Full Code of lyogavin/godmodeanimation for AI

main 0565112c5d39 cached

60 files

648.4 KB

148.6k tokens

497 symbols

1 requests

Download .txt

Showing preview only (676K chars total). Download the full file or copy to clipboard to get everything.

Repository: lyogavin/godmodeanimation
Branch: main
Commit: 0565112c5d39
Files: 60
Total size: 648.4 KB

Directory structure:
gitextract_c750wbnv/

├── LICENSE
├── README.md
└── i2v/
    ├── cog.yaml
    ├── configs/
    │   ├── inference_i2v_512_v1.0.yaml
    │   ├── inference_t2v_1024_v1.0.yaml
    │   ├── inference_t2v_512_v1.0.yaml
    │   ├── inference_t2v_512_v2.0.yaml
    │   └── train_t2v.yaml
    ├── gradio_app.py
    ├── lvdm/
    │   ├── basics.py
    │   ├── common.py
    │   ├── data/
    │   │   ├── frame_dataset.py
    │   │   └── taichi.py
    │   ├── distributions.py
    │   ├── ema.py
    │   ├── models/
    │   │   ├── autoencoder.py
    │   │   ├── ddpm3d.py
    │   │   ├── samplers/
    │   │   │   └── ddim.py
    │   │   └── utils_diffusion.py
    │   ├── modules/
    │   │   ├── attention.py
    │   │   ├── encoders/
    │   │   │   ├── condition.py
    │   │   │   └── ip_resampler.py
    │   │   ├── networks/
    │   │   │   ├── ae_modules.py
    │   │   │   └── openaimodel3d.py
    │   │   └── x_transformer.py
    │   └── utils/
    │       ├── callbacks.py
    │       ├── common_utils.py
    │       └── saving_utils.py
    ├── predict.py
    ├── prompts/
    │   ├── .ipynb_checkpoints/
    │   │   ├── gpt4_extended_prompts_gen_original_vc2_noact_050624-checkpoint.txt
    │   │   ├── gpt4_random_anthropomorphic_game_chars_050624-checkpoint.txt
    │   │   └── gpt4_random_anthropomorphic_game_chars_b2_050624-checkpoint.txt
    │   ├── bg_prompts.txt
    │   ├── gpt4_extended_prompts_gen_original_vc2_050624.txt
    │   ├── gpt4_extended_prompts_gen_original_vc2_noact_050624.txt
    │   ├── gpt4_extended_prompts_path_031324.json
    │   ├── gpt4_extended_prompts_path_031324.txt
    │   ├── gpt4_extended_prompts_path_jump_050624.txt
    │   ├── gpt4_extended_prompts_path_runjump_050624.txt
    │   ├── gpt4_random_anthropomorphic_game_chars_050624.txt
    │   ├── gpt4_random_anthropomorphic_game_chars_b2_050624.txt
    │   ├── i2v_prompts/
    │   │   └── test_prompts.txt
    │   └── test_prompts.txt
    ├── requirements.txt
    ├── scripts/
    │   ├── evaluation/
    │   │   ├── ddp_wrapper.py
    │   │   ├── funcs.py
    │   │   ├── inference.py
    │   │   ├── inference_util.py
    │   │   └── test_seg.py
    │   ├── gradio/
    │   │   ├── i2v_test.py
    │   │   └── t2v_test.py
    │   ├── run_image2video.sh
    │   └── run_text2video.sh
    ├── train.py
    ├── train_t2v_run_jump.sh
    ├── train_t2v_spinkick.sh
    ├── train_t2v_swordslash.sh
    └── utils/
        ├── common_utils.py
        ├── log.py
        └── utils.py

================================================
FILE CONTENTS
================================================

================================================
FILE: LICENSE
================================================
                                 Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/

   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

   1. Definitions.

      "License" shall mean the terms and conditions for use, reproduction,
      and distribution as defined by Sections 1 through 9 of this document.

      "Licensor" shall mean the copyright owner or entity authorized by
      the copyright owner that is granting the License.

      "Legal Entity" shall mean the union of the acting entity and all
      other entities that control, are controlled by, or are under common
      control with that entity. For the purposes of this definition,
      "control" means (i) the power, direct or indirect, to cause the
      direction or management of such entity, whether by contract or
      otherwise, or (ii) ownership of fifty percent (50%) or more of the
      outstanding shares, or (iii) beneficial ownership of such entity.

      "You" (or "Your") shall mean an individual or Legal Entity
      exercising permissions granted by this License.

      "Source" form shall mean the preferred form for making modifications,
      including but not limited to software source code, documentation
      source, and configuration files.

      "Object" form shall mean any form resulting from mechanical
      transformation or translation of a Source form, including but
      not limited to compiled object code, generated documentation,
      and conversions to other media types.

      "Work" shall mean the work of authorship, whether in Source or
      Object form, made available under the License, as indicated by a
      copyright notice that is included in or attached to the work
      (an example is provided in the Appendix below).

      "Derivative Works" shall mean any work, whether in Source or Object
      form, that is based on (or derived from) the Work and for which the
      editorial revisions, annotations, elaborations, or other modifications
      represent, as a whole, an original work of authorship. For the purposes
      of this License, Derivative Works shall not include works that remain
      separable from, or merely link (or bind by name) to the interfaces of,
      the Work and Derivative Works thereof.

      "Contribution" shall mean any work of authorship, including
      the original version of the Work and any modifications or additions
      to that Work or Derivative Works thereof, that is intentionally
      submitted to Licensor for inclusion in the Work by the copyright owner
      or by an individual or Legal Entity authorized to submit on behalf of
      the copyright owner. For the purposes of this definition, "submitted"
      means any form of electronic, verbal, or written communication sent
      to the Licensor or its representatives, including but not limited to
      communication on electronic mailing lists, source code control systems,
      and issue tracking systems that are managed by, or on behalf of, the
      Licensor for the purpose of discussing and improving the Work, but
      excluding communication that is conspicuously marked or otherwise
      designated in writing by the copyright owner as "Not a Contribution."

      "Contributor" shall mean Licensor and any individual or Legal Entity
      on behalf of whom a Contribution has been received by Licensor and
      subsequently incorporated within the Work.

   2. Grant of Copyright License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      copyright license to reproduce, prepare Derivative Works of,
      publicly display, publicly perform, sublicense, and distribute the
      Work and such Derivative Works in Source or Object form.

   3. Grant of Patent License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      (except as stated in this section) patent license to make, have made,
      use, offer to sell, sell, import, and otherwise transfer the Work,
      where such license applies only to those patent claims licensable
      by such Contributor that are necessarily infringed by their
      Contribution(s) alone or by combination of their Contribution(s)
      with the Work to which such Contribution(s) was submitted. If You
      institute patent litigation against any entity (including a
      cross-claim or counterclaim in a lawsuit) alleging that the Work
      or a Contribution incorporated within the Work constitutes direct
      or contributory patent infringement, then any patent licenses
      granted to You under this License for that Work shall terminate
      as of the date such litigation is filed.

   4. Redistribution. You may reproduce and distribute copies of the
      Work or Derivative Works thereof in any medium, with or without
      modifications, and in Source or Object form, provided that You
      meet the following conditions:

      (a) You must give any other recipients of the Work or
          Derivative Works a copy of this License; and

      (b) You must cause any modified files to carry prominent notices
          stating that You changed the files; and

      (c) You must retain, in the Source form of any Derivative Works
          that You distribute, all copyright, patent, trademark, and
          attribution notices from the Source form of the Work,
          excluding those notices that do not pertain to any part of
          the Derivative Works; and

      (d) If the Work includes a "NOTICE" text file as part of its
          distribution, then any Derivative Works that You distribute must
          include a readable copy of the attribution notices contained
          within such NOTICE file, excluding those notices that do not
          pertain to any part of the Derivative Works, in at least one
          of the following places: within a NOTICE text file distributed
          as part of the Derivative Works; within the Source form or
          documentation, if provided along with the Derivative Works; or,
          within a display generated by the Derivative Works, if and
          wherever such third-party notices normally appear. The contents
          of the NOTICE file are for informational purposes only and
          do not modify the License. You may add Your own attribution
          notices within Derivative Works that You distribute, alongside
          or as an addendum to the NOTICE text from the Work, provided
          that such additional attribution notices cannot be construed
          as modifying the License.

      You may add Your own copyright statement to Your modifications and
      may provide additional or different license terms and conditions
      for use, reproduction, or distribution of Your modifications, or
      for any such Derivative Works as a whole, provided Your use,
      reproduction, and distribution of the Work otherwise complies with
      the conditions stated in this License.

   5. Submission of Contributions. Unless You explicitly state otherwise,
      any Contribution intentionally submitted for inclusion in the Work
      by You to the Licensor shall be under the terms and conditions of
      this License, without any additional terms or conditions.
      Notwithstanding the above, nothing herein shall supersede or modify
      the terms of any separate license agreement you may have executed
      with Licensor regarding such Contributions.

   6. Trademarks. This License does not grant permission to use the trade
      names, trademarks, service marks, or product names of the Licensor,
      except as required for reasonable and customary use in describing the
      origin of the Work and reproducing the content of the NOTICE file.

   7. Disclaimer of Warranty. Unless required by applicable law or
      agreed to in writing, Licensor provides the Work (and each
      Contributor provides its Contributions) on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
      implied, including, without limitation, any warranties or conditions
      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
      PARTICULAR PURPOSE. You are solely responsible for determining the
      appropriateness of using or redistributing the Work and assume any
      risks associated with Your exercise of permissions under this License.

   8. Limitation of Liability. In no event and under no legal theory,
      whether in tort (including negligence), contract, or otherwise,
      unless required by applicable law (such as deliberate and grossly
      negligent acts) or agreed to in writing, shall any Contributor be
      liable to You for damages, including any direct, indirect, special,
      incidental, or consequential damages of any character arising as a
      result of this License or out of the use or inability to use the
      Work (including but not limited to damages for loss of goodwill,
      work stoppage, computer failure or malfunction, or any and all
      other commercial damages or losses), even if such Contributor
      has been advised of the possibility of such damages.

   9. Accepting Warranty or Additional Liability. While redistributing
      the Work or Derivative Works thereof, You may choose to offer,
      and charge a fee for, acceptance of support, warranty, indemnity,
      or other liability obligations and/or rights consistent with this
      License. However, in accepting such obligations, You may act only
      on Your own behalf and on Your sole responsibility, not on behalf
      of any other Contributor, and only if You agree to indemnify,
      defend, and hold each Contributor harmless for any liability
      incurred by, or claims asserted against, such Contributor by reason
      of your accepting any such warranty or additional liability.

   END OF TERMS AND CONDITIONS

   APPENDIX: How to apply the Apache License to your work.

      To apply the Apache License to your work, attach the following
      boilerplate notice, with the fields enclosed by brackets "[]"
      replaced with your own identifying information. (Don't include
      the brackets!)  The text should be enclosed in the appropriate
      comment syntax for the file format. We also recommend that a
      file or class name and description of purpose be included on the
      same "printed page" as the copyright notice for easier
      identification within third-party archives.

   Copyright [yyyy] [name of copyright owner]

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.


================================================
FILE: README.md
================================================
# God Mode Animation: 2D Game Animation Generation Model

## 🎉 Updates @May 2025:

**We've launched [God Mode AI 2.0](https://www.godmodeai.co/ai-sprite-generator)** - AI sprite generator that can generate professional game animation sprites from your image, and much more!

<img src="/assets/godmodeai-hero-anim.webp" width="300" />

## Below is our last generation of model, for the latest work see above:

<p align="middle">
  <img src="/assets/godmodeanimation_logo.png?raw=true" width="300" />
  <img src="/assets/godmodeanimation_logo1.png?raw=true" width="300" /> 
</p>

I tried to train text to video and image to video models to generate 2D game animations. I'm open sourcing the model and training code.


[Website](https://godmodeai.co)

## text to animation, image to animation

<p align="middle">
  <img src="https://god-mode-ai.vercel.app/merged_spinkick_8.gif" width="200" />
  <img src="https://f005.backblazeb2.com/file/godmodeaigendanims/samples/0009_0.gif" width="200" /> 
  <img src="https://god-mode-ai.vercel.app/sword_0.gif" width="200" /> 
  <img src="https://god-mode-ai.vercel.app/sword_7.gif" width="200" /> 
</p>

<p align="middle">
  <img src="https://f005.backblazeb2.com/file/godmodeaigendanims/samples/0048_0.gif" width="200" />
  <img src="https://god-mode-ai.vercel.app/sword_2.gif" width="200" /> 
  <img src="https://god-mode-ai.vercel.app/merged_spinkick_0.gif" width="200" /> 
  <img src="https://f005.backblazeb2.com/file/godmodeaigendanims/samples/0052_2.gif" width="200" /> 
</p>

<p align="middle">
  <img src="https://god-mode-ai.vercel.app/merged_spinkick_6.gif" width="200" />
  <img src="https://god-mode-ai.vercel.app/sword_1.gif" width="200" /> 
  <img src="https://f005.backblazeb2.com/file/godmodeaigendanims/samples/0012_1.gif" width="200" /> 
  <img src="https://god-mode-ai.vercel.app/merged_spinkick_3.gif" width="200" /> 
</p>
<p align="middle">
  <img src="https://f005.backblazeb2.com/file/godmodeaigendanims/samples/0058_2.gif" width="200" />
  <img src="https://f005.backblazeb2.com/file/godmodeaigendanims/samples/0054_0.gif" width="200" /> 
  <img src="https://f005.backblazeb2.com/file/godmodeaigendanims/samples/0010_1.gif" width="200" /> 
  <img src="https://god-mode-ai.vercel.app/sword_3.gif" width="200" /> 
</p>
<p align="middle">
  <img src="https://god-mode-ai.vercel.app/sword_4.gif" width="200" />
  <img src="https://god-mode-ai.vercel.app/merged_spinkick_7.gif" width="200" /> 
  <img src="https://god-mode-ai.vercel.app/sword_6.gif" width="200" /> 
  <img src="https://f005.backblazeb2.com/file/godmodeaigendanims/samples/0014_2.gif" width="200" /> 
</p>

## text to game based on animation model

You can try the games [here](https://www.godmodeai.co/godmodedino/)
<p align="middle">
    <p align="middle">a dino jumping cacti in the desert</p>
    <p align="middle"><img src="/assets/dino_game.gif" width="300" /></p>
</p>
<p align="middle">
    <p align="middle">Donald Trump jumping trash can in new york city</p>
    <p align="middle"><img src="/assets/trump_game.gif" width="300" /></p>
</p>
<p align="middle">
    <p align="middle">Harry Potter jumping tree in Hogwarts castle</p>
    <p align="middle"><img src="/assets/harrypotter_game.gif" width="300" /></p>
</p>
<p align="middle">
    <p align="middle">Tylor Swift jumping microphone in hotel room</p>
    <p align="middle"><img src="/assets/taylorswift_game.gif" width="300" /></p>
</p>

## Trained Models


| Motion         | Epochs        | Steps        | Model Type              | HuggingFace Model         |
| ------------ | ------------- | ------------- | ------------------ | ------------ |
| Sword Wield | 36    |2035    | VC2 T2V | [model](https://huggingface.co/lyogavin/godmodeanimation_vc2_sword_wield_ep36)    |
| Sword Wield | 38    |2145    | VC2 T2V | [model](https://huggingface.co/lyogavin/godmodeanimation_vc2_sword_wield_ep38)    |
| Spin Kick | 32    |1815    | VC2 T2V | [model](https://huggingface.co/lyogavin/godmodeanimation_vc2_spinkick_ep32)    |
| Spin Kick | 34    |1925    | VC2 T2V | [model](https://huggingface.co/lyogavin/godmodeanimation_vc2_spinkick_ep34)    |
| Run Jump | 30    |3379    | VC2 T2V | [model](https://huggingface.co/lyogavin/godmodeanimation_vc2_runjump_ep30)    |
| Run Jump | 34    |3815    | VC2 T2V | [model](https://huggingface.co/lyogavin/godmodeanimation_vc2_runjump_ep34)    |
| Run | 19    |1080    | DC I2V | [model](https://huggingface.co/lyogavin/godmodeanimation_dc_run_ep19)    |




##  How to Train I2V Model

1. Clone the repository.
   ```bash
   git clone https://github.com/lyogavin/godmodeanimation.git
   ```
2. Install the necessary dependencies.
   ```bash
   cd godmodeanimation/i2v
   pip install -r requirements.txt
   ```
3. Prepare the datasets by unzipping the files and placing them in the `/root/ucf_ds` directory.
4. Download the pretrained model in the `/root/vc2/model.ckpt` directory.
5. Run the training script.
   ```bash
   train_t2v_run_jump.sh # for rum jump model
   train_t2v_spinkick.sh # for spin kick model
   train_t2v_sword_wield.sh # for sword wield model
   ```
## How to Train DC I2V Model

DC I2V model is based on [DynamiCrafter](https://github.com/Doubiiu/DynamiCrafter). Please follow the instructions in the [DynamiCrafter](https://github.com/Doubiiu/DynamiCrafter) repository to train the DC I2V model.


## Replicate Public Model

We created Replicate public models for Sword Wield, Spin Kick, Run Jump and Run. You can try it from Replicat platform, find the details in [Replicate](https://replicate.com/lyogavin/godmodeanimation).


## Citing God Mode Animation

If you find
This work useful in your research and wish to cite it, please use the following
BibTex entry:

```
@software{godmodeanimation2024,
  author = {Gavin Li},
  title = {2D Game Animation Generation: All you need is repeat the same motion 1000 times},
  url = {https://github.com/lyogavin/godmodeanimation/},
  version = {1.0},
  year = {2024},
}
```


## Contribution 

Welcome contributions, ideas and discussions!

If you find this work useful or interesting to you, please ⭐ or buy me a coffee! It's very important to me! 🙏

[!["Buy Me A Coffee"](https://www.buymeacoffee.com/assets/img/custom_images/orange_img.png)](https://bmc.link/lyogavinQ)




================================================
FILE: i2v/cog.yaml
================================================
# Configuration for Cog ⚙️
# Reference: https://github.com/replicate/cog/blob/main/docs/yaml.md

build:
  gpu: true
  system_packages:
    - "libgl1-mesa-glx"
    - "libglib2.0-0"
  python_version: "3.11"
  python_packages:
    - "torch==2.0.1"
    - "numpy==1.26.4"
    - "opencv-python==4.8.1.78"
    - "torchvision==0.15.2"
    - "pytorch_lightning==2.1.0"
    - "einops==0.7.0"
    - "imageio==2.31.6"
    - "omegaconf==2.3.0"
    - "transformers==4.35.0"
    - "moviepy==1.0.3"
    - "av==10.0.0"
    - "decord==0.6.0"
    - "kornia==0.7.0"
    - "open-clip-torch==2.12.0"
    - "xformers==0.0.21"

  # commands run after the environment is setup
  run:
    #- "git clone https://github.com/lyogavin/godmodeanimation.git"
    #- "cd godmodeanimation"
    - "mkdir checkpoints"
    - "cd checkpoints"
    - "wget https://huggingface.co/lyogavin/godmodeanimation_vc2_runjump_ep34/resolve/main/model.ckpt -O runjump.ckpt"
    - "wget https://huggingface.co/lyogavin/godmodeanimation_vc2_spinkick_ep34/resolve/main/model.ckpt -O spinkick.ckpt"
    - "wget https://huggingface.co/lyogavin/godmodeanimation_vc2_sword_wield_ep38/resolve/main/model.ckpt -O swordwield.ckpt"


predict: "predict.py:Predictor"


================================================
FILE: i2v/configs/inference_i2v_512_v1.0.yaml
================================================
model:
  target: lvdm.models.ddpm3d.LatentVisualDiffusion
  params:
    linear_start: 0.00085
    linear_end: 0.012
    num_timesteps_cond: 1
    timesteps: 1000
    first_stage_key: video
    cond_stage_key: caption
    cond_stage_trainable: false
    conditioning_key: crossattn
    image_size:
    - 40
    - 64
    channels: 4
    scale_by_std: false
    scale_factor: 0.18215
    use_ema: false
    uncond_type: empty_seq
    use_scale: true
    scale_b: 0.7
    finegrained: true
    unet_config:
      target: lvdm.modules.networks.openaimodel3d.UNetModel
      params:
        in_channels: 4
        out_channels: 4
        model_channels: 320
        attention_resolutions:
        - 4
        - 2
        - 1
        num_res_blocks: 2
        channel_mult:
        - 1
        - 2
        - 4
        - 4
        num_head_channels: 64
        transformer_depth: 1
        context_dim: 1024
        use_linear: true
        use_checkpoint: true
        temporal_conv: true
        temporal_attention: true
        temporal_selfatt_only: true
        use_relative_position: false
        use_causal_attention: false
        use_image_attention: true
        temporal_length: 16
        addition_attention: true
        fps_cond: true
    first_stage_config:
      target: lvdm.models.autoencoder.AutoencoderKL
      params:
        embed_dim: 4
        monitor: val/rec_loss
        ddconfig:
          double_z: true
          z_channels: 4
          resolution: 512
          in_channels: 3
          out_ch: 3
          ch: 128
          ch_mult:
          - 1
          - 2
          - 4
          - 4
          num_res_blocks: 2
          attn_resolutions: []
          dropout: 0.0
        lossconfig:
          target: torch.nn.Identity
    cond_stage_config:
      target: lvdm.modules.encoders.condition.FrozenOpenCLIPEmbedder
      params:
        freeze: true
        layer: penultimate
    cond_img_config:
      target: lvdm.modules.encoders.condition.FrozenOpenCLIPImageEmbedderV2
      params:
        freeze: true

================================================
FILE: i2v/configs/inference_t2v_1024_v1.0.yaml
================================================
model:
  target: lvdm.models.ddpm3d.LatentDiffusion
  params:
    linear_start: 0.00085
    linear_end: 0.012
    num_timesteps_cond: 1
    timesteps: 1000
    first_stage_key: video
    cond_stage_key: caption
    cond_stage_trainable: false
    conditioning_key: crossattn
    image_size:
    - 72
    - 128
    channels: 4
    scale_by_std: false
    scale_factor: 0.18215
    use_ema: false
    uncond_type: empty_seq
    use_scale: true
    fix_scale_bug: true
    unet_config:
      target: lvdm.modules.networks.openaimodel3d.UNetModel
      params:
        in_channels: 4
        out_channels: 4
        model_channels: 320
        attention_resolutions:
        - 4
        - 2
        - 1
        num_res_blocks: 2
        channel_mult:
        - 1
        - 2
        - 4
        - 4
        num_head_channels: 64
        transformer_depth: 1
        context_dim: 1024
        use_linear: true
        use_checkpoint: true
        temporal_conv: false
        temporal_attention: true
        temporal_selfatt_only: true
        use_relative_position: true
        use_causal_attention: false
        temporal_length: 16
        addition_attention: true
        fps_cond: true
    first_stage_config:
      target: lvdm.models.autoencoder.AutoencoderKL
      params:
        embed_dim: 4
        monitor: val/rec_loss
        ddconfig:
          double_z: true
          z_channels: 4
          resolution: 512
          in_channels: 3
          out_ch: 3
          ch: 128
          ch_mult:
          - 1
          - 2
          - 4
          - 4
          num_res_blocks: 2
          attn_resolutions: []
          dropout: 0.0
        lossconfig:
          target: torch.nn.Identity
    cond_stage_config:
      target: lvdm.modules.encoders.condition.FrozenOpenCLIPEmbedder
      params:
        freeze: true
        layer: penultimate


================================================
FILE: i2v/configs/inference_t2v_512_v1.0.yaml
================================================
model:
  target: lvdm.models.ddpm3d.LatentDiffusion
  params:
    linear_start: 0.00085
    linear_end: 0.012
    num_timesteps_cond: 1
    timesteps: 1000
    first_stage_key: video
    cond_stage_key: caption
    cond_stage_trainable: false
    conditioning_key: crossattn
    image_size:
    - 40
    - 64
    channels: 4
    scale_by_std: false
    scale_factor: 0.18215
    use_ema: false
    uncond_type: empty_seq
    unet_config:
      target: lvdm.modules.networks.openaimodel3d.UNetModel
      params:
        in_channels: 4
        out_channels: 4
        model_channels: 320
        attention_resolutions:
        - 4
        - 2
        - 1
        num_res_blocks: 2
        channel_mult:
        - 1
        - 2
        - 4
        - 4
        num_head_channels: 64
        transformer_depth: 1
        context_dim: 1024
        use_linear: true
        use_checkpoint: true
        temporal_conv: false
        temporal_attention: true
        temporal_selfatt_only: true
        use_relative_position: true
        use_causal_attention: false
        temporal_length: 16
        addition_attention: true
    first_stage_config:
      target: lvdm.models.autoencoder.AutoencoderKL
      params:
        embed_dim: 4
        monitor: val/rec_loss
        ddconfig:
          double_z: true
          z_channels: 4
          resolution: 512
          in_channels: 3
          out_ch: 3
          ch: 128
          ch_mult:
          - 1
          - 2
          - 4
          - 4
          num_res_blocks: 2
          attn_resolutions: []
          dropout: 0.0
        lossconfig:
          target: torch.nn.Identity
    cond_stage_config:
      target: lvdm.modules.encoders.condition.FrozenOpenCLIPEmbedder
      params:
        freeze: true
        layer: penultimate


================================================
FILE: i2v/configs/inference_t2v_512_v2.0.yaml
================================================
model:
  target: lvdm.models.ddpm3d.LatentDiffusion
  params:
    linear_start: 0.00085
    linear_end: 0.012
    num_timesteps_cond: 1
    timesteps: 1000
    first_stage_key: video
    cond_stage_key: caption
    cond_stage_trainable: false
    conditioning_key: crossattn
    image_size:
    - 40
    - 64
    channels: 4
    scale_by_std: false
    scale_factor: 0.18215
    use_ema: false
    uncond_type: empty_seq
    use_scale: true
    scale_b: 0.7
    unet_config:
      target: lvdm.modules.networks.openaimodel3d.UNetModel
      params:
        in_channels: 4
        out_channels: 4
        model_channels: 320
        attention_resolutions:
        - 4
        - 2
        - 1
        num_res_blocks: 2
        channel_mult:
        - 1
        - 2
        - 4
        - 4
        num_head_channels: 64
        transformer_depth: 1
        context_dim: 1024
        use_linear: true
        use_checkpoint: true
        temporal_conv: true
        temporal_attention: true
        temporal_selfatt_only: true
        use_relative_position: false
        use_causal_attention: false
        temporal_length: 16
        addition_attention: true
        fps_cond: true
    first_stage_config:
      target: lvdm.models.autoencoder.AutoencoderKL
      params:
        embed_dim: 4
        monitor: val/rec_loss
        ddconfig:
          double_z: true
          z_channels: 4
          resolution: 512
          in_channels: 3
          out_ch: 3
          ch: 128
          ch_mult:
          - 1
          - 2
          - 4
          - 4
          num_res_blocks: 2
          attn_resolutions: []
          dropout: 0.0
        lossconfig:
          target: torch.nn.Identity
    cond_stage_config:
      target: lvdm.modules.encoders.condition.FrozenOpenCLIPEmbedder
      params:
        freeze: true
        layer: penultimate


================================================
FILE: i2v/configs/train_t2v.yaml
================================================
model:
  base_learning_rate: 6.0e-05 # 1.5e-04
  scale_lr: False
  target: lvdm.models.ddpm3d.LatentDiffusion
  params:
    video_length: 16
    linear_start: 0.00085
    linear_end: 0.012
    num_timesteps_cond: 1
    log_every_t: 200
    timesteps: 1000
    first_stage_key: image
    cond_stage_key: caption
    cond_stage_trainable: false
    conditioning_key: crossattn
    image_size:
    - 40
    - 64
    channels: 4
    scale_by_std: false
    scale_factor: 0.18215
    use_ema: false
    uncond_type: empty_seq
    use_scale: true
    scale_b: 0.7

    unet_config:
      target: lvdm.modules.networks.openaimodel3d.UNetModel
      params:
        in_channels: 4
        out_channels: 4
        model_channels: 320
        attention_resolutions:
        - 4
        - 2
        - 1
        num_res_blocks: 2
        channel_mult:
        - 1
        - 2
        - 4
        - 4
        num_head_channels: 64
        transformer_depth: 1
        context_dim: 1024
        use_linear: true
        use_checkpoint: true
        temporal_conv: true
        temporal_attention: true
        temporal_selfatt_only: true
        use_relative_position: false
        use_causal_attention: false
        temporal_length: 16
        addition_attention: true
        fps_cond: true


    first_stage_config:
      target: lvdm.models.autoencoder.AutoencoderKL
      params:
        embed_dim: 4
        monitor: val/rec_loss
        ddconfig:
          double_z: true
          z_channels: 4
          resolution: 512
          in_channels: 3
          out_ch: 3
          ch: 128
          ch_mult:
          - 1
          - 2
          - 4
          - 4
          num_res_blocks: 2
          attn_resolutions: []
          dropout: 0.0
        lossconfig:
          target: torch.nn.Identity

    cond_stage_config:
      target: lvdm.modules.encoders.condition.FrozenOpenCLIPEmbedder
      params:
        freeze: true
        layer: penultimate




data:
  target: train.DataModuleFromConfig
  params:
    batch_size: 2
    num_workers: 0
    wrap: false
    train:
      target: lvdm.data.frame_dataset.VideoFrameDataset
      params:
        data_root: ${data_root}
        resolution: 256
        video_length: 16
        dataset_name: UCF-101
        subset_split: train
        #spatial_transform:
        clip_step: 1
        temporal_transform: rand_clips
    validation:
      target: lvdm.data.frame_dataset.VideoFrameDataset
      params:
        data_root: ${data_root}
        resolution: 256
        video_length: 16
        dataset_name: UCF-101
        subset_split: test
        #spatial_transform:
        clip_step: 1
        temporal_transform: rand_clips


lightning:
  callbacks:
    image_logger:
      target: lvdm.utils.callbacks.ImageLogger
      params:
        batch_frequency: 40
        max_images: 16
        increase_log_steps: False
        log_first_step: True
    metrics_over_trainsteps_checkpoint:
      target: pytorch_lightning.callbacks.ModelCheckpoint
      params:
        filename: "{epoch:06}-{step:09}"
        save_weights_only: False
        every_n_epochs: 300
        every_n_train_steps: null
  trainer:
    benchmark: True
    log_every_n_steps: 2
    check_val_every_n_epoch: 1
    batch_size: 1
    num_workers: 0
    num_nodes: 4
    accumulate_grad_batches: 4
    max_epochs: 2500 #2000
  modelcheckpoint:
    target: pytorch_lightning.callbacks.ModelCheckpoint
    params:
      every_n_epochs: 1
      save_top_k: -1
      filename: "{epoch:04}-{step:06}"

================================================
FILE: i2v/gradio_app.py
================================================
import os
import sys
import gradio as gr
from scripts.gradio.t2v_test import Text2Video
sys.path.insert(1, os.path.join(sys.path[0], 'lvdm'))

t2v_examples = [
    ['an elephant is walking under the sea, 4K, high definition',50, 12,1, 16],
    ['an astronaut riding a horse in outer space',25,12,1,16],
    ['a monkey is playing a piano',25,12,1,16],
    ['A fire is burning on a candle',25,12,1,16],
    ['a horse is drinking in the river',25,12,1,16],
    ['Robot dancing in times square',25,12,1,16],                    
]


def videocrafter_demo(result_dir='./tmp/'):
    text2video = Text2Video(result_dir)
    with gr.Blocks(analytics_enabled=False) as videocrafter_iface:
        gr.Markdown("<div align='center'> <h2> VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models </span> </h2> \
                     <a style='font-size:18px;color: #000000' href='https://github.com/AILab-CVC/VideoCrafter'> Github </div>")
        
        #######t2v#######
        with gr.Tab(label="Text2Video"):
            with gr.Column():
                with gr.Row().style(equal_height=False):
                    with gr.Column():
                        input_text = gr.Text(label='Prompts')
                        with gr.Row():
                            steps = gr.Slider(minimum=1, maximum=60, step=1, elem_id=f"steps", label="Sampling steps", value=50)
                            eta = gr.Slider(minimum=0.0, maximum=1.0, step=0.1, label='ETA', value=1.0, elem_id="eta")
                        with gr.Row():
                            cfg_scale = gr.Slider(minimum=1.0, maximum=30.0, step=0.5, label='CFG Scale', value=12.0, elem_id="cfg_scale")
                            fps = gr.Slider(minimum=4, maximum=32, step=1, label='fps', value=16, elem_id="fps")
                        send_btn = gr.Button("Send")
                    with gr.Tab(label='result'):
                        with gr.Row():
                            output_video_1 =  gr.Video().style(width=512)
                gr.Examples(examples=t2v_examples,
                            inputs=[input_text,steps,cfg_scale,eta],
                            outputs=[output_video_1],
                            fn=text2video.get_prompt,
                            cache_examples=False)
                        #cache_examples=os.getenv('SYSTEM') == 'spaces')
            send_btn.click(
                fn=text2video.get_prompt, 
                inputs=[input_text,steps,cfg_scale,eta,fps],
                outputs=[output_video_1],
            )

    return videocrafter_iface

if __name__ == "__main__":
    result_dir = os.path.join('./', 'results')
    videocrafter_iface = videocrafter_demo(result_dir)
    videocrafter_iface.queue(concurrency_count=1, max_size=10)
    videocrafter_iface.launch()
    # videocrafter_iface.launch(server_name='0.0.0.0', server_port=80)

================================================
FILE: i2v/lvdm/basics.py
================================================
# adopted from
# https://github.com/openai/improved-diffusion/blob/main/improved_diffusion/gaussian_diffusion.py
# and
# https://github.com/lucidrains/denoising-diffusion-pytorch/blob/7706bdfc6f527f58d33f84b7b522e61e6e3164b3/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py
# and
# https://github.com/openai/guided-diffusion/blob/0ba878e517b276c45d1195eb29f6f5f72659a05b/guided_diffusion/nn.py
#
# thanks!

import torch.nn as nn
from utils.utils import instantiate_from_config


def disabled_train(self, mode=True):
    """Overwrite model.train with this function to make sure train/eval mode
    does not change anymore."""
    return self

def zero_module(module):
    """
    Zero out the parameters of a module and return it.
    """
    for p in module.parameters():
        p.detach().zero_()
    return module

def scale_module(module, scale):
    """
    Scale the parameters of a module and return it.
    """
    for p in module.parameters():
        p.detach().mul_(scale)
    return module


def conv_nd(dims, *args, **kwargs):
    """
    Create a 1D, 2D, or 3D convolution module.
    """
    if dims == 1:
        return nn.Conv1d(*args, **kwargs)
    elif dims == 2:
        return nn.Conv2d(*args, **kwargs)
    elif dims == 3:
        return nn.Conv3d(*args, **kwargs)
    raise ValueError(f"unsupported dimensions: {dims}")


def linear(*args, **kwargs):
    """
    Create a linear module.
    """
    return nn.Linear(*args, **kwargs)


def avg_pool_nd(dims, *args, **kwargs):
    """
    Create a 1D, 2D, or 3D average pooling module.
    """
    if dims == 1:
        return nn.AvgPool1d(*args, **kwargs)
    elif dims == 2:
        return nn.AvgPool2d(*args, **kwargs)
    elif dims == 3:
        return nn.AvgPool3d(*args, **kwargs)
    raise ValueError(f"unsupported dimensions: {dims}")


def nonlinearity(type='silu'):
    if type == 'silu':
        return nn.SiLU()
    elif type == 'leaky_relu':
        return nn.LeakyReLU()


class GroupNormSpecific(nn.GroupNorm):
    def forward(self, x):
        return super().forward(x.float()).type(x.dtype)


def normalization(channels, num_groups=32):
    """
    Make a standard normalization layer.
    :param channels: number of input channels.
    :return: an nn.Module for normalization.
    """
    return GroupNormSpecific(num_groups, channels)


class HybridConditioner(nn.Module):

    def __init__(self, c_concat_config, c_crossattn_config):
        super().__init__()
        self.concat_conditioner = instantiate_from_config(c_concat_config)
        self.crossattn_conditioner = instantiate_from_config(c_crossattn_config)

    def forward(self, c_concat, c_crossattn):
        c_concat = self.concat_conditioner(c_concat)
        c_crossattn = self.crossattn_conditioner(c_crossattn)
        return {'c_concat': [c_concat], 'c_crossattn': [c_crossattn]}

================================================
FILE: i2v/lvdm/common.py
================================================
import math
from inspect import isfunction
import torch
from torch import nn
import torch.distributed as dist


def gather_data(data, return_np=True):
    ''' gather data from multiple processes to one list '''
    data_list = [torch.zeros_like(data) for _ in range(dist.get_world_size())]
    dist.all_gather(data_list, data)  # gather not supported with NCCL
    if return_np:
        data_list = [data.cpu().numpy() for data in data_list]
    return data_list

def autocast(f):
    def do_autocast(*args, **kwargs):
        with torch.cuda.amp.autocast(enabled=True,
                                     dtype=torch.get_autocast_gpu_dtype(),
                                     cache_enabled=torch.is_autocast_cache_enabled()):
            return f(*args, **kwargs)
    return do_autocast


def extract_into_tensor(a, t, x_shape):
    b, *_ = t.shape
    out = a.gather(-1, t)
    return out.reshape(b, *((1,) * (len(x_shape) - 1)))


def noise_like(shape, device, repeat=False):
    repeat_noise = lambda: torch.randn((1, *shape[1:]), device=device).repeat(shape[0], *((1,) * (len(shape) - 1)))
    noise = lambda: torch.randn(shape, device=device)
    return repeat_noise() if repeat else noise()


def default(val, d):
    if exists(val):
        return val
    return d() if isfunction(d) else d

def exists(val):
    return val is not None

def identity(*args, **kwargs):
    return nn.Identity()

def uniq(arr):
    return{el: True for el in arr}.keys()

def mean_flat(tensor):
    """
    Take the mean over all non-batch dimensions.
    """
    return tensor.mean(dim=list(range(1, len(tensor.shape))))

def ismap(x):
    if not isinstance(x, torch.Tensor):
        return False
    return (len(x.shape) == 4) and (x.shape[1] > 3)

def isimage(x):
    if not isinstance(x,torch.Tensor):
        return False
    return (len(x.shape) == 4) and (x.shape[1] == 3 or x.shape[1] == 1)

def max_neg_value(t):
    return -torch.finfo(t.dtype).max

def shape_to_str(x):
    shape_str = "x".join([str(x) for x in x.shape])
    return shape_str

def init_(tensor):
    dim = tensor.shape[-1]
    std = 1 / math.sqrt(dim)
    tensor.uniform_(-std, std)
    return tensor

ckpt = torch.utils.checkpoint.checkpoint
def checkpoint(func, inputs, params, flag):
    """
    Evaluate a function without caching intermediate activations, allowing for
    reduced memory at the expense of extra compute in the backward pass.
    :param func: the function to evaluate.
    :param inputs: the argument sequence to pass to `func`.
    :param params: a sequence of parameters `func` depends on but does not
                   explicitly take as arguments.
    :param flag: if False, disable gradient checkpointing.
    """
    if flag:
        return ckpt(func, *inputs)
    else:
        return func(*inputs)



================================================
FILE: i2v/lvdm/data/frame_dataset.py
================================================
import os
import random
import re
from PIL import ImageFile
from PIL import Image

import torch
import torch.utils.data as data
import torchvision.transforms as transforms
import torchvision.transforms._transforms_video as transforms_video

""" VideoFrameDataset """

ImageFile.LOAD_TRUNCATED_IMAGES = True
IMG_EXTENSIONS = [
    '.jpg', '.JPG', '.jpeg', '.JPEG',
    '.png', '.PNG', '.ppm', '.PPM', '.bmp', '.BMP',
]


def pil_loader(path):
    # open path as file to avoid ResourceWarning (https://github.com/python-pillow/Pillow/issues/835)
    '''
    with open(path, 'rb') as f:
        with Image.open(f) as img:
            return img.convert('RGB')
    '''
    Im = Image.open(path)
    return Im.convert('RGB')

def accimage_loader(path):
    import accimage
    try:
        return accimage.Image(path)
    except IOError:
        # Potentially a decoding problem, fall back to PIL.Image
        return pil_loader(path)

def default_loader(path):
    '''
    from torchvision import get_image_backend
    if get_image_backend() == 'accimage':
        return accimage_loader(path)
    else:
    '''
    return pil_loader(path)

def is_image_file(filename):
    return any(filename.endswith(extension) for extension in IMG_EXTENSIONS)

def find_classes(dir):
    assert(os.path.exists(dir)), f'{dir} does not exist'
    classes = [d for d in os.listdir(dir) if os.path.isdir(os.path.join(dir, d))]
    classes.sort()
    class_to_idx = {classes[i]: i for i in range(len(classes))}
    return classes, class_to_idx

def class_name_to_idx(annotation_dir):
    """
    return class indices from 0 ~ num_classes-1
    """
    fpath = os.path.join(annotation_dir, "classInd.txt")
    with open(fpath, "r") as f:
        data = f.readlines()
        class_to_idx = {x.strip().split(" ")[1].lower():int(x.strip().split(" ")[0]) - 1 for x in data}
    return class_to_idx

def make_dataset(dir, nframes, class_to_idx, frame_stride=1, **kwargs):
    """
    videos are saved in second-level directory:
    dir: video dir. Format:
        videoxxx
            videoxxx_1
                frame1.jpg
                frame2.jpg
            videoxxx_2
                frame1.jpg
                ...
        videoxxx
        
    nframes: num of frames of every video clips
    class_to_idx: for mapping video name to video id
    """
    if frame_stride != 1:
        raise NotImplementedError
    
    clips = []
    videos = []
    n_clip = 0
    video_frames = []
    for video_name in sorted(os.listdir(dir)):
        if os.path.isdir(os.path.join(dir,video_name)):
            
            # eg: dir + '/rM7aPu9WV2Q'
            subfolder_path = os.path.join(dir, video_name) # video_name: rM7aPu9WV2Q
            for subsubfold in sorted(os.listdir(subfolder_path)):
                subsubfolder_path = os.path.join(subfolder_path, subsubfold)
                if os.path.isdir(subsubfolder_path): # eg: dir/rM7aPu9WV2Q/1'
                    clip_frames = []
                    i = 1
                    # traverse frames in one video
                    for fname in sorted(os.listdir(subsubfolder_path)):
                        if is_image_file(fname):
                            img_path = os.path.join(subsubfolder_path, fname) # eg: dir + '/rM7aPu9WV2Q/rM7aPu9WV2Q_1/rM7aPu9WV2Q_frames_00086552.jpg'
                            frame_info = (img_path, class_to_idx[video_name]) #(img_path, video_id)
                            clip_frames.append(frame_info)
                            video_frames.append(frame_info)
                            
                            # append clips, clip_step=n_frames (no frame overlap between clips).
                            if i % nframes == 0 and i >0:
                                clips.append(clip_frames)
                                n_clip += 1
                                clip_frames = []
                            i = i+1
                    
                    if len(video_frames) >= nframes:
                        videos.append(video_frames)
                    video_frames = []

    print('number of long videos:', len(videos))
    print('number of short videos', len(clips))
    return clips, videos

def split_by_captical(s):
    s_list = re.sub( r"([A-Z])", r" \1", s).split()
    string = ""
    for s in s_list:
        string += s + " "
    return string.rstrip(" ").lower()

def make_dataset_ucf(dir, nframes, class_to_idx, frame_stride=1, clip_step=None):
    """
    Load consecutive clips and consecutive frames from `dir`.

    args:
        nframes: num of frames of every video clips
        class_to_idx: for mapping video name to video id
        frame_stride: select frames with a stride.
        clip_step: select clips with a step. if clip_step< nframes, 
            there will be overlapped frames among two consecutive clips.

    assert videos are saved in first-level directory:
        dir:
            videoxxx1
                frame1.jpg
                frame2.jpg
            videoxxx2
    """
    if clip_step is None:
        # consecutive clips with no frame overlap
        clip_step = nframes
    # make videos
    clips = [] # 2d list
    videos = [] # 2d list
    for video_name in sorted(os.listdir(dir)):
        if video_name != '_broken_clips':
            video_path = os.path.join(dir, video_name)
            assert(os.path.isdir(video_path))
            
            frames = []
            for i, fname in enumerate(sorted(os.listdir(video_path))):
                if not is_image_file(fname):
                    continue
                assert(is_image_file(fname)),f'fname={fname},video_path={video_path},dir={dir}'
                
                # get frame info
                img_path = os.path.join(video_path, fname)

                if "_" in video_name:
                    assert "_" in video_name, video_name
                    class_name = video_name.split("_")[1].lower() # v_BoxingSpeedBag_g12_c05 -> boxingspeedbag
                else:
                    class_name = video_name
                #class_caption = split_by_captical(video_name.split("_")[1]) # v_BoxingSpeedBag_g12_c05 -> BoxingSpeedBag -> boxing speed bag

                # check if caption .txt file exists:
                img_path_ext = img_path[img_path.rindex("."):]
                txt_path = img_path.replace(img_path_ext, ".txt")
                if os.path.exists(txt_path):
                    with open(txt_path,'r') as txtf:
                        class_caption = txtf.read()
                else:
                    class_caption = split_by_captical(video_name.split("_")[1]) # v_BoxingSpeedBag_g12_c05 -> BoxingSpeedBag -> boxing speed bag

                frame_info = {
                    "img_path": img_path, 
                    "class_index": class_to_idx[class_name],
                    "class_name": class_name,           #boxingspeedbag
                    "class_caption": class_caption      #boxing speed bag
                    }
                frames.append(frame_info)
            frames = frames[::frame_stride]
            
            # make videos
            if len(frames) >= nframes:
                videos.append(frames)
            
            # make clips
            start_indices = list(range(len(frames)))[::clip_step]
            for i in start_indices:
                clip = frames[i:i+nframes]
                if len(clip) == nframes:
                    clips.append(clip)
    return clips, videos

def load_and_transform_frames(frame_list, loader, img_transform=None):
    assert(isinstance(frame_list, list))
    clip = []
    labels = []
    for frame in frame_list:
        
        if isinstance(frame, tuple):
            fpath, label = frame
        elif isinstance(frame, dict):
            fpath = frame["img_path"]
            label = {
                "class_index": frame["class_index"],
                "class_name": frame["class_name"],
                "class_caption": frame["class_caption"],
                }
        
        labels.append(label)
        img = loader(fpath)
        if img_transform is not None:
            img = img_transform(img)
        img = img.view(img.size(0),1, img.size(1), img.size(2))
        clip.append(img)
    return clip, labels[0] # all frames have same label..

class VideoFrameDataset(data.Dataset):
    def __init__(self,
        data_root,
        resolution,
        video_length,                   # clip length
        dataset_name="",
        subset_split="",
        annotation_dir=None,
        spatial_transform="",
        temporal_transform="",
        frame_stride=1,
        clip_step=None,
        cfg_random_null_text_ratio=None,
        ):
        
        self.loader = default_loader
        self.video_length = video_length
        self.subset_split = subset_split
        self.temporal_transform = temporal_transform
        self.spatial_transform = spatial_transform
        self.frame_stride = frame_stride
        self.dataset_name = dataset_name

        assert(subset_split in ["train", "test", "all", ""]) # "" means no subset_split directory.
        assert(self.temporal_transform in ["", "rand_clips"])

        self.cfg_random_null_text_ratio = cfg_random_null_text_ratio

        if subset_split == 'all':
            video_dir = os.path.join(data_root, "train")
        else:
            video_dir = os.path.join(data_root, subset_split)
        
        if dataset_name == 'UCF-101':
            if annotation_dir is None:
                annotation_dir = os.path.join(data_root, "ucfTrainTestlist")
            class_to_idx = class_name_to_idx(annotation_dir)
            #assert(len(class_to_idx) == 101), f'num of classes = {len(class_to_idx)}, not 101'
        elif dataset_name == 'sky':
            classes, class_to_idx = find_classes(video_dir)
        else:
            class_to_idx = None
        
        # make dataset
        if dataset_name == 'UCF-101':
            func = make_dataset_ucf
        else:
            func = make_dataset
        self.clips, self.videos = func(video_dir, video_length,  class_to_idx, frame_stride=frame_stride, clip_step=clip_step)
        assert(len(self.clips[0]) == video_length), f"Invalid clip length = {len(self.clips[0])}"
        if self.temporal_transform == 'rand_clips':
            self.clips = self.videos
        
        if subset_split == 'all':
            # add test videos
            video_dir = video_dir.rstrip('/train')+'/test'
            cs, vs = func(video_dir, video_length, class_to_idx)
            if self.temporal_transform == 'rand_clips':
                self.clips += vs
            else:
                self.clips += cs
        
        print('[VideoFrameDataset] number of videos:', len(self.videos))
        print('[VideoFrameDataset] number of clips', len(self.clips))

        # check data
        if len(self.clips) == 0:
            raise(RuntimeError(f"Found 0 clips in {video_dir}. \n"
                               "Supported image extensions are: " + 
                               ",".join(IMG_EXTENSIONS)))
        
        # data transform
        self.img_transform = transforms.Compose([
            transforms.ToTensor(),
            transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
        ])
        if self.spatial_transform == "center_crop_resize":
            print('Spatial transform: center crop and then resize')
            self.video_transform = transforms.Compose([
                transforms.Resize(resolution),
                transforms_video.CenterCropVideo(resolution),
                ])
        elif self.spatial_transform == "resize":
            print('Spatial transform: resize with no crop')
            self.video_transform = transforms.Resize((resolution, resolution))
        elif self.spatial_transform == "random_crop":
            self.video_transform = transforms.Compose([
                transforms_video.RandomCropVideo(resolution),
                ])
        elif self.spatial_transform == "":
            self.video_transform = None
        else:
            raise NotImplementedError

    def __getitem__(self, index):
        # get clip info
        if self.temporal_transform == 'rand_clips':
            raw_video = self.clips[index]
            rand_idx = random.randint(0, len(raw_video) - self.video_length)
            clip = raw_video[rand_idx:rand_idx+self.video_length]
        else:
            clip = self.clips[index]
        assert(len(clip) == self.video_length), f'current clip_length={len(clip)}, target clip_length={self.video_length}, {clip}'
        
        # make clip tensor
        frames, labels = load_and_transform_frames(clip, self.loader, self.img_transform)
        assert(len(frames) == self.video_length), f'current clip_length={len(frames)}, target clip_length={self.video_length}, {clip}'
        frames = torch.cat(frames, 1) # c,t,h,w
        if self.video_transform is not None:
            frames = self.video_transform(frames)
        
        example = dict()
        example["image"] = frames
        if labels is not None and self.dataset_name == 'UCF-101':
            example["caption"] = labels["class_caption"]
            example["class_label"] = labels["class_index"]
            example["class_name"] = labels["class_name"]
        example["frame_stride"] = self.frame_stride

        # nul text random:
        if self.cfg_random_null_text_ratio is not None:
            if random.random() <= self.cfg_random_null_text_ratio:
                example["caption"] = ""
        return example

    def __len__(self):
        return len(self.clips)


================================================
FILE: i2v/lvdm/data/taichi.py
================================================
import os
import random
import torch
from torch.utils.data import Dataset
from decord import VideoReader, cpu
import glob

class Taichi(Dataset):
    """
    Taichi Dataset.
    Assumes data is structured as follows.
    Taichi/
        train/
            xxx.mp4
            ...
        test/
            xxx.mp4
            ...
    """
    def __init__(self,
                 data_root,
                 resolution,
                 video_length,
                 subset_split,
                 frame_stride,
                 ):
        self.data_root = data_root
        self.resolution = resolution
        self.video_length = video_length
        self.subset_split = subset_split
        self.frame_stride = frame_stride
        assert(self.subset_split in ['train', 'test', 'all'])
        self.exts = ['avi', 'mp4', 'webm']

        if isinstance(self.resolution, int):
            self.resolution = [self.resolution, self.resolution]
        assert(isinstance(self.resolution, list) and len(self.resolution) == 2)

        self._make_dataset()
    
    def _make_dataset(self):
        if self.subset_split == 'all':
            data_folder = self.data_root
        else:
            data_folder = os.path.join(self.data_root, self.subset_split)
        self.videos = sum([glob.glob(os.path.join(data_folder, '**', f'*.{ext}'), recursive=True)
                     for ext in self.exts], [])
        print(f'Number of videos = {len(self.videos)}')

    def __getitem__(self, index):
        while True:
            video_path = self.videos[index]

            try:
                video_reader = VideoReader(video_path, ctx=cpu(0), width=self.resolution[1], height=self.resolution[0])
                if len(video_reader) < self.video_length:
                    index += 1
                    continue
                else:
                    break
            except:
                index += 1
                print(f"Load video failed! path = {video_path}")
    
        all_frames = list(range(0, len(video_reader), self.frame_stride))
        if len(all_frames) < self.video_length:
            all_frames = list(range(0, len(video_reader), 1))

        # select random clip
        rand_idx = random.randint(0, len(all_frames) - self.video_length)
        frame_indices = list(range(rand_idx, rand_idx+self.video_length))
        frames = video_reader.get_batch(frame_indices)
        assert(frames.shape[0] == self.video_length),f'{len(frames)}, self.video_length={self.video_length}'

        frames = torch.tensor(frames.asnumpy()).permute(3, 0, 1, 2).float() # [t,h,w,c] -> [c,t,h,w]
        assert(frames.shape[2] == self.resolution[0] and frames.shape[3] == self.resolution[1]), f'frames={frames.shape}, self.resolution={self.resolution}'
        frames = (frames / 255 - 0.5) * 2
        data = {'video': frames}
        return data
    
    def __len__(self):
        return len(self.videos)

================================================
FILE: i2v/lvdm/distributions.py
================================================
import torch
import numpy as np


class AbstractDistribution:
    def sample(self):
        raise NotImplementedError()

    def mode(self):
        raise NotImplementedError()


class DiracDistribution(AbstractDistribution):
    def __init__(self, value):
        self.value = value

    def sample(self):
        return self.value

    def mode(self):
        return self.value


class DiagonalGaussianDistribution(object):
    def __init__(self, parameters, deterministic=False):
        self.parameters = parameters
        self.mean, self.logvar = torch.chunk(parameters, 2, dim=1)
        self.logvar = torch.clamp(self.logvar, -30.0, 20.0)
        self.deterministic = deterministic
        self.std = torch.exp(0.5 * self.logvar)
        self.var = torch.exp(self.logvar)
        if self.deterministic:
            self.var = self.std = torch.zeros_like(self.mean).to(device=self.parameters.device)

    def sample(self, noise=None):
        if noise is None:
            noise = torch.randn(self.mean.shape)
        
        x = self.mean + self.std * noise.to(device=self.parameters.device)
        return x

    def kl(self, other=None):
        if self.deterministic:
            return torch.Tensor([0.])
        else:
            if other is None:
                return 0.5 * torch.sum(torch.pow(self.mean, 2)
                                       + self.var - 1.0 - self.logvar,
                                       dim=[1, 2, 3])
            else:
                return 0.5 * torch.sum(
                    torch.pow(self.mean - other.mean, 2) / other.var
                    + self.var / other.var - 1.0 - self.logvar + other.logvar,
                    dim=[1, 2, 3])

    def nll(self, sample, dims=[1,2,3]):
        if self.deterministic:
            return torch.Tensor([0.])
        logtwopi = np.log(2.0 * np.pi)
        return 0.5 * torch.sum(
            logtwopi + self.logvar + torch.pow(sample - self.mean, 2) / self.var,
            dim=dims)

    def mode(self):
        return self.mean


def normal_kl(mean1, logvar1, mean2, logvar2):
    """
    source: https://github.com/openai/guided-diffusion/blob/27c20a8fab9cb472df5d6bdd6c8d11c8f430b924/guided_diffusion/losses.py#L12
    Compute the KL divergence between two gaussians.
    Shapes are automatically broadcasted, so batches can be compared to
    scalars, among other use cases.
    """
    tensor = None
    for obj in (mean1, logvar1, mean2, logvar2):
        if isinstance(obj, torch.Tensor):
            tensor = obj
            break
    assert tensor is not None, "at least one argument must be a Tensor"

    # Force variances to be Tensors. Broadcasting helps convert scalars to
    # Tensors, but it does not work for torch.exp().
    logvar1, logvar2 = [
        x if isinstance(x, torch.Tensor) else torch.tensor(x).to(tensor)
        for x in (logvar1, logvar2)
    ]

    return 0.5 * (
        -1.0
        + logvar2
        - logvar1
        + torch.exp(logvar1 - logvar2)
        + ((mean1 - mean2) ** 2) * torch.exp(-logvar2)
    )


================================================
FILE: i2v/lvdm/ema.py
================================================
import torch
from torch import nn


class LitEma(nn.Module):
    def __init__(self, model, decay=0.9999, use_num_upates=True):
        super().__init__()
        if decay < 0.0 or decay > 1.0:
            raise ValueError('Decay must be between 0 and 1')

        self.m_name2s_name = {}
        self.register_buffer('decay', torch.tensor(decay, dtype=torch.float32))
        self.register_buffer('num_updates', torch.tensor(0,dtype=torch.int) if use_num_upates
                             else torch.tensor(-1,dtype=torch.int))

        for name, p in model.named_parameters():
            if p.requires_grad:
                #remove as '.'-character is not allowed in buffers
                s_name = name.replace('.','')
                self.m_name2s_name.update({name:s_name})
                self.register_buffer(s_name,p.clone().detach().data)

        self.collected_params = []

    def forward(self,model):
        decay = self.decay

        if self.num_updates >= 0:
            self.num_updates += 1
            decay = min(self.decay,(1 + self.num_updates) / (10 + self.num_updates))

        one_minus_decay = 1.0 - decay

        with torch.no_grad():
            m_param = dict(model.named_parameters())
            shadow_params = dict(self.named_buffers())

            for key in m_param:
                if m_param[key].requires_grad:
                    sname = self.m_name2s_name[key]
                    shadow_params[sname] = shadow_params[sname].type_as(m_param[key])
                    shadow_params[sname].sub_(one_minus_decay * (shadow_params[sname] - m_param[key]))
                else:
                    assert not key in self.m_name2s_name

    def copy_to(self, model):
        m_param = dict(model.named_parameters())
        shadow_params = dict(self.named_buffers())
        for key in m_param:
            if m_param[key].requires_grad:
                m_param[key].data.copy_(shadow_params[self.m_name2s_name[key]].data)
            else:
                assert not key in self.m_name2s_name

    def store(self, parameters):
        """
        Save the current parameters for restoring later.
        Args:
          parameters: Iterable of `torch.nn.Parameter`; the parameters to be
            temporarily stored.
        """
        self.collected_params = [param.clone() for param in parameters]

    def restore(self, parameters):
        """
        Restore the parameters stored with the `store` method.
        Useful to validate the model with EMA parameters without affecting the
        original optimization process. Store the parameters before the
        `copy_to` method. After validation (or model saving), use this to
        restore the former parameters.
        Args:
          parameters: Iterable of `torch.nn.Parameter`; the parameters to be
            updated with the stored parameters.
        """
        for c_param, param in zip(self.collected_params, parameters):
            param.data.copy_(c_param.data)


================================================
FILE: i2v/lvdm/models/autoencoder.py
================================================
import os
from contextlib import contextmanager
import torch
import numpy as np
from einops import rearrange
import torch.nn.functional as F
import pytorch_lightning as pl
from lvdm.modules.networks.ae_modules import Encoder, Decoder
from lvdm.distributions import DiagonalGaussianDistribution
from utils.utils import instantiate_from_config


class AutoencoderKL(pl.LightningModule):
    def __init__(self,
                 ddconfig,
                 lossconfig,
                 embed_dim,
                 ckpt_path=None,
                 ignore_keys=[],
                 image_key="image",
                 colorize_nlabels=None,
                 monitor=None,
                 test=False,
                 logdir=None,
                 input_dim=4,
                 test_args=None,
                 ):
        super().__init__()
        self.image_key = image_key
        self.encoder = Encoder(**ddconfig)
        self.decoder = Decoder(**ddconfig)
        self.loss = instantiate_from_config(lossconfig)
        assert ddconfig["double_z"]
        self.quant_conv = torch.nn.Conv2d(2*ddconfig["z_channels"], 2*embed_dim, 1)
        self.post_quant_conv = torch.nn.Conv2d(embed_dim, ddconfig["z_channels"], 1)
        self.embed_dim = embed_dim
        self.input_dim = input_dim
        self.test = test
        self.test_args = test_args
        self.logdir = logdir
        if colorize_nlabels is not None:
            assert type(colorize_nlabels)==int
            self.register_buffer("colorize", torch.randn(3, colorize_nlabels, 1, 1))
        if monitor is not None:
            self.monitor = monitor
        if ckpt_path is not None:
            self.init_from_ckpt(ckpt_path, ignore_keys=ignore_keys)
        if self.test:
            self.init_test()
    
    def init_test(self,):
        self.test = True
        save_dir = os.path.join(self.logdir, "test")
        if 'ckpt' in self.test_args:
            ckpt_name = os.path.basename(self.test_args.ckpt).split('.ckpt')[0] + f'_epoch{self._cur_epoch}'
            self.root = os.path.join(save_dir, ckpt_name)
        else:
            self.root = save_dir
        if 'test_subdir' in self.test_args:
            self.root = os.path.join(save_dir, self.test_args.test_subdir)

        self.root_zs = os.path.join(self.root, "zs")
        self.root_dec = os.path.join(self.root, "reconstructions")
        self.root_inputs = os.path.join(self.root, "inputs")
        os.makedirs(self.root, exist_ok=True)

        if self.test_args.save_z:
            os.makedirs(self.root_zs, exist_ok=True)
        if self.test_args.save_reconstruction:
            os.makedirs(self.root_dec, exist_ok=True)
        if self.test_args.save_input:
            os.makedirs(self.root_inputs, exist_ok=True)
        assert(self.test_args is not None)
        self.test_maximum = getattr(self.test_args, 'test_maximum', None) 
        self.count = 0
        self.eval_metrics = {}
        self.decodes = []
        self.save_decode_samples = 2048

    def init_from_ckpt(self, path, ignore_keys=list()):
        sd = torch.load(path, map_location="cpu")
        try:
            self._cur_epoch = sd['epoch']
            sd = sd["state_dict"]
        except:
            self._cur_epoch = 'null'
        keys = list(sd.keys())
        for k in keys:
            for ik in ignore_keys:
                if k.startswith(ik):
                    print("Deleting key {} from state_dict.".format(k))
                    del sd[k]
        self.load_state_dict(sd, strict=False)
        # self.load_state_dict(sd, strict=True)
        print(f"Restored from {path}")

    def encode(self, x, **kwargs):
        
        h = self.encoder(x)
        moments = self.quant_conv(h)
        posterior = DiagonalGaussianDistribution(moments)
        return posterior

    def decode(self, z, **kwargs):
        z = self.post_quant_conv(z)
        dec = self.decoder(z)
        return dec

    def forward(self, input, sample_posterior=True):
        posterior = self.encode(input)
        if sample_posterior:
            z = posterior.sample()
        else:
            z = posterior.mode()
        dec = self.decode(z)
        return dec, posterior

    def get_input(self, batch, k):
        x = batch[k]
        if x.dim() == 5 and self.input_dim == 4:
            b,c,t,h,w = x.shape
            self.b = b
            self.t = t 
            x = rearrange(x, 'b c t h w -> (b t) c h w')

        return x

    def training_step(self, batch, batch_idx, optimizer_idx):
        inputs = self.get_input(batch, self.image_key)
        reconstructions, posterior = self(inputs)

        if optimizer_idx == 0:
            # train encoder+decoder+logvar
            aeloss, log_dict_ae = self.loss(inputs, reconstructions, posterior, optimizer_idx, self.global_step,
                                            last_layer=self.get_last_layer(), split="train")
            self.log("aeloss", aeloss, prog_bar=True, logger=True, on_step=True, on_epoch=True)
            self.log_dict(log_dict_ae, prog_bar=False, logger=True, on_step=True, on_epoch=False)
            return aeloss

        if optimizer_idx == 1:
            # train the discriminator
            discloss, log_dict_disc = self.loss(inputs, reconstructions, posterior, optimizer_idx, self.global_step,
                                                last_layer=self.get_last_layer(), split="train")

            self.log("discloss", discloss, prog_bar=True, logger=True, on_step=True, on_epoch=True)
            self.log_dict(log_dict_disc, prog_bar=False, logger=True, on_step=True, on_epoch=False)
            return discloss

    def validation_step(self, batch, batch_idx):
        inputs = self.get_input(batch, self.image_key)
        reconstructions, posterior = self(inputs)
        aeloss, log_dict_ae = self.loss(inputs, reconstructions, posterior, 0, self.global_step,
                                        last_layer=self.get_last_layer(), split="val")

        discloss, log_dict_disc = self.loss(inputs, reconstructions, posterior, 1, self.global_step,
                                            last_layer=self.get_last_layer(), split="val")

        self.log("val/rec_loss", log_dict_ae["val/rec_loss"])
        self.log_dict(log_dict_ae)
        self.log_dict(log_dict_disc)
        return self.log_dict
    
    def configure_optimizers(self):
        lr = self.learning_rate
        opt_ae = torch.optim.Adam(list(self.encoder.parameters())+
                                  list(self.decoder.parameters())+
                                  list(self.quant_conv.parameters())+
                                  list(self.post_quant_conv.parameters()),
                                  lr=lr, betas=(0.5, 0.9))
        opt_disc = torch.optim.Adam(self.loss.discriminator.parameters(),
                                    lr=lr, betas=(0.5, 0.9))
        return [opt_ae, opt_disc], []

    def get_last_layer(self):
        return self.decoder.conv_out.weight

    @torch.no_grad()
    def log_images(self, batch, only_inputs=False, **kwargs):
        log = dict()
        x = self.get_input(batch, self.image_key)
        x = x.to(self.device)
        if not only_inputs:
            xrec, posterior = self(x)
            if x.shape[1] > 3:
                # colorize with random projection
                assert xrec.shape[1] > 3
                x = self.to_rgb(x)
                xrec = self.to_rgb(xrec)
            log["samples"] = self.decode(torch.randn_like(posterior.sample()))
            log["reconstructions"] = xrec
        log["inputs"] = x
        return log

    def to_rgb(self, x):
        assert self.image_key == "segmentation"
        if not hasattr(self, "colorize"):
            self.register_buffer("colorize", torch.randn(3, x.shape[1], 1, 1).to(x))
        x = F.conv2d(x, weight=self.colorize)
        x = 2.*(x-x.min())/(x.max()-x.min()) - 1.
        return x

class IdentityFirstStage(torch.nn.Module):
    def __init__(self, *args, vq_interface=False, **kwargs):
        self.vq_interface = vq_interface  # TODO: Should be true by default but check to not break older stuff
        super().__init__()

    def encode(self, x, *args, **kwargs):
        return x

    def decode(self, x, *args, **kwargs):
        return x

    def quantize(self, x, *args, **kwargs):
        if self.vq_interface:
            return x, None, [None, None, None]
        return x

    def forward(self, x, *args, **kwargs):
        return x


================================================
FILE: i2v/lvdm/models/ddpm3d.py
================================================
"""
wild mixture of
https://github.com/openai/improved-diffusion/blob/e94489283bb876ac1477d5dd7709bbbd2d9902ce/improved_diffusion/gaussian_diffusion.py
https://github.com/lucidrains/denoising-diffusion-pytorch/blob/7706bdfc6f527f58d33f84b7b522e61e6e3164b3/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py
https://github.com/CompVis/taming-transformers
-- merci
"""

from functools import partial
from contextlib import contextmanager
import numpy as np
from tqdm import tqdm
from einops import rearrange, repeat
import logging
mainlogger = logging.getLogger('mainlogger')
import torch
import torch.nn as nn
from torchvision.utils import make_grid
import pytorch_lightning as pl
from utils.utils import instantiate_from_config
from lvdm.ema import LitEma
from lvdm.distributions import DiagonalGaussianDistribution
from lvdm.models.utils_diffusion import make_beta_schedule
from lvdm.modules.encoders.ip_resampler import ImageProjModel, Resampler
from lvdm.models.samplers.ddim import DDIMSampler

from lvdm.basics import disabled_train
from lvdm.common import (
    extract_into_tensor,
    noise_like,
    exists,
    default
)
from lvdm.utils.saving_utils import log_txt_as_img
from lvdm.utils.common_utils import ismap, isimage #exists, default, mean_flat, count_params, instantiate_from_config, check_istarget



__conditioning_keys__ = {'concat': 'c_concat',
                         'crossattn': 'c_crossattn',
                         'adm': 'y'}

class DDPM(pl.LightningModule):
    # classic DDPM with Gaussian diffusion, in image space
    def __init__(self,
                 unet_config,
                 timesteps=1000,
                 beta_schedule="linear",
                 loss_type="l2",
                 ckpt_path=None,
                 ignore_keys=[],
                 load_only_unet=False,
                 monitor=None,
                 use_ema=True,
                 first_stage_key="image",
                 image_size=256,
                 channels=3,
                 log_every_t=100,
                 clip_denoised=True,
                 linear_start=1e-4,
                 linear_end=2e-2,
                 cosine_s=8e-3,
                 given_betas=None,
                 original_elbo_weight=0.,
                 v_posterior=0.,  # weight for choosing posterior variance as sigma = (1-v) * beta_tilde + v * beta
                 l_simple_weight=1.,
                 conditioning_key=None,
                 parameterization="eps",  # all assuming fixed variance schedules
                 scheduler_config=None,
                 use_positional_encodings=False,
                 learn_logvar=False,
                 logvar_init=0.,
                 video_length=None,
                 *args, **kwargs
                 ):
        super().__init__()
        self.video_length = video_length
        self.total_length = video_length
        assert parameterization in ["eps", "x0"], 'currently only supporting "eps" and "x0"'
        self.parameterization = parameterization
        mainlogger.info(f"{self.__class__.__name__}: Running in {self.parameterization}-prediction mode")
        self.cond_stage_model = None
        self.clip_denoised = clip_denoised
        self.log_every_t = log_every_t
        self.first_stage_key = first_stage_key
        self.channels = channels
        self.temporal_length = unet_config.params.temporal_length
        self.image_size = image_size 
        if isinstance(self.image_size, int):
            self.image_size = [self.image_size, self.image_size]
        self.use_positional_encodings = use_positional_encodings
        self.model = DiffusionWrapper(unet_config, conditioning_key)
        self.use_ema = use_ema
        if self.use_ema:
            self.model_ema = LitEma(self.model)
            mainlogger.info(f"Keeping EMAs of {len(list(self.model_ema.buffers()))}.")

        self.use_scheduler = scheduler_config is not None
        if self.use_scheduler:
            self.scheduler_config = scheduler_config

        self.v_posterior = v_posterior
        self.original_elbo_weight = original_elbo_weight
        self.l_simple_weight = l_simple_weight

        if monitor is not None:
            self.monitor = monitor
        if ckpt_path is not None:
            self.init_from_ckpt(ckpt_path, ignore_keys=ignore_keys, only_model=load_only_unet)

        self.register_schedule(given_betas=given_betas, beta_schedule=beta_schedule, timesteps=timesteps,
                               linear_start=linear_start, linear_end=linear_end, cosine_s=cosine_s)

        self.loss_type = loss_type

        self.learn_logvar = learn_logvar
        self.logvar = torch.full(fill_value=logvar_init, size=(self.num_timesteps,))
        if self.learn_logvar:
            self.logvar = nn.Parameter(self.logvar, requires_grad=True)


    def register_schedule(self, given_betas=None, beta_schedule="linear", timesteps=1000,
                          linear_start=1e-4, linear_end=2e-2, cosine_s=8e-3):
        if exists(given_betas):
            betas = given_betas
        else:
            betas = make_beta_schedule(beta_schedule, timesteps, linear_start=linear_start, linear_end=linear_end,
                                       cosine_s=cosine_s)
        alphas = 1. - betas
        alphas_cumprod = np.cumprod(alphas, axis=0)
        alphas_cumprod_prev = np.append(1., alphas_cumprod[:-1])

        timesteps, = betas.shape
        self.num_timesteps = int(timesteps)
        self.linear_start = linear_start
        self.linear_end = linear_end
        assert alphas_cumprod.shape[0] == self.num_timesteps, 'alphas have to be defined for each timestep'

        to_torch = partial(torch.tensor, dtype=torch.float32)

        self.register_buffer('betas', to_torch(betas))
        self.register_buffer('alphas_cumprod', to_torch(alphas_cumprod))
        self.register_buffer('alphas_cumprod_prev', to_torch(alphas_cumprod_prev))

        # calculations for diffusion q(x_t | x_{t-1}) and others
        self.register_buffer('sqrt_alphas_cumprod', to_torch(np.sqrt(alphas_cumprod)))
        self.register_buffer('sqrt_one_minus_alphas_cumprod', to_torch(np.sqrt(1. - alphas_cumprod)))
        self.register_buffer('log_one_minus_alphas_cumprod', to_torch(np.log(1. - alphas_cumprod)))
        self.register_buffer('sqrt_recip_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod)))
        self.register_buffer('sqrt_recipm1_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod - 1)))

        # calculations for posterior q(x_{t-1} | x_t, x_0)
        posterior_variance = (1 - self.v_posterior) * betas * (1. - alphas_cumprod_prev) / (
                    1. - alphas_cumprod) + self.v_posterior * betas
        # above: equal to 1. / (1. / (1. - alpha_cumprod_tm1) + alpha_t / beta_t)
        self.register_buffer('posterior_variance', to_torch(posterior_variance))
        # below: log calculation clipped because the posterior variance is 0 at the beginning of the diffusion chain
        self.register_buffer('posterior_log_variance_clipped', to_torch(np.log(np.maximum(posterior_variance, 1e-20))))
        self.register_buffer('posterior_mean_coef1', to_torch(
            betas * np.sqrt(alphas_cumprod_prev) / (1. - alphas_cumprod)))
        self.register_buffer('posterior_mean_coef2', to_torch(
            (1. - alphas_cumprod_prev) * np.sqrt(alphas) / (1. - alphas_cumprod)))

        if self.parameterization == "eps":
            lvlb_weights = self.betas ** 2 / (
                        2 * self.posterior_variance * to_torch(alphas) * (1 - self.alphas_cumprod))
        elif self.parameterization == "x0":
            lvlb_weights = 0.5 * np.sqrt(torch.Tensor(alphas_cumprod)) / (2. * 1 - torch.Tensor(alphas_cumprod))
        else:
            raise NotImplementedError("mu not supported")
        # TODO how to choose this term
        lvlb_weights[0] = lvlb_weights[1]
        self.register_buffer('lvlb_weights', lvlb_weights, persistent=False)
        assert not torch.isnan(self.lvlb_weights).all()

    @contextmanager
    def ema_scope(self, context=None):
        if self.use_ema:
            self.model_ema.store(self.model.parameters())
            self.model_ema.copy_to(self.model)
            if context is not None:
                mainlogger.info(f"{context}: Switched to EMA weights")
        try:
            yield None
        finally:
            if self.use_ema:
                self.model_ema.restore(self.model.parameters())
                if context is not None:
                    mainlogger.info(f"{context}: Restored training weights")

    def init_from_ckpt(self, path, ignore_keys=list(), only_model=False):
        sd = torch.load(path, map_location="cpu")
        if "state_dict" in list(sd.keys()):
            sd = sd["state_dict"]
        keys = list(sd.keys())
        for k in keys:
            for ik in ignore_keys:
                if k.startswith(ik):
                    mainlogger.info("Deleting key {} from state_dict.".format(k))
                    del sd[k]
        missing, unexpected = self.load_state_dict(sd, strict=False) if not only_model else self.model.load_state_dict(
            sd, strict=False)
        mainlogger.info(f"Restored from {path} with {len(missing)} missing and {len(unexpected)} unexpected keys")
        if len(missing) > 0:
            mainlogger.info(f"Missing Keys: {missing}")
        if len(unexpected) > 0:
            mainlogger.info(f"Unexpected Keys: {unexpected}")

    def forward(self, x, *args, **kwargs):
        t = torch.randint(0, self.num_timesteps, (x.shape[0],), device=self.device).long()
        return self.p_losses(x, t, *args, **kwargs)

    def shared_step(self, batch):
        x = self.get_input(batch, self.first_stage_key)
        loss, loss_dict = self(x)
        return loss, loss_dict

    def training_step(self, batch, batch_idx):
        loss, loss_dict = self.shared_step(batch)

        self.log_dict(loss_dict, prog_bar=True,
                      logger=True, on_step=True, on_epoch=True)

        self.log("global_step", self.global_step,
                 prog_bar=True, logger=True, on_step=True, on_epoch=False)

        if self.use_scheduler:
            lr = self.optimizers().param_groups[0]['lr']
            self.log('lr_abs', lr, prog_bar=True, logger=True, on_step=True, on_epoch=False)

        return loss

    def validation_step(self, batch, batch_idx):
        loss, loss_dict = self.shared_step(batch)

        #self.log_dict(loss_dict, prog_bar=True,
        #              logger=True, on_step=True, on_epoch=True)

        self.log("val_loss", loss,
                 prog_bar=True, logger=True, on_step=True, on_epoch=False)



        #return loss

    def q_mean_variance(self, x_start, t):
        """
        Get the distribution q(x_t | x_0).
        :param x_start: the [N x C x ...] tensor of noiseless inputs.
        :param t: the number of diffusion steps (minus 1). Here, 0 means one step.
        :return: A tuple (mean, variance, log_variance), all of x_start's shape.
        """
        mean = (extract_into_tensor(self.sqrt_alphas_cumprod, t, x_start.shape) * x_start)
        variance = extract_into_tensor(1.0 - self.alphas_cumprod, t, x_start.shape)
        log_variance = extract_into_tensor(self.log_one_minus_alphas_cumprod, t, x_start.shape)
        return mean, variance, log_variance

    def predict_start_from_noise(self, x_t, t, noise):
        return (
                extract_into_tensor(self.sqrt_recip_alphas_cumprod, t, x_t.shape) * x_t -
                extract_into_tensor(self.sqrt_recipm1_alphas_cumprod, t, x_t.shape) * noise
        )

    def q_posterior(self, x_start, x_t, t):
        posterior_mean = (
                extract_into_tensor(self.posterior_mean_coef1, t, x_t.shape) * x_start +
                extract_into_tensor(self.posterior_mean_coef2, t, x_t.shape) * x_t
        )
        posterior_variance = extract_into_tensor(self.posterior_variance, t, x_t.shape)
        posterior_log_variance_clipped = extract_into_tensor(self.posterior_log_variance_clipped, t, x_t.shape)
        return posterior_mean, posterior_variance, posterior_log_variance_clipped

    def p_mean_variance(self, x, t, clip_denoised: bool):
        model_out = self.model(x, t)
        if self.parameterization == "eps":
            x_recon = self.predict_start_from_noise(x, t=t, noise=model_out)
        elif self.parameterization == "x0":
            x_recon = model_out
        if clip_denoised:
            x_recon.clamp_(-1., 1.)

        model_mean, posterior_variance, posterior_log_variance = self.q_posterior(x_start=x_recon, x_t=x, t=t)
        return model_mean, posterior_variance, posterior_log_variance

    @torch.no_grad()
    def p_sample(self, x, t, clip_denoised=True, repeat_noise=False):
        b, *_, device = *x.shape, x.device
        model_mean, _, model_log_variance = self.p_mean_variance(x=x, t=t, clip_denoised=clip_denoised)
        noise = noise_like(x.shape, device, repeat_noise)
        # no noise when t == 0
        nonzero_mask = (1 - (t == 0).float()).reshape(b, *((1,) * (len(x.shape) - 1)))
        return model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise

    @torch.no_grad()
    def p_sample_loop(self, shape, return_intermediates=False):
        device = self.betas.device
        b = shape[0]
        img = torch.randn(shape, device=device)
        intermediates = [img]
        for i in tqdm(reversed(range(0, self.num_timesteps)), desc='Sampling t', total=self.num_timesteps):
            img = self.p_sample(img, torch.full((b,), i, device=device, dtype=torch.long),
                                clip_denoised=self.clip_denoised)
            if i % self.log_every_t == 0 or i == self.num_timesteps - 1:
                intermediates.append(img)
        if return_intermediates:
            return img, intermediates
        return img

    @torch.no_grad()
    def sample(self, batch_size=16, return_intermediates=False):
        image_size = self.image_size
        channels = self.channels
        return self.p_sample_loop((batch_size, channels, image_size, image_size),
                                  return_intermediates=return_intermediates)

    def q_sample(self, x_start, t, noise=None):
        noise = default(noise, lambda: torch.randn_like(x_start))
        return (extract_into_tensor(self.sqrt_alphas_cumprod, t, x_start.shape) * x_start *
                extract_into_tensor(self.scale_arr, t, x_start.shape) +
                extract_into_tensor(self.sqrt_one_minus_alphas_cumprod, t, x_start.shape) * noise)

    def get_input(self, batch, k):
        x = batch[k]
        x = x.to(memory_format=torch.contiguous_format).float()
        return x

    def _get_rows_from_list(self, samples):
        n_imgs_per_row = len(samples)
        denoise_grid = rearrange(samples, 'n b c h w -> b n c h w')
        denoise_grid = rearrange(denoise_grid, 'b n c h w -> (b n) c h w')
        denoise_grid = make_grid(denoise_grid, nrow=n_imgs_per_row)
        return denoise_grid

    @torch.no_grad()
    def log_images(self, batch, N=8, n_row=2, sample=True, return_keys=None, **kwargs):
        log = dict()
        x = self.get_input(batch, self.first_stage_key)
        N = min(x.shape[0], N)
        n_row = min(x.shape[0], n_row)
        x = x.to(self.device)[:N]
        log["inputs"] = x

        # get diffusion row
        diffusion_row = list()
        x_start = x[:n_row]

        for t in range(self.num_timesteps):
            if t % self.log_every_t == 0 or t == self.num_timesteps - 1:
                t = repeat(torch.tensor([t]), '1 -> b', b=n_row)
                t = t.to(self.device).long()
                noise = torch.randn_like(x_start)
                x_noisy = self.q_sample(x_start=x_start, t=t, noise=noise)
                diffusion_row.append(x_noisy)

        log["diffusion_row"] = self._get_rows_from_list(diffusion_row)

        if sample:
            # get denoise row
            with self.ema_scope("Plotting"):
                samples, denoise_row = self.sample(batch_size=N, return_intermediates=True)

            log["samples"] = samples
            log["denoise_row"] = self._get_rows_from_list(denoise_row)

        if return_keys:
            if np.intersect1d(list(log.keys()), return_keys).shape[0] == 0:
                return log
            else:
                return {key: log[key] for key in return_keys}
        return log

    def get_loss(self, pred, target, mean=True, mask=None):
        if self.loss_type == 'l1':
            loss = (target - pred).abs()
            if mean:
                loss = loss.mean()
        elif self.loss_type == 'l2':
            if mean:
                loss = torch.nn.functional.mse_loss(target, pred)
            else:
                loss = torch.nn.functional.mse_loss(target, pred, reduction='none')
        else:
            raise NotImplementedError("unknown loss type '{loss_type}'")
        if mask is not None:
            assert(mean is False)
            assert(loss.shape[2:] == mask.shape[2:]) #thw need be the same
            loss = loss * mask
        return loss

    def p_losses(self, x_start, cond, t, noise=None, skip_qsample=False, x_noisy=None, cond_mask=None, **kwargs, ):
        if not skip_qsample:
            noise = default(noise, lambda: torch.randn_like(x_start))
            x_noisy = self.q_sample(x_start=x_start, t=t, noise=noise)
        else:
            assert (x_noisy is not None)
            assert (noise is not None)
        model_output = self.apply_model(x_noisy, t, cond, **kwargs)

        loss_dict = {}
        prefix = 'train' if self.training else 'val'

        if self.parameterization == "x0":
            target = x_start
        elif self.parameterization == "eps":
            target = noise
        else:
            raise NotImplementedError()

        loss_simple = self.get_loss(model_output, target, mean=False).mean([1, 2, 3, 4])
        loss_dict.update({f'{prefix}/loss_simple': loss_simple.mean()})
        if self.logvar.device != self.device:
            self.logvar = self.logvar.to(self.device)
        logvar_t = self.logvar[t]
        loss = loss_simple / torch.exp(logvar_t) + logvar_t
        if self.learn_logvar:
            loss_dict.update({f'{prefix}/loss_gamma': loss.mean()})
            loss_dict.update({'logvar': self.logvar.data.mean()})

        loss = self.l_simple_weight * loss.mean()

        loss_vlb = self.get_loss(model_output, target, mean=False).mean(dim=(1, 2, 3, 4))
        loss_vlb = (self.lvlb_weights[t] * loss_vlb).mean()
        loss_dict.update({f'{prefix}/loss_vlb': loss_vlb})
        loss += (self.original_elbo_weight * loss_vlb)
        loss_dict.update({f'{prefix}/loss': loss})

        return loss, loss_dict


    def configure_optimizers(self):
        lr = self.learning_rate
        params = list(self.model.parameters())
        if self.learn_logvar:
            params = params + [self.logvar]
        opt = torch.optim.AdamW(params, lr=lr)
        return opt




class LatentDiffusion(DDPM):
    """main class"""
    def __init__(self,
                 first_stage_config,
                 cond_stage_config,
                 num_timesteps_cond=None,
                 cond_stage_key="caption",
                 cond_stage_trainable=False,
                 cond_stage_forward=None,
                 conditioning_key=None,
                 uncond_prob=0.2,
                 uncond_type="empty_seq",
                 scale_factor=1.0,
                 scale_by_std=False,
                 encoder_type="2d",
                 only_model=False,
                 use_scale=False,
                 scale_a=1,
                 scale_b=0.3,
                 mid_step=400,
                 fix_scale_bug=False,
                 latent_frame_strde=None,
                 *args, **kwargs):
        self.num_timesteps_cond = default(num_timesteps_cond, 1)
        self.scale_by_std = scale_by_std
        assert self.num_timesteps_cond <= kwargs['timesteps']
        # for backwards compatibility after implementation of DiffusionWrapper
        ckpt_path = kwargs.pop("ckpt_path", None)
        ignore_keys = kwargs.pop("ignore_keys", [])
        conditioning_key = default(conditioning_key, 'crossattn')
        super().__init__(conditioning_key=conditioning_key, *args, **kwargs)

        self.cond_stage_trainable = cond_stage_trainable
        self.cond_stage_key = cond_stage_key

        # scale factor
        self.use_scale=use_scale
        if self.use_scale:
            self.scale_a=scale_a
            self.scale_b=scale_b
            if fix_scale_bug:
                scale_step=self.num_timesteps-mid_step
            else: #bug
                scale_step = self.num_timesteps

            scale_arr1 = np.linspace(scale_a, scale_b, mid_step)
            scale_arr2 = np.full(scale_step, scale_b)
            scale_arr = np.concatenate((scale_arr1, scale_arr2))
            scale_arr_prev = np.append(scale_a, scale_arr[:-1])
            to_torch = partial(torch.tensor, dtype=torch.float32)
            self.register_buffer('scale_arr', to_torch(scale_arr))

        try:
            self.num_downs = len(first_stage_config.params.ddconfig.ch_mult) - 1
        except:
            self.num_downs = 0
        if not scale_by_std:
            self.scale_factor = scale_factor
        else:
            self.register_buffer('scale_factor', torch.tensor(scale_factor))
        self.instantiate_first_stage(first_stage_config)
        self.instantiate_cond_stage(cond_stage_config)
        self.first_stage_config = first_stage_config
        self.cond_stage_config = cond_stage_config        
        self.clip_denoised = False

        self.cond_stage_forward = cond_stage_forward
        self.encoder_type = encoder_type
        assert(encoder_type in ["2d", "3d"])
        self.uncond_prob = uncond_prob
        self.classifier_free_guidance = True if uncond_prob > 0 else False
        assert(uncond_type in ["zero_embed", "empty_seq"])
        self.uncond_type = uncond_type

        self.latent_frame_strde = latent_frame_strde

        self.restarted_from_ckpt = False
        if ckpt_path is not None:
            self.init_from_ckpt(ckpt_path, ignore_keys, only_model=only_model)
            self.restarted_from_ckpt = True
            x
                

    def make_cond_schedule(self, ):
        self.cond_ids = torch.full(size=(self.num_timesteps,), fill_value=self.num_timesteps - 1, dtype=torch.long)
        ids = torch.round(torch.linspace(0, self.num_timesteps - 1, self.num_timesteps_cond)).long()
        self.cond_ids[:self.num_timesteps_cond] = ids

    def q_sample(self, x_start, t, noise=None):
        noise = default(noise, lambda: torch.randn_like(x_start))
        if self.use_scale:  
            return (extract_into_tensor(self.sqrt_alphas_cumprod, t, x_start.shape) * x_start *
                extract_into_tensor(self.scale_arr, t, x_start.shape) +
                extract_into_tensor(self.sqrt_one_minus_alphas_cumprod, t, x_start.shape) * noise)
        else:
            return (extract_into_tensor(self.sqrt_alphas_cumprod, t, x_start.shape) * x_start +
                extract_into_tensor(self.sqrt_one_minus_alphas_cumprod, t, x_start.shape) * noise)


    def _freeze_model(self):
        for name, para in self.model.diffusion_model.named_parameters():
            para.requires_grad = False

    def instantiate_first_stage(self, config):
        model = instantiate_from_config(config)
        self.first_stage_model = model.eval()
        self.first_stage_model.train = disabled_train
        for param in self.first_stage_model.parameters():
            param.requires_grad = False

    def instantiate_cond_stage(self, config):
        if not self.cond_stage_trainable:
            model = instantiate_from_config(config)
            self.cond_stage_model = model.eval()
            self.cond_stage_model.train = disabled_train
            for param in self.cond_stage_model.parameters():
                param.requires_grad = False
        else:
            model = instantiate_from_config(config)
            self.cond_stage_model = model
    
    def get_learned_conditioning(self, c):
        if self.cond_stage_forward is None:
            if hasattr(self.cond_stage_model, 'encode') and callable(self.cond_stage_model.encode):
                c = self.cond_stage_model.encode(c)
                if isinstance(c, DiagonalGaussianDistribution):
                    c = c.mode()
            else:
                c = self.cond_stage_model(c)
        else:
            assert hasattr(self.cond_stage_model, self.cond_stage_forward)
            c = getattr(self.cond_stage_model, self.cond_stage_forward)(c)
        return c

    def get_first_stage_encoding(self, encoder_posterior, noise=None):
        if isinstance(encoder_posterior, DiagonalGaussianDistribution):
            z = encoder_posterior.sample(noise=noise)
        elif isinstance(encoder_posterior, torch.Tensor):
            z = encoder_posterior
        else:
            raise NotImplementedError(f"encoder_posterior of type '{type(encoder_posterior)}' not yet implemented")
        return self.scale_factor * z

    # old version
    '''
    # @torch.no_grad()
    # def encode_first_stage(self, x):
        if self.encoder_type == "2d" and x.dim() == 5:
            b, _, t, _, _ = x.shape
            x = rearrange(x, 'b c t h w -> (b t) c h w')
            reshape_back = True
        else:
            reshape_back = False
        
        encoder_posterior = self.first_stage_model.encode(x)
        results = self.get_first_stage_encoding(encoder_posterior).detach()
        
        if reshape_back:
            results = rearrange(results, '(b t) c h w -> b c t h w', b=b,t=t)
        
        return results
    '''

    @torch.no_grad()
    def encode_first_stage(self, x):
        if self.encoder_type == "2d" and x.dim() == 5:
            b, _, t, _, _ = x.shape
            x = rearrange(x, 'b c t h w -> (b t) c h w')
            results = self.first_stage_model.encode(x)
            results = rearrange(results, '(b t) c h w -> b c t h w', b=b,t=t)
        else:
            results = self.first_stage_model.encode(x)
        return results


    @torch.no_grad()
    def get_condition(self, batch, x, bs, force_c_encode, k, cond_key):
        is_conditional = self.model.conditioning_key is not None  # crossattn
        if is_conditional:
            if cond_key is None:
                cond_key = self.cond_stage_key

                # get condition batch of different condition type
            if cond_key != self.first_stage_key:
                if cond_key in ['caption', 'txt']:
                    xc = batch[cond_key]
                elif cond_key == 'class_label':
                    xc = batch
                else:
                    raise NotImplementedError
            else:
                xc = x

            # get learned condition.
            # can directly skip it: c = xc
            if self.cond_stage_config is not None and (not self.cond_stage_trainable or force_c_encode):
                if isinstance(xc, torch.Tensor):
                    xc = xc.to(self.device)
                c = self.get_learned_conditioning(xc)
                #print(f"get_condition in input: encoded {xc} to {c}")
            else:
                c = xc

            # process c
            if bs is not None:
                c = c[:bs]

        else:
            c = None
            xc = None

        return c, xc


    @torch.no_grad()
    def get_input(self, batch, k, return_first_stage_outputs=False, force_c_encode=False,
                  cond_key=None, return_original_cond=False, bs=None):
        # get input imgaes
        x = super().get_input(batch, k)  # k=first_stage_key

        if bs is not None:
            x = x[:bs]

        x = x.to(self.device)
        x_ori = x
        b, _, t, h, w = x.shape

        # encode video frames x to z via a 2D encoder
        if self.encoder_type == "2d":
            x = rearrange(x, 'b c t h w -> (b t) c h w')
        encoder_posterior = self.encode_first_stage(x)
        z = self.get_first_stage_encoding(encoder_posterior).detach()
        if self.encoder_type == "2d":
            z = rearrange(z, '(b t) c h w -> b c t h w', b=b, t=t)

        if self.latent_frame_strde:
            z = z[:, :, ::4, :, :]
            assert (z.shape[2] == self.temporal_length), f'z={z.shape}, self.temporal_length={self.temporal_length}'

        c, xc = self.get_condition(batch, x, bs, force_c_encode, k, cond_key)
        out = [z, c]

        if return_first_stage_outputs:
            xrec = self.decode_first_stage(z)
            out.extend([x_ori, xrec])
        if return_original_cond:
            if isinstance(xc, torch.Tensor) and xc.dim() == 4:
                xc = rearrange(xc, '(b t) c h w -> b c t h w', b=b, t=t)
            out.append(xc)

        return out


    def shared_step(self, batch, **kwargs):
        """ shared step of LDM.
        If learned condition, c is raw condition (e.g. text)
        Encoding condition is performed in below forward function.
        """
        x, c = self.get_input(batch, self.first_stage_key)
        loss = self(x, c)
        return loss

    def forward(self, x, c, *args, **kwargs):
        t = torch.randint(0, self.num_timesteps, (x.shape[0],), device=self.device).long()
        if self.model.conditioning_key is not None:
            assert c is not None
            if self.cond_stage_trainable:
                c = self.get_learned_conditioning(c)
        return self.p_losses(x, c, t, *args, **kwargs)

    @torch.no_grad()
    def encode_first_stage_2DAE(self, x):

        b, _, t, _, _ = x.shape
        results = torch.cat([self.get_first_stage_encoding(self.first_stage_model.encode(x[:,:,i])).detach().unsqueeze(2) for i in range(t)], dim=2)
        
        return results
    
    def decode_core(self, z, **kwargs):
        if self.encoder_type == "2d" and z.dim() == 5:
            b, _, t, _, _ = z.shape
            z = rearrange(z, 'b c t h w -> (b t) c h w')
            reshape_back = True
        else:
            reshape_back = False
            
        z = 1. / self.scale_factor * z

        results = self.first_stage_model.decode(z, **kwargs)
            
        if reshape_back:
            results = rearrange(results, '(b t) c h w -> b c t h w', b=b,t=t)
        return results

    @torch.no_grad()
    def decode_first_stage(self, z, **kwargs):
        return self.decode_core(z, **kwargs)

    def apply_model(self, x_noisy, t, cond, **kwargs):
        if isinstance(cond, dict):
            # hybrid case, cond is exptected to be a dict
            pass
        else:
            if not isinstance(cond, list):
                cond = [cond]
            key = 'c_concat' if self.model.conditioning_key == 'concat' else 'c_crossattn'
            cond = {key: cond}

        x_recon = self.model(x_noisy, t, **cond, **kwargs)

        if isinstance(x_recon, tuple):
            return x_recon[0]
        else:
            return x_recon

    def _get_denoise_row_from_list(self, samples, desc=''):
        denoise_row = []
        for zd in tqdm(samples, desc=desc):
            denoise_row.append(self.decode_first_stage(zd.to(self.device)))
        n_log_timesteps = len(denoise_row)

        denoise_row = torch.stack(denoise_row)  # n_log_timesteps, b, C, H, W
        
        if denoise_row.dim() == 5:
            # img, num_imgs= n_log_timesteps * bs, grid_size=[bs,n_log_timesteps]
            denoise_grid = rearrange(denoise_row, 'n b c h w -> b n c h w')
            denoise_grid = rearrange(denoise_grid, 'b n c h w -> (b n) c h w')
            denoise_grid = make_grid(denoise_grid, nrow=n_log_timesteps)
        elif denoise_row.dim() == 6:
            # video, grid_size=[n_log_timesteps*bs, t]
            video_length = denoise_row.shape[3]
            denoise_grid = rearrange(denoise_row, 'n b c t h w -> b n c t h w')
            denoise_grid = rearrange(denoise_grid, 'b n c t h w -> (b n) c t h w')
            denoise_grid = rearrange(denoise_grid, 'n c t h w -> (n t) c h w')
            denoise_grid = make_grid(denoise_grid, nrow=video_length)
        else:
            raise ValueError

        return denoise_grid
 

    @torch.no_grad()
    def decode_first_stage_2DAE(self, z, **kwargs):

        b, _, t, _, _ = z.shape
        z = 1. / self.scale_factor * z
        results = torch.cat([self.first_stage_model.decode(z[:,:,i], **kwargs).unsqueeze(2) for i in range(t)], dim=2)

        return results


    def p_mean_variance(self, x, c, t, clip_denoised: bool, return_x0=False, score_corrector=None, corrector_kwargs=None, **kwargs):
        t_in = t
        model_out = self.apply_model(x, t_in, c, **kwargs)

        if score_corrector is not None:
            assert self.parameterization == "eps"
            model_out = score_corrector.modify_score(self, model_out, x, t, c, **corrector_kwargs)

        if self.parameterization == "eps":
            x_recon = self.predict_start_from_noise(x, t=t, noise=model_out)
        elif self.parameterization == "x0":
            x_recon = model_out
        else:
            raise NotImplementedError()

        if clip_denoised:
            x_recon.clamp_(-1., 1.)

        model_mean, posterior_variance, posterior_log_variance = self.q_posterior(x_start=x_recon, x_t=x, t=t)

        if return_x0:
            return model_mean, posterior_variance, posterior_log_variance, x_recon
        else:
            return model_mean, posterior_variance, posterior_log_variance

    @torch.no_grad()
    def p_sample(self, x, c, t, clip_denoised=False, repeat_noise=False, return_x0=False, \
                 temperature=1., noise_dropout=0., score_corrector=None, corrector_kwargs=None, **kwargs):
        b, *_, device = *x.shape, x.device
        outputs = self.p_mean_variance(x=x, c=c, t=t, clip_denoised=clip_denoised, return_x0=return_x0, \
                                       score_corrector=score_corrector, corrector_kwargs=corrector_kwargs, **kwargs)
        if return_x0:
            model_mean, _, model_log_variance, x0 = outputs
        else:
            model_mean, _, model_log_variance = outputs

        noise = noise_like(x.shape, device, repeat_noise) * temperature
        if noise_dropout > 0.:
            noise = torch.nn.functional.dropout(noise, p=noise_dropout)
        # no noise when t == 0
        nonzero_mask = (1 - (t == 0).float()).reshape(b, *((1,) * (len(x.shape) - 1)))

        if return_x0:
            return model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise, x0
        else:
            return model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise

    @torch.no_grad()
    def p_sample_loop(self, cond, shape, return_intermediates=False, x_T=None, verbose=True, callback=None, \
                      timesteps=None, mask=None, x0=None, img_callback=None, start_T=None, log_every_t=None, **kwargs):

        if not log_every_t:
            log_every_t = self.log_every_t
        device = self.betas.device
        b = shape[0]        
        # sample an initial noise
        if x_T is None:
            img = torch.randn(shape, device=device)
        else:
            img = x_T

        intermediates = [img]
        if timesteps is None:
            timesteps = self.num_timesteps
        if start_T is not None:
            timesteps = min(timesteps, start_T)

        iterator = tqdm(reversed(range(0, timesteps)), desc='Sampling t', total=timesteps) if verbose else reversed(range(0, timesteps))

        if mask is not None:
            assert x0 is not None
            assert x0.shape[2:3] == mask.shape[2:3]  # spatial size has to match

        for i in iterator:
            ts = torch.full((b,), i, device=device, dtype=torch.long)
            if self.shorten_cond_schedule:
                assert self.model.conditioning_key != 'hybrid'
                tc = self.cond_ids[ts].to(cond.device)
                cond = self.q_sample(x_start=cond, t=tc, noise=torch.randn_like(cond))

            img = self.p_sample(img, cond, ts, clip_denoised=self.clip_denoised, **kwargs)
            if mask is not None:
                img_orig = self.q_sample(x0, ts)
                img = img_orig * mask + (1. - mask) * img

            if i % log_every_t == 0 or i == timesteps - 1:
                intermediates.append(img)
            if callback: callback(i)
            if img_callback: img_callback(img, i)

        if return_intermediates:
            return img, intermediates
        return img

    @torch.no_grad()
    def sample(self, cond, batch_size=16, return_intermediates=False, x_T=None,
               verbose=True, timesteps=None,
               mask=None, x0=None, shape=None, **kwargs):
        if shape is None:
            shape = (batch_size, self.channels, self.total_length, *self.image_size)
        if cond is not None:
            if isinstance(cond, dict):
                cond = {key: cond[key][:batch_size] if not isinstance(cond[key], list) else
                list(map(lambda x: x[:batch_size], cond[key])) for key in cond}
            else:
                cond = [c[:batch_size] for c in cond] if isinstance(cond, list) else cond[:batch_size]
        return self.p_sample_loop(cond,
                                  shape,
                                  return_intermediates=return_intermediates, x_T=x_T,
                                  verbose=verbose, timesteps=timesteps,
                                  mask=mask, x0=x0,)

    @torch.no_grad()
    def log_condition(self, log, batch, xc, x, c, cond_stage_key=None):
        """
        xc: oringinal condition before enconding.
        c: condition after encoding.
        """
        if x.dim() == 5:
            txt_img_shape = [x.shape[3], x.shape[4]]
        elif x.dim() == 4:
            txt_img_shape = [x.shape[2], x.shape[3]]
        else:
            raise ValueError
        if self.model.conditioning_key is not None:  # concat-time-mask
            if hasattr(self.cond_stage_model, "decode"):
                xc = self.cond_stage_model.decode(c)
                log["conditioning"] = xc
            elif cond_stage_key in ["caption", "txt"]:
                log["conditioning_txt_img"] = log_txt_as_img(txt_img_shape, batch[cond_stage_key],
                                                             size=x.shape[3] // 25)
                log["conditioning_txt"] = batch[cond_stage_key]
            elif cond_stage_key == 'class_label':
                try:
                    xc = log_txt_as_img(txt_img_shape, batch["human_label"], size=x.shape[3] // 25)
                except:
                    xc = log_txt_as_img(txt_img_shape, batch["class_name"], size=x.shape[3] // 25)
                log['conditioning'] = xc
            elif isimage(xc):
                log["conditioning"] = xc
            if ismap(xc):
                log["original_conditioning"] = self.to_rgb(xc)
            if isinstance(c, dict) and 'mask' in c:
                log['mask'] = self.mask_to_rgb(c['mask'])
        return log

    @torch.no_grad()
    def sample_log(self, cond, batch_size, ddim, ddim_steps, shape=None, **kwargs):
        if ddim:
            ddim_sampler = DDIMSampler(self)
            if shape is None:
                shape = (self.channels, self.total_length, *self.image_size)

            #print(f"ddim_sampler.sample args shape: {shape}, ddim_steps: {ddim_steps}")
            samples, intermediates =ddim_sampler.sample(ddim_steps, batch_size,
                                                        shape, cond, verbose=False, **kwargs)
        else:
            samples, intermediates = self.sample(cond=cond, batch_size=batch_size,
                                                 return_intermediates=True, **kwargs)
        return samples, intermediates

    def get_condition_validate(self, prompt):
        """ text embd
        """
        if isinstance(prompt, str):
            prompt = [prompt]
        c = self.get_learned_conditioning(prompt)
        bs = c.shape[0]

        return c

    @torch.no_grad()
    def log_images(self, batch, N=8, n_row=4, sample=True, ddim_steps=50, ddim_eta=1.,
                   unconditional_guidance_scale=12.0,
                   cond_noisy_level=None,
                   c=None,
                   **kwargs):
        """ log images for LatentDiffusion """
        use_ddim = ddim_steps is not None
        log = dict()

        # get input
        z, c, x, xrec, xc = self.get_input(batch,
                                           k=self.first_stage_key,
                                           return_first_stage_outputs=True,
                                           force_c_encode=True,
                                           return_original_cond=True,
                                           bs=N,
                                           cond_key=None,
                                           )

        N = min(z.shape[0], N)
        n_row = min(x.shape[0], n_row)

        if unconditional_guidance_scale != 1.0:
            prompts = N * [""]
            uc = self.get_condition_validate(prompts)
        else:
            uc = None

        log["inputs"] = x
        log["reconstruction"] = xrec
        log = self.log_condition(log, batch, xc, x, c,
                                 cond_stage_key=self.cond_stage_key,
                                 )
        if cond_noisy_level is not None:
            noise = torch.randn_like(z)
            c['noisy_cond'] = self.q_sample(x_start=z, t=cond_noisy_level, noise=noise)

        if sample:
            with self.ema_scope("Plotting"):

                #print(f"ddim_sampler.sample input: cond:{c}")
                #print(f"ddim_sampler.sample input: use_ddim:{use_ddim}")
                #print(f"ddim_sampler.sample input: ddim_eta:{ddim_eta}")
                #print(f"ddim_sampler.sample input: unconditional_guidance_scale:{unconditional_guidance_scale}")
                #print(f"ddim_sampler.sample input: uc:{uc}")

                samples, z_denoise_row = self.sample_log(cond=c, batch_size=N, ddim=use_ddim,
                                                         ddim_steps=ddim_steps, eta=ddim_eta,
                                                         temporal_length=self.video_length,
                                                         unconditional_guidance_scale=unconditional_guidance_scale,
                                                         unconditional_conditioning=uc, **kwargs,
                                                         )
            x_samples = self.decode_first_stage(samples)
            log["samples"] = x_samples
        return log

    @torch.no_grad()
    def decode_first_stage(self, z, decode_bs=16, return_cpu=True,
                           bs=None, decode_single_video_allframes=False, max_z_t=None,
                           overlapped_length=0,
                           **kwargs):
        b, _, t, _, _ = z.shape

        if self.encoder_type == "2d" and z.dim() == 5:
            return self.decode_first_stage_2DAE(z, decode_bs=decode_bs, return_cpu=return_cpu, **kwargs)
        if decode_single_video_allframes:
            z = torch.split(z, 1, dim=0)  # tuple of zs
            cat_dim = 0
        elif max_z_t is not None:
            if self.encoder_type == "3d":
                z = torch.split(z, max_z_t, dim=2)  # tuple of zs
                cat_dim = 2
            if self.encoder_type == "2d":
                z = torch.split(z, max_z_t, dim=0)  # tuple of zs
                cat_dim = 0
        elif (self.split_clips and self.downfactor_t is not None) or (
                (self.clip_length is not None) and (self.downfactor_t is not None) and (
                z.shape[2] > self.clip_length // self.downfactor_t) \
                and self.encoder_type == "3d"):
            split_z_t = self.clip_length // self.downfactor_t
            print(f'split z ({z.shape}) to length={split_z_t} clips')
            z = split_video_to_clips(z, clip_length=split_z_t, drop_left=True)  # tensor
            if bs is not None and z.shape[0] > bs:
                print(f'split z ({z.shape}) to bs={bs}')
                z = torch.split(z, bs, dim=0)  # tuple of zs
                cat_dim = 0

        torch.cuda.empty_cache()
        if isinstance(z, tuple):
            zs = [self.decode(z_, **kwargs).cpu() for z_ in z]
            results = torch.cat(zs, dim=cat_dim)
        elif isinstance(z, torch.Tensor):
            results = self.decode(z, **kwargs)
        else:
            raise ValueError

        if self.split_clips and self.downfactor_t is not None:
            results = merge_clips_to_videos(results, bs=b)
        return results


class LatentVisualDiffusion(LatentDiffusion):
    def __init__(self, cond_img_config, finegrained=False, random_cond=False, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.random_cond = random_cond
        self.instantiate_img_embedder(cond_img_config, freeze=True)
        num_tokens = 16 if finegrained else 4
        self.image_proj_model = self.init_projector(use_finegrained=finegrained, num_tokens=num_tokens, input_dim=1024,\
                                            cross_attention_dim=1024, dim=1280)    

    def instantiate_img_embedder(self, config, freeze=True):
        embedder = instantiate_from_config(config)
        if freeze:
            self.embedder = embedder.eval()
            self.embedder.train = disabled_train
            for param in self.embedder.parameters():
                param.requires_grad = False

    def init_projector(self, use_finegrained, num_tokens, input_dim, cross_attention_dim, dim):
        if not use_finegrained:
            image_proj_model = ImageProjModel(clip_extra_context_tokens=num_tokens, cross_attention_dim=cross_attention_dim,
                clip_embeddings_dim=input_dim
            )
        else:
            image_proj_model = Resampler(dim=input_dim, depth=4, dim_head=64, heads=12, num_queries=num_tokens,
                embedding_dim=dim, output_dim=cross_attention_dim, ff_mult=4
            )
        return image_proj_model

    ## Never delete this func: it is used in log_images() and inference stage
    def get_image_embeds(self, batch_imgs):
        ## img: b c h w
        img_token = self.embedder(batch_imgs)
        img_emb = self.image_proj_model(img_token)
        return img_emb


class DiffusionWrapper(pl.LightningModule):
    def __init__(self, diff_model_config, conditioning_key):
        super().__init__()
        self.diffusion_model = instantiate_from_config(diff_model_config)
        self.conditioning_key = conditioning_key

    def forward(self, x, t, c_concat: list = None, c_crossattn: list = None,
                c_adm=None, s=None, mask=None, **kwargs):
        # temporal_context = fps is foNone
        if self.conditioning_key is None:
            out = self.diffusion_model(x, t)
        elif self.conditioning_key == 'concat':
            xc = torch.cat([x] + c_concat, dim=1)
            out = self.diffusion_model(xc, t, **kwargs)
        elif self.conditioning_key == 'crossattn':
            cc = torch.cat(c_crossattn, 1)
            out = self.diffusion_model(x, t, context=cc, **kwargs)
        elif self.conditioning_key == 'hybrid':
            ## it is just right [b,c,t,h,w]: concatenate in channel dim
            xc = torch.cat([x] + c_concat, dim=1)
            cc = torch.cat(c_crossattn, 1)
            out = self.diffusion_model(xc, t, context=cc)
        elif self.conditioning_key == 'resblockcond':
            cc = c_crossattn[0]
            out = self.diffusion_model(x, t, context=cc)
        elif self.conditioning_key == 'adm':
            cc = c_crossattn[0]
            out = self.diffusion_model(x, t, y=cc)
        elif self.conditioning_key == 'hybrid-adm':
            assert c_adm is not None
            xc = torch.cat([x] + c_concat, dim=1)
            cc = torch.cat(c_crossattn, 1)
            out = self.diffusion_model(xc, t, context=cc, y=c_adm)
        elif self.conditioning_key == 'hybrid-time':
            assert s is not None
            xc = torch.cat([x] + c_concat, dim=1)
            cc = torch.cat(c_crossattn, 1)
            out = self.diffusion_model(xc, t, context=cc, s=s)
        elif self.conditioning_key == 'concat-time-mask':
            # assert s is not None
            # mainlogger.info('x & mask:',x.shape,c_concat[0].shape)
            xc = torch.cat([x] + c_concat, dim=1)
            out = self.diffusion_model(xc, t, context=None, s=s, mask=mask)
        elif self.conditioning_key == 'concat-adm-mask':
            # assert s is not None
            # mainlogger.info('x & mask:',x.shape,c_concat[0].shape)
            if c_concat is not None:
                xc = torch.cat([x] + c_concat, dim=1)
            else:
                xc = x
            out = self.diffusion_model(xc, t, context=None, y=s, mask=mask)
        elif self.conditioning_key == 'hybrid-adm-mask':
            cc = torch.cat(c_crossattn, 1)
            if c_concat is not None:
                xc = torch.cat([x] + c_concat, dim=1)
            else:
                xc = x
            out = self.diffusion_model(xc, t, context=cc, y=s, mask=mask)
        elif self.conditioning_key == 'hybrid-time-adm': # adm means y, e.g., class index
            # assert s is not None
            assert c_adm is not None
            xc = torch.cat([x] + c_concat, dim=1)
            cc = torch.cat(c_crossattn, 1)
            out = self.diffusion_model(xc, t, context=cc, s=s, y=c_adm)
        else:
            raise NotImplementedError()

        return out

================================================
FILE: i2v/lvdm/models/samplers/ddim.py
================================================
import numpy as np
from tqdm import tqdm
import torch
from lvdm.models.utils_diffusion import make_ddim_sampling_parameters, make_ddim_timesteps
from lvdm.common import noise_like


class DDIMSampler(object):
    def __init__(self, model, schedule="linear", **kwargs):
        super().__init__()
        self.model = model
        self.ddpm_num_timesteps = model.num_timesteps
        self.schedule = schedule
        self.counter = 0

    def register_buffer(self, name, attr):
        if type(attr) == torch.Tensor:
            if attr.device != torch.device("cuda"):
                attr = attr.to(torch.device("cuda"))
        setattr(self, name, attr)

    def make_schedule(self, ddim_num_steps, ddim_discretize="uniform", ddim_eta=0., verbose=True):
        self.ddim_timesteps = make_ddim_timesteps(ddim_discr_method=ddim_discretize, num_ddim_timesteps=ddim_num_steps,
                                                  num_ddpm_timesteps=self.ddpm_num_timesteps,verbose=verbose)
        alphas_cumprod = self.model.alphas_cumprod
        assert alphas_cumprod.shape[0] == self.ddpm_num_timesteps, 'alphas have to be defined for each timestep'
        to_torch = lambda x: x.clone().detach().to(torch.float32).to(self.model.device)

        self.register_buffer('betas', to_torch(self.model.betas))
        self.register_buffer('alphas_cumprod', to_torch(alphas_cumprod))
        self.register_buffer('alphas_cumprod_prev', to_torch(self.model.alphas_cumprod_prev))
        self.use_scale = self.model.use_scale
        print('DDIM scale', self.use_scale)

        if self.use_scale:
            self.register_buffer('scale_arr', to_torch(self.model.scale_arr))
            ddim_scale_arr = self.scale_arr.cpu()[self.ddim_timesteps]
            self.register_buffer('ddim_scale_arr', ddim_scale_arr)
            ddim_scale_arr = np.asarray([self.scale_arr.cpu()[0]] + self.scale_arr.cpu()[self.ddim_timesteps[:-1]].tolist())
            self.register_buffer('ddim_scale_arr_prev', ddim_scale_arr)

        # calculations for diffusion q(x_t | x_{t-1}) and others
        self.register_buffer('sqrt_alphas_cumprod', to_torch(np.sqrt(alphas_cumprod.cpu())))
        self.register_buffer('sqrt_one_minus_alphas_cumprod', to_torch(np.sqrt(1. - alphas_cumprod.cpu())))
        self.register_buffer('log_one_minus_alphas_cumprod', to_torch(np.log(1. - alphas_cumprod.cpu())))
        self.register_buffer('sqrt_recip_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod.cpu())))
        self.register_buffer('sqrt_recipm1_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod.cpu() - 1)))

        # ddim sampling parameters
        ddim_sigmas, ddim_alphas, ddim_alphas_prev = make_ddim_sampling_parameters(alphacums=alphas_cumprod.cpu(),
                                                                                   ddim_timesteps=self.ddim_timesteps,
                                                                                   eta=ddim_eta,verbose=verbose)
        self.register_buffer('ddim_sigmas', ddim_sigmas)
        self.register_buffer('ddim_alphas', ddim_alphas)
        self.register_buffer('ddim_alphas_prev', ddim_alphas_prev)
        self.register_buffer('ddim_sqrt_one_minus_alphas', np.sqrt(1. - ddim_alphas))
        sigmas_for_original_sampling_steps = ddim_eta * torch.sqrt(
            (1 - self.alphas_cumprod_prev) / (1 - self.alphas_cumprod) * (
                        1 - self.alphas_cumprod / self.alphas_cumprod_prev))
        self.register_buffer('ddim_sigmas_for_original_num_steps', sigmas_for_original_sampling_steps)

    @torch.no_grad()
    def sample(self,
               S,
               batch_size,
               shape,
               conditioning=None,
               callback=None,
               normals_sequence=None,
               img_callback=None,
               quantize_x0=False,
               eta=0.,
               mask=None,
               x0=None,
               temperature=1.,
               noise_dropout=0.,
               score_corrector=None,
               corrector_kwargs=None,
               verbose=True,
               schedule_verbose=False,
               x_T=None,
               log_every_t=100,
               unconditional_guidance_scale=1.,
               unconditional_conditioning=None,
               # this has to come in the same format as the conditioning, # e.g. as encoded tokens, ...
               **kwargs
               ):
        
        # check condition bs
        if conditioning is not None:
            if isinstance(conditioning, dict):
                try:
                    cbs = conditioning[list(conditioning.keys())[0]].shape[0]
                except:
                    cbs = conditioning[list(conditioning.keys())[0]][0].shape[0]

                if cbs != batch_size:
                    print(f"Warning: Got {cbs} conditionings but batch-size is {batch_size}")
            else:
                if conditioning.shape[0] != batch_size:
                    print(f"Warning: Got {conditioning.shape[0]} conditionings but batch-size is {batch_size}")

        self.make_schedule(ddim_num_steps=S, ddim_eta=eta, verbose=schedule_verbose)
        
        # make shape
        if len(shape) == 3:
            C, H, W = shape
            size = (batch_size, C, H, W)
        elif len(shape) == 4:
            C, T, H, W = shape
            size = (batch_size, C, T, H, W)
        # print(f'Data shape for DDIM sampling is {size}, eta {eta}')
        
        samples, intermediates = self.ddim_sampling(conditioning, size,
                                                    callback=callback,
                                                    img_callback=img_callback,
                                                    quantize_denoised=quantize_x0,
                                                    mask=mask, x0=x0,
                                                    ddim_use_original_steps=False,
                                                    noise_dropout=noise_dropout,
                                                    temperature=temperature,
                                                    score_corrector=score_corrector,
                                                    corrector_kwargs=corrector_kwargs,
                                                    x_T=x_T,
                                                    log_every_t=log_every_t,
                                                    unconditional_guidance_scale=unconditional_guidance_scale,
                                                    unconditional_conditioning=unconditional_conditioning,
                                                    verbose=verbose,
                                                    **kwargs)
        return samples, intermediates

    @torch.no_grad()
    def ddim_sampling(self, cond, shape,
                      x_T=None, ddim_use_original_steps=False,
                      callback=None, timesteps=None, quantize_denoised=False,
                      mask=None, x0=None, img_callback=None, log_every_t=100,
                      temperature=1., noise_dropout=0., score_corrector=None, corrector_kwargs=None,
                      unconditional_guidance_scale=1., unconditional_conditioning=None, verbose=True,
                      cond_tau=1., target_size=None, start_timesteps=None,
                      **kwargs):
        device = self.model.betas.device        
        print('ddim device', device)
        b = shape[0]
        if x_T is None:
            img = torch.randn(shape, device=device)
        else:
            img = x_T
        
        if timesteps is None:
            timesteps = self.ddpm_num_timesteps if ddim_use_original_steps else self.ddim_timesteps
        elif timesteps is not None and not ddim_use_original_steps:
            subset_end = int(min(timesteps / self.ddim_timesteps.shape[0], 1) * self.ddim_timesteps.shape[0]) - 1
            timesteps = self.ddim_timesteps[:subset_end]
            
        intermediates = {'x_inter': [img], 'pred_x0': [img]}
        time_range = reversed(range(0,timesteps)) if ddim_use_original_steps else np.flip(timesteps)
        total_steps = timesteps if ddim_use_original_steps else timesteps.shape[0]
        if verbose:
            iterator = tqdm(time_range, desc='DDIM Sampler', total=total_steps)
        else:
            iterator = time_range

        init_x0 = False
        clean_cond = kwargs.pop("clean_cond", False)
        for i, step in enumerate(iterator):
            index = total_steps - i - 1
            ts = torch.full((b,), step, device=device, dtype=torch.long)
            if start_timesteps is not None:
                assert x0 is not None
                if step > start_timesteps*time_range[0]:
                    continue
                elif not init_x0:
                    img = self.model.q_sample(x0, ts) 
                    init_x0 = True

            # use mask to blend noised original latent (img_orig) & new sampled latent (img)
            if mask is not None:
                assert x0 is not None
                if clean_cond:
                    img_orig = x0
                else:
                    img_orig = self.model.q_sample(x0, ts)  # TODO: deterministic forward pass? <ddim inversion>
                img = img_orig * mask + (1. - mask) * img # keep original & modify use img
            
            index_clip =  int((1 - cond_tau) * total_steps)
            if index <= index_clip and target_size is not None:
                target_size_ = [target_size[0], target_size[1]//8, target_size[2]//8]
                img = torch.nn.functional.interpolate(
                img,
                size=target_size_,
                mode="nearest",
                )
            outs = self.p_sample_ddim(img, cond, ts, index=index, use_original_steps=ddim_use_original_steps,
                                      quantize_denoised=quantize_denoised, temperature=temperature,
                                      noise_dropout=noise_dropout, score_corrector=score_corrector,
                                      corrector_kwargs=corrector_kwargs,
                                      unconditional_guidance_scale=unconditional_guidance_scale,
                                      unconditional_conditioning=unconditional_conditioning,
                                      x0=x0,
                                      **kwargs)
            
            img, pred_x0 = outs
            if callback: callback(i)
            if img_callback: img_callback(pred_x0, i)

            if index % log_every_t == 0 or index == total_steps - 1:
                intermediates['x_inter'].append(img)
                intermediates['pred_x0'].append(pred_x0)

        return img, intermediates

    @torch.no_grad()
    def p_sample_ddim(self, x, c, t, index, repeat_noise=False, use_original_steps=False, quantize_denoised=False,
                      temperature=1., noise_dropout=0., score_corrector=None, corrector_kwargs=None,
                      unconditional_guidance_scale=1., unconditional_conditioning=None,
                      uc_type=None, conditional_guidance_scale_temporal=None, **kwargs):
        b, *_, device = *x.shape, x.device
        if x.dim() == 5:
            is_video = True
        else:
            is_video = False
        if unconditional_conditioning is None or unconditional_guidance_scale == 1.:
            e_t = self.model.apply_model(x, t, c, **kwargs) # unet denoiser
        else:
            # with unconditional condition
            if isinstance(c, torch.Tensor):
                e_t = self.model.apply_model(x, t, c, **kwargs)
                e_t_uncond = self.model.apply_model(x, t, unconditional_conditioning, **kwargs)
            elif isinstance(c, dict):
                e_t = self.model.apply_model(x, t, c, **kwargs)
                e_t_uncond = self.model.apply_model(x, t, unconditional_conditioning, **kwargs)
            else:
                raise NotImplementedError
            # text cfg
            if uc_type is None:
                e_t = e_t_uncond + unconditional_guidance_scale * (e_t - e_t_uncond)
            else:
                if uc_type == 'cfg_original':
                    e_t = e_t + unconditional_guidance_scale * (e_t - e_t_uncond)
                elif uc_type == 'cfg_ours':
                    e_t = e_t + unconditional_guidance_scale * (e_t_uncond - e_t)
                else:
                    raise NotImplementedError
            # temporal guidance
            if conditional_guidance_scale_temporal is not None:
                e_t_temporal = self.model.apply_model(x, t, c, **kwargs)
                e_t_image = self.model.apply_model(x, t, c, no_temporal_attn=True, **kwargs)
                e_t = e_t + conditional_guidance_scale_temporal * (e_t_temporal - e_t_image)

        if score_corrector is not None:
            assert self.model.parameterization == "eps"
            e_t = score_corrector.modify_score(self.model, e_t, x, t, c, **corrector_kwargs)

        alphas = self.model.alphas_cumprod if use_original_steps else self.ddim_alphas
        alphas_prev = self.model.alphas_cumprod_prev if use_original_steps else self.ddim_alphas_prev
        sqrt_one_minus_alphas = self.model.sqrt_one_minus_alphas_cumprod if use_original_steps else self.ddim_sqrt_one_minus_alphas
        sigmas = self.model.ddim_sigmas_for_original_num_steps if use_original_steps else self.ddim_sigmas
        # select parameters corresponding to the currently considered timestep
        
        if is_video:
            size = (b, 1, 1, 1, 1)
        else:
            size = (b, 1, 1, 1)
        a_t = torch.full(size, alphas[index], device=device)
        a_prev = torch.full(size, alphas_prev[index], device=device)
        sigma_t = torch.full(size, sigmas[index], device=device)
        sqrt_one_minus_at = torch.full(size, sqrt_one_minus_alphas[index],device=device)

        # current prediction for x_0
        pred_x0 = (x - sqrt_one_minus_at * e_t) / a_t.sqrt()
        if quantize_denoised:
            pred_x0, _, *_ = self.model.first_stage_model.quantize(pred_x0)
        # direction pointing to x_t
        dir_xt = (1. - a_prev - sigma_t**2).sqrt() * e_t

        noise = sigma_t * noise_like(x.shape, device, repeat_noise) * temperature
        if noise_dropout > 0.:
            noise = torch.nn.functional.dropout(noise, p=noise_dropout)
        
        alphas = self.model.alphas_cumprod if use_original_steps else self.ddim_alphas
        if self.use_scale:
            scale_arr = self.model.scale_arr if use_original_steps else self.ddim_scale_arr
            scale_t = torch.full(size, scale_arr[index], device=device)
            scale_arr_prev = self.model.scale_arr_prev if use_original_steps else self.ddim_scale_arr_prev
            scale_t_prev = torch.full(size, scale_arr_prev[index], device=device)
            pred_x0 /= scale_t 
            x_prev = a_prev.sqrt() * scale_t_prev * pred_x0 + dir_xt + noise
        else:
            x_prev = a_prev.sqrt() * pred_x0 + dir_xt + noise

        return x_prev, pred_x0


    @torch.no_grad()
    def stochastic_encode(self, x0, t, use_original_steps=False, noise=None):
        # fast, but does not allow for exact reconstruction
        # t serves as an index to gather the correct alphas
        if use_original_steps:
            sqrt_alphas_cumprod = self.sqrt_alphas_cumprod
            sqrt_one_minus_alphas_cumprod = self.sqrt_one_minus_alphas_cumprod
        else:
            sqrt_alphas_cumprod = torch.sqrt(self.ddim_alphas)
            sqrt_one_minus_alphas_cumprod = self.ddim_sqrt_one_minus_alphas

        if noise is None:
            noise = torch.randn_like(x0)

        def extract_into_tensor(a, t, x_shape):
            b, *_ = t.shape
            out = a.gather(-1, t)
            return out.reshape(b, *((1,) * (len(x_shape) - 1)))

        return (extract_into_tensor(sqrt_alphas_cumprod, t, x0.shape) * x0 +
                extract_into_tensor(sqrt_one_minus_alphas_cumprod, t, x0.shape) * noise)

    @torch.no_grad()
    def decode(self, x_latent, cond, t_start, unconditional_guidance_scale=1.0, unconditional_conditioning=None,
               use_original_steps=False):

        timesteps = np.arange(self.ddpm_num_timesteps) if use_original_steps else self.ddim_timesteps
        timesteps = timesteps[:t_start]

        time_range = np.flip(timesteps)
        total_steps = timesteps.shape[0]
        print(f"Running DDIM Sampling with {total_steps} timesteps")

        iterator = tqdm(time_range, desc='Decoding image', total=total_steps)
        x_dec = x_latent
        for i, step in enumerate(iterator):
            index = total_steps - i - 1
            ts = torch.full((x_latent.shape[0],), step, device=x_latent.device, dtype=torch.long)
            x_dec, _ = self.p_sample_ddim(x_dec, cond, ts, index=index, use_original_steps=use_original_steps,
                                          unconditional_guidance_scale=unconditional_guidance_scale,
                                          unconditional_conditioning=unconditional_conditioning)
        return x_dec



================================================
FILE: i2v/lvdm/models/utils_diffusion.py
================================================
import math
import numpy as np
from einops import repeat
import torch
import torch.nn.functional as F


def timestep_embedding(timesteps, dim, max_period=10000, repeat_only=False):
    """
    Create sinusoidal timestep embeddings.
    :param timesteps: a 1-D Tensor of N indices, one per batch element.
                      These may be fractional.
    :param dim: the dimension of the output.
    :param max_period: controls the minimum frequency of the embeddings.
    :return: an [N x dim] Tensor of positional embeddings.
    """
    if not repeat_only:
        half = dim // 2
        freqs = torch.exp(
            -math.log(max_period) * torch.arange(start=0, end=half, dtype=torch.float32) / half
        ).to(device=timesteps.device)
        args = timesteps[:, None].float() * freqs[None]
        embedding = torch.cat([torch.cos(args), torch.sin(args)], dim=-1)
        if dim % 2:
            embedding = torch.cat([embedding, torch.zeros_like(embedding[:, :1])], dim=-1)
    else:
        embedding = repeat(timesteps, 'b -> b d', d=dim)
    return embedding


def make_beta_schedule(schedule, n_timestep, linear_start=1e-4, linear_end=2e-2, cosine_s=8e-3):
    if schedule == "linear":
        betas = (
                torch.linspace(linear_start ** 0.5, linear_end ** 0.5, n_timestep, dtype=torch.float64) ** 2
        )

    elif schedule == "cosine":
        timesteps = (
                torch.arange(n_timestep + 1, dtype=torch.float64) / n_timestep + cosine_s
        )
        alphas = timesteps / (1 + cosine_s) * np.pi / 2
        alphas = torch.cos(alphas).pow(2)
        alphas = alphas / alphas[0]
        betas = 1 - alphas[1:] / alphas[:-1]
        betas = np.clip(betas, a_min=0, a_max=0.999)

    elif schedule == "sqrt_linear":
        betas = torch.linspace(linear_start, linear_end, n_timestep, dtype=torch.float64)
    elif schedule == "sqrt":
        betas = torch.linspace(linear_start, linear_end, n_timestep, dtype=torch.float64) ** 0.5
    else:
        raise ValueError(f"schedule '{schedule}' unknown.")
    return betas.numpy()


def make_ddim_timesteps(ddim_discr_method, num_ddim_timesteps, num_ddpm_timesteps, verbose=True):
    if ddim_discr_method == 'uniform':
        c = num_ddpm_timesteps // num_ddim_timesteps
        ddim_timesteps = np.asarray(list(range(0, num_ddpm_timesteps, c)))
    elif ddim_discr_method == 'quad':
        ddim_timesteps = ((np.linspace(0, np.sqrt(num_ddpm_timesteps * .8), num_ddim_timesteps)) ** 2).astype(int)
    else:
        raise NotImplementedError(f'There is no ddim discretization method called "{ddim_discr_method}"')

    # assert ddim_timesteps.shape[0] == num_ddim_timesteps
    # add one to get the final alpha values right (the ones from first scale to data during sampling)
    steps_out = ddim_timesteps + 1
    if verbose:
        print(f'Selected timesteps for ddim sampler: {steps_out}')
    return steps_out


def make_ddim_sampling_parameters(alphacums, ddim_timesteps, eta, verbose=True):
    # select alphas for computing the variance schedule
    # print(f'ddim_timesteps={ddim_timesteps}, len_alphacums={len(alphacums)}')
    alphas = alphacums[ddim_timesteps]
    alphas_prev = np.asarray([alphacums[0]] + alphacums[ddim_timesteps[:-1]].tolist())

    # according the the formula provided in https://arxiv.org/abs/2010.02502
    sigmas = eta * np.sqrt((1 - alphas_prev) / (1 - alphas) * (1 - alphas / alphas_prev))
    if verbose:
        print(f'Selected alphas for ddim sampler: a_t: {alphas}; a_(t-1): {alphas_prev}')
        print(f'For the chosen value of eta, which is {eta}, '
              f'this results in the following sigma_t schedule for ddim sampler {sigmas}')
    return sigmas, alphas, alphas_prev


def betas_for_alpha_bar(num_diffusion_timesteps, alpha_bar, max_beta=0.999):
    """
    Create a beta schedule that discretizes the given alpha_t_bar function,
    which defines the cumulative product of (1-beta) over time from t = [0,1].
    :param num_diffusion_timesteps: the number of betas to produce.
    :param alpha_bar: a lambda that takes an argument t from 0 to 1 and
                      produces the cumulative product of (1-beta) up to that
                      part of the diffusion process.
    :param max_beta: the maximum beta to use; use values lower than 1 to
                     prevent singularities.
    """
    betas = []
    for i in range(num_diffusion_timesteps):
        t1 = i / num_diffusion_timesteps
        t2 = (i + 1) / num_diffusion_timesteps
        betas.append(min(1 - alpha_bar(t2) / alpha_bar(t1), max_beta))
    return np.array(betas)

================================================
FILE: i2v/lvdm/modules/attention.py
================================================
from functools import partial
import torch
from torch import nn, einsum
import torch.nn.functional as F
from einops import rearrange, repeat
try:
    import xformers
    import xformers.ops
    XFORMERS_IS_AVAILBLE = True
except:
    XFORMERS_IS_AVAILBLE = False
from lvdm.common import (
    checkpoint,
    exists,
    default,
)
from lvdm.basics import (
    zero_module,
)

class RelativePosition(nn.Module):
    """ https://github.com/evelinehong/Transformer_Relative_Position_PyTorch/blob/master/relative_position.py """

    def __init__(self, num_units, max_relative_position):
        super().__init__()
        self.num_units = num_units
        self.max_relative_position = max_relative_position
        self.embeddings_table = nn.Parameter(torch.Tensor(max_relative_position * 2 + 1, num_units))
        nn.init.xavier_uniform_(self.embeddings_table)

    def forward(self, length_q, length_k):
        device = self.embeddings_table.device
        range_vec_q = torch.arange(length_q, device=device)
        range_vec_k = torch.arange(length_k, device=device)
        distance_mat = range_vec_k[None, :] - range_vec_q[:, None]
        distance_mat_clipped = torch.clamp(distance_mat, -self.max_relative_position, self.max_relative_position)
        final_mat = distance_mat_clipped + self.max_relative_position
        final_mat = final_mat.long()
        embeddings = self.embeddings_table[final_mat]
        return embeddings


class CrossAttention(nn.Module):

    def __init__(self, query_dim, context_dim=None, heads=8, dim_head=64, dropout=0., 
                 relative_position=False, temporal_length=None, img_cross_attention=False):
        super().__init__()
        inner_dim = dim_head * heads
        context_dim = default(context_dim, query_dim)

        self.scale = dim_head**-0.5
        self.heads = heads
        self.dim_head = dim_head
        self.to_q = nn.Linear(query_dim, inner_dim, bias=False)
        self.to_k = nn.Linear(context_dim, inner_dim, bias=False)
        self.to_v = nn.Linear(context_dim, inner_dim, bias=False)
        self.to_out = nn.Sequential(nn.Linear(inner_dim, query_dim), nn.Dropout(dropout))

        self.image_cross_attention_scale = 1.0
        self.text_context_len = 77
        self.img_cross_attention = img_cross_attention
        if self.img_cross_attention:
            self.to_k_ip = nn.Linear(context_dim, inner_dim, bias=False)
            self.to_v_ip = nn.Linear(context_dim, inner_dim, bias=False)
        
        self.relative_position = relative_position
        if self.relative_position:
            assert(temporal_length is not None)
            self.relative_position_k = RelativePosition(num_units=dim_head, max_relative_position=temporal_length)
            self.relative_position_v = RelativePosition(num_units=dim_head, max_relative_position=temporal_length)
        else:
            ## only used for spatial attention, while NOT for temporal attention
            if XFORMERS_IS_AVAILBLE and temporal_length is None:
                self.forward = self.efficient_forward

    def forward(self, x, context=None, mask=None):
        h = self.heads

        q = self.to_q(x)
        context = default(context, x)
        ## considering image token additionally
        if context is not None and self.img_cross_attention:
            context, context_img = context[:,:self.text_context_len,:], context[:,self.text_context_len:,:]
            k = self.to_k(context)
            v = self.to_v(context)
            k_ip = self.to_k_ip(context_img)
            v_ip = self.to_v_ip(context_img)
        else:
            k = self.to_k(context)
            v = self.to_v(context)

        q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> (b h) n d', h=h), (q, k, v))
        sim = torch.einsum('b i d, b j d -> b i j', q, k) * self.scale
        if self.relative_position:
            len_q, len_k, len_v = q.shape[1], k.shape[1], v.shape[1]
            k2 = self.relative_position_k(len_q, len_k)
            sim2 = einsum('b t d, t s d -> b t s', q, k2) * self.scale # TODO check 
            sim += sim2
        del k

        if exists(mask):
            ## feasible for causal attention mask only
            max_neg_value = -torch.finfo(sim.dtype).max
            mask = repeat(mask, 'b i j -> (b h) i j', h=h)
            sim.masked_fill_(~(mask>0.5), max_neg_value)

        # attention, what we cannot get enough of
        sim = sim.softmax(dim=-1)
        out = torch.einsum('b i j, b j d -> b i d', sim, v)
        if self.relative_position:
            v2 = self.relative_position_v(len_q, len_v)
            out2 = einsum('b t s, t s d -> b t d', sim, v2) # TODO check
            out += out2
        out = rearrange(out, '(b h) n d -> b n (h d)', h=h)

        ## considering image token additionally
        if context is not None and self.img_cross_attention:
            k_ip, v_ip = map(lambda t: rearrange(t, 'b n (h d) -> (b h) n d', h=h), (k_ip, v_ip))
            sim_ip =  torch.einsum('b i d, b j d -> b i j', q, k_ip) * self.scale
            del k_ip
            sim_ip = sim_ip.softmax(dim=-1)
            out_ip = torch.einsum('b i j, b j d -> b i d', sim_ip, v_ip)
            out_ip = rearrange(out_ip, '(b h) n d -> b n (h d)', h=h)
            out = out + self.image_cross_attention_scale * out_ip
        del q

        return self.to_out(out)
    
    def efficient_forward(self, x, context=None, mask=None):
        q = self.to_q(x)
        context = default(context, x)

        ## considering image token additionally
        if context is not None and self.img_cross_attention:
            context, context_img = context[:,:self.text_context_len,:], context[:,self.text_context_len:,:]
            k = self.to_k(context)
            v = self.to_v(context)
            k_ip = self.to_k_ip(context_img)
            v_ip = self.to_v_ip(context_img)
        else:
            k = self.to_k(context)
            v = self.to_v(context)

        b, _, _ = q.shape
        q, k, v = map(
            lambda t: t.unsqueeze(3)
            .reshape(b, t.shape[1], self.heads, self.dim_head)
            .permute(0, 2, 1, 3)
            .reshape(b * self.heads, t.shape[1], self.dim_head)
            .contiguous(),
            (q, k, v),
        )
        # actually compute the attention, what we cannot get enough of
        out = xformers.ops.memory_efficient_attention(q, k, v, attn_bias=None, op=None)

        ## considering image token additionally
        if context is not None and self.img_cross_attention:
            k_ip, v_ip = map(
                lambda t: t.unsqueeze(3)
                .reshape(b, t.shape[1], self.heads, self.dim_head)
                .permute(0, 2, 1, 3)
                .reshape(b * self.heads, t.shape[1], self.dim_head)
                .contiguous(),
                (k_ip, v_ip),
            )
            out_ip = xformers.ops.memory_efficient_attention(q, k_ip, v_ip, attn_bias=None, op=None)
            out_ip = (
                out_ip.unsqueeze(0)
                .reshape(b, self.heads, out.shape[1], self.dim_head)
                .permute(0, 2, 1, 3)
                .reshape(b, out.shape[1], self.heads * self.dim_head)
            )

        if exists(mask):
            raise NotImplementedError
        out = (
            out.unsqueeze(0)
            .reshape(b, self.heads, out.shape[1], self.dim_head)
            .permute(0, 2, 1, 3)
            .reshape(b, out.shape[1], self.heads * self.dim_head)
        )
        if context is not None and self.img_cross_attention:
            out = out + self.image_cross_attention_scale * out_ip
        return self.to_out(out)


class BasicTransformerBlock(nn.Module):

    def __init__(self, dim, n_heads, d_head, dropout=0., context_dim=None, gated_ff=True, checkpoint=True,
                disable_self_attn=False, attention_cls=None, img_cross_attention=False):
        super().__init__()
        attn_cls = CrossAttention if attention_cls is None else attention_cls
        self.disable_self_attn = disable_self_attn
        self.attn1 = attn_cls(query_dim=dim, heads=n_heads, dim_head=d_head, dropout=dropout,
            context_dim=context_dim if self.disable_self_attn else None)
        self.ff = FeedForward(dim, dropout=dropout, glu=gated_ff)
        self.attn2 = attn_cls(query_dim=dim, context_dim=context_dim, heads=n_heads, dim_head=d_head, dropout=dropout,
            img_cross_attention=img_cross_attention)
        self.norm1 = nn.LayerNorm(dim)
        self.norm2 = nn.LayerNorm(dim)
        self.norm3 = nn.LayerNorm(dim)
        self.checkpoint = checkpoint

    def forward(self, x, context=None, mask=None):
        ## implementation tricks: because checkpointing doesn't support non-tensor (e.g. None or scalar) arguments
        input_tuple = (x,)      ## should not be (x), otherwise *input_tuple will decouple x into multiple arguments
        if context is not None:
            input_tuple = (x, context)
        if mask is not None:
            forward_mask = partial(self._forward, mask=mask)
            return checkpoint(forward_mask, (x,), self.parameters(), self.checkpoint)
        if context is not None and mask is not None:
            input_tuple = (x, context, mask)
        return checkpoint(self._forward, input_tuple, self.parameters(), self.checkpoint)

    def _forward(self, x, context=None, mask=None):
        x = self.attn1(self.norm1(x), context=context if self.disable_self_attn else None, mask=mask) + x
        x = self.attn2(self.norm2(x), context=context, mask=mask) + x
        x = self.ff(self.norm3(x)) + x
        return x


class SpatialTransformer(nn.Module):
    """
    Transformer block for image-like data in spatial axis.
    First, project the input (aka embedding)
    and reshape to b, t, d.
    Then apply standard transformer action.
    Finally, reshape to image
    NEW: use_linear for more efficiency instead of the 1x1 convs
    """

    def __init__(self, in_channels, n_heads, d_head, depth=1, dropout=0., context_dim=None,
                 use_checkpoint=True, disable_self_attn=False, use_linear=False, img_cross_attention=False):
        super().__init__()
        self.in_channels = in_channels
        inner_dim = n_heads * d_head
        self.norm = torch.nn.GroupNorm(num_groups=32, num_channels=in_channels, eps=1e-6, affine=True)
        if not use_linear:
            self.proj_in = nn.Conv2d(in_channels, inner_dim, kernel_size=1, stride=1, padding=0)
        else:
            self.proj_in = nn.Linear(in_channels, inner_dim)

        self.transformer_blocks = nn.ModuleList([
            BasicTransformerBlock(
                inner_dim,
                n_heads,
                d_head,
                dropout=dropout,
                context_dim=context_dim,
                img_cross_attention=img_cross_attention,
                disable_self_attn=disable_self_attn,
                checkpoint=use_checkpoint) for d in range(depth)
        ])
        if not use_linear:
            self.proj_out = zero_module(nn.Conv2d(inner_dim, in_channels, kernel_size=1, stride=1, padding=0))
        else:
            self.proj_out = zero_module(nn.Linear(inner_dim, in_channels))
        self.use_linear = use_linear


    def forward(self, x, context=None):
        b, c, h, w = x.shape
        x_in = x
        x = self.norm(x)
        if not self.use_linear:
            x = self.proj_in(x)
        x = rearrange(x, 'b c h w -> b (h w) c').contiguous()
        if self.use_linear:
            x = self.proj_in(x)
        for i, block in enumerate(self.transformer_blocks):
            x = block(x, context=context)
        if self.use_linear:
            x = self.proj_out(x)
        x = rearrange(x, 'b (h w) c -> b c h w', h=h, w=w).contiguous()
        if not self.use_linear:
            x = self.proj_out(x)
        return x + x_in
    
    
class TemporalTransformer(nn.Module):
    """
    Transformer block for image-like data in temporal axis.
    First, reshape to b, t, d.
    Then apply standard transformer action.
    Finally, reshape to image
    """
    def __init__(self, in_channels, n_heads, d_head, depth=1, dropout=0., context_dim=None,
                 use_checkpoint=True, use_linear=False, only_self_att=True, causal_attention=False,
                 relative_position=False, temporal_length=None):
        super().__init__()
        self.only_self_att = only_self_att
        self.relative_position = relative_position
        self.causal_attention = causal_attention
        self.in_channels = in_channels
        inner_dim = n_heads * d_head
        self.norm = torch.nn.GroupNorm(num_groups=32, num_channels=in_channels, eps=1e-6, affine=True)
        self.proj_in = nn.Conv1d(in_channels, inner_dim, kernel_size=1, stride=1, padding=0)
        if not use_linear:
            self.proj_in = nn.Conv1d(in_channels, inner_dim, kernel_size=1, stride=1, padding=0)
        else:
            self.proj_in = nn.Linear(in_channels, inner_dim)

        if relative_position:
            assert(temporal_length is not None)
            attention_cls = partial(CrossAttention, relative_position=True, temporal_length=temporal_length)
        else:
            attention_cls = None
        if self.causal_attention:
            assert(temporal_length is not None)
            self.mask = torch.tril(torch.ones([1, temporal_length, temporal_length]))

        if self.only_self_att:
            context_dim = None
        self.transformer_blocks = nn.ModuleList([
            BasicTransformerBlock(
                inner_dim,
                n_heads,
                d_head,
                dropout=dropout,
                context_dim=context_dim,
                attention_cls=attention_cls,
                checkpoint=use_checkpoint) for d in range(depth)
        ])
        if not use_linear:
            self.proj_out = zero_module(nn.Conv1d(inner_dim, in_channels, kernel_size=1, stride=1, padding=0))
        else:
            self.proj_out = zero_module(nn.Linear(inner_dim, in_channels))
        self.use_linear = use_linear

    def forward(self, x, context=None):
        b, c, t, h, w = x.shape
        x_in = x
        x = self.norm(x)
        x = rearrange(x, 'b c t h w -> (b h w) c t').contiguous()
        if not self.use_linear:
            x = self.proj_in(x)
        x = rearrange(x, 'bhw c t -> bhw t c').contiguous()
        if self.use_linear:
            x = self.proj_in(x)

        if self.causal_attention:
            mask = self.mask.to(x.device)
            mask = repeat(mask, 'l i j -> (l bhw) i j', bhw=b*h*w)
        else:
            mask = None

        if self.only_self_att:
            ## note: if no context is given, cross-attention defaults to self-attention
            for i, block in enumerate(self.transformer_blocks):
                x = block(x, mask=mask)
            x = rearrange(x, '(b hw) t c -> b hw t c', b=b).contiguous()
        else:
            x = rearrange(x, '(b hw) t c -> b hw t c', b=b).contiguous()
            context = rearrange(context, '(b t) l con -> b t l con', t=t).contiguous()
            for i, block in enumerate(self.transformer_blocks):
                # calculate each batch one by one (since number in shape could not greater then 65,535 for some package)
                for j in range(b):
                    context_j = repeat(
                        context[j],
                        't l con -> (t r) l con', r=(h * w) // t, t=t).contiguous()
                    ## note: causal mask will not applied in cross-attention case
                    x[j] = block(x[j], context=context_j)
        
        if self.use_linear:
            x = self.proj_out(x)
            x = rearrange(x, 'b (h w) t c -> b c t h w', h=h, w=w).contiguous()
        if not self.use_linear:
            x = rearrange(x, 'b hw t c -> (b hw) c t').contiguous()
            x = self.proj_out(x)
            x = rearrange(x, '(b h w) c t -> b c t h w', b=b, h=h, w=w).contiguous()

        return x + x_in
    

class GEGLU(nn.Module):
    def __init__(self, dim_in, dim_out):
        super().__init__()
        self.proj = nn.Linear(dim_in, dim_out * 2)

    def forward(self, x):
        x, gate = self.proj(x).chunk(2, dim=-1)
        return x * F.gelu(gate)


class FeedForward(nn.Module):
    def __init__(self, dim, dim_out=None, mult=4, glu=False, dropout=0.):
        super().__init__()
        inner_dim = int(dim * mult)
        dim_out = default(dim_out, dim)
        project_in = nn.Sequential(
            nn.Linear(dim, inner_dim),
            nn.GELU()
        ) if not glu else GEGLU(dim, inner_dim)

        self.net = nn.Sequential(
            project_in,
            nn.Dropout(dropout),
            nn.Linear(inner_dim, dim_out)
        )

    def forward(self, x):
        return self.net(x)


class LinearAttention(nn.Module):
    def __init__(self, dim, heads=4, dim_head=32):
        super().__init__()
        self.heads = heads
        hidden_dim = dim_head * heads
        self.to_qkv = nn.Conv2d(dim, hidden_dim * 3, 1, bias = False)
        self.to_out = nn.Conv2d(hidden_dim, dim, 1)

    def forward(self, x):
        b, c, h, w = x.shape
        qkv = self.to_qkv(x)
        q, k, v = rearrange(qkv, 'b (qkv heads c) h w -> qkv b heads c (h w)', heads = self.heads, qkv=3)
        k = k.softmax(dim=-1)  
        context = torch.einsum('bhdn,bhen->bhde', k, v)
        out = torch.einsum('bhde,bhdn->bhen', context, q)
        out = rearrange(out, 'b heads c (h w) -> b (heads c) h w', heads=self.heads, h=h, w=w)
        return self.to_out(out)


class SpatialSelfAttention(nn.Module):
    def __init__(self, in_channels):
        super().__init__()
        self.in_channels = in_channels

        self.norm = torch.nn.GroupNorm(num_groups=32, num_channels=in_channels, eps=1e-6, affine=True)
        self.q = torch.nn.Conv2d(in_channels,
                                 in_channels,
                                 kernel_size=1,
                                 stride=1,
                                 padding=0)
        self.k = torch.nn.Conv2d(in_channels,
                                 in_channels,
                                 kernel_size=1,
                                 stride=1,
                                 padding=0)
        self.v = torch.nn.Conv2d(in_channels,
                                 in_channels,
                                 kernel_size=1,
                                 stride=1,
                                 padding=0)
        self.proj_out = torch.nn.Conv2d(in_channels,
                                        in_channels,
                                        kernel_size=1,
                                        stride=1,
                                        padding=0)

    def forward(self, x):
        h_ = x
        h_ = self.norm(h_)
        q = self.q(h_)
        k = self.k(h_)
        v = self.v(h_)

        # compute attention
        b,c,h,w = q.shape
        q = rearrange(q, 'b c h w -> b (h w) c')
        k = rearrange(k, 'b c h w -> b c (h w)')
        w_ = torch.einsum('bij,bjk->bik', q, k)

        w_ = w_ * (int(c)**(-0.5))
        w_ = torch.nn.functional.softmax(w_, dim=2)

        # attend to values
        v = rearrange(v, 'b c h w -> b c (h w)')
        w_ = rearrange(w_, 'b i j -> b j i')
        h_ = torch.einsum('bij,bjk->bik', v, w_)
        h_ = rearrange(h_, 'b c (h w) -> b c h w', h=h)
        h_ = self.proj_out(h_)

        return x+h_


================================================
FILE: i2v/lvdm/modules/encoders/condition.py
================================================
import torch
import torch.nn as nn
from torch.utils.checkpoint import checkpoint
import kornia
import open_clip
from transformers import T5Tokenizer, T5EncoderModel, CLIPTokenizer, CLIPTextModel
from lvdm.common import autocast
from utils.utils import count_params

class AbstractEncoder(nn.Module):
    def __init__(self):
        super().__init__()

    def encode(self, *args, **kwargs):
        raise NotImplementedError


class IdentityEncoder(AbstractEncoder):

    def encode(self, x):
        return x


class ClassEmbedder(nn.Module):
    def __init__(self, embed_dim, n_classes=1000, key='class', ucg_rate=0.1):
        super().__init__()
        self.key = key
        self.embedding = nn.Embedding(n_classes, embed_dim)
        self.n_classes = n_classes
        self.ucg_rate = ucg_rate

    def forward(self, batch, key=None, disable_dropout=False):
        if key is None:
            key = self.key
        # this is for use in crossattn
        c = batch[key][:, None]
        if self.ucg_rate > 0. and not disable_dropout:
            mask = 1. - torch.bernoulli(torch.ones_like(c) * self.ucg_rate)
            c = mask * c + (1 - mask) * torch.ones_like(c) * (self.n_classes - 1)
            c = c.long()
        c = self.embedding(c)
        return c

    def get_unconditional_conditioning(self, bs, device="cuda"):
        uc_class = self.n_classes - 1  # 1000 classes --> 0 ... 999, one extra class for ucg (class 1000)
        uc = torch.ones((bs,), device=device) * uc_class
        uc = {self.key: uc}
        return uc


def disabled_train(self, mode=True):
    """Overwrite model.train with this function to make sure train/eval mode
    does not change anymore."""
    return self


class FrozenT5Embedder(AbstractEncoder):
    """Uses the T5 transformer encoder for text"""

    def __init__(self, version="google/t5-v1_1-large", device="cuda", max_length=77,
                 freeze=True):  # others are google/t5-v1_1-xl and google/t5-v1_1-xxl
        super().__init__()
        self.tokenizer = T5Tokenizer.from_pretrained(version)
        self.transformer = T5EncoderModel.from_pretrained(version)
        self.device = device
        self.max_length = max_length  # TODO: typical value?
        if freeze:
            self.freeze()

    def freeze(self):
        self.transformer = self.transformer.eval()
        # self.train = disabled_train
        for param in self.parameters():
            param.requires_grad = False

    def forward(self, text):
        batch_encoding = self.tokenizer(text, truncation=True, max_length=self.max_length, return_length=True,
                                        return_overflowing_tokens=False, padding="max_length", return_tensors="pt")
        tokens = batch_encoding["input_ids"].to(self.device)
        outputs = self.transformer(input_ids=tokens)

        z = outputs.last_hidden_state
        return z

    def encode(self, text):
        return self(text)


class FrozenCLIPEmbedder(AbstractEncoder):
    """Uses the CLIP transformer encoder for text (from huggingface)"""
    LAYERS = [
        "last",
        "pooled",
        "hidden"
    ]

    def __init__(self, version="openai/clip-vit-large-patch14", device="cuda", max_length=77,
                 freeze=True, layer="last", layer_idx=None):  # clip-vit-base-patch32
        super().__init__()
        assert layer in self.LAYERS
        self.tokenizer = CLIPTokenizer.from_pretrained(version)
        self.transformer = CLIPTextModel.from_pretrained(version)
        self.device = device
        self.max_length = max_length
        if freeze:
            self.freeze()
        self.layer = layer
        self.layer_idx = layer_idx
        if layer == "hidden":
            assert layer_idx is not None
            assert 0 <= abs(layer_idx) <= 12

    def freeze(self):
        self.transformer = self.transformer.eval()
        # self.train = disabled_train
        for param in self.parameters():
            param.requires_grad = False

    def forward(self, text):
        batch_encoding = self.tokenizer(text, truncation=True, max_length=self.max_length, return_length=True,
                                        return_overflowing_tokens=False, padding="max_length", return_tensors="pt")
        tokens = batch_encoding["input_ids"].to(self.device)
        outputs = self.transformer(input_ids=tokens, output_hidden_states=self.layer == "hidden")
        if self.layer == "last":
            z = outputs.last_hidden_state
        elif self.layer == "pooled":
            z = outputs.pooler_output[:, None, :]
        else:
            z = outputs.hidden_states[self.layer_idx]
        return z

    def encode(self, text):
        return self(text)


class ClipImageEmbedder(nn.Module):
    def __init__(
            self,
            model,
            jit=False,
            device='cuda' if torch.cuda.is_available() else 'cpu',
            antialias=True,
            ucg_rate=0.
    ):
        super().__init__()
        from clip import load as load_clip
        self.model, _ = load_clip(name=model, device=device, jit=jit)

        self.antialias = antialias

        self.register_buffer('mean', torch.Tensor([0.48145466, 0.4578275, 0.40821073]), persistent=False)
        self.register_buffer('std', torch.Tensor([0.26862954, 0.26130258, 0.27577711]), persistent=False)
        self.ucg_rate = ucg_rate

    def preprocess(self, x):
        # normalize to [0,1]
        x = kornia.geometry.resize(x, (224, 224),
                                   interpolation='bicubic', align_corners=True,
                                   antialias=self.antialias)
        x = (x + 1.) / 2.
        # re-normalize according to clip
        x = kornia.enhance.normalize(x, self.mean, self.std)
        return x

    def forward(self, x, no_dropout=False):
        # x is assumed to be in range [-1,1]
        out = self.model.encode_image(self.preprocess(x))
        out = out.to(x.dtype)
        if self.ucg_rate > 0. and not no_dropout:
            out = torch.bernoulli((1. - self.ucg_rate) * torch.ones(out.shape[0], device=out.device))[:, None] * out
        return out


class FrozenOpenCLIPEmbedder(AbstractEncoder):
    """
    Uses the OpenCLIP transformer encoder for text
    """
    LAYERS = [
        # "pooled",
        "last",
        "penultimate"
    ]

    def __init__(self, arch="ViT-H-14", version="laion2b_s32b_b79k", device="cuda", max_length=77,
                 freeze=True, layer="last"):
        super().__init__()
        assert layer in self.LAYERS
        model, _, _ = open_clip.create_model_and_transforms(arch, device=torch.device('cpu'))
        del model.visual
        self.model = model

        self.device = device
        self.max_length = max_length
        if freeze:
            self.freeze()
        self.layer = layer
        if self.layer == "last":
            self.layer_idx = 0
        elif self.layer == "penultimate":
            self.layer_idx = 1
        else:
            raise NotImplementedError()

    def freeze(self):
        self.model = self.model.eval()
        for param in self.parameters():
            param.requires_grad = False

    def forward(self, text):
        self.device = self.model.positional_embedding.device
        tokens = open_clip.tokenize(text)
        z = self.encode_with_transformer(tokens.to(self.device))
        return z

    def encode_with_transformer(self, text):
        x = self.model.token_embedding(text)  # [batch_size, n_ctx, d_model]
        x = x + self.model.positional_embedding
        x = x.permute(1, 0, 2)  # NLD -> LND
        x = self.text_transformer_forward(x, attn_mask=self.model.attn_mask)
        x = x.permute(1, 0, 2)  # LND -> NLD
        x = self.model.ln_final(x)
        return x

    def text_transformer_forward(self, x: torch.Tensor, attn_mask=None):
        for i, r in enumerate(self.model.transformer.resblocks):
            if i == len(self.model.transformer.resblocks) - self.layer_idx:
                break
            if self.model.transformer.grad_checkpointing and not torch.jit.is_scripting():
                x = checkpoint(r, x, attn_mask)
            else:
                x = r(x, attn_mask=attn_mask)
        return x

    def encode(self, text):
        return self(text)


class FrozenOpenCLIPImageEmbedder(AbstractEncoder):
    """
    Uses the OpenCLIP vision transformer encoder for images
    """

    def __init__(self, arch="ViT-H-14", version="laion2b_s32b_b79k", device="cuda", max_length=77,
                 freeze=True, layer="pooled", antialias=True, ucg_rate=0.):
        super().__init__()
        model, _, _ = open_clip.create_model_and_transforms(arch, device=torch.device('cpu'),
                                                            pretrained=version, )
        del model.transformer
        self.model = model

        self.device = device
        self.max_length = max_length
        if freeze:
            self.freeze()
        self.layer = layer
        if self.layer == "penultimate":
            raise NotImplementedError()
            self.layer_idx = 1

        self.antialias = antialias

        self.register_buffer('mean', torch.Tensor([0.48145466, 0.4578275, 0.40821073]), persistent=False)
        self.register_buffer('std', torch.Tensor([0.26862954, 0.26130258, 0.27577711]), persistent=False)
        self.ucg_rate = ucg_rate

    def preprocess(self, x):
        # normalize to [0,1]
        x = kornia.geometry.resize(x, (224, 224),
                                   interpolation='bicubic', align_corners=True,
                                   antialias=self.antialias)
        x = (x + 1.) / 2.
        # renormalize according to clip
        x = kornia.enhance.normalize(x, self.mean, self.std)
        return x

    def freeze(self):
        self.model = self.model.eval()
        for param in self.parameters():
            param.requires_grad = False

    @autocast
    def forward(self, image, no_dropout=False):
        z = self.encode_with_vision_transformer(image)
        if self.ucg_rate > 0. and not no_dropout:
            z = torch.bernoulli((1. - self.ucg_rate) * torch.ones(z.shape[0], device=z.device))[:, None] * z
        return z

    def encode_with_vision_transformer(self, img):
        img = self.preprocess(img)
        x = self.model.visual(img)
        return x

    def encode(self, text):
        return self(text)



class FrozenOpenCLIPImageEmbedderV2(AbstractEncoder):
    """
    Uses the OpenCLIP vision transformer encoder for images
    """

    def __init__(self, arch="ViT-H-14", version="laion2b_s32b_b79k", device="cuda",
                 freeze=True, layer="pooled", antialias=True):
        super().__init__()
        model, _, _ = open_clip.create_model_and_transforms(arch, device=torch.device('cpu'),
                                                            pretrained=version, )
        del model.transformer
        self.model = model
        self.device = device

        if freeze:
            self.freeze()
        self.layer = layer
        if self.layer == "penultimate":
            raise NotImplementedError()
            self.layer_idx = 1

        self.antialias = antialias
        self.register_buffer('mean', torch.Tensor([0.48145466, 0.4578275, 0.40821073]), persistent=False)
        self.register_buffer('std', torch.Tensor([0.26862954, 0.26130258, 0.27577711]), persistent=False)


    def preprocess(self, x):
        # normalize to [0,1]
        x = kornia.geometry.resize(x, (224, 224),
                                   interpolation='bicubic', align_corners=True,
                                   antialias=self.antialias)
        x = (x + 1.) / 2.
        # renormalize according to clip
        x = kornia.enhance.normalize(x, self.mean, self.std)
        return x

    def freeze(self):
        self.model = self.model.eval()
        for param in self.model.parameters():
            param.requires_grad = False

    def forward(self, image, no_dropout=False):
        ## image: b c h w
        z = self.encode_with_vision_transformer(image)
        return z

    def encode_with_vision_transformer(self, x):
        x = self.preprocess(x)

        # to patches - whether to use dual patchnorm - https://arxiv.org/abs/2302.01327v1
        if self.model.visual.input_patchnorm:
            # einops - rearrange(x, 'b c (h p1) (w p2) -> b (h w) (c p1 p2)')
            x = x.reshape(x.shape[0], x.shape[1], self.model.visual.grid_size[0], self.model.visual.patch_size[0], self.model.visual.grid_size[1], self.model.visual.patch_size[1])
            x = x.permute(0, 2, 4, 1, 3, 5)
            x = x.reshape(x.shape[0], self.model.visual.grid_size[0] * self.model.visual.grid_size[1], -1)
            x = self.model.visual.patchnorm_pre_ln(x)
            x = self.model.visual.conv1(x)
        else:
            x = self.model.visual.conv1(x)  # shape = [*, width, grid, grid]
            x = x.reshape(x.shape[0], x.shape[1], -1)  # shape = [*, width, grid ** 2]
            x = x.permute(0, 2, 1)  # shape = [*, grid ** 2, width]

        # class embeddings and positional embeddings
        x = torch.cat(
            [self.model.visual.class_embedding.to(x.dtype) + torch.zeros(x.shape[0], 1, x.shape[-1], dtype=x.dtype, device=x.device),
             x], dim=1)  # shape = [*, grid ** 2 + 1, width]
        x = x + self.model.visual.positional_embedding.to(x.dtype)

        # a patch_dropout of 0. would mean it is disabled and this function would do nothing but return what was passed in
        x = self.model.visual.patch_dropout(x)
        x = self.model.visual.ln_pre(x)

        x = x.permute(1, 0, 2)  # NLD -> LND
        x = self.model.visual.transformer(x)
        x = x.permute(1, 0, 2)  # LND -> NLD

        return x


class FrozenCLIPT5Encoder(AbstractEncoder):
    def __init__(self, clip_version="openai/clip-vit-large-patch14", t5_version="google/t5-v1_1-xl", device="cuda",
                 clip_max_length=77, t5_max_length=77):
        super().__init__()
        self.clip_encoder = FrozenCLIPEmbedder(clip_version, device, max_length=clip_max_length)
        self.t5_encoder = FrozenT5Embedder(t5_version, device, max_length=t5_max_length)
        print(f"{self.clip_encoder.__class__.__name__} has {count_params(self.clip_encoder) * 1.e-6:.2f} M parameters, "
              f"{self.t5_encoder.__class__.__name__} comes with {count_params(self.t5_encoder) * 1.e-6:.2f} M params.")

    def encode(self, text):
        return self(text)

    def forward(self, text):
        clip_z = self.clip_encoder.encode(text)
        t5_z = self.t5_encoder.encode(text)
        return [clip_z, t5_z]

================================================
FILE: i2v/lvdm/modules/encoders/ip_resampler.py
================================================
# modified from https://github.com/mlfoundations/open_flamingo/blob/main/open_flamingo/src/helpers.py
import math
import torch
import torch.nn as nn


class ImageProjModel(nn.Module):
    """Projection Model"""
    def __init__(self, cross_attention_dim=1024, clip_embeddings_dim=1024, clip_extra_context_tokens=4):
        super().__init__()        
        self.cross_attention_dim = cross_attention_dim
        self.clip_extra_context_tokens = clip_extra_context_tokens
        self.proj = nn.Linear(clip_embeddings_dim, self.clip_extra_context_tokens * cross_attention_dim)
        self.norm = nn.LayerNorm(cross_attention_dim)
        
    def forward(self, image_embeds):
        #embeds = image_embeds
        embeds = image_embeds.type(list(self.proj.parameters())[0].dtype)
        clip_extra_context_tokens = self.proj(embeds).reshape(-1, self.clip_extra_context_tokens, self.cross_attention_dim)
        clip_extra_context_tokens = self.norm(clip_extra_context_tokens)
        return clip_extra_context_tokens

# FFN
def FeedForward(dim, mult=4):
    inner_dim = int(dim * mult)
    return nn.Sequential(
        nn.LayerNorm(dim),
        nn.Linear(dim, inner_dim, bias=False),
        nn.GELU(),
        nn.Linear(inner_dim, dim, bias=False),
    )
    
    
def reshape_tensor(x, heads):
    bs, length, width = x.shape
    #(bs, length, width) --> (bs, length, n_heads, dim_per_head)
    x = x.view(bs, length, heads, -1)
    # (bs, length, n_heads, dim_per_head) --> (bs, n_heads, length, dim_per_head)
    x = x.transpose(1, 2)
    # (bs, n_heads, length, dim_per_head) --> (bs*n_heads, length, dim_per_head)
    x = x.reshape(bs, heads, length, -1)
    return x


class PerceiverAttention(nn.Module):
    def __init__(self, *, dim, dim_head=64, heads=8):
        super().__init__()
        self.scale = dim_head**-0.5
        self.dim_head = dim_head
        self.heads = heads
        inner_dim = dim_head * heads

        self.norm1 = nn.LayerNorm(dim)
        self.norm2 = nn.LayerNorm(dim)

        self.to_q = nn.Linear(dim, inner_dim, bias=False)
        self.to_kv = nn.Linear(dim, inner_dim * 2, bias=False)
        self.to_out = nn.Linear(inner_dim, dim, bias=False)


    def forward(self, x, latents):
        """
        Args:
            x (torch.Tensor): image features
                shape (b, n1, D)
            latent (torch.Tensor): latent features
                shape (b, n2, D)
        """
        x = self.norm1(x)
        latents = self.norm2(latents)
        
        b, l, _ = latents.shape

        q = self.to_q(latents)
        kv_input = torch.cat((x, latents), dim=-2)
        k, v = self.to_kv(kv_input).chunk(2, dim=-1)
        
        q = reshape_tensor(q, self.heads)
        k = reshape_tensor(k, self.heads)
        v = reshape_tensor(v, self.heads)

        # attention
        scale = 1 / math.sqrt(math.sqrt(self.dim_head))
        weight = (q * scale) @ (k * scale).transpose(-2, -1) # More stable with f16 than dividing afterwards
        weight = torch.softmax(weight.float(), dim=-1).type(weight.dtype)
        out = weight @ v
        
        out = out.permute(0, 2, 1, 3).reshape(b, l, -1)

        return self.to_out(out)


class Resampler(nn.Module):
    def __init__(
        self,
        dim=1024,
        depth=8,
        dim_head=64,
        heads=16,
        num_queries=8,
        embedding_dim=768,
        output_dim=1024,
        ff_mult=4,
    ):
        super().__init__()
        
        self.latents = nn.Parameter(torch.randn(1, num_queries, dim) / dim**0.5)
        
        self.proj_in = nn.Linear(embedding_dim, dim)

        self.proj_out = nn.Linear(dim, output_dim)
        self.norm_out = nn.LayerNorm(output_dim)
        
        self.layers = nn.ModuleList([])
        for _ in range(depth):
            self.layers.append(
                nn.ModuleList(
                    [
                        PerceiverAttention(dim=dim, dim_head=dim_head, heads=heads),
                        FeedForward(dim=dim, mult=ff_mult),
                    ]
                )
            )

    def forward(self, x):
        
        latents = self.latents.repeat(x.size(0), 1, 1)
        
        x = self.proj_in(x)
        
        for attn, ff in self.layers:
            latents = attn(x, latents) + latents
            latents = ff(latents) + latents
            
        latents = self.proj_out(latents)
        return self.norm_out(latents)

================================================
FILE: i2v/lvdm/modules/networks/ae_modules.py
================================================
# pytorch_diffusion + derived encoder decoder
import math
import torch
import numpy as np
import torch.nn as nn
from einops import rearrange
from utils.utils import instantiate_from_config
from lvdm.modules.attention import LinearAttention

def nonlinearity(x):
    # swish
    return x*torch.sigmoid(x)


def Normalize(in_channels, num_groups=32):
    return torch.nn.GroupNorm(num_groups=num_groups, num_channels=in_channels, eps=1e-6, affine=True)



class LinAttnBlock(LinearAttention):
    """to match AttnBlock usage"""
    def __init__(self, in_channels):
        super().__init__(dim=in_channels, heads=1, dim_head=in_channels)


class AttnBlock(nn.Module):
    def __init__(self, in_channels):
        super().__init__()
        self.in_channels = in_channels

        self.norm = Normalize(in_channels)
        self.q = torch.nn.Conv2d(in_channels,
                                 in_channels,
                                 kernel_size=1,
                                 stride=1,
                                 padding=0)
        self.k = torch.nn.Conv2d(in_channels,
                                 in_channels,
                                 kernel_size=1,
                                 stride=1,
                                 padding=0)
        self.v = torch.nn.Conv2d(in_channels,
                                 in_channels,
                                 kernel_size=1,
                                 stride=1,
                                 padding=0)
        self.proj_out = torch.nn.Conv2d(in_channels,
                                        in_channels,
                                        kernel_size=1,
                                        stride=1,
                                        padding=0)

    def forward(self, x):
        h_ = x
        h_ = self.norm(h_)
        q = self.q(h_)
        k = self.k(h_)
        v = self.v(h_)

        # compute attention
        b,c,h,w = q.shape
        q = q.reshape(b,c,h*w) # bcl
        q = q.permute(0,2,1)   # bcl -> blc l=hw
        k = k.reshape(b,c,h*w) # bcl
        
        w_ = torch.bmm(q,k)    # b,hw,hw    w[b,i,j]=sum_c q[b,i,c]k[b,c,j]
        w_ = w_ * (int(c)**(-0.5))
        w_ = torch.nn.functional.softmax(w_, dim=2)

        # attend to values
        v = v.reshape(b,c,h*w)
        w_ = w_.permute(0,2,1)   # b,hw,hw (first hw of k, second of q)
        h_ = torch.bmm(v,w_)     # b, c,hw (hw of q) h_[b,c,j] = sum_i v[b,c,i] w_[b,i,j]
        h_ = h_.reshape(b,c,h,w)

        h_ = self.proj_out(h_)

        return x+h_

def make_attn(in_channels, attn_type="vanilla"):
    assert attn_type in ["vanilla", "linear", "none"], f'attn_type {attn_type} unknown'
    #print(f"making attention of type '{attn_type}' with {in_channels} in_channels")
    if attn_type == "vanilla":
        return AttnBlock(in_channels)
    elif attn_type == "none":
        return nn.Identity(in_channels)
    else:
        return LinAttnBlock(in_channels)
 
class Downsample(nn.Module):
    def __init__(self, in_channels, with_conv):
        super().__init__()
        self.with_conv = with_conv
        self.in_channels = in_channels
        if self.with_conv:
            # no asymmetric padding in torch conv, must do it ourselves
            self.conv = torch.nn.Conv2d(in_channels,
                                        in_channels,
                                        kernel_size=3,
                                        stride=2,
                                        padding=0)
    def forward(self, x):
        if self.with_conv:
            pad = (0,1,0,1)
            x = torch.nn.functional.pad(x, pad, mode="constant", value=0)
            x = self.conv(x)
        else:
            x = torch.nn.functional.avg_pool2d(x, kernel_size=2, stride=2)
        return x

class Upsample(nn.Module):
    def __init__(self, in_channels, with_conv):
        super().__init__()
        self.with_conv = with_conv
        self.in_channels = in_channels
        if self.with_conv:
            self.conv = torch.nn.Conv2d(in_channels,
                                        in_channels,
                                        kernel_size=3,
                                        stride=1,
                                        padding=1)

    def forward(self, x):
        x = torch.nn.functional.interpolate(x, scale_factor=2.0, mode="nearest")
        if self.with_conv:
            x = self.conv(x)
        return x

def get_timestep_embedding(timesteps, embedding_dim):
    """
    This matches the implementation in Denoising Diffusion Probabilistic Models:
    From Fairseq.
    Build sinusoidal embeddings.
    This matches the implementation in tensor2tensor, but differs slightly
    from the description in Section 3.5 of "Attention Is All You Need".
    """
    assert len(timesteps.shape) == 1

    half_dim = embedding_dim // 2
    emb = math.log(10000) / (half_dim - 1)
    emb = torch.exp(torch.arange(half_dim, dtype=torch.float32) * -emb)
    emb = emb.to(device=timesteps.device)
    emb = timesteps.float()[:, None] * emb[None, :]
    emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=1)
    if embedding_dim % 2 == 1:  # zero pad
        emb = torch.nn.functional.pad(emb, (0,1,0,0))
    return emb



class ResnetBlock(nn.Module):
    def __init__(self, *, in_channels, out_channels=None, conv_shortcut=False,
                 dropout, temb_channels=512):
        super().__init__()
        self.in_channels = in_channels
        out_channels = in_channels if out_channels is None else out_channels
        self.out_channels = out_channels
        self.use_conv_shortcut = conv_shortcut

        self.norm1 = Normalize(in_channels)
        self.conv1 = torch.nn.Conv2d(in_channels,
                                     out_channels,
                                     kernel_size=3,
                                     stride=1,
                                     padding=1)
        if temb_channels > 0:
            self.temb_proj = torch.nn.Linear(temb_channels,
                                             out_channels)
        self.norm2 = Normalize(out_channels)
        self.dropout = torch.nn.Dropout(dropout)
        self.conv2 = torch.nn.Conv2d(out_channels,
                                     out_channels,
                                     kernel_size=3,
                                     stride=1,
                                     padding=1)
        if self.in_channels != self.out_channels:
            if self.use_conv_shortcut:
                self.conv_shortcut = torch.nn.Conv2d(in_channels,
                                                     out_channels,
                                                     kernel_size=3,
                                                     stride=1,
                                                     padding=1)
            else:
                self.nin_shortcut = torch.nn.Conv2d(in_channels,
                                                    out_channels,
                                                    kernel_size=1,
                                                    stride=1,
                                                    padding=0)

    def forward(self, x, temb):
        h = x
        h = self.norm1(h)
        h = nonlinearity(h)
        h = self.conv1(h)

        if temb is not None:
            h = h + self.temb_proj(nonlinearity(temb))[:,:,None,None]

        h = self.norm2(h)
        h = nonlinearity(h)
        h = self.dropout(h)
        h = self.conv2(h)

        if self.in_channels != self.out_channels:
            if self.use_conv_shortcut:
                x = self.conv_shortcut(x)
            else:
                x = self.nin_shortcut(x)

        return x+h

class Model(nn.Module):
    def __init__(self, *, ch, out_ch, ch_mult=(1,2,4,8), num_res_blocks,
                 attn_resolutions, dropout=0.0, resamp_with_conv=True, in_channels,
                 resolution, use_timestep=True, use_linear_attn=False, attn_type="vanilla"):
        super().__init__()
        if use_linear_attn: attn_type = "linear"
        self.ch = ch
        self.temb_ch = self.ch*4
        self.num_resolutions = len(ch_mult)
        self.num_res_blocks = num_res_blocks
        self.resolution = resolution
        self.in_channels = in_channels

        self.use_timestep = use_timestep
        if self.use_timestep:
            # timestep embedding
            self.temb = nn.Module()
            self.temb.dense = nn.ModuleList([
                torch.nn.Linear(self.ch,
                                self.temb_ch),
                torch.nn.Linear(self.temb_ch,
                                self.temb_ch),
            ])

        # downsampling
        self.conv_in = torch.nn.Conv2d(in_channels,
                                       self.ch,
                                       kernel_size=3,
                                       stride=1,
                                       padding=1)

        curr_res = resolution
        in_ch_mult = (1,)+tuple(ch_mult)
        self.down = nn.ModuleList()
        for i_level in range(self.num_resolutions):
            block = nn.ModuleList()
            attn = nn.ModuleList()
            block_in = ch*in_ch_mult[i_level]
            block_out = ch*ch_mult[i_level]
            for i_block in range(self.num_res_blocks):
                block.append(ResnetBlock(in_channels=block_in,
                                         out_channels=block_out,
                                         temb_channels=self.temb_ch,
                                         dropout=dropout))
                block_in = block_out
                if curr_res in attn_resolutions:
                    attn.append(make_attn(block_in, attn_type=attn_type))
            down = nn.Module()
            down.block = block
            down.attn = attn
            if i_level != self.num_resolutions-1:
                down.downsample = Downsample(block_in, resamp_with_conv)
                curr_res = curr_res // 2
            self.down.append(down)

        # middle
        self.mid = nn.Module()
        self.mid.block_1 = ResnetBlock(in_channels=block_in,
                                       out_channels=block_in,
                                       temb_channels=self.temb_ch,
                                       dropout=dropout)
        self.mid.attn_1 = make_attn(block_in, attn_type=attn_type)
        self.mid.block_2 = ResnetBlock(in_channels=block_in,
                                       out_channels=block_in,
                                       temb_channels=self.temb_ch,
                                       dropout=dropout)

        # upsampling
        self.up = nn.ModuleList()
        for i_level in reversed(range(self.num_resolutions)):
            block = nn.ModuleList()
            attn = nn.ModuleList()
            block_out = ch*ch_mult[i_level]
            skip_in = ch*ch_mult[i_level]
            for i_block in range(self.num_res_blocks+1):
                if i_block == self.num_res_blocks:
                    skip_in = ch*in_ch_mult[i_level]
                block.append(ResnetBlock(in_channels=block_in+skip_in,
                                         out_channels=block_out,
                                         temb_channels=self.temb_ch,
                                         dropout=dropout))
                block_in = block_out
                if curr_res in attn_resolutions:
                    attn.append(make_attn(block_in, attn_type=attn_type))
            up = nn.Module()
            up.block = block
            up.attn = attn
            if i_level != 0:
                up.upsample = Upsample(block_in, resamp_with_conv)
                curr_res = curr_res * 2
            self.up.insert(0, up) # prepend to get consistent order

        # end
        self.norm_out = Normalize(block_in)
        self.conv_out = torch.nn.Conv2d(block_in,
                                        out_ch,
                                        kernel_size=3,
                                        stride=1,
                                        padding=1)

    def forward(self, x, t=None, context=None):
        #assert x.shape[2] == x.shape[3] == self.resolution
        if context is not None:
            # assume aligned context, cat along channel axis
            x = torch.cat((x, context), dim=1)
        if self.use_timestep:
            # timestep embedding
            assert t is not None
            temb = get_timestep_embedding(t, self.ch)
            temb = self.temb.dense[0](temb)
            temb = nonlinearity(temb)
            temb = self.temb.dense[1](temb)
        else:
            temb = None

        # downsampling
        hs = [self.conv_in(x)]
        for i_level in range(self.num_resolutions):
            for i_block in range(self.num_res_blocks):
                h = self.down[i_level].block[i_block](hs[-1], temb)
                if len(self.down[i_level].attn) > 0:
                    h = self.down[i_level].attn[i_block](h)
                hs.append(h)
            if i_level != self.num_resolutions-1:
                hs.append(self.down[i_level].downsample(hs[-1]))

        # middle
        h = hs[-1]
        h = self.mid.block_1(h, temb)
        h = self.mid.attn_1(h)
        h = self.mid.block_2(h, temb)

        # upsampling
        for i_level in reversed(range(self.num_resolutions)):
            for i_block in range(self.num_res_blocks+1):
                h = self.up[i_level].block[i_block](
                    torch.cat([h, hs.pop()], dim=1), temb)
                if len(self.up[i_level].attn) > 0:
                    h = self.up[i_level].attn[i_block](h)
            if i_level != 0:
                h = self.up[i_level].upsample(h)

        # end
        h = self.norm_out(h)
        h = nonlinearity(h)
        h = self.conv_out(h)
        return h

    def get_last_layer(self):
        return self.conv_out.weight


class Encoder(nn.Module):
    def __init__(self, *, ch, out_ch, ch_mult=(1,2,4,8), num_res_blocks,
                 attn_resolutions, dropout=0.0, resamp_with_conv=True, in_channels,
                 resolution, z_channels, double_z=True, use_linear_attn=False, attn_type="vanilla",
                 **ignore_kwargs):
        super().__init__()
        if use_linear_attn: attn_type = "linear"
        self.ch = ch
        self.temb_ch = 0
        self.num_resolutions = len(ch_mult)
        self.num_res_blocks = num_res_blocks
        self.resolution = resolution
        self.in_channels = in_channels

        # downsampling
        self.conv_in = torch.nn.Conv2d(in_channels,
                                       self.ch,
                                       kernel_size=3,
                                       stride=1,
                                       padding=1)

        curr_res = resolution
        in_ch_mult = (1,)+tuple(ch_mult)
        self.in_ch_mult = in_ch_mult
        self.down = nn.ModuleList()
        for i_level in range(self.num_resolutions):
            block = nn.ModuleList()
            attn = nn.ModuleList()
            block_in = ch*in_ch_mult[i_level]
            block_out = ch*ch_mult[i_level]
            for i_block in range(self.num_res_blocks):
                block.append(ResnetBlock(in_channels=block_in,
                                         out_channels=block_out,
                                         temb_channels=self.temb_ch,
                                         dropout=dropout))
                block_in = block_out
                if curr_res in attn_resolutions:
                    attn.append(make_attn(block_in, attn_type=attn_type))
            down = nn.Module()
            down.block = block
            down.attn = attn
            if i_level != self.num_resolutions-1:
                down.downsample = Downsample(block_in, resamp_with_conv)
                curr_res = curr_res // 2
            self.down.append(down)

        # middle
        self.mid = nn.Module()
        self.mid.block_1 = ResnetBlock(in_channels=block_in,
                                       out_channels=block_in,
                                       temb_channels=self.temb_ch,
                                       dropout=dropout)
        self.mid.attn_1 = make_attn(block_in, attn_type=attn_type)
        self.mid.block_2 = ResnetBlock(in_channels=block_in,
                                       out_channels=block_in,
                                       temb_channels=self.temb_ch,
                                       dropout=dropout)

        # end
        self.norm_out = Normalize(block_in)
        self.conv_out = torch.nn.Conv2d(block_in,
                                        2*z_channels if double_z else z_channels,
                                        kernel_size=3,
                                        stride=1,
                                        padding=1)

    def forward(self, x):
        # timestep embedding
        temb = None

        # print(f'encoder-input={x.shape}')
        # downsampling
        hs = [self.conv_in(x)]
        # print(f'encoder-conv in feat={hs[0].shape}')
        for i_level in range(self.num_resolutions):
            for i_block in range(self.num_res_blocks):
                h = self.down[i_level].block[i_block](hs[-1], temb)
                # print(f'encoder-down feat={h.shape}')
                if len(self.down[i_level].attn) > 0:
                    h = self.down[i_level].attn[i_block](h)
                hs.append(h)
            if i_level != self.num_resolutio

Download .txt

gitextract_c750wbnv/

├── LICENSE
├── README.md
└── i2v/
    ├── cog.yaml
    ├── configs/
    │   ├── inference_i2v_512_v1.0.yaml
    │   ├── inference_t2v_1024_v1.0.yaml
    │   ├── inference_t2v_512_v1.0.yaml
    │   ├── inference_t2v_512_v2.0.yaml
    │   └── train_t2v.yaml
    ├── gradio_app.py
    ├── lvdm/
    │   ├── basics.py
    │   ├── common.py
    │   ├── data/
    │   │   ├── frame_dataset.py
    │   │   └── taichi.py
    │   ├── distributions.py
    │   ├── ema.py
    │   ├── models/
    │   │   ├── autoencoder.py
    │   │   ├── ddpm3d.py
    │   │   ├── samplers/
    │   │   │   └── ddim.py
    │   │   └── utils_diffusion.py
    │   ├── modules/
    │   │   ├── attention.py
    │   │   ├── encoders/
    │   │   │   ├── condition.py
    │   │   │   └── ip_resampler.py
    │   │   ├── networks/
    │   │   │   ├── ae_modules.py
    │   │   │   └── openaimodel3d.py
    │   │   └── x_transformer.py
    │   └── utils/
    │       ├── callbacks.py
    │       ├── common_utils.py
    │       └── saving_utils.py
    ├── predict.py
    ├── prompts/
    │   ├── .ipynb_checkpoints/
    │   │   ├── gpt4_extended_prompts_gen_original_vc2_noact_050624-checkpoint.txt
    │   │   ├── gpt4_random_anthropomorphic_game_chars_050624-checkpoint.txt
    │   │   └── gpt4_random_anthropomorphic_game_chars_b2_050624-checkpoint.txt
    │   ├── bg_prompts.txt
    │   ├── gpt4_extended_prompts_gen_original_vc2_050624.txt
    │   ├── gpt4_extended_prompts_gen_original_vc2_noact_050624.txt
    │   ├── gpt4_extended_prompts_path_031324.json
    │   ├── gpt4_extended_prompts_path_031324.txt
    │   ├── gpt4_extended_prompts_path_jump_050624.txt
    │   ├── gpt4_extended_prompts_path_runjump_050624.txt
    │   ├── gpt4_random_anthropomorphic_game_chars_050624.txt
    │   ├── gpt4_random_anthropomorphic_game_chars_b2_050624.txt
    │   ├── i2v_prompts/
    │   │   └── test_prompts.txt
    │   └── test_prompts.txt
    ├── requirements.txt
    ├── scripts/
    │   ├── evaluation/
    │   │   ├── ddp_wrapper.py
    │   │   ├── funcs.py
    │   │   ├── inference.py
    │   │   ├── inference_util.py
    │   │   └── test_seg.py
    │   ├── gradio/
    │   │   ├── i2v_test.py
    │   │   └── t2v_test.py
    │   ├── run_image2video.sh
    │   └── run_text2video.sh
    ├── train.py
    ├── train_t2v_run_jump.sh
    ├── train_t2v_spinkick.sh
    ├── train_t2v_swordslash.sh
    └── utils/
        ├── common_utils.py
        ├── log.py
        └── utils.py

Download .txt

SYMBOL INDEX (497 symbols across 31 files)

FILE: i2v/gradio_app.py
  function videocrafter_demo (line 17) | def videocrafter_demo(result_dir='./tmp/'):

FILE: i2v/lvdm/basics.py
  function disabled_train (line 14) | def disabled_train(self, mode=True):
  function zero_module (line 19) | def zero_module(module):
  function scale_module (line 27) | def scale_module(module, scale):
  function conv_nd (line 36) | def conv_nd(dims, *args, **kwargs):
  function linear (line 49) | def linear(*args, **kwargs):
  function avg_pool_nd (line 56) | def avg_pool_nd(dims, *args, **kwargs):
  function nonlinearity (line 69) | def nonlinearity(type='silu'):
  class GroupNormSpecific (line 76) | class GroupNormSpecific(nn.GroupNorm):
    method forward (line 77) | def forward(self, x):
  function normalization (line 81) | def normalization(channels, num_groups=32):
  class HybridConditioner (line 90) | class HybridConditioner(nn.Module):
    method __init__ (line 92) | def __init__(self, c_concat_config, c_crossattn_config):
    method forward (line 97) | def forward(self, c_concat, c_crossattn):

FILE: i2v/lvdm/common.py
  function gather_data (line 8) | def gather_data(data, return_np=True):
  function autocast (line 16) | def autocast(f):
  function extract_into_tensor (line 25) | def extract_into_tensor(a, t, x_shape):
  function noise_like (line 31) | def noise_like(shape, device, repeat=False):
  function default (line 37) | def default(val, d):
  function exists (line 42) | def exists(val):
  function identity (line 45) | def identity(*args, **kwargs):
  function uniq (line 48) | def uniq(arr):
  function mean_flat (line 51) | def mean_flat(tensor):
  function ismap (line 57) | def ismap(x):
  function isimage (line 62) | def isimage(x):
  function max_neg_value (line 67) | def max_neg_value(t):
  function shape_to_str (line 70) | def shape_to_str(x):
  function init_ (line 74) | def init_(tensor):
  function checkpoint (line 81) | def checkpoint(func, inputs, params, flag):

FILE: i2v/lvdm/data/frame_dataset.py
  function pil_loader (line 21) | def pil_loader(path):
  function accimage_loader (line 31) | def accimage_loader(path):
  function default_loader (line 39) | def default_loader(path):
  function is_image_file (line 48) | def is_image_file(filename):
  function find_classes (line 51) | def find_classes(dir):
  function class_name_to_idx (line 58) | def class_name_to_idx(annotation_dir):
  function make_dataset (line 68) | def make_dataset(dir, nframes, class_to_idx, frame_stride=1, **kwargs):
  function split_by_captical (line 124) | def split_by_captical(s):
  function make_dataset_ucf (line 131) | def make_dataset_ucf(dir, nframes, class_to_idx, frame_stride=1, clip_st...
  function load_and_transform_frames (line 206) | def load_and_transform_frames(frame_list, loader, img_transform=None):
  class VideoFrameDataset (line 230) | class VideoFrameDataset(data.Dataset):
    method __init__ (line 231) | def __init__(self,
    method __getitem__ (line 324) | def __getitem__(self, index):
    method __len__ (line 355) | def __len__(self):

FILE: i2v/lvdm/data/taichi.py
  class Taichi (line 8) | class Taichi(Dataset):
    method __init__ (line 20) | def __init__(self,
    method _make_dataset (line 41) | def _make_dataset(self):
    method __getitem__ (line 50) | def __getitem__(self, index):
    method __len__ (line 81) | def __len__(self):

FILE: i2v/lvdm/distributions.py
  class AbstractDistribution (line 5) | class AbstractDistribution:
    method sample (line 6) | def sample(self):
    method mode (line 9) | def mode(self):
  class DiracDistribution (line 13) | class DiracDistribution(AbstractDistribution):
    method __init__ (line 14) | def __init__(self, value):
    method sample (line 17) | def sample(self):
    method mode (line 20) | def mode(self):
  class DiagonalGaussianDistribution (line 24) | class DiagonalGaussianDistribution(object):
    method __init__ (line 25) | def __init__(self, parameters, deterministic=False):
    method sample (line 35) | def sample(self, noise=None):
    method kl (line 42) | def kl(self, other=None):
    method nll (line 56) | def nll(self, sample, dims=[1,2,3]):
    method mode (line 64) | def mode(self):
  function normal_kl (line 68) | def normal_kl(mean1, logvar1, mean2, logvar2):

FILE: i2v/lvdm/ema.py
  class LitEma (line 5) | class LitEma(nn.Module):
    method __init__ (line 6) | def __init__(self, model, decay=0.9999, use_num_upates=True):
    method forward (line 25) | def forward(self,model):
    method copy_to (line 46) | def copy_to(self, model):
    method store (line 55) | def store(self, parameters):
    method restore (line 64) | def restore(self, parameters):

FILE: i2v/lvdm/models/autoencoder.py
  class AutoencoderKL (line 13) | class AutoencoderKL(pl.LightningModule):
    method __init__ (line 14) | def __init__(self,
    method init_test (line 51) | def init_test(self,):
    method init_from_ckpt (line 80) | def init_from_ckpt(self, path, ignore_keys=list()):
    method encode (line 97) | def encode(self, x, **kwargs):
    method decode (line 104) | def decode(self, z, **kwargs):
    method forward (line 109) | def forward(self, input, sample_posterior=True):
    method get_input (line 118) | def get_input(self, batch, k):
    method training_step (line 128) | def training_step(self, batch, batch_idx, optimizer_idx):
    method validation_step (line 149) | def validation_step(self, batch, batch_idx):
    method configure_optimizers (line 163) | def configure_optimizers(self):
    method get_last_layer (line 174) | def get_last_layer(self):
    method log_images (line 178) | def log_images(self, batch, only_inputs=False, **kwargs):
    method to_rgb (line 194) | def to_rgb(self, x):
  class IdentityFirstStage (line 202) | class IdentityFirstStage(torch.nn.Module):
    method __init__ (line 203) | def __init__(self, *args, vq_interface=False, **kwargs):
    method encode (line 207) | def encode(self, x, *args, **kwargs):
    method decode (line 210) | def decode(self, x, *args, **kwargs):
    method quantize (line 213) | def quantize(self, x, *args, **kwargs):
    method forward (line 218) | def forward(self, x, *args, **kwargs):

FILE: i2v/lvdm/models/ddpm3d.py
  class DDPM (line 43) | class DDPM(pl.LightningModule):
    method __init__ (line 45) | def __init__(self,
    method register_schedule (line 122) | def register_schedule(self, given_betas=None, beta_schedule="linear", ...
    method ema_scope (line 177) | def ema_scope(self, context=None):
    method init_from_ckpt (line 191) | def init_from_ckpt(self, path, ignore_keys=list(), only_model=False):
    method forward (line 209) | def forward(self, x, *args, **kwargs):
    method shared_step (line 213) | def shared_step(self, batch):
    method training_step (line 218) | def training_step(self, batch, batch_idx):
    method validation_step (line 233) | def validation_step(self, batch, batch_idx):
    method q_mean_variance (line 246) | def q_mean_variance(self, x_start, t):
    method predict_start_from_noise (line 258) | def predict_start_from_noise(self, x_t, t, noise):
    method q_posterior (line 264) | def q_posterior(self, x_start, x_t, t):
    method p_mean_variance (line 273) | def p_mean_variance(self, x, t, clip_denoised: bool):
    method p_sample (line 286) | def p_sample(self, x, t, clip_denoised=True, repeat_noise=False):
    method p_sample_loop (line 295) | def p_sample_loop(self, shape, return_intermediates=False):
    method sample (line 310) | def sample(self, batch_size=16, return_intermediates=False):
    method q_sample (line 316) | def q_sample(self, x_start, t, noise=None):
    method get_input (line 322) | def get_input(self, batch, k):
    method _get_rows_from_list (line 327) | def _get_rows_from_list(self, samples):
    method log_images (line 335) | def log_images(self, batch, N=8, n_row=2, sample=True, return_keys=Non...
    method get_loss (line 372) | def get_loss(self, pred, target, mean=True, mask=None):
    method p_losses (line 390) | def p_losses(self, x_start, cond, t, noise=None, skip_qsample=False, x...
    method configure_optimizers (line 430) | def configure_optimizers(self):
  class LatentDiffusion (line 441) | class LatentDiffusion(DDPM):
    method __init__ (line 443) | def __init__(self,
    method make_cond_schedule (line 524) | def make_cond_schedule(self, ):
    method q_sample (line 529) | def q_sample(self, x_start, t, noise=None):
    method _freeze_model (line 540) | def _freeze_model(self):
    method instantiate_first_stage (line 544) | def instantiate_first_stage(self, config):
    method instantiate_cond_stage (line 551) | def instantiate_cond_stage(self, config):
    method get_learned_conditioning (line 562) | def get_learned_conditioning(self, c):
    method get_first_stage_encoding (line 575) | def get_first_stage_encoding(self, encoder_posterior, noise=None):
    method encode_first_stage (line 605) | def encode_first_stage(self, x):
    method get_condition (line 617) | def get_condition(self, batch, x, bs, force_c_encode, k, cond_key):
    method get_input (line 656) | def get_input(self, batch, k, return_first_stage_outputs=False, force_...
    method shared_step (line 694) | def shared_step(self, batch, **kwargs):
    method forward (line 703) | def forward(self, x, c, *args, **kwargs):
    method encode_first_stage_2DAE (line 712) | def encode_first_stage_2DAE(self, x):
    method decode_core (line 719) | def decode_core(self, z, **kwargs):
    method decode_first_stage (line 736) | def decode_first_stage(self, z, **kwargs):
    method apply_model (line 739) | def apply_model(self, x_noisy, t, cond, **kwargs):
    method _get_denoise_row_from_list (line 756) | def _get_denoise_row_from_list(self, samples, desc=''):
    method decode_first_stage_2DAE (line 783) | def decode_first_stage_2DAE(self, z, **kwargs):
    method p_mean_variance (line 792) | def p_mean_variance(self, x, c, t, clip_denoised: bool, return_x0=Fals...
    method p_sample (line 818) | def p_sample(self, x, c, t, clip_denoised=False, repeat_noise=False, r...
    method p_sample_loop (line 840) | def p_sample_loop(self, cond, shape, return_intermediates=False, x_T=N...
    method sample (line 887) | def sample(self, cond, batch_size=16, return_intermediates=False, x_T=...
    method log_condition (line 905) | def log_condition(self, log, batch, xc, x, c, cond_stage_key=None):
    method sample_log (line 939) | def sample_log(self, cond, batch_size, ddim, ddim_steps, shape=None, *...
    method get_condition_validate (line 953) | def get_condition_validate(self, prompt):
    method log_images (line 964) | def log_images(self, batch, N=8, n_row=4, sample=True, ddim_steps=50, ...
    method decode_first_stage (line 1021) | def decode_first_stage(self, z, decode_bs=16, return_cpu=True,
  class LatentVisualDiffusion (line 1065) | class LatentVisualDiffusion(LatentDiffusion):
    method __init__ (line 1066) | def __init__(self, cond_img_config, finegrained=False, random_cond=Fal...
    method instantiate_img_embedder (line 1074) | def instantiate_img_embedder(self, config, freeze=True):
    method init_projector (line 1082) | def init_projector(self, use_finegrained, num_tokens, input_dim, cross...
    method get_image_embeds (line 1094) | def get_image_embeds(self, batch_imgs):
  class DiffusionWrapper (line 1101) | class DiffusionWrapper(pl.LightningModule):
    method __init__ (line 1102) | def __init__(self, diff_model_config, conditioning_key):
    method forward (line 1107) | def forward(self, x, t, c_concat: list = None, c_crossattn: list = None,

FILE: i2v/lvdm/models/samplers/ddim.py
  class DDIMSampler (line 8) | class DDIMSampler(object):
    method __init__ (line 9) | def __init__(self, model, schedule="linear", **kwargs):
    method register_buffer (line 16) | def register_buffer(self, name, attr):
    method make_schedule (line 22) | def make_schedule(self, ddim_num_steps, ddim_discretize="uniform", ddi...
    method sample (line 63) | def sample(self,
    method ddim_sampling (line 133) | def ddim_sampling(self, cond, shape,
    method p_sample_ddim (line 213) | def p_sample_ddim(self, x, c, t, index, repeat_noise=False, use_origin...
    method stochastic_encode (line 295) | def stochastic_encode(self, x0, t, use_original_steps=False, noise=None):
    method decode (line 317) | def decode(self, x_latent, cond, t_start, unconditional_guidance_scale...

FILE: i2v/lvdm/models/utils_diffusion.py
  function timestep_embedding (line 8) | def timestep_embedding(timesteps, dim, max_period=10000, repeat_only=Fal...
  function make_beta_schedule (line 31) | def make_beta_schedule(schedule, n_timestep, linear_start=1e-4, linear_e...
  function make_ddim_timesteps (line 56) | def make_ddim_timesteps(ddim_discr_method, num_ddim_timesteps, num_ddpm_...
  function make_ddim_sampling_parameters (line 73) | def make_ddim_sampling_parameters(alphacums, ddim_timesteps, eta, verbos...
  function betas_for_alpha_bar (line 88) | def betas_for_alpha_bar(num_diffusion_timesteps, alpha_bar, max_beta=0.9...

FILE: i2v/lvdm/modules/attention.py
  class RelativePosition (line 21) | class RelativePosition(nn.Module):
    method __init__ (line 24) | def __init__(self, num_units, max_relative_position):
    method forward (line 31) | def forward(self, length_q, length_k):
  class CrossAttention (line 43) | class CrossAttention(nn.Module):
    method __init__ (line 45) | def __init__(self, query_dim, context_dim=None, heads=8, dim_head=64, ...
    method forward (line 76) | def forward(self, x, context=None, mask=None):
    method efficient_forward (line 129) | def efficient_forward(self, x, context=None, mask=None):
  class BasicTransformerBlock (line 187) | class BasicTransformerBlock(nn.Module):
    method __init__ (line 189) | def __init__(self, dim, n_heads, d_head, dropout=0., context_dim=None,...
    method forward (line 204) | def forward(self, x, context=None, mask=None):
    method _forward (line 216) | def _forward(self, x, context=None, mask=None):
  class SpatialTransformer (line 223) | class SpatialTransformer(nn.Module):
    method __init__ (line 233) | def __init__(self, in_channels, n_heads, d_head, depth=1, dropout=0., ...
    method forward (line 262) | def forward(self, x, context=None):
  class TemporalTransformer (line 281) | class TemporalTransformer(nn.Module):
    method __init__ (line 288) | def __init__(self, in_channels, n_heads, d_head, depth=1, dropout=0., ...
    method forward (line 331) | def forward(self, x, context=None):
  class GEGLU (line 376) | class GEGLU(nn.Module):
    method __init__ (line 377) | def __init__(self, dim_in, dim_out):
    method forward (line 381) | def forward(self, x):
  class FeedForward (line 386) | class FeedForward(nn.Module):
    method __init__ (line 387) | def __init__(self, dim, dim_out=None, mult=4, glu=False, dropout=0.):
    method forward (line 402) | def forward(self, x):
  class LinearAttention (line 406) | class LinearAttention(nn.Module):
    method __init__ (line 407) | def __init__(self, dim, heads=4, dim_head=32):
    method forward (line 414) | def forward(self, x):
  class SpatialSelfAttention (line 425) | class SpatialSelfAttention(nn.Module):
    method __init__ (line 426) | def __init__(self, in_channels):
    method forward (line 452) | def forward(self, x):

FILE: i2v/lvdm/modules/encoders/condition.py
  class AbstractEncoder (line 10) | class AbstractEncoder(nn.Module):
    method __init__ (line 11) | def __init__(self):
    method encode (line 14) | def encode(self, *args, **kwargs):
  class IdentityEncoder (line 18) | class IdentityEncoder(AbstractEncoder):
    method encode (line 20) | def encode(self, x):
  class ClassEmbedder (line 24) | class ClassEmbedder(nn.Module):
    method __init__ (line 25) | def __init__(self, embed_dim, n_classes=1000, key='class', ucg_rate=0.1):
    method forward (line 32) | def forward(self, batch, key=None, disable_dropout=False):
    method get_unconditional_conditioning (line 44) | def get_unconditional_conditioning(self, bs, device="cuda"):
  function disabled_train (line 51) | def disabled_train(self, mode=True):
  class FrozenT5Embedder (line 57) | class FrozenT5Embedder(AbstractEncoder):
    method __init__ (line 60) | def __init__(self, version="google/t5-v1_1-large", device="cuda", max_...
    method freeze (line 70) | def freeze(self):
    method forward (line 76) | def forward(self, text):
    method encode (line 85) | def encode(self, text):
  class FrozenCLIPEmbedder (line 89) | class FrozenCLIPEmbedder(AbstractEncoder):
    method __init__ (line 97) | def __init__(self, version="openai/clip-vit-large-patch14", device="cu...
    method freeze (line 113) | def freeze(self):
    method forward (line 119) | def forward(self, text):
    method encode (line 132) | def encode(self, text):
  class ClipImageEmbedder (line 136) | class ClipImageEmbedder(nn.Module):
    method __init__ (line 137) | def __init__(
    method preprocess (line 155) | def preprocess(self, x):
    method forward (line 165) | def forward(self, x, no_dropout=False):
  class FrozenOpenCLIPEmbedder (line 174) | class FrozenOpenCLIPEmbedder(AbstractEncoder):
    method __init__ (line 184) | def __init__(self, arch="ViT-H-14", version="laion2b_s32b_b79k", devic...
    method freeze (line 204) | def freeze(self):
    method forward (line 209) | def forward(self, text):
    method encode_with_transformer (line 215) | def encode_with_transformer(self, text):
    method text_transformer_forward (line 224) | def text_transformer_forward(self, x: torch.Tensor, attn_mask=None):
    method encode (line 234) | def encode(self, text):
  class FrozenOpenCLIPImageEmbedder (line 238) | class FrozenOpenCLIPImageEmbedder(AbstractEncoder):
    method __init__ (line 243) | def __init__(self, arch="ViT-H-14", version="laion2b_s32b_b79k", devic...
    method preprocess (line 266) | def preprocess(self, x):
    method freeze (line 276) | def freeze(self):
    method forward (line 282) | def forward(self, image, no_dropout=False):
    method encode_with_vision_transformer (line 288) | def encode_with_vision_transformer(self, img):
    method encode (line 293) | def encode(self, text):
  class FrozenOpenCLIPImageEmbedderV2 (line 298) | class FrozenOpenCLIPImageEmbedderV2(AbstractEncoder):
    method __init__ (line 303) | def __init__(self, arch="ViT-H-14", version="laion2b_s32b_b79k", devic...
    method preprocess (line 324) | def preprocess(self, x):
    method freeze (line 334) | def freeze(self):
    method forward (line 339) | def forward(self, image, no_dropout=False):
    method encode_with_vision_transformer (line 344) | def encode_with_vision_transformer(self, x):
  class FrozenCLIPT5Encoder (line 377) | class FrozenCLIPT5Encoder(AbstractEncoder):
    method __init__ (line 378) | def __init__(self, clip_version="openai/clip-vit-large-patch14", t5_ve...
    method encode (line 386) | def encode(self, text):
    method forward (line 389) | def forward(self, text):

FILE: i2v/lvdm/modules/encoders/ip_resampler.py
  class ImageProjModel (line 7) | class ImageProjModel(nn.Module):
    method __init__ (line 9) | def __init__(self, cross_attention_dim=1024, clip_embeddings_dim=1024,...
    method forward (line 16) | def forward(self, image_embeds):
  function FeedForward (line 24) | def FeedForward(dim, mult=4):
  function reshape_tensor (line 34) | def reshape_tensor(x, heads):
  class PerceiverAttention (line 45) | class PerceiverAttention(nn.Module):
    method __init__ (line 46) | def __init__(self, *, dim, dim_head=64, heads=8):
    method forward (line 61) | def forward(self, x, latents):
  class Resampler (line 93) | class Resampler(nn.Module):
    method __init__ (line 94) | def __init__(
    method forward (line 125) | def forward(self, x):

FILE: i2v/lvdm/modules/networks/ae_modules.py
  function nonlinearity (line 10) | def nonlinearity(x):
  function Normalize (line 15) | def Normalize(in_channels, num_groups=32):
  class LinAttnBlock (line 20) | class LinAttnBlock(LinearAttention):
    method __init__ (line 22) | def __init__(self, in_channels):
  class AttnBlock (line 26) | class AttnBlock(nn.Module):
    method __init__ (line 27) | def __init__(self, in_channels):
    method forward (line 53) | def forward(self, x):
  function make_attn (line 80) | def make_attn(in_channels, attn_type="vanilla"):
  class Downsample (line 90) | class Downsample(nn.Module):
    method __init__ (line 91) | def __init__(self, in_channels, with_conv):
    method forward (line 102) | def forward(self, x):
  class Upsample (line 111) | class Upsample(nn.Module):
    method __init__ (line 112) | def __init__(self, in_channels, with_conv):
    method forward (line 123) | def forward(self, x):
  function get_timestep_embedding (line 129) | def get_timestep_embedding(timesteps, embedding_dim):
  class ResnetBlock (line 151) | class ResnetBlock(nn.Module):
    method __init__ (line 152) | def __init__(self, *, in_channels, out_channels=None, conv_shortcut=Fa...
    method forward (line 190) | def forward(self, x, temb):
  class Model (line 212) | class Model(nn.Module):
    method __init__ (line 213) | def __init__(self, *, ch, out_ch, ch_mult=(1,2,4,8), num_res_blocks,
    method forward (line 312) | def forward(self, x, t=None, context=None):
    method get_last_layer (line 360) | def get_last_layer(self):
  class Encoder (line 364) | class Encoder(nn.Module):
    method __init__ (line 365) | def __init__(self, *, ch, out_ch, ch_mult=(1,2,4,8), num_res_blocks,
    method forward (line 430) | def forward(self, x):
  class Decoder (line 466) | class Decoder(nn.Module):
    method __init__ (line 467) | def __init__(self, *, ch, out_ch, ch_mult=(1,2,4,8), num_res_blocks,
    method forward (line 539) | def forward(self, z):
  class SimpleDecoder (line 581) | class SimpleDecoder(nn.Module):
    method __init__ (line 582) | def __init__(self, in_channels, out_channels, *args, **kwargs):
    method forward (line 604) | def forward(self, x):
  class UpsampleDecoder (line 617) | class UpsampleDecoder(nn.Module):
    method __init__ (line 618) | def __init__(self, in_channels, out_channels, ch, num_res_blocks, reso...
    method forward (line 651) | def forward(self, x):
  class LatentRescaler (line 665) | class LatentRescaler(nn.Module):
    method __init__ (line 666) | def __init__(self, factor, in_channels, mid_channels, out_channels, de...
    method forward (line 690) | def forward(self, x):
  class MergedRescaleEncoder (line 702) | class MergedRescaleEncoder(nn.Module):
    method __init__ (line 703) | def __init__(self, in_channels, ch, resolution, out_ch, num_res_blocks,
    method forward (line 715) | def forward(self, x):
  class MergedRescaleDecoder (line 721) | class MergedRescaleDecoder(nn.Module):
    method __init__ (line 722) | def __init__(self, z_channels, out_ch, resolution, num_res_blocks, att...
    method forward (line 732) | def forward(self, x):
  class Upsampler (line 738) | class Upsampler(nn.Module):
    method __init__ (line 739) | def __init__(self, in_size, out_size, in_channels, out_channels, ch_mu...
    method forward (line 751) | def forward(self, x):
  class Resize (line 757) | class Resize(nn.Module):
    method __init__ (line 758) | def __init__(self, in_channels=None, learned=False, mode="bilinear"):
    method forward (line 773) | def forward(self, x, scale_factor=1.0):
  class FirstStagePostProcessor (line 780) | class FirstStagePostProcessor(nn.Module):
    method __init__ (line 782) | def __init__(self, ch_mult:list, in_channels,
    method instantiate_pretrained (line 817) | def instantiate_pretrained(self, config):
    method encode_with_pretrained (line 826) | def encode_with_pretrained(self,x):
    method forward (line 832) | def forward(self,x):

FILE: i2v/lvdm/modules/networks/openaimodel3d.py
  class TimestepBlock (line 19) | class TimestepBlock(nn.Module):
    method forward (line 24) | def forward(self, x, emb):
  class TimestepEmbedSequential (line 30) | class TimestepEmbedSequential(nn.Sequential, TimestepBlock):
    method forward (line 36) | def forward(self, x, emb, context=None, batch_size=None):
  class Downsample (line 51) | class Downsample(nn.Module):
    method __init__ (line 60) | def __init__(self, channels, use_conv, dims=2, out_channels=None, padd...
    method forward (line 75) | def forward(self, x):
  class Upsample (line 80) | class Upsample(nn.Module):
    method __init__ (line 89) | def __init__(self, channels, use_conv, dims=2, out_channels=None, padd...
    method forward (line 98) | def forward(self, x):
  class ResBlock (line 109) | class ResBlock(TimestepBlock):
    method __init__ (line 124) | def __init__(
    method forward (line 195) | def forward(self, x, emb,  batch_size=None):
    method _forward (line 208) | def _forward(self, x, emb,  batch_size=None,):
  class TemporalConvBlock (line 237) | class TemporalConvBlock(nn.Module):
    method __init__ (line 242) | def __init__(self, in_channels, out_channels=None, dropout=0.0, spatia...
    method forward (line 269) | def forward(self, x):
  class UNetModel (line 279) | class UNetModel(nn.Module):
    method __init__ (line 307) | def __init__(self,
    method forward (line 534) | def forward(self, x, timesteps, context=None, features_adapter=None, f...

FILE: i2v/lvdm/modules/x_transformer.py
  class AbsolutePositionalEmbedding (line 24) | class AbsolutePositionalEmbedding(nn.Module):
    method __init__ (line 25) | def __init__(self, dim, max_seq_len):
    method init_ (line 30) | def init_(self):
    method forward (line 33) | def forward(self, x):
  class FixedPositionalEmbedding (line 38) | class FixedPositionalEmbedding(nn.Module):
    method __init__ (line 39) | def __init__(self, dim):
    method forward (line 44) | def forward(self, x, seq_dim=1, offset=0):
  function exists (line 53) | def exists(val):
  function default (line 57) | def default(val, d):
  function always (line 63) | def always(val):
  function not_equals (line 69) | def not_equals(val):
  function equals (line 75) | def equals(val):
  function max_neg_value (line 81) | def max_neg_value(tensor):
  function pick_and_pop (line 87) | def pick_and_pop(keys, d):
  function group_dict_by_key (line 92) | def group_dict_by_key(cond, d):
  function string_begins_with (line 101) | def string_begins_with(prefix, str):
  function group_by_key_prefix (line 105) | def group_by_key_prefix(prefix, d):
  function groupby_prefix_and_trim (line 109) | def groupby_prefix_and_trim(prefix, d):
  class Scale (line 116) | class Scale(nn.Module):
    method __init__ (line 117) | def __init__(self, value, fn):
    method forward (line 122) | def forward(self, x, **kwargs):
  class Rezero (line 127) | class Rezero(nn.Module):
    method __init__ (line 128) | def __init__(self, fn):
    method forward (line 133) | def forward(self, x, **kwargs):
  class ScaleNorm (line 138) | class ScaleNorm(nn.Module):
    method __init__ (line 139) | def __init__(self, dim, eps=1e-5):
    method forward (line 145) | def forward(self, x):
  class RMSNorm (line 150) | class RMSNorm(nn.Module):
    method __init__ (line 151) | def __init__(self, dim, eps=1e-8):
    method forward (line 157) | def forward(self, x):
  class Residual (line 162) | class Residual(nn.Module):
    method forward (line 163) | def forward(self, x, residual):
  class GRUGating (line 167) | class GRUGating(nn.Module):
    method __init__ (line 168) | def __init__(self, dim):
    method forward (line 172) | def forward(self, x, residual):
  class GEGLU (line 183) | class GEGLU(nn.Module):
    method __init__ (line 184) | def __init__(self, dim_in, dim_out):
    method forward (line 188) | def forward(self, x):
  class FeedForward (line 193) | class FeedForward(nn.Module):
    method __init__ (line 194) | def __init__(self, dim, dim_out=None, mult=4, glu=False, dropout=0.):
    method forward (line 209) | def forward(self, x):
  class Attention (line 214) | class Attention(nn.Module):
    method __init__ (line 215) | def __init__(
    method forward (line 267) | def forward(
  class AttentionLayers (line 369) | class AttentionLayers(nn.Module):
    method __init__ (line 370) | def __init__(
    method forward (line 480) | def forward(
  class Encoder (line 540) | class Encoder(AttentionLayers):
    method __init__ (line 541) | def __init__(self, **kwargs):
  class TransformerWrapper (line 547) | class TransformerWrapper(nn.Module):
    method __init__ (line 548) | def __init__(
    method init_ (line 594) | def init_(self):
    method forward (line 597) | def forward(

FILE: i2v/lvdm/utils/callbacks.py
  class ImageLogger (line 18) | class ImageLogger(Callback):
    method __init__ (line 19) | def __init__(self, batch_frequency, max_images, clamp=True, increase_l...
    method _testtube (line 51) | def _testtube(self, pl_module, images, batch_idx, split):
    method log_local (line 75) | def log_local(self, save_dir, split, images,
    method log_img (line 187) | def log_img(self, pl_module, batch, batch_idx, split="train", rank=0):
    method check_frequency (line 234) | def check_frequency(self, check_idx, split="train"):
    method on_train_batch_end (line 247) | def on_train_batch_end(self, trainer, pl_module, outputs, batch, batch...
    method on_validation_batch_end (line 252) | def on_validation_batch_end(self, trainer, pl_module, outputs, batch, ...
    method on_validation_epoch_end (line 263) | def on_validation_epoch_end(self, trainer, pl_module):
  class CUDACallback (line 284) | class CUDACallback(Callback):
    method on_train_epoch_start (line 285) | def on_train_epoch_start(self, trainer, pl_module):
    method on_train_epoch_end (line 296) | def on_train_epoch_end(self, trainer, pl_module):
  class SetupCallback_low (line 316) | class SetupCallback_low(Callback):
    method __init__ (line 317) | def __init__(self, resume, now, logdir, ckptdir, cfgdir, config, light...
    method on_keyboard_interrupt (line 329) | def on_keyboard_interrupt(self, trainer, pl_module):
    method on_pretrain_routine_start (line 336) | def on_pretrain_routine_start(self, trainer, pl_module):
  class SetupCallback_high (line 364) | class SetupCallback_high(Callback):
    method __init__ (line 365) | def __init__(self, resume, now, logdir, ckptdir, cfgdir, config, light...
    method on_keyboard_interrupt (line 377) | def on_keyboard_interrupt(self, trainer, pl_module):
    method on_fit_start (line 384) | def on_fit_start(self, trainer, pl_module):

FILE: i2v/lvdm/utils/common_utils.py
  function shape_to_str (line 9) | def shape_to_str(x):
  function str2bool (line 14) | def str2bool(v):
  function get_obj_from_str (line 25) | def get_obj_from_str(string, reload=False):
  function instantiate_from_config (line 33) | def instantiate_from_config(config):
  function shift_dim (line 44) | def shift_dim(x, src_dim=-1, dest_dim=-1, make_contiguous=True):
  function torch_to_np (line 73) | def torch_to_np(x):
  function np_to_torch_video (line 84) | def np_to_torch_video(x):
  function load_npz_from_dir (line 90) | def load_npz_from_dir(data_dir):
  function load_npz_from_paths (line 96) | def load_npz_from_paths(data_paths):
  function ismap (line 102) | def ismap(x):
  function isimage (line 108) | def isimage(x):
  function exists (line 114) | def exists(x):
  function default (line 118) | def default(val, d):
  function mean_flat (line 124) | def mean_flat(tensor):
  function count_params (line 132) | def count_params(model, verbose=False):
  function check_istarget (line 139) | def check_istarget(name, para_list):

FILE: i2v/lvdm/utils/saving_utils.py
  function tensor_to_mp4 (line 20) | def tensor_to_mp4(video, savepath, fps, rescale=True, nrow=None):
  function savenp2sheet (line 38) | def savenp2sheet(imgs, savepath, nrow=None):
  function save_np_to_img (line 71) | def save_np_to_img(img, path, norm=True):
  function npz_to_imgsheet_5d (line 79) | def npz_to_imgsheet_5d(data_path, res_dir, nrow=None,):
  function npz_to_imgsheet_4d (line 96) | def npz_to_imgsheet_4d(data_path, res_path, nrow=None,):
  function tensor_to_imgsheet (line 108) | def tensor_to_imgsheet(tensor, save_path):
  function npz_to_frames (line 120) | def npz_to_frames(data_path, res_dir, norm, num_frames=None, num_samples...
  function npz_to_gifs (line 144) | def npz_to_gifs(data_path, res_dir, duration=0.2, start_idx=0, num_video...
  function fill_with_black_squares (line 166) | def fill_with_black_squares(video, desired_len: int) -> Tensor:
  function load_num_videos (line 176) | def load_num_videos(data_path, num_videos):
  function npz_to_video_grid (line 190) | def npz_to_video_grid(data_path, out_path, num_frames=None, fps=8, num_v...
  function npz_to_gif_grid (line 228) | def npz_to_gif_grid(data_path, out_path, n_cols=None, num_videos=20):
  function torch_to_video_grid (line 249) | def torch_to_video_grid(videos, out_path, num_frames, fps, num_videos=No...
  function log_txt_as_img (line 274) | def log_txt_as_img(wh, xc, size=10):

FILE: i2v/predict.py
  class Predictor (line 26) | class Predictor(BasePredictor):
    method setup (line 27) | def setup(self) -> None:
    method predict (line 41) | def predict(

FILE: i2v/scripts/evaluation/ddp_wrapper.py
  function setup_dist (line 8) | def setup_dist(local_rank):
  function get_dist_info (line 15) | def get_dist_info():

FILE: i2v/scripts/evaluation/funcs.py
  function batch_ddim_sampling (line 13) | def batch_ddim_sampling(model, cond, noise_shape, n_samples=1, ddim_step...
  function get_filelist (line 85) | def get_filelist(data_dir, ext='*'):
  function get_dirlist (line 90) | def get_dirlist(path):
  function load_model_checkpoint (line 102) | def load_model_checkpoint(model, ckpt):
  function load_prompts (line 121) | def load_prompts(prompt_file):
  function load_video_batch (line 132) | def load_video_batch(filepath_list, frame_stride, video_size=(256,256), ...
  function load_image_batch (line 171) | def load_image_batch(filepath_list, image_size=(256,256)):
  function save_videos (line 195) | def save_videos(batch_tensors, savedir, filenames, fps=10):
  function save_videos_gif (line 209) | def save_videos_gif(batch_tensors, savepath, fps=10):

FILE: i2v/scripts/evaluation/inference.py
  function get_parser (line 19) | def get_parser():
  function run_inference (line 44) | def run_inference(args, gpu_num, gpu_no, **kwargs):

FILE: i2v/scripts/evaluation/inference_util.py
  function get_parser (line 19) | def get_parser():
  function run_inference (line 43) | def run_inference(
  function ensure_model (line 107) | def ensure_model(args):

FILE: i2v/scripts/gradio/i2v_test.py
  class Image2Video (line 9) | class Image2Video():
    method __init__ (line 10) | def __init__(self,result_dir='./tmp/',gpu_num=1) -> None:
    method get_image (line 31) | def get_image(self, image, prompt, steps=50, cfg_scale=12.0, eta=1.0, ...
    method download_model (line 70) | def download_model(self):

FILE: i2v/scripts/gradio/t2v_test.py
  class Text2Video (line 9) | class Text2Video():
    method __init__ (line 10) | def __init__(self,result_dir='./tmp/',gpu_num=1) -> None:
    method get_prompt (line 31) | def get_prompt(self, prompt, steps=50, cfg_scale=12.0, eta=1.0, fps=16):
    method download_model (line 62) | def download_model(self):

FILE: i2v/train.py
  function get_parser (line 33) | def get_parser(**parser_kwargs):
  function nondefault_trainer_args (line 54) | def nondefault_trainer_args(opt):
  class WrappedDataset (line 61) | class WrappedDataset(Dataset):
    method __init__ (line 64) | def __init__(self, dataset):
    method __len__ (line 67) | def __len__(self):
    method __getitem__ (line 70) | def __getitem__(self, idx):
  class DataModuleFromConfig (line 74) | class DataModuleFromConfig(pl.LightningDataModule):
    method __init__ (line 75) | def __init__(self, batch_size, train=None, validation=None, test=None,...
    method prepare_data (line 101) | def prepare_data(self):
    method setup (line 104) | def setup(self, stage=None):
    method _train_dataloader (line 112) | def _train_dataloader(self):
    method _val_dataloader (line 119) | def _val_dataloader(self, shuffle=False):
    method _test_dataloader (line 132) | def _test_dataloader(self, shuffle=False):
    method _predict_dataloader (line 142) | def _predict_dataloader(self, shuffle=False):
  function melk (line 484) | def melk(*args, **kwargs):
  function divein (line 491) | def divein(*args, **kwargs):

FILE: i2v/utils/common_utils.py
  function shape_to_str (line 9) | def shape_to_str(x):
  function str2bool (line 14) | def str2bool(v):
  function get_obj_from_str (line 25) | def get_obj_from_str(string, reload=False):
  function instantiate_from_config (line 33) | def instantiate_from_config(config):
  function shift_dim (line 44) | def shift_dim(x, src_dim=-1, dest_dim=-1, make_contiguous=True):
  function torch_to_np (line 73) | def torch_to_np(x):
  function np_to_torch_video (line 84) | def np_to_torch_video(x):
  function load_npz_from_dir (line 90) | def load_npz_from_dir(data_dir):
  function load_npz_from_paths (line 96) | def load_npz_from_paths(data_paths):
  function ismap (line 102) | def ismap(x):
  function isimage (line 108) | def isimage(x):
  function exists (line 114) | def exists(x):
  function default (line 118) | def default(val, d):
  function mean_flat (line 124) | def mean_flat(tensor):
  function count_params (line 132) | def count_params(model, verbose=False):
  function check_istarget (line 139) | def check_istarget(name, para_list):

FILE: i2v/utils/log.py
  function release_logger_files (line 18) | def release_logger_files():
  function set_experiment_logger (line 30) | def set_experiment_logger(path_out, file_name=FILE_LOGS, reset=True):
  function set_ptl_logger (line 53) | def set_ptl_logger(path_out, phase, file_name="ptl.log", reset=True):

FILE: i2v/utils/utils.py
  function count_params (line 8) | def count_params(model, verbose=False):
  function check_istarget (line 15) | def check_istarget(name, para_list):
  function instantiate_from_config (line 27) | def instantiate_from_config(config):
  function get_obj_from_str (line 37) | def get_obj_from_str(string, reload=False):
  function load_npz_from_dir (line 45) | def load_npz_from_dir(data_dir):
  function load_npz_from_paths (line 51) | def load_npz_from_paths(data_paths):
  function resize_numpy_image (line 57) | def resize_numpy_image(image, max_resolution=512 * 512, resize_short_edg...
  function setup_dist (line 70) | def setup_dist(args):

Download .json

Condensed preview — 60 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (681K chars).

[
  {
    "path": "LICENSE",
    "chars": 11357,
    "preview": "                                 Apache License\n                           Version 2.0, January 2004\n                   "
  },
  {
    "path": "README.md",
    "chars": 6270,
    "preview": "# God Mode Animation: 2D Game Animation Generation Model\n\n## 🎉 Updates @May 2025:\n\n**We've launched [God Mode AI 2.0](ht"
  },
  {
    "path": "i2v/cog.yaml",
    "chars": 1205,
    "preview": "# Configuration for Cog ⚙️\n# Reference: https://github.com/replicate/cog/blob/main/docs/yaml.md\n\nbuild:\n  gpu: true\n  sy"
  },
  {
    "path": "i2v/configs/inference_i2v_512_v1.0.yaml",
    "chars": 2037,
    "preview": "model:\n  target: lvdm.models.ddpm3d.LatentVisualDiffusion\n  params:\n    linear_start: 0.00085\n    linear_end: 0.012\n    "
  },
  {
    "path": "i2v/configs/inference_t2v_1024_v1.0.yaml",
    "chars": 1852,
    "preview": "model:\n  target: lvdm.models.ddpm3d.LatentDiffusion\n  params:\n    linear_start: 0.00085\n    linear_end: 0.012\n    num_ti"
  },
  {
    "path": "i2v/configs/inference_t2v_512_v1.0.yaml",
    "chars": 1784,
    "preview": "model:\n  target: lvdm.models.ddpm3d.LatentDiffusion\n  params:\n    linear_start: 0.00085\n    linear_end: 0.012\n    num_ti"
  },
  {
    "path": "i2v/configs/inference_t2v_512_v2.0.yaml",
    "chars": 1844,
    "preview": "model:\n  target: lvdm.models.ddpm3d.LatentDiffusion\n  params:\n    linear_start: 0.00085\n    linear_end: 0.012\n    num_ti"
  },
  {
    "path": "i2v/configs/train_t2v.yaml",
    "chars": 3514,
    "preview": "model:\n  base_learning_rate: 6.0e-05 # 1.5e-04\n  scale_lr: False\n  target: lvdm.models.ddpm3d.LatentDiffusion\n  params:\n"
  },
  {
    "path": "i2v/gradio_app.py",
    "chars": 2874,
    "preview": "import os\nimport sys\nimport gradio as gr\nfrom scripts.gradio.t2v_test import Text2Video\nsys.path.insert(1, os.path.join("
  },
  {
    "path": "i2v/lvdm/basics.py",
    "chars": 2849,
    "preview": "# adopted from\n# https://github.com/openai/improved-diffusion/blob/main/improved_diffusion/gaussian_diffusion.py\n# and\n#"
  },
  {
    "path": "i2v/lvdm/common.py",
    "chars": 2800,
    "preview": "import math\nfrom inspect import isfunction\nimport torch\nfrom torch import nn\nimport torch.distributed as dist\n\n\ndef gath"
  },
  {
    "path": "i2v/lvdm/data/frame_dataset.py",
    "chars": 13588,
    "preview": "import os\nimport random\nimport re\nfrom PIL import ImageFile\nfrom PIL import Image\n\nimport torch\nimport torch.utils.data "
  },
  {
    "path": "i2v/lvdm/data/taichi.py",
    "chars": 2915,
    "preview": "import os\nimport random\nimport torch\nfrom torch.utils.data import Dataset\nfrom decord import VideoReader, cpu\nimport glo"
  },
  {
    "path": "i2v/lvdm/distributions.py",
    "chars": 3043,
    "preview": "import torch\nimport numpy as np\n\n\nclass AbstractDistribution:\n    def sample(self):\n        raise NotImplementedError()\n"
  },
  {
    "path": "i2v/lvdm/ema.py",
    "chars": 2982,
    "preview": "import torch\nfrom torch import nn\n\n\nclass LitEma(nn.Module):\n    def __init__(self, model, decay=0.9999, use_num_upates="
  },
  {
    "path": "i2v/lvdm/models/autoencoder.py",
    "chars": 8474,
    "preview": "import os\nfrom contextlib import contextmanager\nimport torch\nimport numpy as np\nfrom einops import rearrange\nimport torc"
  },
  {
    "path": "i2v/lvdm/models/ddpm3d.py",
    "chars": 49740,
    "preview": "\"\"\"\nwild mixture of\nhttps://github.com/openai/improved-diffusion/blob/e94489283bb876ac1477d5dd7709bbbd2d9902ce/improved_"
  },
  {
    "path": "i2v/lvdm/models/samplers/ddim.py",
    "chars": 17138,
    "preview": "import numpy as np\nfrom tqdm import tqdm\nimport torch\nfrom lvdm.models.utils_diffusion import make_ddim_sampling_paramet"
  },
  {
    "path": "i2v/lvdm/models/utils_diffusion.py",
    "chars": 4604,
    "preview": "import math\nimport numpy as np\nfrom einops import repeat\nimport torch\nimport torch.nn.functional as F\n\n\ndef timestep_emb"
  },
  {
    "path": "i2v/lvdm/modules/attention.py",
    "chars": 19456,
    "preview": "from functools import partial\nimport torch\nfrom torch import nn, einsum\nimport torch.nn.functional as F\nfrom einops impo"
  },
  {
    "path": "i2v/lvdm/modules/encoders/condition.py",
    "chars": 14630,
    "preview": "import torch\nimport torch.nn as nn\nfrom torch.utils.checkpoint import checkpoint\nimport kornia\nimport open_clip\nfrom tra"
  },
  {
    "path": "i2v/lvdm/modules/encoders/ip_resampler.py",
    "chars": 4429,
    "preview": "# modified from https://github.com/mlfoundations/open_flamingo/blob/main/open_flamingo/src/helpers.py\nimport math\nimport"
  },
  {
    "path": "i2v/lvdm/modules/networks/ae_modules.py",
    "chars": 34246,
    "preview": "# pytorch_diffusion + derived encoder decoder\nimport math\nimport torch\nimport numpy as np\nimport torch.nn as nn\nfrom ein"
  },
  {
    "path": "i2v/lvdm/modules/networks/openaimodel3d.py",
    "chars": 23973,
    "preview": "from functools import partial\nfrom abc import abstractmethod\nimport torch\nimport torch.nn as nn\nfrom einops import rearr"
  },
  {
    "path": "i2v/lvdm/modules/x_transformer.py",
    "chars": 20159,
    "preview": "\"\"\"shout-out to https://github.com/lucidrains/x-transformers/tree/main/x_transformers\"\"\"\nfrom functools import partial\nf"
  },
  {
    "path": "i2v/lvdm/utils/callbacks.py",
    "chars": 18035,
    "preview": "import logging\nimport time, os\nimport numpy as np\nfrom PIL import Image\nfrom einops import rearrange\nfrom omegaconf impo"
  },
  {
    "path": "i2v/lvdm/utils/common_utils.py",
    "chars": 3796,
    "preview": "import os\nimport importlib\nimport numpy as np\nfrom inspect import isfunction\n\nimport torch\n\n\ndef shape_to_str(x):\n    sh"
  },
  {
    "path": "i2v/lvdm/utils/saving_utils.py",
    "chars": 11549,
    "preview": "import os\nimport sys\nsys.path.insert(1, os.path.join(sys.path[0], '..'))\nimport cv2\nimport os\nimport time\nimport imageio"
  },
  {
    "path": "i2v/predict.py",
    "chars": 5746,
    "preview": "# Prediction interface for Cog ⚙️\n# https://github.com/replicate/cog/blob/main/docs/python.md\n\n\nimport os\nimport sys\nimp"
  },
  {
    "path": "i2v/prompts/.ipynb_checkpoints/gpt4_extended_prompts_gen_original_vc2_noact_050624-checkpoint.txt",
    "chars": 23576,
    "preview": "Game character animation of a stealthy rogue with black hood, dual daggers, and dark leather armor, side view facing rig"
  },
  {
    "path": "i2v/prompts/.ipynb_checkpoints/gpt4_random_anthropomorphic_game_chars_050624-checkpoint.txt",
    "chars": 16726,
    "preview": "Game character animation of an anthropomorphic fox mage casting a purple fire spell, side view facing right, solid color"
  },
  {
    "path": "i2v/prompts/.ipynb_checkpoints/gpt4_random_anthropomorphic_game_chars_b2_050624-checkpoint.txt",
    "chars": 17339,
    "preview": "Game character animation of an anthropomorphic fox archer in emerald green robes, pulling an arrow from a quiver, glowin"
  },
  {
    "path": "i2v/prompts/bg_prompts.txt",
    "chars": 38480,
    "preview": "Game background scene, fixed camera, repeatable loopable animation, dynamic objects, Urban cityscape with alleyway, graf"
  },
  {
    "path": "i2v/prompts/gpt4_extended_prompts_gen_original_vc2_050624.txt",
    "chars": 24573,
    "preview": "Game character idle breathing animation of a stealthy rogue with black hood, dual daggers, and dark leather armor, side "
  },
  {
    "path": "i2v/prompts/gpt4_extended_prompts_gen_original_vc2_noact_050624.txt",
    "chars": 23576,
    "preview": "Game character animation of a stealthy rogue with black hood, dual daggers, and dark leather armor, side view facing rig"
  },
  {
    "path": "i2v/prompts/gpt4_extended_prompts_path_031324.json",
    "chars": 24173,
    "preview": "[\"Game character running animation of a stealthy rogue with black hood, dual daggers, and dark leather armor, side view "
  },
  {
    "path": "i2v/prompts/gpt4_extended_prompts_path_031324.txt",
    "chars": 23873,
    "preview": "Game character running animation of a stealthy rogue with black hood, dual daggers, and dark leather armor, side view ru"
  },
  {
    "path": "i2v/prompts/gpt4_extended_prompts_path_jump_050624.txt",
    "chars": 23873,
    "preview": "Game character jumping animation of a stealthy rogue with black hood, dual daggers, and dark leather armor, side view ru"
  },
  {
    "path": "i2v/prompts/gpt4_extended_prompts_path_runjump_050624.txt",
    "chars": 47746,
    "preview": "Game character running animation of a stealthy rogue with black hood, dual daggers, and dark leather armor, side view ru"
  },
  {
    "path": "i2v/prompts/gpt4_random_anthropomorphic_game_chars_050624.txt",
    "chars": 16726,
    "preview": "Game character animation of an anthropomorphic fox mage casting a purple fire spell, side view facing right, solid color"
  },
  {
    "path": "i2v/prompts/gpt4_random_anthropomorphic_game_chars_b2_050624.txt",
    "chars": 17339,
    "preview": "Game character animation of an anthropomorphic fox archer in emerald green robes, pulling an arrow from a quiver, glowin"
  },
  {
    "path": "i2v/prompts/i2v_prompts/test_prompts.txt",
    "chars": 83,
    "preview": "horses are walking on the grassland\r\na boy and a girl are talking on the seashore\r\n"
  },
  {
    "path": "i2v/prompts/test_prompts.txt",
    "chars": 239,
    "preview": "Game character running animation of Stone-skinned muscular monster with glowing red accents, running animation, pixel ar"
  },
  {
    "path": "i2v/requirements.txt",
    "chars": 338,
    "preview": "decord==0.6.0\neinops==0.3.0\nimageio==2.9.0\nnumpy #==1.24.2\nomegaconf==2.1.1\nopencv_python\npandas==2.0.0\nPillow==9.5.0\npy"
  },
  {
    "path": "i2v/scripts/evaluation/ddp_wrapper.py",
    "chars": 1481,
    "preview": "import datetime\r\nimport argparse, importlib\r\nfrom pytorch_lightning import seed_everything\r\n\r\nimport torch\r\nimport torch"
  },
  {
    "path": "i2v/scripts/evaluation/funcs.py",
    "chars": 9978,
    "preview": "import os, sys, glob\r\nimport numpy as np\r\nfrom collections import OrderedDict\r\nfrom decord import VideoReader, cpu\r\nimpo"
  },
  {
    "path": "i2v/scripts/evaluation/inference.py",
    "chars": 7713,
    "preview": "import argparse, os, sys, glob, yaml, math, random\r\nimport datetime, time\r\nimport numpy as np\r\nfrom omegaconf import Ome"
  },
  {
    "path": "i2v/scripts/evaluation/inference_util.py",
    "chars": 5700,
    "preview": "import argparse, os, sys, glob, yaml, math, random\r\nimport datetime, time\r\nimport numpy as np\r\nfrom omegaconf import Ome"
  },
  {
    "path": "i2v/scripts/evaluation/test_seg.py",
    "chars": 331,
    "preview": "import sys\n\nsys.path.append('/root/godmodeai/god-mode-ai-server/src/bg_tasks/')\n\nfrom segment_animation import segment_g"
  },
  {
    "path": "i2v/scripts/gradio/i2v_test.py",
    "chars": 3703,
    "preview": "import os\nimport time\nfrom omegaconf import OmegaConf\nimport torch\nfrom scripts.evaluation.funcs import load_model_check"
  },
  {
    "path": "i2v/scripts/gradio/t2v_test.py",
    "chars": 3291,
    "preview": "import os\nimport time\nfrom omegaconf import OmegaConf\nimport torch\nfrom scripts.evaluation.funcs import load_model_check"
  },
  {
    "path": "i2v/scripts/run_image2video.sh",
    "chars": 541,
    "preview": "name=\"i2v_512_test\"\n\nckpt='checkpoints/i2v_512_v1/model.ckpt'\nconfig='configs/inference_i2v_512_v1.0.yaml'\n\nprompt_file="
  },
  {
    "path": "i2v/scripts/run_text2video.sh",
    "chars": 450,
    "preview": "name=\"base_512_v2\"\n\nckpt='/root/vc2/model.ckpt'\nconfig='configs/inference_t2v_512_v2.0.yaml'\n\nprompt_file=\"prompts/test_"
  },
  {
    "path": "i2v/train.py",
    "chars": 23568,
    "preview": "import os, sys, datetime, glob\nimport logging\nimport argparse\nfrom functools import partial\nfrom packaging import versio"
  },
  {
    "path": "i2v/train_t2v_run_jump.sh",
    "chars": 1692,
    "preview": "\nPROJ_ROOT=\"/root/animation_training/src/video_crafter\"                      # root directory for saving experiment logs"
  },
  {
    "path": "i2v/train_t2v_spinkick.sh",
    "chars": 702,
    "preview": "\nPROJ_ROOT=\"/root/animation_training/src/video_crafter\"                      # root directory for saving experiment logs"
  },
  {
    "path": "i2v/train_t2v_swordslash.sh",
    "chars": 704,
    "preview": "\nPROJ_ROOT=\"/root/animation_training/src/video_crafter\"                      # root directory for saving experiment logs"
  },
  {
    "path": "i2v/utils/common_utils.py",
    "chars": 3796,
    "preview": "import os\nimport importlib\nimport numpy as np\nfrom inspect import isfunction\n\nimport torch\n\n\ndef shape_to_str(x):\n    sh"
  },
  {
    "path": "i2v/utils/log.py",
    "chars": 2623,
    "preview": "import os\nimport logging\nimport multiprocessing as mproc\nimport types\n\n#: number of available CPUs on this computer\nCPU_"
  },
  {
    "path": "i2v/utils/utils.py",
    "chars": 2171,
    "preview": "import importlib\nimport numpy as np\nimport cv2\nimport torch\nimport torch.distributed as dist\n\n\ndef count_params(model, v"
  }
]

About this extraction

This page contains the full source code of the lyogavin/godmodeanimation GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 60 files (648.4 KB), approximately 148.6k tokens, and a symbol index with 497 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo