Showing preview only (348K chars total). Download the full file or copy to clipboard to get everything.
Repository: arthur-qiu/LongerCrafter
Branch: main
Commit: 404b68b1fea5
Files: 42
Total size: 332.1 KB
Directory structure:
gitextract_8we83d97/
├── LICENSE
├── README.md
├── cog.yaml
├── configs/
│ ├── inference_t2v_1024_v1.0.yaml
│ ├── inference_t2v_1024_v1.0_freenoise.yaml
│ ├── inference_t2v_tconv256_v1.0.yaml
│ ├── inference_t2v_tconv256_v1.0_freenoise.yaml
│ ├── inference_t2v_tconv512_v2.0.yaml
│ └── inference_t2v_tconv512_v2.0_freenoise.yaml
├── lvdm/
│ ├── basics.py
│ ├── common.py
│ ├── distributions.py
│ ├── ema.py
│ ├── models/
│ │ ├── autoencoder.py
│ │ ├── ddpm3d.py
│ │ ├── samplers/
│ │ │ ├── ddim.py
│ │ │ └── ddim_mp.py
│ │ └── utils_diffusion.py
│ └── modules/
│ ├── attention.py
│ ├── attention_freenoise.py
│ ├── encoders/
│ │ ├── condition.py
│ │ └── ip_resampler.py
│ ├── networks/
│ │ ├── ae_modules.py
│ │ ├── openaimodel3d.py
│ │ └── openaimodel3d_freenoise.py
│ └── x_transformer.py
├── predict.py
├── prompts/
│ ├── mp_prompts.txt
│ └── single_prompts.txt
├── requirements.txt
├── scripts/
│ ├── evaluation/
│ │ ├── ddp_wrapper.py
│ │ ├── funcs.py
│ │ ├── inference.py
│ │ ├── inference_freenoise.py
│ │ └── inference_freenoise_mp.py
│ ├── run_text2video.sh
│ ├── run_text2video_freenoise_1024.sh
│ ├── run_text2video_freenoise_256.sh
│ ├── run_text2video_freenoise_512.sh
│ ├── run_text2video_freenoise_mp_256.sh
│ └── run_text2video_freenoise_mp_512.sh
└── utils/
└── utils.py
================================================
FILE CONTENTS
================================================
================================================
FILE: LICENSE
================================================
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
================================================
FILE: README.md
================================================
## ___***FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling***___
### 🔥🔥🔥 FreeNoise for longer high-quality video generation is now released!
<div align="center">
<p style="font-weight: bold">
✅ totally <span style="color: red; font-weight: bold">no</span> tuning
✅ less than <span style="color: red; font-weight: bold">20%</span> extra time
✅ support <span style="color: red; font-weight: bold">512</span> frames
</p>
<a href='https://arxiv.org/abs/2310.15169'><img src='https://img.shields.io/badge/arXiv-2310.15169-b31b1b.svg'></a>
<a href='http://haonanqiu.com/projects/FreeNoise.html'><img src='https://img.shields.io/badge/Project-Page-Green'></a>
[](https://huggingface.co/spaces/MoonQiu/FreeNoise)
[](https://replicate.com/cjwbw/longercrafter)
_**[Haonan Qiu](http://haonanqiu.com/), [Menghan Xia*](https://menghanxia.github.io), [Yong Zhang](https://yzhang2016.github.io), [Yingqing He](https://github.com/YingqingHe),
<br>
[Xintao Wang](https://xinntao.github.io), [Ying Shan](https://scholar.google.com/citations?hl=zh-CN&user=4oXBp9UAAAAJ), and [Ziwei Liu*](https://liuziwei7.github.io/)**_
<br><br>
(* corresponding author)
From Tencent AI Lab and Nanyang Technological University.
<img src=assets/t2v/hd01.gif>
<p>Input: "A chihuahua in astronaut suit floating in space, cinematic lighting, glow effect";
<br>
Resolution: 1024 x 576; Frames: 64.</p>
<img src=assets/t2v/hd02.gif>
<p>Input: "Campfire at night in a snowy forest with starry sky in the background";
<br>
Resolution: 1024 x 576; Frames: 64.</p>
</div>
## 🔆 Introduction
🤗🤗🤗 LongerCrafter (FreeNoise) is a tuning-free and time-efficient paradigm for longer video generation based on pretrained video diffusion models.
### 1. Longer Single-Prompt Text-to-video Generation
<div align="center">
<img src=assets/t2v/sp512.gif>
<p>Longer single-prompt results. Resolution: 256 x 256; Frames: 512. (Compressed)</p>
</div>
### 2. Longer Multi-Prompt Text-to-video Generation
<div align="center">
<img src=assets/t2v/mp256.gif>
<p>Longer multi-prompt results. Resolution: 256 x 256; Frames: 256. (Compressed)</p>
</div>
## 📝 Changelog
- __[2024.01.28]__: 🔥🔥 Support FreeNoise on VideoCrafter2!
- __[2024.01.23]__: 🔥🔥 Support FreeNoise on other two video frameworks AnimateDiff and LaVie!
- __[2023.10.25]__: 🔥🔥 Release the 256x256 model and support multi-prompt generation!
- __[2023.10.24]__: 🔥🔥 Release the LongerCrafter (FreeNoise), longer video generation!
<br>
## 🧰 Models
|Model|Resolution|Checkpoint|Description
|:---------|:---------|:--------|:--------|
|VideoCrafter (Text2Video)|576x1024|[Hugging Face](https://huggingface.co/VideoCrafter/Text2Video-1024-v1.0/blob/main/model.ckpt)|Support 64 frames on NVIDIA A100 (40GB)
|VideoCrafter (Text2Video)|256x256|[Hugging Face](https://huggingface.co/VideoCrafter)|Support 512 frames on NVIDIA A100 (40GB)
|VideoCrafter2 (Text2Video)|320x512|[Hugging Face](https://huggingface.co/VideoCrafter/VideoCrafter2/blob/main/model.ckpt)|Support 128 frames on NVIDIA A100 (40GB)
(Reduce the number of frames when you have smaller GPUs, e.g. 256x256 resolutions with 64 frames.)
## ⚙️ Setup
### Install Environment via Anaconda (Recommended)
```bash
conda create -n freenoise python=3.8.5
conda activate freenoise
pip install -r requirements.txt
```
## 💫 Inference
### 1. Longer Text-to-Video
<!-- 1) Download pretrained T2V models via [Hugging Face](https://huggingface.co/VideoCrafter/Text2Video-512-v1/blob/main/model.ckpt), and put the `model.ckpt` in `checkpoints/base_512_v1/model.ckpt`.
2) Input the following commands in the terminal.
```bash
sh scripts/run_text2video_freenoise_512.sh
``` -->
1) Download pretrained T2V models via [Hugging Face](https://huggingface.co/VideoCrafter/Text2Video-1024-v1.0/blob/main/model.ckpt), and put the `model.ckpt` in `checkpoints/base_1024_v1/model.ckpt`.
2) Input the following commands in the terminal.
```bash
sh scripts/run_text2video_freenoise_1024.sh
```
### 2. Longer Multi-Prompt Text-to-Video
1) Download pretrained T2V models via [Hugging Face](https://huggingface.co/VideoCrafter), and put the `model.ckpt` in `checkpoints/base_256_v1/model.ckpt`.
2) Input the following commands in the terminal.
```bash
sh scripts/run_text2video_freenoise_mp_256.sh
```
## 🧲 Support For Other Models
FreeNoise is supposed to work on other similar frameworks. An easy way to test compatibility is by shuffling the noise to see whether a new similar video can be generated (set eta to 0). If you have any questions about applying FreeNoise to other frameworks, feel free to contact [Haonan Qiu](http://haonanqiu.com/).
Current official implementation: [FreeNoise-VideoCrafter](https://github.com/AILab-CVC/FreeNoise), [FreeNoise-AnimateDiff](https://github.com/arthur-qiu/FreeNoise-AnimateDiff), [FreeNoise-LaVie](https://github.com/arthur-qiu/FreeNoise-LaVie)
## 🚀 My Free Series
[FreeScale](https://github.com/ali-vilab/FreeScale): Tuning-free method for high-resolution image/video generation.
[FreeTraj](https://github.com/arthur-qiu/FreeTraj): Tuning-free method for trajectory control.
## 👫 Crafter Family
[VideoCrafter](https://github.com/AILab-CVC/VideoCrafter): Framework for high-quality video generation.
[ScaleCrafter](https://github.com/YingqingHe/ScaleCrafter): Tuning-free method for high-resolution image/video generation.
[TaleCrafter](https://github.com/AILab-CVC/TaleCrafter): An interactive story visualization tool that supports multiple characters.
## 😉 Citation
```bib
@misc{qiu2023freenoise,
title={FreeNoise: Tuning-Free Longer Video Diffusion Via Noise Rescheduling},
author={Haonan Qiu and Menghan Xia and Yong Zhang and Yingqing He and Xintao Wang and Ying Shan and Ziwei Liu},
year={2023},
eprint={2310.15169},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
```
## 📢 Disclaimer
We develop this repository for RESEARCH purposes, so it can only be used for personal/research/non-commercial purposes.
****
================================================
FILE: cog.yaml
================================================
# Configuration for Cog ⚙️
# Reference: https://github.com/replicate/cog/blob/main/docs/yaml.md
build:
gpu: true
system_packages:
- "libgl1-mesa-glx"
- "libglib2.0-0"
python_version: "3.11"
python_packages:
- "decord==0.6.0"
- "einops==0.3.0"
- "imageio==2.9.0"
- "numpy==1.24.2"
- "omegaconf==2.1.1"
- "opencv_python==4.8.1.78"
- "pandas==2.0.0"
- "Pillow==9.5.0"
- "pytorch_lightning==1.8.3"
- "PyYAML==6.0"
- "setuptools==65.6.3"
- "torch==2.0.1"
- "torchvision==0.15.2"
- "tqdm==4.65.0"
- "transformers==4.25.1"
- "moviepy==1.0.3"
- "av==10.0.0"
- "xformers==0.0.22"
- "timm==0.9.8"
- "scikit-learn==1.3.2"
- "open_clip_torch==2.23.0"
- "kornia==0.7.0"
predict: "predict.py:Predictor"
================================================
FILE: configs/inference_t2v_1024_v1.0.yaml
================================================
model:
target: lvdm.models.ddpm3d.LatentDiffusion
params:
linear_start: 0.00085
linear_end: 0.012
num_timesteps_cond: 1
timesteps: 1000
first_stage_key: video
cond_stage_key: caption
cond_stage_trainable: false
conditioning_key: crossattn
image_size:
- 72
- 128
channels: 4
scale_by_std: false
scale_factor: 0.18215
use_ema: false
uncond_type: empty_seq
use_scale: true
fix_scale_bug: true
unet_config:
target: lvdm.modules.networks.openaimodel3d.UNetModel
params:
in_channels: 4
out_channels: 4
model_channels: 320
attention_resolutions:
- 4
- 2
- 1
num_res_blocks: 2
channel_mult:
- 1
- 2
- 4
- 4
num_head_channels: 64
transformer_depth: 1
context_dim: 1024
use_linear: true
use_checkpoint: true
temporal_conv: false
temporal_attention: true
temporal_selfatt_only: true
use_relative_position: true
use_causal_attention: false
temporal_length: 16
addition_attention: true
fps_cond: true
first_stage_config:
target: lvdm.models.autoencoder.AutoencoderKL
params:
embed_dim: 4
monitor: val/rec_loss
ddconfig:
double_z: true
z_channels: 4
resolution: 512
in_channels: 3
out_ch: 3
ch: 128
ch_mult:
- 1
- 2
- 4
- 4
num_res_blocks: 2
attn_resolutions: []
dropout: 0.0
lossconfig:
target: torch.nn.Identity
cond_stage_config:
target: lvdm.modules.encoders.condition.FrozenOpenCLIPEmbedder
params:
freeze: true
layer: penultimate
================================================
FILE: configs/inference_t2v_1024_v1.0_freenoise.yaml
================================================
model:
target: lvdm.models.ddpm3d.LatentDiffusion
params:
linear_start: 0.00085
linear_end: 0.012
num_timesteps_cond: 1
timesteps: 1000
first_stage_key: video
cond_stage_key: caption
cond_stage_trainable: false
conditioning_key: crossattn
image_size:
- 72
- 128
channels: 4
scale_by_std: false
scale_factor: 0.18215
use_ema: false
uncond_type: empty_seq
use_scale: true
fix_scale_bug: true
unet_config:
target: lvdm.modules.networks.openaimodel3d_freenoise.UNetModel
params:
in_channels: 4
out_channels: 4
model_channels: 320
attention_resolutions:
- 4
- 2
- 1
num_res_blocks: 2
channel_mult:
- 1
- 2
- 4
- 4
num_head_channels: 64
transformer_depth: 1
context_dim: 1024
use_linear: true
use_checkpoint: true
temporal_conv: false
temporal_attention: true
temporal_selfatt_only: true
use_relative_position: true
use_causal_attention: false
temporal_length: 16
addition_attention: true
fps_cond: true
first_stage_config:
target: lvdm.models.autoencoder.AutoencoderKL
params:
embed_dim: 4
monitor: val/rec_loss
ddconfig:
double_z: true
z_channels: 4
resolution: 512
in_channels: 3
out_ch: 3
ch: 128
ch_mult:
- 1
- 2
- 4
- 4
num_res_blocks: 2
attn_resolutions: []
dropout: 0.0
lossconfig:
target: torch.nn.Identity
cond_stage_config:
target: lvdm.modules.encoders.condition.FrozenOpenCLIPEmbedder
params:
freeze: true
layer: penultimate
================================================
FILE: configs/inference_t2v_tconv256_v1.0.yaml
================================================
model:
target: lvdm.models.ddpm3d.LatentDiffusion
params:
linear_start: 0.00085
linear_end: 0.012
num_timesteps_cond: 1
timesteps: 1000
first_stage_key: video
cond_stage_key: caption
cond_stage_trainable: false
conditioning_key: crossattn
image_size:
- 32
- 32
channels: 4
scale_by_std: false
scale_factor: 0.18215
use_ema: false
uncond_type: empty_seq
use_scale: false
fix_scale_bug: true
unet_config:
target: lvdm.modules.networks.openaimodel3d_freenoise.UNetModel
params:
in_channels: 4
out_channels: 4
model_channels: 320
attention_resolutions:
- 4
- 2
- 1
num_res_blocks: 2
channel_mult:
- 1
- 2
- 4
- 4
num_head_channels: 64
transformer_depth: 1
context_dim: 1024
use_linear: true
use_checkpoint: true
temporal_conv: true
temporal_attention: true
temporal_selfatt_only: true
use_relative_position: false
use_causal_attention: false
temporal_length: 16
addition_attention: true
fps_cond: true
first_stage_config:
target: lvdm.models.autoencoder.AutoencoderKL
params:
embed_dim: 4
monitor: val/rec_loss
ddconfig:
double_z: true
z_channels: 4
resolution: 512
in_channels: 3
out_ch: 3
ch: 128
ch_mult:
- 1
- 2
- 4
- 4
num_res_blocks: 2
attn_resolutions: []
dropout: 0.0
lossconfig:
target: torch.nn.Identity
cond_stage_config:
target: lvdm.modules.encoders.condition.FrozenOpenCLIPEmbedder
params:
freeze: true
layer: penultimate
================================================
FILE: configs/inference_t2v_tconv256_v1.0_freenoise.yaml
================================================
model:
target: lvdm.models.ddpm3d.LatentDiffusion
params:
linear_start: 0.00085
linear_end: 0.012
num_timesteps_cond: 1
timesteps: 1000
first_stage_key: video
cond_stage_key: caption
cond_stage_trainable: false
conditioning_key: crossattn
image_size:
- 32
- 32
channels: 4
scale_by_std: false
scale_factor: 0.18215
use_ema: false
uncond_type: empty_seq
use_scale: false
fix_scale_bug: true
unet_config:
target: lvdm.modules.networks.openaimodel3d_freenoise.UNetModel
params:
in_channels: 4
out_channels: 4
model_channels: 320
attention_resolutions:
- 4
- 2
- 1
num_res_blocks: 2
channel_mult:
- 1
- 2
- 4
- 4
num_head_channels: 64
transformer_depth: 1
context_dim: 1024
use_linear: true
use_checkpoint: true
temporal_conv: true
temporal_attention: true
temporal_selfatt_only: true
use_relative_position: false
use_causal_attention: false
temporal_length: 16
addition_attention: true
fps_cond: true
first_stage_config:
target: lvdm.models.autoencoder.AutoencoderKL
params:
embed_dim: 4
monitor: val/rec_loss
ddconfig:
double_z: true
z_channels: 4
resolution: 512
in_channels: 3
out_ch: 3
ch: 128
ch_mult:
- 1
- 2
- 4
- 4
num_res_blocks: 2
attn_resolutions: []
dropout: 0.0
lossconfig:
target: torch.nn.Identity
cond_stage_config:
target: lvdm.modules.encoders.condition.FrozenOpenCLIPEmbedder
params:
freeze: true
layer: penultimate
================================================
FILE: configs/inference_t2v_tconv512_v2.0.yaml
================================================
model:
target: lvdm.models.ddpm3d.LatentDiffusion
params:
linear_start: 0.00085
linear_end: 0.012
num_timesteps_cond: 1
timesteps: 1000
first_stage_key: video
cond_stage_key: caption
cond_stage_trainable: false
conditioning_key: crossattn
image_size:
- 40
- 64
channels: 4
scale_by_std: false
scale_factor: 0.18215
use_ema: false
uncond_type: empty_seq
use_scale: true
scale_b: 0.7
unet_config:
target: lvdm.modules.networks.openaimodel3d.UNetModel
params:
in_channels: 4
out_channels: 4
model_channels: 320
attention_resolutions:
- 4
- 2
- 1
num_res_blocks: 2
channel_mult:
- 1
- 2
- 4
- 4
num_head_channels: 64
transformer_depth: 1
context_dim: 1024
use_linear: true
use_checkpoint: true
temporal_conv: true
temporal_attention: true
temporal_selfatt_only: true
use_relative_position: false
use_causal_attention: false
temporal_length: 16
addition_attention: true
fps_cond: true
first_stage_config:
target: lvdm.models.autoencoder.AutoencoderKL
params:
embed_dim: 4
monitor: val/rec_loss
ddconfig:
double_z: true
z_channels: 4
resolution: 512
in_channels: 3
out_ch: 3
ch: 128
ch_mult:
- 1
- 2
- 4
- 4
num_res_blocks: 2
attn_resolutions: []
dropout: 0.0
lossconfig:
target: torch.nn.Identity
cond_stage_config:
target: lvdm.modules.encoders.condition.FrozenOpenCLIPEmbedder
params:
freeze: true
layer: penultimate
================================================
FILE: configs/inference_t2v_tconv512_v2.0_freenoise.yaml
================================================
model:
target: lvdm.models.ddpm3d.LatentDiffusion
params:
linear_start: 0.00085
linear_end: 0.012
num_timesteps_cond: 1
timesteps: 1000
first_stage_key: video
cond_stage_key: caption
cond_stage_trainable: false
conditioning_key: crossattn
image_size:
- 40
- 64
channels: 4
scale_by_std: false
scale_factor: 0.18215
use_ema: false
uncond_type: empty_seq
use_scale: true
scale_b: 0.7
unet_config:
target: lvdm.modules.networks.openaimodel3d_freenoise.UNetModel
params:
in_channels: 4
out_channels: 4
model_channels: 320
attention_resolutions:
- 4
- 2
- 1
num_res_blocks: 2
channel_mult:
- 1
- 2
- 4
- 4
num_head_channels: 64
transformer_depth: 1
context_dim: 1024
use_linear: true
use_checkpoint: true
temporal_conv: true
temporal_attention: true
temporal_selfatt_only: true
use_relative_position: false
use_causal_attention: false
temporal_length: 16
addition_attention: true
fps_cond: true
first_stage_config:
target: lvdm.models.autoencoder.AutoencoderKL
params:
embed_dim: 4
monitor: val/rec_loss
ddconfig:
double_z: true
z_channels: 4
resolution: 512
in_channels: 3
out_ch: 3
ch: 128
ch_mult:
- 1
- 2
- 4
- 4
num_res_blocks: 2
attn_resolutions: []
dropout: 0.0
lossconfig:
target: torch.nn.Identity
cond_stage_config:
target: lvdm.modules.encoders.condition.FrozenOpenCLIPEmbedder
params:
freeze: true
layer: penultimate
================================================
FILE: lvdm/basics.py
================================================
# adopted from
# https://github.com/openai/improved-diffusion/blob/main/improved_diffusion/gaussian_diffusion.py
# and
# https://github.com/lucidrains/denoising-diffusion-pytorch/blob/7706bdfc6f527f58d33f84b7b522e61e6e3164b3/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py
# and
# https://github.com/openai/guided-diffusion/blob/0ba878e517b276c45d1195eb29f6f5f72659a05b/guided_diffusion/nn.py
#
# thanks!
import torch.nn as nn
from utils.utils import instantiate_from_config
def disabled_train(self, mode=True):
"""Overwrite model.train with this function to make sure train/eval mode
does not change anymore."""
return self
def zero_module(module):
"""
Zero out the parameters of a module and return it.
"""
for p in module.parameters():
p.detach().zero_()
return module
def scale_module(module, scale):
"""
Scale the parameters of a module and return it.
"""
for p in module.parameters():
p.detach().mul_(scale)
return module
def conv_nd(dims, *args, **kwargs):
"""
Create a 1D, 2D, or 3D convolution module.
"""
if dims == 1:
return nn.Conv1d(*args, **kwargs)
elif dims == 2:
return nn.Conv2d(*args, **kwargs)
elif dims == 3:
return nn.Conv3d(*args, **kwargs)
raise ValueError(f"unsupported dimensions: {dims}")
def linear(*args, **kwargs):
"""
Create a linear module.
"""
return nn.Linear(*args, **kwargs)
def avg_pool_nd(dims, *args, **kwargs):
"""
Create a 1D, 2D, or 3D average pooling module.
"""
if dims == 1:
return nn.AvgPool1d(*args, **kwargs)
elif dims == 2:
return nn.AvgPool2d(*args, **kwargs)
elif dims == 3:
return nn.AvgPool3d(*args, **kwargs)
raise ValueError(f"unsupported dimensions: {dims}")
def nonlinearity(type='silu'):
if type == 'silu':
return nn.SiLU()
elif type == 'leaky_relu':
return nn.LeakyReLU()
class GroupNormSpecific(nn.GroupNorm):
def forward(self, x):
return super().forward(x.float()).type(x.dtype)
def normalization(channels, num_groups=32):
"""
Make a standard normalization layer.
:param channels: number of input channels.
:return: an nn.Module for normalization.
"""
return GroupNormSpecific(num_groups, channels)
class HybridConditioner(nn.Module):
def __init__(self, c_concat_config, c_crossattn_config):
super().__init__()
self.concat_conditioner = instantiate_from_config(c_concat_config)
self.crossattn_conditioner = instantiate_from_config(c_crossattn_config)
def forward(self, c_concat, c_crossattn):
c_concat = self.concat_conditioner(c_concat)
c_crossattn = self.crossattn_conditioner(c_crossattn)
return {'c_concat': [c_concat], 'c_crossattn': [c_crossattn]}
================================================
FILE: lvdm/common.py
================================================
import math
from inspect import isfunction
import torch
from torch import nn
import torch.distributed as dist
def gather_data(data, return_np=True):
''' gather data from multiple processes to one list '''
data_list = [torch.zeros_like(data) for _ in range(dist.get_world_size())]
dist.all_gather(data_list, data) # gather not supported with NCCL
if return_np:
data_list = [data.cpu().numpy() for data in data_list]
return data_list
def autocast(f):
def do_autocast(*args, **kwargs):
with torch.cuda.amp.autocast(enabled=True,
dtype=torch.get_autocast_gpu_dtype(),
cache_enabled=torch.is_autocast_cache_enabled()):
return f(*args, **kwargs)
return do_autocast
def extract_into_tensor(a, t, x_shape):
b, *_ = t.shape
out = a.gather(-1, t)
return out.reshape(b, *((1,) * (len(x_shape) - 1)))
def noise_like(shape, device, repeat=False):
repeat_noise = lambda: torch.randn((1, *shape[1:]), device=device).repeat(shape[0], *((1,) * (len(shape) - 1)))
noise = lambda: torch.randn(shape, device=device)
return repeat_noise() if repeat else noise()
def default(val, d):
if exists(val):
return val
return d() if isfunction(d) else d
def exists(val):
return val is not None
def identity(*args, **kwargs):
return nn.Identity()
def uniq(arr):
return{el: True for el in arr}.keys()
def mean_flat(tensor):
"""
Take the mean over all non-batch dimensions.
"""
return tensor.mean(dim=list(range(1, len(tensor.shape))))
def ismap(x):
if not isinstance(x, torch.Tensor):
return False
return (len(x.shape) == 4) and (x.shape[1] > 3)
def isimage(x):
if not isinstance(x,torch.Tensor):
return False
return (len(x.shape) == 4) and (x.shape[1] == 3 or x.shape[1] == 1)
def max_neg_value(t):
return -torch.finfo(t.dtype).max
def shape_to_str(x):
shape_str = "x".join([str(x) for x in x.shape])
return shape_str
def init_(tensor):
dim = tensor.shape[-1]
std = 1 / math.sqrt(dim)
tensor.uniform_(-std, std)
return tensor
ckpt = torch.utils.checkpoint.checkpoint
def checkpoint(func, inputs, params, flag):
"""
Evaluate a function without caching intermediate activations, allowing for
reduced memory at the expense of extra compute in the backward pass.
:param func: the function to evaluate.
:param inputs: the argument sequence to pass to `func`.
:param params: a sequence of parameters `func` depends on but does not
explicitly take as arguments.
:param flag: if False, disable gradient checkpointing.
"""
if flag:
return ckpt(func, *inputs)
else:
return func(*inputs)
================================================
FILE: lvdm/distributions.py
================================================
import torch
import numpy as np
class AbstractDistribution:
def sample(self):
raise NotImplementedError()
def mode(self):
raise NotImplementedError()
class DiracDistribution(AbstractDistribution):
def __init__(self, value):
self.value = value
def sample(self):
return self.value
def mode(self):
return self.value
class DiagonalGaussianDistribution(object):
def __init__(self, parameters, deterministic=False):
self.parameters = parameters
self.mean, self.logvar = torch.chunk(parameters, 2, dim=1)
self.logvar = torch.clamp(self.logvar, -30.0, 20.0)
self.deterministic = deterministic
self.std = torch.exp(0.5 * self.logvar)
self.var = torch.exp(self.logvar)
if self.deterministic:
self.var = self.std = torch.zeros_like(self.mean).to(device=self.parameters.device)
def sample(self, noise=None):
if noise is None:
noise = torch.randn(self.mean.shape)
x = self.mean + self.std * noise.to(device=self.parameters.device)
return x
def kl(self, other=None):
if self.deterministic:
return torch.Tensor([0.])
else:
if other is None:
return 0.5 * torch.sum(torch.pow(self.mean, 2)
+ self.var - 1.0 - self.logvar,
dim=[1, 2, 3])
else:
return 0.5 * torch.sum(
torch.pow(self.mean - other.mean, 2) / other.var
+ self.var / other.var - 1.0 - self.logvar + other.logvar,
dim=[1, 2, 3])
def nll(self, sample, dims=[1,2,3]):
if self.deterministic:
return torch.Tensor([0.])
logtwopi = np.log(2.0 * np.pi)
return 0.5 * torch.sum(
logtwopi + self.logvar + torch.pow(sample - self.mean, 2) / self.var,
dim=dims)
def mode(self):
return self.mean
def normal_kl(mean1, logvar1, mean2, logvar2):
"""
source: https://github.com/openai/guided-diffusion/blob/27c20a8fab9cb472df5d6bdd6c8d11c8f430b924/guided_diffusion/losses.py#L12
Compute the KL divergence between two gaussians.
Shapes are automatically broadcasted, so batches can be compared to
scalars, among other use cases.
"""
tensor = None
for obj in (mean1, logvar1, mean2, logvar2):
if isinstance(obj, torch.Tensor):
tensor = obj
break
assert tensor is not None, "at least one argument must be a Tensor"
# Force variances to be Tensors. Broadcasting helps convert scalars to
# Tensors, but it does not work for torch.exp().
logvar1, logvar2 = [
x if isinstance(x, torch.Tensor) else torch.tensor(x).to(tensor)
for x in (logvar1, logvar2)
]
return 0.5 * (
-1.0
+ logvar2
- logvar1
+ torch.exp(logvar1 - logvar2)
+ ((mean1 - mean2) ** 2) * torch.exp(-logvar2)
)
================================================
FILE: lvdm/ema.py
================================================
import torch
from torch import nn
class LitEma(nn.Module):
def __init__(self, model, decay=0.9999, use_num_upates=True):
super().__init__()
if decay < 0.0 or decay > 1.0:
raise ValueError('Decay must be between 0 and 1')
self.m_name2s_name = {}
self.register_buffer('decay', torch.tensor(decay, dtype=torch.float32))
self.register_buffer('num_updates', torch.tensor(0,dtype=torch.int) if use_num_upates
else torch.tensor(-1,dtype=torch.int))
for name, p in model.named_parameters():
if p.requires_grad:
#remove as '.'-character is not allowed in buffers
s_name = name.replace('.','')
self.m_name2s_name.update({name:s_name})
self.register_buffer(s_name,p.clone().detach().data)
self.collected_params = []
def forward(self,model):
decay = self.decay
if self.num_updates >= 0:
self.num_updates += 1
decay = min(self.decay,(1 + self.num_updates) / (10 + self.num_updates))
one_minus_decay = 1.0 - decay
with torch.no_grad():
m_param = dict(model.named_parameters())
shadow_params = dict(self.named_buffers())
for key in m_param:
if m_param[key].requires_grad:
sname = self.m_name2s_name[key]
shadow_params[sname] = shadow_params[sname].type_as(m_param[key])
shadow_params[sname].sub_(one_minus_decay * (shadow_params[sname] - m_param[key]))
else:
assert not key in self.m_name2s_name
def copy_to(self, model):
m_param = dict(model.named_parameters())
shadow_params = dict(self.named_buffers())
for key in m_param:
if m_param[key].requires_grad:
m_param[key].data.copy_(shadow_params[self.m_name2s_name[key]].data)
else:
assert not key in self.m_name2s_name
def store(self, parameters):
"""
Save the current parameters for restoring later.
Args:
parameters: Iterable of `torch.nn.Parameter`; the parameters to be
temporarily stored.
"""
self.collected_params = [param.clone() for param in parameters]
def restore(self, parameters):
"""
Restore the parameters stored with the `store` method.
Useful to validate the model with EMA parameters without affecting the
original optimization process. Store the parameters before the
`copy_to` method. After validation (or model saving), use this to
restore the former parameters.
Args:
parameters: Iterable of `torch.nn.Parameter`; the parameters to be
updated with the stored parameters.
"""
for c_param, param in zip(self.collected_params, parameters):
param.data.copy_(c_param.data)
================================================
FILE: lvdm/models/autoencoder.py
================================================
import os
from contextlib import contextmanager
import torch
import numpy as np
from einops import rearrange
import torch.nn.functional as F
import pytorch_lightning as pl
from lvdm.modules.networks.ae_modules import Encoder, Decoder
from lvdm.distributions import DiagonalGaussianDistribution
from utils.utils import instantiate_from_config
class AutoencoderKL(pl.LightningModule):
def __init__(self,
ddconfig,
lossconfig,
embed_dim,
ckpt_path=None,
ignore_keys=[],
image_key="image",
colorize_nlabels=None,
monitor=None,
test=False,
logdir=None,
input_dim=4,
test_args=None,
):
super().__init__()
self.image_key = image_key
self.encoder = Encoder(**ddconfig)
self.decoder = Decoder(**ddconfig)
self.loss = instantiate_from_config(lossconfig)
assert ddconfig["double_z"]
self.quant_conv = torch.nn.Conv2d(2*ddconfig["z_channels"], 2*embed_dim, 1)
self.post_quant_conv = torch.nn.Conv2d(embed_dim, ddconfig["z_channels"], 1)
self.embed_dim = embed_dim
self.input_dim = input_dim
self.test = test
self.test_args = test_args
self.logdir = logdir
if colorize_nlabels is not None:
assert type(colorize_nlabels)==int
self.register_buffer("colorize", torch.randn(3, colorize_nlabels, 1, 1))
if monitor is not None:
self.monitor = monitor
if ckpt_path is not None:
self.init_from_ckpt(ckpt_path, ignore_keys=ignore_keys)
if self.test:
self.init_test()
def init_test(self,):
self.test = True
save_dir = os.path.join(self.logdir, "test")
if 'ckpt' in self.test_args:
ckpt_name = os.path.basename(self.test_args.ckpt).split('.ckpt')[0] + f'_epoch{self._cur_epoch}'
self.root = os.path.join(save_dir, ckpt_name)
else:
self.root = save_dir
if 'test_subdir' in self.test_args:
self.root = os.path.join(save_dir, self.test_args.test_subdir)
self.root_zs = os.path.join(self.root, "zs")
self.root_dec = os.path.join(self.root, "reconstructions")
self.root_inputs = os.path.join(self.root, "inputs")
os.makedirs(self.root, exist_ok=True)
if self.test_args.save_z:
os.makedirs(self.root_zs, exist_ok=True)
if self.test_args.save_reconstruction:
os.makedirs(self.root_dec, exist_ok=True)
if self.test_args.save_input:
os.makedirs(self.root_inputs, exist_ok=True)
assert(self.test_args is not None)
self.test_maximum = getattr(self.test_args, 'test_maximum', None)
self.count = 0
self.eval_metrics = {}
self.decodes = []
self.save_decode_samples = 2048
def init_from_ckpt(self, path, ignore_keys=list()):
sd = torch.load(path, map_location="cpu")
try:
self._cur_epoch = sd['epoch']
sd = sd["state_dict"]
except:
self._cur_epoch = 'null'
keys = list(sd.keys())
for k in keys:
for ik in ignore_keys:
if k.startswith(ik):
print("Deleting key {} from state_dict.".format(k))
del sd[k]
self.load_state_dict(sd, strict=False)
# self.load_state_dict(sd, strict=True)
print(f"Restored from {path}")
def encode(self, x, **kwargs):
h = self.encoder(x)
moments = self.quant_conv(h)
posterior = DiagonalGaussianDistribution(moments)
return posterior
def decode(self, z, **kwargs):
z = self.post_quant_conv(z)
dec = self.decoder(z)
return dec
def forward(self, input, sample_posterior=True):
posterior = self.encode(input)
if sample_posterior:
z = posterior.sample()
else:
z = posterior.mode()
dec = self.decode(z)
return dec, posterior
def get_input(self, batch, k):
x = batch[k]
if x.dim() == 5 and self.input_dim == 4:
b,c,t,h,w = x.shape
self.b = b
self.t = t
x = rearrange(x, 'b c t h w -> (b t) c h w')
return x
def training_step(self, batch, batch_idx, optimizer_idx):
inputs = self.get_input(batch, self.image_key)
reconstructions, posterior = self(inputs)
if optimizer_idx == 0:
# train encoder+decoder+logvar
aeloss, log_dict_ae = self.loss(inputs, reconstructions, posterior, optimizer_idx, self.global_step,
last_layer=self.get_last_layer(), split="train")
self.log("aeloss", aeloss, prog_bar=True, logger=True, on_step=True, on_epoch=True)
self.log_dict(log_dict_ae, prog_bar=False, logger=True, on_step=True, on_epoch=False)
return aeloss
if optimizer_idx == 1:
# train the discriminator
discloss, log_dict_disc = self.loss(inputs, reconstructions, posterior, optimizer_idx, self.global_step,
last_layer=self.get_last_layer(), split="train")
self.log("discloss", discloss, prog_bar=True, logger=True, on_step=True, on_epoch=True)
self.log_dict(log_dict_disc, prog_bar=False, logger=True, on_step=True, on_epoch=False)
return discloss
def validation_step(self, batch, batch_idx):
inputs = self.get_input(batch, self.image_key)
reconstructions, posterior = self(inputs)
aeloss, log_dict_ae = self.loss(inputs, reconstructions, posterior, 0, self.global_step,
last_layer=self.get_last_layer(), split="val")
discloss, log_dict_disc = self.loss(inputs, reconstructions, posterior, 1, self.global_step,
last_layer=self.get_last_layer(), split="val")
self.log("val/rec_loss", log_dict_ae["val/rec_loss"])
self.log_dict(log_dict_ae)
self.log_dict(log_dict_disc)
return self.log_dict
def configure_optimizers(self):
lr = self.learning_rate
opt_ae = torch.optim.Adam(list(self.encoder.parameters())+
list(self.decoder.parameters())+
list(self.quant_conv.parameters())+
list(self.post_quant_conv.parameters()),
lr=lr, betas=(0.5, 0.9))
opt_disc = torch.optim.Adam(self.loss.discriminator.parameters(),
lr=lr, betas=(0.5, 0.9))
return [opt_ae, opt_disc], []
def get_last_layer(self):
return self.decoder.conv_out.weight
@torch.no_grad()
def log_images(self, batch, only_inputs=False, **kwargs):
log = dict()
x = self.get_input(batch, self.image_key)
x = x.to(self.device)
if not only_inputs:
xrec, posterior = self(x)
if x.shape[1] > 3:
# colorize with random projection
assert xrec.shape[1] > 3
x = self.to_rgb(x)
xrec = self.to_rgb(xrec)
log["samples"] = self.decode(torch.randn_like(posterior.sample()))
log["reconstructions"] = xrec
log["inputs"] = x
return log
def to_rgb(self, x):
assert self.image_key == "segmentation"
if not hasattr(self, "colorize"):
self.register_buffer("colorize", torch.randn(3, x.shape[1], 1, 1).to(x))
x = F.conv2d(x, weight=self.colorize)
x = 2.*(x-x.min())/(x.max()-x.min()) - 1.
return x
class IdentityFirstStage(torch.nn.Module):
def __init__(self, *args, vq_interface=False, **kwargs):
self.vq_interface = vq_interface # TODO: Should be true by default but check to not break older stuff
super().__init__()
def encode(self, x, *args, **kwargs):
return x
def decode(self, x, *args, **kwargs):
return x
def quantize(self, x, *args, **kwargs):
if self.vq_interface:
return x, None, [None, None, None]
return x
def forward(self, x, *args, **kwargs):
return x
================================================
FILE: lvdm/models/ddpm3d.py
================================================
"""
wild mixture of
https://github.com/openai/improved-diffusion/blob/e94489283bb876ac1477d5dd7709bbbd2d9902ce/improved_diffusion/gaussian_diffusion.py
https://github.com/lucidrains/denoising-diffusion-pytorch/blob/7706bdfc6f527f58d33f84b7b522e61e6e3164b3/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py
https://github.com/CompVis/taming-transformers
-- merci
"""
from functools import partial
from contextlib import contextmanager
import numpy as np
from tqdm import tqdm
from einops import rearrange, repeat
import logging
mainlogger = logging.getLogger('mainlogger')
import torch
import torch.nn as nn
from torchvision.utils import make_grid
import pytorch_lightning as pl
from utils.utils import instantiate_from_config
from lvdm.ema import LitEma
from lvdm.distributions import DiagonalGaussianDistribution
from lvdm.models.utils_diffusion import make_beta_schedule
from lvdm.modules.encoders.ip_resampler import ImageProjModel, Resampler
from lvdm.basics import disabled_train
from lvdm.common import (
extract_into_tensor,
noise_like,
exists,
default
)
__conditioning_keys__ = {'concat': 'c_concat',
'crossattn': 'c_crossattn',
'adm': 'y'}
class DDPM(pl.LightningModule):
# classic DDPM with Gaussian diffusion, in image space
def __init__(self,
unet_config,
timesteps=1000,
beta_schedule="linear",
loss_type="l2",
ckpt_path=None,
ignore_keys=[],
load_only_unet=False,
monitor=None,
use_ema=True,
first_stage_key="image",
image_size=256,
channels=3,
log_every_t=100,
clip_denoised=True,
linear_start=1e-4,
linear_end=2e-2,
cosine_s=8e-3,
given_betas=None,
original_elbo_weight=0.,
v_posterior=0., # weight for choosing posterior variance as sigma = (1-v) * beta_tilde + v * beta
l_simple_weight=1.,
conditioning_key=None,
parameterization="eps", # all assuming fixed variance schedules
scheduler_config=None,
use_positional_encodings=False,
learn_logvar=False,
logvar_init=0.
):
super().__init__()
assert parameterization in ["eps", "x0"], 'currently only supporting "eps" and "x0"'
self.parameterization = parameterization
mainlogger.info(f"{self.__class__.__name__}: Running in {self.parameterization}-prediction mode")
self.cond_stage_model = None
self.clip_denoised = clip_denoised
self.log_every_t = log_every_t
self.first_stage_key = first_stage_key
self.channels = channels
self.temporal_length = unet_config.params.temporal_length
self.image_size = image_size
if isinstance(self.image_size, int):
self.image_size = [self.image_size, self.image_size]
self.use_positional_encodings = use_positional_encodings
self.model = DiffusionWrapper(unet_config, conditioning_key)
self.use_ema = use_ema
if self.use_ema:
self.model_ema = LitEma(self.model)
mainlogger.info(f"Keeping EMAs of {len(list(self.model_ema.buffers()))}.")
self.use_scheduler = scheduler_config is not None
if self.use_scheduler:
self.scheduler_config = scheduler_config
self.v_posterior = v_posterior
self.original_elbo_weight = original_elbo_weight
self.l_simple_weight = l_simple_weight
if monitor is not None:
self.monitor = monitor
if ckpt_path is not None:
self.init_from_ckpt(ckpt_path, ignore_keys=ignore_keys, only_model=load_only_unet)
self.register_schedule(given_betas=given_betas, beta_schedule=beta_schedule, timesteps=timesteps,
linear_start=linear_start, linear_end=linear_end, cosine_s=cosine_s)
self.loss_type = loss_type
self.learn_logvar = learn_logvar
self.logvar = torch.full(fill_value=logvar_init, size=(self.num_timesteps,))
if self.learn_logvar:
self.logvar = nn.Parameter(self.logvar, requires_grad=True)
def register_schedule(self, given_betas=None, beta_schedule="linear", timesteps=1000,
linear_start=1e-4, linear_end=2e-2, cosine_s=8e-3):
if exists(given_betas):
betas = given_betas
else:
betas = make_beta_schedule(beta_schedule, timesteps, linear_start=linear_start, linear_end=linear_end,
cosine_s=cosine_s)
alphas = 1. - betas
alphas_cumprod = np.cumprod(alphas, axis=0)
alphas_cumprod_prev = np.append(1., alphas_cumprod[:-1])
timesteps, = betas.shape
self.num_timesteps = int(timesteps)
self.linear_start = linear_start
self.linear_end = linear_end
assert alphas_cumprod.shape[0] == self.num_timesteps, 'alphas have to be defined for each timestep'
to_torch = partial(torch.tensor, dtype=torch.float32)
self.register_buffer('betas', to_torch(betas))
self.register_buffer('alphas_cumprod', to_torch(alphas_cumprod))
self.register_buffer('alphas_cumprod_prev', to_torch(alphas_cumprod_prev))
# calculations for diffusion q(x_t | x_{t-1}) and others
self.register_buffer('sqrt_alphas_cumprod', to_torch(np.sqrt(alphas_cumprod)))
self.register_buffer('sqrt_one_minus_alphas_cumprod', to_torch(np.sqrt(1. - alphas_cumprod)))
self.register_buffer('log_one_minus_alphas_cumprod', to_torch(np.log(1. - alphas_cumprod)))
self.register_buffer('sqrt_recip_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod)))
self.register_buffer('sqrt_recipm1_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod - 1)))
# calculations for posterior q(x_{t-1} | x_t, x_0)
posterior_variance = (1 - self.v_posterior) * betas * (1. - alphas_cumprod_prev) / (
1. - alphas_cumprod) + self.v_posterior * betas
# above: equal to 1. / (1. / (1. - alpha_cumprod_tm1) + alpha_t / beta_t)
self.register_buffer('posterior_variance', to_torch(posterior_variance))
# below: log calculation clipped because the posterior variance is 0 at the beginning of the diffusion chain
self.register_buffer('posterior_log_variance_clipped', to_torch(np.log(np.maximum(posterior_variance, 1e-20))))
self.register_buffer('posterior_mean_coef1', to_torch(
betas * np.sqrt(alphas_cumprod_prev) / (1. - alphas_cumprod)))
self.register_buffer('posterior_mean_coef2', to_torch(
(1. - alphas_cumprod_prev) * np.sqrt(alphas) / (1. - alphas_cumprod)))
if self.parameterization == "eps":
lvlb_weights = self.betas ** 2 / (
2 * self.posterior_variance * to_torch(alphas) * (1 - self.alphas_cumprod))
elif self.parameterization == "x0":
lvlb_weights = 0.5 * np.sqrt(torch.Tensor(alphas_cumprod)) / (2. * 1 - torch.Tensor(alphas_cumprod))
else:
raise NotImplementedError("mu not supported")
# TODO how to choose this term
lvlb_weights[0] = lvlb_weights[1]
self.register_buffer('lvlb_weights', lvlb_weights, persistent=False)
assert not torch.isnan(self.lvlb_weights).all()
@contextmanager
def ema_scope(self, context=None):
if self.use_ema:
self.model_ema.store(self.model.parameters())
self.model_ema.copy_to(self.model)
if context is not None:
mainlogger.info(f"{context}: Switched to EMA weights")
try:
yield None
finally:
if self.use_ema:
self.model_ema.restore(self.model.parameters())
if context is not None:
mainlogger.info(f"{context}: Restored training weights")
def init_from_ckpt(self, path, ignore_keys=list(), only_model=False):
sd = torch.load(path, map_location="cpu")
if "state_dict" in list(sd.keys()):
sd = sd["state_dict"]
keys = list(sd.keys())
for k in keys:
for ik in ignore_keys:
if k.startswith(ik):
mainlogger.info("Deleting key {} from state_dict.".format(k))
del sd[k]
missing, unexpected = self.load_state_dict(sd, strict=False) if not only_model else self.model.load_state_dict(
sd, strict=False)
mainlogger.info(f"Restored from {path} with {len(missing)} missing and {len(unexpected)} unexpected keys")
if len(missing) > 0:
mainlogger.info(f"Missing Keys: {missing}")
if len(unexpected) > 0:
mainlogger.info(f"Unexpected Keys: {unexpected}")
def q_mean_variance(self, x_start, t):
"""
Get the distribution q(x_t | x_0).
:param x_start: the [N x C x ...] tensor of noiseless inputs.
:param t: the number of diffusion steps (minus 1). Here, 0 means one step.
:return: A tuple (mean, variance, log_variance), all of x_start's shape.
"""
mean = (extract_into_tensor(self.sqrt_alphas_cumprod, t, x_start.shape) * x_start)
variance = extract_into_tensor(1.0 - self.alphas_cumprod, t, x_start.shape)
log_variance = extract_into_tensor(self.log_one_minus_alphas_cumprod, t, x_start.shape)
return mean, variance, log_variance
def predict_start_from_noise(self, x_t, t, noise):
return (
extract_into_tensor(self.sqrt_recip_alphas_cumprod, t, x_t.shape) * x_t -
extract_into_tensor(self.sqrt_recipm1_alphas_cumprod, t, x_t.shape) * noise
)
def q_posterior(self, x_start, x_t, t):
posterior_mean = (
extract_into_tensor(self.posterior_mean_coef1, t, x_t.shape) * x_start +
extract_into_tensor(self.posterior_mean_coef2, t, x_t.shape) * x_t
)
posterior_variance = extract_into_tensor(self.posterior_variance, t, x_t.shape)
posterior_log_variance_clipped = extract_into_tensor(self.posterior_log_variance_clipped, t, x_t.shape)
return posterior_mean, posterior_variance, posterior_log_variance_clipped
def p_mean_variance(self, x, t, clip_denoised: bool):
model_out = self.model(x, t)
if self.parameterization == "eps":
x_recon = self.predict_start_from_noise(x, t=t, noise=model_out)
elif self.parameterization == "x0":
x_recon = model_out
if clip_denoised:
x_recon.clamp_(-1., 1.)
model_mean, posterior_variance, posterior_log_variance = self.q_posterior(x_start=x_recon, x_t=x, t=t)
return model_mean, posterior_variance, posterior_log_variance
@torch.no_grad()
def p_sample(self, x, t, clip_denoised=True, repeat_noise=False):
b, *_, device = *x.shape, x.device
model_mean, _, model_log_variance = self.p_mean_variance(x=x, t=t, clip_denoised=clip_denoised)
noise = noise_like(x.shape, device, repeat_noise)
# no noise when t == 0
nonzero_mask = (1 - (t == 0).float()).reshape(b, *((1,) * (len(x.shape) - 1)))
return model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise
@torch.no_grad()
def p_sample_loop(self, shape, return_intermediates=False):
device = self.betas.device
b = shape[0]
img = torch.randn(shape, device=device)
intermediates = [img]
for i in tqdm(reversed(range(0, self.num_timesteps)), desc='Sampling t', total=self.num_timesteps):
img = self.p_sample(img, torch.full((b,), i, device=device, dtype=torch.long),
clip_denoised=self.clip_denoised)
if i % self.log_every_t == 0 or i == self.num_timesteps - 1:
intermediates.append(img)
if return_intermediates:
return img, intermediates
return img
@torch.no_grad()
def sample(self, batch_size=16, return_intermediates=False):
image_size = self.image_size
channels = self.channels
return self.p_sample_loop((batch_size, channels, image_size, image_size),
return_intermediates=return_intermediates)
def q_sample(self, x_start, t, noise=None):
noise = default(noise, lambda: torch.randn_like(x_start))
return (extract_into_tensor(self.sqrt_alphas_cumprod, t, x_start.shape) * x_start *
extract_into_tensor(self.scale_arr, t, x_start.shape) +
extract_into_tensor(self.sqrt_one_minus_alphas_cumprod, t, x_start.shape) * noise)
def get_input(self, batch, k):
x = batch[k]
x = x.to(memory_format=torch.contiguous_format).float()
return x
def _get_rows_from_list(self, samples):
n_imgs_per_row = len(samples)
denoise_grid = rearrange(samples, 'n b c h w -> b n c h w')
denoise_grid = rearrange(denoise_grid, 'b n c h w -> (b n) c h w')
denoise_grid = make_grid(denoise_grid, nrow=n_imgs_per_row)
return denoise_grid
@torch.no_grad()
def log_images(self, batch, N=8, n_row=2, sample=True, return_keys=None, **kwargs):
log = dict()
x = self.get_input(batch, self.first_stage_key)
N = min(x.shape[0], N)
n_row = min(x.shape[0], n_row)
x = x.to(self.device)[:N]
log["inputs"] = x
# get diffusion row
diffusion_row = list()
x_start = x[:n_row]
for t in range(self.num_timesteps):
if t % self.log_every_t == 0 or t == self.num_timesteps - 1:
t = repeat(torch.tensor([t]), '1 -> b', b=n_row)
t = t.to(self.device).long()
noise = torch.randn_like(x_start)
x_noisy = self.q_sample(x_start=x_start, t=t, noise=noise)
diffusion_row.append(x_noisy)
log["diffusion_row"] = self._get_rows_from_list(diffusion_row)
if sample:
# get denoise row
with self.ema_scope("Plotting"):
samples, denoise_row = self.sample(batch_size=N, return_intermediates=True)
log["samples"] = samples
log["denoise_row"] = self._get_rows_from_list(denoise_row)
if return_keys:
if np.intersect1d(list(log.keys()), return_keys).shape[0] == 0:
return log
else:
return {key: log[key] for key in return_keys}
return log
class LatentDiffusion(DDPM):
"""main class"""
def __init__(self,
first_stage_config,
cond_stage_config,
num_timesteps_cond=None,
cond_stage_key="caption",
cond_stage_trainable=False,
cond_stage_forward=None,
conditioning_key=None,
uncond_prob=0.2,
uncond_type="empty_seq",
scale_factor=1.0,
scale_by_std=False,
encoder_type="2d",
only_model=False,
use_scale=False,
scale_a=1,
scale_b=0.3,
mid_step=400,
fix_scale_bug=False,
*args, **kwargs):
self.num_timesteps_cond = default(num_timesteps_cond, 1)
self.scale_by_std = scale_by_std
assert self.num_timesteps_cond <= kwargs['timesteps']
# for backwards compatibility after implementation of DiffusionWrapper
ckpt_path = kwargs.pop("ckpt_path", None)
ignore_keys = kwargs.pop("ignore_keys", [])
conditioning_key = default(conditioning_key, 'crossattn')
super().__init__(conditioning_key=conditioning_key, *args, **kwargs)
self.cond_stage_trainable = cond_stage_trainable
self.cond_stage_key = cond_stage_key
# scale factor
self.use_scale=use_scale
if self.use_scale:
self.scale_a=scale_a
self.scale_b=scale_b
if fix_scale_bug:
scale_step=self.num_timesteps-mid_step
else: #bug
scale_step = self.num_timesteps
scale_arr1 = np.linspace(scale_a, scale_b, mid_step)
scale_arr2 = np.full(scale_step, scale_b)
scale_arr = np.concatenate((scale_arr1, scale_arr2))
scale_arr_prev = np.append(scale_a, scale_arr[:-1])
to_torch = partial(torch.tensor, dtype=torch.float32)
self.register_buffer('scale_arr', to_torch(scale_arr))
try:
self.num_downs = len(first_stage_config.params.ddconfig.ch_mult) - 1
except:
self.num_downs = 0
if not scale_by_std:
self.scale_factor = scale_factor
else:
self.register_buffer('scale_factor', torch.tensor(scale_factor))
self.instantiate_first_stage(first_stage_config)
self.instantiate_cond_stage(cond_stage_config)
self.first_stage_config = first_stage_config
self.cond_stage_config = cond_stage_config
self.clip_denoised = False
self.cond_stage_forward = cond_stage_forward
self.encoder_type = encoder_type
assert(encoder_type in ["2d", "3d"])
self.uncond_prob = uncond_prob
self.classifier_free_guidance = True if uncond_prob > 0 else False
assert(uncond_type in ["zero_embed", "empty_seq"])
self.uncond_type = uncond_type
self.restarted_from_ckpt = False
if ckpt_path is not None:
self.init_from_ckpt(ckpt_path, ignore_keys, only_model=only_model)
self.restarted_from_ckpt = True
def make_cond_schedule(self, ):
self.cond_ids = torch.full(size=(self.num_timesteps,), fill_value=self.num_timesteps - 1, dtype=torch.long)
ids = torch.round(torch.linspace(0, self.num_timesteps - 1, self.num_timesteps_cond)).long()
self.cond_ids[:self.num_timesteps_cond] = ids
def q_sample(self, x_start, t, noise=None):
noise = default(noise, lambda: torch.randn_like(x_start))
if self.use_scale:
return (extract_into_tensor(self.sqrt_alphas_cumprod, t, x_start.shape) * x_start *
extract_into_tensor(self.scale_arr, t, x_start.shape) +
extract_into_tensor(self.sqrt_one_minus_alphas_cumprod, t, x_start.shape) * noise)
else:
return (extract_into_tensor(self.sqrt_alphas_cumprod, t, x_start.shape) * x_start +
extract_into_tensor(self.sqrt_one_minus_alphas_cumprod, t, x_start.shape) * noise)
def _freeze_model(self):
for name, para in self.model.diffusion_model.named_parameters():
para.requires_grad = False
def instantiate_first_stage(self, config):
model = instantiate_from_config(config)
self.first_stage_model = model.eval()
self.first_stage_model.train = disabled_train
for param in self.first_stage_model.parameters():
param.requires_grad = False
def instantiate_cond_stage(self, config):
if not self.cond_stage_trainable:
model = instantiate_from_config(config)
self.cond_stage_model = model.eval()
self.cond_stage_model.train = disabled_train
for param in self.cond_stage_model.parameters():
param.requires_grad = False
else:
model = instantiate_from_config(config)
self.cond_stage_model = model
def get_learned_conditioning(self, c):
if self.cond_stage_forward is None:
if hasattr(self.cond_stage_model, 'encode') and callable(self.cond_stage_model.encode):
c = self.cond_stage_model.encode(c)
if isinstance(c, DiagonalGaussianDistribution):
c = c.mode()
else:
c = self.cond_stage_model(c)
else:
assert hasattr(self.cond_stage_model, self.cond_stage_forward)
c = getattr(self.cond_stage_model, self.cond_stage_forward)(c)
return c
def get_first_stage_encoding(self, encoder_posterior, noise=None):
if isinstance(encoder_posterior, DiagonalGaussianDistribution):
z = encoder_posterior.sample(noise=noise)
elif isinstance(encoder_posterior, torch.Tensor):
z = encoder_posterior
else:
raise NotImplementedError(f"encoder_posterior of type '{type(encoder_posterior)}' not yet implemented")
return self.scale_factor * z
@torch.no_grad()
def encode_first_stage(self, x):
if self.encoder_type == "2d" and x.dim() == 5:
b, _, t, _, _ = x.shape
x = rearrange(x, 'b c t h w -> (b t) c h w')
reshape_back = True
else:
reshape_back = False
encoder_posterior = self.first_stage_model.encode(x)
results = self.get_first_stage_encoding(encoder_posterior).detach()
if reshape_back:
results = rearrange(results, '(b t) c h w -> b c t h w', b=b,t=t)
return results
@torch.no_grad()
def encode_first_stage_2DAE(self, x):
b, _, t, _, _ = x.shape
results = torch.cat([self.get_first_stage_encoding(self.first_stage_model.encode(x[:,:,i])).detach().unsqueeze(2) for i in range(t)], dim=2)
return results
def decode_core(self, z, **kwargs):
if self.encoder_type == "2d" and z.dim() == 5:
b, _, t, _, _ = z.shape
z = rearrange(z, 'b c t h w -> (b t) c h w')
reshape_back = True
else:
reshape_back = False
z = 1. / self.scale_factor * z
results = self.first_stage_model.decode(z, **kwargs)
if reshape_back:
results = rearrange(results, '(b t) c h w -> b c t h w', b=b,t=t)
return results
@torch.no_grad()
def decode_first_stage(self, z, **kwargs):
return self.decode_core(z, **kwargs)
def apply_model(self, x_noisy, t, cond, **kwargs):
if isinstance(cond, dict):
# hybrid case, cond is exptected to be a dict
pass
else:
if not isinstance(cond, list):
cond = [cond]
key = 'c_concat' if self.model.conditioning_key == 'concat' else 'c_crossattn'
cond = {key: cond}
x_recon = self.model(x_noisy, t, **cond, **kwargs)
if isinstance(x_recon, tuple):
return x_recon[0]
else:
return x_recon
def _get_denoise_row_from_list(self, samples, desc=''):
denoise_row = []
for zd in tqdm(samples, desc=desc):
denoise_row.append(self.decode_first_stage(zd.to(self.device)))
n_log_timesteps = len(denoise_row)
denoise_row = torch.stack(denoise_row) # n_log_timesteps, b, C, H, W
if denoise_row.dim() == 5:
# img, num_imgs= n_log_timesteps * bs, grid_size=[bs,n_log_timesteps]
denoise_grid = rearrange(denoise_row, 'n b c h w -> b n c h w')
denoise_grid = rearrange(denoise_grid, 'b n c h w -> (b n) c h w')
denoise_grid = make_grid(denoise_grid, nrow=n_log_timesteps)
elif denoise_row.dim() == 6:
# video, grid_size=[n_log_timesteps*bs, t]
video_length = denoise_row.shape[3]
denoise_grid = rearrange(denoise_row, 'n b c t h w -> b n c t h w')
denoise_grid = rearrange(denoise_grid, 'b n c t h w -> (b n) c t h w')
denoise_grid = rearrange(denoise_grid, 'n c t h w -> (n t) c h w')
denoise_grid = make_grid(denoise_grid, nrow=video_length)
else:
raise ValueError
return denoise_grid
@torch.no_grad()
def decode_first_stage_2DAE(self, z, **kwargs):
b, _, t, _, _ = z.shape
z = 1. / self.scale_factor * z
results = torch.cat([self.first_stage_model.decode(z[:,:,i], **kwargs).unsqueeze(2) for i in range(t)], dim=2)
return results
def p_mean_variance(self, x, c, t, clip_denoised: bool, return_x0=False, score_corrector=None, corrector_kwargs=None, **kwargs):
t_in = t
model_out = self.apply_model(x, t_in, c, **kwargs)
if score_corrector is not None:
assert self.parameterization == "eps"
model_out = score_corrector.modify_score(self, model_out, x, t, c, **corrector_kwargs)
if self.parameterization == "eps":
x_recon = self.predict_start_from_noise(x, t=t, noise=model_out)
elif self.parameterization == "x0":
x_recon = model_out
else:
raise NotImplementedError()
if clip_denoised:
x_recon.clamp_(-1., 1.)
model_mean, posterior_variance, posterior_log_variance = self.q_posterior(x_start=x_recon, x_t=x, t=t)
if return_x0:
return model_mean, posterior_variance, posterior_log_variance, x_recon
else:
return model_mean, posterior_variance, posterior_log_variance
@torch.no_grad()
def p_sample(self, x, c, t, clip_denoised=False, repeat_noise=False, return_x0=False, \
temperature=1., noise_dropout=0., score_corrector=None, corrector_kwargs=None, **kwargs):
b, *_, device = *x.shape, x.device
outputs = self.p_mean_variance(x=x, c=c, t=t, clip_denoised=clip_denoised, return_x0=return_x0, \
score_corrector=score_corrector, corrector_kwargs=corrector_kwargs, **kwargs)
if return_x0:
model_mean, _, model_log_variance, x0 = outputs
else:
model_mean, _, model_log_variance = outputs
noise = noise_like(x.shape, device, repeat_noise) * temperature
if noise_dropout > 0.:
noise = torch.nn.functional.dropout(noise, p=noise_dropout)
# no noise when t == 0
nonzero_mask = (1 - (t == 0).float()).reshape(b, *((1,) * (len(x.shape) - 1)))
if return_x0:
return model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise, x0
else:
return model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise
@torch.no_grad()
def p_sample_loop(self, cond, shape, return_intermediates=False, x_T=None, verbose=True, callback=None, \
timesteps=None, mask=None, x0=None, img_callback=None, start_T=None, log_every_t=None, **kwargs):
if not log_every_t:
log_every_t = self.log_every_t
device = self.betas.device
b = shape[0]
# sample an initial noise
if x_T is None:
img = torch.randn(shape, device=device)
else:
img = x_T
intermediates = [img]
if timesteps is None:
timesteps = self.num_timesteps
if start_T is not None:
timesteps = min(timesteps, start_T)
iterator = tqdm(reversed(range(0, timesteps)), desc='Sampling t', total=timesteps) if verbose else reversed(range(0, timesteps))
if mask is not None:
assert x0 is not None
assert x0.shape[2:3] == mask.shape[2:3] # spatial size has to match
for i in iterator:
ts = torch.full((b,), i, device=device, dtype=torch.long)
if self.shorten_cond_schedule:
assert self.model.conditioning_key != 'hybrid'
tc = self.cond_ids[ts].to(cond.device)
cond = self.q_sample(x_start=cond, t=tc, noise=torch.randn_like(cond))
img = self.p_sample(img, cond, ts, clip_denoised=self.clip_denoised, **kwargs)
if mask is not None:
img_orig = self.q_sample(x0, ts)
img = img_orig * mask + (1. - mask) * img
if i % log_every_t == 0 or i == timesteps - 1:
intermediates.append(img)
if callback: callback(i)
if img_callback: img_callback(img, i)
if return_intermediates:
return img, intermediates
return img
class LatentVisualDiffusion(LatentDiffusion):
def __init__(self, cond_img_config, finegrained=False, random_cond=False, *args, **kwargs):
super().__init__(*args, **kwargs)
self.random_cond = random_cond
self.instantiate_img_embedder(cond_img_config, freeze=True)
num_tokens = 16 if finegrained else 4
self.image_proj_model = self.init_projector(use_finegrained=finegrained, num_tokens=num_tokens, input_dim=1024,\
cross_attention_dim=1024, dim=1280)
def instantiate_img_embedder(self, config, freeze=True):
embedder = instantiate_from_config(config)
if freeze:
self.embedder = embedder.eval()
self.embedder.train = disabled_train
for param in self.embedder.parameters():
param.requires_grad = False
def init_projector(self, use_finegrained, num_tokens, input_dim, cross_attention_dim, dim):
if not use_finegrained:
image_proj_model = ImageProjModel(clip_extra_context_tokens=num_tokens, cross_attention_dim=cross_attention_dim,
clip_embeddings_dim=input_dim
)
else:
image_proj_model = Resampler(dim=input_dim, depth=4, dim_head=64, heads=12, num_queries=num_tokens,
embedding_dim=dim, output_dim=cross_attention_dim, ff_mult=4
)
return image_proj_model
## Never delete this func: it is used in log_images() and inference stage
def get_image_embeds(self, batch_imgs):
## img: b c h w
img_token = self.embedder(batch_imgs)
img_emb = self.image_proj_model(img_token)
return img_emb
class DiffusionWrapper(pl.LightningModule):
def __init__(self, diff_model_config, conditioning_key):
super().__init__()
self.diffusion_model = instantiate_from_config(diff_model_config)
self.conditioning_key = conditioning_key
def forward(self, x, t, c_concat: list = None, c_crossattn: list = None,
c_adm=None, s=None, mask=None, **kwargs):
# temporal_context = fps is foNone
if self.conditioning_key is None:
out = self.diffusion_model(x, t)
elif self.conditioning_key == 'concat':
xc = torch.cat([x] + c_concat, dim=1)
out = self.diffusion_model(xc, t, **kwargs)
elif self.conditioning_key == 'crossattn':
cc = torch.cat(c_crossattn, 1)
out = self.diffusion_model(x, t, context=cc, **kwargs)
elif self.conditioning_key == 'hybrid':
## it is just right [b,c,t,h,w]: concatenate in channel dim
xc = torch.cat([x] + c_concat, dim=1)
cc = torch.cat(c_crossattn, 1)
out = self.diffusion_model(xc, t, context=cc)
elif self.conditioning_key == 'resblockcond':
cc = c_crossattn[0]
out = self.diffusion_model(x, t, context=cc)
elif self.conditioning_key == 'adm':
cc = c_crossattn[0]
out = self.diffusion_model(x, t, y=cc)
elif self.conditioning_key == 'hybrid-adm':
assert c_adm is not None
xc = torch.cat([x] + c_concat, dim=1)
cc = torch.cat(c_crossattn, 1)
out = self.diffusion_model(xc, t, context=cc, y=c_adm)
elif self.conditioning_key == 'hybrid-time':
assert s is not None
xc = torch.cat([x] + c_concat, dim=1)
cc = torch.cat(c_crossattn, 1)
out = self.diffusion_model(xc, t, context=cc, s=s)
elif self.conditioning_key == 'concat-time-mask':
# assert s is not None
# mainlogger.info('x & mask:',x.shape,c_concat[0].shape)
xc = torch.cat([x] + c_concat, dim=1)
out = self.diffusion_model(xc, t, context=None, s=s, mask=mask)
elif self.conditioning_key == 'concat-adm-mask':
# assert s is not None
# mainlogger.info('x & mask:',x.shape,c_concat[0].shape)
if c_concat is not None:
xc = torch.cat([x] + c_concat, dim=1)
else:
xc = x
out = self.diffusion_model(xc, t, context=None, y=s, mask=mask)
elif self.conditioning_key == 'hybrid-adm-mask':
cc = torch.cat(c_crossattn, 1)
if c_concat is not None:
xc = torch.cat([x] + c_concat, dim=1)
else:
xc = x
out = self.diffusion_model(xc, t, context=cc, y=s, mask=mask)
elif self.conditioning_key == 'hybrid-time-adm': # adm means y, e.g., class index
# assert s is not None
assert c_adm is not None
xc = torch.cat([x] + c_concat, dim=1)
cc = torch.cat(c_crossattn, 1)
out = self.diffusion_model(xc, t, context=cc, s=s, y=c_adm)
else:
raise NotImplementedError()
return out
================================================
FILE: lvdm/models/samplers/ddim.py
================================================
import numpy as np
from tqdm import tqdm
import torch
from lvdm.models.utils_diffusion import make_ddim_sampling_parameters, make_ddim_timesteps
from lvdm.common import noise_like
class DDIMSampler(object):
def __init__(self, model, schedule="linear", **kwargs):
super().__init__()
self.model = model
self.ddpm_num_timesteps = model.num_timesteps
self.schedule = schedule
self.counter = 0
def register_buffer(self, name, attr):
if type(attr) == torch.Tensor:
if attr.device != torch.device("cuda"):
attr = attr.to(torch.device("cuda"))
setattr(self, name, attr)
def make_schedule(self, ddim_num_steps, ddim_discretize="uniform", ddim_eta=0., verbose=True):
self.ddim_timesteps = make_ddim_timesteps(ddim_discr_method=ddim_discretize, num_ddim_timesteps=ddim_num_steps,
num_ddpm_timesteps=self.ddpm_num_timesteps,verbose=verbose)
alphas_cumprod = self.model.alphas_cumprod
assert alphas_cumprod.shape[0] == self.ddpm_num_timesteps, 'alphas have to be defined for each timestep'
to_torch = lambda x: x.clone().detach().to(torch.float32).to(self.model.device)
self.register_buffer('betas', to_torch(self.model.betas))
self.register_buffer('alphas_cumprod', to_torch(alphas_cumprod))
self.register_buffer('alphas_cumprod_prev', to_torch(self.model.alphas_cumprod_prev))
self.use_scale = self.model.use_scale
print('DDIM scale', self.use_scale)
if self.use_scale:
self.register_buffer('scale_arr', to_torch(self.model.scale_arr))
ddim_scale_arr = self.scale_arr.cpu()[self.ddim_timesteps]
self.register_buffer('ddim_scale_arr', ddim_scale_arr)
ddim_scale_arr = np.asarray([self.scale_arr.cpu()[0]] + self.scale_arr.cpu()[self.ddim_timesteps[:-1]].tolist())
self.register_buffer('ddim_scale_arr_prev', ddim_scale_arr)
# calculations for diffusion q(x_t | x_{t-1}) and others
self.register_buffer('sqrt_alphas_cumprod', to_torch(np.sqrt(alphas_cumprod.cpu())))
self.register_buffer('sqrt_one_minus_alphas_cumprod', to_torch(np.sqrt(1. - alphas_cumprod.cpu())))
self.register_buffer('log_one_minus_alphas_cumprod', to_torch(np.log(1. - alphas_cumprod.cpu())))
self.register_buffer('sqrt_recip_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod.cpu())))
self.register_buffer('sqrt_recipm1_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod.cpu() - 1)))
# ddim sampling parameters
ddim_sigmas, ddim_alphas, ddim_alphas_prev = make_ddim_sampling_parameters(alphacums=alphas_cumprod.cpu(),
ddim_timesteps=self.ddim_timesteps,
eta=ddim_eta,verbose=verbose)
self.register_buffer('ddim_sigmas', ddim_sigmas)
self.register_buffer('ddim_alphas', ddim_alphas)
self.register_buffer('ddim_alphas_prev', ddim_alphas_prev)
self.register_buffer('ddim_sqrt_one_minus_alphas', np.sqrt(1. - ddim_alphas))
sigmas_for_original_sampling_steps = ddim_eta * torch.sqrt(
(1 - self.alphas_cumprod_prev) / (1 - self.alphas_cumprod) * (
1 - self.alphas_cumprod / self.alphas_cumprod_prev))
self.register_buffer('ddim_sigmas_for_original_num_steps', sigmas_for_original_sampling_steps)
@torch.no_grad()
def sample(self,
S,
batch_size,
shape,
conditioning=None,
callback=None,
normals_sequence=None,
img_callback=None,
quantize_x0=False,
eta=0.,
mask=None,
x0=None,
temperature=1.,
noise_dropout=0.,
score_corrector=None,
corrector_kwargs=None,
verbose=True,
schedule_verbose=False,
x_T=None,
log_every_t=100,
unconditional_guidance_scale=1.,
unconditional_conditioning=None,
# this has to come in the same format as the conditioning, # e.g. as encoded tokens, ...
**kwargs
):
# check condition bs
if conditioning is not None:
if isinstance(conditioning, dict):
try:
cbs = conditioning[list(conditioning.keys())[0]].shape[0]
except:
cbs = conditioning[list(conditioning.keys())[0]][0].shape[0]
if cbs != batch_size:
print(f"Warning: Got {cbs} conditionings but batch-size is {batch_size}")
else:
if conditioning.shape[0] != batch_size:
print(f"Warning: Got {conditioning.shape[0]} conditionings but batch-size is {batch_size}")
self.make_schedule(ddim_num_steps=S, ddim_eta=eta, verbose=schedule_verbose)
# make shape
if len(shape) == 3:
C, H, W = shape
size = (batch_size, C, H, W)
elif len(shape) == 4:
C, T, H, W = shape
size = (batch_size, C, T, H, W)
# print(f'Data shape for DDIM sampling is {size}, eta {eta}')
samples, intermediates = self.ddim_sampling(conditioning, size,
callback=callback,
img_callback=img_callback,
quantize_denoised=quantize_x0,
mask=mask, x0=x0,
ddim_use_original_steps=False,
noise_dropout=noise_dropout,
temperature=temperature,
score_corrector=score_corrector,
corrector_kwargs=corrector_kwargs,
x_T=x_T,
log_every_t=log_every_t,
unconditional_guidance_scale=unconditional_guidance_scale,
unconditional_conditioning=unconditional_conditioning,
verbose=verbose,
**kwargs)
return samples, intermediates
@torch.no_grad()
def ddim_sampling(self, cond, shape,
x_T=None, ddim_use_original_steps=False,
callback=None, timesteps=None, quantize_denoised=False,
mask=None, x0=None, img_callback=None, log_every_t=100,
temperature=1., noise_dropout=0., score_corrector=None, corrector_kwargs=None,
unconditional_guidance_scale=1., unconditional_conditioning=None, verbose=True,
cond_tau=1., target_size=None, start_timesteps=None,
**kwargs):
device = self.model.betas.device
print('ddim device', device)
b = shape[0]
if x_T is None:
img = torch.randn(shape, device=device)
else:
img = x_T
if timesteps is None:
timesteps = self.ddpm_num_timesteps if ddim_use_original_steps else self.ddim_timesteps
elif timesteps is not None and not ddim_use_original_steps:
subset_end = int(min(timesteps / self.ddim_timesteps.shape[0], 1) * self.ddim_timesteps.shape[0]) - 1
timesteps = self.ddim_timesteps[:subset_end]
intermediates = {'x_inter': [img], 'pred_x0': [img]}
time_range = reversed(range(0,timesteps)) if ddim_use_original_steps else np.flip(timesteps)
total_steps = timesteps if ddim_use_original_steps else timesteps.shape[0]
if verbose:
iterator = tqdm(time_range, desc='DDIM Sampler', total=total_steps)
else:
iterator = time_range
init_x0 = False
clean_cond = kwargs.pop("clean_cond", False)
for i, step in enumerate(iterator):
index = total_steps - i - 1
ts = torch.full((b,), step, device=device, dtype=torch.long)
if start_timesteps is not None:
assert x0 is not None
if step > start_timesteps*time_range[0]:
continue
elif not init_x0:
img = self.model.q_sample(x0, ts)
init_x0 = True
# use mask to blend noised original latent (img_orig) & new sampled latent (img)
if mask is not None:
assert x0 is not None
if clean_cond:
img_orig = x0
else:
img_orig = self.model.q_sample(x0, ts) # TODO: deterministic forward pass? <ddim inversion>
img = img_orig * mask + (1. - mask) * img # keep original & modify use img
index_clip = int((1 - cond_tau) * total_steps)
if index <= index_clip and target_size is not None:
target_size_ = [target_size[0], target_size[1]//8, target_size[2]//8]
img = torch.nn.functional.interpolate(
img,
size=target_size_,
mode="nearest",
)
outs = self.p_sample_ddim(img, cond, ts, index=index, use_original_steps=ddim_use_original_steps,
quantize_denoised=quantize_denoised, temperature=temperature,
noise_dropout=noise_dropout, score_corrector=score_corrector,
corrector_kwargs=corrector_kwargs,
unconditional_guidance_scale=unconditional_guidance_scale,
unconditional_conditioning=unconditional_conditioning,
x0=x0,
**kwargs)
img, pred_x0 = outs
if callback: callback(i)
if img_callback: img_callback(pred_x0, i)
if index % log_every_t == 0 or index == total_steps - 1:
intermediates['x_inter'].append(img)
intermediates['pred_x0'].append(pred_x0)
return img, intermediates
@torch.no_grad()
def p_sample_ddim(self, x, c, t, index, repeat_noise=False, use_original_steps=False, quantize_denoised=False,
temperature=1., noise_dropout=0., score_corrector=None, corrector_kwargs=None,
unconditional_guidance_scale=1., unconditional_conditioning=None,
uc_type=None, conditional_guidance_scale_temporal=None, **kwargs):
b, *_, device = *x.shape, x.device
if x.dim() == 5:
is_video = True
else:
is_video = False
if unconditional_conditioning is None or unconditional_guidance_scale == 1.:
e_t = self.model.apply_model(x, t, c, **kwargs) # unet denoiser
else:
# with unconditional condition
if isinstance(c, torch.Tensor):
e_t = self.model.apply_model(x, t, c, **kwargs)
e_t_uncond = self.model.apply_model(x, t, unconditional_conditioning, **kwargs)
elif isinstance(c, dict):
e_t = self.model.apply_model(x, t, c, **kwargs)
e_t_uncond = self.model.apply_model(x, t, unconditional_conditioning, **kwargs)
else:
raise NotImplementedError
# text cfg
if uc_type is None:
e_t = e_t_uncond + unconditional_guidance_scale * (e_t - e_t_uncond)
else:
if uc_type == 'cfg_original':
e_t = e_t + unconditional_guidance_scale * (e_t - e_t_uncond)
elif uc_type == 'cfg_ours':
e_t = e_t + unconditional_guidance_scale * (e_t_uncond - e_t)
else:
raise NotImplementedError
# temporal guidance
if conditional_guidance_scale_temporal is not None:
e_t_temporal = self.model.apply_model(x, t, c, **kwargs)
e_t_image = self.model.apply_model(x, t, c, no_temporal_attn=True, **kwargs)
e_t = e_t + conditional_guidance_scale_temporal * (e_t_temporal - e_t_image)
if score_corrector is not None:
assert self.model.parameterization == "eps"
e_t = score_corrector.modify_score(self.model, e_t, x, t, c, **corrector_kwargs)
alphas = self.model.alphas_cumprod if use_original_steps else self.ddim_alphas
alphas_prev = self.model.alphas_cumprod_prev if use_original_steps else self.ddim_alphas_prev
sqrt_one_minus_alphas = self.model.sqrt_one_minus_alphas_cumprod if use_original_steps else self.ddim_sqrt_one_minus_alphas
sigmas = self.model.ddim_sigmas_for_original_num_steps if use_original_steps else self.ddim_sigmas
# select parameters corresponding to the currently considered timestep
if is_video:
size = (b, 1, 1, 1, 1)
else:
size = (b, 1, 1, 1)
a_t = torch.full(size, alphas[index], device=device)
a_prev = torch.full(size, alphas_prev[index], device=device)
sigma_t = torch.full(size, sigmas[index], device=device)
sqrt_one_minus_at = torch.full(size, sqrt_one_minus_alphas[index],device=device)
# current prediction for x_0
pred_x0 = (x - sqrt_one_minus_at * e_t) / a_t.sqrt()
if quantize_denoised:
pred_x0, _, *_ = self.model.first_stage_model.quantize(pred_x0)
# direction pointing to x_t
dir_xt = (1. - a_prev - sigma_t**2).sqrt() * e_t
noise = sigma_t * noise_like(x.shape, device, repeat_noise) * temperature
if noise_dropout > 0.:
noise = torch.nn.functional.dropout(noise, p=noise_dropout)
alphas = self.model.alphas_cumprod if use_original_steps else self.ddim_alphas
if self.use_scale:
scale_arr = self.model.scale_arr if use_original_steps else self.ddim_scale_arr
scale_t = torch.full(size, scale_arr[index], device=device)
scale_arr_prev = self.model.scale_arr_prev if use_original_steps else self.ddim_scale_arr_prev
scale_t_prev = torch.full(size, scale_arr_prev[index], device=device)
pred_x0 /= scale_t
x_prev = a_prev.sqrt() * scale_t_prev * pred_x0 + dir_xt + noise
else:
x_prev = a_prev.sqrt() * pred_x0 + dir_xt + noise
return x_prev, pred_x0
@torch.no_grad()
def stochastic_encode(self, x0, t, use_original_steps=False, noise=None):
# fast, but does not allow for exact reconstruction
# t serves as an index to gather the correct alphas
if use_original_steps:
sqrt_alphas_cumprod = self.sqrt_alphas_cumprod
sqrt_one_minus_alphas_cumprod = self.sqrt_one_minus_alphas_cumprod
else:
sqrt_alphas_cumprod = torch.sqrt(self.ddim_alphas)
sqrt_one_minus_alphas_cumprod = self.ddim_sqrt_one_minus_alphas
if noise is None:
noise = torch.randn_like(x0)
def extract_into_tensor(a, t, x_shape):
b, *_ = t.shape
out = a.gather(-1, t)
return out.reshape(b, *((1,) * (len(x_shape) - 1)))
return (extract_into_tensor(sqrt_alphas_cumprod, t, x0.shape) * x0 +
extract_into_tensor(sqrt_one_minus_alphas_cumprod, t, x0.shape) * noise)
@torch.no_grad()
def decode(self, x_latent, cond, t_start, unconditional_guidance_scale=1.0, unconditional_conditioning=None,
use_original_steps=False):
timesteps = np.arange(self.ddpm_num_timesteps) if use_original_steps else self.ddim_timesteps
timesteps = timesteps[:t_start]
time_range = np.flip(timesteps)
total_steps = timesteps.shape[0]
print(f"Running DDIM Sampling with {total_steps} timesteps")
iterator = tqdm(time_range, desc='Decoding image', total=total_steps)
x_dec = x_latent
for i, step in enumerate(iterator):
index = total_steps - i - 1
ts = torch.full((x_latent.shape[0],), step, device=x_latent.device, dtype=torch.long)
x_dec, _ = self.p_sample_ddim(x_dec, cond, ts, index=index, use_original_steps=use_original_steps,
unconditional_guidance_scale=unconditional_guidance_scale,
unconditional_conditioning=unconditional_conditioning)
return x_dec
================================================
FILE: lvdm/models/samplers/ddim_mp.py
================================================
import numpy as np
from tqdm import tqdm
import torch
from lvdm.models.utils_diffusion import make_ddim_sampling_parameters, make_ddim_timesteps
from lvdm.common import noise_like
class DDIMSampler(object):
def __init__(self, model, schedule="linear", **kwargs):
super().__init__()
self.model = model
self.ddpm_num_timesteps = model.num_timesteps
self.schedule = schedule
self.counter = 0
def register_buffer(self, name, attr):
if type(attr) == torch.Tensor:
if attr.device != torch.device("cuda"):
attr = attr.to(torch.device("cuda"))
setattr(self, name, attr)
def make_schedule(self, ddim_num_steps, ddim_discretize="uniform", ddim_eta=0., verbose=True):
self.ddim_timesteps = make_ddim_timesteps(ddim_discr_method=ddim_discretize, num_ddim_timesteps=ddim_num_steps,
num_ddpm_timesteps=self.ddpm_num_timesteps,verbose=verbose)
alphas_cumprod = self.model.alphas_cumprod
assert alphas_cumprod.shape[0] == self.ddpm_num_timesteps, 'alphas have to be defined for each timestep'
to_torch = lambda x: x.clone().detach().to(torch.float32).to(self.model.device)
self.register_buffer('betas', to_torch(self.model.betas))
self.register_buffer('alphas_cumprod', to_torch(alphas_cumprod))
self.register_buffer('alphas_cumprod_prev', to_torch(self.model.alphas_cumprod_prev))
self.use_scale = self.model.use_scale
print('DDIM scale', self.use_scale)
if self.use_scale:
self.register_buffer('scale_arr', to_torch(self.model.scale_arr))
ddim_scale_arr = self.scale_arr.cpu()[self.ddim_timesteps]
self.register_buffer('ddim_scale_arr', ddim_scale_arr)
ddim_scale_arr = np.asarray([self.scale_arr.cpu()[0]] + self.scale_arr.cpu()[self.ddim_timesteps[:-1]].tolist())
self.register_buffer('ddim_scale_arr_prev', ddim_scale_arr)
# calculations for diffusion q(x_t | x_{t-1}) and others
self.register_buffer('sqrt_alphas_cumprod', to_torch(np.sqrt(alphas_cumprod.cpu())))
self.register_buffer('sqrt_one_minus_alphas_cumprod', to_torch(np.sqrt(1. - alphas_cumprod.cpu())))
self.register_buffer('log_one_minus_alphas_cumprod', to_torch(np.log(1. - alphas_cumprod.cpu())))
self.register_buffer('sqrt_recip_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod.cpu())))
self.register_buffer('sqrt_recipm1_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod.cpu() - 1)))
# ddim sampling parameters
ddim_sigmas, ddim_alphas, ddim_alphas_prev = make_ddim_sampling_parameters(alphacums=alphas_cumprod.cpu(),
ddim_timesteps=self.ddim_timesteps,
eta=ddim_eta,verbose=verbose)
self.register_buffer('ddim_sigmas', ddim_sigmas)
self.register_buffer('ddim_alphas', ddim_alphas)
self.register_buffer('ddim_alphas_prev', ddim_alphas_prev)
self.register_buffer('ddim_sqrt_one_minus_alphas', np.sqrt(1. - ddim_alphas))
sigmas_for_original_sampling_steps = ddim_eta * torch.sqrt(
(1 - self.alphas_cumprod_prev) / (1 - self.alphas_cumprod) * (
1 - self.alphas_cumprod / self.alphas_cumprod_prev))
self.register_buffer('ddim_sigmas_for_original_num_steps', sigmas_for_original_sampling_steps)
@torch.no_grad()
def sample(self,
S,
batch_size,
shape,
conditioning=None,
callback=None,
normals_sequence=None,
img_callback=None,
quantize_x0=False,
eta=0.,
mask=None,
x0=None,
temperature=1.,
noise_dropout=0.,
score_corrector=None,
corrector_kwargs=None,
verbose=True,
schedule_verbose=False,
x_T=None,
log_every_t=100,
unconditional_guidance_scale=1.,
unconditional_conditioning=None,
# this has to come in the same format as the conditioning, # e.g. as encoded tokens, ...
**kwargs
):
# check condition bs
# if conditioning is not None:
# if isinstance(conditioning, dict):
# try:
# cbs = conditioning[list(conditioning.keys())[0]].shape[0]
# except:
# cbs = conditioning[list(conditioning.keys())[0]][0].shape[0]
# if cbs != batch_size:
# print(f"Warning: Got {cbs} conditionings but batch-size is {batch_size}")
# else:
# if conditioning.shape[0] != batch_size:
# print(f"Warning: Got {conditioning.shape[0]} conditionings but batch-size is {batch_size}")
self.make_schedule(ddim_num_steps=S, ddim_eta=eta, verbose=schedule_verbose)
# make shape
if len(shape) == 3:
C, H, W = shape
size = (batch_size, C, H, W)
elif len(shape) == 4:
C, T, H, W = shape
size = (batch_size, C, T, H, W)
# print(f'Data shape for DDIM sampling is {size}, eta {eta}')
samples, intermediates = self.ddim_sampling(conditioning, size,
callback=callback,
img_callback=img_callback,
quantize_denoised=quantize_x0,
mask=mask, x0=x0,
ddim_use_original_steps=False,
noise_dropout=noise_dropout,
temperature=temperature,
score_corrector=score_corrector,
corrector_kwargs=corrector_kwargs,
x_T=x_T,
log_every_t=log_every_t,
unconditional_guidance_scale=unconditional_guidance_scale,
unconditional_conditioning=unconditional_conditioning,
verbose=verbose,
**kwargs)
return samples, intermediates
@torch.no_grad()
def ddim_sampling(self, cond, shape,
x_T=None, ddim_use_original_steps=False,
callback=None, timesteps=None, quantize_denoised=False,
mask=None, x0=None, img_callback=None, log_every_t=100,
temperature=1., noise_dropout=0., score_corrector=None, corrector_kwargs=None,
unconditional_guidance_scale=1., unconditional_conditioning=None, verbose=True,
cond_tau=1., target_size=None, start_timesteps=None,
**kwargs):
device = self.model.betas.device
print('ddim device', device)
b = shape[0]
if x_T is None:
img = torch.randn(shape, device=device)
else:
img = x_T
if timesteps is None:
timesteps = self.ddpm_num_timesteps if ddim_use_original_steps else self.ddim_timesteps
elif timesteps is not None and not ddim_use_original_steps:
subset_end = int(min(timesteps / self.ddim_timesteps.shape[0], 1) * self.ddim_timesteps.shape[0]) - 1
timesteps = self.ddim_timesteps[:subset_end]
intermediates = {'x_inter': [img], 'pred_x0': [img]}
time_range = reversed(range(0,timesteps)) if ddim_use_original_steps else np.flip(timesteps)
total_steps = timesteps if ddim_use_original_steps else timesteps.shape[0]
if verbose:
iterator = tqdm(time_range, desc='DDIM Sampler', total=total_steps)
else:
iterator = time_range
init_x0 = False
clean_cond = kwargs.pop("clean_cond", False)
for i, step in enumerate(iterator):
index = total_steps - i - 1
ts = torch.full((b,), step, device=device, dtype=torch.long)
if start_timesteps is not None:
assert x0 is not None
if step > start_timesteps*time_range[0]:
continue
elif not init_x0:
img = self.model.q_sample(x0, ts)
init_x0 = True
# use mask to blend noised original latent (img_orig) & new sampled latent (img)
if mask is not None:
assert x0 is not None
if clean_cond:
img_orig = x0
else:
img_orig = self.model.q_sample(x0, ts) # TODO: deterministic forward pass? <ddim inversion>
img = img_orig * mask + (1. - mask) * img # keep original & modify use img
index_clip = int((1 - cond_tau) * total_steps)
if index <= index_clip and target_size is not None:
target_size_ = [target_size[0], target_size[1]//8, target_size[2]//8]
img = torch.nn.functional.interpolate(
img,
size=target_size_,
mode="nearest",
)
outs = self.p_sample_ddim(img, cond, ts, index=index, use_original_steps=ddim_use_original_steps,
quantize_denoised=quantize_denoised, temperature=temperature,
noise_dropout=noise_dropout, score_corrector=score_corrector,
corrector_kwargs=corrector_kwargs,
unconditional_guidance_scale=unconditional_guidance_scale,
unconditional_conditioning=unconditional_conditioning,
x0=x0,
step=i,
**kwargs)
img, pred_x0 = outs
if callback: callback(i)
if img_callback: img_callback(pred_x0, i)
if index % log_every_t == 0 or index == total_steps - 1:
intermediates['x_inter'].append(img)
intermediates['pred_x0'].append(pred_x0)
return img, intermediates
@torch.no_grad()
def p_sample_ddim(self, x, c, t, index, repeat_noise=False, use_original_steps=False, quantize_denoised=False,
temperature=1., noise_dropout=0., score_corrector=None, corrector_kwargs=None,
unconditional_guidance_scale=1., unconditional_conditioning=None,
uc_type=None, conditional_guidance_scale_temporal=None, step=0, **kwargs):
b, *_, device = *x.shape, x.device
if x.dim() == 5:
is_video = True
else:
is_video = False
if unconditional_conditioning is None or unconditional_guidance_scale == 1.:
e_t = self.model.apply_model(x, t, c, **kwargs) # unet denoiser
else:
# with unconditional condition
if step < 5 or step > 15:
e_t = self.model.apply_model(x, t, c, use_injection=True, **kwargs)
e_t_uncond = self.model.apply_model(x, t, unconditional_conditioning, **kwargs)
elif isinstance(c, torch.Tensor):
e_t = self.model.apply_model(x, t, c, **kwargs)
e_t_uncond = self.model.apply_model(x, t, unconditional_conditioning, **kwargs)
elif isinstance(c, dict):
e_t = self.model.apply_model(x, t, c, **kwargs)
e_t_uncond = self.model.apply_model(x, t, unconditional_conditioning, **kwargs)
else:
raise NotImplementedError
# text cfg
if uc_type is None:
e_t = e_t_uncond + unconditional_guidance_scale * (e_t - e_t_uncond)
else:
if uc_type == 'cfg_original':
e_t = e_t + unconditional_guidance_scale * (e_t - e_t_uncond)
elif uc_type == 'cfg_ours':
e_t = e_t + unconditional_guidance_scale * (e_t_uncond - e_t)
else:
raise NotImplementedError
# temporal guidance
if conditional_guidance_scale_temporal is not None:
e_t_temporal = self.model.apply_model(x, t, c, **kwargs)
e_t_image = self.model.apply_model(x, t, c, no_temporal_attn=True, **kwargs)
e_t = e_t + conditional_guidance_scale_temporal * (e_t_temporal - e_t_image)
if score_corrector is not None:
assert self.model.parameterization == "eps"
e_t = score_corrector.modify_score(self.model, e_t, x, t, c, **corrector_kwargs)
alphas = self.model.alphas_cumprod if use_original_steps else self.ddim_alphas
alphas_prev = self.model.alphas_cumprod_prev if use_original_steps else self.ddim_alphas_prev
sqrt_one_minus_alphas = self.model.sqrt_one_minus_alphas_cumprod if use_original_steps else self.ddim_sqrt_one_minus_alphas
sigmas = self.model.ddim_sigmas_for_original_num_steps if use_original_steps else self.ddim_sigmas
# select parameters corresponding to the currently considered timestep
if is_video:
size = (b, 1, 1, 1, 1)
else:
size = (b, 1, 1, 1)
a_t = torch.full(size, alphas[index], device=device)
a_prev = torch.full(size, alphas_prev[index], device=device)
sigma_t = torch.full(size, sigmas[index], device=device)
sqrt_one_minus_at = torch.full(size, sqrt_one_minus_alphas[index],device=device)
# current prediction for x_0
pred_x0 = (x - sqrt_one_minus_at * e_t) / a_t.sqrt()
if quantize_denoised:
pred_x0, _, *_ = self.model.first_stage_model.quantize(pred_x0)
# direction pointing to x_t
dir_xt = (1. - a_prev - sigma_t**2).sqrt() * e_t
noise = sigma_t * noise_like(x.shape, device, repeat_noise) * temperature
if noise_dropout > 0.:
noise = torch.nn.functional.dropout(noise, p=noise_dropout)
alphas = self.model.alphas_cumprod if use_original_steps else self.ddim_alphas
if self.use_scale:
scale_arr = self.model.scale_arr if use_original_steps else self.ddim_scale_arr
scale_t = torch.full(size, scale_arr[index], device=device)
scale_arr_prev = self.model.scale_arr_prev if use_original_steps else self.ddim_scale_arr_prev
scale_t_prev = torch.full(size, scale_arr_prev[index], device=device)
pred_x0 /= scale_t
x_prev = a_prev.sqrt() * scale_t_prev * pred_x0 + dir_xt + noise
else:
x_prev = a_prev.sqrt() * pred_x0 + dir_xt + noise
return x_prev, pred_x0
@torch.no_grad()
def stochastic_encode(self, x0, t, use_original_steps=False, noise=None):
# fast, but does not allow for exact reconstruction
# t serves as an index to gather the correct alphas
if use_original_steps:
sqrt_alphas_cumprod = self.sqrt_alphas_cumprod
sqrt_one_minus_alphas_cumprod = self.sqrt_one_minus_alphas_cumprod
else:
sqrt_alphas_cumprod = torch.sqrt(self.ddim_alphas)
sqrt_one_minus_alphas_cumprod = self.ddim_sqrt_one_minus_alphas
if noise is None:
noise = torch.randn_like(x0)
def extract_into_tensor(a, t, x_shape):
b, *_ = t.shape
out = a.gather(-1, t)
return out.reshape(b, *((1,) * (len(x_shape) - 1)))
return (extract_into_tensor(sqrt_alphas_cumprod, t, x0.shape) * x0 +
extract_into_tensor(sqrt_one_minus_alphas_cumprod, t, x0.shape) * noise)
@torch.no_grad()
def decode(self, x_latent, cond, t_start, unconditional_guidance_scale=1.0, unconditional_conditioning=None,
use_original_steps=False):
timesteps = np.arange(self.ddpm_num_timesteps) if use_original_steps else self.ddim_timesteps
timesteps = timesteps[:t_start]
time_range = np.flip(timesteps)
total_steps = timesteps.shape[0]
print(f"Running DDIM Sampling with {total_steps} timesteps")
iterator = tqdm(time_range, desc='Decoding image', total=total_steps)
x_dec = x_latent
for i, step in enumerate(iterator):
index = total_steps - i - 1
ts = torch.full((x_latent.shape[0],), step, device=x_latent.device, dtype=torch.long)
x_dec, _ = self.p_sample_ddim(x_dec, cond, ts, index=index, use_original_steps=use_original_steps,
unconditional_guidance_scale=unconditional_guidance_scale,
unconditional_conditioning=unconditional_conditioning)
return x_dec
================================================
FILE: lvdm/models/utils_diffusion.py
================================================
import math
import numpy as np
from einops import repeat
import torch
import torch.nn.functional as F
def timestep_embedding(timesteps, dim, max_period=10000, repeat_only=False):
"""
Create sinusoidal timestep embeddings.
:param timesteps: a 1-D Tensor of N indices, one per batch element.
These may be fractional.
:param dim: the dimension of the output.
:param max_period: controls the minimum frequency of the embeddings.
:return: an [N x dim] Tensor of positional embeddings.
"""
if not repeat_only:
half = dim // 2
freqs = torch.exp(
-math.log(max_period) * torch.arange(start=0, end=half, dtype=torch.float32) / half
).to(device=timesteps.device)
args = timesteps[:, None].float() * freqs[None]
embedding = torch.cat([torch.cos(args), torch.sin(args)], dim=-1)
if dim % 2:
embedding = torch.cat([embedding, torch.zeros_like(embedding[:, :1])], dim=-1)
else:
embedding = repeat(timesteps, 'b -> b d', d=dim)
return embedding
def make_beta_schedule(schedule, n_timestep, linear_start=1e-4, linear_end=2e-2, cosine_s=8e-3):
if schedule == "linear":
betas = (
torch.linspace(linear_start ** 0.5, linear_end ** 0.5, n_timestep, dtype=torch.float64) ** 2
)
elif schedule == "cosine":
timesteps = (
torch.arange(n_timestep + 1, dtype=torch.float64) / n_timestep + cosine_s
)
alphas = timesteps / (1 + cosine_s) * np.pi / 2
alphas = torch.cos(alphas).pow(2)
alphas = alphas / alphas[0]
betas = 1 - alphas[1:] / alphas[:-1]
betas = np.clip(betas, a_min=0, a_max=0.999)
elif schedule == "sqrt_linear":
betas = torch.linspace(linear_start, linear_end, n_timestep, dtype=torch.float64)
elif schedule == "sqrt":
betas = torch.linspace(linear_start, linear_end, n_timestep, dtype=torch.float64) ** 0.5
else:
raise ValueError(f"schedule '{schedule}' unknown.")
return betas.numpy()
def make_ddim_timesteps(ddim_discr_method, num_ddim_timesteps, num_ddpm_timesteps, verbose=True):
if ddim_discr_method == 'uniform':
c = num_ddpm_timesteps // num_ddim_timesteps
ddim_timesteps = np.asarray(list(range(0, num_ddpm_timesteps, c)))
elif ddim_discr_method == 'quad':
ddim_timesteps = ((np.linspace(0, np.sqrt(num_ddpm_timesteps * .8), num_ddim_timesteps)) ** 2).astype(int)
else:
raise NotImplementedError(f'There is no ddim discretization method called "{ddim_discr_method}"')
# assert ddim_timesteps.shape[0] == num_ddim_timesteps
# add one to get the final alpha values right (the ones from first scale to data during sampling)
steps_out = ddim_timesteps + 1
if verbose:
print(f'Selected timesteps for ddim sampler: {steps_out}')
return steps_out
def make_ddim_sampling_parameters(alphacums, ddim_timesteps, eta, verbose=True):
# select alphas for computing the variance schedule
# print(f'ddim_timesteps={ddim_timesteps}, len_alphacums={len(alphacums)}')
alphas = alphacums[ddim_timesteps]
alphas_prev = np.asarray([alphacums[0]] + alphacums[ddim_timesteps[:-1]].tolist())
# according the the formula provided in https://arxiv.org/abs/2010.02502
sigmas = eta * np.sqrt((1 - alphas_prev) / (1 - alphas) * (1 - alphas / alphas_prev))
if verbose:
print(f'Selected alphas for ddim sampler: a_t: {alphas}; a_(t-1): {alphas_prev}')
print(f'For the chosen value of eta, which is {eta}, '
f'this results in the following sigma_t schedule for ddim sampler {sigmas}')
return sigmas, alphas, alphas_prev
def betas_for_alpha_bar(num_diffusion_timesteps, alpha_bar, max_beta=0.999):
"""
Create a beta schedule that discretizes the given alpha_t_bar function,
which defines the cumulative product of (1-beta) over time from t = [0,1].
:param num_diffusion_timesteps: the number of betas to produce.
:param alpha_bar: a lambda that takes an argument t from 0 to 1 and
produces the cumulative product of (1-beta) up to that
part of the diffusion process.
:param max_beta: the maximum beta to use; use values lower than 1 to
prevent singularities.
"""
betas = []
for i in range(num_diffusion_timesteps):
t1 = i / num_diffusion_timesteps
t2 = (i + 1) / num_diffusion_timesteps
betas.append(min(1 - alpha_bar(t2) / alpha_bar(t1), max_beta))
return np.array(betas)
================================================
FILE: lvdm/modules/attention.py
================================================
from functools import partial
import torch
from torch import nn, einsum
import torch.nn.functional as F
from einops import rearrange, repeat
try:
import xformers
import xformers.ops
XFORMERS_IS_AVAILBLE = True
except:
XFORMERS_IS_AVAILBLE = False
from lvdm.common import (
checkpoint,
exists,
default,
)
from lvdm.basics import (
zero_module,
)
class RelativePosition(nn.Module):
""" https://github.com/evelinehong/Transformer_Relative_Position_PyTorch/blob/master/relative_position.py """
def __init__(self, num_units, max_relative_position):
super().__init__()
self.num_units = num_units
self.max_relative_position = max_relative_position
self.embeddings_table = nn.Parameter(torch.Tensor(max_relative_position * 2 + 1, num_units))
nn.init.xavier_uniform_(self.embeddings_table)
def forward(self, length_q, length_k):
device = self.embeddings_table.device
range_vec_q = torch.arange(length_q, device=device)
range_vec_k = torch.arange(length_k, device=device)
distance_mat = range_vec_k[None, :] - range_vec_q[:, None]
distance_mat_clipped = torch.clamp(distance_mat, -self.max_relative_position, self.max_relative_position)
final_mat = distance_mat_clipped + self.max_relative_position
final_mat = final_mat.long()
embeddings = self.embeddings_table[final_mat]
return embeddings
class CrossAttention(nn.Module):
def __init__(self, query_dim, context_dim=None, heads=8, dim_head=64, dropout=0.,
relative_position=False, temporal_length=None, img_cross_attention=False):
super().__init__()
inner_dim = dim_head * heads
context_dim = default(context_dim, query_dim)
self.scale = dim_head**-0.5
self.heads = heads
self.dim_head = dim_head
self.to_q = nn.Linear(query_dim, inner_dim, bias=False)
self.to_k = nn.Linear(context_dim, inner_dim, bias=False)
self.to_v = nn.Linear(context_dim, inner_dim, bias=False)
self.to_out = nn.Sequential(nn.Linear(inner_dim, query_dim), nn.Dropout(dropout))
self.image_cross_attention_scale = 1.0
self.text_context_len = 77
self.img_cross_attention = img_cross_attention
if self.img_cross_attention:
self.to_k_ip = nn.Linear(context_dim, inner_dim, bias=False)
self.to_v_ip = nn.Linear(context_dim, inner_dim, bias=False)
self.relative_position = relative_position
if self.relative_position:
assert(temporal_length is not None)
self.relative_position_k = RelativePosition(num_units=dim_head, max_relative_position=temporal_length)
self.relative_position_v = RelativePosition(num_units=dim_head, max_relative_position=temporal_length)
else:
## only used for spatial attention, while NOT for temporal attention
if XFORMERS_IS_AVAILBLE and temporal_length is None:
self.forward = self.efficient_forward
def forward(self, x, context=None, mask=None):
h = self.heads
q = self.to_q(x)
context = default(context, x)
## considering image token additionally
if context is not None and self.img_cross_attention:
context, context_img = context[:,:self.text_context_len,:], context[:,self.text_context_len:,:]
k = self.to_k(context)
v = self.to_v(context)
k_ip = self.to_k_ip(context_img)
v_ip = self.to_v_ip(context_img)
else:
k = self.to_k(context)
v = self.to_v(context)
q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> (b h) n d', h=h), (q, k, v))
sim = torch.einsum('b i d, b j d -> b i j', q, k) * self.scale
if self.relative_position:
len_q, len_k, len_v = q.shape[1], k.shape[1], v.shape[1]
k2 = self.relative_position_k(len_q, len_k)
sim2 = einsum('b t d, t s d -> b t s', q, k2) * self.scale # TODO check
sim += sim2
del k
if exists(mask):
## feasible for causal attention mask only
max_neg_value = -torch.finfo(sim.dtype).max
mask = repeat(mask, 'b i j -> (b h) i j', h=h)
sim.masked_fill_(~(mask>0.5), max_neg_value)
# attention, what we cannot get enough of
sim = sim.softmax(dim=-1)
out = torch.einsum('b i j, b j d -> b i d', sim, v)
if self.relative_position:
v2 = self.relative_position_v(len_q, len_v)
out2 = einsum('b t s, t s d -> b t d', sim, v2) # TODO check
out += out2
out = rearrange(out, '(b h) n d -> b n (h d)', h=h)
## considering image token additionally
if context is not None and self.img_cross_attention:
k_ip, v_ip = map(lambda t: rearrange(t, 'b n (h d) -> (b h) n d', h=h), (k_ip, v_ip))
sim_ip = torch.einsum('b i d, b j d -> b i j', q, k_ip) * self.scale
del k_ip
sim_ip = sim_ip.softmax(dim=-1)
out_ip = torch.einsum('b i j, b j d -> b i d', sim_ip, v_ip)
out_ip = rearrange(out_ip, '(b h) n d -> b n (h d)', h=h)
out = out + self.image_cross_attention_scale * out_ip
del q
return self.to_out(out)
def efficient_forward(self, x, context=None, mask=None):
q = self.to_q(x)
context = default(context, x)
## considering image token additionally
if context is not None and self.img_cross_attention:
context, context_img = context[:,:self.text_context_len,:], context[:,self.text_context_len:,:]
k = self.to_k(context)
v = self.to_v(context)
k_ip = self.to_k_ip(context_img)
v_ip = self.to_v_ip(context_img)
else:
k = self.to_k(context)
v = self.to_v(context)
b, _, _ = q.shape
q, k, v = map(
lambda t: t.unsqueeze(3)
.reshape(b, t.shape[1], self.heads, self.dim_head)
.permute(0, 2, 1, 3)
.reshape(b * self.heads, t.shape[1], self.dim_head)
.contiguous(),
(q, k, v),
)
# actually compute the attention, what we cannot get enough of
out = xformers.ops.memory_efficient_attention(q, k, v, attn_bias=None, op=None)
## considering image token additionally
if context is not None and self.img_cross_attention:
k_ip, v_ip = map(
lambda t: t.unsqueeze(3)
.reshape(b, t.shape[1], self.heads, self.dim_head)
.permute(0, 2, 1, 3)
.reshape(b * self.heads, t.shape[1], self.dim_head)
.contiguous(),
(k_ip, v_ip),
)
out_ip = xformers.ops.memory_efficient_attention(q, k_ip, v_ip, attn_bias=None, op=None)
out_ip = (
out_ip.unsqueeze(0)
.reshape(b, self.heads, out.shape[1], self.dim_head)
.permute(0, 2, 1, 3)
.reshape(b, out.shape[1], self.heads * self.dim_head)
)
if exists(mask):
raise NotImplementedError
out = (
out.unsqueeze(0)
.reshape(b, self.heads, out.shape[1], self.dim_head)
.permute(0, 2, 1, 3)
.reshape(b, out.shape[1], self.heads * self.dim_head)
)
if context is not None and self.img_cross_attention:
out = out + self.image_cross_attention_scale * out_ip
return self.to_out(out)
class BasicTransformerBlock(nn.Module):
def __init__(self, dim, n_heads, d_head, dropout=0., context_dim=None, gated_ff=True, checkpoint=True,
disable_self_attn=False, attention_cls=None, img_cross_attention=False):
super().__init__()
attn_cls = CrossAttention if attention_cls is None else attention_cls
self.disable_self_attn = disable_self_attn
self.attn1 = attn_cls(query_dim=dim, heads=n_heads, dim_head=d_head, dropout=dropout,
context_dim=context_dim if self.disable_self_attn else None)
self.ff = FeedForward(dim, dropout=dropout, glu=gated_ff)
self.attn2 = attn_cls(query_dim=dim, context_dim=context_dim, heads=n_heads, dim_head=d_head, dropout=dropout,
img_cross_attention=img_cross_attention)
self.norm1 = nn.LayerNorm(dim)
self.norm2 = nn.LayerNorm(dim)
self.norm3 = nn.LayerNorm(dim)
self.checkpoint = checkpoint
def forward(self, x, context=None, mask=None):
## implementation tricks: because checkpointing doesn't support non-tensor (e.g. None or scalar) arguments
input_tuple = (x,) ## should not be (x), otherwise *input_tuple will decouple x into multiple arguments
if context is not None:
input_tuple = (x, context)
if mask is not None:
forward_mask = partial(self._forward, mask=mask)
return checkpoint(forward_mask, (x,), self.parameters(), self.checkpoint)
if context is not None and mask is not None:
input_tuple = (x, context, mask)
return checkpoint(self._forward, input_tuple, self.parameters(), self.checkpoint)
def _forward(self, x, context=None, mask=None):
x = self.attn1(self.norm1(x), context=context if self.disable_self_attn else None, mask=mask) + x
x = self.attn2(self.norm2(x), context=context, mask=mask) + x
x = self.ff(self.norm3(x)) + x
return x
class SpatialTransformer(nn.Module):
"""
Transformer block for image-like data in spatial axis.
First, project the input (aka embedding)
and reshape to b, t, d.
Then apply standard transformer action.
Finally, reshape to image
NEW: use_linear for more efficiency instead of the 1x1 convs
"""
def __init__(self, in_channels, n_heads, d_head, depth=1, dropout=0., context_dim=None,
use_checkpoint=True, disable_self_attn=False, use_linear=False, img_cross_attention=False):
super().__init__()
self.in_channels = in_channels
inner_dim = n_heads * d_head
self.norm = torch.nn.GroupNorm(num_groups=32, num_channels=in_channels, eps=1e-6, affine=True)
if not use_linear:
self.proj_in = nn.Conv2d(in_channels, inner_dim, kernel_size=1, stride=1, padding=0)
else:
self.proj_in = nn.Linear(in_channels, inner_dim)
self.transformer_blocks = nn.ModuleList([
BasicTransformerBlock(
inner_dim,
n_heads,
d_head,
dropout=dropout,
context_dim=context_dim,
img_cross_attention=img_cross_attention,
disable_self_attn=disable_self_attn,
checkpoint=use_checkpoint) for d in range(depth)
])
if not use_linear:
self.proj_out = zero_module(nn.Conv2d(inner_dim, in_channels, kernel_size=1, stride=1, padding=0))
else:
self.proj_out = zero_module(nn.Linear(inner_dim, in_channels))
self.use_linear = use_linear
def forward(self, x, context=None):
b, c, h, w = x.shape
x_in = x
x = self.norm(x)
if not self.use_linear:
x = self.proj_in(x)
x = rearrange(x, 'b c h w -> b (h w) c').contiguous()
if self.use_linear:
x = self.proj_in(x)
for i, block in enumerate(self.transformer_blocks):
x = block(x, context=context)
if self.use_linear:
x = self.proj_out(x)
x = rearrange(x, 'b (h w) c -> b c h w', h=h, w=w).contiguous()
if not self.use_linear:
x = self.proj_out(x)
return x + x_in
class TemporalTransformer(nn.Module):
"""
Transformer block for image-like data in temporal axis.
First, reshape to b, t, d.
Then apply standard transformer action.
Finally, reshape to image
"""
def __init__(self, in_channels, n_heads, d_head, depth=1, dropout=0., context_dim=None,
use_checkpoint=True, use_linear=False, only_self_att=True, causal_attention=False,
relative_position=False, temporal_length=None):
super().__init__()
self.only_self_att = only_self_att
self.relative_position = relative_position
self.causal_attention = causal_attention
self.in_channels = in_channels
inner_dim = n_heads * d_head
self.norm = torch.nn.GroupNorm(num_groups=32, num_channels=in_channels, eps=1e-6, affine=True)
self.proj_in = nn.Conv1d(in_channels, inner_dim, kernel_size=1, stride=1, padding=0)
if not use_linear:
self.proj_in = nn.Conv1d(in_channels, inner_dim, kernel_size=1, stride=1, padding=0)
else:
self.proj_in = nn.Linear(in_channels, inner_dim)
if relative_position:
assert(temporal_length is not None)
attention_cls = partial(CrossAttention, relative_position=True, temporal_length=temporal_length)
else:
attention_cls = None
if self.causal_attention:
assert(temporal_length is not None)
self.mask = torch.tril(torch.ones([1, temporal_length, temporal_length]))
if self.only_self_att:
context_dim = None
self.transformer_blocks = nn.ModuleList([
BasicTransformerBlock(
inner_dim,
n_heads,
d_head,
dropout=dropout,
context_dim=context_dim,
attention_cls=attention_cls,
checkpoint=use_checkpoint) for d in range(depth)
])
if not use_linear:
self.proj_out = zero_module(nn.Conv1d(inner_dim, in_channels, kernel_size=1, stride=1, padding=0))
else:
self.proj_out = zero_module(nn.Linear(inner_dim, in_channels))
self.use_linear = use_linear
def forward(self, x, context=None):
b, c, t, h, w = x.shape
x_in = x
x = self.norm(x)
x = rearrange(x, 'b c t h w -> (b h w) c t').contiguous()
if not self.use_linear:
x = self.proj_in(x)
x = rearrange(x, 'bhw c t -> bhw t c').contiguous()
if self.use_linear:
x = self.proj_in(x)
if self.causal_attention:
mask = self.mask.to(x.device)
mask = repeat(mask, 'l i j -> (l bhw) i j', bhw=b*h*w)
else:
mask = None
if self.only_self_att:
## note: if no context is given, cross-attention defaults to self-attention
for i, block in enumerate(self.transformer_blocks):
x = block(x, mask=mask)
x = rearrange(x, '(b hw) t c -> b hw t c', b=b).contiguous()
else:
x = rearrange(x, '(b hw) t c -> b hw t c', b=b).contiguous()
context = rearrange(context, '(b t) l con -> b t l con', t=t).contiguous()
for i, block in enumerate(self.transformer_blocks):
# calculate each batch one by one (since number in shape could not greater then 65,535 for some package)
for j in range(b):
context_j = repeat(
context[j],
't l con -> (t r) l con', r=(h * w) // t, t=t).contiguous()
## note: causal mask will not applied in cross-attention case
x[j] = block(x[j], context=context_j)
if self.use_linear:
x = self.proj_out(x)
x = rearrange(x, 'b (h w) t c -> b c t h w', h=h, w=w).contiguous()
if not self.use_linear:
x = rearrange(x, 'b hw t c -> (b hw) c t').contiguous()
x = self.proj_out(x)
x = rearrange(x, '(b h w) c t -> b c t h w', b=b, h=h, w=w).contiguous()
return x + x_in
class GEGLU(nn.Module):
def __init__(self, dim_in, dim_out):
super().__init__()
self.proj = nn.Linear(dim_in, dim_out * 2)
def forward(self, x):
x, gate = self.proj(x).chunk(2, dim=-1)
return x * F.gelu(gate)
class FeedForward(nn.Module):
def __init__(self, dim, dim_out=None, mult=4, glu=False, dropout=0.):
super().__init__()
inner_dim = int(dim * mult)
dim_out = default(dim_out, dim)
project_in = nn.Sequential(
nn.Linear(dim, inner_dim),
nn.GELU()
) if not glu else GEGLU(dim, inner_dim)
self.net = nn.Sequential(
project_in,
nn.Dropout(dropout),
nn.Linear(inner_dim, dim_out)
)
def forward(self, x):
return self.net(x)
class LinearAttention(nn.Module):
def __init__(self, dim, heads=4, dim_head=32):
super().__init__()
self.heads = heads
hidden_dim = dim_head * heads
self.to_qkv = nn.Conv2d(dim, hidden_dim * 3, 1, bias = False)
self.to_out = nn.Conv2d(hidden_dim, dim, 1)
def forward(self, x):
b, c, h, w = x.shape
qkv = self.to_qkv(x)
q, k, v = rearrange(qkv, 'b (qkv heads c) h w -> qkv b heads c (h w)', heads = self.heads, qkv=3)
k = k.softmax(dim=-1)
context = torch.einsum('bhdn,bhen->bhde', k, v)
out = torch.einsum('bhde,bhdn->bhen', context, q)
out = rearrange(out, 'b heads c (h w) -> b (heads c) h w', heads=self.heads, h=h, w=w)
return self.to_out(out)
class SpatialSelfAttention(nn.Module):
def __init__(self, in_channels):
super().__init__()
self.in_channels = in_channels
self.norm = torch.nn.GroupNorm(num_groups=32, num_channels=in_channels, eps=1e-6, affine=True)
self.q = torch.nn.Conv2d(in_channels,
in_channels,
kernel_size=1,
stride=1,
padding=0)
self.k = torch.nn.Conv2d(in_channels,
in_channels,
kernel_size=1,
stride=1,
padding=0)
self.v = torch.nn.Conv2d(in_channels,
in_channels,
kernel_size=1,
stride=1,
padding=0)
self.proj_out = torch.nn.Conv2d(in_channels,
in_channels,
kernel_size=1,
stride=1,
padding=0)
def forward(self, x):
h_ = x
h_ = self.norm(h_)
q = self.q(h_)
k = self.k(h_)
v = self.v(h_)
# compute attention
b,c,h,w = q.shape
q = rearrange(q, 'b c h w -> b (h w) c')
k = rearrange(k, 'b c h w -> b c (h w)')
w_ = torch.einsum('bij,bjk->bik', q, k)
w_ = w_ * (int(c)**(-0.5))
w_ = torch.nn.functional.softmax(w_, dim=2)
# attend to values
v = rearrange(v, 'b c h w -> b c (h w)')
w_ = rearrange(w_, 'b i j -> b j i')
h_ = torch.einsum('bij,bjk->bik', v, w_)
h_ = rearrange(h_, 'b c (h w) -> b c h w', h=h)
h_ = self.proj_out(h_)
return x+h_
================================================
FILE: lvdm/modules/attention_freenoise.py
================================================
from functools import partial
import torch
from torch import nn, einsum
import torch.nn.functional as F
from einops import rearrange, repeat
try:
import xformers
import xformers.ops
XFORMERS_IS_AVAILBLE = True
except:
XFORMERS_IS_AVAILBLE = False
from lvdm.common import (
checkpoint,
exists,
default,
)
from lvdm.basics import (
zero_module,
)
def generate_weight_sequence(n):
if n % 2 == 0:
max_weight = n // 2
weight_sequence = list(range(1, max_weight + 1, 1)) + list(range(max_weight, 0, -1))
else:
max_weight = (n + 1) // 2
weight_sequence = list(range(1, max_weight, 1)) + [max_weight] + list(range(max_weight - 1, 0, -1))
return weight_sequence
class RelativePosition(nn.Module):
""" https://github.com/evelinehong/Transformer_Relative_Position_PyTorch/blob/master/relative_position.py """
def __init__(self, num_units, max_relative_position):
super().__init__()
self.num_units = num_units
self.max_relative_position = max_relative_position
self.embeddings_table = nn.Parameter(torch.Tensor(max_relative_position * 2 + 1, num_units))
nn.init.xavier_uniform_(self.embeddings_table)
def forward(self, length_q, length_k):
device = self.embeddings_table.device
range_vec_q = torch.arange(length_q, device=device)
range_vec_k = torch.arange(length_k, device=device)
distance_mat = range_vec_k[None, :] - range_vec_q[:, None]
distance_mat_clipped = torch.clamp(distance_mat, -self.max_relative_position, self.max_relative_position)
final_mat = distance_mat_clipped + self.max_relative_position
final_mat = final_mat.long()
embeddings = self.embeddings_table[final_mat]
return embeddings
class CrossAttention(nn.Module):
def __init__(self, query_dim, context_dim=None, heads=8, dim_head=64, dropout=0.,
relative_position=False, temporal_length=None, img_cross_attention=False, injection=False):
super().__init__()
inner_dim = dim_head * heads
context_dim = default(context_dim, query_dim)
self.scale = dim_head**-0.5
self.heads = heads
self.dim_head = dim_head
self.to_q = nn.Linear(query_dim, inner_dim, bias=False)
self.to_k = nn.Linear(context_dim, inner_dim, bias=False)
self.to_v = nn.Linear(context_dim, inner_dim, bias=False)
self.to_out = nn.Sequential(nn.Linear(inner_dim, query_dim), nn.Dropout(dropout))
self.image_cross_attention_scale = 1.0
self.text_context_len = 77
self.img_cross_attention = img_cross_attention
if self.img_cross_attention:
self.to_k_ip = nn.Linear(context_dim, inner_dim, bias=False)
self.to_v_ip = nn.Linear(context_dim, inner_dim, bias=False)
self.relative_position = relative_position
if self.relative_position:
assert(temporal_length is not None)
self.relative_position_k = RelativePosition(num_units=dim_head, max_relative_position=temporal_length)
self.relative_position_v = RelativePosition(num_units=dim_head, max_relative_position=temporal_length)
else:
## only used for spatial attention, while NOT for temporal attention
if XFORMERS_IS_AVAILBLE and temporal_length is None:
self.forward = self.efficient_forward
self.injection = injection
def forward(self, x, context=None, mask=None, context_next=None, use_injection=False):
sa_flag = False
if context is None:
sa_flag = True
h = self.heads
all_q = self.to_q(x)
context = default(context, x)
## considering image token additionally
if context is not None and self.img_cross_attention:
context, context_img = context[:,:self.text_context_len,:], context[:,self.text_context_len:,:]
all_k = self.to_k(context)
all_v = self.to_v(context)
all_k_ip = self.to_k_ip(context_img)
all_v_ip = self.to_v_ip(context_img)
else:
all_k = self.to_k(context)
all_v = self.to_v(context)
count = torch.zeros_like(all_k)
value = torch.zeros_like(all_k)
if (sa_flag) and (context_next is not None):
all_q, all_k, all_v = map(lambda t: rearrange(t, 'b n (h d) -> (b h) n d', h=h), (all_q, all_k, all_v))
if context is not None and self.img_cross_attention:
all_k_ip, all_v_ip = map(lambda t: rearrange(t, 'b n (h d) -> (b h) n d', h=h), (all_k_ip, all_v_ip))
for t_start, t_end in context_next:
weight_sequence = generate_weight_sequence(t_end - t_start)
weight_tensor = torch.ones_like(count[:, t_start:t_end])
weight_tensor = weight_tensor * torch.Tensor(weight_sequence).to(x.device).unsqueeze(0).unsqueeze(-1)
q = all_q[:, t_start:t_end]
k = all_k[:, t_start:t_end]
v = all_v[:, t_start:t_end]
sim = torch.einsum('b i d, b j d -> b i j', q, k) * self.scale
if self.relative_position:
len_q, len_k, len_v = q.shape[1], k.shape[1], v.shape[1]
k2 = self.relative_position_k(len_q, len_k)
sim2 = einsum('b t d, t s d -> b t s', q, k2) * self.scale # TODO check
sim += sim2
del k
if exists(mask):
## feasible for causal attention mask only
max_neg_value = -torch.finfo(sim.dtype).max
mask = repeat(mask, 'b i j -> (b h) i j', h=h)
sim.masked_fill_(~(mask>0.5), max_neg_value)
# attention, what we cannot get enough of
sim = sim.softmax(dim=-1)
out = torch.einsum('b i j, b j d -> b i d', sim, v)
if self.relative_position:
v2 = self.relative_position_v(len_q, len_v)
out2 = einsum('b t s, t s d -> b t d', sim, v2) # TODO check
out += out2
out = rearrange(out, '(b h) n d -> b n (h d)', h=h)
## considering image token additionally
if context is not None and self.img_cross_attention:
k_ip = all_k_ip[:, t_start:t_end]
v_ip = all_v_ip[:, t_start:t_end]
sim_ip = torch.einsum('b i d, b j d -> b i j', q, k_ip) * self.scale
del k_ip
sim_ip = sim_ip.softmax(dim=-1)
out_ip = torch.einsum('b i j, b j d -> b i d', sim_ip, v_ip)
out_ip = rearrange(out_ip, '(b h) n d -> b n (h d)', h=h)
out = out + self.image_cross_attention_scale * out_ip
del q
value[:,t_start:t_end] += out * weight_tensor
count[:,t_start:t_end] += weight_tensor
final_out = torch.where(count>0, value/count, value)
else:
q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> (b h) n d', h=h), (all_q, all_k, all_v))
sim = torch.einsum('b i d, b j d -> b i j', q, k) * self.scale
if self.relative_position:
len_q, len_k, len_v = q.shape[1], k.shape[1], v.shape[1]
k2 = self.relative_position_k(len_q, len_k)
sim2 = einsum('b t d, t s d -> b t s', q, k2) * self.scale # TODO check
sim += sim2
del k
if exists(mask):
## feasible for causal attention mask only
max_neg_value = -torch.finfo(sim.dtype).max
mask = repeat(mask, 'b i j -> (b h) i j', h=h)
sim.masked_fill_(~(mask>0.5), max_neg_value)
# attention, what we cannot get enough of
sim = sim.softmax(dim=-1)
out = torch.einsum('b i j, b j d -> b i d', sim, v)
if self.relative_position:
v2 = self.relative_position_v(len_q, len_v)
out2 = einsum('b t s, t s d -> b t d', sim, v2) # TODO check
out += out2
final_out = rearrange(out, '(b h) n d -> b n (h d)', h=h)
## considering image token additionally
if context is not None and self.img_cross_attention:
k_ip, v_ip = map(lambda t: rearrange(t, 'b n (h d) -> (b h) n d', h=h), (all_k_ip, all_v_ip))
sim_ip = torch.einsum('b i d, b j d -> b i j', q, k_ip) * self.scale
del k_ip
sim_ip = sim_ip.softmax(dim=-1)
out_ip = torch.einsum('b i j, b j d -> b i d', sim_ip, v_ip)
out_ip = rearrange(out_ip, '(b h) n d -> b n (h d)', h=h)
final_out = final_out + self.image_cross_attention_scale * out_ip
del q
return self.to_out(final_out)
def efficient_forward(self, x, context=None, mask=None, context_next=None, use_injection=False):
sa_flag = False
if context is None:
sa_flag = True
q = self.to_q(x)
context = default(context, x)
if not sa_flag:
sq_size = x.shape[0]
if self.injection and use_injection:
context_new = context[-sq_size:]
else:
context_new = context[:sq_size]
else:
context_new = context.clone()
## considering image token additionally
if context is not None and self.img_cross_attention:
context, context_img = context_new[:,:self.text_context_len,:], context_new[:,self.text_context_len:,:]
k = self.to_k(context)
v = self.to_v(context)
k_ip = self.to_k_ip(context_img)
v_ip = self.to_v_ip(context_img)
else:
k = self.to_k(context_new)
v = self.to_v(context_new)
b, _, _ = q.shape
q, k, v = map(
lambda t: t.unsqueeze(3)
.reshape(b, t.shape[1], self.heads, self.dim_head)
.permute(0, 2, 1, 3)
.reshape(b * self.heads, t.shape[1], self.dim_head)
.contiguous(),
(q, k, v),
)
# actually compute the attention, what we cannot get enough of
out = xformers.ops.memory_efficient_attention(q, k, v, attn_bias=None, op=None)
## considering image token additionally
if context is not None and self.img_cross_attention:
k_ip, v_ip = map(
lambda t: t.unsqueeze(3)
.reshape(b, t.shape[1], self.heads, self.dim_head)
.permute(0, 2, 1, 3)
.reshape(b * self.heads, t.shape[1], self.dim_head)
.contiguous(),
(k_ip, v_ip),
)
out_ip = xformers.ops.memory_efficient_attention(q, k_ip, v_ip, attn_bias=None, op=None)
out_ip = (
out_ip.unsqueeze(0)
.reshape(b, self.heads, out.shape[1], self.dim_head)
.permute(0, 2, 1, 3)
.reshape(b, out.shape[1], self.heads * self.dim_head)
)
if exists(mask):
raise NotImplementedError
out = (
out.unsqueeze(0)
.reshape(b, self.heads, out.shape[1], self.dim_head)
.permute(0, 2, 1, 3)
.reshape(b, out.shape[1], self.heads * self.dim_head)
)
if context is not None and self.img_cross_attention:
out = out + self.image_cross_attention_scale * out_ip
return self.to_out(out)
class BasicTransformerBlock(nn.Module):
def __init__(self, dim, n_heads, d_head, dropout=0., context_dim=None, gated_ff=True, checkpoint=True,
disable_self_attn=False, attention_cls=None, img_cross_attention=False, injection=False):
super().__init__()
attn_cls = CrossAttention if attention_cls is None else attention_cls
self.disable_self_attn = disable_self_attn
self.attn1 = attn_cls(query_dim=dim, heads=n_heads, dim_head=d_head, dropout=dropout,
context_dim=context_dim if self.disable_self_attn else None, injection=injection)
self.ff = FeedForward(dim, dropout=dropout, glu=gated_ff)
self.attn2 = attn_cls(query_dim=dim, context_dim=context_dim, heads=n_heads, dim_head=d_head, dropout=dropout,
img_cross_attention=img_cross_attention, injection=injection)
self.norm1 = nn.LayerNorm(dim)
self.norm2 = nn.LayerNorm(dim)
self.norm3 = nn.LayerNorm(dim)
self.checkpoint = checkpoint
def forward(self, x, context=None, mask=None, context_next=None, use_injection=False, **kwargs):
## implementation tricks: because checkpointing doesn't support non-tensor (e.g. None or scalar) arguments
input_tuple = (x,) ## should not be (x), otherwise *input_tuple will decouple x into multiple arguments
if context is not None:
input_tuple = (x, context)
if mask is not None:
forward_mask = partial(self._forward, mask=mask)
return checkpoint(forward_mask, (x,), self.parameters(), self.checkpoint)
if context is not None and mask is not None:
input_tuple = (x, context, mask)
input_tuple = (x, context, mask, context_next, use_injection)
return checkpoint(self._forward, input_tuple, self.parameters(), self.checkpoint)
def _forward(self, x, context=None, mask=None, context_next=None, use_injection=False):
x = self.attn1(self.norm1(x), context=context if self.disable_self_attn else None, mask=mask, context_next=context_next, use_injection=False) + x
x = self.attn2(self.norm2(x), context=context, mask=mask, context_next=context_next, use_injection=use_injection) + x
x = self.ff(self.norm3(x)) + x
return x
class SpatialTransformer(nn.Module):
"""
Transformer block for image-like data in spatial axis.
First, project the input (aka embedding)
and reshape to b, t, d.
Then apply standard transformer action.
Finally, reshape to image
NEW: use_linear for more efficiency instead of the 1x1 convs
"""
def __init__(self, in_channels, n_heads, d_head, depth=1, dropout=0., context_dim=None,
use_checkpoint=True, disable_self_attn=False, use_linear=False, img_cross_attention=False, injection=False):
super().__init__()
self.in_channels = in_channels
inner_dim = n_heads * d_head
self.norm = torch.nn.GroupNorm(num_groups=32, num_channels=in_channels, eps=1e-6, affine=True)
if not use_linear:
self.proj_in = nn.Conv2d(in_channels, inner_dim, kernel_size=1, stride=1, padding=0)
else:
self.proj_in = nn.Linear(in_channels, inner_dim)
self.transformer_blocks = nn.ModuleList([
BasicTransformerBlock(
inner_dim,
n_heads,
d_head,
dropout=dropout,
context_dim=context_dim,
img_cross_attention=img_cross_attention,
disable_self_attn=disable_self_attn,
checkpoint=use_checkpoint,
injection=injection) for d in range(depth)
])
if not use_linear:
self.proj_out = zero_module(nn.Conv2d(inner_dim, in_channels, kernel_size=1, stride=1, padding=0))
else:
self.proj_out = zero_module(nn.Linear(inner_dim, in_channels))
self.use_linear = use_linear
def forward(self, x, context=None, **kwargs):
b, c, h, w = x.shape
x_in = x
x = self.norm(x)
if not self.use_linear:
x = self.proj_in(x)
x = rearrange(x, 'b c h w -> b (h w) c').contiguous()
if self.use_linear:
x = self.proj_in(x)
for i, block in enumerate(self.transformer_blocks):
x = block(x, context=context, **kwargs)
if self.use_linear:
x = self.proj_out(x)
x = rearrange(x, 'b (h w) c -> b c h w', h=h, w=w).contiguous()
if not self.use_linear:
x = self.proj_out(x)
return x + x_in
class TemporalTransformer(nn.Module):
"""
Transformer block for image-like data in temporal axis.
First, reshape to b, t, d.
Then apply standard transformer action.
Finally, reshape to image
"""
def __init__(self, in_channels, n_heads, d_head, depth=1, dropout=0., context_dim=None,
use_checkpoint=True, use_linear=False, only_self_att=True, causal_attention=False,
relative_position=False, temporal_length=None, injection=False):
super().__init__()
self.only_self_att = only_self_att
self.relative_position = relative_position
self.causal_attention = causal_attention
self.in_channels = in_channels
inner_dim = n_heads * d_head
self.norm = torch.nn.GroupNorm(num_groups=32, num_channels=in_channels, eps=1e-6, affine=True)
self.proj_in = nn.Conv1d(in_channels, inner_dim, kernel_size=1, stride=1, padding=0)
if not use_linear:
self.proj_in = nn.Conv1d(in_channels, inner_dim, kernel_size=1, stride=1, padding=0)
else:
self.proj_in = nn.Linear(in_channels, inner_dim)
if relative_position:
assert(temporal_length is not None)
attention_cls = partial(CrossAttention, relative_position=True, temporal_length=temporal_length)
else:
attention_cls = partial(CrossAttention, temporal_length=temporal_length)
if self.causal_attention:
assert(temporal_length is not None)
self.mask = torch.tril(torch.ones([1, temporal_length, temporal_length]))
if self.only_self_att:
context_dim = None
self.transformer_blocks = nn.ModuleList([
BasicTransformerBlock(
inner_dim,
n_heads,
d_head,
dropout=dropout,
context_dim=context_dim,
attention_cls=attention_cls,
checkpoint=use_checkpoint,
injection=injection) for d in range(depth)
])
if not use_linear:
self.proj_out = zero_module(nn.Conv1d(inner_dim, in_channels, kernel_size=1, stride=1, padding=0))
else:
self.proj_out = zero_module(nn.Linear(inner_dim, in_channels))
self.use_linear = use_linear
def forward(self, x, context=None, **kwargs):
b, c, t, h, w = x.shape
x_in = x
x = self.norm(x)
x = rearrange(x, 'b c t h w -> (b h w) c t').contiguous()
if not self.use_linear:
x = self.proj_in(x)
x = rearrange(x, 'bhw c t -> bhw t c').contiguous()
if self.use_linear:
x = self.proj_in(x)
if self.causal_attention:
mask = self.mask.to(x.device)
mask = repeat(mask, 'l i j -> (l bhw) i j', bhw=b*h*w)
else:
mask = None
if self.only_self_att:
## note: if no context is given, cross-attention defaults to self-attention
for i, block in enumerate(self.transformer_blocks):
x = block(x, mask=mask, **kwargs)
x = rearrange(x, '(b hw) t c -> b hw t c', b=b).contiguous()
else:
x = rearrange(x, '(b hw) t c -> b hw t c', b=b).contiguous()
context = rearrange(context, '(b t) l con -> b t l con', t=t).contiguous()
for i, block in enumerate(self.transformer_blocks):
# calculate each batch one by one (since number in shape could not greater then 65,535 for some package)
for j in range(b):
context_j = repeat(
context[j],
't l con -> (t r) l con', r=(h * w) // t, t=t).contiguous()
## note: causal mask will not applied in cross-attention case
x[j] = block(x[j], context=context_j, **kwargs)
if self.use_linear:
x = self.proj_out(x)
x = rearrange(x, 'b (h w) t c -> b c t h w', h=h, w=w).contiguous()
if not self.use_linear:
x = rearrange(x, 'b hw t c -> (b hw) c t').contiguous()
x = self.proj_out(x)
x = rearrange(x, '(b h w) c t -> b c t h w', b=b, h=h, w=w).contiguous()
return x + x_in
class GEGLU(nn.Module):
def __init__(self, dim_in, dim_out):
super().__init__()
self.proj = nn.Linear(dim_in, dim_out * 2)
def forward(self, x):
x, gate = self.proj(x).chunk(2, dim=-1)
return x * F.gelu(gate)
class FeedForward(nn.Module):
def __init__(self, dim, dim_out=None, mult=4, glu=False, dropout=0.):
super().__init__()
inner_dim = int(dim * mult)
dim_out = default(dim_out, dim)
project_in = nn.Sequential(
nn.Linear(dim, inner_dim),
nn.GELU()
) if not glu else GEGLU(dim, inner_dim)
self.net = nn.Sequential(
project_in,
nn.Dropout(dropout),
nn.Linear(inner_dim, dim_out)
)
def forward(self, x):
return self.net(x)
class LinearAttention(nn.Module):
def __init__(self, dim, heads=4, dim_head=32):
super().__init__()
self.heads = heads
hidden_dim = dim_head * heads
self.to_qkv = nn.Conv2d(dim, hidden_dim * 3, 1, bias = False)
self.to_out = nn.Conv2d(hidden_dim, dim, 1)
def forward(self, x):
b, c, h, w = x.shape
qkv = self.to_qkv(x)
q, k, v = rearrange(qkv, 'b (qkv heads c) h w -> qkv b heads c (h w)', heads = self.heads, qkv=3)
k = k.softmax(dim=-1)
context = torch.einsum('bhdn,bhen->bhde', k, v)
out = torch.einsum('bhde,bhdn->bhen', context, q)
out = rearrange(out, 'b heads c (h w) -> b (heads c) h w', heads=self.heads, h=h, w=w)
return self.to_out(out)
class SpatialSelfAttention(nn.Module):
def __init__(self, in_channels):
super().__init__()
self.in_channels = in_channels
self.norm = torch.nn.GroupNorm(num_groups=32, num_channels=in_channels, eps=1e-6, affine=True)
self.q = torch.nn.Conv2d(in_channels,
in_channels,
kernel_size=1,
stride=1,
padding=0)
self.k = torch.nn.Conv2d(in_channels,
in_channels,
kernel_size=1,
stride=1,
padding=0)
self.v = torch.nn.Conv2d(in_channels,
in_channels,
kernel_size=1,
stride=1,
padding=0)
self.proj_out = torch.nn.Conv2d(in_channels,
in_channels,
kernel_size=1,
stride=1,
padding=0)
def forward(self, x, **kwargs):
h_ = x
h_ = self.norm(h_)
q = self.q(h_)
k = self.k(h_)
v = self.v(h_)
# compute attention
b,c,h,w = q.shape
q = rearrange(q, 'b c h w -> b (h w) c')
k = rearrange(k, 'b c h w -> b c (h w)')
w_ = torch.einsum('bij,bjk->bik', q, k)
w_ = w_ * (int(c)**(-0.5))
w_ = torch.nn.functional.softmax(w_, dim=2)
# attend to values
v = rearrange(v, 'b c h w -> b c (h w)')
w_ = rearrange(w_, 'b i j -> b j i')
h_ = torch.einsum('bij,bjk->bik', v, w_)
h_ = rearrange(h_, 'b c (h w) -> b c h w', h=h)
h_ = self.proj_out(h_)
return x+h_
================================================
FILE: lvdm/modules/encoders/condition.py
================================================
import torch
import torch.nn as nn
from torch.utils.checkpoint import checkpoint
import kornia
import open_clip
from transformers import T5Tokenizer, T5EncoderModel, CLIPTokenizer, CLIPTextModel
from lvdm.common import autocast
from utils.utils import count_params
class AbstractEncoder(nn.Module):
def __init__(self):
super().__init__()
def encode(self, *args, **kwargs):
raise NotImplementedError
class IdentityEncoder(AbstractEncoder):
def encode(self, x):
return x
class ClassEmbedder(nn.Module):
def __init__(self, embed_dim, n_classes=1000, key='class', ucg_rate=0.1):
super().__init__()
self.key = key
self.embedding = nn.Embedding(n_classes, embed_dim)
self.n_classes = n_classes
self.ucg_rate = ucg_rate
def forward(self, batch, key=None, disable_dropout=False):
if key is None:
key = self.key
# this is for use in crossattn
c = batch[key][:, None]
if self.ucg_rate > 0. and not disable_dropout:
mask = 1. - torch.bernoulli(torch.ones_like(c) * self.ucg_rate)
c = mask * c + (1 - mask) * torch.ones_like(c) * (self.n_classes - 1)
c = c.long()
c = self.embedding(c)
return c
def get_unconditional_conditioning(self, bs, device="cuda"):
uc_class = self.n_classes - 1 # 1000 classes --> 0 ... 999, one extra class for ucg (class 1000)
uc = torch.ones((bs,), device=device) * uc_class
uc = {self.key: uc}
return uc
def disabled_train(self, mode=True):
"""Overwrite model.train with this function to make sure train/eval mode
does not change anymore."""
return self
class FrozenT5Embedder(AbstractEncoder):
"""Uses the T5 transformer encoder for text"""
def __init__(self, version="google/t5-v1_1-large", device="cuda", max_length=77,
freeze=True): # others are google/t5-v1_1-xl and google/t5-v1_1-xxl
super().__init__()
self.tokenizer = T5Tokenizer.from_pretrained(version)
self.transformer = T5EncoderModel.from_pretrained(version)
self.device = device
self.max_length = max_length # TODO: typical value?
if freeze:
self.freeze()
def freeze(self):
self.transformer = self.transformer.eval()
# self.train = disabled_train
for param in self.parameters():
param.requires_grad = False
def forward(self, text):
batch_encoding = self.tokenizer(text, truncation=True, max_length=self.max_length, return_length=True,
return_overflowing_tokens=False, padding="max_length", return_tensors="pt")
tokens = batch_encoding["input_ids"].to(self.device)
outputs = self.transformer(input_ids=tokens)
z = outputs.last_hidden_state
return z
def encode(self, text):
return self(text)
class FrozenCLIPEmbedder(AbstractEncoder):
"""Uses the CLIP transformer encoder for text (from huggingface)"""
LAYERS = [
"last",
"pooled",
"hidden"
]
def __init__(self, version="openai/clip-vit-large-patch14", device="cuda", max_length=77,
freeze=True, layer="last", layer_idx=None): # clip-vit-base-patch32
super().__init__()
assert layer in self.LAYERS
self.tokenizer = CLIPTokenizer.from_pretrained(version)
self.transformer = CLIPTextModel.from_pretrained(version)
self.device = device
self.max_length = max_length
if freeze:
self.freeze()
self.layer = layer
self.layer_idx = layer_idx
if layer == "hidden":
assert layer_idx is not None
assert 0 <= abs(layer_idx) <= 12
def freeze(self):
self.transformer = self.transformer.eval()
# self.train = disabled_train
for param in self.parameters():
param.requires_grad = False
def forward(self, text):
batch_encoding = self.tokenizer(text, truncation=True, max_length=self.max_length, return_length=True,
return_overflowing_tokens=False, padding="max_length", return_tensors="pt")
tokens = batch_encoding["input_ids"].to(self.device)
outputs = self.transformer(input_ids=tokens, output_hidden_states=self.layer == "hidden")
if self.layer == "last":
z = outputs.last_hidden_state
elif self.layer == "pooled":
z = outputs.pooler_output[:, None, :]
else:
z = outputs.hidden_states[self.layer_idx]
return z
def encode(self, text):
return self(text)
class ClipImageEmbedder(nn.Module):
def __init__(
self,
model,
jit=False,
device='cuda' if torch.cuda.is_available() else 'cpu',
antialias=True,
ucg_rate=0.
):
super().__init__()
from clip import load as load_clip
self.model, _ = load_clip(name=model, device=device, jit=jit)
self.antialias = antialias
self.register_buffer('mean', torch.Tensor([0.48145466, 0.4578275, 0.40821073]), persistent=False)
self.register_buffer('std', torch.Tensor([0.26862954, 0.26130258, 0.27577711]), persistent=False)
self.ucg_rate = ucg_rate
def preprocess(self, x):
# normalize to [0,1]
x = kornia.geometry.resize(x, (224, 224),
interpolation='bicubic', align_corners=True,
antialias=self.antialias)
x = (x + 1.) / 2.
# re-normalize according to clip
x = kornia.enhance.normalize(x, self.mean, self.std)
return x
def forward(self, x, no_dropout=False):
# x is assumed to be in range [-1,1]
out = self.model.encode_image(self.preprocess(x))
out = out.to(x.dtype)
if self.ucg_rate > 0. and not no_dropout:
out = torch.bernoulli((1. - self.ucg_rate) * torch.ones(out.shape[0], device=out.device))[:, None] * out
return out
class FrozenOpenCLIPEmbedder(AbstractEncoder):
"""
Uses the OpenCLIP transformer encoder for text
"""
LAYERS = [
# "pooled",
"last",
"penultimate"
]
def __init__(self, arch="ViT-H-14", version="laion2b_s32b_b79k", device="cuda", max_length=77,
freeze=True, layer="last"):
super().__init__()
assert layer in self.LAYERS
model, _, _ = open_clip.create_model_and_transforms(arch, device=torch.device('cpu'))
del model.visual
self.model = model
self.device = device
self.max_length = max_length
if freeze:
self.freeze()
self.layer = layer
if self.layer == "last":
self.layer_idx = 0
elif self.layer == "penultimate":
self.layer_idx = 1
else:
raise NotImplementedError()
def freeze(self):
self.model = self.model.eval()
for param in self.parameters():
param.requires_grad = False
def forward(self, text):
self.device = self.model.positional_embedding.device
tokens = open_clip.tokenize(text)
z = self.encode_with_transformer(tokens.to(self.device))
return z
def encode_with_transformer(self, text):
x = self.model.token_embedding(text) # [batch_size, n_ctx, d_model]
x = x + self.model.positional_embedding
x = x.permute(1, 0, 2) # NLD -> LND
x = self.text_transformer_forward(x, attn_mask=self.model.attn_mask)
x = x.permute(1, 0, 2) # LND -> NLD
x = self.model.ln_final(x)
return x
def text_transformer_forward(self, x: torch.Tensor, attn_mask=None):
for i, r in enumerate(self.model.transformer.resblocks):
if i == len(self.model.transformer.resblocks) - self.layer_idx:
break
if self.model.transformer.grad_checkpointing and not torch.jit.is_scripting():
x = checkpoint(r, x, attn_mask)
else:
x = r(x, attn_mask=attn_mask)
return x
def encode(self, text):
return self(text)
class FrozenOpenCLIPImageEmbedder(AbstractEncoder):
"""
Uses the OpenCLIP vision transformer encoder for images
"""
def __init__(self, arch="ViT-H-14", version="laion2b_s32b_b79k", device="cuda", max_length=77,
freeze=True, layer="pooled", antialias=True, ucg_rate=0.):
super().__init__()
model, _, _ = open_clip.create_model_and_transforms(arch, device=torch.device('cpu'),
pretrained=version, )
del model.transformer
self.model = model
self.device = device
self.max_length = max_length
if freeze:
self.freeze()
self.layer = layer
if self.layer == "penultimate":
raise NotImplementedError()
self.layer_idx = 1
self.antialias = antialias
self.register_buffer('mean', torch.Tensor([0.48145466, 0.4578275, 0.40821073]), persistent=False)
self.register_buffer('std', torch.Tensor([0.26862954, 0.26130258, 0.27577711]), persistent=False)
self.ucg_rate = ucg_rate
def preprocess(self, x):
# normalize to [0,1]
x = kornia.geometry.resize(x, (224, 224),
interpolation='bicubic', align_corners=True,
antialias=self.antialias)
x = (x + 1.) / 2.
# renormalize according to clip
x = kornia.enhance.normalize(x, self.mean, self.std)
return x
def freeze(self):
self.model = self.model.eval()
for param in self.parameters():
param.requires_grad = False
@autocast
def forward(self, image, no_dropout=False):
z = self.encode_with_vision_transformer(image)
if self.ucg_rate > 0. and not no_dropout:
z = torch.bernoulli((1. - self.ucg_rate) * torch.ones(z.shape[0], device=z.device))[:, None] * z
return z
def encode_with_vision_transformer(self, img):
img = self.preprocess(img)
x = self.model.visual(img)
return x
def encode(self, text):
return self(text)
class FrozenOpenCLIPImageEmbedderV2(AbstractEncoder):
"""
Uses the OpenCLIP vision transformer encoder for images
"""
def __init__(self, arch="ViT-H-14", version="laion2b_s32b_b79k", device="cuda",
freeze=True, layer="pooled", antialias=True):
super().__init__()
model, _, _ = open_clip.create_model_and_transforms(arch, device=torch.device('cpu'),
pretrained=version, )
del model.transformer
self.model = model
self.device = device
if freeze:
self.freeze()
self.layer = layer
if self.layer == "penultimate":
raise NotImplementedError()
self.layer_idx = 1
self.antialias = antialias
self.register_buffer('mean', torch.Tensor([0.48145466, 0.4578275, 0.40821073]), persistent=False)
self.register_buffer('std', torch.Tensor([0.26862954, 0.26130258, 0.27577711]), persistent=False)
def preprocess(self, x):
# normalize to [0,1]
x = kornia.geometry.resize(x, (224, 224),
interpolation='bicubic', align_corners=True,
antialias=self.antialias)
x = (x + 1.) / 2.
# renormalize according to clip
x = kornia.enhance.normalize(x, self.mean, self.std)
return x
def freeze(self):
self.model = self.model.eval()
for param in self.model.parameters():
param.requires_grad = False
def forward(self, image, no_dropout=False):
## image: b c h w
z = self.encode_with_vision_transformer(image)
return z
def encode_with_vision_transformer(self, x):
x = self.preprocess(x)
# to patches - whether to use dual patchnorm - https://arxiv.org/abs/2302.01327v1
if self.model.visual.input_patchnorm:
# einops - rearrange(x, 'b c (h p1) (w p2) -> b (h w) (c p1 p2)')
x = x.reshape(x.shape[0], x.shape[1], self.model.visual.grid_size[0], self.model.visual.patch_size[0], self.model.visual.grid_size[1], self.model.visual.patch_size[1])
x = x.permute(0, 2, 4, 1, 3, 5)
x = x.reshape(x.shape[0], self.model.visual.grid_size[0] * self.model.visual.grid_size[1], -1)
x = self.model.visual.patchnorm_pre_ln(x)
x = self.model.visual.conv1(x)
else:
x = self.model.visual.conv1(x) # shape = [*, width, grid, grid]
x = x.reshape(x.shape[0], x.shape[1], -1) # shape = [*, width, grid ** 2]
x = x.permute(0, 2, 1) # shape = [*, grid ** 2, width]
# class embeddings and positional embeddings
x = torch.cat(
[self.model.visual.class_embedding.to(x.dtype) + torch.zeros(x.shape[0], 1, x.shape[-1], dtype=x.dtype, device=x.device),
x], dim=1) # shape = [*, grid ** 2 + 1, width]
x = x + self.model.visual.positional_embedding.to(x.dtype)
# a patch_dropout of 0. would mean it is disabled and this function would do nothing but return what was passed in
x = self.model.visual.patch_dropout(x)
x = self.model.visual.ln_pre(x)
x = x.permute(1, 0, 2) # NLD -> LND
x = self.model.visual.transformer(x)
x = x.permute(1, 0, 2) # LND -> NLD
return x
class FrozenCLIPT5Encoder(AbstractEncoder):
def __init__(self, clip_version="openai/clip-vit-large-patch14", t5_version="google/t5-v1_1-xl", device="cuda",
clip_max_length=77, t5_max_length=77):
super().__init__()
self.clip_encoder = FrozenCLIPEmbedder(clip_version, device, max_length=clip_max_length)
self.t5_encoder = FrozenT5Embedder(t5_version, device, max_length=t5_max_length)
print(f"{self.clip_encoder.__class__.__name__} has {count_params(self.clip_encoder) * 1.e-6:.2f} M parameters, "
f"{self.t5_encoder.__class__.__name__} comes with {count_params(self.t5_encoder) * 1.e-6:.2f} M params.")
def encode(self, text):
return self(text)
def forward(self, text):
clip_z = self.clip_encoder.encode(text)
t5_z = self.t5_encoder.encode(text)
return [clip_z, t5_z]
================================================
FILE: lvdm/modules/encoders/ip_resampler.py
================================================
# modified from https://github.com/mlfoundations/open_flamingo/blob/main/open_flamingo/src/helpers.py
import math
import torch
import torch.nn as nn
class ImageProjModel(nn.Module):
"""Projection Model"""
def __init__(self, cross_attention_dim=1024, clip_embeddings_dim=1024, clip_extra_context_tokens=4):
super().__init__()
self.cross_attention_dim = cross_attention_dim
self.clip_extra_context_tokens = clip_extra_context_tokens
self.proj = nn.Linear(clip_embeddings_dim, self.clip_extra_context_tokens * cross_attention_dim)
self.norm = nn.LayerNorm(cross_attention_dim)
def forward(self, image_embeds):
#embeds = image_embeds
embeds = image_embeds.type(list(self.proj.parameters())[0].dtype)
clip_extra_context_tokens = self.proj(embeds).reshape(-1, self.clip_extra_context_tokens, self.cross_attention_dim)
clip_extra_context_tokens = self.norm(clip_extra_context_tokens)
return clip_extra_context_tokens
# FFN
def FeedForward(dim, mult=4):
inner_dim = int(dim * mult)
return nn.Sequential(
nn.LayerNorm(dim),
nn.Linear(dim, inner_dim, bias=False),
nn.GELU(),
nn.Linear(inner_dim, dim, bias=False),
)
def reshape_tensor(x, heads):
bs, length, width = x.shape
#(bs, length, width) --> (bs, length, n_heads, dim_per_head)
x = x.view(bs, length, heads, -1)
# (bs, length, n_heads, dim_per_head) --> (bs, n_heads, length, dim_per_head)
x = x.transpose(1, 2)
# (bs, n_heads, length, dim_per_head) --> (bs*n_heads, length, dim_per_head)
x = x.reshape(bs, heads, length, -1)
return x
class PerceiverAttention(nn.Module):
def __init__(self, *, dim, dim_head=64, heads=8):
super().__init__()
self.scale = dim_head**-0.5
self.dim_head = dim_head
self.heads = heads
inner_dim = dim_head * heads
self.norm1 = nn.LayerNorm(dim)
self.norm2 = nn.LayerNorm(dim)
self.to_q = nn.Linear(dim, inner_dim, bias=False)
self.to_kv = nn.Linear(dim, inner_dim * 2, bias=False)
self.to_out = nn.Linear(inner_dim, dim, bias=False)
def forward(self, x, latents):
"""
Args:
x (torch.Tensor): image features
shape (b, n1, D)
latent (torch.Tensor): latent features
shape (b, n2, D)
"""
x = self.norm1(x)
latents = self.norm2(latents)
b, l, _ = latents.shape
q = self.to_q(latents)
kv_input = torch.cat((x, latents), dim=-2)
k, v = self.to_kv(kv_input).chunk(2, dim=-1)
q = reshape_tensor(q, self.heads)
k = reshape_tensor(k, self.heads)
v = reshape_tensor(v, self.heads)
# attention
scale = 1 / math.sqrt(math.sqrt(self.dim_head))
weight = (q * scale) @ (k * scale).transpose(-2, -1) # More stable with f16 than dividing afterwards
weight = torch.softmax(weight.float(), dim=-1).type(weight.dtype)
out = weight @ v
out = out.permute(0, 2, 1, 3).reshape(b, l, -1)
return self.to_out(out)
class Resampler(nn.Module):
def __init__(
self,
dim=1024,
depth=8,
dim_head=64,
heads=16,
num_queries=8,
embedding_dim=768,
output_dim=1024,
ff_mult=4,
):
super().__init__()
self.latents = nn.Parameter(torch.randn(1, num_queries, dim) / dim**0.5)
self.proj_in = nn.Linear(embedding_dim, dim)
self.proj_out = nn.Linear(dim, output_dim)
self.norm_out = nn.LayerNorm(output_dim)
self.layers = nn.ModuleList([])
for _ in range(depth):
self.layers.append(
nn.ModuleList(
[
PerceiverAttention(dim=dim, dim_head=dim_head, heads=heads),
FeedForward(dim=dim, mult=ff_mult),
]
)
)
def forward(self, x):
latents = self.latents.repeat(x.size(0), 1, 1)
x = self.proj_in(x)
for attn, ff in self.layers:
latents = attn(x, latents) + latents
latents = ff(latents) + latents
latents = self.proj_out(latents)
return self.norm_out(latents)
================================================
FILE: lvdm/modules/networks/ae_modules.py
================================================
# pytorch_diffusion + derived encoder decoder
import math
import torch
import numpy as np
import torch.nn as nn
from einops import rearrange
from utils.utils import instantiate_from_config
from lvdm.modules.attention import LinearAttention
def nonlinearity(x):
# swish
return x*torch.sigmoid(x)
def Normalize(in_channels, num_groups=32):
return torch.nn.GroupNorm(num_groups=num_groups, num_channels=in_channels, eps=1e-6, affine=True)
class LinAttnBlock(LinearAttention):
"""to match AttnBlock usage"""
def __init__(self, in_channels):
super().__init__(dim=in_channels, heads=1, dim_head=in_channels)
class AttnBlock(nn.Module):
def __init__(self, in_channels):
super().__init__()
self.in_channels = in_channels
self.norm = Normalize(in_channels)
self.q = torch.nn.Conv2d(in_channels,
in_channels,
kernel_size=1,
stride=1,
padding=0)
self.k = torch.nn.Conv2d(in_channels,
in_channels,
kernel_size=1,
stride=1,
padding=0)
self.v = torch.nn.Conv2d(in_channels,
in_channels,
kernel_size=1,
stride=1,
padding=0)
self.proj_out = torch.nn.Conv2d(in_channels,
in_channels,
kernel_size=1,
stride=1,
padding=0)
def forward(self, x):
h_ = x
h_ = self.norm(h_)
q = self.q(h_)
k = self.k(h_)
v = self.v(h_)
# compute attention
b,c,h,w = q.shape
q = q.reshape(b,c,h*w) # bcl
q = q.permute(0,2,1) # bcl -> blc l=hw
k = k.reshape(b,c,h*w) # bcl
w_ = torch.bmm(q,k) # b,hw,hw w[b,i,j]=sum_c q[b,i,c]k[b,c,j]
w_ = w_ * (int(c)**(-0.5))
w_ = torch.nn.functional.softmax(w_, dim=2)
# attend to values
v = v.reshape(b,c,h*w)
w_ = w_.permute(0,2,1) # b,hw,hw (first hw of k, second of q)
h_ = torch.bmm(v,w_) # b, c,hw (hw of q) h_[b,c,j] = sum_i v[b,c,i] w_[b,i,j]
h_ = h_.reshape(b,c,h,w)
h_ = self.proj_out(h_)
return x+h_
def make_attn(in_channels, attn_type="vanilla"):
assert attn_type in ["vanilla", "linear", "none"], f'attn_type {attn_type} unknown'
#print(f"making attention of type '{attn_type}' with {in_channels} in_channels")
if attn_type == "vanilla":
return AttnBlock(in_channels)
elif attn_type == "none":
return nn.Identity(in_channels)
else:
return LinAttnBlock(in_channels)
class Downsample(nn.Module):
def __init__(self, in_channels, with_conv):
super().__init__()
self.with_conv = with_conv
self.in_channels = in_channels
if self.with_conv:
# no asymmetric padding in torch conv, must do it ourselves
self.conv = torch.nn.Conv2d(in_channels,
in_channels,
kernel_size=3,
stride=2,
padding=0)
def forward(self, x):
if self.with_conv:
pad = (0,1,0,1)
x = torch.nn.functional.pad(x, pad, mode="constant", value=0)
x = self.conv(x)
else:
x = torch.nn.functional.avg_pool2d(x, kernel_size=2, stride=2)
return x
class Upsample(nn.Module):
def __init__(self, in_channels, with_conv):
super().__init__()
self.with_conv = with_conv
self.in_channels = in_channels
if self.with_conv:
self.conv = torch.nn.Conv2d(in_channels,
in_channels,
kernel_size=3,
stride=1,
padding=1)
def forward(self, x):
x = torch.nn.functional.interpolate(x, scale_factor=2.0, mode="nearest")
if self.with_conv:
x = self.conv(x)
return x
def get_timestep_embedding(timesteps, embedding_dim):
"""
This matches the implementation in Denoising Diffusion Probabilistic Models:
From Fairseq.
Build sinusoidal embeddings.
This matches the implementation in tensor2tensor, but differs slightly
from the description in Section 3.5 of "Attention Is All You Need".
"""
assert len(timesteps.shape) == 1
half_dim = embedding_dim // 2
emb = math.log(10000) / (half_dim - 1)
emb = torch.exp(torch.arange(half_dim, dtype=torch.float32) * -emb)
emb = emb.to(device=timesteps.device)
emb = timesteps.float()[:, None] * emb[None, :]
emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=1)
if embedding_dim % 2 == 1: # zero pad
emb = torch.nn.functional.pad(emb, (0,1,0,0))
return emb
class ResnetBlock(nn.Module):
def __init__(self, *, in_channels, out_channels=None, conv_shortcut=False,
dropout, temb_channels=512):
super().__init__()
self.in_channels = in_channels
out_channels = in_channels if out_channels is None else out_channels
self.out_channels = out_channels
self.use_conv_shortcut = conv_shortcut
self.norm1 = Normalize(in_channels)
self.conv1 = torch.nn.Conv2d(in_channels,
out_channels,
kernel_size=3,
stride=1,
padding=1)
if temb_channels > 0:
self.temb_proj = torch.nn.Linear(temb_channels,
out_channels)
self.norm2 = Normalize(out_channels)
self.dropout = torch.nn.Dropout(dropout)
self.conv2 = torch.nn.Conv2d(out_channels,
out_channels,
kernel_size=3,
stride=1,
padding=1)
if self.in_channels != self.out_channels:
if self.use_conv_shortcut:
self.conv_shortcut = torch.nn.Conv2d(in_channels,
out_channels,
kernel_size=3,
stride=1,
padding=1)
else:
self.nin_shortcut = torch.nn.Conv2d(in_channels,
out_channels,
kernel_size=1,
stride=1,
padding=0)
def forward(self, x, temb):
h = x
h = self.norm1(h)
h = nonlinearity(h)
h = self.conv1(h)
if temb is not None:
h = h + self.temb_proj(nonlinearity(temb))[:,:,None,None]
h = self.norm2(h)
h = nonlinearity(h)
h = self.dropout(h)
h = self.conv2(h)
if self.in_channels != self.out_channels:
if self.use_conv_shortcut:
x = self.conv_shortcut(x)
else:
x = self.nin_shortcut(x)
return x+h
class Model(nn.Module):
def __init__(self, *, ch, out_ch, ch_mult=(1,2,4,8), num_res_blocks,
attn_resolutions, dropout=0.0, resamp_with_conv=True, in_channels,
resolution, use_timestep=True, use_linear_attn=False, attn_type="vanilla"):
super().__init__()
if use_linear_attn: attn_type = "linear"
self.ch = ch
self.temb_ch = self.ch*4
self.num_resolutions = len(ch_mult)
self.num_res_blocks = num_res_blocks
self.resolution = resolution
self.in_channels = in_channels
self.use_timestep = use_timestep
if self.use_timestep:
# timestep embedding
self.temb = nn.Module()
self.temb.dense = nn.ModuleList([
torch.nn.Linear(self.ch,
self.temb_ch),
torch.nn.Linear(self.temb_ch,
self.temb_ch),
])
# downsampling
self.conv_in = torch.nn.Conv2d(in_channels,
self.ch,
kernel_size=3,
stride=1,
padding=1)
curr_res = resolution
in_ch_mult = (1,)+tuple(ch_mult)
self.down = nn.ModuleList()
for i_level in range(self.num_resolutions):
block = nn.ModuleList()
attn = nn.ModuleList()
block_in = ch*in_ch_mult[i_level]
block_out = ch*ch_mult[i_level]
for i_block in range(self.num_res_blocks):
block.append(ResnetBlock(in_channels=block_in,
out_channels=block_out,
temb_channels=self.temb_ch,
dropout=dropout))
block_in = block_out
if curr_res in attn_resolutions:
attn.append(make_attn(block_in, attn_type=attn_type))
down = nn.Module()
down.block = block
down.attn = attn
if i_level != self.num_resolutions-1:
down.downsample = Downsample(block_in, resamp_with_conv)
curr_res = curr_res // 2
self.down.append(down)
# middle
self.mid = nn.Module()
self.mid.block_1 = ResnetBlock(in_channels=block_in,
out_channels=block_in,
temb_channels=self.temb_ch,
dropout=dropout)
self.mid.attn_1 = make_attn(block_in, attn_type=attn_type)
self.mid.block_2 = ResnetBlock(in_channels=block_in,
out_channels=block_in,
temb_channels=self.temb_ch,
dropout=dropout)
# upsampling
self.up = nn.ModuleList()
for i_level in reversed(range(self.num_resolutions)):
block = nn.ModuleList()
attn = nn.ModuleList()
block_out = ch*ch_mult[i_level]
skip_in = ch*ch_mult[i_level]
for i_block in range(self.num_res_blocks+1):
if i_block == self.num_res_blocks:
skip_in = ch*in_ch_mult[i_level]
block.append(ResnetBlock(in_channels=block_in+skip_in,
out_channels=block_out,
temb_channels=self.temb_ch,
dropout=dropout))
block_in = block_out
if curr_res in attn_resolutions:
attn.append(make_attn(block_in, attn_type=attn_type))
up = nn.Module()
up.block = block
up.attn = attn
if i_level != 0:
up.upsample = Upsample(block_in, resamp_with_conv)
curr_res = curr_res * 2
self.up.insert(0, up) # prepend to get consistent order
# end
self.norm_out = Normalize(block_in)
self.conv_out = torch.nn.Conv2d(block_in,
out_ch,
kernel_size=3,
gitextract_8we83d97/
├── LICENSE
├── README.md
├── cog.yaml
├── configs/
│ ├── inference_t2v_1024_v1.0.yaml
│ ├── inference_t2v_1024_v1.0_freenoise.yaml
│ ├── inference_t2v_tconv256_v1.0.yaml
│ ├── inference_t2v_tconv256_v1.0_freenoise.yaml
│ ├── inference_t2v_tconv512_v2.0.yaml
│ └── inference_t2v_tconv512_v2.0_freenoise.yaml
├── lvdm/
│ ├── basics.py
│ ├── common.py
│ ├── distributions.py
│ ├── ema.py
│ ├── models/
│ │ ├── autoencoder.py
│ │ ├── ddpm3d.py
│ │ ├── samplers/
│ │ │ ├── ddim.py
│ │ │ └── ddim_mp.py
│ │ └── utils_diffusion.py
│ └── modules/
│ ├── attention.py
│ ├── attention_freenoise.py
│ ├── encoders/
│ │ ├── condition.py
│ │ └── ip_resampler.py
│ ├── networks/
│ │ ├── ae_modules.py
│ │ ├── openaimodel3d.py
│ │ └── openaimodel3d_freenoise.py
│ └── x_transformer.py
├── predict.py
├── prompts/
│ ├── mp_prompts.txt
│ └── single_prompts.txt
├── requirements.txt
├── scripts/
│ ├── evaluation/
│ │ ├── ddp_wrapper.py
│ │ ├── funcs.py
│ │ ├── inference.py
│ │ ├── inference_freenoise.py
│ │ └── inference_freenoise_mp.py
│ ├── run_text2video.sh
│ ├── run_text2video_freenoise_1024.sh
│ ├── run_text2video_freenoise_256.sh
│ ├── run_text2video_freenoise_512.sh
│ ├── run_text2video_freenoise_mp_256.sh
│ └── run_text2video_freenoise_mp_512.sh
└── utils/
└── utils.py
SYMBOL INDEX (430 symbols across 24 files)
FILE: lvdm/basics.py
function disabled_train (line 14) | def disabled_train(self, mode=True):
function zero_module (line 19) | def zero_module(module):
function scale_module (line 27) | def scale_module(module, scale):
function conv_nd (line 36) | def conv_nd(dims, *args, **kwargs):
function linear (line 49) | def linear(*args, **kwargs):
function avg_pool_nd (line 56) | def avg_pool_nd(dims, *args, **kwargs):
function nonlinearity (line 69) | def nonlinearity(type='silu'):
class GroupNormSpecific (line 76) | class GroupNormSpecific(nn.GroupNorm):
method forward (line 77) | def forward(self, x):
function normalization (line 81) | def normalization(channels, num_groups=32):
class HybridConditioner (line 90) | class HybridConditioner(nn.Module):
method __init__ (line 92) | def __init__(self, c_concat_config, c_crossattn_config):
method forward (line 97) | def forward(self, c_concat, c_crossattn):
FILE: lvdm/common.py
function gather_data (line 8) | def gather_data(data, return_np=True):
function autocast (line 16) | def autocast(f):
function extract_into_tensor (line 25) | def extract_into_tensor(a, t, x_shape):
function noise_like (line 31) | def noise_like(shape, device, repeat=False):
function default (line 37) | def default(val, d):
function exists (line 42) | def exists(val):
function identity (line 45) | def identity(*args, **kwargs):
function uniq (line 48) | def uniq(arr):
function mean_flat (line 51) | def mean_flat(tensor):
function ismap (line 57) | def ismap(x):
function isimage (line 62) | def isimage(x):
function max_neg_value (line 67) | def max_neg_value(t):
function shape_to_str (line 70) | def shape_to_str(x):
function init_ (line 74) | def init_(tensor):
function checkpoint (line 81) | def checkpoint(func, inputs, params, flag):
FILE: lvdm/distributions.py
class AbstractDistribution (line 5) | class AbstractDistribution:
method sample (line 6) | def sample(self):
method mode (line 9) | def mode(self):
class DiracDistribution (line 13) | class DiracDistribution(AbstractDistribution):
method __init__ (line 14) | def __init__(self, value):
method sample (line 17) | def sample(self):
method mode (line 20) | def mode(self):
class DiagonalGaussianDistribution (line 24) | class DiagonalGaussianDistribution(object):
method __init__ (line 25) | def __init__(self, parameters, deterministic=False):
method sample (line 35) | def sample(self, noise=None):
method kl (line 42) | def kl(self, other=None):
method nll (line 56) | def nll(self, sample, dims=[1,2,3]):
method mode (line 64) | def mode(self):
function normal_kl (line 68) | def normal_kl(mean1, logvar1, mean2, logvar2):
FILE: lvdm/ema.py
class LitEma (line 5) | class LitEma(nn.Module):
method __init__ (line 6) | def __init__(self, model, decay=0.9999, use_num_upates=True):
method forward (line 25) | def forward(self,model):
method copy_to (line 46) | def copy_to(self, model):
method store (line 55) | def store(self, parameters):
method restore (line 64) | def restore(self, parameters):
FILE: lvdm/models/autoencoder.py
class AutoencoderKL (line 13) | class AutoencoderKL(pl.LightningModule):
method __init__ (line 14) | def __init__(self,
method init_test (line 51) | def init_test(self,):
method init_from_ckpt (line 80) | def init_from_ckpt(self, path, ignore_keys=list()):
method encode (line 97) | def encode(self, x, **kwargs):
method decode (line 104) | def decode(self, z, **kwargs):
method forward (line 109) | def forward(self, input, sample_posterior=True):
method get_input (line 118) | def get_input(self, batch, k):
method training_step (line 128) | def training_step(self, batch, batch_idx, optimizer_idx):
method validation_step (line 149) | def validation_step(self, batch, batch_idx):
method configure_optimizers (line 163) | def configure_optimizers(self):
method get_last_layer (line 174) | def get_last_layer(self):
method log_images (line 178) | def log_images(self, batch, only_inputs=False, **kwargs):
method to_rgb (line 194) | def to_rgb(self, x):
class IdentityFirstStage (line 202) | class IdentityFirstStage(torch.nn.Module):
method __init__ (line 203) | def __init__(self, *args, vq_interface=False, **kwargs):
method encode (line 207) | def encode(self, x, *args, **kwargs):
method decode (line 210) | def decode(self, x, *args, **kwargs):
method quantize (line 213) | def quantize(self, x, *args, **kwargs):
method forward (line 218) | def forward(self, x, *args, **kwargs):
FILE: lvdm/models/ddpm3d.py
class DDPM (line 38) | class DDPM(pl.LightningModule):
method __init__ (line 40) | def __init__(self,
method register_schedule (line 113) | def register_schedule(self, given_betas=None, beta_schedule="linear", ...
method ema_scope (line 168) | def ema_scope(self, context=None):
method init_from_ckpt (line 182) | def init_from_ckpt(self, path, ignore_keys=list(), only_model=False):
method q_mean_variance (line 200) | def q_mean_variance(self, x_start, t):
method predict_start_from_noise (line 212) | def predict_start_from_noise(self, x_t, t, noise):
method q_posterior (line 218) | def q_posterior(self, x_start, x_t, t):
method p_mean_variance (line 227) | def p_mean_variance(self, x, t, clip_denoised: bool):
method p_sample (line 240) | def p_sample(self, x, t, clip_denoised=True, repeat_noise=False):
method p_sample_loop (line 249) | def p_sample_loop(self, shape, return_intermediates=False):
method sample (line 264) | def sample(self, batch_size=16, return_intermediates=False):
method q_sample (line 270) | def q_sample(self, x_start, t, noise=None):
method get_input (line 276) | def get_input(self, batch, k):
method _get_rows_from_list (line 281) | def _get_rows_from_list(self, samples):
method log_images (line 289) | def log_images(self, batch, N=8, n_row=2, sample=True, return_keys=Non...
class LatentDiffusion (line 327) | class LatentDiffusion(DDPM):
method __init__ (line 329) | def __init__(self,
method make_cond_schedule (line 407) | def make_cond_schedule(self, ):
method q_sample (line 412) | def q_sample(self, x_start, t, noise=None):
method _freeze_model (line 423) | def _freeze_model(self):
method instantiate_first_stage (line 427) | def instantiate_first_stage(self, config):
method instantiate_cond_stage (line 434) | def instantiate_cond_stage(self, config):
method get_learned_conditioning (line 445) | def get_learned_conditioning(self, c):
method get_first_stage_encoding (line 458) | def get_first_stage_encoding(self, encoder_posterior, noise=None):
method encode_first_stage (line 468) | def encode_first_stage(self, x):
method encode_first_stage_2DAE (line 485) | def encode_first_stage_2DAE(self, x):
method decode_core (line 492) | def decode_core(self, z, **kwargs):
method decode_first_stage (line 509) | def decode_first_stage(self, z, **kwargs):
method apply_model (line 512) | def apply_model(self, x_noisy, t, cond, **kwargs):
method _get_denoise_row_from_list (line 529) | def _get_denoise_row_from_list(self, samples, desc=''):
method decode_first_stage_2DAE (line 556) | def decode_first_stage_2DAE(self, z, **kwargs):
method p_mean_variance (line 565) | def p_mean_variance(self, x, c, t, clip_denoised: bool, return_x0=Fals...
method p_sample (line 591) | def p_sample(self, x, c, t, clip_denoised=False, repeat_noise=False, r...
method p_sample_loop (line 613) | def p_sample_loop(self, cond, shape, return_intermediates=False, x_T=N...
class LatentVisualDiffusion (line 660) | class LatentVisualDiffusion(LatentDiffusion):
method __init__ (line 661) | def __init__(self, cond_img_config, finegrained=False, random_cond=Fal...
method instantiate_img_embedder (line 669) | def instantiate_img_embedder(self, config, freeze=True):
method init_projector (line 677) | def init_projector(self, use_finegrained, num_tokens, input_dim, cross...
method get_image_embeds (line 689) | def get_image_embeds(self, batch_imgs):
class DiffusionWrapper (line 696) | class DiffusionWrapper(pl.LightningModule):
method __init__ (line 697) | def __init__(self, diff_model_config, conditioning_key):
method forward (line 702) | def forward(self, x, t, c_concat: list = None, c_crossattn: list = None,
FILE: lvdm/models/samplers/ddim.py
class DDIMSampler (line 8) | class DDIMSampler(object):
method __init__ (line 9) | def __init__(self, model, schedule="linear", **kwargs):
method register_buffer (line 16) | def register_buffer(self, name, attr):
method make_schedule (line 22) | def make_schedule(self, ddim_num_steps, ddim_discretize="uniform", ddi...
method sample (line 63) | def sample(self,
method ddim_sampling (line 133) | def ddim_sampling(self, cond, shape,
method p_sample_ddim (line 213) | def p_sample_ddim(self, x, c, t, index, repeat_noise=False, use_origin...
method stochastic_encode (line 295) | def stochastic_encode(self, x0, t, use_original_steps=False, noise=None):
method decode (line 317) | def decode(self, x_latent, cond, t_start, unconditional_guidance_scale...
FILE: lvdm/models/samplers/ddim_mp.py
class DDIMSampler (line 8) | class DDIMSampler(object):
method __init__ (line 9) | def __init__(self, model, schedule="linear", **kwargs):
method register_buffer (line 16) | def register_buffer(self, name, attr):
method make_schedule (line 22) | def make_schedule(self, ddim_num_steps, ddim_discretize="uniform", ddi...
method sample (line 63) | def sample(self,
method ddim_sampling (line 133) | def ddim_sampling(self, cond, shape,
method p_sample_ddim (line 214) | def p_sample_ddim(self, x, c, t, index, repeat_noise=False, use_origin...
method stochastic_encode (line 299) | def stochastic_encode(self, x0, t, use_original_steps=False, noise=None):
method decode (line 321) | def decode(self, x_latent, cond, t_start, unconditional_guidance_scale...
FILE: lvdm/models/utils_diffusion.py
function timestep_embedding (line 8) | def timestep_embedding(timesteps, dim, max_period=10000, repeat_only=Fal...
function make_beta_schedule (line 31) | def make_beta_schedule(schedule, n_timestep, linear_start=1e-4, linear_e...
function make_ddim_timesteps (line 56) | def make_ddim_timesteps(ddim_discr_method, num_ddim_timesteps, num_ddpm_...
function make_ddim_sampling_parameters (line 73) | def make_ddim_sampling_parameters(alphacums, ddim_timesteps, eta, verbos...
function betas_for_alpha_bar (line 88) | def betas_for_alpha_bar(num_diffusion_timesteps, alpha_bar, max_beta=0.9...
FILE: lvdm/modules/attention.py
class RelativePosition (line 21) | class RelativePosition(nn.Module):
method __init__ (line 24) | def __init__(self, num_units, max_relative_position):
method forward (line 31) | def forward(self, length_q, length_k):
class CrossAttention (line 43) | class CrossAttention(nn.Module):
method __init__ (line 45) | def __init__(self, query_dim, context_dim=None, heads=8, dim_head=64, ...
method forward (line 76) | def forward(self, x, context=None, mask=None):
method efficient_forward (line 129) | def efficient_forward(self, x, context=None, mask=None):
class BasicTransformerBlock (line 187) | class BasicTransformerBlock(nn.Module):
method __init__ (line 189) | def __init__(self, dim, n_heads, d_head, dropout=0., context_dim=None,...
method forward (line 204) | def forward(self, x, context=None, mask=None):
method _forward (line 216) | def _forward(self, x, context=None, mask=None):
class SpatialTransformer (line 223) | class SpatialTransformer(nn.Module):
method __init__ (line 233) | def __init__(self, in_channels, n_heads, d_head, depth=1, dropout=0., ...
method forward (line 262) | def forward(self, x, context=None):
class TemporalTransformer (line 281) | class TemporalTransformer(nn.Module):
method __init__ (line 288) | def __init__(self, in_channels, n_heads, d_head, depth=1, dropout=0., ...
method forward (line 331) | def forward(self, x, context=None):
class GEGLU (line 376) | class GEGLU(nn.Module):
method __init__ (line 377) | def __init__(self, dim_in, dim_out):
method forward (line 381) | def forward(self, x):
class FeedForward (line 386) | class FeedForward(nn.Module):
method __init__ (line 387) | def __init__(self, dim, dim_out=None, mult=4, glu=False, dropout=0.):
method forward (line 402) | def forward(self, x):
class LinearAttention (line 406) | class LinearAttention(nn.Module):
method __init__ (line 407) | def __init__(self, dim, heads=4, dim_head=32):
method forward (line 414) | def forward(self, x):
class SpatialSelfAttention (line 425) | class SpatialSelfAttention(nn.Module):
method __init__ (line 426) | def __init__(self, in_channels):
method forward (line 452) | def forward(self, x):
FILE: lvdm/modules/attention_freenoise.py
function generate_weight_sequence (line 21) | def generate_weight_sequence(n):
class RelativePosition (line 30) | class RelativePosition(nn.Module):
method __init__ (line 33) | def __init__(self, num_units, max_relative_position):
method forward (line 40) | def forward(self, length_q, length_k):
class CrossAttention (line 52) | class CrossAttention(nn.Module):
method __init__ (line 54) | def __init__(self, query_dim, context_dim=None, heads=8, dim_head=64, ...
method forward (line 87) | def forward(self, x, context=None, mask=None, context_next=None, use_i...
method efficient_forward (line 202) | def efficient_forward(self, x, context=None, mask=None, context_next=N...
class BasicTransformerBlock (line 274) | class BasicTransformerBlock(nn.Module):
method __init__ (line 276) | def __init__(self, dim, n_heads, d_head, dropout=0., context_dim=None,...
method forward (line 291) | def forward(self, x, context=None, mask=None, context_next=None, use_i...
method _forward (line 304) | def _forward(self, x, context=None, mask=None, context_next=None, use_...
class SpatialTransformer (line 311) | class SpatialTransformer(nn.Module):
method __init__ (line 321) | def __init__(self, in_channels, n_heads, d_head, depth=1, dropout=0., ...
method forward (line 351) | def forward(self, x, context=None, **kwargs):
class TemporalTransformer (line 370) | class TemporalTransformer(nn.Module):
method __init__ (line 377) | def __init__(self, in_channels, n_heads, d_head, depth=1, dropout=0., ...
method forward (line 421) | def forward(self, x, context=None, **kwargs):
class GEGLU (line 466) | class GEGLU(nn.Module):
method __init__ (line 467) | def __init__(self, dim_in, dim_out):
method forward (line 471) | def forward(self, x):
class FeedForward (line 476) | class FeedForward(nn.Module):
method __init__ (line 477) | def __init__(self, dim, dim_out=None, mult=4, glu=False, dropout=0.):
method forward (line 492) | def forward(self, x):
class LinearAttention (line 496) | class LinearAttention(nn.Module):
method __init__ (line 497) | def __init__(self, dim, heads=4, dim_head=32):
method forward (line 504) | def forward(self, x):
class SpatialSelfAttention (line 515) | class SpatialSelfAttention(nn.Module):
method __init__ (line 516) | def __init__(self, in_channels):
method forward (line 542) | def forward(self, x, **kwargs):
FILE: lvdm/modules/encoders/condition.py
class AbstractEncoder (line 10) | class AbstractEncoder(nn.Module):
method __init__ (line 11) | def __init__(self):
method encode (line 14) | def encode(self, *args, **kwargs):
class IdentityEncoder (line 18) | class IdentityEncoder(AbstractEncoder):
method encode (line 20) | def encode(self, x):
class ClassEmbedder (line 24) | class ClassEmbedder(nn.Module):
method __init__ (line 25) | def __init__(self, embed_dim, n_classes=1000, key='class', ucg_rate=0.1):
method forward (line 32) | def forward(self, batch, key=None, disable_dropout=False):
method get_unconditional_conditioning (line 44) | def get_unconditional_conditioning(self, bs, device="cuda"):
function disabled_train (line 51) | def disabled_train(self, mode=True):
class FrozenT5Embedder (line 57) | class FrozenT5Embedder(AbstractEncoder):
method __init__ (line 60) | def __init__(self, version="google/t5-v1_1-large", device="cuda", max_...
method freeze (line 70) | def freeze(self):
method forward (line 76) | def forward(self, text):
method encode (line 85) | def encode(self, text):
class FrozenCLIPEmbedder (line 89) | class FrozenCLIPEmbedder(AbstractEncoder):
method __init__ (line 97) | def __init__(self, version="openai/clip-vit-large-patch14", device="cu...
method freeze (line 113) | def freeze(self):
method forward (line 119) | def forward(self, text):
method encode (line 132) | def encode(self, text):
class ClipImageEmbedder (line 136) | class ClipImageEmbedder(nn.Module):
method __init__ (line 137) | def __init__(
method preprocess (line 155) | def preprocess(self, x):
method forward (line 165) | def forward(self, x, no_dropout=False):
class FrozenOpenCLIPEmbedder (line 174) | class FrozenOpenCLIPEmbedder(AbstractEncoder):
method __init__ (line 184) | def __init__(self, arch="ViT-H-14", version="laion2b_s32b_b79k", devic...
method freeze (line 204) | def freeze(self):
method forward (line 209) | def forward(self, text):
method encode_with_transformer (line 215) | def encode_with_transformer(self, text):
method text_transformer_forward (line 224) | def text_transformer_forward(self, x: torch.Tensor, attn_mask=None):
method encode (line 234) | def encode(self, text):
class FrozenOpenCLIPImageEmbedder (line 238) | class FrozenOpenCLIPImageEmbedder(AbstractEncoder):
method __init__ (line 243) | def __init__(self, arch="ViT-H-14", version="laion2b_s32b_b79k", devic...
method preprocess (line 266) | def preprocess(self, x):
method freeze (line 276) | def freeze(self):
method forward (line 282) | def forward(self, image, no_dropout=False):
method encode_with_vision_transformer (line 288) | def encode_with_vision_transformer(self, img):
method encode (line 293) | def encode(self, text):
class FrozenOpenCLIPImageEmbedderV2 (line 298) | class FrozenOpenCLIPImageEmbedderV2(AbstractEncoder):
method __init__ (line 303) | def __init__(self, arch="ViT-H-14", version="laion2b_s32b_b79k", devic...
method preprocess (line 324) | def preprocess(self, x):
method freeze (line 334) | def freeze(self):
method forward (line 339) | def forward(self, image, no_dropout=False):
method encode_with_vision_transformer (line 344) | def encode_with_vision_transformer(self, x):
class FrozenCLIPT5Encoder (line 377) | class FrozenCLIPT5Encoder(AbstractEncoder):
method __init__ (line 378) | def __init__(self, clip_version="openai/clip-vit-large-patch14", t5_ve...
method encode (line 386) | def encode(self, text):
method forward (line 389) | def forward(self, text):
FILE: lvdm/modules/encoders/ip_resampler.py
class ImageProjModel (line 7) | class ImageProjModel(nn.Module):
method __init__ (line 9) | def __init__(self, cross_attention_dim=1024, clip_embeddings_dim=1024,...
method forward (line 16) | def forward(self, image_embeds):
function FeedForward (line 24) | def FeedForward(dim, mult=4):
function reshape_tensor (line 34) | def reshape_tensor(x, heads):
class PerceiverAttention (line 45) | class PerceiverAttention(nn.Module):
method __init__ (line 46) | def __init__(self, *, dim, dim_head=64, heads=8):
method forward (line 61) | def forward(self, x, latents):
class Resampler (line 93) | class Resampler(nn.Module):
method __init__ (line 94) | def __init__(
method forward (line 125) | def forward(self, x):
FILE: lvdm/modules/networks/ae_modules.py
function nonlinearity (line 10) | def nonlinearity(x):
function Normalize (line 15) | def Normalize(in_channels, num_groups=32):
class LinAttnBlock (line 20) | class LinAttnBlock(LinearAttention):
method __init__ (line 22) | def __init__(self, in_channels):
class AttnBlock (line 26) | class AttnBlock(nn.Module):
method __init__ (line 27) | def __init__(self, in_channels):
method forward (line 53) | def forward(self, x):
function make_attn (line 80) | def make_attn(in_channels, attn_type="vanilla"):
class Downsample (line 90) | class Downsample(nn.Module):
method __init__ (line 91) | def __init__(self, in_channels, with_conv):
method forward (line 102) | def forward(self, x):
class Upsample (line 111) | class Upsample(nn.Module):
method __init__ (line 112) | def __init__(self, in_channels, with_conv):
method forward (line 123) | def forward(self, x):
function get_timestep_embedding (line 129) | def get_timestep_embedding(timesteps, embedding_dim):
class ResnetBlock (line 151) | class ResnetBlock(nn.Module):
method __init__ (line 152) | def __init__(self, *, in_channels, out_channels=None, conv_shortcut=Fa...
method forward (line 190) | def forward(self, x, temb):
class Model (line 212) | class Model(nn.Module):
method __init__ (line 213) | def __init__(self, *, ch, out_ch, ch_mult=(1,2,4,8), num_res_blocks,
method forward (line 312) | def forward(self, x, t=None, context=None):
method get_last_layer (line 360) | def get_last_layer(self):
class Encoder (line 364) | class Encoder(nn.Module):
method __init__ (line 365) | def __init__(self, *, ch, out_ch, ch_mult=(1,2,4,8), num_res_blocks,
method forward (line 430) | def forward(self, x):
class Decoder (line 466) | class Decoder(nn.Module):
method __init__ (line 467) | def __init__(self, *, ch, out_ch, ch_mult=(1,2,4,8), num_res_blocks,
method forward (line 539) | def forward(self, z):
class SimpleDecoder (line 581) | class SimpleDecoder(nn.Module):
method __init__ (line 582) | def __init__(self, in_channels, out_channels, *args, **kwargs):
method forward (line 604) | def forward(self, x):
class UpsampleDecoder (line 617) | class UpsampleDecoder(nn.Module):
method __init__ (line 618) | def __init__(self, in_channels, out_channels, ch, num_res_blocks, reso...
method forward (line 651) | def forward(self, x):
class LatentRescaler (line 665) | class LatentRescaler(nn.Module):
method __init__ (line 666) | def __init__(self, factor, in_channels, mid_channels, out_channels, de...
method forward (line 690) | def forward(self, x):
class MergedRescaleEncoder (line 702) | class MergedRescaleEncoder(nn.Module):
method __init__ (line 703) | def __init__(self, in_channels, ch, resolution, out_ch, num_res_blocks,
method forward (line 715) | def forward(self, x):
class MergedRescaleDecoder (line 721) | class MergedRescaleDecoder(nn.Module):
method __init__ (line 722) | def __init__(self, z_channels, out_ch, resolution, num_res_blocks, att...
method forward (line 732) | def forward(self, x):
class Upsampler (line 738) | class Upsampler(nn.Module):
method __init__ (line 739) | def __init__(self, in_size, out_size, in_channels, out_channels, ch_mu...
method forward (line 751) | def forward(self, x):
class Resize (line 757) | class Resize(nn.Module):
method __init__ (line 758) | def __init__(self, in_channels=None, learned=False, mode="bilinear"):
method forward (line 773) | def forward(self, x, scale_factor=1.0):
class FirstStagePostProcessor (line 780) | class FirstStagePostProcessor(nn.Module):
method __init__ (line 782) | def __init__(self, ch_mult:list, in_channels,
method instantiate_pretrained (line 817) | def instantiate_pretrained(self, config):
method encode_with_pretrained (line 826) | def encode_with_pretrained(self,x):
method forward (line 832) | def forward(self,x):
FILE: lvdm/modules/networks/openaimodel3d.py
class TimestepBlock (line 19) | class TimestepBlock(nn.Module):
method forward (line 24) | def forward(self, x, emb):
class TimestepEmbedSequential (line 30) | class TimestepEmbedSequential(nn.Sequential, TimestepBlock):
method forward (line 36) | def forward(self, x, emb, context=None, batch_size=None):
class Downsample (line 51) | class Downsample(nn.Module):
method __init__ (line 60) | def __init__(self, channels, use_conv, dims=2, out_channels=None, padd...
method forward (line 75) | def forward(self, x):
class Upsample (line 80) | class Upsample(nn.Module):
method __init__ (line 89) | def __init__(self, channels, use_conv, dims=2, out_channels=None, padd...
method forward (line 98) | def forward(self, x):
class ResBlock (line 109) | class ResBlock(TimestepBlock):
method __init__ (line 124) | def __init__(
method forward (line 195) | def forward(self, x, emb, batch_size=None):
method _forward (line 208) | def _forward(self, x, emb, batch_size=None,):
class TemporalConvBlock (line 237) | class TemporalConvBlock(nn.Module):
method __init__ (line 242) | def __init__(self, in_channels, out_channels=None, dropout=0.0, spatia...
method forward (line 269) | def forward(self, x):
class UNetModel (line 279) | class UNetModel(nn.Module):
method __init__ (line 307) | def __init__(self,
method forward (line 534) | def forward(self, x, timesteps, context=None, features_adapter=None, f...
FILE: lvdm/modules/networks/openaimodel3d_freenoise.py
class TimestepBlock (line 19) | class TimestepBlock(nn.Module):
method forward (line 24) | def forward(self, x, emb):
class TimestepEmbedSequential (line 30) | class TimestepEmbedSequential(nn.Sequential, TimestepBlock):
method forward (line 36) | def forward(self, x, emb, context=None, batch_size=None, use_injection...
class Downsample (line 51) | class Downsample(nn.Module):
method __init__ (line 60) | def __init__(self, channels, use_conv, dims=2, out_channels=None, padd...
method forward (line 75) | def forward(self, x):
class Upsample (line 80) | class Upsample(nn.Module):
method __init__ (line 89) | def __init__(self, channels, use_conv, dims=2, out_channels=None, padd...
method forward (line 98) | def forward(self, x):
class ResBlock (line 109) | class ResBlock(TimestepBlock):
method __init__ (line 124) | def __init__(
method forward (line 195) | def forward(self, x, emb, batch_size=None):
method _forward (line 208) | def _forward(self, x, emb, batch_size=None,):
class TemporalConvBlock (line 237) | class TemporalConvBlock(nn.Module):
method __init__ (line 242) | def __init__(self, in_channels, out_channels=None, dropout=0.0, spatia...
method forward (line 269) | def forward(self, x):
class UNetModel (line 279) | class UNetModel(nn.Module):
method __init__ (line 307) | def __init__(self,
method forward (line 534) | def forward(self, x, timesteps, context=None, features_adapter=None, f...
FILE: lvdm/modules/x_transformer.py
class AbsolutePositionalEmbedding (line 24) | class AbsolutePositionalEmbedding(nn.Module):
method __init__ (line 25) | def __init__(self, dim, max_seq_len):
method init_ (line 30) | def init_(self):
method forward (line 33) | def forward(self, x):
class FixedPositionalEmbedding (line 38) | class FixedPositionalEmbedding(nn.Module):
method __init__ (line 39) | def __init__(self, dim):
method forward (line 44) | def forward(self, x, seq_dim=1, offset=0):
function exists (line 53) | def exists(val):
function default (line 57) | def default(val, d):
function always (line 63) | def always(val):
function not_equals (line 69) | def not_equals(val):
function equals (line 75) | def equals(val):
function max_neg_value (line 81) | def max_neg_value(tensor):
function pick_and_pop (line 87) | def pick_and_pop(keys, d):
function group_dict_by_key (line 92) | def group_dict_by_key(cond, d):
function string_begins_with (line 101) | def string_begins_with(prefix, str):
function group_by_key_prefix (line 105) | def group_by_key_prefix(prefix, d):
function groupby_prefix_and_trim (line 109) | def groupby_prefix_and_trim(prefix, d):
class Scale (line 116) | class Scale(nn.Module):
method __init__ (line 117) | def __init__(self, value, fn):
method forward (line 122) | def forward(self, x, **kwargs):
class Rezero (line 127) | class Rezero(nn.Module):
method __init__ (line 128) | def __init__(self, fn):
method forward (line 133) | def forward(self, x, **kwargs):
class ScaleNorm (line 138) | class ScaleNorm(nn.Module):
method __init__ (line 139) | def __init__(self, dim, eps=1e-5):
method forward (line 145) | def forward(self, x):
class RMSNorm (line 150) | class RMSNorm(nn.Module):
method __init__ (line 151) | def __init__(self, dim, eps=1e-8):
method forward (line 157) | def forward(self, x):
class Residual (line 162) | class Residual(nn.Module):
method forward (line 163) | def forward(self, x, residual):
class GRUGating (line 167) | class GRUGating(nn.Module):
method __init__ (line 168) | def __init__(self, dim):
method forward (line 172) | def forward(self, x, residual):
class GEGLU (line 183) | class GEGLU(nn.Module):
method __init__ (line 184) | def __init__(self, dim_in, dim_out):
method forward (line 188) | def forward(self, x):
class FeedForward (line 193) | class FeedForward(nn.Module):
method __init__ (line 194) | def __init__(self, dim, dim_out=None, mult=4, glu=False, dropout=0.):
method forward (line 209) | def forward(self, x):
class Attention (line 214) | class Attention(nn.Module):
method __init__ (line 215) | def __init__(
method forward (line 267) | def forward(
class AttentionLayers (line 369) | class AttentionLayers(nn.Module):
method __init__ (line 370) | def __init__(
method forward (line 480) | def forward(
class Encoder (line 540) | class Encoder(AttentionLayers):
method __init__ (line 541) | def __init__(self, **kwargs):
class TransformerWrapper (line 547) | class TransformerWrapper(nn.Module):
method __init__ (line 548) | def __init__(
method init_ (line 594) | def init_(self):
method forward (line 597) | def forward(
FILE: predict.py
class Predictor (line 26) | class Predictor(BasePredictor):
method setup (line 27) | def setup(self) -> None:
method predict (line 49) | def predict(
FILE: scripts/evaluation/ddp_wrapper.py
function setup_dist (line 8) | def setup_dist(local_rank):
function get_dist_info (line 15) | def get_dist_info():
FILE: scripts/evaluation/funcs.py
function get_views (line 13) | def get_views(video_length, window_size=16, stride=4):
function batch_ddim_sampling (line 22) | def batch_ddim_sampling(model, cond, noise_shape, n_samples=1, ddim_step...
function batch_ddim_sampling_freenoise (line 79) | def batch_ddim_sampling_freenoise(model, cond, noise_shape, n_samples=1,...
function batch_ddim_sampling_freenoise_mp (line 139) | def batch_ddim_sampling_freenoise_mp(model, cond, noise_shape, n_samples...
function get_filelist (line 214) | def get_filelist(data_dir, ext='*'):
function get_dirlist (line 219) | def get_dirlist(path):
function load_model_checkpoint (line 231) | def load_model_checkpoint(model, ckpt):
function load_prompts (line 250) | def load_prompts(prompt_file):
function load_prompts_mp (line 260) | def load_prompts_mp(prompt_file):
function load_video_batch (line 277) | def load_video_batch(filepath_list, frame_stride, video_size=(256,256), ...
function load_image_batch (line 316) | def load_image_batch(filepath_list, image_size=(256,256)):
function save_videos (line 340) | def save_videos(batch_tensors, savedir, filenames, fps=10):
FILE: scripts/evaluation/inference.py
function get_parser (line 18) | def get_parser():
function run_inference (line 42) | def run_inference(args, gpu_num, gpu_no, **kwargs):
FILE: scripts/evaluation/inference_freenoise.py
function get_parser (line 18) | def get_parser():
function run_inference (line 45) | def run_inference(args, gpu_num, gpu_no, **kwargs):
FILE: scripts/evaluation/inference_freenoise_mp.py
function get_parser (line 18) | def get_parser():
function run_inference (line 45) | def run_inference(args, gpu_num, gpu_no, **kwargs):
FILE: utils/utils.py
function count_params (line 8) | def count_params(model, verbose=False):
function check_istarget (line 15) | def check_istarget(name, para_list):
function instantiate_from_config (line 27) | def instantiate_from_config(config):
function get_obj_from_str (line 37) | def get_obj_from_str(string, reload=False):
function load_npz_from_dir (line 45) | def load_npz_from_dir(data_dir):
function load_npz_from_paths (line 51) | def load_npz_from_paths(data_paths):
function resize_numpy_image (line 57) | def resize_numpy_image(image, max_resolution=512 * 512, resize_short_edg...
function setup_dist (line 70) | def setup_dist(args):
Condensed preview — 42 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (353K chars).
[
{
"path": "LICENSE",
"chars": 11357,
"preview": " Apache License\n Version 2.0, January 2004\n "
},
{
"path": "README.md",
"chars": 6315,
"preview": "## ___***FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling***___\n\n### 🔥🔥🔥 FreeNoise for longer high-q"
},
{
"path": "cog.yaml",
"chars": 794,
"preview": "# Configuration for Cog ⚙️\n# Reference: https://github.com/replicate/cog/blob/main/docs/yaml.md\n\nbuild:\n gpu: true\n sy"
},
{
"path": "configs/inference_t2v_1024_v1.0.yaml",
"chars": 1852,
"preview": "model:\n target: lvdm.models.ddpm3d.LatentDiffusion\n params:\n linear_start: 0.00085\n linear_end: 0.012\n num_ti"
},
{
"path": "configs/inference_t2v_1024_v1.0_freenoise.yaml",
"chars": 1862,
"preview": "model:\n target: lvdm.models.ddpm3d.LatentDiffusion\n params:\n linear_start: 0.00085\n linear_end: 0.012\n num_ti"
},
{
"path": "configs/inference_t2v_tconv256_v1.0.yaml",
"chars": 1862,
"preview": "model:\n target: lvdm.models.ddpm3d.LatentDiffusion\n params:\n linear_start: 0.00085\n linear_end: 0.012\n num_ti"
},
{
"path": "configs/inference_t2v_tconv256_v1.0_freenoise.yaml",
"chars": 1862,
"preview": "model:\n target: lvdm.models.ddpm3d.LatentDiffusion\n params:\n linear_start: 0.00085\n linear_end: 0.012\n num_ti"
},
{
"path": "configs/inference_t2v_tconv512_v2.0.yaml",
"chars": 1844,
"preview": "model:\n target: lvdm.models.ddpm3d.LatentDiffusion\n params:\n linear_start: 0.00085\n linear_end: 0.012\n num_ti"
},
{
"path": "configs/inference_t2v_tconv512_v2.0_freenoise.yaml",
"chars": 1854,
"preview": "model:\n target: lvdm.models.ddpm3d.LatentDiffusion\n params:\n linear_start: 0.00085\n linear_end: 0.012\n num_ti"
},
{
"path": "lvdm/basics.py",
"chars": 2849,
"preview": "# adopted from\n# https://github.com/openai/improved-diffusion/blob/main/improved_diffusion/gaussian_diffusion.py\n# and\n#"
},
{
"path": "lvdm/common.py",
"chars": 2800,
"preview": "import math\nfrom inspect import isfunction\nimport torch\nfrom torch import nn\nimport torch.distributed as dist\n\n\ndef gath"
},
{
"path": "lvdm/distributions.py",
"chars": 3043,
"preview": "import torch\nimport numpy as np\n\n\nclass AbstractDistribution:\n def sample(self):\n raise NotImplementedError()\n"
},
{
"path": "lvdm/ema.py",
"chars": 2982,
"preview": "import torch\nfrom torch import nn\n\n\nclass LitEma(nn.Module):\n def __init__(self, model, decay=0.9999, use_num_upates="
},
{
"path": "lvdm/models/autoencoder.py",
"chars": 8474,
"preview": "import os\nfrom contextlib import contextmanager\nimport torch\nimport numpy as np\nfrom einops import rearrange\nimport torc"
},
{
"path": "lvdm/models/ddpm3d.py",
"chars": 33370,
"preview": "\"\"\"\nwild mixture of\nhttps://github.com/openai/improved-diffusion/blob/e94489283bb876ac1477d5dd7709bbbd2d9902ce/improved_"
},
{
"path": "lvdm/models/samplers/ddim.py",
"chars": 17138,
"preview": "import numpy as np\nfrom tqdm import tqdm\nimport torch\nfrom lvdm.models.utils_diffusion import make_ddim_sampling_paramet"
},
{
"path": "lvdm/models/samplers/ddim_mp.py",
"chars": 17434,
"preview": "import numpy as np\nfrom tqdm import tqdm\nimport torch\nfrom lvdm.models.utils_diffusion import make_ddim_sampling_paramet"
},
{
"path": "lvdm/models/utils_diffusion.py",
"chars": 4604,
"preview": "import math\nimport numpy as np\nfrom einops import repeat\nimport torch\nimport torch.nn.functional as F\n\n\ndef timestep_emb"
},
{
"path": "lvdm/modules/attention.py",
"chars": 19456,
"preview": "from functools import partial\nimport torch\nfrom torch import nn, einsum\nimport torch.nn.functional as F\nfrom einops impo"
},
{
"path": "lvdm/modules/attention_freenoise.py",
"chars": 24040,
"preview": "from functools import partial\nimport torch\nfrom torch import nn, einsum\nimport torch.nn.functional as F\nfrom einops impo"
},
{
"path": "lvdm/modules/encoders/condition.py",
"chars": 14630,
"preview": "import torch\nimport torch.nn as nn\nfrom torch.utils.checkpoint import checkpoint\nimport kornia\nimport open_clip\nfrom tra"
},
{
"path": "lvdm/modules/encoders/ip_resampler.py",
"chars": 4429,
"preview": "# modified from https://github.com/mlfoundations/open_flamingo/blob/main/open_flamingo/src/helpers.py\nimport math\nimport"
},
{
"path": "lvdm/modules/networks/ae_modules.py",
"chars": 34246,
"preview": "# pytorch_diffusion + derived encoder decoder\nimport math\nimport torch\nimport numpy as np\nimport torch.nn as nn\nfrom ein"
},
{
"path": "lvdm/modules/networks/openaimodel3d.py",
"chars": 24165,
"preview": "from functools import partial\nfrom abc import abstractmethod\nimport torch\nimport torch.nn as nn\nfrom einops import rearr"
},
{
"path": "lvdm/modules/networks/openaimodel3d_freenoise.py",
"chars": 24334,
"preview": "from functools import partial\nfrom abc import abstractmethod\nimport torch\nimport torch.nn as nn\nfrom einops import rearr"
},
{
"path": "lvdm/modules/x_transformer.py",
"chars": 20159,
"preview": "\"\"\"shout-out to https://github.com/lucidrains/x-transformers/tree/main/x_transformers\"\"\"\nfrom functools import partial\nf"
},
{
"path": "predict.py",
"chars": 6520,
"preview": "# Prediction interface for Cog ⚙️\n# https://github.com/replicate/cog/blob/main/docs/python.md\n\n\nimport os\nimport sys\nimp"
},
{
"path": "prompts/mp_prompts.txt",
"chars": 228,
"preview": "A bigfoot giving a thumbs up in the snow, towards the camera;A bigfoot waving hands in the snow, towards the camera\nA wo"
},
{
"path": "prompts/single_prompts.txt",
"chars": 100,
"preview": "A chihuahua in astronaut suit floating in space, cinematic lighting, glow effect\nA corgi is swimming"
},
{
"path": "requirements.txt",
"chars": 299,
"preview": "decord==0.6.0\neinops==0.3.0\nimageio==2.9.0\nnumpy==1.24.2\nomegaconf==2.1.1\nopencv_python\npandas==2.0.0\nPillow==9.5.0\npyto"
},
{
"path": "scripts/evaluation/ddp_wrapper.py",
"chars": 1481,
"preview": "import datetime\r\nimport argparse, importlib\r\nfrom pytorch_lightning import seed_everything\r\n\r\nimport torch\r\nimport torch"
},
{
"path": "scripts/evaluation/funcs.py",
"chars": 15434,
"preview": "import os, sys, glob\r\nimport numpy as np\r\nfrom collections import OrderedDict\r\nfrom decord import VideoReader, cpu\r\nimpo"
},
{
"path": "scripts/evaluation/inference.py",
"chars": 6986,
"preview": "import argparse, os, sys, glob, yaml, math, random\r\nimport datetime, time\r\nimport numpy as np\r\nfrom omegaconf import Ome"
},
{
"path": "scripts/evaluation/inference_freenoise.py",
"chars": 7702,
"preview": "import argparse, os, sys, glob, yaml, math, random\r\nimport datetime, time\r\nimport numpy as np\r\nfrom omegaconf import Ome"
},
{
"path": "scripts/evaluation/inference_freenoise_mp.py",
"chars": 8033,
"preview": "import argparse, os, sys, glob, yaml, math, random\r\nimport datetime, time\r\nimport numpy as np\r\nfrom omegaconf import Ome"
},
{
"path": "scripts/run_text2video.sh",
"chars": 498,
"preview": "name=\"base_512_test\"\n\nckpt='checkpoints/base_512_v1/model.ckpt'\nconfig='configs/inference_t2v_tconv512_v1.0.yaml'\n\npromp"
},
{
"path": "scripts/run_text2video_freenoise_1024.sh",
"chars": 568,
"preview": "name=\"base_1024_test\"\n\nckpt='checkpoints/base_1024_v1/model.ckpt'\nconfig='configs/inference_t2v_1024_v1.0_freenoise.yaml"
},
{
"path": "scripts/run_text2video_freenoise_256.sh",
"chars": 569,
"preview": "name=\"base_256_test\"\n\nckpt='checkpoints/base_256_v1/model.ckpt'\nconfig='configs/inference_t2v_tconv256_v1.0_freenoise.ya"
},
{
"path": "scripts/run_text2video_freenoise_512.sh",
"chars": 568,
"preview": "name=\"base_512_test\"\n\nckpt='checkpoints/base_512_v2/model.ckpt'\nconfig='configs/inference_t2v_tconv512_v2.0_freenoise.ya"
},
{
"path": "scripts/run_text2video_freenoise_mp_256.sh",
"chars": 563,
"preview": "name=\"base_256_test\"\n\nckpt='checkpoints/base_256_v1/model.ckpt'\nconfig='configs/inference_t2v_tconv256_v1.0_freenoise.ya"
},
{
"path": "scripts/run_text2video_freenoise_mp_512.sh",
"chars": 563,
"preview": "name=\"base_512_test\"\n\nckpt='checkpoints/base_512_v2/model.ckpt'\nconfig='configs/inference_t2v_tconv512_v2.0_freenoise.ya"
},
{
"path": "utils/utils.py",
"chars": 2171,
"preview": "import importlib\nimport numpy as np\nimport cv2\nimport torch\nimport torch.distributed as dist\n\n\ndef count_params(model, v"
}
]
About this extraction
This page contains the full source code of the arthur-qiu/LongerCrafter GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 42 files (332.1 KB), approximately 81.5k tokens, and a symbol index with 430 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.