Repository: AIGODLIKE/ComfyUI-ToonCrafter Branch: main Commit: 96024189ecb2 Files: 98 Total size: 933.1 KB Directory structure: gitextract_h58eh22a/ ├── .github/ │ └── workflows/ │ └── publish.yml ├── .gitignore ├── LICENSE ├── ToonCrafter/ │ ├── .gitignore │ ├── LICENSE │ ├── README.md │ ├── __init__.py │ ├── cldm/ │ │ ├── cldm.py │ │ ├── ddim_hacked.py │ │ ├── hack.py │ │ ├── logger.py │ │ └── model.py │ ├── configs/ │ │ ├── cldm_v21.yaml │ │ ├── inference_512_v1.0.yaml │ │ ├── training_1024_v1.0/ │ │ │ ├── config.yaml │ │ │ └── run.sh │ │ └── training_512_v1.0/ │ │ ├── config.yaml │ │ └── run.sh │ ├── gradio_app.py │ ├── ldm/ │ │ ├── data/ │ │ │ ├── __init__.py │ │ │ └── util.py │ │ ├── models/ │ │ │ ├── autoencoder.py │ │ │ └── diffusion/ │ │ │ ├── __init__.py │ │ │ ├── ddim.py │ │ │ ├── ddpm.py │ │ │ ├── dpm_solver/ │ │ │ │ ├── __init__.py │ │ │ │ ├── dpm_solver.py │ │ │ │ └── sampler.py │ │ │ ├── plms.py │ │ │ └── sampling_util.py │ │ ├── modules/ │ │ │ ├── attention.py │ │ │ ├── diffusionmodules/ │ │ │ │ ├── __init__.py │ │ │ │ ├── model.py │ │ │ │ ├── openaimodel.py │ │ │ │ ├── upscaling.py │ │ │ │ └── util.py │ │ │ ├── distributions/ │ │ │ │ ├── __init__.py │ │ │ │ └── distributions.py │ │ │ ├── ema.py │ │ │ ├── encoders/ │ │ │ │ ├── __init__.py │ │ │ │ └── modules.py │ │ │ ├── image_degradation/ │ │ │ │ ├── __init__.py │ │ │ │ ├── bsrgan.py │ │ │ │ ├── bsrgan_light.py │ │ │ │ └── utils_image.py │ │ │ └── midas/ │ │ │ ├── __init__.py │ │ │ ├── api.py │ │ │ ├── midas/ │ │ │ │ ├── __init__.py │ │ │ │ ├── base_model.py │ │ │ │ ├── blocks.py │ │ │ │ ├── dpt_depth.py │ │ │ │ ├── midas_net.py │ │ │ │ ├── midas_net_custom.py │ │ │ │ ├── transforms.py │ │ │ │ └── vit.py │ │ │ └── utils.py │ │ └── util.py │ ├── lvdm/ │ │ ├── __init__.py │ │ ├── basics.py │ │ ├── common.py │ │ ├── data/ │ │ │ ├── base.py │ │ │ └── webvid.py │ │ ├── distributions.py │ │ ├── ema.py │ │ ├── models/ │ │ │ ├── autoencoder.py │ │ │ ├── autoencoder_dualref.py │ │ │ ├── ddpm3d.py │ │ │ ├── samplers/ │ │ │ │ ├── ddim.py │ │ │ │ └── ddim_multiplecond.py │ │ │ └── utils_diffusion.py │ │ └── modules/ │ │ ├── attention.py │ │ ├── attention_svd.py │ │ ├── encoders/ │ │ │ ├── condition.py │ │ │ └── resampler.py │ │ ├── networks/ │ │ │ ├── ae_modules.py │ │ │ └── openaimodel3d.py │ │ └── x_transformer.py │ ├── main/ │ │ ├── __init__.py │ │ ├── callbacks.py │ │ ├── trainer.py │ │ ├── utils_data.py │ │ └── utils_train.py │ ├── prompts/ │ │ └── 512_interp/ │ │ └── prompts.txt │ ├── requirements.txt │ ├── scripts/ │ │ ├── evaluation/ │ │ │ ├── ddp_wrapper.py │ │ │ ├── funcs.py │ │ │ └── inference.py │ │ ├── gradio/ │ │ │ ├── i2v_test.py │ │ │ └── i2v_test_application.py │ │ └── run.sh │ └── utils/ │ ├── __init__.py │ ├── save_video.py │ └── utils.py ├── __init__.py ├── pre_run.py ├── pyproject.toml ├── readme.md └── requirements.txt ================================================ FILE CONTENTS ================================================ ================================================ FILE: .github/workflows/publish.yml ================================================ name: Publish to Comfy registry on: workflow_dispatch: push: branches: - main - master paths: - "pyproject.toml" jobs: publish-node: name: Publish Custom Node to registry runs-on: ubuntu-latest steps: - name: Check out code uses: actions/checkout@v4 - name: Publish Custom Node uses: Comfy-Org/publish-node-action@main with: ## Add your own personal access token to your Github Repository secrets and reference it here. personal_access_token: ${{ secrets.REGISTRY_ACCESS_TOKEN }} ================================================ FILE: .gitignore ================================================ .DS_Store *pyc .vscode __pycache__ *.egg-info checkpoints ToonCrafter/checkpoints results backup LOG /models ToonCrafter/tmp Thumbs.db ================================================ FILE: LICENSE ================================================ Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS APPENDIX: How to apply the Apache License to your work. To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives. Copyright Tencent Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. ================================================ FILE: ToonCrafter/.gitignore ================================================ .DS_Store *pyc .vscode __pycache__ *.egg-info checkpoints results backup LOG ================================================ FILE: ToonCrafter/LICENSE ================================================ Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS APPENDIX: How to apply the Apache License to your work. To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives. Copyright Tencent Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. ================================================ FILE: ToonCrafter/README.md ================================================ ## ___***ToonCrafter: Generative Cartoon Interpolation***___
## 🔆 Introduction ⚠️ Please check our [disclaimer](#disc) first. 🤗 ToonCrafter can interpolate two cartoon images by leveraging the pre-trained image-to-video diffusion priors. Please check our project page and paper for more information.
### 1.1 Showcases (512x320)
Input starting frame Input ending frame Generated video
### 1.2 Sparse sketch guidance
Input starting frame Input ending frame Input sketch guidance Generated video
### 2. Applications #### 2.1 Cartoon Sketch Interpolation (see project page for more details)
Input starting frame Input ending frame Generated video
#### 2.2 Reference-based Sketch Colorization
Input sketch Input reference Colorization results
## 📝 Changelog - [ ] Add sketch control and colorization function. - __[2024.05.29]__: 🔥🔥 Release code and model weights. - __[2024.05.28]__: Launch the project page and update the arXiv preprint.
## 🧰 Models |Model|Resolution|GPU Mem. & Inference Time (A100, ddim 50steps)|Checkpoint| |:---------|:---------|:--------|:--------| |ToonCrafter_512|320x512| TBD (`perframe_ae=True`)|[Hugging Face](https://huggingface.co/Doubiiu/ToonCrafter/blob/main/model.ckpt)| Currently, our ToonCrafter can support generating videos of up to 16 frames with a resolution of 512x320. The inference time can be reduced by using fewer DDIM steps. ## ⚙️ Setup ### Install Environment via Anaconda (Recommended) ```bash conda create -n tooncrafter python=3.8.5 conda activate tooncrafter pip install -r requirements.txt ``` ## 💫 Inference ### 1. Command line Download pretrained ToonCrafter_512 and put the `model.ckpt` in `checkpoints/tooncrafter_512_interp_v1/model.ckpt`. ```bash sh scripts/run.sh ``` ### 2. Local Gradio demo Download the pretrained model and put it in the corresponding directory according to the previous guidelines. ```bash python gradio_app.py ``` ## 📢 Disclaimer Calm down. Our framework opens up the era of generative cartoon interpolation, but due to the variaity of generative video prior, the success rate is not guaranteed. ⚠️This is an open-source research exploration, instead of commercial products. It can't meet all your expectations. This project strives to impact the domain of AI-driven video generation positively. Users are granted the freedom to create videos using this tool, but they are expected to comply with local laws and utilize it responsibly. The developers do not assume any responsibility for potential misuse by users. **** ================================================ FILE: ToonCrafter/__init__.py ================================================ import sys from pathlib import Path sys.path.append(Path(__file__).parent.as_posix()) ================================================ FILE: ToonCrafter/cldm/cldm.py ================================================ import einops import torch import torch as th import torch.nn as nn from ToonCrafter.ldm.modules.diffusionmodules.util import ( conv_nd, linear, zero_module, timestep_embedding, ) from einops import rearrange, repeat from torchvision.utils import make_grid from ToonCrafter.ldm.modules.attention import SpatialTransformer from ToonCrafter.ldm.modules.diffusionmodules.openaimodel import TimestepEmbedSequential, ResBlock, Downsample, AttentionBlock from lvdm.modules.networks.openaimodel3d import UNetModel from ToonCrafter.ldm.models.diffusion.ddpm import LatentDiffusion from ToonCrafter.ldm.util import log_txt_as_img, exists, instantiate_from_config from ToonCrafter.ldm.models.diffusion.ddim import DDIMSampler class ControlledUnetModel(UNetModel): def forward(self, x, timesteps, context=None, features_adapter=None, fs=None, control = None, **kwargs): b,_,t,_,_ = x.shape t_emb = timestep_embedding(timesteps, self.model_channels, repeat_only=False).type(x.dtype) emb = self.time_embed(t_emb) ## repeat t times for context [(b t) 77 768] & time embedding ## check if we use per-frame image conditioning _, l_context, _ = context.shape if l_context == 77 + t*16: ## !!! HARD CODE here context_text, context_img = context[:,:77,:], context[:,77:,:] context_text = context_text.repeat_interleave(repeats=t, dim=0) context_img = rearrange(context_img, 'b (t l) c -> (b t) l c', t=t) context = torch.cat([context_text, context_img], dim=1) else: context = context.repeat_interleave(repeats=t, dim=0) emb = emb.repeat_interleave(repeats=t, dim=0) ## always in shape (b t) c h w, except for temporal layer x = rearrange(x, 'b c t h w -> (b t) c h w') ## combine emb if self.fs_condition: if fs is None: fs = torch.tensor( [self.default_fs] * b, dtype=torch.long, device=x.device) fs_emb = timestep_embedding(fs, self.model_channels, repeat_only=False).type(x.dtype) fs_embed = self.fps_embedding(fs_emb) fs_embed = fs_embed.repeat_interleave(repeats=t, dim=0) emb = emb + fs_embed h = x.type(self.dtype) adapter_idx = 0 hs = [] with torch.no_grad(): for id, module in enumerate(self.input_blocks): h = module(h, emb, context=context, batch_size=b) if id ==0 and self.addition_attention: h = self.init_attn(h, emb, context=context, batch_size=b) ## plug-in adapter features if ((id+1)%3 == 0) and features_adapter is not None: h = h + features_adapter[adapter_idx] adapter_idx += 1 hs.append(h) if features_adapter is not None: assert len(features_adapter)==adapter_idx, 'Wrong features_adapter' h = self.middle_block(h, emb, context=context, batch_size=b) if control is not None: h += control.pop() for module in self.output_blocks: if control is None: h = torch.cat([h, hs.pop()], dim=1) else: h = torch.cat([h, hs.pop() + control.pop()], dim=1) h = module(h, emb, context=context, batch_size=b) h = h.type(x.dtype) y = self.out(h) # reshape back to (b c t h w) y = rearrange(y, '(b t) c h w -> b c t h w', b=b) return y class ControlNet(nn.Module): def __init__( self, image_size, in_channels, model_channels, hint_channels, num_res_blocks, attention_resolutions, dropout=0, channel_mult=(1, 2, 4, 8), conv_resample=True, dims=2, use_checkpoint=False, use_fp16=False, num_heads=-1, num_head_channels=-1, num_heads_upsample=-1, use_scale_shift_norm=False, resblock_updown=False, use_new_attention_order=False, use_spatial_transformer=False, # custom transformer support transformer_depth=1, # custom transformer support context_dim=None, # custom transformer support n_embed=None, # custom support for prediction of discrete ids into codebook of first stage vq model legacy=True, disable_self_attentions=None, num_attention_blocks=None, disable_middle_self_attn=False, use_linear_in_transformer=False, ): super().__init__() if use_spatial_transformer: assert context_dim is not None, 'Fool!! You forgot to include the dimension of your cross-attention conditioning...' if context_dim is not None: assert use_spatial_transformer, 'Fool!! You forgot to use the spatial transformer for your cross-attention conditioning...' from omegaconf.listconfig import ListConfig if type(context_dim) == ListConfig: context_dim = list(context_dim) if num_heads_upsample == -1: num_heads_upsample = num_heads if num_heads == -1: assert num_head_channels != -1, 'Either num_heads or num_head_channels has to be set' if num_head_channels == -1: assert num_heads != -1, 'Either num_heads or num_head_channels has to be set' self.dims = dims self.image_size = image_size self.in_channels = in_channels self.model_channels = model_channels if isinstance(num_res_blocks, int): self.num_res_blocks = len(channel_mult) * [num_res_blocks] else: if len(num_res_blocks) != len(channel_mult): raise ValueError("provide num_res_blocks either as an int (globally constant) or " "as a list/tuple (per-level) with the same length as channel_mult") self.num_res_blocks = num_res_blocks if disable_self_attentions is not None: # should be a list of booleans, indicating whether to disable self-attention in TransformerBlocks or not assert len(disable_self_attentions) == len(channel_mult) if num_attention_blocks is not None: assert len(num_attention_blocks) == len(self.num_res_blocks) assert all(map(lambda i: self.num_res_blocks[i] >= num_attention_blocks[i], range(len(num_attention_blocks)))) print(f"Constructor of UNetModel received num_attention_blocks={num_attention_blocks}. " f"This option has LESS priority than attention_resolutions {attention_resolutions}, " f"i.e., in cases where num_attention_blocks[i] > 0 but 2**i not in attention_resolutions, " f"attention will still not be set.") self.attention_resolutions = attention_resolutions self.dropout = dropout self.channel_mult = channel_mult self.conv_resample = conv_resample self.use_checkpoint = use_checkpoint self.dtype = th.float16 if use_fp16 else th.float32 self.num_heads = num_heads self.num_head_channels = num_head_channels self.num_heads_upsample = num_heads_upsample self.predict_codebook_ids = n_embed is not None time_embed_dim = model_channels * 4 self.time_embed = nn.Sequential( linear(model_channels, time_embed_dim), nn.SiLU(), linear(time_embed_dim, time_embed_dim), ) self.input_blocks = nn.ModuleList( [ TimestepEmbedSequential( conv_nd(dims, in_channels, model_channels, 3, padding=1) ) ] ) self.zero_convs = nn.ModuleList([self.make_zero_conv(model_channels)]) self.input_hint_block = TimestepEmbedSequential( conv_nd(dims, hint_channels, 16, 3, padding=1), nn.SiLU(), conv_nd(dims, 16, 16, 3, padding=1), nn.SiLU(), conv_nd(dims, 16, 32, 3, padding=1, stride=2), nn.SiLU(), conv_nd(dims, 32, 32, 3, padding=1), nn.SiLU(), conv_nd(dims, 32, 96, 3, padding=1, stride=2), nn.SiLU(), conv_nd(dims, 96, 96, 3, padding=1), nn.SiLU(), conv_nd(dims, 96, 256, 3, padding=1, stride=2), nn.SiLU(), zero_module(conv_nd(dims, 256, model_channels, 3, padding=1)) ) self._feature_size = model_channels input_block_chans = [model_channels] ch = model_channels ds = 1 for level, mult in enumerate(channel_mult): for nr in range(self.num_res_blocks[level]): layers = [ ResBlock( ch, time_embed_dim, dropout, out_channels=mult * model_channels, dims=dims, use_checkpoint=use_checkpoint, use_scale_shift_norm=use_scale_shift_norm, ) ] ch = mult * model_channels if ds in attention_resolutions: if num_head_channels == -1: dim_head = ch // num_heads else: num_heads = ch // num_head_channels dim_head = num_head_channels if legacy: # num_heads = 1 dim_head = ch // num_heads if use_spatial_transformer else num_head_channels if exists(disable_self_attentions): disabled_sa = disable_self_attentions[level] else: disabled_sa = False if not exists(num_attention_blocks) or nr < num_attention_blocks[level]: layers.append( AttentionBlock( ch, use_checkpoint=use_checkpoint, num_heads=num_heads, num_head_channels=dim_head, use_new_attention_order=use_new_attention_order, ) if not use_spatial_transformer else SpatialTransformer( ch, num_heads, dim_head, depth=transformer_depth, context_dim=context_dim, disable_self_attn=disabled_sa, use_linear=use_linear_in_transformer, use_checkpoint=use_checkpoint ) ) self.input_blocks.append(TimestepEmbedSequential(*layers)) self.zero_convs.append(self.make_zero_conv(ch)) self._feature_size += ch input_block_chans.append(ch) if level != len(channel_mult) - 1: out_ch = ch self.input_blocks.append( TimestepEmbedSequential( ResBlock( ch, time_embed_dim, dropout, out_channels=out_ch, dims=dims, use_checkpoint=use_checkpoint, use_scale_shift_norm=use_scale_shift_norm, down=True, ) if resblock_updown else Downsample( ch, conv_resample, dims=dims, out_channels=out_ch ) ) ) ch = out_ch input_block_chans.append(ch) self.zero_convs.append(self.make_zero_conv(ch)) ds *= 2 self._feature_size += ch if num_head_channels == -1: dim_head = ch // num_heads else: num_heads = ch // num_head_channels dim_head = num_head_channels if legacy: # num_heads = 1 dim_head = ch // num_heads if use_spatial_transformer else num_head_channels self.middle_block = TimestepEmbedSequential( ResBlock( ch, time_embed_dim, dropout, dims=dims, use_checkpoint=use_checkpoint, use_scale_shift_norm=use_scale_shift_norm, ), AttentionBlock( ch, use_checkpoint=use_checkpoint, num_heads=num_heads, num_head_channels=dim_head, use_new_attention_order=use_new_attention_order, ) if not use_spatial_transformer else SpatialTransformer( # always uses a self-attn ch, num_heads, dim_head, depth=transformer_depth, context_dim=context_dim, disable_self_attn=disable_middle_self_attn, use_linear=use_linear_in_transformer, use_checkpoint=use_checkpoint ), ResBlock( ch, time_embed_dim, dropout, dims=dims, use_checkpoint=use_checkpoint, use_scale_shift_norm=use_scale_shift_norm, ), ) self.middle_block_out = self.make_zero_conv(ch) self._feature_size += ch def make_zero_conv(self, channels): return TimestepEmbedSequential(zero_module(conv_nd(self.dims, channels, channels, 1, padding=0))) def forward(self, x, hint, timesteps, context, **kwargs): t_emb = timestep_embedding(timesteps, self.model_channels, repeat_only=False) emb = self.time_embed(t_emb) guided_hint = self.input_hint_block(hint, emb, context) outs = [] h = x.type(self.dtype) for module, zero_conv in zip(self.input_blocks, self.zero_convs): if guided_hint is not None: h = module(h, emb, context) h += guided_hint guided_hint = None else: h = module(h, emb, context) outs.append(zero_conv(h, emb, context, True)) h = self.middle_block(h, emb, context) outs.append(self.middle_block_out(h, emb, context)) return outs class ControlLDM(LatentDiffusion): def __init__(self, control_stage_config, control_key, only_mid_control, *args, **kwargs): super().__init__(*args, **kwargs) self.control_model = instantiate_from_config(control_stage_config) self.control_key = control_key self.only_mid_control = only_mid_control self.control_scales = [1.0] * 13 @torch.no_grad() def get_input(self, batch, k, bs=None, *args, **kwargs): x, c = super().get_input(batch, self.first_stage_key, *args, **kwargs) control = batch[self.control_key] if bs is not None: control = control[:bs] control = control.to(self.device) control = einops.rearrange(control, 'b h w c -> b c h w') control = control.to(memory_format=torch.contiguous_format).float() return x, dict(c_crossattn=[c], c_concat=[control]) def apply_model(self, x_noisy, t, cond, *args, **kwargs): assert isinstance(cond, dict) diffusion_model = self.model.diffusion_model cond_txt = torch.cat(cond['c_crossattn'], 1) if cond['c_concat'] is None: eps = diffusion_model(x=x_noisy, timesteps=t, context=cond_txt, control=None, only_mid_control=self.only_mid_control) else: control = self.control_model(x=x_noisy, hint=torch.cat(cond['c_concat'], 1), timesteps=t, context=cond_txt) control = [c * scale for c, scale in zip(control, self.control_scales)] eps = diffusion_model(x=x_noisy, timesteps=t, context=cond_txt, control=control, only_mid_control=self.only_mid_control) return eps @torch.no_grad() def get_unconditional_conditioning(self, N): return self.get_learned_conditioning([""] * N) @torch.no_grad() def log_images(self, batch, N=4, n_row=2, sample=False, ddim_steps=50, ddim_eta=0.0, return_keys=None, quantize_denoised=True, inpaint=True, plot_denoise_rows=False, plot_progressive_rows=True, plot_diffusion_rows=False, unconditional_guidance_scale=9.0, unconditional_guidance_label=None, use_ema_scope=True, **kwargs): use_ddim = ddim_steps is not None log = dict() z, c = self.get_input(batch, self.first_stage_key, bs=N) c_cat, c = c["c_concat"][0][:N], c["c_crossattn"][0][:N] N = min(z.shape[0], N) n_row = min(z.shape[0], n_row) log["reconstruction"] = self.decode_first_stage(z) log["control"] = c_cat * 2.0 - 1.0 log["conditioning"] = log_txt_as_img((512, 512), batch[self.cond_stage_key], size=16) if plot_diffusion_rows: # get diffusion row diffusion_row = list() z_start = z[:n_row] for t in range(self.num_timesteps): if t % self.log_every_t == 0 or t == self.num_timesteps - 1: t = repeat(torch.tensor([t]), '1 -> b', b=n_row) t = t.to(self.device).long() noise = torch.randn_like(z_start) z_noisy = self.q_sample(x_start=z_start, t=t, noise=noise) diffusion_row.append(self.decode_first_stage(z_noisy)) diffusion_row = torch.stack(diffusion_row) # n_log_step, n_row, C, H, W diffusion_grid = rearrange(diffusion_row, 'n b c h w -> b n c h w') diffusion_grid = rearrange(diffusion_grid, 'b n c h w -> (b n) c h w') diffusion_grid = make_grid(diffusion_grid, nrow=diffusion_row.shape[0]) log["diffusion_row"] = diffusion_grid if sample: # get denoise row samples, z_denoise_row = self.sample_log(cond={"c_concat": [c_cat], "c_crossattn": [c]}, batch_size=N, ddim=use_ddim, ddim_steps=ddim_steps, eta=ddim_eta) x_samples = self.decode_first_stage(samples) log["samples"] = x_samples if plot_denoise_rows: denoise_grid = self._get_denoise_row_from_list(z_denoise_row) log["denoise_row"] = denoise_grid if unconditional_guidance_scale > 1.0: uc_cross = self.get_unconditional_conditioning(N) uc_cat = c_cat # torch.zeros_like(c_cat) uc_full = {"c_concat": [uc_cat], "c_crossattn": [uc_cross]} samples_cfg, _ = self.sample_log(cond={"c_concat": [c_cat], "c_crossattn": [c]}, batch_size=N, ddim=use_ddim, ddim_steps=ddim_steps, eta=ddim_eta, unconditional_guidance_scale=unconditional_guidance_scale, unconditional_conditioning=uc_full, ) x_samples_cfg = self.decode_first_stage(samples_cfg) log[f"samples_cfg_scale_{unconditional_guidance_scale:.2f}"] = x_samples_cfg return log @torch.no_grad() def sample_log(self, cond, batch_size, ddim, ddim_steps, **kwargs): ddim_sampler = DDIMSampler(self) b, c, h, w = cond["c_concat"][0].shape shape = (self.channels, h // 8, w // 8) samples, intermediates = ddim_sampler.sample(ddim_steps, batch_size, shape, cond, verbose=False, **kwargs) return samples, intermediates def configure_optimizers(self): lr = self.learning_rate params = list(self.control_model.parameters()) if not self.sd_locked: params += list(self.model.diffusion_model.output_blocks.parameters()) params += list(self.model.diffusion_model.out.parameters()) opt = torch.optim.AdamW(params, lr=lr) return opt def low_vram_shift(self, is_diffusing): if is_diffusing: self.model = self.model.cuda() self.control_model = self.control_model.cuda() self.first_stage_model = self.first_stage_model.cpu() self.cond_stage_model = self.cond_stage_model.cpu() else: self.model = self.model.cpu() self.control_model = self.control_model.cpu() self.first_stage_model = self.first_stage_model.cuda() self.cond_stage_model = self.cond_stage_model.cuda() ================================================ FILE: ToonCrafter/cldm/ddim_hacked.py ================================================ """SAMPLING ONLY.""" import torch import numpy as np from tqdm import tqdm from ToonCrafter.ldm.modules.diffusionmodules.util import make_ddim_sampling_parameters, make_ddim_timesteps, noise_like, extract_into_tensor class DDIMSampler(object): def __init__(self, model, schedule="linear", **kwargs): super().__init__() self.model = model self.ddpm_num_timesteps = model.num_timesteps self.schedule = schedule def register_buffer(self, name, attr): if type(attr) == torch.Tensor: if attr.device != torch.device("cuda"): attr = attr.to(torch.device("cuda")) setattr(self, name, attr) def make_schedule(self, ddim_num_steps, ddim_discretize="uniform", ddim_eta=0., verbose=True): self.ddim_timesteps = make_ddim_timesteps(ddim_discr_method=ddim_discretize, num_ddim_timesteps=ddim_num_steps, num_ddpm_timesteps=self.ddpm_num_timesteps,verbose=verbose) alphas_cumprod = self.model.alphas_cumprod assert alphas_cumprod.shape[0] == self.ddpm_num_timesteps, 'alphas have to be defined for each timestep' to_torch = lambda x: x.clone().detach().to(torch.float32).to(self.model.device) self.register_buffer('betas', to_torch(self.model.betas)) self.register_buffer('alphas_cumprod', to_torch(alphas_cumprod)) self.register_buffer('alphas_cumprod_prev', to_torch(self.model.alphas_cumprod_prev)) # calculations for diffusion q(x_t | x_{t-1}) and others self.register_buffer('sqrt_alphas_cumprod', to_torch(np.sqrt(alphas_cumprod.cpu()))) self.register_buffer('sqrt_one_minus_alphas_cumprod', to_torch(np.sqrt(1. - alphas_cumprod.cpu()))) self.register_buffer('log_one_minus_alphas_cumprod', to_torch(np.log(1. - alphas_cumprod.cpu()))) self.register_buffer('sqrt_recip_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod.cpu()))) self.register_buffer('sqrt_recipm1_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod.cpu() - 1))) # ddim sampling parameters ddim_sigmas, ddim_alphas, ddim_alphas_prev = make_ddim_sampling_parameters(alphacums=alphas_cumprod.cpu(), ddim_timesteps=self.ddim_timesteps, eta=ddim_eta,verbose=verbose) self.register_buffer('ddim_sigmas', ddim_sigmas) self.register_buffer('ddim_alphas', ddim_alphas) self.register_buffer('ddim_alphas_prev', ddim_alphas_prev) self.register_buffer('ddim_sqrt_one_minus_alphas', np.sqrt(1. - ddim_alphas)) sigmas_for_original_sampling_steps = ddim_eta * torch.sqrt( (1 - self.alphas_cumprod_prev) / (1 - self.alphas_cumprod) * ( 1 - self.alphas_cumprod / self.alphas_cumprod_prev)) self.register_buffer('ddim_sigmas_for_original_num_steps', sigmas_for_original_sampling_steps) @torch.no_grad() def sample(self, S, batch_size, shape, conditioning=None, callback=None, normals_sequence=None, img_callback=None, quantize_x0=False, eta=0., mask=None, x0=None, temperature=1., noise_dropout=0., score_corrector=None, corrector_kwargs=None, verbose=True, x_T=None, log_every_t=100, unconditional_guidance_scale=1., unconditional_conditioning=None, # this has to come in the same format as the conditioning, # e.g. as encoded tokens, ... dynamic_threshold=None, ucg_schedule=None, **kwargs ): if conditioning is not None: if isinstance(conditioning, dict): ctmp = conditioning[list(conditioning.keys())[0]] while isinstance(ctmp, list): ctmp = ctmp[0] cbs = ctmp.shape[0] if cbs != batch_size: print(f"Warning: Got {cbs} conditionings but batch-size is {batch_size}") elif isinstance(conditioning, list): for ctmp in conditioning: if ctmp.shape[0] != batch_size: print(f"Warning: Got {cbs} conditionings but batch-size is {batch_size}") else: if conditioning.shape[0] != batch_size: print(f"Warning: Got {conditioning.shape[0]} conditionings but batch-size is {batch_size}") self.make_schedule(ddim_num_steps=S, ddim_eta=eta, verbose=verbose) # sampling C, H, W = shape size = (batch_size, C, H, W) print(f'Data shape for DDIM sampling is {size}, eta {eta}') samples, intermediates = self.ddim_sampling(conditioning, size, callback=callback, img_callback=img_callback, quantize_denoised=quantize_x0, mask=mask, x0=x0, ddim_use_original_steps=False, noise_dropout=noise_dropout, temperature=temperature, score_corrector=score_corrector, corrector_kwargs=corrector_kwargs, x_T=x_T, log_every_t=log_every_t, unconditional_guidance_scale=unconditional_guidance_scale, unconditional_conditioning=unconditional_conditioning, dynamic_threshold=dynamic_threshold, ucg_schedule=ucg_schedule ) return samples, intermediates @torch.no_grad() def ddim_sampling(self, cond, shape, x_T=None, ddim_use_original_steps=False, callback=None, timesteps=None, quantize_denoised=False, mask=None, x0=None, img_callback=None, log_every_t=100, temperature=1., noise_dropout=0., score_corrector=None, corrector_kwargs=None, unconditional_guidance_scale=1., unconditional_conditioning=None, dynamic_threshold=None, ucg_schedule=None): device = self.model.betas.device b = shape[0] if x_T is None: img = torch.randn(shape, device=device) else: img = x_T if timesteps is None: timesteps = self.ddpm_num_timesteps if ddim_use_original_steps else self.ddim_timesteps elif timesteps is not None and not ddim_use_original_steps: subset_end = int(min(timesteps / self.ddim_timesteps.shape[0], 1) * self.ddim_timesteps.shape[0]) - 1 timesteps = self.ddim_timesteps[:subset_end] intermediates = {'x_inter': [img], 'pred_x0': [img]} time_range = reversed(range(0,timesteps)) if ddim_use_original_steps else np.flip(timesteps) total_steps = timesteps if ddim_use_original_steps else timesteps.shape[0] print(f"Running DDIM Sampling with {total_steps} timesteps") iterator = tqdm(time_range, desc='DDIM Sampler', total=total_steps) for i, step in enumerate(iterator): index = total_steps - i - 1 ts = torch.full((b,), step, device=device, dtype=torch.long) if mask is not None: assert x0 is not None img_orig = self.model.q_sample(x0, ts) # TODO: deterministic forward pass? img = img_orig * mask + (1. - mask) * img if ucg_schedule is not None: assert len(ucg_schedule) == len(time_range) unconditional_guidance_scale = ucg_schedule[i] outs = self.p_sample_ddim(img, cond, ts, index=index, use_original_steps=ddim_use_original_steps, quantize_denoised=quantize_denoised, temperature=temperature, noise_dropout=noise_dropout, score_corrector=score_corrector, corrector_kwargs=corrector_kwargs, unconditional_guidance_scale=unconditional_guidance_scale, unconditional_conditioning=unconditional_conditioning, dynamic_threshold=dynamic_threshold) img, pred_x0 = outs if callback: callback(i) if img_callback: img_callback(pred_x0, i) if index % log_every_t == 0 or index == total_steps - 1: intermediates['x_inter'].append(img) intermediates['pred_x0'].append(pred_x0) return img, intermediates @torch.no_grad() def p_sample_ddim(self, x, c, t, index, repeat_noise=False, use_original_steps=False, quantize_denoised=False, temperature=1., noise_dropout=0., score_corrector=None, corrector_kwargs=None, unconditional_guidance_scale=1., unconditional_conditioning=None, dynamic_threshold=None): b, *_, device = *x.shape, x.device if unconditional_conditioning is None or unconditional_guidance_scale == 1.: model_output = self.model.apply_model(x, t, c) else: model_t = self.model.apply_model(x, t, c) model_uncond = self.model.apply_model(x, t, unconditional_conditioning) model_output = model_uncond + unconditional_guidance_scale * (model_t - model_uncond) if self.model.parameterization == "v": e_t = self.model.predict_eps_from_z_and_v(x, t, model_output) else: e_t = model_output if score_corrector is not None: assert self.model.parameterization == "eps", 'not implemented' e_t = score_corrector.modify_score(self.model, e_t, x, t, c, **corrector_kwargs) alphas = self.model.alphas_cumprod if use_original_steps else self.ddim_alphas alphas_prev = self.model.alphas_cumprod_prev if use_original_steps else self.ddim_alphas_prev sqrt_one_minus_alphas = self.model.sqrt_one_minus_alphas_cumprod if use_original_steps else self.ddim_sqrt_one_minus_alphas sigmas = self.model.ddim_sigmas_for_original_num_steps if use_original_steps else self.ddim_sigmas # select parameters corresponding to the currently considered timestep a_t = torch.full((b, 1, 1, 1), alphas[index], device=device) a_prev = torch.full((b, 1, 1, 1), alphas_prev[index], device=device) sigma_t = torch.full((b, 1, 1, 1), sigmas[index], device=device) sqrt_one_minus_at = torch.full((b, 1, 1, 1), sqrt_one_minus_alphas[index],device=device) # current prediction for x_0 if self.model.parameterization != "v": pred_x0 = (x - sqrt_one_minus_at * e_t) / a_t.sqrt() else: pred_x0 = self.model.predict_start_from_z_and_v(x, t, model_output) if quantize_denoised: pred_x0, _, *_ = self.model.first_stage_model.quantize(pred_x0) if dynamic_threshold is not None: raise NotImplementedError() # direction pointing to x_t dir_xt = (1. - a_prev - sigma_t**2).sqrt() * e_t noise = sigma_t * noise_like(x.shape, device, repeat_noise) * temperature if noise_dropout > 0.: noise = torch.nn.functional.dropout(noise, p=noise_dropout) x_prev = a_prev.sqrt() * pred_x0 + dir_xt + noise return x_prev, pred_x0 @torch.no_grad() def encode(self, x0, c, t_enc, use_original_steps=False, return_intermediates=None, unconditional_guidance_scale=1.0, unconditional_conditioning=None, callback=None): timesteps = np.arange(self.ddpm_num_timesteps) if use_original_steps else self.ddim_timesteps num_reference_steps = timesteps.shape[0] assert t_enc <= num_reference_steps num_steps = t_enc if use_original_steps: alphas_next = self.alphas_cumprod[:num_steps] alphas = self.alphas_cumprod_prev[:num_steps] else: alphas_next = self.ddim_alphas[:num_steps] alphas = torch.tensor(self.ddim_alphas_prev[:num_steps]) x_next = x0 intermediates = [] inter_steps = [] for i in tqdm(range(num_steps), desc='Encoding Image'): t = torch.full((x0.shape[0],), timesteps[i], device=self.model.device, dtype=torch.long) if unconditional_guidance_scale == 1.: noise_pred = self.model.apply_model(x_next, t, c) else: assert unconditional_conditioning is not None e_t_uncond, noise_pred = torch.chunk( self.model.apply_model(torch.cat((x_next, x_next)), torch.cat((t, t)), torch.cat((unconditional_conditioning, c))), 2) noise_pred = e_t_uncond + unconditional_guidance_scale * (noise_pred - e_t_uncond) xt_weighted = (alphas_next[i] / alphas[i]).sqrt() * x_next weighted_noise_pred = alphas_next[i].sqrt() * ( (1 / alphas_next[i] - 1).sqrt() - (1 / alphas[i] - 1).sqrt()) * noise_pred x_next = xt_weighted + weighted_noise_pred if return_intermediates and i % ( num_steps // return_intermediates) == 0 and i < num_steps - 1: intermediates.append(x_next) inter_steps.append(i) elif return_intermediates and i >= num_steps - 2: intermediates.append(x_next) inter_steps.append(i) if callback: callback(i) out = {'x_encoded': x_next, 'intermediate_steps': inter_steps} if return_intermediates: out.update({'intermediates': intermediates}) return x_next, out @torch.no_grad() def stochastic_encode(self, x0, t, use_original_steps=False, noise=None): # fast, but does not allow for exact reconstruction # t serves as an index to gather the correct alphas if use_original_steps: sqrt_alphas_cumprod = self.sqrt_alphas_cumprod sqrt_one_minus_alphas_cumprod = self.sqrt_one_minus_alphas_cumprod else: sqrt_alphas_cumprod = torch.sqrt(self.ddim_alphas) sqrt_one_minus_alphas_cumprod = self.ddim_sqrt_one_minus_alphas if noise is None: noise = torch.randn_like(x0) return (extract_into_tensor(sqrt_alphas_cumprod, t, x0.shape) * x0 + extract_into_tensor(sqrt_one_minus_alphas_cumprod, t, x0.shape) * noise) @torch.no_grad() def decode(self, x_latent, cond, t_start, unconditional_guidance_scale=1.0, unconditional_conditioning=None, use_original_steps=False, callback=None): timesteps = np.arange(self.ddpm_num_timesteps) if use_original_steps else self.ddim_timesteps timesteps = timesteps[:t_start] time_range = np.flip(timesteps) total_steps = timesteps.shape[0] print(f"Running DDIM Sampling with {total_steps} timesteps") iterator = tqdm(time_range, desc='Decoding image', total=total_steps) x_dec = x_latent for i, step in enumerate(iterator): index = total_steps - i - 1 ts = torch.full((x_latent.shape[0],), step, device=x_latent.device, dtype=torch.long) x_dec, _ = self.p_sample_ddim(x_dec, cond, ts, index=index, use_original_steps=use_original_steps, unconditional_guidance_scale=unconditional_guidance_scale, unconditional_conditioning=unconditional_conditioning) if callback: callback(i) return x_dec ================================================ FILE: ToonCrafter/cldm/hack.py ================================================ import torch import einops import ldm.modules.encoders.modules import ldm.modules.attention from transformers import logging from ToonCrafter.ldm.modules.attention import default def disable_verbosity(): logging.set_verbosity_error() print('logging improved.') return def enable_sliced_attention(): ldm.modules.attention.CrossAttention.forward = _hacked_sliced_attentin_forward print('Enabled sliced_attention.') return def hack_everything(clip_skip=0): disable_verbosity() ldm.modules.encoders.modules.FrozenCLIPEmbedder.forward = _hacked_clip_forward ldm.modules.encoders.modules.FrozenCLIPEmbedder.clip_skip = clip_skip print('Enabled clip hacks.') return # Written by Lvmin def _hacked_clip_forward(self, text): PAD = self.tokenizer.pad_token_id EOS = self.tokenizer.eos_token_id BOS = self.tokenizer.bos_token_id def tokenize(t): return self.tokenizer(t, truncation=False, add_special_tokens=False)["input_ids"] def transformer_encode(t): if self.clip_skip > 1: rt = self.transformer(input_ids=t, output_hidden_states=True) return self.transformer.text_model.final_layer_norm(rt.hidden_states[-self.clip_skip]) else: return self.transformer(input_ids=t, output_hidden_states=False).last_hidden_state def split(x): return x[75 * 0: 75 * 1], x[75 * 1: 75 * 2], x[75 * 2: 75 * 3] def pad(x, p, i): return x[:i] if len(x) >= i else x + [p] * (i - len(x)) raw_tokens_list = tokenize(text) tokens_list = [] for raw_tokens in raw_tokens_list: raw_tokens_123 = split(raw_tokens) raw_tokens_123 = [[BOS] + raw_tokens_i + [EOS] for raw_tokens_i in raw_tokens_123] raw_tokens_123 = [pad(raw_tokens_i, PAD, 77) for raw_tokens_i in raw_tokens_123] tokens_list.append(raw_tokens_123) tokens_list = torch.IntTensor(tokens_list).to(self.device) feed = einops.rearrange(tokens_list, 'b f i -> (b f) i') y = transformer_encode(feed) z = einops.rearrange(y, '(b f) i c -> b (f i) c', f=3) return z # Stolen from https://github.com/basujindal/stable-diffusion/blob/main/optimizedSD/splitAttention.py def _hacked_sliced_attentin_forward(self, x, context=None, mask=None): h = self.heads q = self.to_q(x) context = default(context, x) k = self.to_k(context) v = self.to_v(context) del context, x q, k, v = map(lambda t: einops.rearrange(t, 'b n (h d) -> (b h) n d', h=h), (q, k, v)) limit = k.shape[0] att_step = 1 q_chunks = list(torch.tensor_split(q, limit // att_step, dim=0)) k_chunks = list(torch.tensor_split(k, limit // att_step, dim=0)) v_chunks = list(torch.tensor_split(v, limit // att_step, dim=0)) q_chunks.reverse() k_chunks.reverse() v_chunks.reverse() sim = torch.zeros(q.shape[0], q.shape[1], v.shape[2], device=q.device) del k, q, v for i in range(0, limit, att_step): q_buffer = q_chunks.pop() k_buffer = k_chunks.pop() v_buffer = v_chunks.pop() sim_buffer = torch.einsum('b i d, b j d -> b i j', q_buffer, k_buffer) * self.scale del k_buffer, q_buffer # attention, what we cannot get enough of, by chunks sim_buffer = sim_buffer.softmax(dim=-1) sim_buffer = torch.einsum('b i j, b j d -> b i d', sim_buffer, v_buffer) del v_buffer sim[i:i + att_step, :, :] = sim_buffer del sim_buffer sim = einops.rearrange(sim, '(b h) n d -> b n (h d)', h=h) return self.to_out(sim) ================================================ FILE: ToonCrafter/cldm/logger.py ================================================ import os import numpy as np import torch import torchvision from PIL import Image from pytorch_lightning.callbacks import Callback from pytorch_lightning.utilities.distributed import rank_zero_only class ImageLogger(Callback): def __init__(self, batch_frequency=2000, max_images=4, clamp=True, increase_log_steps=True, rescale=True, disabled=False, log_on_batch_idx=False, log_first_step=False, log_images_kwargs=None): super().__init__() self.rescale = rescale self.batch_freq = batch_frequency self.max_images = max_images if not increase_log_steps: self.log_steps = [self.batch_freq] self.clamp = clamp self.disabled = disabled self.log_on_batch_idx = log_on_batch_idx self.log_images_kwargs = log_images_kwargs if log_images_kwargs else {} self.log_first_step = log_first_step @rank_zero_only def log_local(self, save_dir, split, images, global_step, current_epoch, batch_idx): root = os.path.join(save_dir, "image_log", split) for k in images: grid = torchvision.utils.make_grid(images[k], nrow=4) if self.rescale: grid = (grid + 1.0) / 2.0 # -1,1 -> 0,1; c,h,w grid = grid.transpose(0, 1).transpose(1, 2).squeeze(-1) grid = grid.numpy() grid = (grid * 255).astype(np.uint8) filename = "{}_gs-{:06}_e-{:06}_b-{:06}.png".format(k, global_step, current_epoch, batch_idx) path = os.path.join(root, filename) os.makedirs(os.path.split(path)[0], exist_ok=True) Image.fromarray(grid).save(path) def log_img(self, pl_module, batch, batch_idx, split="train"): check_idx = batch_idx # if self.log_on_batch_idx else pl_module.global_step if (self.check_frequency(check_idx) and # batch_idx % self.batch_freq == 0 hasattr(pl_module, "log_images") and callable(pl_module.log_images) and self.max_images > 0): logger = type(pl_module.logger) is_train = pl_module.training if is_train: pl_module.eval() with torch.no_grad(): images = pl_module.log_images(batch, split=split, **self.log_images_kwargs) for k in images: N = min(images[k].shape[0], self.max_images) images[k] = images[k][:N] if isinstance(images[k], torch.Tensor): images[k] = images[k].detach().cpu() if self.clamp: images[k] = torch.clamp(images[k], -1., 1.) self.log_local(pl_module.logger.save_dir, split, images, pl_module.global_step, pl_module.current_epoch, batch_idx) if is_train: pl_module.train() def check_frequency(self, check_idx): return check_idx % self.batch_freq == 0 def on_train_batch_end(self, trainer, pl_module, outputs, batch, batch_idx, dataloader_idx): if not self.disabled: self.log_img(pl_module, batch, batch_idx, split="train") ================================================ FILE: ToonCrafter/cldm/model.py ================================================ import os import torch from omegaconf import OmegaConf from comfy.ldm.util import instantiate_from_config def get_state_dict(d): return d.get('state_dict', d) def load_state_dict(ckpt_path, location='cpu'): _, extension = os.path.splitext(ckpt_path) if extension.lower() == ".safetensors": import safetensors.torch state_dict = safetensors.torch.load_file(ckpt_path, device=location) else: state_dict = get_state_dict(torch.load(ckpt_path, map_location=torch.device(location))) state_dict = get_state_dict(state_dict) print(f'Loaded state_dict from [{ckpt_path}]') return state_dict def create_model(config_path): config = OmegaConf.load(config_path) model = instantiate_from_config(config.model).cpu() print(f'Loaded model config from [{config_path}]') return model ================================================ FILE: ToonCrafter/configs/cldm_v21.yaml ================================================ control_stage_config: target: ToonCrafter.cldm.cldm.ControlNet params: use_checkpoint: True image_size: 32 # unused in_channels: 4 hint_channels: 1 model_channels: 320 attention_resolutions: [ 4, 2, 1 ] num_res_blocks: 2 channel_mult: [ 1, 2, 4, 4 ] num_head_channels: 64 # need to fix for flash-attn use_spatial_transformer: True use_linear_in_transformer: True transformer_depth: 1 context_dim: 1024 legacy: False ================================================ FILE: ToonCrafter/configs/inference_512_v1.0.yaml ================================================ model: target: lvdm.models.ddpm3d.LatentVisualDiffusion params: rescale_betas_zero_snr: True parameterization: "v" linear_start: 0.00085 linear_end: 0.012 num_timesteps_cond: 1 timesteps: 1000 first_stage_key: video cond_stage_key: caption cond_stage_trainable: False conditioning_key: hybrid image_size: [40, 64] channels: 4 scale_by_std: False scale_factor: 0.18215 use_ema: False uncond_type: 'empty_seq' use_dynamic_rescale: true base_scale: 0.7 fps_condition_type: 'fps' perframe_ae: True loop_video: true unet_config: target: lvdm.modules.networks.openaimodel3d.UNetModel params: in_channels: 8 out_channels: 4 model_channels: 320 attention_resolutions: - 4 - 2 - 1 num_res_blocks: 2 channel_mult: - 1 - 2 - 4 - 4 dropout: 0.1 num_head_channels: 64 transformer_depth: 1 context_dim: 1024 use_linear: true use_checkpoint: True temporal_conv: True temporal_attention: True temporal_selfatt_only: true use_relative_position: false use_causal_attention: False temporal_length: 16 addition_attention: true image_cross_attention: true default_fs: 24 fs_condition: true first_stage_config: target: lvdm.models.autoencoder.AutoencoderKL_Dualref params: embed_dim: 4 monitor: val/rec_loss ddconfig: double_z: True z_channels: 4 resolution: 256 in_channels: 3 out_ch: 3 ch: 128 ch_mult: - 1 - 2 - 4 - 4 num_res_blocks: 2 attn_resolutions: [] dropout: 0.0 lossconfig: target: torch.nn.Identity cond_stage_config: target: lvdm.modules.encoders.condition.FrozenOpenCLIPEmbedder params: freeze: true layer: "penultimate" img_cond_stage_config: target: lvdm.modules.encoders.condition.FrozenOpenCLIPImageEmbedderV2 params: freeze: true image_proj_stage_config: target: lvdm.modules.encoders.resampler.Resampler params: dim: 1024 depth: 4 dim_head: 64 heads: 12 num_queries: 16 embedding_dim: 1280 output_dim: 1024 ff_mult: 4 video_length: 16 ================================================ FILE: ToonCrafter/configs/training_1024_v1.0/config.yaml ================================================ model: pretrained_checkpoint: checkpoints/dynamicrafter_1024_v1/model.ckpt base_learning_rate: 1.0e-05 scale_lr: False target: lvdm.models.ddpm3d.LatentVisualDiffusion params: rescale_betas_zero_snr: True parameterization: "v" linear_start: 0.00085 linear_end: 0.012 num_timesteps_cond: 1 log_every_t: 200 timesteps: 1000 first_stage_key: video cond_stage_key: caption cond_stage_trainable: False image_proj_model_trainable: True conditioning_key: hybrid image_size: [72, 128] channels: 4 scale_by_std: False scale_factor: 0.18215 use_ema: False uncond_prob: 0.05 uncond_type: 'empty_seq' rand_cond_frame: true use_dynamic_rescale: true base_scale: 0.3 fps_condition_type: 'fps' perframe_ae: True unet_config: target: lvdm.modules.networks.openaimodel3d.UNetModel params: in_channels: 8 out_channels: 4 model_channels: 320 attention_resolutions: - 4 - 2 - 1 num_res_blocks: 2 channel_mult: - 1 - 2 - 4 - 4 dropout: 0.1 num_head_channels: 64 transformer_depth: 1 context_dim: 1024 use_linear: true use_checkpoint: True temporal_conv: True temporal_attention: True temporal_selfatt_only: true use_relative_position: false use_causal_attention: False temporal_length: 16 addition_attention: true image_cross_attention: true default_fs: 10 fs_condition: true first_stage_config: target: lvdm.models.autoencoder.AutoencoderKL params: embed_dim: 4 monitor: val/rec_loss ddconfig: double_z: True z_channels: 4 resolution: 256 in_channels: 3 out_ch: 3 ch: 128 ch_mult: - 1 - 2 - 4 - 4 num_res_blocks: 2 attn_resolutions: [] dropout: 0.0 lossconfig: target: torch.nn.Identity cond_stage_config: target: lvdm.modules.encoders.condition.FrozenOpenCLIPEmbedder params: freeze: true layer: "penultimate" img_cond_stage_config: target: lvdm.modules.encoders.condition.FrozenOpenCLIPImageEmbedderV2 params: freeze: true image_proj_stage_config: target: lvdm.modules.encoders.resampler.Resampler params: dim: 1024 depth: 4 dim_head: 64 heads: 12 num_queries: 16 embedding_dim: 1280 output_dim: 1024 ff_mult: 4 video_length: 16 data: target: utils_data.DataModuleFromConfig params: batch_size: 1 num_workers: 12 wrap: false train: target: lvdm.data.webvid.WebVid params: data_dir: meta_path: <.csv FILE> video_length: 16 frame_stride: 6 load_raw_resolution: true resolution: [576, 1024] spatial_transform: resize_center_crop random_fs: true ## if true, we uniformly sample fs with max_fs=frame_stride (above) lightning: precision: 16 # strategy: deepspeed_stage_2 trainer: benchmark: True accumulate_grad_batches: 2 max_steps: 100000 # logger log_every_n_steps: 50 # val val_check_interval: 0.5 gradient_clip_algorithm: 'norm' gradient_clip_val: 0.5 callbacks: model_checkpoint: target: pytorch_lightning.callbacks.ModelCheckpoint params: every_n_train_steps: 9000 #1000 filename: "{epoch}-{step}" save_weights_only: True metrics_over_trainsteps_checkpoint: target: pytorch_lightning.callbacks.ModelCheckpoint params: filename: '{epoch}-{step}' save_weights_only: True every_n_train_steps: 10000 #20000 # 3s/step*2w= batch_logger: target: callbacks.ImageLogger params: batch_frequency: 500 to_local: False max_images: 8 log_images_kwargs: ddim_steps: 50 unconditional_guidance_scale: 7.5 timestep_spacing: uniform_trailing guidance_rescale: 0.7 ================================================ FILE: ToonCrafter/configs/training_1024_v1.0/run.sh ================================================ # NCCL configuration # export NCCL_DEBUG=INFO # export NCCL_IB_DISABLE=0 # export NCCL_IB_GID_INDEX=3 # export NCCL_NET_GDR_LEVEL=3 # export NCCL_TOPO_FILE=/tmp/topo.txt # args name="training_1024_v1.0" config_file=configs/${name}/config.yaml # save root dir for logs, checkpoints, tensorboard record, etc. save_root="" mkdir -p $save_root/$name ## run CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 -m torch.distributed.launch \ --nproc_per_node=$HOST_GPU_NUM --nnodes=1 --master_addr=127.0.0.1 --master_port=12352 --node_rank=0 \ ./main/trainer.py \ --base $config_file \ --train \ --name $name \ --logdir $save_root \ --devices $HOST_GPU_NUM \ lightning.trainer.num_nodes=1 ## debugging # CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m torch.distributed.launch \ # --nproc_per_node=4 --nnodes=1 --master_addr=127.0.0.1 --master_port=12352 --node_rank=0 \ # ./main/trainer.py \ # --base $config_file \ # --train \ # --name $name \ # --logdir $save_root \ # --devices 4 \ # lightning.trainer.num_nodes=1 ================================================ FILE: ToonCrafter/configs/training_512_v1.0/config.yaml ================================================ model: pretrained_checkpoint: checkpoints/dynamicrafter_512_v1/model.ckpt base_learning_rate: 1.0e-05 scale_lr: False target: lvdm.models.ddpm3d.LatentVisualDiffusion params: rescale_betas_zero_snr: True parameterization: "v" linear_start: 0.00085 linear_end: 0.012 num_timesteps_cond: 1 log_every_t: 200 timesteps: 1000 first_stage_key: video cond_stage_key: caption cond_stage_trainable: False image_proj_model_trainable: True conditioning_key: hybrid image_size: [40, 64] channels: 4 scale_by_std: False scale_factor: 0.18215 use_ema: False uncond_prob: 0.05 uncond_type: 'empty_seq' rand_cond_frame: true use_dynamic_rescale: true base_scale: 0.7 fps_condition_type: 'fps' perframe_ae: True unet_config: target: lvdm.modules.networks.openaimodel3d.UNetModel params: in_channels: 8 out_channels: 4 model_channels: 320 attention_resolutions: - 4 - 2 - 1 num_res_blocks: 2 channel_mult: - 1 - 2 - 4 - 4 dropout: 0.1 num_head_channels: 64 transformer_depth: 1 context_dim: 1024 use_linear: true use_checkpoint: True temporal_conv: True temporal_attention: True temporal_selfatt_only: true use_relative_position: false use_causal_attention: False temporal_length: 16 addition_attention: true image_cross_attention: true default_fs: 10 fs_condition: true first_stage_config: target: lvdm.models.autoencoder.AutoencoderKL params: embed_dim: 4 monitor: val/rec_loss ddconfig: double_z: True z_channels: 4 resolution: 256 in_channels: 3 out_ch: 3 ch: 128 ch_mult: - 1 - 2 - 4 - 4 num_res_blocks: 2 attn_resolutions: [] dropout: 0.0 lossconfig: target: torch.nn.Identity cond_stage_config: target: lvdm.modules.encoders.condition.FrozenOpenCLIPEmbedder params: freeze: true layer: "penultimate" img_cond_stage_config: target: lvdm.modules.encoders.condition.FrozenOpenCLIPImageEmbedderV2 params: freeze: true image_proj_stage_config: target: lvdm.modules.encoders.resampler.Resampler params: dim: 1024 depth: 4 dim_head: 64 heads: 12 num_queries: 16 embedding_dim: 1280 output_dim: 1024 ff_mult: 4 video_length: 16 data: target: utils_data.DataModuleFromConfig params: batch_size: 2 num_workers: 12 wrap: false train: target: lvdm.data.webvid.WebVid params: data_dir: meta_path: <.csv FILE> video_length: 16 frame_stride: 6 load_raw_resolution: true resolution: [320, 512] spatial_transform: resize_center_crop random_fs: true ## if true, we uniformly sample fs with max_fs=frame_stride (above) lightning: precision: 16 # strategy: deepspeed_stage_2 trainer: benchmark: True accumulate_grad_batches: 2 max_steps: 100000 # logger log_every_n_steps: 50 # val val_check_interval: 0.5 gradient_clip_algorithm: 'norm' gradient_clip_val: 0.5 callbacks: model_checkpoint: target: pytorch_lightning.callbacks.ModelCheckpoint params: every_n_train_steps: 9000 #1000 filename: "{epoch}-{step}" save_weights_only: True metrics_over_trainsteps_checkpoint: target: pytorch_lightning.callbacks.ModelCheckpoint params: filename: '{epoch}-{step}' save_weights_only: True every_n_train_steps: 10000 #20000 # 3s/step*2w= batch_logger: target: callbacks.ImageLogger params: batch_frequency: 500 to_local: False max_images: 8 log_images_kwargs: ddim_steps: 50 unconditional_guidance_scale: 7.5 timestep_spacing: uniform_trailing guidance_rescale: 0.7 ================================================ FILE: ToonCrafter/configs/training_512_v1.0/run.sh ================================================ # NCCL configuration # export NCCL_DEBUG=INFO # export NCCL_IB_DISABLE=0 # export NCCL_IB_GID_INDEX=3 # export NCCL_NET_GDR_LEVEL=3 # export NCCL_TOPO_FILE=/tmp/topo.txt # args name="training_512_v1.0" config_file=configs/${name}/config.yaml # save root dir for logs, checkpoints, tensorboard record, etc. save_root="" mkdir -p $save_root/$name ## run CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 -m torch.distributed.launch \ --nproc_per_node=$HOST_GPU_NUM --nnodes=1 --master_addr=127.0.0.1 --master_port=12352 --node_rank=0 \ ./main/trainer.py \ --base $config_file \ --train \ --name $name \ --logdir $save_root \ --devices $HOST_GPU_NUM \ lightning.trainer.num_nodes=1 ## debugging # CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m torch.distributed.launch \ # --nproc_per_node=4 --nnodes=1 --master_addr=127.0.0.1 --master_port=12352 --node_rank=0 \ # ./main/trainer.py \ # --base $config_file \ # --train \ # --name $name \ # --logdir $save_root \ # --devices 4 \ # lightning.trainer.num_nodes=1 ================================================ FILE: ToonCrafter/gradio_app.py ================================================ import os import argparse import sys import gradio as gr from scripts.gradio.i2v_test_application import Image2Video sys.path.insert(1, os.path.join(sys.path[0], 'lvdm')) i2v_examples_interp_512 = [ ['prompts/512_interp/74906_1462_frame1.png', 'walking man', 50, 7.5, 1.0, 10, 123, 'prompts/512_interp/74906_1462_frame3.png'], ['prompts/512_interp/Japan_v2_2_062266_s2_frame1.png', 'an anime scene', 50, 7.5, 1.0, 10, 789, 'prompts/512_interp/Japan_v2_2_062266_s2_frame3.png'], ['prompts/512_interp/Japan_v2_3_119235_s2_frame1.png', 'an anime scene', 50, 7.5, 1.0, 10, 123, 'prompts/512_interp/Japan_v2_3_119235_s2_frame3.png'], ] def dynamicrafter_demo(result_dir='./tmp/', res=512): if res == 1024: resolution = '576_1024' css = """#input_img {max-width: 1024px !important} #output_vid {max-width: 1024px; max-height:576px}""" elif res == 512: resolution = '320_512' css = """#input_img {max-width: 512px !important} #output_vid {max-width: 512px; max-height: 320px} #input_img2 {max-width: 512px !important} #output_vid {max-width: 512px; max-height: 320px}""" elif res == 256: resolution = '256_256' css = """#input_img {max-width: 256px !important} #output_vid {max-width: 256px; max-height: 256px}""" else: raise NotImplementedError(f"Unsupported resolution: {res}") image2video = Image2Video(result_dir, resolution=resolution) with gr.Blocks(analytics_enabled=False, css=css) as dynamicrafter_iface: with gr.Tab(label='ToonCrafter_320x512'): with gr.Column(): with gr.Row(): with gr.Column(): with gr.Row(): i2v_input_image = gr.Image(label="Input Image1", elem_id="input_img") with gr.Row(): i2v_input_text = gr.Text(label='Prompts') with gr.Row(): i2v_seed = gr.Slider(label='Random Seed', minimum=0, maximum=50000, step=1, value=123) i2v_eta = gr.Slider(minimum=0.0, maximum=1.0, step=0.1, label='ETA', value=1.0, elem_id="i2v_eta") i2v_cfg_scale = gr.Slider(minimum=1.0, maximum=15.0, step=0.5, label='CFG Scale', value=7.5, elem_id="i2v_cfg_scale") with gr.Row(): i2v_steps = gr.Slider(minimum=1, maximum=60, step=1, elem_id="i2v_steps", label="Sampling steps", value=50) i2v_motion = gr.Slider(minimum=5, maximum=30, step=1, elem_id="i2v_motion", label="FPS", value=10) i2v_end_btn = gr.Button("Generate") with gr.Column(): with gr.Row(): i2v_input_image2 = gr.Image(label="Input Image2", elem_id="input_img2") with gr.Row(): i2v_output_video = gr.Video(label="Generated Video", elem_id="output_vid", autoplay=True, show_share_button=True) gr.Examples(examples=i2v_examples_interp_512, inputs=[i2v_input_image, i2v_input_text, i2v_steps, i2v_cfg_scale, i2v_eta, i2v_motion, i2v_seed, i2v_input_image2], outputs=[i2v_output_video], fn=image2video.get_image, cache_examples=False, ) i2v_end_btn.click(inputs=[i2v_input_image, i2v_input_text, i2v_steps, i2v_cfg_scale, i2v_eta, i2v_motion, i2v_seed, i2v_input_image2], outputs=[i2v_output_video], fn=image2video.get_image ) return dynamicrafter_iface def get_parser(): parser = argparse.ArgumentParser() return parser if __name__ == "__main__": parser = get_parser() args = parser.parse_args() result_dir = os.path.join('./', 'results') dynamicrafter_iface = dynamicrafter_demo(result_dir) dynamicrafter_iface.queue(max_size=12) dynamicrafter_iface.launch(max_threads=1) # dynamicrafter_iface.launch(server_name='0.0.0.0', server_port=80, max_threads=1) ================================================ FILE: ToonCrafter/ldm/data/__init__.py ================================================ ================================================ FILE: ToonCrafter/ldm/data/util.py ================================================ import torch from ToonCrafter.ldm.modules.midas.api import load_midas_transform class AddMiDaS(object): def __init__(self, model_type): super().__init__() self.transform = load_midas_transform(model_type) def pt2np(self, x): x = ((x + 1.0) * .5).detach().cpu().numpy() return x def np2pt(self, x): x = torch.from_numpy(x) * 2 - 1. return x def __call__(self, sample): # sample['jpg'] is tensor hwc in [-1, 1] at this point x = self.pt2np(sample['jpg']) x = self.transform({"image": x})["image"] sample['midas_in'] = x return sample ================================================ FILE: ToonCrafter/ldm/models/autoencoder.py ================================================ import torch import pytorch_lightning as pl import torch.nn.functional as F from contextlib import contextmanager from ToonCrafter.ldm.modules.diffusionmodules.model import Encoder, Decoder from ToonCrafter.ldm.modules.distributions.distributions import DiagonalGaussianDistribution from ToonCrafter.ldm.util import instantiate_from_config from ToonCrafter.ldm.modules.ema import LitEma class AutoencoderKL(pl.LightningModule): def __init__(self, ddconfig, lossconfig, embed_dim, ckpt_path=None, ignore_keys=[], image_key="image", colorize_nlabels=None, monitor=None, ema_decay=None, learn_logvar=False ): super().__init__() self.learn_logvar = learn_logvar self.image_key = image_key self.encoder = Encoder(**ddconfig) self.decoder = Decoder(**ddconfig) self.loss = instantiate_from_config(lossconfig) assert ddconfig["double_z"] self.quant_conv = torch.nn.Conv2d(2*ddconfig["z_channels"], 2*embed_dim, 1) self.post_quant_conv = torch.nn.Conv2d(embed_dim, ddconfig["z_channels"], 1) self.embed_dim = embed_dim if colorize_nlabels is not None: assert type(colorize_nlabels)==int self.register_buffer("colorize", torch.randn(3, colorize_nlabels, 1, 1)) if monitor is not None: self.monitor = monitor self.use_ema = ema_decay is not None if self.use_ema: self.ema_decay = ema_decay assert 0. < ema_decay < 1. self.model_ema = LitEma(self, decay=ema_decay) print(f"Keeping EMAs of {len(list(self.model_ema.buffers()))}.") if ckpt_path is not None: self.init_from_ckpt(ckpt_path, ignore_keys=ignore_keys) def init_from_ckpt(self, path, ignore_keys=list()): sd = torch.load(path, map_location="cpu")["state_dict"] keys = list(sd.keys()) for k in keys: for ik in ignore_keys: if k.startswith(ik): print("Deleting key {} from state_dict.".format(k)) del sd[k] self.load_state_dict(sd, strict=False) print(f"Restored from {path}") @contextmanager def ema_scope(self, context=None): if self.use_ema: self.model_ema.store(self.parameters()) self.model_ema.copy_to(self) if context is not None: print(f"{context}: Switched to EMA weights") try: yield None finally: if self.use_ema: self.model_ema.restore(self.parameters()) if context is not None: print(f"{context}: Restored training weights") def on_train_batch_end(self, *args, **kwargs): if self.use_ema: self.model_ema(self) def encode(self, x): h = self.encoder(x) moments = self.quant_conv(h) posterior = DiagonalGaussianDistribution(moments) return posterior def decode(self, z): z = self.post_quant_conv(z) dec = self.decoder(z) return dec def forward(self, input, sample_posterior=True): posterior = self.encode(input) if sample_posterior: z = posterior.sample() else: z = posterior.mode() dec = self.decode(z) return dec, posterior def get_input(self, batch, k): x = batch[k] if len(x.shape) == 3: x = x[..., None] x = x.permute(0, 3, 1, 2).to(memory_format=torch.contiguous_format).float() return x def training_step(self, batch, batch_idx, optimizer_idx): inputs = self.get_input(batch, self.image_key) reconstructions, posterior = self(inputs) if optimizer_idx == 0: # train encoder+decoder+logvar aeloss, log_dict_ae = self.loss(inputs, reconstructions, posterior, optimizer_idx, self.global_step, last_layer=self.get_last_layer(), split="train") self.log("aeloss", aeloss, prog_bar=True, logger=True, on_step=True, on_epoch=True) self.log_dict(log_dict_ae, prog_bar=False, logger=True, on_step=True, on_epoch=False) return aeloss if optimizer_idx == 1: # train the discriminator discloss, log_dict_disc = self.loss(inputs, reconstructions, posterior, optimizer_idx, self.global_step, last_layer=self.get_last_layer(), split="train") self.log("discloss", discloss, prog_bar=True, logger=True, on_step=True, on_epoch=True) self.log_dict(log_dict_disc, prog_bar=False, logger=True, on_step=True, on_epoch=False) return discloss def validation_step(self, batch, batch_idx): log_dict = self._validation_step(batch, batch_idx) with self.ema_scope(): log_dict_ema = self._validation_step(batch, batch_idx, postfix="_ema") return log_dict def _validation_step(self, batch, batch_idx, postfix=""): inputs = self.get_input(batch, self.image_key) reconstructions, posterior = self(inputs) aeloss, log_dict_ae = self.loss(inputs, reconstructions, posterior, 0, self.global_step, last_layer=self.get_last_layer(), split="val"+postfix) discloss, log_dict_disc = self.loss(inputs, reconstructions, posterior, 1, self.global_step, last_layer=self.get_last_layer(), split="val"+postfix) self.log(f"val{postfix}/rec_loss", log_dict_ae[f"val{postfix}/rec_loss"]) self.log_dict(log_dict_ae) self.log_dict(log_dict_disc) return self.log_dict def configure_optimizers(self): lr = self.learning_rate ae_params_list = list(self.encoder.parameters()) + list(self.decoder.parameters()) + list( self.quant_conv.parameters()) + list(self.post_quant_conv.parameters()) if self.learn_logvar: print(f"{self.__class__.__name__}: Learning logvar") ae_params_list.append(self.loss.logvar) opt_ae = torch.optim.Adam(ae_params_list, lr=lr, betas=(0.5, 0.9)) opt_disc = torch.optim.Adam(self.loss.discriminator.parameters(), lr=lr, betas=(0.5, 0.9)) return [opt_ae, opt_disc], [] def get_last_layer(self): return self.decoder.conv_out.weight @torch.no_grad() def log_images(self, batch, only_inputs=False, log_ema=False, **kwargs): log = dict() x = self.get_input(batch, self.image_key) x = x.to(self.device) if not only_inputs: xrec, posterior = self(x) if x.shape[1] > 3: # colorize with random projection assert xrec.shape[1] > 3 x = self.to_rgb(x) xrec = self.to_rgb(xrec) log["samples"] = self.decode(torch.randn_like(posterior.sample())) log["reconstructions"] = xrec if log_ema or self.use_ema: with self.ema_scope(): xrec_ema, posterior_ema = self(x) if x.shape[1] > 3: # colorize with random projection assert xrec_ema.shape[1] > 3 xrec_ema = self.to_rgb(xrec_ema) log["samples_ema"] = self.decode(torch.randn_like(posterior_ema.sample())) log["reconstructions_ema"] = xrec_ema log["inputs"] = x return log def to_rgb(self, x): assert self.image_key == "segmentation" if not hasattr(self, "colorize"): self.register_buffer("colorize", torch.randn(3, x.shape[1], 1, 1).to(x)) x = F.conv2d(x, weight=self.colorize) x = 2.*(x-x.min())/(x.max()-x.min()) - 1. return x class IdentityFirstStage(torch.nn.Module): def __init__(self, *args, vq_interface=False, **kwargs): self.vq_interface = vq_interface super().__init__() def encode(self, x, *args, **kwargs): return x def decode(self, x, *args, **kwargs): return x def quantize(self, x, *args, **kwargs): if self.vq_interface: return x, None, [None, None, None] return x def forward(self, x, *args, **kwargs): return x ================================================ FILE: ToonCrafter/ldm/models/diffusion/__init__.py ================================================ ================================================ FILE: ToonCrafter/ldm/models/diffusion/ddim.py ================================================ """SAMPLING ONLY.""" import torch import numpy as np from tqdm import tqdm from ToonCrafter.ldm.modules.diffusionmodules.util import make_ddim_sampling_parameters, make_ddim_timesteps, noise_like, extract_into_tensor class DDIMSampler(object): def __init__(self, model, schedule="linear", **kwargs): super().__init__() self.model = model self.ddpm_num_timesteps = model.num_timesteps self.schedule = schedule def register_buffer(self, name, attr): if type(attr) == torch.Tensor: if attr.device != torch.device("cuda"): attr = attr.to(torch.device("cuda")) setattr(self, name, attr) def make_schedule(self, ddim_num_steps, ddim_discretize="uniform", ddim_eta=0., verbose=True): self.ddim_timesteps = make_ddim_timesteps(ddim_discr_method=ddim_discretize, num_ddim_timesteps=ddim_num_steps, num_ddpm_timesteps=self.ddpm_num_timesteps,verbose=verbose) alphas_cumprod = self.model.alphas_cumprod assert alphas_cumprod.shape[0] == self.ddpm_num_timesteps, 'alphas have to be defined for each timestep' to_torch = lambda x: x.clone().detach().to(torch.float32).to(self.model.device) self.register_buffer('betas', to_torch(self.model.betas)) self.register_buffer('alphas_cumprod', to_torch(alphas_cumprod)) self.register_buffer('alphas_cumprod_prev', to_torch(self.model.alphas_cumprod_prev)) # calculations for diffusion q(x_t | x_{t-1}) and others self.register_buffer('sqrt_alphas_cumprod', to_torch(np.sqrt(alphas_cumprod.cpu()))) self.register_buffer('sqrt_one_minus_alphas_cumprod', to_torch(np.sqrt(1. - alphas_cumprod.cpu()))) self.register_buffer('log_one_minus_alphas_cumprod', to_torch(np.log(1. - alphas_cumprod.cpu()))) self.register_buffer('sqrt_recip_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod.cpu()))) self.register_buffer('sqrt_recipm1_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod.cpu() - 1))) # ddim sampling parameters ddim_sigmas, ddim_alphas, ddim_alphas_prev = make_ddim_sampling_parameters(alphacums=alphas_cumprod.cpu(), ddim_timesteps=self.ddim_timesteps, eta=ddim_eta,verbose=verbose) self.register_buffer('ddim_sigmas', ddim_sigmas) self.register_buffer('ddim_alphas', ddim_alphas) self.register_buffer('ddim_alphas_prev', ddim_alphas_prev) self.register_buffer('ddim_sqrt_one_minus_alphas', np.sqrt(1. - ddim_alphas)) sigmas_for_original_sampling_steps = ddim_eta * torch.sqrt( (1 - self.alphas_cumprod_prev) / (1 - self.alphas_cumprod) * ( 1 - self.alphas_cumprod / self.alphas_cumprod_prev)) self.register_buffer('ddim_sigmas_for_original_num_steps', sigmas_for_original_sampling_steps) @torch.no_grad() def sample(self, S, batch_size, shape, conditioning=None, callback=None, normals_sequence=None, img_callback=None, quantize_x0=False, eta=0., mask=None, x0=None, temperature=1., noise_dropout=0., score_corrector=None, corrector_kwargs=None, verbose=True, x_T=None, log_every_t=100, unconditional_guidance_scale=1., unconditional_conditioning=None, # this has to come in the same format as the conditioning, # e.g. as encoded tokens, ... dynamic_threshold=None, ucg_schedule=None, **kwargs ): if conditioning is not None: if isinstance(conditioning, dict): ctmp = conditioning[list(conditioning.keys())[0]] while isinstance(ctmp, list): ctmp = ctmp[0] cbs = ctmp.shape[0] if cbs != batch_size: print(f"Warning: Got {cbs} conditionings but batch-size is {batch_size}") elif isinstance(conditioning, list): for ctmp in conditioning: if ctmp.shape[0] != batch_size: print(f"Warning: Got {cbs} conditionings but batch-size is {batch_size}") else: if conditioning.shape[0] != batch_size: print(f"Warning: Got {conditioning.shape[0]} conditionings but batch-size is {batch_size}") self.make_schedule(ddim_num_steps=S, ddim_eta=eta, verbose=verbose) # sampling C, H, W = shape size = (batch_size, C, H, W) print(f'Data shape for DDIM sampling is {size}, eta {eta}') samples, intermediates = self.ddim_sampling(conditioning, size, callback=callback, img_callback=img_callback, quantize_denoised=quantize_x0, mask=mask, x0=x0, ddim_use_original_steps=False, noise_dropout=noise_dropout, temperature=temperature, score_corrector=score_corrector, corrector_kwargs=corrector_kwargs, x_T=x_T, log_every_t=log_every_t, unconditional_guidance_scale=unconditional_guidance_scale, unconditional_conditioning=unconditional_conditioning, dynamic_threshold=dynamic_threshold, ucg_schedule=ucg_schedule ) return samples, intermediates @torch.no_grad() def ddim_sampling(self, cond, shape, x_T=None, ddim_use_original_steps=False, callback=None, timesteps=None, quantize_denoised=False, mask=None, x0=None, img_callback=None, log_every_t=100, temperature=1., noise_dropout=0., score_corrector=None, corrector_kwargs=None, unconditional_guidance_scale=1., unconditional_conditioning=None, dynamic_threshold=None, ucg_schedule=None): device = self.model.betas.device b = shape[0] if x_T is None: img = torch.randn(shape, device=device) else: img = x_T if timesteps is None: timesteps = self.ddpm_num_timesteps if ddim_use_original_steps else self.ddim_timesteps elif timesteps is not None and not ddim_use_original_steps: subset_end = int(min(timesteps / self.ddim_timesteps.shape[0], 1) * self.ddim_timesteps.shape[0]) - 1 timesteps = self.ddim_timesteps[:subset_end] intermediates = {'x_inter': [img], 'pred_x0': [img]} time_range = reversed(range(0,timesteps)) if ddim_use_original_steps else np.flip(timesteps) total_steps = timesteps if ddim_use_original_steps else timesteps.shape[0] print(f"Running DDIM Sampling with {total_steps} timesteps") iterator = tqdm(time_range, desc='DDIM Sampler', total=total_steps) for i, step in enumerate(iterator): index = total_steps - i - 1 ts = torch.full((b,), step, device=device, dtype=torch.long) if mask is not None: assert x0 is not None img_orig = self.model.q_sample(x0, ts) # TODO: deterministic forward pass? img = img_orig * mask + (1. - mask) * img if ucg_schedule is not None: assert len(ucg_schedule) == len(time_range) unconditional_guidance_scale = ucg_schedule[i] outs = self.p_sample_ddim(img, cond, ts, index=index, use_original_steps=ddim_use_original_steps, quantize_denoised=quantize_denoised, temperature=temperature, noise_dropout=noise_dropout, score_corrector=score_corrector, corrector_kwargs=corrector_kwargs, unconditional_guidance_scale=unconditional_guidance_scale, unconditional_conditioning=unconditional_conditioning, dynamic_threshold=dynamic_threshold) img, pred_x0 = outs if callback: callback(i) if img_callback: img_callback(pred_x0, i) if index % log_every_t == 0 or index == total_steps - 1: intermediates['x_inter'].append(img) intermediates['pred_x0'].append(pred_x0) return img, intermediates @torch.no_grad() def p_sample_ddim(self, x, c, t, index, repeat_noise=False, use_original_steps=False, quantize_denoised=False, temperature=1., noise_dropout=0., score_corrector=None, corrector_kwargs=None, unconditional_guidance_scale=1., unconditional_conditioning=None, dynamic_threshold=None): b, *_, device = *x.shape, x.device if unconditional_conditioning is None or unconditional_guidance_scale == 1.: model_output = self.model.apply_model(x, t, c) else: x_in = torch.cat([x] * 2) t_in = torch.cat([t] * 2) if isinstance(c, dict): assert isinstance(unconditional_conditioning, dict) c_in = dict() for k in c: if isinstance(c[k], list): c_in[k] = [torch.cat([ unconditional_conditioning[k][i], c[k][i]]) for i in range(len(c[k]))] else: c_in[k] = torch.cat([ unconditional_conditioning[k], c[k]]) elif isinstance(c, list): c_in = list() assert isinstance(unconditional_conditioning, list) for i in range(len(c)): c_in.append(torch.cat([unconditional_conditioning[i], c[i]])) else: c_in = torch.cat([unconditional_conditioning, c]) model_uncond, model_t = self.model.apply_model(x_in, t_in, c_in).chunk(2) model_output = model_uncond + unconditional_guidance_scale * (model_t - model_uncond) if self.model.parameterization == "v": e_t = self.model.predict_eps_from_z_and_v(x, t, model_output) else: e_t = model_output if score_corrector is not None: assert self.model.parameterization == "eps", 'not implemented' e_t = score_corrector.modify_score(self.model, e_t, x, t, c, **corrector_kwargs) alphas = self.model.alphas_cumprod if use_original_steps else self.ddim_alphas alphas_prev = self.model.alphas_cumprod_prev if use_original_steps else self.ddim_alphas_prev sqrt_one_minus_alphas = self.model.sqrt_one_minus_alphas_cumprod if use_original_steps else self.ddim_sqrt_one_minus_alphas sigmas = self.model.ddim_sigmas_for_original_num_steps if use_original_steps else self.ddim_sigmas # select parameters corresponding to the currently considered timestep a_t = torch.full((b, 1, 1, 1), alphas[index], device=device) a_prev = torch.full((b, 1, 1, 1), alphas_prev[index], device=device) sigma_t = torch.full((b, 1, 1, 1), sigmas[index], device=device) sqrt_one_minus_at = torch.full((b, 1, 1, 1), sqrt_one_minus_alphas[index],device=device) # current prediction for x_0 if self.model.parameterization != "v": pred_x0 = (x - sqrt_one_minus_at * e_t) / a_t.sqrt() else: pred_x0 = self.model.predict_start_from_z_and_v(x, t, model_output) if quantize_denoised: pred_x0, _, *_ = self.model.first_stage_model.quantize(pred_x0) if dynamic_threshold is not None: raise NotImplementedError() # direction pointing to x_t dir_xt = (1. - a_prev - sigma_t**2).sqrt() * e_t noise = sigma_t * noise_like(x.shape, device, repeat_noise) * temperature if noise_dropout > 0.: noise = torch.nn.functional.dropout(noise, p=noise_dropout) x_prev = a_prev.sqrt() * pred_x0 + dir_xt + noise return x_prev, pred_x0 @torch.no_grad() def encode(self, x0, c, t_enc, use_original_steps=False, return_intermediates=None, unconditional_guidance_scale=1.0, unconditional_conditioning=None, callback=None): num_reference_steps = self.ddpm_num_timesteps if use_original_steps else self.ddim_timesteps.shape[0] assert t_enc <= num_reference_steps num_steps = t_enc if use_original_steps: alphas_next = self.alphas_cumprod[:num_steps] alphas = self.alphas_cumprod_prev[:num_steps] else: alphas_next = self.ddim_alphas[:num_steps] alphas = torch.tensor(self.ddim_alphas_prev[:num_steps]) x_next = x0 intermediates = [] inter_steps = [] for i in tqdm(range(num_steps), desc='Encoding Image'): t = torch.full((x0.shape[0],), i, device=self.model.device, dtype=torch.long) if unconditional_guidance_scale == 1.: noise_pred = self.model.apply_model(x_next, t, c) else: assert unconditional_conditioning is not None e_t_uncond, noise_pred = torch.chunk( self.model.apply_model(torch.cat((x_next, x_next)), torch.cat((t, t)), torch.cat((unconditional_conditioning, c))), 2) noise_pred = e_t_uncond + unconditional_guidance_scale * (noise_pred - e_t_uncond) xt_weighted = (alphas_next[i] / alphas[i]).sqrt() * x_next weighted_noise_pred = alphas_next[i].sqrt() * ( (1 / alphas_next[i] - 1).sqrt() - (1 / alphas[i] - 1).sqrt()) * noise_pred x_next = xt_weighted + weighted_noise_pred if return_intermediates and i % ( num_steps // return_intermediates) == 0 and i < num_steps - 1: intermediates.append(x_next) inter_steps.append(i) elif return_intermediates and i >= num_steps - 2: intermediates.append(x_next) inter_steps.append(i) if callback: callback(i) out = {'x_encoded': x_next, 'intermediate_steps': inter_steps} if return_intermediates: out.update({'intermediates': intermediates}) return x_next, out @torch.no_grad() def stochastic_encode(self, x0, t, use_original_steps=False, noise=None): # fast, but does not allow for exact reconstruction # t serves as an index to gather the correct alphas if use_original_steps: sqrt_alphas_cumprod = self.sqrt_alphas_cumprod sqrt_one_minus_alphas_cumprod = self.sqrt_one_minus_alphas_cumprod else: sqrt_alphas_cumprod = torch.sqrt(self.ddim_alphas) sqrt_one_minus_alphas_cumprod = self.ddim_sqrt_one_minus_alphas if noise is None: noise = torch.randn_like(x0) return (extract_into_tensor(sqrt_alphas_cumprod, t, x0.shape) * x0 + extract_into_tensor(sqrt_one_minus_alphas_cumprod, t, x0.shape) * noise) @torch.no_grad() def decode(self, x_latent, cond, t_start, unconditional_guidance_scale=1.0, unconditional_conditioning=None, use_original_steps=False, callback=None): timesteps = np.arange(self.ddpm_num_timesteps) if use_original_steps else self.ddim_timesteps timesteps = timesteps[:t_start] time_range = np.flip(timesteps) total_steps = timesteps.shape[0] print(f"Running DDIM Sampling with {total_steps} timesteps") iterator = tqdm(time_range, desc='Decoding image', total=total_steps) x_dec = x_latent for i, step in enumerate(iterator): index = total_steps - i - 1 ts = torch.full((x_latent.shape[0],), step, device=x_latent.device, dtype=torch.long) x_dec, _ = self.p_sample_ddim(x_dec, cond, ts, index=index, use_original_steps=use_original_steps, unconditional_guidance_scale=unconditional_guidance_scale, unconditional_conditioning=unconditional_conditioning) if callback: callback(i) return x_dec ================================================ FILE: ToonCrafter/ldm/models/diffusion/ddpm.py ================================================ """ wild mixture of https://github.com/lucidrains/denoising-diffusion-pytorch/blob/7706bdfc6f527f58d33f84b7b522e61e6e3164b3/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py https://github.com/openai/improved-diffusion/blob/e94489283bb876ac1477d5dd7709bbbd2d9902ce/improved_diffusion/gaussian_diffusion.py https://github.com/CompVis/taming-transformers -- merci """ import torch import torch.nn as nn import numpy as np import pytorch_lightning as pl from torch.optim.lr_scheduler import LambdaLR from einops import rearrange, repeat from contextlib import contextmanager, nullcontext from functools import partial import itertools from tqdm import tqdm from torchvision.utils import make_grid from pytorch_lightning.utilities import rank_zero_only from omegaconf import ListConfig from ToonCrafter.ldm.util import log_txt_as_img, exists, default, ismap, isimage, mean_flat, count_params, instantiate_from_config from ToonCrafter.ldm.modules.ema import LitEma from ToonCrafter.ldm.modules.distributions.distributions import normal_kl, DiagonalGaussianDistribution from ToonCrafter.ldm.models.autoencoder import IdentityFirstStage, AutoencoderKL from ToonCrafter.ldm.modules.diffusionmodules.util import make_beta_schedule, extract_into_tensor, noise_like from ToonCrafter.ldm.models.diffusion.ddim import DDIMSampler __conditioning_keys__ = {'concat': 'c_concat', 'crossattn': 'c_crossattn', 'adm': 'y'} def disabled_train(self, mode=True): """Overwrite model.train with this function to make sure train/eval mode does not change anymore.""" return self def uniform_on_device(r1, r2, shape, device): return (r1 - r2) * torch.rand(*shape, device=device) + r2 class DDPM(pl.LightningModule): # classic DDPM with Gaussian diffusion, in image space def __init__(self, unet_config, timesteps=1000, beta_schedule="linear", loss_type="l2", ckpt_path=None, ignore_keys=[], load_only_unet=False, monitor="val/loss", use_ema=True, first_stage_key="image", image_size=256, channels=3, log_every_t=100, clip_denoised=True, linear_start=1e-4, linear_end=2e-2, cosine_s=8e-3, given_betas=None, original_elbo_weight=0., v_posterior=0., # weight for choosing posterior variance as sigma = (1-v) * beta_tilde + v * beta l_simple_weight=1., conditioning_key=None, parameterization="eps", # all assuming fixed variance schedules scheduler_config=None, use_positional_encodings=False, learn_logvar=False, logvar_init=0., make_it_fit=False, ucg_training=None, reset_ema=False, reset_num_ema_updates=False, ): super().__init__() assert parameterization in ["eps", "x0", "v"], 'currently only supporting "eps" and "x0" and "v"' self.parameterization = parameterization print(f"{self.__class__.__name__}: Running in {self.parameterization}-prediction mode") self.cond_stage_model = None self.clip_denoised = clip_denoised self.log_every_t = log_every_t self.first_stage_key = first_stage_key self.image_size = image_size # try conv? self.channels = channels self.use_positional_encodings = use_positional_encodings self.model = DiffusionWrapper(unet_config, conditioning_key) count_params(self.model, verbose=True) self.use_ema = use_ema if self.use_ema: self.model_ema = LitEma(self.model) print(f"Keeping EMAs of {len(list(self.model_ema.buffers()))}.") self.use_scheduler = scheduler_config is not None if self.use_scheduler: self.scheduler_config = scheduler_config self.v_posterior = v_posterior self.original_elbo_weight = original_elbo_weight self.l_simple_weight = l_simple_weight if monitor is not None: self.monitor = monitor self.make_it_fit = make_it_fit if reset_ema: assert exists(ckpt_path) if ckpt_path is not None: self.init_from_ckpt(ckpt_path, ignore_keys=ignore_keys, only_model=load_only_unet) if reset_ema: assert self.use_ema print(f"Resetting ema to pure model weights. This is useful when restoring from an ema-only checkpoint.") self.model_ema = LitEma(self.model) if reset_num_ema_updates: print(" +++++++++++ WARNING: RESETTING NUM_EMA UPDATES TO ZERO +++++++++++ ") assert self.use_ema self.model_ema.reset_num_updates() self.register_schedule(given_betas=given_betas, beta_schedule=beta_schedule, timesteps=timesteps, linear_start=linear_start, linear_end=linear_end, cosine_s=cosine_s) self.loss_type = loss_type self.learn_logvar = learn_logvar logvar = torch.full(fill_value=logvar_init, size=(self.num_timesteps,)) if self.learn_logvar: self.logvar = nn.Parameter(self.logvar, requires_grad=True) else: self.register_buffer('logvar', logvar) self.ucg_training = ucg_training or dict() if self.ucg_training: self.ucg_prng = np.random.RandomState() def register_schedule(self, given_betas=None, beta_schedule="linear", timesteps=1000, linear_start=1e-4, linear_end=2e-2, cosine_s=8e-3): if exists(given_betas): betas = given_betas else: betas = make_beta_schedule(beta_schedule, timesteps, linear_start=linear_start, linear_end=linear_end, cosine_s=cosine_s) alphas = 1. - betas alphas_cumprod = np.cumprod(alphas, axis=0) alphas_cumprod_prev = np.append(1., alphas_cumprod[:-1]) timesteps, = betas.shape self.num_timesteps = int(timesteps) self.linear_start = linear_start self.linear_end = linear_end assert alphas_cumprod.shape[0] == self.num_timesteps, 'alphas have to be defined for each timestep' to_torch = partial(torch.tensor, dtype=torch.float32) self.register_buffer('betas', to_torch(betas)) self.register_buffer('alphas_cumprod', to_torch(alphas_cumprod)) self.register_buffer('alphas_cumprod_prev', to_torch(alphas_cumprod_prev)) # calculations for diffusion q(x_t | x_{t-1}) and others self.register_buffer('sqrt_alphas_cumprod', to_torch(np.sqrt(alphas_cumprod))) self.register_buffer('sqrt_one_minus_alphas_cumprod', to_torch(np.sqrt(1. - alphas_cumprod))) self.register_buffer('log_one_minus_alphas_cumprod', to_torch(np.log(1. - alphas_cumprod))) self.register_buffer('sqrt_recip_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod))) self.register_buffer('sqrt_recipm1_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod - 1))) # calculations for posterior q(x_{t-1} | x_t, x_0) posterior_variance = (1 - self.v_posterior) * betas * (1. - alphas_cumprod_prev) / ( 1. - alphas_cumprod) + self.v_posterior * betas # above: equal to 1. / (1. / (1. - alpha_cumprod_tm1) + alpha_t / beta_t) self.register_buffer('posterior_variance', to_torch(posterior_variance)) # below: log calculation clipped because the posterior variance is 0 at the beginning of the diffusion chain self.register_buffer('posterior_log_variance_clipped', to_torch(np.log(np.maximum(posterior_variance, 1e-20)))) self.register_buffer('posterior_mean_coef1', to_torch( betas * np.sqrt(alphas_cumprod_prev) / (1. - alphas_cumprod))) self.register_buffer('posterior_mean_coef2', to_torch( (1. - alphas_cumprod_prev) * np.sqrt(alphas) / (1. - alphas_cumprod))) if self.parameterization == "eps": lvlb_weights = self.betas ** 2 / ( 2 * self.posterior_variance * to_torch(alphas) * (1 - self.alphas_cumprod)) elif self.parameterization == "x0": lvlb_weights = 0.5 * np.sqrt(torch.Tensor(alphas_cumprod)) / (2. * 1 - torch.Tensor(alphas_cumprod)) elif self.parameterization == "v": lvlb_weights = torch.ones_like(self.betas ** 2 / ( 2 * self.posterior_variance * to_torch(alphas) * (1 - self.alphas_cumprod))) else: raise NotImplementedError("mu not supported") lvlb_weights[0] = lvlb_weights[1] self.register_buffer('lvlb_weights', lvlb_weights, persistent=False) assert not torch.isnan(self.lvlb_weights).all() @contextmanager def ema_scope(self, context=None): if self.use_ema: self.model_ema.store(self.model.parameters()) self.model_ema.copy_to(self.model) if context is not None: print(f"{context}: Switched to EMA weights") try: yield None finally: if self.use_ema: self.model_ema.restore(self.model.parameters()) if context is not None: print(f"{context}: Restored training weights") @torch.no_grad() def init_from_ckpt(self, path, ignore_keys=list(), only_model=False): sd = torch.load(path, map_location="cpu") if "state_dict" in list(sd.keys()): sd = sd["state_dict"] keys = list(sd.keys()) for k in keys: for ik in ignore_keys: if k.startswith(ik): print("Deleting key {} from state_dict.".format(k)) del sd[k] if self.make_it_fit: n_params = len([name for name, _ in itertools.chain(self.named_parameters(), self.named_buffers())]) for name, param in tqdm( itertools.chain(self.named_parameters(), self.named_buffers()), desc="Fitting old weights to new weights", total=n_params ): if not name in sd: continue old_shape = sd[name].shape new_shape = param.shape assert len(old_shape) == len(new_shape) if len(new_shape) > 2: # we only modify first two axes assert new_shape[2:] == old_shape[2:] # assumes first axis corresponds to output dim if not new_shape == old_shape: new_param = param.clone() old_param = sd[name] if len(new_shape) == 1: for i in range(new_param.shape[0]): new_param[i] = old_param[i % old_shape[0]] elif len(new_shape) >= 2: for i in range(new_param.shape[0]): for j in range(new_param.shape[1]): new_param[i, j] = old_param[i % old_shape[0], j % old_shape[1]] n_used_old = torch.ones(old_shape[1]) for j in range(new_param.shape[1]): n_used_old[j % old_shape[1]] += 1 n_used_new = torch.zeros(new_shape[1]) for j in range(new_param.shape[1]): n_used_new[j] = n_used_old[j % old_shape[1]] n_used_new = n_used_new[None, :] while len(n_used_new.shape) < len(new_shape): n_used_new = n_used_new.unsqueeze(-1) new_param /= n_used_new sd[name] = new_param missing, unexpected = self.load_state_dict(sd, strict=False) if not only_model else self.model.load_state_dict( sd, strict=False) print(f"Restored from {path} with {len(missing)} missing and {len(unexpected)} unexpected keys") if len(missing) > 0: print(f"Missing Keys:\n {missing}") if len(unexpected) > 0: print(f"\nUnexpected Keys:\n {unexpected}") def q_mean_variance(self, x_start, t): """ Get the distribution q(x_t | x_0). :param x_start: the [N x C x ...] tensor of noiseless inputs. :param t: the number of diffusion steps (minus 1). Here, 0 means one step. :return: A tuple (mean, variance, log_variance), all of x_start's shape. """ mean = (extract_into_tensor(self.sqrt_alphas_cumprod, t, x_start.shape) * x_start) variance = extract_into_tensor(1.0 - self.alphas_cumprod, t, x_start.shape) log_variance = extract_into_tensor(self.log_one_minus_alphas_cumprod, t, x_start.shape) return mean, variance, log_variance def predict_start_from_noise(self, x_t, t, noise): return ( extract_into_tensor(self.sqrt_recip_alphas_cumprod, t, x_t.shape) * x_t - extract_into_tensor(self.sqrt_recipm1_alphas_cumprod, t, x_t.shape) * noise ) def predict_start_from_z_and_v(self, x_t, t, v): # self.register_buffer('sqrt_alphas_cumprod', to_torch(np.sqrt(alphas_cumprod))) # self.register_buffer('sqrt_one_minus_alphas_cumprod', to_torch(np.sqrt(1. - alphas_cumprod))) return ( extract_into_tensor(self.sqrt_alphas_cumprod, t, x_t.shape) * x_t - extract_into_tensor(self.sqrt_one_minus_alphas_cumprod, t, x_t.shape) * v ) def predict_eps_from_z_and_v(self, x_t, t, v): return ( extract_into_tensor(self.sqrt_alphas_cumprod, t, x_t.shape) * v + extract_into_tensor(self.sqrt_one_minus_alphas_cumprod, t, x_t.shape) * x_t ) def q_posterior(self, x_start, x_t, t): posterior_mean = ( extract_into_tensor(self.posterior_mean_coef1, t, x_t.shape) * x_start + extract_into_tensor(self.posterior_mean_coef2, t, x_t.shape) * x_t ) posterior_variance = extract_into_tensor(self.posterior_variance, t, x_t.shape) posterior_log_variance_clipped = extract_into_tensor(self.posterior_log_variance_clipped, t, x_t.shape) return posterior_mean, posterior_variance, posterior_log_variance_clipped def p_mean_variance(self, x, t, clip_denoised: bool): model_out = self.model(x, t) if self.parameterization == "eps": x_recon = self.predict_start_from_noise(x, t=t, noise=model_out) elif self.parameterization == "x0": x_recon = model_out if clip_denoised: x_recon.clamp_(-1., 1.) model_mean, posterior_variance, posterior_log_variance = self.q_posterior(x_start=x_recon, x_t=x, t=t) return model_mean, posterior_variance, posterior_log_variance @torch.no_grad() def p_sample(self, x, t, clip_denoised=True, repeat_noise=False): b, *_, device = *x.shape, x.device model_mean, _, model_log_variance = self.p_mean_variance(x=x, t=t, clip_denoised=clip_denoised) noise = noise_like(x.shape, device, repeat_noise) # no noise when t == 0 nonzero_mask = (1 - (t == 0).float()).reshape(b, *((1,) * (len(x.shape) - 1))) return model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise @torch.no_grad() def p_sample_loop(self, shape, return_intermediates=False): device = self.betas.device b = shape[0] img = torch.randn(shape, device=device) intermediates = [img] for i in tqdm(reversed(range(0, self.num_timesteps)), desc='Sampling t', total=self.num_timesteps): img = self.p_sample(img, torch.full((b,), i, device=device, dtype=torch.long), clip_denoised=self.clip_denoised) if i % self.log_every_t == 0 or i == self.num_timesteps - 1: intermediates.append(img) if return_intermediates: return img, intermediates return img @torch.no_grad() def sample(self, batch_size=16, return_intermediates=False): image_size = self.image_size channels = self.channels return self.p_sample_loop((batch_size, channels, image_size, image_size), return_intermediates=return_intermediates) def q_sample(self, x_start, t, noise=None): noise = default(noise, lambda: torch.randn_like(x_start)) return (extract_into_tensor(self.sqrt_alphas_cumprod, t, x_start.shape) * x_start + extract_into_tensor(self.sqrt_one_minus_alphas_cumprod, t, x_start.shape) * noise) def get_v(self, x, noise, t): return ( extract_into_tensor(self.sqrt_alphas_cumprod, t, x.shape) * noise - extract_into_tensor(self.sqrt_one_minus_alphas_cumprod, t, x.shape) * x ) def get_loss(self, pred, target, mean=True): if self.loss_type == 'l1': loss = (target - pred).abs() if mean: loss = loss.mean() elif self.loss_type == 'l2': if mean: loss = torch.nn.functional.mse_loss(target, pred) else: loss = torch.nn.functional.mse_loss(target, pred, reduction='none') else: raise NotImplementedError("unknown loss type '{loss_type}'") return loss def p_losses(self, x_start, t, noise=None): noise = default(noise, lambda: torch.randn_like(x_start)) x_noisy = self.q_sample(x_start=x_start, t=t, noise=noise) model_out = self.model(x_noisy, t) loss_dict = {} if self.parameterization == "eps": target = noise elif self.parameterization == "x0": target = x_start elif self.parameterization == "v": target = self.get_v(x_start, noise, t) else: raise NotImplementedError(f"Parameterization {self.parameterization} not yet supported") loss = self.get_loss(model_out, target, mean=False).mean(dim=[1, 2, 3]) log_prefix = 'train' if self.training else 'val' loss_dict.update({f'{log_prefix}/loss_simple': loss.mean()}) loss_simple = loss.mean() * self.l_simple_weight loss_vlb = (self.lvlb_weights[t] * loss).mean() loss_dict.update({f'{log_prefix}/loss_vlb': loss_vlb}) loss = loss_simple + self.original_elbo_weight * loss_vlb loss_dict.update({f'{log_prefix}/loss': loss}) return loss, loss_dict def forward(self, x, *args, **kwargs): # b, c, h, w, device, img_size, = *x.shape, x.device, self.image_size # assert h == img_size and w == img_size, f'height and width of image must be {img_size}' t = torch.randint(0, self.num_timesteps, (x.shape[0],), device=self.device).long() return self.p_losses(x, t, *args, **kwargs) def get_input(self, batch, k): x = batch[k] if len(x.shape) == 3: x = x[..., None] x = rearrange(x, 'b h w c -> b c h w') x = x.to(memory_format=torch.contiguous_format).float() return x def shared_step(self, batch): x = self.get_input(batch, self.first_stage_key) loss, loss_dict = self(x) return loss, loss_dict def training_step(self, batch, batch_idx): for k in self.ucg_training: p = self.ucg_training[k]["p"] val = self.ucg_training[k]["val"] if val is None: val = "" for i in range(len(batch[k])): if self.ucg_prng.choice(2, p=[1 - p, p]): batch[k][i] = val loss, loss_dict = self.shared_step(batch) self.log_dict(loss_dict, prog_bar=True, logger=True, on_step=True, on_epoch=True) self.log("global_step", self.global_step, prog_bar=True, logger=True, on_step=True, on_epoch=False) if self.use_scheduler: lr = self.optimizers().param_groups[0]['lr'] self.log('lr_abs', lr, prog_bar=True, logger=True, on_step=True, on_epoch=False) return loss @torch.no_grad() def validation_step(self, batch, batch_idx): _, loss_dict_no_ema = self.shared_step(batch) with self.ema_scope(): _, loss_dict_ema = self.shared_step(batch) loss_dict_ema = {key + '_ema': loss_dict_ema[key] for key in loss_dict_ema} self.log_dict(loss_dict_no_ema, prog_bar=False, logger=True, on_step=False, on_epoch=True) self.log_dict(loss_dict_ema, prog_bar=False, logger=True, on_step=False, on_epoch=True) def on_train_batch_end(self, *args, **kwargs): if self.use_ema: self.model_ema(self.model) def _get_rows_from_list(self, samples): n_imgs_per_row = len(samples) denoise_grid = rearrange(samples, 'n b c h w -> b n c h w') denoise_grid = rearrange(denoise_grid, 'b n c h w -> (b n) c h w') denoise_grid = make_grid(denoise_grid, nrow=n_imgs_per_row) return denoise_grid @torch.no_grad() def log_images(self, batch, N=8, n_row=2, sample=True, return_keys=None, **kwargs): log = dict() x = self.get_input(batch, self.first_stage_key) N = min(x.shape[0], N) n_row = min(x.shape[0], n_row) x = x.to(self.device)[:N] log["inputs"] = x # get diffusion row diffusion_row = list() x_start = x[:n_row] for t in range(self.num_timesteps): if t % self.log_every_t == 0 or t == self.num_timesteps - 1: t = repeat(torch.tensor([t]), '1 -> b', b=n_row) t = t.to(self.device).long() noise = torch.randn_like(x_start) x_noisy = self.q_sample(x_start=x_start, t=t, noise=noise) diffusion_row.append(x_noisy) log["diffusion_row"] = self._get_rows_from_list(diffusion_row) if sample: # get denoise row with self.ema_scope("Plotting"): samples, denoise_row = self.sample(batch_size=N, return_intermediates=True) log["samples"] = samples log["denoise_row"] = self._get_rows_from_list(denoise_row) if return_keys: if np.intersect1d(list(log.keys()), return_keys).shape[0] == 0: return log else: return {key: log[key] for key in return_keys} return log def configure_optimizers(self): lr = self.learning_rate params = list(self.model.parameters()) if self.learn_logvar: params = params + [self.logvar] opt = torch.optim.AdamW(params, lr=lr) return opt class LatentDiffusion(DDPM): """main class""" def __init__(self, first_stage_config, cond_stage_config, num_timesteps_cond=None, cond_stage_key="image", cond_stage_trainable=False, concat_mode=True, cond_stage_forward=None, conditioning_key=None, scale_factor=1.0, scale_by_std=False, force_null_conditioning=False, *args, **kwargs): self.force_null_conditioning = force_null_conditioning self.num_timesteps_cond = default(num_timesteps_cond, 1) self.scale_by_std = scale_by_std assert self.num_timesteps_cond <= kwargs['timesteps'] # for backwards compatibility after implementation of DiffusionWrapper if conditioning_key is None: conditioning_key = 'concat' if concat_mode else 'crossattn' if cond_stage_config == '__is_unconditional__' and not self.force_null_conditioning: conditioning_key = None ckpt_path = kwargs.pop("ckpt_path", None) reset_ema = kwargs.pop("reset_ema", False) reset_num_ema_updates = kwargs.pop("reset_num_ema_updates", False) ignore_keys = kwargs.pop("ignore_keys", []) super().__init__(conditioning_key=conditioning_key, *args, **kwargs) self.concat_mode = concat_mode self.cond_stage_trainable = cond_stage_trainable self.cond_stage_key = cond_stage_key try: self.num_downs = len(first_stage_config.params.ddconfig.ch_mult) - 1 except: self.num_downs = 0 if not scale_by_std: self.scale_factor = scale_factor else: self.register_buffer('scale_factor', torch.tensor(scale_factor)) self.instantiate_first_stage(first_stage_config) self.instantiate_cond_stage(cond_stage_config) self.cond_stage_forward = cond_stage_forward self.clip_denoised = False self.bbox_tokenizer = None self.restarted_from_ckpt = False if ckpt_path is not None: self.init_from_ckpt(ckpt_path, ignore_keys) self.restarted_from_ckpt = True if reset_ema: assert self.use_ema print( f"Resetting ema to pure model weights. This is useful when restoring from an ema-only checkpoint.") self.model_ema = LitEma(self.model) if reset_num_ema_updates: print(" +++++++++++ WARNING: RESETTING NUM_EMA UPDATES TO ZERO +++++++++++ ") assert self.use_ema self.model_ema.reset_num_updates() def make_cond_schedule(self, ): self.cond_ids = torch.full(size=(self.num_timesteps,), fill_value=self.num_timesteps - 1, dtype=torch.long) ids = torch.round(torch.linspace(0, self.num_timesteps - 1, self.num_timesteps_cond)).long() self.cond_ids[:self.num_timesteps_cond] = ids @rank_zero_only @torch.no_grad() def on_train_batch_start(self, batch, batch_idx, dataloader_idx): # only for very first batch if self.scale_by_std and self.current_epoch == 0 and self.global_step == 0 and batch_idx == 0 and not self.restarted_from_ckpt: assert self.scale_factor == 1., 'rather not use custom rescaling and std-rescaling simultaneously' # set rescale weight to 1./std of encodings print("### USING STD-RESCALING ###") x = super().get_input(batch, self.first_stage_key) x = x.to(self.device) encoder_posterior = self.encode_first_stage(x) z = self.get_first_stage_encoding(encoder_posterior).detach() del self.scale_factor self.register_buffer('scale_factor', 1. / z.flatten().std()) print(f"setting self.scale_factor to {self.scale_factor}") print("### USING STD-RESCALING ###") def register_schedule(self, given_betas=None, beta_schedule="linear", timesteps=1000, linear_start=1e-4, linear_end=2e-2, cosine_s=8e-3): super().register_schedule(given_betas, beta_schedule, timesteps, linear_start, linear_end, cosine_s) self.shorten_cond_schedule = self.num_timesteps_cond > 1 if self.shorten_cond_schedule: self.make_cond_schedule() def instantiate_first_stage(self, config): model = instantiate_from_config(config) self.first_stage_model = model.eval() self.first_stage_model.train = disabled_train for param in self.first_stage_model.parameters(): param.requires_grad = False def instantiate_cond_stage(self, config): if not self.cond_stage_trainable: if config == "__is_first_stage__": print("Using first stage also as cond stage.") self.cond_stage_model = self.first_stage_model elif config == "__is_unconditional__": print(f"Training {self.__class__.__name__} as an unconditional model.") self.cond_stage_model = None # self.be_unconditional = True else: model = instantiate_from_config(config) self.cond_stage_model = model.eval() self.cond_stage_model.train = disabled_train for param in self.cond_stage_model.parameters(): param.requires_grad = False else: assert config != '__is_first_stage__' assert config != '__is_unconditional__' model = instantiate_from_config(config) self.cond_stage_model = model def _get_denoise_row_from_list(self, samples, desc='', force_no_decoder_quantization=False): denoise_row = [] for zd in tqdm(samples, desc=desc): denoise_row.append(self.decode_first_stage(zd.to(self.device), force_not_quantize=force_no_decoder_quantization)) n_imgs_per_row = len(denoise_row) denoise_row = torch.stack(denoise_row) # n_log_step, n_row, C, H, W denoise_grid = rearrange(denoise_row, 'n b c h w -> b n c h w') denoise_grid = rearrange(denoise_grid, 'b n c h w -> (b n) c h w') denoise_grid = make_grid(denoise_grid, nrow=n_imgs_per_row) return denoise_grid def get_first_stage_encoding(self, encoder_posterior): if isinstance(encoder_posterior, DiagonalGaussianDistribution): z = encoder_posterior.sample() elif isinstance(encoder_posterior, torch.Tensor): z = encoder_posterior else: raise NotImplementedError(f"encoder_posterior of type '{type(encoder_posterior)}' not yet implemented") return self.scale_factor * z def get_learned_conditioning(self, c): if self.cond_stage_forward is None: if hasattr(self.cond_stage_model, 'encode') and callable(self.cond_stage_model.encode): c = self.cond_stage_model.encode(c) if isinstance(c, DiagonalGaussianDistribution): c = c.mode() else: c = self.cond_stage_model(c) else: assert hasattr(self.cond_stage_model, self.cond_stage_forward) c = getattr(self.cond_stage_model, self.cond_stage_forward)(c) return c def meshgrid(self, h, w): y = torch.arange(0, h).view(h, 1, 1).repeat(1, w, 1) x = torch.arange(0, w).view(1, w, 1).repeat(h, 1, 1) arr = torch.cat([y, x], dim=-1) return arr def delta_border(self, h, w): """ :param h: height :param w: width :return: normalized distance to image border, wtith min distance = 0 at border and max dist = 0.5 at image center """ lower_right_corner = torch.tensor([h - 1, w - 1]).view(1, 1, 2) arr = self.meshgrid(h, w) / lower_right_corner dist_left_up = torch.min(arr, dim=-1, keepdims=True)[0] dist_right_down = torch.min(1 - arr, dim=-1, keepdims=True)[0] edge_dist = torch.min(torch.cat([dist_left_up, dist_right_down], dim=-1), dim=-1)[0] return edge_dist def get_weighting(self, h, w, Ly, Lx, device): weighting = self.delta_border(h, w) weighting = torch.clip(weighting, self.split_input_params["clip_min_weight"], self.split_input_params["clip_max_weight"], ) weighting = weighting.view(1, h * w, 1).repeat(1, 1, Ly * Lx).to(device) if self.split_input_params["tie_braker"]: L_weighting = self.delta_border(Ly, Lx) L_weighting = torch.clip(L_weighting, self.split_input_params["clip_min_tie_weight"], self.split_input_params["clip_max_tie_weight"]) L_weighting = L_weighting.view(1, 1, Ly * Lx).to(device) weighting = weighting * L_weighting return weighting def get_fold_unfold(self, x, kernel_size, stride, uf=1, df=1): # todo load once not every time, shorten code """ :param x: img of size (bs, c, h, w) :return: n img crops of size (n, bs, c, kernel_size[0], kernel_size[1]) """ bs, nc, h, w = x.shape # number of crops in image Ly = (h - kernel_size[0]) // stride[0] + 1 Lx = (w - kernel_size[1]) // stride[1] + 1 if uf == 1 and df == 1: fold_params = dict(kernel_size=kernel_size, dilation=1, padding=0, stride=stride) unfold = torch.nn.Unfold(**fold_params) fold = torch.nn.Fold(output_size=x.shape[2:], **fold_params) weighting = self.get_weighting(kernel_size[0], kernel_size[1], Ly, Lx, x.device).to(x.dtype) normalization = fold(weighting).view(1, 1, h, w) # normalizes the overlap weighting = weighting.view((1, 1, kernel_size[0], kernel_size[1], Ly * Lx)) elif uf > 1 and df == 1: fold_params = dict(kernel_size=kernel_size, dilation=1, padding=0, stride=stride) unfold = torch.nn.Unfold(**fold_params) fold_params2 = dict(kernel_size=(kernel_size[0] * uf, kernel_size[0] * uf), dilation=1, padding=0, stride=(stride[0] * uf, stride[1] * uf)) fold = torch.nn.Fold(output_size=(x.shape[2] * uf, x.shape[3] * uf), **fold_params2) weighting = self.get_weighting(kernel_size[0] * uf, kernel_size[1] * uf, Ly, Lx, x.device).to(x.dtype) normalization = fold(weighting).view(1, 1, h * uf, w * uf) # normalizes the overlap weighting = weighting.view((1, 1, kernel_size[0] * uf, kernel_size[1] * uf, Ly * Lx)) elif df > 1 and uf == 1: fold_params = dict(kernel_size=kernel_size, dilation=1, padding=0, stride=stride) unfold = torch.nn.Unfold(**fold_params) fold_params2 = dict(kernel_size=(kernel_size[0] // df, kernel_size[0] // df), dilation=1, padding=0, stride=(stride[0] // df, stride[1] // df)) fold = torch.nn.Fold(output_size=(x.shape[2] // df, x.shape[3] // df), **fold_params2) weighting = self.get_weighting(kernel_size[0] // df, kernel_size[1] // df, Ly, Lx, x.device).to(x.dtype) normalization = fold(weighting).view(1, 1, h // df, w // df) # normalizes the overlap weighting = weighting.view((1, 1, kernel_size[0] // df, kernel_size[1] // df, Ly * Lx)) else: raise NotImplementedError return fold, unfold, normalization, weighting @torch.no_grad() def get_input(self, batch, k, return_first_stage_outputs=False, force_c_encode=False, cond_key=None, return_original_cond=False, bs=None, return_x=False): x = super().get_input(batch, k) if bs is not None: x = x[:bs] x = x.to(self.device) encoder_posterior = self.encode_first_stage(x) z = self.get_first_stage_encoding(encoder_posterior).detach() if self.model.conditioning_key is not None and not self.force_null_conditioning: if cond_key is None: cond_key = self.cond_stage_key if cond_key != self.first_stage_key: if cond_key in ['caption', 'coordinates_bbox', "txt"]: xc = batch[cond_key] elif cond_key in ['class_label', 'cls']: xc = batch else: xc = super().get_input(batch, cond_key).to(self.device) else: xc = x if not self.cond_stage_trainable or force_c_encode: if isinstance(xc, dict) or isinstance(xc, list): c = self.get_learned_conditioning(xc) else: c = self.get_learned_conditioning(xc.to(self.device)) else: c = xc if bs is not None: c = c[:bs] if self.use_positional_encodings: pos_x, pos_y = self.compute_latent_shifts(batch) ckey = __conditioning_keys__[self.model.conditioning_key] c = {ckey: c, 'pos_x': pos_x, 'pos_y': pos_y} else: c = None xc = None if self.use_positional_encodings: pos_x, pos_y = self.compute_latent_shifts(batch) c = {'pos_x': pos_x, 'pos_y': pos_y} out = [z, c] if return_first_stage_outputs: xrec = self.decode_first_stage(z) out.extend([x, xrec]) if return_x: out.extend([x]) if return_original_cond: out.append(xc) return out @torch.no_grad() def decode_first_stage(self, z, predict_cids=False, force_not_quantize=False): if predict_cids: if z.dim() == 4: z = torch.argmax(z.exp(), dim=1).long() z = self.first_stage_model.quantize.get_codebook_entry(z, shape=None) z = rearrange(z, 'b h w c -> b c h w').contiguous() z = 1. / self.scale_factor * z return self.first_stage_model.decode(z) @torch.no_grad() def encode_first_stage(self, x): return self.first_stage_model.encode(x) def shared_step(self, batch, **kwargs): x, c = self.get_input(batch, self.first_stage_key) loss = self(x, c) return loss def forward(self, x, c, *args, **kwargs): t = torch.randint(0, self.num_timesteps, (x.shape[0],), device=self.device).long() if self.model.conditioning_key is not None: assert c is not None if self.cond_stage_trainable: c = self.get_learned_conditioning(c) if self.shorten_cond_schedule: # TODO: drop this option tc = self.cond_ids[t].to(self.device) c = self.q_sample(x_start=c, t=tc, noise=torch.randn_like(c.float())) return self.p_losses(x, c, t, *args, **kwargs) def apply_model(self, x_noisy, t, cond, return_ids=False): if isinstance(cond, dict): # hybrid case, cond is expected to be a dict pass else: if not isinstance(cond, list): cond = [cond] key = 'c_concat' if self.model.conditioning_key == 'concat' else 'c_crossattn' cond = {key: cond} x_recon = self.model(x_noisy, t, **cond) if isinstance(x_recon, tuple) and not return_ids: return x_recon[0] else: return x_recon def _predict_eps_from_xstart(self, x_t, t, pred_xstart): return (extract_into_tensor(self.sqrt_recip_alphas_cumprod, t, x_t.shape) * x_t - pred_xstart) / \ extract_into_tensor(self.sqrt_recipm1_alphas_cumprod, t, x_t.shape) def _prior_bpd(self, x_start): """ Get the prior KL term for the variational lower-bound, measured in bits-per-dim. This term can't be optimized, as it only depends on the encoder. :param x_start: the [N x C x ...] tensor of inputs. :return: a batch of [N] KL values (in bits), one per batch element. """ batch_size = x_start.shape[0] t = torch.tensor([self.num_timesteps - 1] * batch_size, device=x_start.device) qt_mean, _, qt_log_variance = self.q_mean_variance(x_start, t) kl_prior = normal_kl(mean1=qt_mean, logvar1=qt_log_variance, mean2=0.0, logvar2=0.0) return mean_flat(kl_prior) / np.log(2.0) def p_losses(self, x_start, cond, t, noise=None): noise = default(noise, lambda: torch.randn_like(x_start)) x_noisy = self.q_sample(x_start=x_start, t=t, noise=noise) model_output = self.apply_model(x_noisy, t, cond) loss_dict = {} prefix = 'train' if self.training else 'val' if self.parameterization == "x0": target = x_start elif self.parameterization == "eps": target = noise elif self.parameterization == "v": target = self.get_v(x_start, noise, t) else: raise NotImplementedError() loss_simple = self.get_loss(model_output, target, mean=False).mean([1, 2, 3]) loss_dict.update({f'{prefix}/loss_simple': loss_simple.mean()}) logvar_t = self.logvar[t].to(self.device) loss = loss_simple / torch.exp(logvar_t) + logvar_t # loss = loss_simple / torch.exp(self.logvar) + self.logvar if self.learn_logvar: loss_dict.update({f'{prefix}/loss_gamma': loss.mean()}) loss_dict.update({'logvar': self.logvar.data.mean()}) loss = self.l_simple_weight * loss.mean() loss_vlb = self.get_loss(model_output, target, mean=False).mean(dim=(1, 2, 3)) loss_vlb = (self.lvlb_weights[t] * loss_vlb).mean() loss_dict.update({f'{prefix}/loss_vlb': loss_vlb}) loss += (self.original_elbo_weight * loss_vlb) loss_dict.update({f'{prefix}/loss': loss}) return loss, loss_dict def p_mean_variance(self, x, c, t, clip_denoised: bool, return_codebook_ids=False, quantize_denoised=False, return_x0=False, score_corrector=None, corrector_kwargs=None): t_in = t model_out = self.apply_model(x, t_in, c, return_ids=return_codebook_ids) if score_corrector is not None: assert self.parameterization == "eps" model_out = score_corrector.modify_score(self, model_out, x, t, c, **corrector_kwargs) if return_codebook_ids: model_out, logits = model_out if self.parameterization == "eps": x_recon = self.predict_start_from_noise(x, t=t, noise=model_out) elif self.parameterization == "x0": x_recon = model_out else: raise NotImplementedError() if clip_denoised: x_recon.clamp_(-1., 1.) if quantize_denoised: x_recon, _, [_, _, indices] = self.first_stage_model.quantize(x_recon) model_mean, posterior_variance, posterior_log_variance = self.q_posterior(x_start=x_recon, x_t=x, t=t) if return_codebook_ids: return model_mean, posterior_variance, posterior_log_variance, logits elif return_x0: return model_mean, posterior_variance, posterior_log_variance, x_recon else: return model_mean, posterior_variance, posterior_log_variance @torch.no_grad() def p_sample(self, x, c, t, clip_denoised=False, repeat_noise=False, return_codebook_ids=False, quantize_denoised=False, return_x0=False, temperature=1., noise_dropout=0., score_corrector=None, corrector_kwargs=None): b, *_, device = *x.shape, x.device outputs = self.p_mean_variance(x=x, c=c, t=t, clip_denoised=clip_denoised, return_codebook_ids=return_codebook_ids, quantize_denoised=quantize_denoised, return_x0=return_x0, score_corrector=score_corrector, corrector_kwargs=corrector_kwargs) if return_codebook_ids: raise DeprecationWarning("Support dropped.") model_mean, _, model_log_variance, logits = outputs elif return_x0: model_mean, _, model_log_variance, x0 = outputs else: model_mean, _, model_log_variance = outputs noise = noise_like(x.shape, device, repeat_noise) * temperature if noise_dropout > 0.: noise = torch.nn.functional.dropout(noise, p=noise_dropout) # no noise when t == 0 nonzero_mask = (1 - (t == 0).float()).reshape(b, *((1,) * (len(x.shape) - 1))) if return_codebook_ids: return model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise, logits.argmax(dim=1) if return_x0: return model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise, x0 else: return model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise @torch.no_grad() def progressive_denoising(self, cond, shape, verbose=True, callback=None, quantize_denoised=False, img_callback=None, mask=None, x0=None, temperature=1., noise_dropout=0., score_corrector=None, corrector_kwargs=None, batch_size=None, x_T=None, start_T=None, log_every_t=None): if not log_every_t: log_every_t = self.log_every_t timesteps = self.num_timesteps if batch_size is not None: b = batch_size if batch_size is not None else shape[0] shape = [batch_size] + list(shape) else: b = batch_size = shape[0] if x_T is None: img = torch.randn(shape, device=self.device) else: img = x_T intermediates = [] if cond is not None: if isinstance(cond, dict): cond = {key: cond[key][:batch_size] if not isinstance(cond[key], list) else list(map(lambda x: x[:batch_size], cond[key])) for key in cond} else: cond = [c[:batch_size] for c in cond] if isinstance(cond, list) else cond[:batch_size] if start_T is not None: timesteps = min(timesteps, start_T) iterator = tqdm(reversed(range(0, timesteps)), desc='Progressive Generation', total=timesteps) if verbose else reversed( range(0, timesteps)) if type(temperature) == float: temperature = [temperature] * timesteps for i in iterator: ts = torch.full((b,), i, device=self.device, dtype=torch.long) if self.shorten_cond_schedule: assert self.model.conditioning_key != 'hybrid' tc = self.cond_ids[ts].to(cond.device) cond = self.q_sample(x_start=cond, t=tc, noise=torch.randn_like(cond)) img, x0_partial = self.p_sample(img, cond, ts, clip_denoised=self.clip_denoised, quantize_denoised=quantize_denoised, return_x0=True, temperature=temperature[i], noise_dropout=noise_dropout, score_corrector=score_corrector, corrector_kwargs=corrector_kwargs) if mask is not None: assert x0 is not None img_orig = self.q_sample(x0, ts) img = img_orig * mask + (1. - mask) * img if i % log_every_t == 0 or i == timesteps - 1: intermediates.append(x0_partial) if callback: callback(i) if img_callback: img_callback(img, i) return img, intermediates @torch.no_grad() def p_sample_loop(self, cond, shape, return_intermediates=False, x_T=None, verbose=True, callback=None, timesteps=None, quantize_denoised=False, mask=None, x0=None, img_callback=None, start_T=None, log_every_t=None): if not log_every_t: log_every_t = self.log_every_t device = self.betas.device b = shape[0] if x_T is None: img = torch.randn(shape, device=device) else: img = x_T intermediates = [img] if timesteps is None: timesteps = self.num_timesteps if start_T is not None: timesteps = min(timesteps, start_T) iterator = tqdm(reversed(range(0, timesteps)), desc='Sampling t', total=timesteps) if verbose else reversed( range(0, timesteps)) if mask is not None: assert x0 is not None assert x0.shape[2:3] == mask.shape[2:3] # spatial size has to match for i in iterator: ts = torch.full((b,), i, device=device, dtype=torch.long) if self.shorten_cond_schedule: assert self.model.conditioning_key != 'hybrid' tc = self.cond_ids[ts].to(cond.device) cond = self.q_sample(x_start=cond, t=tc, noise=torch.randn_like(cond)) img = self.p_sample(img, cond, ts, clip_denoised=self.clip_denoised, quantize_denoised=quantize_denoised) if mask is not None: img_orig = self.q_sample(x0, ts) img = img_orig * mask + (1. - mask) * img if i % log_every_t == 0 or i == timesteps - 1: intermediates.append(img) if callback: callback(i) if img_callback: img_callback(img, i) if return_intermediates: return img, intermediates return img @torch.no_grad() def sample(self, cond, batch_size=16, return_intermediates=False, x_T=None, verbose=True, timesteps=None, quantize_denoised=False, mask=None, x0=None, shape=None, **kwargs): if shape is None: shape = (batch_size, self.channels, self.image_size, self.image_size) if cond is not None: if isinstance(cond, dict): cond = {key: cond[key][:batch_size] if not isinstance(cond[key], list) else list(map(lambda x: x[:batch_size], cond[key])) for key in cond} else: cond = [c[:batch_size] for c in cond] if isinstance(cond, list) else cond[:batch_size] return self.p_sample_loop(cond, shape, return_intermediates=return_intermediates, x_T=x_T, verbose=verbose, timesteps=timesteps, quantize_denoised=quantize_denoised, mask=mask, x0=x0) @torch.no_grad() def sample_log(self, cond, batch_size, ddim, ddim_steps, **kwargs): if ddim: ddim_sampler = DDIMSampler(self) shape = (self.channels, self.image_size, self.image_size) samples, intermediates = ddim_sampler.sample(ddim_steps, batch_size, shape, cond, verbose=False, **kwargs) else: samples, intermediates = self.sample(cond=cond, batch_size=batch_size, return_intermediates=True, **kwargs) return samples, intermediates @torch.no_grad() def get_unconditional_conditioning(self, batch_size, null_label=None): if null_label is not None: xc = null_label if isinstance(xc, ListConfig): xc = list(xc) if isinstance(xc, dict) or isinstance(xc, list): c = self.get_learned_conditioning(xc) else: if hasattr(xc, "to"): xc = xc.to(self.device) c = self.get_learned_conditioning(xc) else: if self.cond_stage_key in ["class_label", "cls"]: xc = self.cond_stage_model.get_unconditional_conditioning(batch_size, device=self.device) return self.get_learned_conditioning(xc) else: raise NotImplementedError("todo") if isinstance(c, list): # in case the encoder gives us a list for i in range(len(c)): c[i] = repeat(c[i], '1 ... -> b ...', b=batch_size).to(self.device) else: c = repeat(c, '1 ... -> b ...', b=batch_size).to(self.device) return c @torch.no_grad() def log_images(self, batch, N=8, n_row=4, sample=True, ddim_steps=50, ddim_eta=0., return_keys=None, quantize_denoised=True, inpaint=True, plot_denoise_rows=False, plot_progressive_rows=True, plot_diffusion_rows=True, unconditional_guidance_scale=1., unconditional_guidance_label=None, use_ema_scope=True, **kwargs): ema_scope = self.ema_scope if use_ema_scope else nullcontext use_ddim = ddim_steps is not None log = dict() z, c, x, xrec, xc = self.get_input(batch, self.first_stage_key, return_first_stage_outputs=True, force_c_encode=True, return_original_cond=True, bs=N) N = min(x.shape[0], N) n_row = min(x.shape[0], n_row) log["inputs"] = x log["reconstruction"] = xrec if self.model.conditioning_key is not None: if hasattr(self.cond_stage_model, "decode"): xc = self.cond_stage_model.decode(c) log["conditioning"] = xc elif self.cond_stage_key in ["caption", "txt"]: xc = log_txt_as_img((x.shape[2], x.shape[3]), batch[self.cond_stage_key], size=x.shape[2] // 25) log["conditioning"] = xc elif self.cond_stage_key in ['class_label', "cls"]: try: xc = log_txt_as_img((x.shape[2], x.shape[3]), batch["human_label"], size=x.shape[2] // 25) log['conditioning'] = xc except KeyError: # probably no "human_label" in batch pass elif isimage(xc): log["conditioning"] = xc if ismap(xc): log["original_conditioning"] = self.to_rgb(xc) if plot_diffusion_rows: # get diffusion row diffusion_row = list() z_start = z[:n_row] for t in range(self.num_timesteps): if t % self.log_every_t == 0 or t == self.num_timesteps - 1: t = repeat(torch.tensor([t]), '1 -> b', b=n_row) t = t.to(self.device).long() noise = torch.randn_like(z_start) z_noisy = self.q_sample(x_start=z_start, t=t, noise=noise) diffusion_row.append(self.decode_first_stage(z_noisy)) diffusion_row = torch.stack(diffusion_row) # n_log_step, n_row, C, H, W diffusion_grid = rearrange(diffusion_row, 'n b c h w -> b n c h w') diffusion_grid = rearrange(diffusion_grid, 'b n c h w -> (b n) c h w') diffusion_grid = make_grid(diffusion_grid, nrow=diffusion_row.shape[0]) log["diffusion_row"] = diffusion_grid if sample: # get denoise row with ema_scope("Sampling"): samples, z_denoise_row = self.sample_log(cond=c, batch_size=N, ddim=use_ddim, ddim_steps=ddim_steps, eta=ddim_eta) # samples, z_denoise_row = self.sample(cond=c, batch_size=N, return_intermediates=True) x_samples = self.decode_first_stage(samples) log["samples"] = x_samples if plot_denoise_rows: denoise_grid = self._get_denoise_row_from_list(z_denoise_row) log["denoise_row"] = denoise_grid if quantize_denoised and not isinstance(self.first_stage_model, AutoencoderKL) and not isinstance( self.first_stage_model, IdentityFirstStage): # also display when quantizing x0 while sampling with ema_scope("Plotting Quantized Denoised"): samples, z_denoise_row = self.sample_log(cond=c, batch_size=N, ddim=use_ddim, ddim_steps=ddim_steps, eta=ddim_eta, quantize_denoised=True) # samples, z_denoise_row = self.sample(cond=c, batch_size=N, return_intermediates=True, # quantize_denoised=True) x_samples = self.decode_first_stage(samples.to(self.device)) log["samples_x0_quantized"] = x_samples if unconditional_guidance_scale > 1.0: uc = self.get_unconditional_conditioning(N, unconditional_guidance_label) if self.model.conditioning_key == "crossattn-adm": uc = {"c_crossattn": [uc], "c_adm": c["c_adm"]} with ema_scope("Sampling with classifier-free guidance"): samples_cfg, _ = self.sample_log(cond=c, batch_size=N, ddim=use_ddim, ddim_steps=ddim_steps, eta=ddim_eta, unconditional_guidance_scale=unconditional_guidance_scale, unconditional_conditioning=uc, ) x_samples_cfg = self.decode_first_stage(samples_cfg) log[f"samples_cfg_scale_{unconditional_guidance_scale:.2f}"] = x_samples_cfg if inpaint: # make a simple center square b, h, w = z.shape[0], z.shape[2], z.shape[3] mask = torch.ones(N, h, w).to(self.device) # zeros will be filled in mask[:, h // 4:3 * h // 4, w // 4:3 * w // 4] = 0. mask = mask[:, None, ...] with ema_scope("Plotting Inpaint"): samples, _ = self.sample_log(cond=c, batch_size=N, ddim=use_ddim, eta=ddim_eta, ddim_steps=ddim_steps, x0=z[:N], mask=mask) x_samples = self.decode_first_stage(samples.to(self.device)) log["samples_inpainting"] = x_samples log["mask"] = mask # outpaint mask = 1. - mask with ema_scope("Plotting Outpaint"): samples, _ = self.sample_log(cond=c, batch_size=N, ddim=use_ddim, eta=ddim_eta, ddim_steps=ddim_steps, x0=z[:N], mask=mask) x_samples = self.decode_first_stage(samples.to(self.device)) log["samples_outpainting"] = x_samples if plot_progressive_rows: with ema_scope("Plotting Progressives"): img, progressives = self.progressive_denoising(c, shape=(self.channels, self.image_size, self.image_size), batch_size=N) prog_row = self._get_denoise_row_from_list(progressives, desc="Progressive Generation") log["progressive_row"] = prog_row if return_keys: if np.intersect1d(list(log.keys()), return_keys).shape[0] == 0: return log else: return {key: log[key] for key in return_keys} return log def configure_optimizers(self): lr = self.learning_rate params = list(self.model.parameters()) if self.cond_stage_trainable: print(f"{self.__class__.__name__}: Also optimizing conditioner params!") params = params + list(self.cond_stage_model.parameters()) if self.learn_logvar: print('Diffusion model optimizing logvar') params.append(self.logvar) opt = torch.optim.AdamW(params, lr=lr) if self.use_scheduler: assert 'target' in self.scheduler_config scheduler = instantiate_from_config(self.scheduler_config) print("Setting up LambdaLR scheduler...") scheduler = [ { 'scheduler': LambdaLR(opt, lr_lambda=scheduler.schedule), 'interval': 'step', 'frequency': 1 }] return [opt], scheduler return opt @torch.no_grad() def to_rgb(self, x): x = x.float() if not hasattr(self, "colorize"): self.colorize = torch.randn(3, x.shape[1], 1, 1).to(x) x = nn.functional.conv2d(x, weight=self.colorize) x = 2. * (x - x.min()) / (x.max() - x.min()) - 1. return x class DiffusionWrapper(pl.LightningModule): def __init__(self, diff_model_config, conditioning_key): super().__init__() self.sequential_cross_attn = diff_model_config.pop("sequential_crossattn", False) self.diffusion_model = instantiate_from_config(diff_model_config) self.conditioning_key = conditioning_key assert self.conditioning_key in [None, 'concat', 'crossattn', 'hybrid', 'adm', 'hybrid-adm', 'crossattn-adm'] def forward(self, x, t, c_concat: list = None, c_crossattn: list = None, c_adm=None): if self.conditioning_key is None: out = self.diffusion_model(x, t) elif self.conditioning_key == 'concat': xc = torch.cat([x] + c_concat, dim=1) out = self.diffusion_model(xc, t) elif self.conditioning_key == 'crossattn': if not self.sequential_cross_attn: cc = torch.cat(c_crossattn, 1) else: cc = c_crossattn out = self.diffusion_model(x, t, context=cc) elif self.conditioning_key == 'hybrid': xc = torch.cat([x] + c_concat, dim=1) cc = torch.cat(c_crossattn, 1) out = self.diffusion_model(xc, t, context=cc) elif self.conditioning_key == 'hybrid-adm': assert c_adm is not None xc = torch.cat([x] + c_concat, dim=1) cc = torch.cat(c_crossattn, 1) out = self.diffusion_model(xc, t, context=cc, y=c_adm) elif self.conditioning_key == 'crossattn-adm': assert c_adm is not None cc = torch.cat(c_crossattn, 1) out = self.diffusion_model(x, t, context=cc, y=c_adm) elif self.conditioning_key == 'adm': cc = c_crossattn[0] out = self.diffusion_model(x, t, y=cc) else: raise NotImplementedError() return out class LatentUpscaleDiffusion(LatentDiffusion): def __init__(self, *args, low_scale_config, low_scale_key="LR", noise_level_key=None, **kwargs): super().__init__(*args, **kwargs) # assumes that neither the cond_stage nor the low_scale_model contain trainable params assert not self.cond_stage_trainable self.instantiate_low_stage(low_scale_config) self.low_scale_key = low_scale_key self.noise_level_key = noise_level_key def instantiate_low_stage(self, config): model = instantiate_from_config(config) self.low_scale_model = model.eval() self.low_scale_model.train = disabled_train for param in self.low_scale_model.parameters(): param.requires_grad = False @torch.no_grad() def get_input(self, batch, k, cond_key=None, bs=None, log_mode=False): if not log_mode: z, c = super().get_input(batch, k, force_c_encode=True, bs=bs) else: z, c, x, xrec, xc = super().get_input(batch, self.first_stage_key, return_first_stage_outputs=True, force_c_encode=True, return_original_cond=True, bs=bs) x_low = batch[self.low_scale_key][:bs] x_low = rearrange(x_low, 'b h w c -> b c h w') x_low = x_low.to(memory_format=torch.contiguous_format).float() zx, noise_level = self.low_scale_model(x_low) if self.noise_level_key is not None: # get noise level from batch instead, e.g. when extracting a custom noise level for bsr raise NotImplementedError('TODO') all_conds = {"c_concat": [zx], "c_crossattn": [c], "c_adm": noise_level} if log_mode: # TODO: maybe disable if too expensive x_low_rec = self.low_scale_model.decode(zx) return z, all_conds, x, xrec, xc, x_low, x_low_rec, noise_level return z, all_conds @torch.no_grad() def log_images(self, batch, N=8, n_row=4, sample=True, ddim_steps=200, ddim_eta=1., return_keys=None, plot_denoise_rows=False, plot_progressive_rows=True, plot_diffusion_rows=True, unconditional_guidance_scale=1., unconditional_guidance_label=None, use_ema_scope=True, **kwargs): ema_scope = self.ema_scope if use_ema_scope else nullcontext use_ddim = ddim_steps is not None log = dict() z, c, x, xrec, xc, x_low, x_low_rec, noise_level = self.get_input(batch, self.first_stage_key, bs=N, log_mode=True) N = min(x.shape[0], N) n_row = min(x.shape[0], n_row) log["inputs"] = x log["reconstruction"] = xrec log["x_lr"] = x_low log[f"x_lr_rec_@noise_levels{'-'.join(map(lambda x: str(x), list(noise_level.cpu().numpy())))}"] = x_low_rec if self.model.conditioning_key is not None: if hasattr(self.cond_stage_model, "decode"): xc = self.cond_stage_model.decode(c) log["conditioning"] = xc elif self.cond_stage_key in ["caption", "txt"]: xc = log_txt_as_img((x.shape[2], x.shape[3]), batch[self.cond_stage_key], size=x.shape[2] // 25) log["conditioning"] = xc elif self.cond_stage_key in ['class_label', 'cls']: xc = log_txt_as_img((x.shape[2], x.shape[3]), batch["human_label"], size=x.shape[2] // 25) log['conditioning'] = xc elif isimage(xc): log["conditioning"] = xc if ismap(xc): log["original_conditioning"] = self.to_rgb(xc) if plot_diffusion_rows: # get diffusion row diffusion_row = list() z_start = z[:n_row] for t in range(self.num_timesteps): if t % self.log_every_t == 0 or t == self.num_timesteps - 1: t = repeat(torch.tensor([t]), '1 -> b', b=n_row) t = t.to(self.device).long() noise = torch.randn_like(z_start) z_noisy = self.q_sample(x_start=z_start, t=t, noise=noise) diffusion_row.append(self.decode_first_stage(z_noisy)) diffusion_row = torch.stack(diffusion_row) # n_log_step, n_row, C, H, W diffusion_grid = rearrange(diffusion_row, 'n b c h w -> b n c h w') diffusion_grid = rearrange(diffusion_grid, 'b n c h w -> (b n) c h w') diffusion_grid = make_grid(diffusion_grid, nrow=diffusion_row.shape[0]) log["diffusion_row"] = diffusion_grid if sample: # get denoise row with ema_scope("Sampling"): samples, z_denoise_row = self.sample_log(cond=c, batch_size=N, ddim=use_ddim, ddim_steps=ddim_steps, eta=ddim_eta) # samples, z_denoise_row = self.sample(cond=c, batch_size=N, return_intermediates=True) x_samples = self.decode_first_stage(samples) log["samples"] = x_samples if plot_denoise_rows: denoise_grid = self._get_denoise_row_from_list(z_denoise_row) log["denoise_row"] = denoise_grid if unconditional_guidance_scale > 1.0: uc_tmp = self.get_unconditional_conditioning(N, unconditional_guidance_label) # TODO explore better "unconditional" choices for the other keys # maybe guide away from empty text label and highest noise level and maximally degraded zx? uc = dict() for k in c: if k == "c_crossattn": assert isinstance(c[k], list) and len(c[k]) == 1 uc[k] = [uc_tmp] elif k == "c_adm": # todo: only run with text-based guidance? assert isinstance(c[k], torch.Tensor) #uc[k] = torch.ones_like(c[k]) * self.low_scale_model.max_noise_level uc[k] = c[k] elif isinstance(c[k], list): uc[k] = [c[k][i] for i in range(len(c[k]))] else: uc[k] = c[k] with ema_scope("Sampling with classifier-free guidance"): samples_cfg, _ = self.sample_log(cond=c, batch_size=N, ddim=use_ddim, ddim_steps=ddim_steps, eta=ddim_eta, unconditional_guidance_scale=unconditional_guidance_scale, unconditional_conditioning=uc, ) x_samples_cfg = self.decode_first_stage(samples_cfg) log[f"samples_cfg_scale_{unconditional_guidance_scale:.2f}"] = x_samples_cfg if plot_progressive_rows: with ema_scope("Plotting Progressives"): img, progressives = self.progressive_denoising(c, shape=(self.channels, self.image_size, self.image_size), batch_size=N) prog_row = self._get_denoise_row_from_list(progressives, desc="Progressive Generation") log["progressive_row"] = prog_row return log class LatentFinetuneDiffusion(LatentDiffusion): """ Basis for different finetunas, such as inpainting or depth2image To disable finetuning mode, set finetune_keys to None """ def __init__(self, concat_keys: tuple, finetune_keys=("model.diffusion_model.input_blocks.0.0.weight", "model_ema.diffusion_modelinput_blocks00weight" ), keep_finetune_dims=4, # if model was trained without concat mode before and we would like to keep these channels c_concat_log_start=None, # to log reconstruction of c_concat codes c_concat_log_end=None, *args, **kwargs ): ckpt_path = kwargs.pop("ckpt_path", None) ignore_keys = kwargs.pop("ignore_keys", list()) super().__init__(*args, **kwargs) self.finetune_keys = finetune_keys self.concat_keys = concat_keys self.keep_dims = keep_finetune_dims self.c_concat_log_start = c_concat_log_start self.c_concat_log_end = c_concat_log_end if exists(self.finetune_keys): assert exists(ckpt_path), 'can only finetune from a given checkpoint' if exists(ckpt_path): self.init_from_ckpt(ckpt_path, ignore_keys) def init_from_ckpt(self, path, ignore_keys=list(), only_model=False): sd = torch.load(path, map_location="cpu") if "state_dict" in list(sd.keys()): sd = sd["state_dict"] keys = list(sd.keys()) for k in keys: for ik in ignore_keys: if k.startswith(ik): print("Deleting key {} from state_dict.".format(k)) del sd[k] # make it explicit, finetune by including extra input channels if exists(self.finetune_keys) and k in self.finetune_keys: new_entry = None for name, param in self.named_parameters(): if name in self.finetune_keys: print( f"modifying key '{name}' and keeping its original {self.keep_dims} (channels) dimensions only") new_entry = torch.zeros_like(param) # zero init assert exists(new_entry), 'did not find matching parameter to modify' new_entry[:, :self.keep_dims, ...] = sd[k] sd[k] = new_entry missing, unexpected = self.load_state_dict(sd, strict=False) if not only_model else self.model.load_state_dict( sd, strict=False) print(f"Restored from {path} with {len(missing)} missing and {len(unexpected)} unexpected keys") if len(missing) > 0: print(f"Missing Keys: {missing}") if len(unexpected) > 0: print(f"Unexpected Keys: {unexpected}") @torch.no_grad() def log_images(self, batch, N=8, n_row=4, sample=True, ddim_steps=200, ddim_eta=1., return_keys=None, quantize_denoised=True, inpaint=True, plot_denoise_rows=False, plot_progressive_rows=True, plot_diffusion_rows=True, unconditional_guidance_scale=1., unconditional_guidance_label=None, use_ema_scope=True, **kwargs): ema_scope = self.ema_scope if use_ema_scope else nullcontext use_ddim = ddim_steps is not None log = dict() z, c, x, xrec, xc = self.get_input(batch, self.first_stage_key, bs=N, return_first_stage_outputs=True) c_cat, c = c["c_concat"][0], c["c_crossattn"][0] N = min(x.shape[0], N) n_row = min(x.shape[0], n_row) log["inputs"] = x log["reconstruction"] = xrec if self.model.conditioning_key is not None: if hasattr(self.cond_stage_model, "decode"): xc = self.cond_stage_model.decode(c) log["conditioning"] = xc elif self.cond_stage_key in ["caption", "txt"]: xc = log_txt_as_img((x.shape[2], x.shape[3]), batch[self.cond_stage_key], size=x.shape[2] // 25) log["conditioning"] = xc elif self.cond_stage_key in ['class_label', 'cls']: xc = log_txt_as_img((x.shape[2], x.shape[3]), batch["human_label"], size=x.shape[2] // 25) log['conditioning'] = xc elif isimage(xc): log["conditioning"] = xc if ismap(xc): log["original_conditioning"] = self.to_rgb(xc) if not (self.c_concat_log_start is None and self.c_concat_log_end is None): log["c_concat_decoded"] = self.decode_first_stage(c_cat[:, self.c_concat_log_start:self.c_concat_log_end]) if plot_diffusion_rows: # get diffusion row diffusion_row = list() z_start = z[:n_row] for t in range(self.num_timesteps): if t % self.log_every_t == 0 or t == self.num_timesteps - 1: t = repeat(torch.tensor([t]), '1 -> b', b=n_row) t = t.to(self.device).long() noise = torch.randn_like(z_start) z_noisy = self.q_sample(x_start=z_start, t=t, noise=noise) diffusion_row.append(self.decode_first_stage(z_noisy)) diffusion_row = torch.stack(diffusion_row) # n_log_step, n_row, C, H, W diffusion_grid = rearrange(diffusion_row, 'n b c h w -> b n c h w') diffusion_grid = rearrange(diffusion_grid, 'b n c h w -> (b n) c h w') diffusion_grid = make_grid(diffusion_grid, nrow=diffusion_row.shape[0]) log["diffusion_row"] = diffusion_grid if sample: # get denoise row with ema_scope("Sampling"): samples, z_denoise_row = self.sample_log(cond={"c_concat": [c_cat], "c_crossattn": [c]}, batch_size=N, ddim=use_ddim, ddim_steps=ddim_steps, eta=ddim_eta) # samples, z_denoise_row = self.sample(cond=c, batch_size=N, return_intermediates=True) x_samples = self.decode_first_stage(samples) log["samples"] = x_samples if plot_denoise_rows: denoise_grid = self._get_denoise_row_from_list(z_denoise_row) log["denoise_row"] = denoise_grid if unconditional_guidance_scale > 1.0: uc_cross = self.get_unconditional_conditioning(N, unconditional_guidance_label) uc_cat = c_cat uc_full = {"c_concat": [uc_cat], "c_crossattn": [uc_cross]} with ema_scope("Sampling with classifier-free guidance"): samples_cfg, _ = self.sample_log(cond={"c_concat": [c_cat], "c_crossattn": [c]}, batch_size=N, ddim=use_ddim, ddim_steps=ddim_steps, eta=ddim_eta, unconditional_guidance_scale=unconditional_guidance_scale, unconditional_conditioning=uc_full, ) x_samples_cfg = self.decode_first_stage(samples_cfg) log[f"samples_cfg_scale_{unconditional_guidance_scale:.2f}"] = x_samples_cfg return log class LatentInpaintDiffusion(LatentFinetuneDiffusion): """ can either run as pure inpainting model (only concat mode) or with mixed conditionings, e.g. mask as concat and text via cross-attn. To disable finetuning mode, set finetune_keys to None """ def __init__(self, concat_keys=("mask", "masked_image"), masked_image_key="masked_image", *args, **kwargs ): super().__init__(concat_keys, *args, **kwargs) self.masked_image_key = masked_image_key assert self.masked_image_key in concat_keys @torch.no_grad() def get_input(self, batch, k, cond_key=None, bs=None, return_first_stage_outputs=False): # note: restricted to non-trainable encoders currently assert not self.cond_stage_trainable, 'trainable cond stages not yet supported for inpainting' z, c, x, xrec, xc = super().get_input(batch, self.first_stage_key, return_first_stage_outputs=True, force_c_encode=True, return_original_cond=True, bs=bs) assert exists(self.concat_keys) c_cat = list() for ck in self.concat_keys: cc = rearrange(batch[ck], 'b h w c -> b c h w').to(memory_format=torch.contiguous_format).float() if bs is not None: cc = cc[:bs] cc = cc.to(self.device) bchw = z.shape if ck != self.masked_image_key: cc = torch.nn.functional.interpolate(cc, size=bchw[-2:]) else: cc = self.get_first_stage_encoding(self.encode_first_stage(cc)) c_cat.append(cc) c_cat = torch.cat(c_cat, dim=1) all_conds = {"c_concat": [c_cat], "c_crossattn": [c]} if return_first_stage_outputs: return z, all_conds, x, xrec, xc return z, all_conds @torch.no_grad() def log_images(self, *args, **kwargs): log = super(LatentInpaintDiffusion, self).log_images(*args, **kwargs) log["masked_image"] = rearrange(args[0]["masked_image"], 'b h w c -> b c h w').to(memory_format=torch.contiguous_format).float() return log class LatentDepth2ImageDiffusion(LatentFinetuneDiffusion): """ condition on monocular depth estimation """ def __init__(self, depth_stage_config, concat_keys=("midas_in",), *args, **kwargs): super().__init__(concat_keys=concat_keys, *args, **kwargs) self.depth_model = instantiate_from_config(depth_stage_config) self.depth_stage_key = concat_keys[0] @torch.no_grad() def get_input(self, batch, k, cond_key=None, bs=None, return_first_stage_outputs=False): # note: restricted to non-trainable encoders currently assert not self.cond_stage_trainable, 'trainable cond stages not yet supported for depth2img' z, c, x, xrec, xc = super().get_input(batch, self.first_stage_key, return_first_stage_outputs=True, force_c_encode=True, return_original_cond=True, bs=bs) assert exists(self.concat_keys) assert len(self.concat_keys) == 1 c_cat = list() for ck in self.concat_keys: cc = batch[ck] if bs is not None: cc = cc[:bs] cc = cc.to(self.device) cc = self.depth_model(cc) cc = torch.nn.functional.interpolate( cc, size=z.shape[2:], mode="bicubic", align_corners=False, ) depth_min, depth_max = torch.amin(cc, dim=[1, 2, 3], keepdim=True), torch.amax(cc, dim=[1, 2, 3], keepdim=True) cc = 2. * (cc - depth_min) / (depth_max - depth_min + 0.001) - 1. c_cat.append(cc) c_cat = torch.cat(c_cat, dim=1) all_conds = {"c_concat": [c_cat], "c_crossattn": [c]} if return_first_stage_outputs: return z, all_conds, x, xrec, xc return z, all_conds @torch.no_grad() def log_images(self, *args, **kwargs): log = super().log_images(*args, **kwargs) depth = self.depth_model(args[0][self.depth_stage_key]) depth_min, depth_max = torch.amin(depth, dim=[1, 2, 3], keepdim=True), \ torch.amax(depth, dim=[1, 2, 3], keepdim=True) log["depth"] = 2. * (depth - depth_min) / (depth_max - depth_min) - 1. return log class LatentUpscaleFinetuneDiffusion(LatentFinetuneDiffusion): """ condition on low-res image (and optionally on some spatial noise augmentation) """ def __init__(self, concat_keys=("lr",), reshuffle_patch_size=None, low_scale_config=None, low_scale_key=None, *args, **kwargs): super().__init__(concat_keys=concat_keys, *args, **kwargs) self.reshuffle_patch_size = reshuffle_patch_size self.low_scale_model = None if low_scale_config is not None: print("Initializing a low-scale model") assert exists(low_scale_key) self.instantiate_low_stage(low_scale_config) self.low_scale_key = low_scale_key def instantiate_low_stage(self, config): model = instantiate_from_config(config) self.low_scale_model = model.eval() self.low_scale_model.train = disabled_train for param in self.low_scale_model.parameters(): param.requires_grad = False @torch.no_grad() def get_input(self, batch, k, cond_key=None, bs=None, return_first_stage_outputs=False): # note: restricted to non-trainable encoders currently assert not self.cond_stage_trainable, 'trainable cond stages not yet supported for upscaling-ft' z, c, x, xrec, xc = super().get_input(batch, self.first_stage_key, return_first_stage_outputs=True, force_c_encode=True, return_original_cond=True, bs=bs) assert exists(self.concat_keys) assert len(self.concat_keys) == 1 # optionally make spatial noise_level here c_cat = list() noise_level = None for ck in self.concat_keys: cc = batch[ck] cc = rearrange(cc, 'b h w c -> b c h w') if exists(self.reshuffle_patch_size): assert isinstance(self.reshuffle_patch_size, int) cc = rearrange(cc, 'b c (p1 h) (p2 w) -> b (p1 p2 c) h w', p1=self.reshuffle_patch_size, p2=self.reshuffle_patch_size) if bs is not None: cc = cc[:bs] cc = cc.to(self.device) if exists(self.low_scale_model) and ck == self.low_scale_key: cc, noise_level = self.low_scale_model(cc) c_cat.append(cc) c_cat = torch.cat(c_cat, dim=1) if exists(noise_level): all_conds = {"c_concat": [c_cat], "c_crossattn": [c], "c_adm": noise_level} else: all_conds = {"c_concat": [c_cat], "c_crossattn": [c]} if return_first_stage_outputs: return z, all_conds, x, xrec, xc return z, all_conds @torch.no_grad() def log_images(self, *args, **kwargs): log = super().log_images(*args, **kwargs) log["lr"] = rearrange(args[0]["lr"], 'b h w c -> b c h w') return log ================================================ FILE: ToonCrafter/ldm/models/diffusion/dpm_solver/__init__.py ================================================ from .sampler import DPMSolverSampler ================================================ FILE: ToonCrafter/ldm/models/diffusion/dpm_solver/dpm_solver.py ================================================ import torch import torch.nn.functional as F import math from tqdm import tqdm class NoiseScheduleVP: def __init__( self, schedule='discrete', betas=None, alphas_cumprod=None, continuous_beta_0=0.1, continuous_beta_1=20., ): """Create a wrapper class for the forward SDE (VP type). *** Update: We support discrete-time diffusion models by implementing a picewise linear interpolation for log_alpha_t. We recommend to use schedule='discrete' for the discrete-time diffusion models, especially for high-resolution images. *** The forward SDE ensures that the condition distribution q_{t|0}(x_t | x_0) = N ( alpha_t * x_0, sigma_t^2 * I ). We further define lambda_t = log(alpha_t) - log(sigma_t), which is the half-logSNR (described in the DPM-Solver paper). Therefore, we implement the functions for computing alpha_t, sigma_t and lambda_t. For t in [0, T], we have: log_alpha_t = self.marginal_log_mean_coeff(t) sigma_t = self.marginal_std(t) lambda_t = self.marginal_lambda(t) Moreover, as lambda(t) is an invertible function, we also support its inverse function: t = self.inverse_lambda(lambda_t) =============================================================== We support both discrete-time DPMs (trained on n = 0, 1, ..., N-1) and continuous-time DPMs (trained on t in [t_0, T]). 1. For discrete-time DPMs: For discrete-time DPMs trained on n = 0, 1, ..., N-1, we convert the discrete steps to continuous time steps by: t_i = (i + 1) / N e.g. for N = 1000, we have t_0 = 1e-3 and T = t_{N-1} = 1. We solve the corresponding diffusion ODE from time T = 1 to time t_0 = 1e-3. Args: betas: A `torch.Tensor`. The beta array for the discrete-time DPM. (See the original DDPM paper for details) alphas_cumprod: A `torch.Tensor`. The cumprod alphas for the discrete-time DPM. (See the original DDPM paper for details) Note that we always have alphas_cumprod = cumprod(betas). Therefore, we only need to set one of `betas` and `alphas_cumprod`. **Important**: Please pay special attention for the args for `alphas_cumprod`: The `alphas_cumprod` is the \hat{alpha_n} arrays in the notations of DDPM. Specifically, DDPMs assume that q_{t_n | 0}(x_{t_n} | x_0) = N ( \sqrt{\hat{alpha_n}} * x_0, (1 - \hat{alpha_n}) * I ). Therefore, the notation \hat{alpha_n} is different from the notation alpha_t in DPM-Solver. In fact, we have alpha_{t_n} = \sqrt{\hat{alpha_n}}, and log(alpha_{t_n}) = 0.5 * log(\hat{alpha_n}). 2. For continuous-time DPMs: We support two types of VPSDEs: linear (DDPM) and cosine (improved-DDPM). The hyperparameters for the noise schedule are the default settings in DDPM and improved-DDPM: Args: beta_min: A `float` number. The smallest beta for the linear schedule. beta_max: A `float` number. The largest beta for the linear schedule. cosine_s: A `float` number. The hyperparameter in the cosine schedule. cosine_beta_max: A `float` number. The hyperparameter in the cosine schedule. T: A `float` number. The ending time of the forward process. =============================================================== Args: schedule: A `str`. The noise schedule of the forward SDE. 'discrete' for discrete-time DPMs, 'linear' or 'cosine' for continuous-time DPMs. Returns: A wrapper object of the forward SDE (VP type). =============================================================== Example: # For discrete-time DPMs, given betas (the beta array for n = 0, 1, ..., N - 1): >>> ns = NoiseScheduleVP('discrete', betas=betas) # For discrete-time DPMs, given alphas_cumprod (the \hat{alpha_n} array for n = 0, 1, ..., N - 1): >>> ns = NoiseScheduleVP('discrete', alphas_cumprod=alphas_cumprod) # For continuous-time DPMs (VPSDE), linear schedule: >>> ns = NoiseScheduleVP('linear', continuous_beta_0=0.1, continuous_beta_1=20.) """ if schedule not in ['discrete', 'linear', 'cosine']: raise ValueError( "Unsupported noise schedule {}. The schedule needs to be 'discrete' or 'linear' or 'cosine'".format( schedule)) self.schedule = schedule if schedule == 'discrete': if betas is not None: log_alphas = 0.5 * torch.log(1 - betas).cumsum(dim=0) else: assert alphas_cumprod is not None log_alphas = 0.5 * torch.log(alphas_cumprod) self.total_N = len(log_alphas) self.T = 1. self.t_array = torch.linspace(0., 1., self.total_N + 1)[1:].reshape((1, -1)) self.log_alpha_array = log_alphas.reshape((1, -1,)) else: self.total_N = 1000 self.beta_0 = continuous_beta_0 self.beta_1 = continuous_beta_1 self.cosine_s = 0.008 self.cosine_beta_max = 999. self.cosine_t_max = math.atan(self.cosine_beta_max * (1. + self.cosine_s) / math.pi) * 2. * ( 1. + self.cosine_s) / math.pi - self.cosine_s self.cosine_log_alpha_0 = math.log(math.cos(self.cosine_s / (1. + self.cosine_s) * math.pi / 2.)) self.schedule = schedule if schedule == 'cosine': # For the cosine schedule, T = 1 will have numerical issues. So we manually set the ending time T. # Note that T = 0.9946 may be not the optimal setting. However, we find it works well. self.T = 0.9946 else: self.T = 1. def marginal_log_mean_coeff(self, t): """ Compute log(alpha_t) of a given continuous-time label t in [0, T]. """ if self.schedule == 'discrete': return interpolate_fn(t.reshape((-1, 1)), self.t_array.to(t.device), self.log_alpha_array.to(t.device)).reshape((-1)) elif self.schedule == 'linear': return -0.25 * t ** 2 * (self.beta_1 - self.beta_0) - 0.5 * t * self.beta_0 elif self.schedule == 'cosine': log_alpha_fn = lambda s: torch.log(torch.cos((s + self.cosine_s) / (1. + self.cosine_s) * math.pi / 2.)) log_alpha_t = log_alpha_fn(t) - self.cosine_log_alpha_0 return log_alpha_t def marginal_alpha(self, t): """ Compute alpha_t of a given continuous-time label t in [0, T]. """ return torch.exp(self.marginal_log_mean_coeff(t)) def marginal_std(self, t): """ Compute sigma_t of a given continuous-time label t in [0, T]. """ return torch.sqrt(1. - torch.exp(2. * self.marginal_log_mean_coeff(t))) def marginal_lambda(self, t): """ Compute lambda_t = log(alpha_t) - log(sigma_t) of a given continuous-time label t in [0, T]. """ log_mean_coeff = self.marginal_log_mean_coeff(t) log_std = 0.5 * torch.log(1. - torch.exp(2. * log_mean_coeff)) return log_mean_coeff - log_std def inverse_lambda(self, lamb): """ Compute the continuous-time label t in [0, T] of a given half-logSNR lambda_t. """ if self.schedule == 'linear': tmp = 2. * (self.beta_1 - self.beta_0) * torch.logaddexp(-2. * lamb, torch.zeros((1,)).to(lamb)) Delta = self.beta_0 ** 2 + tmp return tmp / (torch.sqrt(Delta) + self.beta_0) / (self.beta_1 - self.beta_0) elif self.schedule == 'discrete': log_alpha = -0.5 * torch.logaddexp(torch.zeros((1,)).to(lamb.device), -2. * lamb) t = interpolate_fn(log_alpha.reshape((-1, 1)), torch.flip(self.log_alpha_array.to(lamb.device), [1]), torch.flip(self.t_array.to(lamb.device), [1])) return t.reshape((-1,)) else: log_alpha = -0.5 * torch.logaddexp(-2. * lamb, torch.zeros((1,)).to(lamb)) t_fn = lambda log_alpha_t: torch.arccos(torch.exp(log_alpha_t + self.cosine_log_alpha_0)) * 2. * ( 1. + self.cosine_s) / math.pi - self.cosine_s t = t_fn(log_alpha) return t def model_wrapper( model, noise_schedule, model_type="noise", model_kwargs={}, guidance_type="uncond", condition=None, unconditional_condition=None, guidance_scale=1., classifier_fn=None, classifier_kwargs={}, ): """Create a wrapper function for the noise prediction model. DPM-Solver needs to solve the continuous-time diffusion ODEs. For DPMs trained on discrete-time labels, we need to firstly wrap the model function to a noise prediction model that accepts the continuous time as the input. We support four types of the diffusion model by setting `model_type`: 1. "noise": noise prediction model. (Trained by predicting noise). 2. "x_start": data prediction model. (Trained by predicting the data x_0 at time 0). 3. "v": velocity prediction model. (Trained by predicting the velocity). The "v" prediction is derivation detailed in Appendix D of [1], and is used in Imagen-Video [2]. [1] Salimans, Tim, and Jonathan Ho. "Progressive distillation for fast sampling of diffusion models." arXiv preprint arXiv:2202.00512 (2022). [2] Ho, Jonathan, et al. "Imagen Video: High Definition Video Generation with Diffusion Models." arXiv preprint arXiv:2210.02303 (2022). 4. "score": marginal score function. (Trained by denoising score matching). Note that the score function and the noise prediction model follows a simple relationship: ``` noise(x_t, t) = -sigma_t * score(x_t, t) ``` We support three types of guided sampling by DPMs by setting `guidance_type`: 1. "uncond": unconditional sampling by DPMs. The input `model` has the following format: `` model(x, t_input, **model_kwargs) -> noise | x_start | v | score `` 2. "classifier": classifier guidance sampling [3] by DPMs and another classifier. The input `model` has the following format: `` model(x, t_input, **model_kwargs) -> noise | x_start | v | score `` The input `classifier_fn` has the following format: `` classifier_fn(x, t_input, cond, **classifier_kwargs) -> logits(x, t_input, cond) `` [3] P. Dhariwal and A. Q. Nichol, "Diffusion models beat GANs on image synthesis," in Advances in Neural Information Processing Systems, vol. 34, 2021, pp. 8780-8794. 3. "classifier-free": classifier-free guidance sampling by conditional DPMs. The input `model` has the following format: `` model(x, t_input, cond, **model_kwargs) -> noise | x_start | v | score `` And if cond == `unconditional_condition`, the model output is the unconditional DPM output. [4] Ho, Jonathan, and Tim Salimans. "Classifier-free diffusion guidance." arXiv preprint arXiv:2207.12598 (2022). The `t_input` is the time label of the model, which may be discrete-time labels (i.e. 0 to 999) or continuous-time labels (i.e. epsilon to T). We wrap the model function to accept only `x` and `t_continuous` as inputs, and outputs the predicted noise: `` def model_fn(x, t_continuous) -> noise: t_input = get_model_input_time(t_continuous) return noise_pred(model, x, t_input, **model_kwargs) `` where `t_continuous` is the continuous time labels (i.e. epsilon to T). And we use `model_fn` for DPM-Solver. =============================================================== Args: model: A diffusion model with the corresponding format described above. noise_schedule: A noise schedule object, such as NoiseScheduleVP. model_type: A `str`. The parameterization type of the diffusion model. "noise" or "x_start" or "v" or "score". model_kwargs: A `dict`. A dict for the other inputs of the model function. guidance_type: A `str`. The type of the guidance for sampling. "uncond" or "classifier" or "classifier-free". condition: A pytorch tensor. The condition for the guided sampling. Only used for "classifier" or "classifier-free" guidance type. unconditional_condition: A pytorch tensor. The condition for the unconditional sampling. Only used for "classifier-free" guidance type. guidance_scale: A `float`. The scale for the guided sampling. classifier_fn: A classifier function. Only used for the classifier guidance. classifier_kwargs: A `dict`. A dict for the other inputs of the classifier function. Returns: A noise prediction model that accepts the noised data and the continuous time as the inputs. """ def get_model_input_time(t_continuous): """ Convert the continuous-time `t_continuous` (in [epsilon, T]) to the model input time. For discrete-time DPMs, we convert `t_continuous` in [1 / N, 1] to `t_input` in [0, 1000 * (N - 1) / N]. For continuous-time DPMs, we just use `t_continuous`. """ if noise_schedule.schedule == 'discrete': return (t_continuous - 1. / noise_schedule.total_N) * 1000. else: return t_continuous def noise_pred_fn(x, t_continuous, cond=None): if t_continuous.reshape((-1,)).shape[0] == 1: t_continuous = t_continuous.expand((x.shape[0])) t_input = get_model_input_time(t_continuous) if cond is None: output = model(x, t_input, **model_kwargs) else: output = model(x, t_input, cond, **model_kwargs) if model_type == "noise": return output elif model_type == "x_start": alpha_t, sigma_t = noise_schedule.marginal_alpha(t_continuous), noise_schedule.marginal_std(t_continuous) dims = x.dim() return (x - expand_dims(alpha_t, dims) * output) / expand_dims(sigma_t, dims) elif model_type == "v": alpha_t, sigma_t = noise_schedule.marginal_alpha(t_continuous), noise_schedule.marginal_std(t_continuous) dims = x.dim() return expand_dims(alpha_t, dims) * output + expand_dims(sigma_t, dims) * x elif model_type == "score": sigma_t = noise_schedule.marginal_std(t_continuous) dims = x.dim() return -expand_dims(sigma_t, dims) * output def cond_grad_fn(x, t_input): """ Compute the gradient of the classifier, i.e. nabla_{x} log p_t(cond | x_t). """ with torch.enable_grad(): x_in = x.detach().requires_grad_(True) log_prob = classifier_fn(x_in, t_input, condition, **classifier_kwargs) return torch.autograd.grad(log_prob.sum(), x_in)[0] def model_fn(x, t_continuous): """ The noise predicition model function that is used for DPM-Solver. """ if t_continuous.reshape((-1,)).shape[0] == 1: t_continuous = t_continuous.expand((x.shape[0])) if guidance_type == "uncond": return noise_pred_fn(x, t_continuous) elif guidance_type == "classifier": assert classifier_fn is not None t_input = get_model_input_time(t_continuous) cond_grad = cond_grad_fn(x, t_input) sigma_t = noise_schedule.marginal_std(t_continuous) noise = noise_pred_fn(x, t_continuous) return noise - guidance_scale * expand_dims(sigma_t, dims=cond_grad.dim()) * cond_grad elif guidance_type == "classifier-free": if guidance_scale == 1. or unconditional_condition is None: return noise_pred_fn(x, t_continuous, cond=condition) else: x_in = torch.cat([x] * 2) t_in = torch.cat([t_continuous] * 2) c_in = torch.cat([unconditional_condition, condition]) noise_uncond, noise = noise_pred_fn(x_in, t_in, cond=c_in).chunk(2) return noise_uncond + guidance_scale * (noise - noise_uncond) assert model_type in ["noise", "x_start", "v"] assert guidance_type in ["uncond", "classifier", "classifier-free"] return model_fn class DPM_Solver: def __init__(self, model_fn, noise_schedule, predict_x0=False, thresholding=False, max_val=1.): """Construct a DPM-Solver. We support both the noise prediction model ("predicting epsilon") and the data prediction model ("predicting x0"). If `predict_x0` is False, we use the solver for the noise prediction model (DPM-Solver). If `predict_x0` is True, we use the solver for the data prediction model (DPM-Solver++). In such case, we further support the "dynamic thresholding" in [1] when `thresholding` is True. The "dynamic thresholding" can greatly improve the sample quality for pixel-space DPMs with large guidance scales. Args: model_fn: A noise prediction model function which accepts the continuous-time input (t in [epsilon, T]): `` def model_fn(x, t_continuous): return noise `` noise_schedule: A noise schedule object, such as NoiseScheduleVP. predict_x0: A `bool`. If true, use the data prediction model; else, use the noise prediction model. thresholding: A `bool`. Valid when `predict_x0` is True. Whether to use the "dynamic thresholding" in [1]. max_val: A `float`. Valid when both `predict_x0` and `thresholding` are True. The max value for thresholding. [1] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022b. """ self.model = model_fn self.noise_schedule = noise_schedule self.predict_x0 = predict_x0 self.thresholding = thresholding self.max_val = max_val def noise_prediction_fn(self, x, t): """ Return the noise prediction model. """ return self.model(x, t) def data_prediction_fn(self, x, t): """ Return the data prediction model (with thresholding). """ noise = self.noise_prediction_fn(x, t) dims = x.dim() alpha_t, sigma_t = self.noise_schedule.marginal_alpha(t), self.noise_schedule.marginal_std(t) x0 = (x - expand_dims(sigma_t, dims) * noise) / expand_dims(alpha_t, dims) if self.thresholding: p = 0.995 # A hyperparameter in the paper of "Imagen" [1]. s = torch.quantile(torch.abs(x0).reshape((x0.shape[0], -1)), p, dim=1) s = expand_dims(torch.maximum(s, self.max_val * torch.ones_like(s).to(s.device)), dims) x0 = torch.clamp(x0, -s, s) / s return x0 def model_fn(self, x, t): """ Convert the model to the noise prediction model or the data prediction model. """ if self.predict_x0: return self.data_prediction_fn(x, t) else: return self.noise_prediction_fn(x, t) def get_time_steps(self, skip_type, t_T, t_0, N, device): """Compute the intermediate time steps for sampling. Args: skip_type: A `str`. The type for the spacing of the time steps. We support three types: - 'logSNR': uniform logSNR for the time steps. - 'time_uniform': uniform time for the time steps. (**Recommended for high-resolutional data**.) - 'time_quadratic': quadratic time for the time steps. (Used in DDIM for low-resolutional data.) t_T: A `float`. The starting time of the sampling (default is T). t_0: A `float`. The ending time of the sampling (default is epsilon). N: A `int`. The total number of the spacing of the time steps. device: A torch device. Returns: A pytorch tensor of the time steps, with the shape (N + 1,). """ if skip_type == 'logSNR': lambda_T = self.noise_schedule.marginal_lambda(torch.tensor(t_T).to(device)) lambda_0 = self.noise_schedule.marginal_lambda(torch.tensor(t_0).to(device)) logSNR_steps = torch.linspace(lambda_T.cpu().item(), lambda_0.cpu().item(), N + 1).to(device) return self.noise_schedule.inverse_lambda(logSNR_steps) elif skip_type == 'time_uniform': return torch.linspace(t_T, t_0, N + 1).to(device) elif skip_type == 'time_quadratic': t_order = 2 t = torch.linspace(t_T ** (1. / t_order), t_0 ** (1. / t_order), N + 1).pow(t_order).to(device) return t else: raise ValueError( "Unsupported skip_type {}, need to be 'logSNR' or 'time_uniform' or 'time_quadratic'".format(skip_type)) def get_orders_and_timesteps_for_singlestep_solver(self, steps, order, skip_type, t_T, t_0, device): """ Get the order of each step for sampling by the singlestep DPM-Solver. We combine both DPM-Solver-1,2,3 to use all the function evaluations, which is named as "DPM-Solver-fast". Given a fixed number of function evaluations by `steps`, the sampling procedure by DPM-Solver-fast is: - If order == 1: We take `steps` of DPM-Solver-1 (i.e. DDIM). - If order == 2: - Denote K = (steps // 2). We take K or (K + 1) intermediate time steps for sampling. - If steps % 2 == 0, we use K steps of DPM-Solver-2. - If steps % 2 == 1, we use K steps of DPM-Solver-2 and 1 step of DPM-Solver-1. - If order == 3: - Denote K = (steps // 3 + 1). We take K intermediate time steps for sampling. - If steps % 3 == 0, we use (K - 2) steps of DPM-Solver-3, and 1 step of DPM-Solver-2 and 1 step of DPM-Solver-1. - If steps % 3 == 1, we use (K - 1) steps of DPM-Solver-3 and 1 step of DPM-Solver-1. - If steps % 3 == 2, we use (K - 1) steps of DPM-Solver-3 and 1 step of DPM-Solver-2. ============================================ Args: order: A `int`. The max order for the solver (2 or 3). steps: A `int`. The total number of function evaluations (NFE). skip_type: A `str`. The type for the spacing of the time steps. We support three types: - 'logSNR': uniform logSNR for the time steps. - 'time_uniform': uniform time for the time steps. (**Recommended for high-resolutional data**.) - 'time_quadratic': quadratic time for the time steps. (Used in DDIM for low-resolutional data.) t_T: A `float`. The starting time of the sampling (default is T). t_0: A `float`. The ending time of the sampling (default is epsilon). device: A torch device. Returns: orders: A list of the solver order of each step. """ if order == 3: K = steps // 3 + 1 if steps % 3 == 0: orders = [3, ] * (K - 2) + [2, 1] elif steps % 3 == 1: orders = [3, ] * (K - 1) + [1] else: orders = [3, ] * (K - 1) + [2] elif order == 2: if steps % 2 == 0: K = steps // 2 orders = [2, ] * K else: K = steps // 2 + 1 orders = [2, ] * (K - 1) + [1] elif order == 1: K = 1 orders = [1, ] * steps else: raise ValueError("'order' must be '1' or '2' or '3'.") if skip_type == 'logSNR': # To reproduce the results in DPM-Solver paper timesteps_outer = self.get_time_steps(skip_type, t_T, t_0, K, device) else: timesteps_outer = self.get_time_steps(skip_type, t_T, t_0, steps, device)[ torch.cumsum(torch.tensor([0, ] + orders)).to(device)] return timesteps_outer, orders def denoise_to_zero_fn(self, x, s): """ Denoise at the final step, which is equivalent to solve the ODE from lambda_s to infty by first-order discretization. """ return self.data_prediction_fn(x, s) def dpm_solver_first_update(self, x, s, t, model_s=None, return_intermediate=False): """ DPM-Solver-1 (equivalent to DDIM) from time `s` to time `t`. Args: x: A pytorch tensor. The initial value at time `s`. s: A pytorch tensor. The starting time, with the shape (x.shape[0],). t: A pytorch tensor. The ending time, with the shape (x.shape[0],). model_s: A pytorch tensor. The model function evaluated at time `s`. If `model_s` is None, we evaluate the model by `x` and `s`; otherwise we directly use it. return_intermediate: A `bool`. If true, also return the model value at time `s`. Returns: x_t: A pytorch tensor. The approximated solution at time `t`. """ ns = self.noise_schedule dims = x.dim() lambda_s, lambda_t = ns.marginal_lambda(s), ns.marginal_lambda(t) h = lambda_t - lambda_s log_alpha_s, log_alpha_t = ns.marginal_log_mean_coeff(s), ns.marginal_log_mean_coeff(t) sigma_s, sigma_t = ns.marginal_std(s), ns.marginal_std(t) alpha_t = torch.exp(log_alpha_t) if self.predict_x0: phi_1 = torch.expm1(-h) if model_s is None: model_s = self.model_fn(x, s) x_t = ( expand_dims(sigma_t / sigma_s, dims) * x - expand_dims(alpha_t * phi_1, dims) * model_s ) if return_intermediate: return x_t, {'model_s': model_s} else: return x_t else: phi_1 = torch.expm1(h) if model_s is None: model_s = self.model_fn(x, s) x_t = ( expand_dims(torch.exp(log_alpha_t - log_alpha_s), dims) * x - expand_dims(sigma_t * phi_1, dims) * model_s ) if return_intermediate: return x_t, {'model_s': model_s} else: return x_t def singlestep_dpm_solver_second_update(self, x, s, t, r1=0.5, model_s=None, return_intermediate=False, solver_type='dpm_solver'): """ Singlestep solver DPM-Solver-2 from time `s` to time `t`. Args: x: A pytorch tensor. The initial value at time `s`. s: A pytorch tensor. The starting time, with the shape (x.shape[0],). t: A pytorch tensor. The ending time, with the shape (x.shape[0],). r1: A `float`. The hyperparameter of the second-order solver. model_s: A pytorch tensor. The model function evaluated at time `s`. If `model_s` is None, we evaluate the model by `x` and `s`; otherwise we directly use it. return_intermediate: A `bool`. If true, also return the model value at time `s` and `s1` (the intermediate time). solver_type: either 'dpm_solver' or 'taylor'. The type for the high-order solvers. The type slightly impacts the performance. We recommend to use 'dpm_solver' type. Returns: x_t: A pytorch tensor. The approximated solution at time `t`. """ if solver_type not in ['dpm_solver', 'taylor']: raise ValueError("'solver_type' must be either 'dpm_solver' or 'taylor', got {}".format(solver_type)) if r1 is None: r1 = 0.5 ns = self.noise_schedule dims = x.dim() lambda_s, lambda_t = ns.marginal_lambda(s), ns.marginal_lambda(t) h = lambda_t - lambda_s lambda_s1 = lambda_s + r1 * h s1 = ns.inverse_lambda(lambda_s1) log_alpha_s, log_alpha_s1, log_alpha_t = ns.marginal_log_mean_coeff(s), ns.marginal_log_mean_coeff( s1), ns.marginal_log_mean_coeff(t) sigma_s, sigma_s1, sigma_t = ns.marginal_std(s), ns.marginal_std(s1), ns.marginal_std(t) alpha_s1, alpha_t = torch.exp(log_alpha_s1), torch.exp(log_alpha_t) if self.predict_x0: phi_11 = torch.expm1(-r1 * h) phi_1 = torch.expm1(-h) if model_s is None: model_s = self.model_fn(x, s) x_s1 = ( expand_dims(sigma_s1 / sigma_s, dims) * x - expand_dims(alpha_s1 * phi_11, dims) * model_s ) model_s1 = self.model_fn(x_s1, s1) if solver_type == 'dpm_solver': x_t = ( expand_dims(sigma_t / sigma_s, dims) * x - expand_dims(alpha_t * phi_1, dims) * model_s - (0.5 / r1) * expand_dims(alpha_t * phi_1, dims) * (model_s1 - model_s) ) elif solver_type == 'taylor': x_t = ( expand_dims(sigma_t / sigma_s, dims) * x - expand_dims(alpha_t * phi_1, dims) * model_s + (1. / r1) * expand_dims(alpha_t * ((torch.exp(-h) - 1.) / h + 1.), dims) * ( model_s1 - model_s) ) else: phi_11 = torch.expm1(r1 * h) phi_1 = torch.expm1(h) if model_s is None: model_s = self.model_fn(x, s) x_s1 = ( expand_dims(torch.exp(log_alpha_s1 - log_alpha_s), dims) * x - expand_dims(sigma_s1 * phi_11, dims) * model_s ) model_s1 = self.model_fn(x_s1, s1) if solver_type == 'dpm_solver': x_t = ( expand_dims(torch.exp(log_alpha_t - log_alpha_s), dims) * x - expand_dims(sigma_t * phi_1, dims) * model_s - (0.5 / r1) * expand_dims(sigma_t * phi_1, dims) * (model_s1 - model_s) ) elif solver_type == 'taylor': x_t = ( expand_dims(torch.exp(log_alpha_t - log_alpha_s), dims) * x - expand_dims(sigma_t * phi_1, dims) * model_s - (1. / r1) * expand_dims(sigma_t * ((torch.exp(h) - 1.) / h - 1.), dims) * (model_s1 - model_s) ) if return_intermediate: return x_t, {'model_s': model_s, 'model_s1': model_s1} else: return x_t def singlestep_dpm_solver_third_update(self, x, s, t, r1=1. / 3., r2=2. / 3., model_s=None, model_s1=None, return_intermediate=False, solver_type='dpm_solver'): """ Singlestep solver DPM-Solver-3 from time `s` to time `t`. Args: x: A pytorch tensor. The initial value at time `s`. s: A pytorch tensor. The starting time, with the shape (x.shape[0],). t: A pytorch tensor. The ending time, with the shape (x.shape[0],). r1: A `float`. The hyperparameter of the third-order solver. r2: A `float`. The hyperparameter of the third-order solver. model_s: A pytorch tensor. The model function evaluated at time `s`. If `model_s` is None, we evaluate the model by `x` and `s`; otherwise we directly use it. model_s1: A pytorch tensor. The model function evaluated at time `s1` (the intermediate time given by `r1`). If `model_s1` is None, we evaluate the model at `s1`; otherwise we directly use it. return_intermediate: A `bool`. If true, also return the model value at time `s`, `s1` and `s2` (the intermediate times). solver_type: either 'dpm_solver' or 'taylor'. The type for the high-order solvers. The type slightly impacts the performance. We recommend to use 'dpm_solver' type. Returns: x_t: A pytorch tensor. The approximated solution at time `t`. """ if solver_type not in ['dpm_solver', 'taylor']: raise ValueError("'solver_type' must be either 'dpm_solver' or 'taylor', got {}".format(solver_type)) if r1 is None: r1 = 1. / 3. if r2 is None: r2 = 2. / 3. ns = self.noise_schedule dims = x.dim() lambda_s, lambda_t = ns.marginal_lambda(s), ns.marginal_lambda(t) h = lambda_t - lambda_s lambda_s1 = lambda_s + r1 * h lambda_s2 = lambda_s + r2 * h s1 = ns.inverse_lambda(lambda_s1) s2 = ns.inverse_lambda(lambda_s2) log_alpha_s, log_alpha_s1, log_alpha_s2, log_alpha_t = ns.marginal_log_mean_coeff( s), ns.marginal_log_mean_coeff(s1), ns.marginal_log_mean_coeff(s2), ns.marginal_log_mean_coeff(t) sigma_s, sigma_s1, sigma_s2, sigma_t = ns.marginal_std(s), ns.marginal_std(s1), ns.marginal_std( s2), ns.marginal_std(t) alpha_s1, alpha_s2, alpha_t = torch.exp(log_alpha_s1), torch.exp(log_alpha_s2), torch.exp(log_alpha_t) if self.predict_x0: phi_11 = torch.expm1(-r1 * h) phi_12 = torch.expm1(-r2 * h) phi_1 = torch.expm1(-h) phi_22 = torch.expm1(-r2 * h) / (r2 * h) + 1. phi_2 = phi_1 / h + 1. phi_3 = phi_2 / h - 0.5 if model_s is None: model_s = self.model_fn(x, s) if model_s1 is None: x_s1 = ( expand_dims(sigma_s1 / sigma_s, dims) * x - expand_dims(alpha_s1 * phi_11, dims) * model_s ) model_s1 = self.model_fn(x_s1, s1) x_s2 = ( expand_dims(sigma_s2 / sigma_s, dims) * x - expand_dims(alpha_s2 * phi_12, dims) * model_s + r2 / r1 * expand_dims(alpha_s2 * phi_22, dims) * (model_s1 - model_s) ) model_s2 = self.model_fn(x_s2, s2) if solver_type == 'dpm_solver': x_t = ( expand_dims(sigma_t / sigma_s, dims) * x - expand_dims(alpha_t * phi_1, dims) * model_s + (1. / r2) * expand_dims(alpha_t * phi_2, dims) * (model_s2 - model_s) ) elif solver_type == 'taylor': D1_0 = (1. / r1) * (model_s1 - model_s) D1_1 = (1. / r2) * (model_s2 - model_s) D1 = (r2 * D1_0 - r1 * D1_1) / (r2 - r1) D2 = 2. * (D1_1 - D1_0) / (r2 - r1) x_t = ( expand_dims(sigma_t / sigma_s, dims) * x - expand_dims(alpha_t * phi_1, dims) * model_s + expand_dims(alpha_t * phi_2, dims) * D1 - expand_dims(alpha_t * phi_3, dims) * D2 ) else: phi_11 = torch.expm1(r1 * h) phi_12 = torch.expm1(r2 * h) phi_1 = torch.expm1(h) phi_22 = torch.expm1(r2 * h) / (r2 * h) - 1. phi_2 = phi_1 / h - 1. phi_3 = phi_2 / h - 0.5 if model_s is None: model_s = self.model_fn(x, s) if model_s1 is None: x_s1 = ( expand_dims(torch.exp(log_alpha_s1 - log_alpha_s), dims) * x - expand_dims(sigma_s1 * phi_11, dims) * model_s ) model_s1 = self.model_fn(x_s1, s1) x_s2 = ( expand_dims(torch.exp(log_alpha_s2 - log_alpha_s), dims) * x - expand_dims(sigma_s2 * phi_12, dims) * model_s - r2 / r1 * expand_dims(sigma_s2 * phi_22, dims) * (model_s1 - model_s) ) model_s2 = self.model_fn(x_s2, s2) if solver_type == 'dpm_solver': x_t = ( expand_dims(torch.exp(log_alpha_t - log_alpha_s), dims) * x - expand_dims(sigma_t * phi_1, dims) * model_s - (1. / r2) * expand_dims(sigma_t * phi_2, dims) * (model_s2 - model_s) ) elif solver_type == 'taylor': D1_0 = (1. / r1) * (model_s1 - model_s) D1_1 = (1. / r2) * (model_s2 - model_s) D1 = (r2 * D1_0 - r1 * D1_1) / (r2 - r1) D2 = 2. * (D1_1 - D1_0) / (r2 - r1) x_t = ( expand_dims(torch.exp(log_alpha_t - log_alpha_s), dims) * x - expand_dims(sigma_t * phi_1, dims) * model_s - expand_dims(sigma_t * phi_2, dims) * D1 - expand_dims(sigma_t * phi_3, dims) * D2 ) if return_intermediate: return x_t, {'model_s': model_s, 'model_s1': model_s1, 'model_s2': model_s2} else: return x_t def multistep_dpm_solver_second_update(self, x, model_prev_list, t_prev_list, t, solver_type="dpm_solver"): """ Multistep solver DPM-Solver-2 from time `t_prev_list[-1]` to time `t`. Args: x: A pytorch tensor. The initial value at time `s`. model_prev_list: A list of pytorch tensor. The previous computed model values. t_prev_list: A list of pytorch tensor. The previous times, each time has the shape (x.shape[0],) t: A pytorch tensor. The ending time, with the shape (x.shape[0],). solver_type: either 'dpm_solver' or 'taylor'. The type for the high-order solvers. The type slightly impacts the performance. We recommend to use 'dpm_solver' type. Returns: x_t: A pytorch tensor. The approximated solution at time `t`. """ if solver_type not in ['dpm_solver', 'taylor']: raise ValueError("'solver_type' must be either 'dpm_solver' or 'taylor', got {}".format(solver_type)) ns = self.noise_schedule dims = x.dim() model_prev_1, model_prev_0 = model_prev_list t_prev_1, t_prev_0 = t_prev_list lambda_prev_1, lambda_prev_0, lambda_t = ns.marginal_lambda(t_prev_1), ns.marginal_lambda( t_prev_0), ns.marginal_lambda(t) log_alpha_prev_0, log_alpha_t = ns.marginal_log_mean_coeff(t_prev_0), ns.marginal_log_mean_coeff(t) sigma_prev_0, sigma_t = ns.marginal_std(t_prev_0), ns.marginal_std(t) alpha_t = torch.exp(log_alpha_t) h_0 = lambda_prev_0 - lambda_prev_1 h = lambda_t - lambda_prev_0 r0 = h_0 / h D1_0 = expand_dims(1. / r0, dims) * (model_prev_0 - model_prev_1) if self.predict_x0: if solver_type == 'dpm_solver': x_t = ( expand_dims(sigma_t / sigma_prev_0, dims) * x - expand_dims(alpha_t * (torch.exp(-h) - 1.), dims) * model_prev_0 - 0.5 * expand_dims(alpha_t * (torch.exp(-h) - 1.), dims) * D1_0 ) elif solver_type == 'taylor': x_t = ( expand_dims(sigma_t / sigma_prev_0, dims) * x - expand_dims(alpha_t * (torch.exp(-h) - 1.), dims) * model_prev_0 + expand_dims(alpha_t * ((torch.exp(-h) - 1.) / h + 1.), dims) * D1_0 ) else: if solver_type == 'dpm_solver': x_t = ( expand_dims(torch.exp(log_alpha_t - log_alpha_prev_0), dims) * x - expand_dims(sigma_t * (torch.exp(h) - 1.), dims) * model_prev_0 - 0.5 * expand_dims(sigma_t * (torch.exp(h) - 1.), dims) * D1_0 ) elif solver_type == 'taylor': x_t = ( expand_dims(torch.exp(log_alpha_t - log_alpha_prev_0), dims) * x - expand_dims(sigma_t * (torch.exp(h) - 1.), dims) * model_prev_0 - expand_dims(sigma_t * ((torch.exp(h) - 1.) / h - 1.), dims) * D1_0 ) return x_t def multistep_dpm_solver_third_update(self, x, model_prev_list, t_prev_list, t, solver_type='dpm_solver'): """ Multistep solver DPM-Solver-3 from time `t_prev_list[-1]` to time `t`. Args: x: A pytorch tensor. The initial value at time `s`. model_prev_list: A list of pytorch tensor. The previous computed model values. t_prev_list: A list of pytorch tensor. The previous times, each time has the shape (x.shape[0],) t: A pytorch tensor. The ending time, with the shape (x.shape[0],). solver_type: either 'dpm_solver' or 'taylor'. The type for the high-order solvers. The type slightly impacts the performance. We recommend to use 'dpm_solver' type. Returns: x_t: A pytorch tensor. The approximated solution at time `t`. """ ns = self.noise_schedule dims = x.dim() model_prev_2, model_prev_1, model_prev_0 = model_prev_list t_prev_2, t_prev_1, t_prev_0 = t_prev_list lambda_prev_2, lambda_prev_1, lambda_prev_0, lambda_t = ns.marginal_lambda(t_prev_2), ns.marginal_lambda( t_prev_1), ns.marginal_lambda(t_prev_0), ns.marginal_lambda(t) log_alpha_prev_0, log_alpha_t = ns.marginal_log_mean_coeff(t_prev_0), ns.marginal_log_mean_coeff(t) sigma_prev_0, sigma_t = ns.marginal_std(t_prev_0), ns.marginal_std(t) alpha_t = torch.exp(log_alpha_t) h_1 = lambda_prev_1 - lambda_prev_2 h_0 = lambda_prev_0 - lambda_prev_1 h = lambda_t - lambda_prev_0 r0, r1 = h_0 / h, h_1 / h D1_0 = expand_dims(1. / r0, dims) * (model_prev_0 - model_prev_1) D1_1 = expand_dims(1. / r1, dims) * (model_prev_1 - model_prev_2) D1 = D1_0 + expand_dims(r0 / (r0 + r1), dims) * (D1_0 - D1_1) D2 = expand_dims(1. / (r0 + r1), dims) * (D1_0 - D1_1) if self.predict_x0: x_t = ( expand_dims(sigma_t / sigma_prev_0, dims) * x - expand_dims(alpha_t * (torch.exp(-h) - 1.), dims) * model_prev_0 + expand_dims(alpha_t * ((torch.exp(-h) - 1.) / h + 1.), dims) * D1 - expand_dims(alpha_t * ((torch.exp(-h) - 1. + h) / h ** 2 - 0.5), dims) * D2 ) else: x_t = ( expand_dims(torch.exp(log_alpha_t - log_alpha_prev_0), dims) * x - expand_dims(sigma_t * (torch.exp(h) - 1.), dims) * model_prev_0 - expand_dims(sigma_t * ((torch.exp(h) - 1.) / h - 1.), dims) * D1 - expand_dims(sigma_t * ((torch.exp(h) - 1. - h) / h ** 2 - 0.5), dims) * D2 ) return x_t def singlestep_dpm_solver_update(self, x, s, t, order, return_intermediate=False, solver_type='dpm_solver', r1=None, r2=None): """ Singlestep DPM-Solver with the order `order` from time `s` to time `t`. Args: x: A pytorch tensor. The initial value at time `s`. s: A pytorch tensor. The starting time, with the shape (x.shape[0],). t: A pytorch tensor. The ending time, with the shape (x.shape[0],). order: A `int`. The order of DPM-Solver. We only support order == 1 or 2 or 3. return_intermediate: A `bool`. If true, also return the model value at time `s`, `s1` and `s2` (the intermediate times). solver_type: either 'dpm_solver' or 'taylor'. The type for the high-order solvers. The type slightly impacts the performance. We recommend to use 'dpm_solver' type. r1: A `float`. The hyperparameter of the second-order or third-order solver. r2: A `float`. The hyperparameter of the third-order solver. Returns: x_t: A pytorch tensor. The approximated solution at time `t`. """ if order == 1: return self.dpm_solver_first_update(x, s, t, return_intermediate=return_intermediate) elif order == 2: return self.singlestep_dpm_solver_second_update(x, s, t, return_intermediate=return_intermediate, solver_type=solver_type, r1=r1) elif order == 3: return self.singlestep_dpm_solver_third_update(x, s, t, return_intermediate=return_intermediate, solver_type=solver_type, r1=r1, r2=r2) else: raise ValueError("Solver order must be 1 or 2 or 3, got {}".format(order)) def multistep_dpm_solver_update(self, x, model_prev_list, t_prev_list, t, order, solver_type='dpm_solver'): """ Multistep DPM-Solver with the order `order` from time `t_prev_list[-1]` to time `t`. Args: x: A pytorch tensor. The initial value at time `s`. model_prev_list: A list of pytorch tensor. The previous computed model values. t_prev_list: A list of pytorch tensor. The previous times, each time has the shape (x.shape[0],) t: A pytorch tensor. The ending time, with the shape (x.shape[0],). order: A `int`. The order of DPM-Solver. We only support order == 1 or 2 or 3. solver_type: either 'dpm_solver' or 'taylor'. The type for the high-order solvers. The type slightly impacts the performance. We recommend to use 'dpm_solver' type. Returns: x_t: A pytorch tensor. The approximated solution at time `t`. """ if order == 1: return self.dpm_solver_first_update(x, t_prev_list[-1], t, model_s=model_prev_list[-1]) elif order == 2: return self.multistep_dpm_solver_second_update(x, model_prev_list, t_prev_list, t, solver_type=solver_type) elif order == 3: return self.multistep_dpm_solver_third_update(x, model_prev_list, t_prev_list, t, solver_type=solver_type) else: raise ValueError("Solver order must be 1 or 2 or 3, got {}".format(order)) def dpm_solver_adaptive(self, x, order, t_T, t_0, h_init=0.05, atol=0.0078, rtol=0.05, theta=0.9, t_err=1e-5, solver_type='dpm_solver'): """ The adaptive step size solver based on singlestep DPM-Solver. Args: x: A pytorch tensor. The initial value at time `t_T`. order: A `int`. The (higher) order of the solver. We only support order == 2 or 3. t_T: A `float`. The starting time of the sampling (default is T). t_0: A `float`. The ending time of the sampling (default is epsilon). h_init: A `float`. The initial step size (for logSNR). atol: A `float`. The absolute tolerance of the solver. For image data, the default setting is 0.0078, followed [1]. rtol: A `float`. The relative tolerance of the solver. The default setting is 0.05. theta: A `float`. The safety hyperparameter for adapting the step size. The default setting is 0.9, followed [1]. t_err: A `float`. The tolerance for the time. We solve the diffusion ODE until the absolute error between the current time and `t_0` is less than `t_err`. The default setting is 1e-5. solver_type: either 'dpm_solver' or 'taylor'. The type for the high-order solvers. The type slightly impacts the performance. We recommend to use 'dpm_solver' type. Returns: x_0: A pytorch tensor. The approximated solution at time `t_0`. [1] A. Jolicoeur-Martineau, K. Li, R. Piché-Taillefer, T. Kachman, and I. Mitliagkas, "Gotta go fast when generating data with score-based models," arXiv preprint arXiv:2105.14080, 2021. """ ns = self.noise_schedule s = t_T * torch.ones((x.shape[0],)).to(x) lambda_s = ns.marginal_lambda(s) lambda_0 = ns.marginal_lambda(t_0 * torch.ones_like(s).to(x)) h = h_init * torch.ones_like(s).to(x) x_prev = x nfe = 0 if order == 2: r1 = 0.5 lower_update = lambda x, s, t: self.dpm_solver_first_update(x, s, t, return_intermediate=True) higher_update = lambda x, s, t, **kwargs: self.singlestep_dpm_solver_second_update(x, s, t, r1=r1, solver_type=solver_type, **kwargs) elif order == 3: r1, r2 = 1. / 3., 2. / 3. lower_update = lambda x, s, t: self.singlestep_dpm_solver_second_update(x, s, t, r1=r1, return_intermediate=True, solver_type=solver_type) higher_update = lambda x, s, t, **kwargs: self.singlestep_dpm_solver_third_update(x, s, t, r1=r1, r2=r2, solver_type=solver_type, **kwargs) else: raise ValueError("For adaptive step size solver, order must be 2 or 3, got {}".format(order)) while torch.abs((s - t_0)).mean() > t_err: t = ns.inverse_lambda(lambda_s + h) x_lower, lower_noise_kwargs = lower_update(x, s, t) x_higher = higher_update(x, s, t, **lower_noise_kwargs) delta = torch.max(torch.ones_like(x).to(x) * atol, rtol * torch.max(torch.abs(x_lower), torch.abs(x_prev))) norm_fn = lambda v: torch.sqrt(torch.square(v.reshape((v.shape[0], -1))).mean(dim=-1, keepdim=True)) E = norm_fn((x_higher - x_lower) / delta).max() if torch.all(E <= 1.): x = x_higher s = t x_prev = x_lower lambda_s = ns.marginal_lambda(s) h = torch.min(theta * h * torch.float_power(E, -1. / order).float(), lambda_0 - lambda_s) nfe += order print('adaptive solver nfe', nfe) return x def sample(self, x, steps=20, t_start=None, t_end=None, order=3, skip_type='time_uniform', method='singlestep', lower_order_final=True, denoise_to_zero=False, solver_type='dpm_solver', atol=0.0078, rtol=0.05, ): """ Compute the sample at time `t_end` by DPM-Solver, given the initial `x` at time `t_start`. ===================================================== We support the following algorithms for both noise prediction model and data prediction model: - 'singlestep': Singlestep DPM-Solver (i.e. "DPM-Solver-fast" in the paper), which combines different orders of singlestep DPM-Solver. We combine all the singlestep solvers with order <= `order` to use up all the function evaluations (steps). The total number of function evaluations (NFE) == `steps`. Given a fixed NFE == `steps`, the sampling procedure is: - If `order` == 1: - Denote K = steps. We use K steps of DPM-Solver-1 (i.e. DDIM). - If `order` == 2: - Denote K = (steps // 2) + (steps % 2). We take K intermediate time steps for sampling. - If steps % 2 == 0, we use K steps of singlestep DPM-Solver-2. - If steps % 2 == 1, we use (K - 1) steps of singlestep DPM-Solver-2 and 1 step of DPM-Solver-1. - If `order` == 3: - Denote K = (steps // 3 + 1). We take K intermediate time steps for sampling. - If steps % 3 == 0, we use (K - 2) steps of singlestep DPM-Solver-3, and 1 step of singlestep DPM-Solver-2 and 1 step of DPM-Solver-1. - If steps % 3 == 1, we use (K - 1) steps of singlestep DPM-Solver-3 and 1 step of DPM-Solver-1. - If steps % 3 == 2, we use (K - 1) steps of singlestep DPM-Solver-3 and 1 step of singlestep DPM-Solver-2. - 'multistep': Multistep DPM-Solver with the order of `order`. The total number of function evaluations (NFE) == `steps`. We initialize the first `order` values by lower order multistep solvers. Given a fixed NFE == `steps`, the sampling procedure is: Denote K = steps. - If `order` == 1: - We use K steps of DPM-Solver-1 (i.e. DDIM). - If `order` == 2: - We firstly use 1 step of DPM-Solver-1, then use (K - 1) step of multistep DPM-Solver-2. - If `order` == 3: - We firstly use 1 step of DPM-Solver-1, then 1 step of multistep DPM-Solver-2, then (K - 2) step of multistep DPM-Solver-3. - 'singlestep_fixed': Fixed order singlestep DPM-Solver (i.e. DPM-Solver-1 or singlestep DPM-Solver-2 or singlestep DPM-Solver-3). We use singlestep DPM-Solver-`order` for `order`=1 or 2 or 3, with total [`steps` // `order`] * `order` NFE. - 'adaptive': Adaptive step size DPM-Solver (i.e. "DPM-Solver-12" and "DPM-Solver-23" in the paper). We ignore `steps` and use adaptive step size DPM-Solver with a higher order of `order`. You can adjust the absolute tolerance `atol` and the relative tolerance `rtol` to balance the computatation costs (NFE) and the sample quality. - If `order` == 2, we use DPM-Solver-12 which combines DPM-Solver-1 and singlestep DPM-Solver-2. - If `order` == 3, we use DPM-Solver-23 which combines singlestep DPM-Solver-2 and singlestep DPM-Solver-3. ===================================================== Some advices for choosing the algorithm: - For **unconditional sampling** or **guided sampling with small guidance scale** by DPMs: Use singlestep DPM-Solver ("DPM-Solver-fast" in the paper) with `order = 3`. e.g. >>> dpm_solver = DPM_Solver(model_fn, noise_schedule, predict_x0=False) >>> x_sample = dpm_solver.sample(x, steps=steps, t_start=t_start, t_end=t_end, order=3, skip_type='time_uniform', method='singlestep') - For **guided sampling with large guidance scale** by DPMs: Use multistep DPM-Solver with `predict_x0 = True` and `order = 2`. e.g. >>> dpm_solver = DPM_Solver(model_fn, noise_schedule, predict_x0=True) >>> x_sample = dpm_solver.sample(x, steps=steps, t_start=t_start, t_end=t_end, order=2, skip_type='time_uniform', method='multistep') We support three types of `skip_type`: - 'logSNR': uniform logSNR for the time steps. **Recommended for low-resolutional images** - 'time_uniform': uniform time for the time steps. **Recommended for high-resolutional images**. - 'time_quadratic': quadratic time for the time steps. ===================================================== Args: x: A pytorch tensor. The initial value at time `t_start` e.g. if `t_start` == T, then `x` is a sample from the standard normal distribution. steps: A `int`. The total number of function evaluations (NFE). t_start: A `float`. The starting time of the sampling. If `T` is None, we use self.noise_schedule.T (default is 1.0). t_end: A `float`. The ending time of the sampling. If `t_end` is None, we use 1. / self.noise_schedule.total_N. e.g. if total_N == 1000, we have `t_end` == 1e-3. For discrete-time DPMs: - We recommend `t_end` == 1. / self.noise_schedule.total_N. For continuous-time DPMs: - We recommend `t_end` == 1e-3 when `steps` <= 15; and `t_end` == 1e-4 when `steps` > 15. order: A `int`. The order of DPM-Solver. skip_type: A `str`. The type for the spacing of the time steps. 'time_uniform' or 'logSNR' or 'time_quadratic'. method: A `str`. The method for sampling. 'singlestep' or 'multistep' or 'singlestep_fixed' or 'adaptive'. denoise_to_zero: A `bool`. Whether to denoise to time 0 at the final step. Default is `False`. If `denoise_to_zero` is `True`, the total NFE is (`steps` + 1). This trick is firstly proposed by DDPM (https://arxiv.org/abs/2006.11239) and score_sde (https://arxiv.org/abs/2011.13456). Such trick can improve the FID for diffusion models sampling by diffusion SDEs for low-resolutional images (such as CIFAR-10). However, we observed that such trick does not matter for high-resolutional images. As it needs an additional NFE, we do not recommend it for high-resolutional images. lower_order_final: A `bool`. Whether to use lower order solvers at the final steps. Only valid for `method=multistep` and `steps < 15`. We empirically find that this trick is a key to stabilizing the sampling by DPM-Solver with very few steps (especially for steps <= 10). So we recommend to set it to be `True`. solver_type: A `str`. The taylor expansion type for the solver. `dpm_solver` or `taylor`. We recommend `dpm_solver`. atol: A `float`. The absolute tolerance of the adaptive step size solver. Valid when `method` == 'adaptive'. rtol: A `float`. The relative tolerance of the adaptive step size solver. Valid when `method` == 'adaptive'. Returns: x_end: A pytorch tensor. The approximated solution at time `t_end`. """ t_0 = 1. / self.noise_schedule.total_N if t_end is None else t_end t_T = self.noise_schedule.T if t_start is None else t_start device = x.device if method == 'adaptive': with torch.no_grad(): x = self.dpm_solver_adaptive(x, order=order, t_T=t_T, t_0=t_0, atol=atol, rtol=rtol, solver_type=solver_type) elif method == 'multistep': assert steps >= order timesteps = self.get_time_steps(skip_type=skip_type, t_T=t_T, t_0=t_0, N=steps, device=device) assert timesteps.shape[0] - 1 == steps with torch.no_grad(): vec_t = timesteps[0].expand((x.shape[0])) model_prev_list = [self.model_fn(x, vec_t)] t_prev_list = [vec_t] # Init the first `order` values by lower order multistep DPM-Solver. for init_order in tqdm(range(1, order), desc="DPM init order"): vec_t = timesteps[init_order].expand(x.shape[0]) x = self.multistep_dpm_solver_update(x, model_prev_list, t_prev_list, vec_t, init_order, solver_type=solver_type) model_prev_list.append(self.model_fn(x, vec_t)) t_prev_list.append(vec_t) # Compute the remaining values by `order`-th order multistep DPM-Solver. for step in tqdm(range(order, steps + 1), desc="DPM multistep"): vec_t = timesteps[step].expand(x.shape[0]) if lower_order_final and steps < 15: step_order = min(order, steps + 1 - step) else: step_order = order x = self.multistep_dpm_solver_update(x, model_prev_list, t_prev_list, vec_t, step_order, solver_type=solver_type) for i in range(order - 1): t_prev_list[i] = t_prev_list[i + 1] model_prev_list[i] = model_prev_list[i + 1] t_prev_list[-1] = vec_t # We do not need to evaluate the final model value. if step < steps: model_prev_list[-1] = self.model_fn(x, vec_t) elif method in ['singlestep', 'singlestep_fixed']: if method == 'singlestep': timesteps_outer, orders = self.get_orders_and_timesteps_for_singlestep_solver(steps=steps, order=order, skip_type=skip_type, t_T=t_T, t_0=t_0, device=device) elif method == 'singlestep_fixed': K = steps // order orders = [order, ] * K timesteps_outer = self.get_time_steps(skip_type=skip_type, t_T=t_T, t_0=t_0, N=K, device=device) for i, order in enumerate(orders): t_T_inner, t_0_inner = timesteps_outer[i], timesteps_outer[i + 1] timesteps_inner = self.get_time_steps(skip_type=skip_type, t_T=t_T_inner.item(), t_0=t_0_inner.item(), N=order, device=device) lambda_inner = self.noise_schedule.marginal_lambda(timesteps_inner) vec_s, vec_t = t_T_inner.tile(x.shape[0]), t_0_inner.tile(x.shape[0]) h = lambda_inner[-1] - lambda_inner[0] r1 = None if order <= 1 else (lambda_inner[1] - lambda_inner[0]) / h r2 = None if order <= 2 else (lambda_inner[2] - lambda_inner[0]) / h x = self.singlestep_dpm_solver_update(x, vec_s, vec_t, order, solver_type=solver_type, r1=r1, r2=r2) if denoise_to_zero: x = self.denoise_to_zero_fn(x, torch.ones((x.shape[0],)).to(device) * t_0) return x ############################################################# # other utility functions ############################################################# def interpolate_fn(x, xp, yp): """ A piecewise linear function y = f(x), using xp and yp as keypoints. We implement f(x) in a differentiable way (i.e. applicable for autograd). The function f(x) is well-defined for all x-axis. (For x beyond the bounds of xp, we use the outmost points of xp to define the linear function.) Args: x: PyTorch tensor with shape [N, C], where N is the batch size, C is the number of channels (we use C = 1 for DPM-Solver). xp: PyTorch tensor with shape [C, K], where K is the number of keypoints. yp: PyTorch tensor with shape [C, K]. Returns: The function values f(x), with shape [N, C]. """ N, K = x.shape[0], xp.shape[1] all_x = torch.cat([x.unsqueeze(2), xp.unsqueeze(0).repeat((N, 1, 1))], dim=2) sorted_all_x, x_indices = torch.sort(all_x, dim=2) x_idx = torch.argmin(x_indices, dim=2) cand_start_idx = x_idx - 1 start_idx = torch.where( torch.eq(x_idx, 0), torch.tensor(1, device=x.device), torch.where( torch.eq(x_idx, K), torch.tensor(K - 2, device=x.device), cand_start_idx, ), ) end_idx = torch.where(torch.eq(start_idx, cand_start_idx), start_idx + 2, start_idx + 1) start_x = torch.gather(sorted_all_x, dim=2, index=start_idx.unsqueeze(2)).squeeze(2) end_x = torch.gather(sorted_all_x, dim=2, index=end_idx.unsqueeze(2)).squeeze(2) start_idx2 = torch.where( torch.eq(x_idx, 0), torch.tensor(0, device=x.device), torch.where( torch.eq(x_idx, K), torch.tensor(K - 2, device=x.device), cand_start_idx, ), ) y_positions_expanded = yp.unsqueeze(0).expand(N, -1, -1) start_y = torch.gather(y_positions_expanded, dim=2, index=start_idx2.unsqueeze(2)).squeeze(2) end_y = torch.gather(y_positions_expanded, dim=2, index=(start_idx2 + 1).unsqueeze(2)).squeeze(2) cand = start_y + (x - start_x) * (end_y - start_y) / (end_x - start_x) return cand def expand_dims(v, dims): """ Expand the tensor `v` to the dim `dims`. Args: `v`: a PyTorch tensor with shape [N]. `dim`: a `int`. Returns: a PyTorch tensor with shape [N, 1, 1, ..., 1] and the total dimension is `dims`. """ return v[(...,) + (None,) * (dims - 1)] ================================================ FILE: ToonCrafter/ldm/models/diffusion/dpm_solver/sampler.py ================================================ """SAMPLING ONLY.""" import torch from .dpm_solver import NoiseScheduleVP, model_wrapper, DPM_Solver MODEL_TYPES = { "eps": "noise", "v": "v" } class DPMSolverSampler(object): def __init__(self, model, **kwargs): super().__init__() self.model = model to_torch = lambda x: x.clone().detach().to(torch.float32).to(model.device) self.register_buffer('alphas_cumprod', to_torch(model.alphas_cumprod)) def register_buffer(self, name, attr): if type(attr) == torch.Tensor: if attr.device != torch.device("cuda"): attr = attr.to(torch.device("cuda")) setattr(self, name, attr) @torch.no_grad() def sample(self, S, batch_size, shape, conditioning=None, callback=None, normals_sequence=None, img_callback=None, quantize_x0=False, eta=0., mask=None, x0=None, temperature=1., noise_dropout=0., score_corrector=None, corrector_kwargs=None, verbose=True, x_T=None, log_every_t=100, unconditional_guidance_scale=1., unconditional_conditioning=None, # this has to come in the same format as the conditioning, # e.g. as encoded tokens, ... **kwargs ): if conditioning is not None: if isinstance(conditioning, dict): cbs = conditioning[list(conditioning.keys())[0]].shape[0] if cbs != batch_size: print(f"Warning: Got {cbs} conditionings but batch-size is {batch_size}") else: if conditioning.shape[0] != batch_size: print(f"Warning: Got {conditioning.shape[0]} conditionings but batch-size is {batch_size}") # sampling C, H, W = shape size = (batch_size, C, H, W) print(f'Data shape for DPM-Solver sampling is {size}, sampling steps {S}') device = self.model.betas.device if x_T is None: img = torch.randn(size, device=device) else: img = x_T ns = NoiseScheduleVP('discrete', alphas_cumprod=self.alphas_cumprod) model_fn = model_wrapper( lambda x, t, c: self.model.apply_model(x, t, c), ns, model_type=MODEL_TYPES[self.model.parameterization], guidance_type="classifier-free", condition=conditioning, unconditional_condition=unconditional_conditioning, guidance_scale=unconditional_guidance_scale, ) dpm_solver = DPM_Solver(model_fn, ns, predict_x0=True, thresholding=False) x = dpm_solver.sample(img, steps=S, skip_type="time_uniform", method="multistep", order=2, lower_order_final=True) return x.to(device), None ================================================ FILE: ToonCrafter/ldm/models/diffusion/plms.py ================================================ """SAMPLING ONLY.""" import torch import numpy as np from tqdm import tqdm from functools import partial from ToonCrafter.ldm.modules.diffusionmodules.util import make_ddim_sampling_parameters, make_ddim_timesteps, noise_like from ToonCrafter.ldm.models.diffusion.sampling_util import norm_thresholding class PLMSSampler(object): def __init__(self, model, schedule="linear", **kwargs): super().__init__() self.model = model self.ddpm_num_timesteps = model.num_timesteps self.schedule = schedule def register_buffer(self, name, attr): if type(attr) == torch.Tensor: if attr.device != torch.device("cuda"): attr = attr.to(torch.device("cuda")) setattr(self, name, attr) def make_schedule(self, ddim_num_steps, ddim_discretize="uniform", ddim_eta=0., verbose=True): if ddim_eta != 0: raise ValueError('ddim_eta must be 0 for PLMS') self.ddim_timesteps = make_ddim_timesteps(ddim_discr_method=ddim_discretize, num_ddim_timesteps=ddim_num_steps, num_ddpm_timesteps=self.ddpm_num_timesteps,verbose=verbose) alphas_cumprod = self.model.alphas_cumprod assert alphas_cumprod.shape[0] == self.ddpm_num_timesteps, 'alphas have to be defined for each timestep' to_torch = lambda x: x.clone().detach().to(torch.float32).to(self.model.device) self.register_buffer('betas', to_torch(self.model.betas)) self.register_buffer('alphas_cumprod', to_torch(alphas_cumprod)) self.register_buffer('alphas_cumprod_prev', to_torch(self.model.alphas_cumprod_prev)) # calculations for diffusion q(x_t | x_{t-1}) and others self.register_buffer('sqrt_alphas_cumprod', to_torch(np.sqrt(alphas_cumprod.cpu()))) self.register_buffer('sqrt_one_minus_alphas_cumprod', to_torch(np.sqrt(1. - alphas_cumprod.cpu()))) self.register_buffer('log_one_minus_alphas_cumprod', to_torch(np.log(1. - alphas_cumprod.cpu()))) self.register_buffer('sqrt_recip_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod.cpu()))) self.register_buffer('sqrt_recipm1_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod.cpu() - 1))) # ddim sampling parameters ddim_sigmas, ddim_alphas, ddim_alphas_prev = make_ddim_sampling_parameters(alphacums=alphas_cumprod.cpu(), ddim_timesteps=self.ddim_timesteps, eta=ddim_eta,verbose=verbose) self.register_buffer('ddim_sigmas', ddim_sigmas) self.register_buffer('ddim_alphas', ddim_alphas) self.register_buffer('ddim_alphas_prev', ddim_alphas_prev) self.register_buffer('ddim_sqrt_one_minus_alphas', np.sqrt(1. - ddim_alphas)) sigmas_for_original_sampling_steps = ddim_eta * torch.sqrt( (1 - self.alphas_cumprod_prev) / (1 - self.alphas_cumprod) * ( 1 - self.alphas_cumprod / self.alphas_cumprod_prev)) self.register_buffer('ddim_sigmas_for_original_num_steps', sigmas_for_original_sampling_steps) @torch.no_grad() def sample(self, S, batch_size, shape, conditioning=None, callback=None, normals_sequence=None, img_callback=None, quantize_x0=False, eta=0., mask=None, x0=None, temperature=1., noise_dropout=0., score_corrector=None, corrector_kwargs=None, verbose=True, x_T=None, log_every_t=100, unconditional_guidance_scale=1., unconditional_conditioning=None, # this has to come in the same format as the conditioning, # e.g. as encoded tokens, ... dynamic_threshold=None, **kwargs ): if conditioning is not None: if isinstance(conditioning, dict): cbs = conditioning[list(conditioning.keys())[0]].shape[0] if cbs != batch_size: print(f"Warning: Got {cbs} conditionings but batch-size is {batch_size}") else: if conditioning.shape[0] != batch_size: print(f"Warning: Got {conditioning.shape[0]} conditionings but batch-size is {batch_size}") self.make_schedule(ddim_num_steps=S, ddim_eta=eta, verbose=verbose) # sampling C, H, W = shape size = (batch_size, C, H, W) print(f'Data shape for PLMS sampling is {size}') samples, intermediates = self.plms_sampling(conditioning, size, callback=callback, img_callback=img_callback, quantize_denoised=quantize_x0, mask=mask, x0=x0, ddim_use_original_steps=False, noise_dropout=noise_dropout, temperature=temperature, score_corrector=score_corrector, corrector_kwargs=corrector_kwargs, x_T=x_T, log_every_t=log_every_t, unconditional_guidance_scale=unconditional_guidance_scale, unconditional_conditioning=unconditional_conditioning, dynamic_threshold=dynamic_threshold, ) return samples, intermediates @torch.no_grad() def plms_sampling(self, cond, shape, x_T=None, ddim_use_original_steps=False, callback=None, timesteps=None, quantize_denoised=False, mask=None, x0=None, img_callback=None, log_every_t=100, temperature=1., noise_dropout=0., score_corrector=None, corrector_kwargs=None, unconditional_guidance_scale=1., unconditional_conditioning=None, dynamic_threshold=None): device = self.model.betas.device b = shape[0] if x_T is None: img = torch.randn(shape, device=device) else: img = x_T if timesteps is None: timesteps = self.ddpm_num_timesteps if ddim_use_original_steps else self.ddim_timesteps elif timesteps is not None and not ddim_use_original_steps: subset_end = int(min(timesteps / self.ddim_timesteps.shape[0], 1) * self.ddim_timesteps.shape[0]) - 1 timesteps = self.ddim_timesteps[:subset_end] intermediates = {'x_inter': [img], 'pred_x0': [img]} time_range = list(reversed(range(0,timesteps))) if ddim_use_original_steps else np.flip(timesteps) total_steps = timesteps if ddim_use_original_steps else timesteps.shape[0] print(f"Running PLMS Sampling with {total_steps} timesteps") iterator = tqdm(time_range, desc='PLMS Sampler', total=total_steps) old_eps = [] for i, step in enumerate(iterator): index = total_steps - i - 1 ts = torch.full((b,), step, device=device, dtype=torch.long) ts_next = torch.full((b,), time_range[min(i + 1, len(time_range) - 1)], device=device, dtype=torch.long) if mask is not None: assert x0 is not None img_orig = self.model.q_sample(x0, ts) # TODO: deterministic forward pass? img = img_orig * mask + (1. - mask) * img outs = self.p_sample_plms(img, cond, ts, index=index, use_original_steps=ddim_use_original_steps, quantize_denoised=quantize_denoised, temperature=temperature, noise_dropout=noise_dropout, score_corrector=score_corrector, corrector_kwargs=corrector_kwargs, unconditional_guidance_scale=unconditional_guidance_scale, unconditional_conditioning=unconditional_conditioning, old_eps=old_eps, t_next=ts_next, dynamic_threshold=dynamic_threshold) img, pred_x0, e_t = outs old_eps.append(e_t) if len(old_eps) >= 4: old_eps.pop(0) if callback: callback(i) if img_callback: img_callback(pred_x0, i) if index % log_every_t == 0 or index == total_steps - 1: intermediates['x_inter'].append(img) intermediates['pred_x0'].append(pred_x0) return img, intermediates @torch.no_grad() def p_sample_plms(self, x, c, t, index, repeat_noise=False, use_original_steps=False, quantize_denoised=False, temperature=1., noise_dropout=0., score_corrector=None, corrector_kwargs=None, unconditional_guidance_scale=1., unconditional_conditioning=None, old_eps=None, t_next=None, dynamic_threshold=None): b, *_, device = *x.shape, x.device def get_model_output(x, t): if unconditional_conditioning is None or unconditional_guidance_scale == 1.: e_t = self.model.apply_model(x, t, c) else: x_in = torch.cat([x] * 2) t_in = torch.cat([t] * 2) c_in = torch.cat([unconditional_conditioning, c]) e_t_uncond, e_t = self.model.apply_model(x_in, t_in, c_in).chunk(2) e_t = e_t_uncond + unconditional_guidance_scale * (e_t - e_t_uncond) if score_corrector is not None: assert self.model.parameterization == "eps" e_t = score_corrector.modify_score(self.model, e_t, x, t, c, **corrector_kwargs) return e_t alphas = self.model.alphas_cumprod if use_original_steps else self.ddim_alphas alphas_prev = self.model.alphas_cumprod_prev if use_original_steps else self.ddim_alphas_prev sqrt_one_minus_alphas = self.model.sqrt_one_minus_alphas_cumprod if use_original_steps else self.ddim_sqrt_one_minus_alphas sigmas = self.model.ddim_sigmas_for_original_num_steps if use_original_steps else self.ddim_sigmas def get_x_prev_and_pred_x0(e_t, index): # select parameters corresponding to the currently considered timestep a_t = torch.full((b, 1, 1, 1), alphas[index], device=device) a_prev = torch.full((b, 1, 1, 1), alphas_prev[index], device=device) sigma_t = torch.full((b, 1, 1, 1), sigmas[index], device=device) sqrt_one_minus_at = torch.full((b, 1, 1, 1), sqrt_one_minus_alphas[index],device=device) # current prediction for x_0 pred_x0 = (x - sqrt_one_minus_at * e_t) / a_t.sqrt() if quantize_denoised: pred_x0, _, *_ = self.model.first_stage_model.quantize(pred_x0) if dynamic_threshold is not None: pred_x0 = norm_thresholding(pred_x0, dynamic_threshold) # direction pointing to x_t dir_xt = (1. - a_prev - sigma_t**2).sqrt() * e_t noise = sigma_t * noise_like(x.shape, device, repeat_noise) * temperature if noise_dropout > 0.: noise = torch.nn.functional.dropout(noise, p=noise_dropout) x_prev = a_prev.sqrt() * pred_x0 + dir_xt + noise return x_prev, pred_x0 e_t = get_model_output(x, t) if len(old_eps) == 0: # Pseudo Improved Euler (2nd order) x_prev, pred_x0 = get_x_prev_and_pred_x0(e_t, index) e_t_next = get_model_output(x_prev, t_next) e_t_prime = (e_t + e_t_next) / 2 elif len(old_eps) == 1: # 2nd order Pseudo Linear Multistep (Adams-Bashforth) e_t_prime = (3 * e_t - old_eps[-1]) / 2 elif len(old_eps) == 2: # 3nd order Pseudo Linear Multistep (Adams-Bashforth) e_t_prime = (23 * e_t - 16 * old_eps[-1] + 5 * old_eps[-2]) / 12 elif len(old_eps) >= 3: # 4nd order Pseudo Linear Multistep (Adams-Bashforth) e_t_prime = (55 * e_t - 59 * old_eps[-1] + 37 * old_eps[-2] - 9 * old_eps[-3]) / 24 x_prev, pred_x0 = get_x_prev_and_pred_x0(e_t_prime, index) return x_prev, pred_x0, e_t ================================================ FILE: ToonCrafter/ldm/models/diffusion/sampling_util.py ================================================ import torch import numpy as np def append_dims(x, target_dims): """Appends dimensions to the end of a tensor until it has target_dims dimensions. From https://github.com/crowsonkb/k-diffusion/blob/master/k_diffusion/utils.py""" dims_to_append = target_dims - x.ndim if dims_to_append < 0: raise ValueError(f'input has {x.ndim} dims but target_dims is {target_dims}, which is less') return x[(...,) + (None,) * dims_to_append] def norm_thresholding(x0, value): s = append_dims(x0.pow(2).flatten(1).mean(1).sqrt().clamp(min=value), x0.ndim) return x0 * (value / s) def spatial_norm_thresholding(x0, value): # b c h w s = x0.pow(2).mean(1, keepdim=True).sqrt().clamp(min=value) return x0 * (value / s) ================================================ FILE: ToonCrafter/ldm/modules/attention.py ================================================ from inspect import isfunction import math import torch import torch.nn.functional as F from torch import nn, einsum from einops import rearrange, repeat from typing import Optional, Any from ToonCrafter.ldm.modules.diffusionmodules.util import checkpoint try: import xformers import xformers.ops XFORMERS_IS_AVAILBLE = True except: XFORMERS_IS_AVAILBLE = False # CrossAttn precision handling import os _ATTN_PRECISION = os.environ.get("ATTN_PRECISION", "fp32") def exists(val): return val is not None def uniq(arr): return{el: True for el in arr}.keys() def default(val, d): if exists(val): return val return d() if isfunction(d) else d def max_neg_value(t): return -torch.finfo(t.dtype).max def init_(tensor): dim = tensor.shape[-1] std = 1 / math.sqrt(dim) tensor.uniform_(-std, std) return tensor # feedforward class GEGLU(nn.Module): def __init__(self, dim_in, dim_out): super().__init__() self.proj = nn.Linear(dim_in, dim_out * 2) def forward(self, x): x, gate = self.proj(x).chunk(2, dim=-1) return x * F.gelu(gate) class FeedForward(nn.Module): def __init__(self, dim, dim_out=None, mult=4, glu=False, dropout=0.): super().__init__() inner_dim = int(dim * mult) dim_out = default(dim_out, dim) project_in = nn.Sequential( nn.Linear(dim, inner_dim), nn.GELU() ) if not glu else GEGLU(dim, inner_dim) self.net = nn.Sequential( project_in, nn.Dropout(dropout), nn.Linear(inner_dim, dim_out) ) def forward(self, x): return self.net(x) def zero_module(module): """ Zero out the parameters of a module and return it. """ for p in module.parameters(): p.detach().zero_() return module def Normalize(in_channels): return torch.nn.GroupNorm(num_groups=32, num_channels=in_channels, eps=1e-6, affine=True) class SpatialSelfAttention(nn.Module): def __init__(self, in_channels): super().__init__() self.in_channels = in_channels self.norm = Normalize(in_channels) self.q = torch.nn.Conv2d(in_channels, in_channels, kernel_size=1, stride=1, padding=0) self.k = torch.nn.Conv2d(in_channels, in_channels, kernel_size=1, stride=1, padding=0) self.v = torch.nn.Conv2d(in_channels, in_channels, kernel_size=1, stride=1, padding=0) self.proj_out = torch.nn.Conv2d(in_channels, in_channels, kernel_size=1, stride=1, padding=0) def forward(self, x): h_ = x h_ = self.norm(h_) q = self.q(h_) k = self.k(h_) v = self.v(h_) # compute attention b,c,h,w = q.shape q = rearrange(q, 'b c h w -> b (h w) c') k = rearrange(k, 'b c h w -> b c (h w)') w_ = torch.einsum('bij,bjk->bik', q, k) w_ = w_ * (int(c)**(-0.5)) w_ = torch.nn.functional.softmax(w_, dim=2) # attend to values v = rearrange(v, 'b c h w -> b c (h w)') w_ = rearrange(w_, 'b i j -> b j i') h_ = torch.einsum('bij,bjk->bik', v, w_) h_ = rearrange(h_, 'b c (h w) -> b c h w', h=h) h_ = self.proj_out(h_) return x+h_ class CrossAttention(nn.Module): def __init__(self, query_dim, context_dim=None, heads=8, dim_head=64, dropout=0.): super().__init__() inner_dim = dim_head * heads context_dim = default(context_dim, query_dim) self.scale = dim_head ** -0.5 self.heads = heads self.to_q = nn.Linear(query_dim, inner_dim, bias=False) self.to_k = nn.Linear(context_dim, inner_dim, bias=False) self.to_v = nn.Linear(context_dim, inner_dim, bias=False) self.to_out = nn.Sequential( nn.Linear(inner_dim, query_dim), nn.Dropout(dropout) ) def forward(self, x, context=None, mask=None): h = self.heads q = self.to_q(x) context = default(context, x) k = self.to_k(context) v = self.to_v(context) q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> (b h) n d', h=h), (q, k, v)) # force cast to fp32 to avoid overflowing if _ATTN_PRECISION =="fp32": with torch.autocast(enabled=False, device_type = 'cuda'): q, k = q.float(), k.float() sim = einsum('b i d, b j d -> b i j', q, k) * self.scale else: sim = einsum('b i d, b j d -> b i j', q, k) * self.scale del q, k if exists(mask): mask = rearrange(mask, 'b ... -> b (...)') max_neg_value = -torch.finfo(sim.dtype).max mask = repeat(mask, 'b j -> (b h) () j', h=h) sim.masked_fill_(~mask, max_neg_value) # attention, what we cannot get enough of sim = sim.softmax(dim=-1) out = einsum('b i j, b j d -> b i d', sim, v) out = rearrange(out, '(b h) n d -> b n (h d)', h=h) return self.to_out(out) class MemoryEfficientCrossAttention(nn.Module): # https://github.com/MatthieuTPHR/diffusers/blob/d80b531ff8060ec1ea982b65a1b8df70f73aa67c/src/diffusers/models/attention.py#L223 def __init__(self, query_dim, context_dim=None, heads=8, dim_head=64, dropout=0.0): super().__init__() print(f"Setting up {self.__class__.__name__}. Query dim is {query_dim}, context_dim is {context_dim} and using " f"{heads} heads.") inner_dim = dim_head * heads context_dim = default(context_dim, query_dim) self.heads = heads self.dim_head = dim_head self.to_q = nn.Linear(query_dim, inner_dim, bias=False) self.to_k = nn.Linear(context_dim, inner_dim, bias=False) self.to_v = nn.Linear(context_dim, inner_dim, bias=False) self.to_out = nn.Sequential(nn.Linear(inner_dim, query_dim), nn.Dropout(dropout)) self.attention_op: Optional[Any] = None def forward(self, x, context=None, mask=None): q = self.to_q(x) context = default(context, x) k = self.to_k(context) v = self.to_v(context) b, _, _ = q.shape q, k, v = map( lambda t: t.unsqueeze(3) .reshape(b, t.shape[1], self.heads, self.dim_head) .permute(0, 2, 1, 3) .reshape(b * self.heads, t.shape[1], self.dim_head) .contiguous(), (q, k, v), ) # actually compute the attention, what we cannot get enough of out = xformers.ops.memory_efficient_attention(q, k, v, attn_bias=None, op=self.attention_op) if exists(mask): raise NotImplementedError out = ( out.unsqueeze(0) .reshape(b, self.heads, out.shape[1], self.dim_head) .permute(0, 2, 1, 3) .reshape(b, out.shape[1], self.heads * self.dim_head) ) return self.to_out(out) class BasicTransformerBlock(nn.Module): ATTENTION_MODES = { "softmax": CrossAttention, # vanilla attention "softmax-xformers": MemoryEfficientCrossAttention } def __init__(self, dim, n_heads, d_head, dropout=0., context_dim=None, gated_ff=True, checkpoint=True, disable_self_attn=False): super().__init__() attn_mode = "softmax-xformers" if XFORMERS_IS_AVAILBLE else "softmax" assert attn_mode in self.ATTENTION_MODES attn_cls = self.ATTENTION_MODES[attn_mode] self.disable_self_attn = disable_self_attn self.attn1 = attn_cls(query_dim=dim, heads=n_heads, dim_head=d_head, dropout=dropout, context_dim=context_dim if self.disable_self_attn else None) # is a self-attention if not self.disable_self_attn self.ff = FeedForward(dim, dropout=dropout, glu=gated_ff) self.attn2 = attn_cls(query_dim=dim, context_dim=context_dim, heads=n_heads, dim_head=d_head, dropout=dropout) # is self-attn if context is none self.norm1 = nn.LayerNorm(dim) self.norm2 = nn.LayerNorm(dim) self.norm3 = nn.LayerNorm(dim) self.checkpoint = checkpoint def forward(self, x, context=None): return checkpoint(self._forward, (x, context), self.parameters(), self.checkpoint) def _forward(self, x, context=None): x = self.attn1(self.norm1(x), context=context if self.disable_self_attn else None) + x x = self.attn2(self.norm2(x), context=context) + x x = self.ff(self.norm3(x)) + x return x class SpatialTransformer(nn.Module): """ Transformer block for image-like data. First, project the input (aka embedding) and reshape to b, t, d. Then apply standard transformer action. Finally, reshape to image NEW: use_linear for more efficiency instead of the 1x1 convs """ def __init__(self, in_channels, n_heads, d_head, depth=1, dropout=0., context_dim=None, disable_self_attn=False, use_linear=False, use_checkpoint=True): super().__init__() if exists(context_dim) and not isinstance(context_dim, list): context_dim = [context_dim] self.in_channels = in_channels inner_dim = n_heads * d_head self.norm = Normalize(in_channels) if not use_linear: self.proj_in = nn.Conv2d(in_channels, inner_dim, kernel_size=1, stride=1, padding=0) else: self.proj_in = nn.Linear(in_channels, inner_dim) self.transformer_blocks = nn.ModuleList( [BasicTransformerBlock(inner_dim, n_heads, d_head, dropout=dropout, context_dim=context_dim[d], disable_self_attn=disable_self_attn, checkpoint=use_checkpoint) for d in range(depth)] ) if not use_linear: self.proj_out = zero_module(nn.Conv2d(inner_dim, in_channels, kernel_size=1, stride=1, padding=0)) else: self.proj_out = zero_module(nn.Linear(in_channels, inner_dim)) self.use_linear = use_linear def forward(self, x, context=None): # note: if no context is given, cross-attention defaults to self-attention if not isinstance(context, list): context = [context] b, c, h, w = x.shape x_in = x x = self.norm(x) if not self.use_linear: x = self.proj_in(x) x = rearrange(x, 'b c h w -> b (h w) c').contiguous() if self.use_linear: x = self.proj_in(x) for i, block in enumerate(self.transformer_blocks): x = block(x, context=context[i]) if self.use_linear: x = self.proj_out(x) x = rearrange(x, 'b (h w) c -> b c h w', h=h, w=w).contiguous() if not self.use_linear: x = self.proj_out(x) return x + x_in ================================================ FILE: ToonCrafter/ldm/modules/diffusionmodules/__init__.py ================================================ ================================================ FILE: ToonCrafter/ldm/modules/diffusionmodules/model.py ================================================ # pytorch_diffusion + derived encoder decoder import math import torch import torch.nn as nn import numpy as np from einops import rearrange from typing import Optional, Any from ToonCrafter.ldm.modules.attention import MemoryEfficientCrossAttention try: import xformers import xformers.ops XFORMERS_IS_AVAILBLE = True except: XFORMERS_IS_AVAILBLE = False print("No module 'xformers'. Proceeding without it.") def get_timestep_embedding(timesteps, embedding_dim): """ This matches the implementation in Denoising Diffusion Probabilistic Models: From Fairseq. Build sinusoidal embeddings. This matches the implementation in tensor2tensor, but differs slightly from the description in Section 3.5 of "Attention Is All You Need". """ assert len(timesteps.shape) == 1 half_dim = embedding_dim // 2 emb = math.log(10000) / (half_dim - 1) emb = torch.exp(torch.arange(half_dim, dtype=torch.float32) * -emb) emb = emb.to(device=timesteps.device) emb = timesteps.float()[:, None] * emb[None, :] emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=1) if embedding_dim % 2 == 1: # zero pad emb = torch.nn.functional.pad(emb, (0,1,0,0)) return emb def nonlinearity(x): # swish return x*torch.sigmoid(x) def Normalize(in_channels, num_groups=32): return torch.nn.GroupNorm(num_groups=num_groups, num_channels=in_channels, eps=1e-6, affine=True) class Upsample(nn.Module): def __init__(self, in_channels, with_conv): super().__init__() self.with_conv = with_conv if self.with_conv: self.conv = torch.nn.Conv2d(in_channels, in_channels, kernel_size=3, stride=1, padding=1) def forward(self, x): x = torch.nn.functional.interpolate(x, scale_factor=2.0, mode="nearest") if self.with_conv: x = self.conv(x) return x class Downsample(nn.Module): def __init__(self, in_channels, with_conv): super().__init__() self.with_conv = with_conv if self.with_conv: # no asymmetric padding in torch conv, must do it ourselves self.conv = torch.nn.Conv2d(in_channels, in_channels, kernel_size=3, stride=2, padding=0) def forward(self, x): if self.with_conv: pad = (0,1,0,1) x = torch.nn.functional.pad(x, pad, mode="constant", value=0) x = self.conv(x) else: x = torch.nn.functional.avg_pool2d(x, kernel_size=2, stride=2) return x class ResnetBlock(nn.Module): def __init__(self, *, in_channels, out_channels=None, conv_shortcut=False, dropout, temb_channels=512): super().__init__() self.in_channels = in_channels out_channels = in_channels if out_channels is None else out_channels self.out_channels = out_channels self.use_conv_shortcut = conv_shortcut self.norm1 = Normalize(in_channels) self.conv1 = torch.nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=1, padding=1) if temb_channels > 0: self.temb_proj = torch.nn.Linear(temb_channels, out_channels) self.norm2 = Normalize(out_channels) self.dropout = torch.nn.Dropout(dropout) self.conv2 = torch.nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=1, padding=1) if self.in_channels != self.out_channels: if self.use_conv_shortcut: self.conv_shortcut = torch.nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=1, padding=1) else: self.nin_shortcut = torch.nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=1, padding=0) def forward(self, x, temb): h = x h = self.norm1(h) h = nonlinearity(h) h = self.conv1(h) if temb is not None: h = h + self.temb_proj(nonlinearity(temb))[:,:,None,None] h = self.norm2(h) h = nonlinearity(h) h = self.dropout(h) h = self.conv2(h) if self.in_channels != self.out_channels: if self.use_conv_shortcut: x = self.conv_shortcut(x) else: x = self.nin_shortcut(x) return x+h class AttnBlock(nn.Module): def __init__(self, in_channels): super().__init__() self.in_channels = in_channels self.norm = Normalize(in_channels) self.q = torch.nn.Conv2d(in_channels, in_channels, kernel_size=1, stride=1, padding=0) self.k = torch.nn.Conv2d(in_channels, in_channels, kernel_size=1, stride=1, padding=0) self.v = torch.nn.Conv2d(in_channels, in_channels, kernel_size=1, stride=1, padding=0) self.proj_out = torch.nn.Conv2d(in_channels, in_channels, kernel_size=1, stride=1, padding=0) def forward(self, x): h_ = x h_ = self.norm(h_) q = self.q(h_) k = self.k(h_) v = self.v(h_) # compute attention b,c,h,w = q.shape q = q.reshape(b,c,h*w) q = q.permute(0,2,1) # b,hw,c k = k.reshape(b,c,h*w) # b,c,hw w_ = torch.bmm(q,k) # b,hw,hw w[b,i,j]=sum_c q[b,i,c]k[b,c,j] w_ = w_ * (int(c)**(-0.5)) w_ = torch.nn.functional.softmax(w_, dim=2) # attend to values v = v.reshape(b,c,h*w) w_ = w_.permute(0,2,1) # b,hw,hw (first hw of k, second of q) h_ = torch.bmm(v,w_) # b, c,hw (hw of q) h_[b,c,j] = sum_i v[b,c,i] w_[b,i,j] h_ = h_.reshape(b,c,h,w) h_ = self.proj_out(h_) return x+h_ class MemoryEfficientAttnBlock(nn.Module): """ Uses xformers efficient implementation, see https://github.com/MatthieuTPHR/diffusers/blob/d80b531ff8060ec1ea982b65a1b8df70f73aa67c/src/diffusers/models/attention.py#L223 Note: this is a single-head self-attention operation """ # def __init__(self, in_channels): super().__init__() self.in_channels = in_channels self.norm = Normalize(in_channels) self.q = torch.nn.Conv2d(in_channels, in_channels, kernel_size=1, stride=1, padding=0) self.k = torch.nn.Conv2d(in_channels, in_channels, kernel_size=1, stride=1, padding=0) self.v = torch.nn.Conv2d(in_channels, in_channels, kernel_size=1, stride=1, padding=0) self.proj_out = torch.nn.Conv2d(in_channels, in_channels, kernel_size=1, stride=1, padding=0) self.attention_op: Optional[Any] = None def forward(self, x): h_ = x h_ = self.norm(h_) q = self.q(h_) k = self.k(h_) v = self.v(h_) # compute attention B, C, H, W = q.shape q, k, v = map(lambda x: rearrange(x, 'b c h w -> b (h w) c'), (q, k, v)) q, k, v = map( lambda t: t.unsqueeze(3) .reshape(B, t.shape[1], 1, C) .permute(0, 2, 1, 3) .reshape(B * 1, t.shape[1], C) .contiguous(), (q, k, v), ) out = xformers.ops.memory_efficient_attention(q, k, v, attn_bias=None, op=self.attention_op) out = ( out.unsqueeze(0) .reshape(B, 1, out.shape[1], C) .permute(0, 2, 1, 3) .reshape(B, out.shape[1], C) ) out = rearrange(out, 'b (h w) c -> b c h w', b=B, h=H, w=W, c=C) out = self.proj_out(out) return x+out class MemoryEfficientCrossAttentionWrapper(MemoryEfficientCrossAttention): def forward(self, x, context=None, mask=None): b, c, h, w = x.shape x = rearrange(x, 'b c h w -> b (h w) c') out = super().forward(x, context=context, mask=mask) out = rearrange(out, 'b (h w) c -> b c h w', h=h, w=w, c=c) return x + out def make_attn(in_channels, attn_type="vanilla", attn_kwargs=None): assert attn_type in ["vanilla", "vanilla-xformers", "memory-efficient-cross-attn", "linear", "none"], f'attn_type {attn_type} unknown' if XFORMERS_IS_AVAILBLE and attn_type == "vanilla": attn_type = "vanilla-xformers" print(f"making attention of type '{attn_type}' with {in_channels} in_channels") if attn_type == "vanilla": assert attn_kwargs is None return AttnBlock(in_channels) elif attn_type == "vanilla-xformers": print(f"building MemoryEfficientAttnBlock with {in_channels} in_channels...") return MemoryEfficientAttnBlock(in_channels) elif type == "memory-efficient-cross-attn": attn_kwargs["query_dim"] = in_channels return MemoryEfficientCrossAttentionWrapper(**attn_kwargs) elif attn_type == "none": return nn.Identity(in_channels) else: raise NotImplementedError() class Model(nn.Module): def __init__(self, *, ch, out_ch, ch_mult=(1,2,4,8), num_res_blocks, attn_resolutions, dropout=0.0, resamp_with_conv=True, in_channels, resolution, use_timestep=True, use_linear_attn=False, attn_type="vanilla"): super().__init__() if use_linear_attn: attn_type = "linear" self.ch = ch self.temb_ch = self.ch*4 self.num_resolutions = len(ch_mult) self.num_res_blocks = num_res_blocks self.resolution = resolution self.in_channels = in_channels self.use_timestep = use_timestep if self.use_timestep: # timestep embedding self.temb = nn.Module() self.temb.dense = nn.ModuleList([ torch.nn.Linear(self.ch, self.temb_ch), torch.nn.Linear(self.temb_ch, self.temb_ch), ]) # downsampling self.conv_in = torch.nn.Conv2d(in_channels, self.ch, kernel_size=3, stride=1, padding=1) curr_res = resolution in_ch_mult = (1,)+tuple(ch_mult) self.down = nn.ModuleList() for i_level in range(self.num_resolutions): block = nn.ModuleList() attn = nn.ModuleList() block_in = ch*in_ch_mult[i_level] block_out = ch*ch_mult[i_level] for i_block in range(self.num_res_blocks): block.append(ResnetBlock(in_channels=block_in, out_channels=block_out, temb_channels=self.temb_ch, dropout=dropout)) block_in = block_out if curr_res in attn_resolutions: attn.append(make_attn(block_in, attn_type=attn_type)) down = nn.Module() down.block = block down.attn = attn if i_level != self.num_resolutions-1: down.downsample = Downsample(block_in, resamp_with_conv) curr_res = curr_res // 2 self.down.append(down) # middle self.mid = nn.Module() self.mid.block_1 = ResnetBlock(in_channels=block_in, out_channels=block_in, temb_channels=self.temb_ch, dropout=dropout) self.mid.attn_1 = make_attn(block_in, attn_type=attn_type) self.mid.block_2 = ResnetBlock(in_channels=block_in, out_channels=block_in, temb_channels=self.temb_ch, dropout=dropout) # upsampling self.up = nn.ModuleList() for i_level in reversed(range(self.num_resolutions)): block = nn.ModuleList() attn = nn.ModuleList() block_out = ch*ch_mult[i_level] skip_in = ch*ch_mult[i_level] for i_block in range(self.num_res_blocks+1): if i_block == self.num_res_blocks: skip_in = ch*in_ch_mult[i_level] block.append(ResnetBlock(in_channels=block_in+skip_in, out_channels=block_out, temb_channels=self.temb_ch, dropout=dropout)) block_in = block_out if curr_res in attn_resolutions: attn.append(make_attn(block_in, attn_type=attn_type)) up = nn.Module() up.block = block up.attn = attn if i_level != 0: up.upsample = Upsample(block_in, resamp_with_conv) curr_res = curr_res * 2 self.up.insert(0, up) # prepend to get consistent order # end self.norm_out = Normalize(block_in) self.conv_out = torch.nn.Conv2d(block_in, out_ch, kernel_size=3, stride=1, padding=1) def forward(self, x, t=None, context=None): #assert x.shape[2] == x.shape[3] == self.resolution if context is not None: # assume aligned context, cat along channel axis x = torch.cat((x, context), dim=1) if self.use_timestep: # timestep embedding assert t is not None temb = get_timestep_embedding(t, self.ch) temb = self.temb.dense[0](temb) temb = nonlinearity(temb) temb = self.temb.dense[1](temb) else: temb = None # downsampling hs = [self.conv_in(x)] for i_level in range(self.num_resolutions): for i_block in range(self.num_res_blocks): h = self.down[i_level].block[i_block](hs[-1], temb) if len(self.down[i_level].attn) > 0: h = self.down[i_level].attn[i_block](h) hs.append(h) if i_level != self.num_resolutions-1: hs.append(self.down[i_level].downsample(hs[-1])) # middle h = hs[-1] h = self.mid.block_1(h, temb) h = self.mid.attn_1(h) h = self.mid.block_2(h, temb) # upsampling for i_level in reversed(range(self.num_resolutions)): for i_block in range(self.num_res_blocks+1): h = self.up[i_level].block[i_block]( torch.cat([h, hs.pop()], dim=1), temb) if len(self.up[i_level].attn) > 0: h = self.up[i_level].attn[i_block](h) if i_level != 0: h = self.up[i_level].upsample(h) # end h = self.norm_out(h) h = nonlinearity(h) h = self.conv_out(h) return h def get_last_layer(self): return self.conv_out.weight class Encoder(nn.Module): def __init__(self, *, ch, out_ch, ch_mult=(1,2,4,8), num_res_blocks, attn_resolutions, dropout=0.0, resamp_with_conv=True, in_channels, resolution, z_channels, double_z=True, use_linear_attn=False, attn_type="vanilla", **ignore_kwargs): super().__init__() if use_linear_attn: attn_type = "linear" self.ch = ch self.temb_ch = 0 self.num_resolutions = len(ch_mult) self.num_res_blocks = num_res_blocks self.resolution = resolution self.in_channels = in_channels # downsampling self.conv_in = torch.nn.Conv2d(in_channels, self.ch, kernel_size=3, stride=1, padding=1) curr_res = resolution in_ch_mult = (1,)+tuple(ch_mult) self.in_ch_mult = in_ch_mult self.down = nn.ModuleList() for i_level in range(self.num_resolutions): block = nn.ModuleList() attn = nn.ModuleList() block_in = ch*in_ch_mult[i_level] block_out = ch*ch_mult[i_level] for i_block in range(self.num_res_blocks): block.append(ResnetBlock(in_channels=block_in, out_channels=block_out, temb_channels=self.temb_ch, dropout=dropout)) block_in = block_out if curr_res in attn_resolutions: attn.append(make_attn(block_in, attn_type=attn_type)) down = nn.Module() down.block = block down.attn = attn if i_level != self.num_resolutions-1: down.downsample = Downsample(block_in, resamp_with_conv) curr_res = curr_res // 2 self.down.append(down) # middle self.mid = nn.Module() self.mid.block_1 = ResnetBlock(in_channels=block_in, out_channels=block_in, temb_channels=self.temb_ch, dropout=dropout) self.mid.attn_1 = make_attn(block_in, attn_type=attn_type) self.mid.block_2 = ResnetBlock(in_channels=block_in, out_channels=block_in, temb_channels=self.temb_ch, dropout=dropout) # end self.norm_out = Normalize(block_in) self.conv_out = torch.nn.Conv2d(block_in, 2*z_channels if double_z else z_channels, kernel_size=3, stride=1, padding=1) def forward(self, x): # timestep embedding temb = None # downsampling hs = [self.conv_in(x)] for i_level in range(self.num_resolutions): for i_block in range(self.num_res_blocks): h = self.down[i_level].block[i_block](hs[-1], temb) if len(self.down[i_level].attn) > 0: h = self.down[i_level].attn[i_block](h) hs.append(h) if i_level != self.num_resolutions-1: hs.append(self.down[i_level].downsample(hs[-1])) # middle h = hs[-1] h = self.mid.block_1(h, temb) h = self.mid.attn_1(h) h = self.mid.block_2(h, temb) # end h = self.norm_out(h) h = nonlinearity(h) h = self.conv_out(h) return h class Decoder(nn.Module): def __init__(self, *, ch, out_ch, ch_mult=(1,2,4,8), num_res_blocks, attn_resolutions, dropout=0.0, resamp_with_conv=True, in_channels, resolution, z_channels, give_pre_end=False, tanh_out=False, use_linear_attn=False, attn_type="vanilla", **ignorekwargs): super().__init__() if use_linear_attn: attn_type = "linear" self.ch = ch self.temb_ch = 0 self.num_resolutions = len(ch_mult) self.num_res_blocks = num_res_blocks self.resolution = resolution self.in_channels = in_channels self.give_pre_end = give_pre_end self.tanh_out = tanh_out # compute in_ch_mult, block_in and curr_res at lowest res in_ch_mult = (1,)+tuple(ch_mult) block_in = ch*ch_mult[self.num_resolutions-1] curr_res = resolution // 2**(self.num_resolutions-1) self.z_shape = (1,z_channels,curr_res,curr_res) print("Working with z of shape {} = {} dimensions.".format( self.z_shape, np.prod(self.z_shape))) # z to block_in self.conv_in = torch.nn.Conv2d(z_channels, block_in, kernel_size=3, stride=1, padding=1) # middle self.mid = nn.Module() self.mid.block_1 = ResnetBlock(in_channels=block_in, out_channels=block_in, temb_channels=self.temb_ch, dropout=dropout) self.mid.attn_1 = make_attn(block_in, attn_type=attn_type) self.mid.block_2 = ResnetBlock(in_channels=block_in, out_channels=block_in, temb_channels=self.temb_ch, dropout=dropout) # upsampling self.up = nn.ModuleList() for i_level in reversed(range(self.num_resolutions)): block = nn.ModuleList() attn = nn.ModuleList() block_out = ch*ch_mult[i_level] for i_block in range(self.num_res_blocks+1): block.append(ResnetBlock(in_channels=block_in, out_channels=block_out, temb_channels=self.temb_ch, dropout=dropout)) block_in = block_out if curr_res in attn_resolutions: attn.append(make_attn(block_in, attn_type=attn_type)) up = nn.Module() up.block = block up.attn = attn if i_level != 0: up.upsample = Upsample(block_in, resamp_with_conv) curr_res = curr_res * 2 self.up.insert(0, up) # prepend to get consistent order # end self.norm_out = Normalize(block_in) self.conv_out = torch.nn.Conv2d(block_in, out_ch, kernel_size=3, stride=1, padding=1) def forward(self, z): #assert z.shape[1:] == self.z_shape[1:] self.last_z_shape = z.shape # timestep embedding temb = None # z to block_in h = self.conv_in(z) # middle h = self.mid.block_1(h, temb) h = self.mid.attn_1(h) h = self.mid.block_2(h, temb) # upsampling for i_level in reversed(range(self.num_resolutions)): for i_block in range(self.num_res_blocks+1): h = self.up[i_level].block[i_block](h, temb) if len(self.up[i_level].attn) > 0: h = self.up[i_level].attn[i_block](h) if i_level != 0: h = self.up[i_level].upsample(h) # end if self.give_pre_end: return h h = self.norm_out(h) h = nonlinearity(h) h = self.conv_out(h) if self.tanh_out: h = torch.tanh(h) return h class SimpleDecoder(nn.Module): def __init__(self, in_channels, out_channels, *args, **kwargs): super().__init__() self.model = nn.ModuleList([nn.Conv2d(in_channels, in_channels, 1), ResnetBlock(in_channels=in_channels, out_channels=2 * in_channels, temb_channels=0, dropout=0.0), ResnetBlock(in_channels=2 * in_channels, out_channels=4 * in_channels, temb_channels=0, dropout=0.0), ResnetBlock(in_channels=4 * in_channels, out_channels=2 * in_channels, temb_channels=0, dropout=0.0), nn.Conv2d(2*in_channels, in_channels, 1), Upsample(in_channels, with_conv=True)]) # end self.norm_out = Normalize(in_channels) self.conv_out = torch.nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=1, padding=1) def forward(self, x): for i, layer in enumerate(self.model): if i in [1,2,3]: x = layer(x, None) else: x = layer(x) h = self.norm_out(x) h = nonlinearity(h) x = self.conv_out(h) return x class UpsampleDecoder(nn.Module): def __init__(self, in_channels, out_channels, ch, num_res_blocks, resolution, ch_mult=(2,2), dropout=0.0): super().__init__() # upsampling self.temb_ch = 0 self.num_resolutions = len(ch_mult) self.num_res_blocks = num_res_blocks block_in = in_channels curr_res = resolution // 2 ** (self.num_resolutions - 1) self.res_blocks = nn.ModuleList() self.upsample_blocks = nn.ModuleList() for i_level in range(self.num_resolutions): res_block = [] block_out = ch * ch_mult[i_level] for i_block in range(self.num_res_blocks + 1): res_block.append(ResnetBlock(in_channels=block_in, out_channels=block_out, temb_channels=self.temb_ch, dropout=dropout)) block_in = block_out self.res_blocks.append(nn.ModuleList(res_block)) if i_level != self.num_resolutions - 1: self.upsample_blocks.append(Upsample(block_in, True)) curr_res = curr_res * 2 # end self.norm_out = Normalize(block_in) self.conv_out = torch.nn.Conv2d(block_in, out_channels, kernel_size=3, stride=1, padding=1) def forward(self, x): # upsampling h = x for k, i_level in enumerate(range(self.num_resolutions)): for i_block in range(self.num_res_blocks + 1): h = self.res_blocks[i_level][i_block](h, None) if i_level != self.num_resolutions - 1: h = self.upsample_blocks[k](h) h = self.norm_out(h) h = nonlinearity(h) h = self.conv_out(h) return h class LatentRescaler(nn.Module): def __init__(self, factor, in_channels, mid_channels, out_channels, depth=2): super().__init__() # residual block, interpolate, residual block self.factor = factor self.conv_in = nn.Conv2d(in_channels, mid_channels, kernel_size=3, stride=1, padding=1) self.res_block1 = nn.ModuleList([ResnetBlock(in_channels=mid_channels, out_channels=mid_channels, temb_channels=0, dropout=0.0) for _ in range(depth)]) self.attn = AttnBlock(mid_channels) self.res_block2 = nn.ModuleList([ResnetBlock(in_channels=mid_channels, out_channels=mid_channels, temb_channels=0, dropout=0.0) for _ in range(depth)]) self.conv_out = nn.Conv2d(mid_channels, out_channels, kernel_size=1, ) def forward(self, x): x = self.conv_in(x) for block in self.res_block1: x = block(x, None) x = torch.nn.functional.interpolate(x, size=(int(round(x.shape[2]*self.factor)), int(round(x.shape[3]*self.factor)))) x = self.attn(x) for block in self.res_block2: x = block(x, None) x = self.conv_out(x) return x class MergedRescaleEncoder(nn.Module): def __init__(self, in_channels, ch, resolution, out_ch, num_res_blocks, attn_resolutions, dropout=0.0, resamp_with_conv=True, ch_mult=(1,2,4,8), rescale_factor=1.0, rescale_module_depth=1): super().__init__() intermediate_chn = ch * ch_mult[-1] self.encoder = Encoder(in_channels=in_channels, num_res_blocks=num_res_blocks, ch=ch, ch_mult=ch_mult, z_channels=intermediate_chn, double_z=False, resolution=resolution, attn_resolutions=attn_resolutions, dropout=dropout, resamp_with_conv=resamp_with_conv, out_ch=None) self.rescaler = LatentRescaler(factor=rescale_factor, in_channels=intermediate_chn, mid_channels=intermediate_chn, out_channels=out_ch, depth=rescale_module_depth) def forward(self, x): x = self.encoder(x) x = self.rescaler(x) return x class MergedRescaleDecoder(nn.Module): def __init__(self, z_channels, out_ch, resolution, num_res_blocks, attn_resolutions, ch, ch_mult=(1,2,4,8), dropout=0.0, resamp_with_conv=True, rescale_factor=1.0, rescale_module_depth=1): super().__init__() tmp_chn = z_channels*ch_mult[-1] self.decoder = Decoder(out_ch=out_ch, z_channels=tmp_chn, attn_resolutions=attn_resolutions, dropout=dropout, resamp_with_conv=resamp_with_conv, in_channels=None, num_res_blocks=num_res_blocks, ch_mult=ch_mult, resolution=resolution, ch=ch) self.rescaler = LatentRescaler(factor=rescale_factor, in_channels=z_channels, mid_channels=tmp_chn, out_channels=tmp_chn, depth=rescale_module_depth) def forward(self, x): x = self.rescaler(x) x = self.decoder(x) return x class Upsampler(nn.Module): def __init__(self, in_size, out_size, in_channels, out_channels, ch_mult=2): super().__init__() assert out_size >= in_size num_blocks = int(np.log2(out_size//in_size))+1 factor_up = 1.+ (out_size % in_size) print(f"Building {self.__class__.__name__} with in_size: {in_size} --> out_size {out_size} and factor {factor_up}") self.rescaler = LatentRescaler(factor=factor_up, in_channels=in_channels, mid_channels=2*in_channels, out_channels=in_channels) self.decoder = Decoder(out_ch=out_channels, resolution=out_size, z_channels=in_channels, num_res_blocks=2, attn_resolutions=[], in_channels=None, ch=in_channels, ch_mult=[ch_mult for _ in range(num_blocks)]) def forward(self, x): x = self.rescaler(x) x = self.decoder(x) return x class Resize(nn.Module): def __init__(self, in_channels=None, learned=False, mode="bilinear"): super().__init__() self.with_conv = learned self.mode = mode if self.with_conv: print(f"Note: {self.__class__.__name} uses learned downsampling and will ignore the fixed {mode} mode") raise NotImplementedError() assert in_channels is not None # no asymmetric padding in torch conv, must do it ourselves self.conv = torch.nn.Conv2d(in_channels, in_channels, kernel_size=4, stride=2, padding=1) def forward(self, x, scale_factor=1.0): if scale_factor==1.0: return x else: x = torch.nn.functional.interpolate(x, mode=self.mode, align_corners=False, scale_factor=scale_factor) return x ================================================ FILE: ToonCrafter/ldm/modules/diffusionmodules/openaimodel.py ================================================ from abc import abstractmethod import math import numpy as np import torch as th import torch.nn as nn import torch.nn.functional as F from ToonCrafter.ldm.modules.diffusionmodules.util import ( checkpoint, conv_nd, linear, avg_pool_nd, zero_module, normalization, timestep_embedding, ) from ToonCrafter.ldm.modules.attention import SpatialTransformer from ToonCrafter.ldm.util import exists # dummy replace def convert_module_to_f16(x): pass def convert_module_to_f32(x): pass ## go class AttentionPool2d(nn.Module): """ Adapted from CLIP: https://github.com/openai/CLIP/blob/main/clip/model.py """ def __init__( self, spacial_dim: int, embed_dim: int, num_heads_channels: int, output_dim: int = None, ): super().__init__() self.positional_embedding = nn.Parameter(th.randn(embed_dim, spacial_dim ** 2 + 1) / embed_dim ** 0.5) self.qkv_proj = conv_nd(1, embed_dim, 3 * embed_dim, 1) self.c_proj = conv_nd(1, embed_dim, output_dim or embed_dim, 1) self.num_heads = embed_dim // num_heads_channels self.attention = QKVAttention(self.num_heads) def forward(self, x): b, c, *_spatial = x.shape x = x.reshape(b, c, -1) # NC(HW) x = th.cat([x.mean(dim=-1, keepdim=True), x], dim=-1) # NC(HW+1) x = x + self.positional_embedding[None, :, :].to(x.dtype) # NC(HW+1) x = self.qkv_proj(x) x = self.attention(x) x = self.c_proj(x) return x[:, :, 0] class TimestepBlock(nn.Module): """ Any module where forward() takes timestep embeddings as a second argument. """ @abstractmethod def forward(self, x, emb): """ Apply the module to `x` given `emb` timestep embeddings. """ class TimestepEmbedSequential(nn.Sequential, TimestepBlock): """ A sequential module that passes timestep embeddings to the children that support it as an extra input. """ def forward(self, x, emb, context=None, check=False): for layer in self: if isinstance(layer, TimestepBlock): x = layer(x, emb) elif isinstance(layer, SpatialTransformer): x = layer(x, context) else: x = layer(x) return x class Upsample(nn.Module): """ An upsampling layer with an optional convolution. :param channels: channels in the inputs and outputs. :param use_conv: a bool determining if a convolution is applied. :param dims: determines if the signal is 1D, 2D, or 3D. If 3D, then upsampling occurs in the inner-two dimensions. """ def __init__(self, channels, use_conv, dims=2, out_channels=None, padding=1): super().__init__() self.channels = channels self.out_channels = out_channels or channels self.use_conv = use_conv self.dims = dims if use_conv: self.conv = conv_nd(dims, self.channels, self.out_channels, 3, padding=padding) def forward(self, x): assert x.shape[1] == self.channels if self.dims == 3: x = F.interpolate( x, (x.shape[2], x.shape[3] * 2, x.shape[4] * 2), mode="nearest" ) else: x = F.interpolate(x, scale_factor=2, mode="nearest") if self.use_conv: x = self.conv(x) return x class TransposedUpsample(nn.Module): 'Learned 2x upsampling without padding' def __init__(self, channels, out_channels=None, ks=5): super().__init__() self.channels = channels self.out_channels = out_channels or channels self.up = nn.ConvTranspose2d(self.channels,self.out_channels,kernel_size=ks,stride=2) def forward(self,x): return self.up(x) class Downsample(nn.Module): """ A downsampling layer with an optional convolution. :param channels: channels in the inputs and outputs. :param use_conv: a bool determining if a convolution is applied. :param dims: determines if the signal is 1D, 2D, or 3D. If 3D, then downsampling occurs in the inner-two dimensions. """ def __init__(self, channels, use_conv, dims=2, out_channels=None,padding=1): super().__init__() self.channels = channels self.out_channels = out_channels or channels self.use_conv = use_conv self.dims = dims stride = 2 if dims != 3 else (1, 2, 2) if use_conv: self.op = conv_nd( dims, self.channels, self.out_channels, 3, stride=stride, padding=padding ) else: assert self.channels == self.out_channels self.op = avg_pool_nd(dims, kernel_size=stride, stride=stride) def forward(self, x): assert x.shape[1] == self.channels return self.op(x) class ResBlock(TimestepBlock): """ A residual block that can optionally change the number of channels. :param channels: the number of input channels. :param emb_channels: the number of timestep embedding channels. :param dropout: the rate of dropout. :param out_channels: if specified, the number of out channels. :param use_conv: if True and out_channels is specified, use a spatial convolution instead of a smaller 1x1 convolution to change the channels in the skip connection. :param dims: determines if the signal is 1D, 2D, or 3D. :param use_checkpoint: if True, use gradient checkpointing on this module. :param up: if True, use this block for upsampling. :param down: if True, use this block for downsampling. """ def __init__( self, channels, emb_channels, dropout, out_channels=None, use_conv=False, use_scale_shift_norm=False, dims=2, use_checkpoint=False, up=False, down=False, ): super().__init__() self.channels = channels self.emb_channels = emb_channels self.dropout = dropout self.out_channels = out_channels or channels self.use_conv = use_conv self.use_checkpoint = use_checkpoint self.use_scale_shift_norm = use_scale_shift_norm self.in_layers = nn.Sequential( normalization(channels), nn.SiLU(), conv_nd(dims, channels, self.out_channels, 3, padding=1), ) self.updown = up or down if up: self.h_upd = Upsample(channels, False, dims) self.x_upd = Upsample(channels, False, dims) elif down: self.h_upd = Downsample(channels, False, dims) self.x_upd = Downsample(channels, False, dims) else: self.h_upd = self.x_upd = nn.Identity() self.emb_layers = nn.Sequential( nn.SiLU(), linear( emb_channels, 2 * self.out_channels if use_scale_shift_norm else self.out_channels, ), ) self.out_layers = nn.Sequential( normalization(self.out_channels), nn.SiLU(), nn.Dropout(p=dropout), zero_module( conv_nd(dims, self.out_channels, self.out_channels, 3, padding=1) ), ) if self.out_channels == channels: self.skip_connection = nn.Identity() elif use_conv: self.skip_connection = conv_nd( dims, channels, self.out_channels, 3, padding=1 ) else: self.skip_connection = conv_nd(dims, channels, self.out_channels, 1) def forward(self, x, emb): """ Apply the block to a Tensor, conditioned on a timestep embedding. :param x: an [N x C x ...] Tensor of features. :param emb: an [N x emb_channels] Tensor of timestep embeddings. :return: an [N x C x ...] Tensor of outputs. """ return checkpoint( self._forward, (x, emb), self.parameters(), self.use_checkpoint ) def _forward(self, x, emb): if self.updown: in_rest, in_conv = self.in_layers[:-1], self.in_layers[-1] h = in_rest(x) h = self.h_upd(h) x = self.x_upd(x) h = in_conv(h) else: h = self.in_layers(x) emb_out = self.emb_layers(emb).type(h.dtype) while len(emb_out.shape) < len(h.shape): emb_out = emb_out[..., None] if self.use_scale_shift_norm: out_norm, out_rest = self.out_layers[0], self.out_layers[1:] scale, shift = th.chunk(emb_out, 2, dim=1) h = out_norm(h) * (1 + scale) + shift h = out_rest(h) else: h = h + emb_out h = self.out_layers(h) return self.skip_connection(x) + h class AttentionBlock(nn.Module): """ An attention block that allows spatial positions to attend to each other. Originally ported from here, but adapted to the N-d case. https://github.com/hojonathanho/diffusion/blob/1e0dceb3b3495bbe19116a5e1b3596cd0706c543/diffusion_tf/models/unet.py#L66. """ def __init__( self, channels, num_heads=1, num_head_channels=-1, use_checkpoint=False, use_new_attention_order=False, ): super().__init__() self.channels = channels if num_head_channels == -1: self.num_heads = num_heads else: assert ( channels % num_head_channels == 0 ), f"q,k,v channels {channels} is not divisible by num_head_channels {num_head_channels}" self.num_heads = channels // num_head_channels self.use_checkpoint = use_checkpoint self.norm = normalization(channels) self.qkv = conv_nd(1, channels, channels * 3, 1) if use_new_attention_order: # split qkv before split heads self.attention = QKVAttention(self.num_heads) else: # split heads before split qkv self.attention = QKVAttentionLegacy(self.num_heads) self.proj_out = zero_module(conv_nd(1, channels, channels, 1)) def forward(self, x): return checkpoint(self._forward, (x,), self.parameters(), True) # TODO: check checkpoint usage, is True # TODO: fix the .half call!!! #return pt_checkpoint(self._forward, x) # pytorch def _forward(self, x): b, c, *spatial = x.shape x = x.reshape(b, c, -1) qkv = self.qkv(self.norm(x)) h = self.attention(qkv) h = self.proj_out(h) return (x + h).reshape(b, c, *spatial) def count_flops_attn(model, _x, y): """ A counter for the `thop` package to count the operations in an attention operation. Meant to be used like: macs, params = thop.profile( model, inputs=(inputs, timestamps), custom_ops={QKVAttention: QKVAttention.count_flops}, ) """ b, c, *spatial = y[0].shape num_spatial = int(np.prod(spatial)) # We perform two matmuls with the same number of ops. # The first computes the weight matrix, the second computes # the combination of the value vectors. matmul_ops = 2 * b * (num_spatial ** 2) * c model.total_ops += th.DoubleTensor([matmul_ops]) class QKVAttentionLegacy(nn.Module): """ A module which performs QKV attention. Matches legacy QKVAttention + input/ouput heads shaping """ def __init__(self, n_heads): super().__init__() self.n_heads = n_heads def forward(self, qkv): """ Apply QKV attention. :param qkv: an [N x (H * 3 * C) x T] tensor of Qs, Ks, and Vs. :return: an [N x (H * C) x T] tensor after attention. """ bs, width, length = qkv.shape assert width % (3 * self.n_heads) == 0 ch = width // (3 * self.n_heads) q, k, v = qkv.reshape(bs * self.n_heads, ch * 3, length).split(ch, dim=1) scale = 1 / math.sqrt(math.sqrt(ch)) weight = th.einsum( "bct,bcs->bts", q * scale, k * scale ) # More stable with f16 than dividing afterwards weight = th.softmax(weight.float(), dim=-1).type(weight.dtype) a = th.einsum("bts,bcs->bct", weight, v) return a.reshape(bs, -1, length) @staticmethod def count_flops(model, _x, y): return count_flops_attn(model, _x, y) class QKVAttention(nn.Module): """ A module which performs QKV attention and splits in a different order. """ def __init__(self, n_heads): super().__init__() self.n_heads = n_heads def forward(self, qkv): """ Apply QKV attention. :param qkv: an [N x (3 * H * C) x T] tensor of Qs, Ks, and Vs. :return: an [N x (H * C) x T] tensor after attention. """ bs, width, length = qkv.shape assert width % (3 * self.n_heads) == 0 ch = width // (3 * self.n_heads) q, k, v = qkv.chunk(3, dim=1) scale = 1 / math.sqrt(math.sqrt(ch)) weight = th.einsum( "bct,bcs->bts", (q * scale).view(bs * self.n_heads, ch, length), (k * scale).view(bs * self.n_heads, ch, length), ) # More stable with f16 than dividing afterwards weight = th.softmax(weight.float(), dim=-1).type(weight.dtype) a = th.einsum("bts,bcs->bct", weight, v.reshape(bs * self.n_heads, ch, length)) return a.reshape(bs, -1, length) @staticmethod def count_flops(model, _x, y): return count_flops_attn(model, _x, y) class UNetModel(nn.Module): """ The full UNet model with attention and timestep embedding. :param in_channels: channels in the input Tensor. :param model_channels: base channel count for the model. :param out_channels: channels in the output Tensor. :param num_res_blocks: number of residual blocks per downsample. :param attention_resolutions: a collection of downsample rates at which attention will take place. May be a set, list, or tuple. For example, if this contains 4, then at 4x downsampling, attention will be used. :param dropout: the dropout probability. :param channel_mult: channel multiplier for each level of the UNet. :param conv_resample: if True, use learned convolutions for upsampling and downsampling. :param dims: determines if the signal is 1D, 2D, or 3D. :param num_classes: if specified (as an int), then this model will be class-conditional with `num_classes` classes. :param use_checkpoint: use gradient checkpointing to reduce memory usage. :param num_heads: the number of attention heads in each attention layer. :param num_heads_channels: if specified, ignore num_heads and instead use a fixed channel width per attention head. :param num_heads_upsample: works with num_heads to set a different number of heads for upsampling. Deprecated. :param use_scale_shift_norm: use a FiLM-like conditioning mechanism. :param resblock_updown: use residual blocks for up/downsampling. :param use_new_attention_order: use a different attention pattern for potentially increased efficiency. """ def __init__( self, image_size, in_channels, model_channels, out_channels, num_res_blocks, attention_resolutions, dropout=0, channel_mult=(1, 2, 4, 8), conv_resample=True, dims=2, num_classes=None, use_checkpoint=False, use_fp16=False, num_heads=-1, num_head_channels=-1, num_heads_upsample=-1, use_scale_shift_norm=False, resblock_updown=False, use_new_attention_order=False, use_spatial_transformer=False, # custom transformer support transformer_depth=1, # custom transformer support context_dim=None, # custom transformer support n_embed=None, # custom support for prediction of discrete ids into codebook of first stage vq model legacy=True, disable_self_attentions=None, num_attention_blocks=None, disable_middle_self_attn=False, use_linear_in_transformer=False, ): super().__init__() if use_spatial_transformer: assert context_dim is not None, 'Fool!! You forgot to include the dimension of your cross-attention conditioning...' if context_dim is not None: assert use_spatial_transformer, 'Fool!! You forgot to use the spatial transformer for your cross-attention conditioning...' from omegaconf.listconfig import ListConfig if type(context_dim) == ListConfig: context_dim = list(context_dim) if num_heads_upsample == -1: num_heads_upsample = num_heads if num_heads == -1: assert num_head_channels != -1, 'Either num_heads or num_head_channels has to be set' if num_head_channels == -1: assert num_heads != -1, 'Either num_heads or num_head_channels has to be set' self.image_size = image_size self.in_channels = in_channels self.model_channels = model_channels self.out_channels = out_channels if isinstance(num_res_blocks, int): self.num_res_blocks = len(channel_mult) * [num_res_blocks] else: if len(num_res_blocks) != len(channel_mult): raise ValueError("provide num_res_blocks either as an int (globally constant) or " "as a list/tuple (per-level) with the same length as channel_mult") self.num_res_blocks = num_res_blocks if disable_self_attentions is not None: # should be a list of booleans, indicating whether to disable self-attention in TransformerBlocks or not assert len(disable_self_attentions) == len(channel_mult) if num_attention_blocks is not None: assert len(num_attention_blocks) == len(self.num_res_blocks) assert all(map(lambda i: self.num_res_blocks[i] >= num_attention_blocks[i], range(len(num_attention_blocks)))) print(f"Constructor of UNetModel received num_attention_blocks={num_attention_blocks}. " f"This option has LESS priority than attention_resolutions {attention_resolutions}, " f"i.e., in cases where num_attention_blocks[i] > 0 but 2**i not in attention_resolutions, " f"attention will still not be set.") self.attention_resolutions = attention_resolutions self.dropout = dropout self.channel_mult = channel_mult self.conv_resample = conv_resample self.num_classes = num_classes self.use_checkpoint = use_checkpoint self.dtype = th.float16 if use_fp16 else th.float32 self.num_heads = num_heads self.num_head_channels = num_head_channels self.num_heads_upsample = num_heads_upsample self.predict_codebook_ids = n_embed is not None time_embed_dim = model_channels * 4 self.time_embed = nn.Sequential( linear(model_channels, time_embed_dim), nn.SiLU(), linear(time_embed_dim, time_embed_dim), ) if self.num_classes is not None: if isinstance(self.num_classes, int): self.label_emb = nn.Embedding(num_classes, time_embed_dim) elif self.num_classes == "continuous": print("setting up linear c_adm embedding layer") self.label_emb = nn.Linear(1, time_embed_dim) else: raise ValueError() self.input_blocks = nn.ModuleList( [ TimestepEmbedSequential( conv_nd(dims, in_channels, model_channels, 3, padding=1) ) ] ) self._feature_size = model_channels input_block_chans = [model_channels] ch = model_channels ds = 1 for level, mult in enumerate(channel_mult): for nr in range(self.num_res_blocks[level]): layers = [ ResBlock( ch, time_embed_dim, dropout, out_channels=mult * model_channels, dims=dims, use_checkpoint=use_checkpoint, use_scale_shift_norm=use_scale_shift_norm, ) ] ch = mult * model_channels if ds in attention_resolutions: if num_head_channels == -1: dim_head = ch // num_heads else: num_heads = ch // num_head_channels dim_head = num_head_channels if legacy: #num_heads = 1 dim_head = ch // num_heads if use_spatial_transformer else num_head_channels if exists(disable_self_attentions): disabled_sa = disable_self_attentions[level] else: disabled_sa = False if not exists(num_attention_blocks) or nr < num_attention_blocks[level]: layers.append( AttentionBlock( ch, use_checkpoint=use_checkpoint, num_heads=num_heads, num_head_channels=dim_head, use_new_attention_order=use_new_attention_order, ) if not use_spatial_transformer else SpatialTransformer( ch, num_heads, dim_head, depth=transformer_depth, context_dim=context_dim, disable_self_attn=disabled_sa, use_linear=use_linear_in_transformer, use_checkpoint=use_checkpoint ) ) self.input_blocks.append(TimestepEmbedSequential(*layers)) self._feature_size += ch input_block_chans.append(ch) if level != len(channel_mult) - 1: out_ch = ch self.input_blocks.append( TimestepEmbedSequential( ResBlock( ch, time_embed_dim, dropout, out_channels=out_ch, dims=dims, use_checkpoint=use_checkpoint, use_scale_shift_norm=use_scale_shift_norm, down=True, ) if resblock_updown else Downsample( ch, conv_resample, dims=dims, out_channels=out_ch ) ) ) ch = out_ch input_block_chans.append(ch) ds *= 2 self._feature_size += ch if num_head_channels == -1: dim_head = ch // num_heads else: num_heads = ch // num_head_channels dim_head = num_head_channels if legacy: #num_heads = 1 dim_head = ch // num_heads if use_spatial_transformer else num_head_channels self.middle_block = TimestepEmbedSequential( ResBlock( ch, time_embed_dim, dropout, dims=dims, use_checkpoint=use_checkpoint, use_scale_shift_norm=use_scale_shift_norm, ), AttentionBlock( ch, use_checkpoint=use_checkpoint, num_heads=num_heads, num_head_channels=dim_head, use_new_attention_order=use_new_attention_order, ) if not use_spatial_transformer else SpatialTransformer( # always uses a self-attn ch, num_heads, dim_head, depth=transformer_depth, context_dim=context_dim, disable_self_attn=disable_middle_self_attn, use_linear=use_linear_in_transformer, use_checkpoint=use_checkpoint ), ResBlock( ch, time_embed_dim, dropout, dims=dims, use_checkpoint=use_checkpoint, use_scale_shift_norm=use_scale_shift_norm, ), ) self._feature_size += ch self.output_blocks = nn.ModuleList([]) for level, mult in list(enumerate(channel_mult))[::-1]: for i in range(self.num_res_blocks[level] + 1): ich = input_block_chans.pop() layers = [ ResBlock( ch + ich, time_embed_dim, dropout, out_channels=model_channels * mult, dims=dims, use_checkpoint=use_checkpoint, use_scale_shift_norm=use_scale_shift_norm, ) ] ch = model_channels * mult if ds in attention_resolutions: if num_head_channels == -1: dim_head = ch // num_heads else: num_heads = ch // num_head_channels dim_head = num_head_channels if legacy: #num_heads = 1 dim_head = ch // num_heads if use_spatial_transformer else num_head_channels if exists(disable_self_attentions): disabled_sa = disable_self_attentions[level] else: disabled_sa = False if not exists(num_attention_blocks) or i < num_attention_blocks[level]: layers.append( AttentionBlock( ch, use_checkpoint=use_checkpoint, num_heads=num_heads_upsample, num_head_channels=dim_head, use_new_attention_order=use_new_attention_order, ) if not use_spatial_transformer else SpatialTransformer( ch, num_heads, dim_head, depth=transformer_depth, context_dim=context_dim, disable_self_attn=disabled_sa, use_linear=use_linear_in_transformer, use_checkpoint=use_checkpoint ) ) if level and i == self.num_res_blocks[level]: out_ch = ch layers.append( ResBlock( ch, time_embed_dim, dropout, out_channels=out_ch, dims=dims, use_checkpoint=use_checkpoint, use_scale_shift_norm=use_scale_shift_norm, up=True, ) if resblock_updown else Upsample(ch, conv_resample, dims=dims, out_channels=out_ch) ) ds //= 2 self.output_blocks.append(TimestepEmbedSequential(*layers)) self._feature_size += ch self.out = nn.Sequential( normalization(ch), nn.SiLU(), zero_module(conv_nd(dims, model_channels, out_channels, 3, padding=1)), ) if self.predict_codebook_ids: self.id_predictor = nn.Sequential( normalization(ch), conv_nd(dims, model_channels, n_embed, 1), #nn.LogSoftmax(dim=1) # change to cross_entropy and produce non-normalized logits ) def convert_to_fp16(self): """ Convert the torso of the model to float16. """ self.input_blocks.apply(convert_module_to_f16) self.middle_block.apply(convert_module_to_f16) self.output_blocks.apply(convert_module_to_f16) def convert_to_fp32(self): """ Convert the torso of the model to float32. """ self.input_blocks.apply(convert_module_to_f32) self.middle_block.apply(convert_module_to_f32) self.output_blocks.apply(convert_module_to_f32) def forward(self, x, timesteps=None, context=None, y=None,**kwargs): """ Apply the model to an input batch. :param x: an [N x C x ...] Tensor of inputs. :param timesteps: a 1-D batch of timesteps. :param context: conditioning plugged in via crossattn :param y: an [N] Tensor of labels, if class-conditional. :return: an [N x C x ...] Tensor of outputs. """ assert (y is not None) == ( self.num_classes is not None ), "must specify y if and only if the model is class-conditional" hs = [] t_emb = timestep_embedding(timesteps, self.model_channels, repeat_only=False) emb = self.time_embed(t_emb) if self.num_classes is not None: assert y.shape[0] == x.shape[0] emb = emb + self.label_emb(y) h = x.type(self.dtype) for module in self.input_blocks: h = module(h, emb, context) hs.append(h) h = self.middle_block(h, emb, context) for module in self.output_blocks: h = th.cat([h, hs.pop()], dim=1) h = module(h, emb, context) h = h.type(x.dtype) if self.predict_codebook_ids: return self.id_predictor(h) else: return self.out(h) ================================================ FILE: ToonCrafter/ldm/modules/diffusionmodules/upscaling.py ================================================ import torch import torch.nn as nn import numpy as np from functools import partial from ToonCrafter.ldm.modules.diffusionmodules.util import extract_into_tensor, make_beta_schedule from ToonCrafter.ldm.util import default class AbstractLowScaleModel(nn.Module): # for concatenating a downsampled image to the latent representation def __init__(self, noise_schedule_config=None): super(AbstractLowScaleModel, self).__init__() if noise_schedule_config is not None: self.register_schedule(**noise_schedule_config) def register_schedule(self, beta_schedule="linear", timesteps=1000, linear_start=1e-4, linear_end=2e-2, cosine_s=8e-3): betas = make_beta_schedule(beta_schedule, timesteps, linear_start=linear_start, linear_end=linear_end, cosine_s=cosine_s) alphas = 1. - betas alphas_cumprod = np.cumprod(alphas, axis=0) alphas_cumprod_prev = np.append(1., alphas_cumprod[:-1]) timesteps, = betas.shape self.num_timesteps = int(timesteps) self.linear_start = linear_start self.linear_end = linear_end assert alphas_cumprod.shape[0] == self.num_timesteps, 'alphas have to be defined for each timestep' to_torch = partial(torch.tensor, dtype=torch.float32) self.register_buffer('betas', to_torch(betas)) self.register_buffer('alphas_cumprod', to_torch(alphas_cumprod)) self.register_buffer('alphas_cumprod_prev', to_torch(alphas_cumprod_prev)) # calculations for diffusion q(x_t | x_{t-1}) and others self.register_buffer('sqrt_alphas_cumprod', to_torch(np.sqrt(alphas_cumprod))) self.register_buffer('sqrt_one_minus_alphas_cumprod', to_torch(np.sqrt(1. - alphas_cumprod))) self.register_buffer('log_one_minus_alphas_cumprod', to_torch(np.log(1. - alphas_cumprod))) self.register_buffer('sqrt_recip_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod))) self.register_buffer('sqrt_recipm1_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod - 1))) def q_sample(self, x_start, t, noise=None): noise = default(noise, lambda: torch.randn_like(x_start)) return (extract_into_tensor(self.sqrt_alphas_cumprod, t, x_start.shape) * x_start + extract_into_tensor(self.sqrt_one_minus_alphas_cumprod, t, x_start.shape) * noise) def forward(self, x): return x, None def decode(self, x): return x class SimpleImageConcat(AbstractLowScaleModel): # no noise level conditioning def __init__(self): super(SimpleImageConcat, self).__init__(noise_schedule_config=None) self.max_noise_level = 0 def forward(self, x): # fix to constant noise level return x, torch.zeros(x.shape[0], device=x.device).long() class ImageConcatWithNoiseAugmentation(AbstractLowScaleModel): def __init__(self, noise_schedule_config, max_noise_level=1000, to_cuda=False): super().__init__(noise_schedule_config=noise_schedule_config) self.max_noise_level = max_noise_level def forward(self, x, noise_level=None): if noise_level is None: noise_level = torch.randint(0, self.max_noise_level, (x.shape[0],), device=x.device).long() else: assert isinstance(noise_level, torch.Tensor) z = self.q_sample(x, noise_level) return z, noise_level ================================================ FILE: ToonCrafter/ldm/modules/diffusionmodules/util.py ================================================ # adopted from # https://github.com/openai/improved-diffusion/blob/main/improved_diffusion/gaussian_diffusion.py # and # https://github.com/lucidrains/denoising-diffusion-pytorch/blob/7706bdfc6f527f58d33f84b7b522e61e6e3164b3/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py # and # https://github.com/openai/guided-diffusion/blob/0ba878e517b276c45d1195eb29f6f5f72659a05b/guided_diffusion/nn.py # # thanks! import os import math import torch import torch.nn as nn import numpy as np from einops import repeat from ToonCrafter.ldm.util import instantiate_from_config def make_beta_schedule(schedule, n_timestep, linear_start=1e-4, linear_end=2e-2, cosine_s=8e-3): if schedule == "linear": betas = ( torch.linspace(linear_start ** 0.5, linear_end ** 0.5, n_timestep, dtype=torch.float64) ** 2 ) elif schedule == "cosine": timesteps = ( torch.arange(n_timestep + 1, dtype=torch.float64) / n_timestep + cosine_s ) alphas = timesteps / (1 + cosine_s) * np.pi / 2 alphas = torch.cos(alphas).pow(2) alphas = alphas / alphas[0] betas = 1 - alphas[1:] / alphas[:-1] betas = np.clip(betas, a_min=0, a_max=0.999) elif schedule == "sqrt_linear": betas = torch.linspace(linear_start, linear_end, n_timestep, dtype=torch.float64) elif schedule == "sqrt": betas = torch.linspace(linear_start, linear_end, n_timestep, dtype=torch.float64) ** 0.5 else: raise ValueError(f"schedule '{schedule}' unknown.") return betas.numpy() def make_ddim_timesteps(ddim_discr_method, num_ddim_timesteps, num_ddpm_timesteps, verbose=True): if ddim_discr_method == 'uniform': c = num_ddpm_timesteps // num_ddim_timesteps ddim_timesteps = np.asarray(list(range(0, num_ddpm_timesteps, c))) elif ddim_discr_method == 'quad': ddim_timesteps = ((np.linspace(0, np.sqrt(num_ddpm_timesteps * .8), num_ddim_timesteps)) ** 2).astype(int) else: raise NotImplementedError(f'There is no ddim discretization method called "{ddim_discr_method}"') # assert ddim_timesteps.shape[0] == num_ddim_timesteps # add one to get the final alpha values right (the ones from first scale to data during sampling) steps_out = ddim_timesteps + 1 if verbose: print(f'Selected timesteps for ddim sampler: {steps_out}') return steps_out def make_ddim_sampling_parameters(alphacums, ddim_timesteps, eta, verbose=True): # select alphas for computing the variance schedule alphas = alphacums[ddim_timesteps] alphas_prev = np.asarray([alphacums[0]] + alphacums[ddim_timesteps[:-1]].tolist()) # according the the formula provided in https://arxiv.org/abs/2010.02502 sigmas = eta * np.sqrt((1 - alphas_prev) / (1 - alphas) * (1 - alphas / alphas_prev)) if verbose: print(f'Selected alphas for ddim sampler: a_t: {alphas}; a_(t-1): {alphas_prev}') print(f'For the chosen value of eta, which is {eta}, ' f'this results in the following sigma_t schedule for ddim sampler {sigmas}') return sigmas, alphas, alphas_prev def betas_for_alpha_bar(num_diffusion_timesteps, alpha_bar, max_beta=0.999): """ Create a beta schedule that discretizes the given alpha_t_bar function, which defines the cumulative product of (1-beta) over time from t = [0,1]. :param num_diffusion_timesteps: the number of betas to produce. :param alpha_bar: a lambda that takes an argument t from 0 to 1 and produces the cumulative product of (1-beta) up to that part of the diffusion process. :param max_beta: the maximum beta to use; use values lower than 1 to prevent singularities. """ betas = [] for i in range(num_diffusion_timesteps): t1 = i / num_diffusion_timesteps t2 = (i + 1) / num_diffusion_timesteps betas.append(min(1 - alpha_bar(t2) / alpha_bar(t1), max_beta)) return np.array(betas) def extract_into_tensor(a, t, x_shape): b, *_ = t.shape out = a.gather(-1, t) return out.reshape(b, *((1,) * (len(x_shape) - 1))) def checkpoint(func, inputs, params, flag): """ Evaluate a function without caching intermediate activations, allowing for reduced memory at the expense of extra compute in the backward pass. :param func: the function to evaluate. :param inputs: the argument sequence to pass to `func`. :param params: a sequence of parameters `func` depends on but does not explicitly take as arguments. :param flag: if False, disable gradient checkpointing. """ if flag: args = tuple(inputs) + tuple(params) return CheckpointFunction.apply(func, len(inputs), *args) else: return func(*inputs) class CheckpointFunction(torch.autograd.Function): @staticmethod def forward(ctx, run_function, length, *args): ctx.run_function = run_function ctx.input_tensors = list(args[:length]) ctx.input_params = list(args[length:]) ctx.gpu_autocast_kwargs = {"enabled": torch.is_autocast_enabled(), "dtype": torch.get_autocast_gpu_dtype(), "cache_enabled": torch.is_autocast_cache_enabled()} with torch.no_grad(): output_tensors = ctx.run_function(*ctx.input_tensors) return output_tensors @staticmethod def backward(ctx, *output_grads): ctx.input_tensors = [x.detach().requires_grad_(True) for x in ctx.input_tensors] with torch.enable_grad(), \ torch.cuda.amp.autocast(**ctx.gpu_autocast_kwargs): # Fixes a bug where the first op in run_function modifies the # Tensor storage in place, which is not allowed for detach()'d # Tensors. shallow_copies = [x.view_as(x) for x in ctx.input_tensors] output_tensors = ctx.run_function(*shallow_copies) input_grads = torch.autograd.grad( output_tensors, ctx.input_tensors + ctx.input_params, output_grads, allow_unused=True, ) del ctx.input_tensors del ctx.input_params del output_tensors return (None, None) + input_grads def timestep_embedding(timesteps, dim, max_period=10000, repeat_only=False): """ Create sinusoidal timestep embeddings. :param timesteps: a 1-D Tensor of N indices, one per batch element. These may be fractional. :param dim: the dimension of the output. :param max_period: controls the minimum frequency of the embeddings. :return: an [N x dim] Tensor of positional embeddings. """ if not repeat_only: half = dim // 2 freqs = torch.exp( -math.log(max_period) * torch.arange(start=0, end=half, dtype=torch.float32) / half ).to(device=timesteps.device) args = timesteps[:, None].float() * freqs[None] embedding = torch.cat([torch.cos(args), torch.sin(args)], dim=-1) if dim % 2: embedding = torch.cat([embedding, torch.zeros_like(embedding[:, :1])], dim=-1) else: embedding = repeat(timesteps, 'b -> b d', d=dim) return embedding def zero_module(module): """ Zero out the parameters of a module and return it. """ for p in module.parameters(): p.detach().zero_() return module def scale_module(module, scale): """ Scale the parameters of a module and return it. """ for p in module.parameters(): p.detach().mul_(scale) return module def mean_flat(tensor): """ Take the mean over all non-batch dimensions. """ return tensor.mean(dim=list(range(1, len(tensor.shape)))) def normalization(channels): """ Make a standard normalization layer. :param channels: number of input channels. :return: an nn.Module for normalization. """ return GroupNorm32(32, channels) # PyTorch 1.7 has SiLU, but we support PyTorch 1.5. class SiLU(nn.Module): def forward(self, x): return x * torch.sigmoid(x) class GroupNorm32(nn.GroupNorm): def forward(self, x): return super().forward(x.float()).type(x.dtype) def conv_nd(dims, *args, **kwargs): """ Create a 1D, 2D, or 3D convolution module. """ if dims == 1: return nn.Conv1d(*args, **kwargs) elif dims == 2: return nn.Conv2d(*args, **kwargs) elif dims == 3: return nn.Conv3d(*args, **kwargs) raise ValueError(f"unsupported dimensions: {dims}") def linear(*args, **kwargs): """ Create a linear module. """ return nn.Linear(*args, **kwargs) def avg_pool_nd(dims, *args, **kwargs): """ Create a 1D, 2D, or 3D average pooling module. """ if dims == 1: return nn.AvgPool1d(*args, **kwargs) elif dims == 2: return nn.AvgPool2d(*args, **kwargs) elif dims == 3: return nn.AvgPool3d(*args, **kwargs) raise ValueError(f"unsupported dimensions: {dims}") class HybridConditioner(nn.Module): def __init__(self, c_concat_config, c_crossattn_config): super().__init__() self.concat_conditioner = instantiate_from_config(c_concat_config) self.crossattn_conditioner = instantiate_from_config(c_crossattn_config) def forward(self, c_concat, c_crossattn): c_concat = self.concat_conditioner(c_concat) c_crossattn = self.crossattn_conditioner(c_crossattn) return {'c_concat': [c_concat], 'c_crossattn': [c_crossattn]} def noise_like(shape, device, repeat=False): repeat_noise = lambda: torch.randn((1, *shape[1:]), device=device).repeat(shape[0], *((1,) * (len(shape) - 1))) noise = lambda: torch.randn(shape, device=device) return repeat_noise() if repeat else noise() ================================================ FILE: ToonCrafter/ldm/modules/distributions/__init__.py ================================================ ================================================ FILE: ToonCrafter/ldm/modules/distributions/distributions.py ================================================ import torch import numpy as np class AbstractDistribution: def sample(self): raise NotImplementedError() def mode(self): raise NotImplementedError() class DiracDistribution(AbstractDistribution): def __init__(self, value): self.value = value def sample(self): return self.value def mode(self): return self.value class DiagonalGaussianDistribution(object): def __init__(self, parameters, deterministic=False): self.parameters = parameters self.mean, self.logvar = torch.chunk(parameters, 2, dim=1) self.logvar = torch.clamp(self.logvar, -30.0, 20.0) self.deterministic = deterministic self.std = torch.exp(0.5 * self.logvar) self.var = torch.exp(self.logvar) if self.deterministic: self.var = self.std = torch.zeros_like(self.mean).to(device=self.parameters.device) def sample(self): x = self.mean + self.std * torch.randn(self.mean.shape).to(device=self.parameters.device) return x def kl(self, other=None): if self.deterministic: return torch.Tensor([0.]) else: if other is None: return 0.5 * torch.sum(torch.pow(self.mean, 2) + self.var - 1.0 - self.logvar, dim=[1, 2, 3]) else: return 0.5 * torch.sum( torch.pow(self.mean - other.mean, 2) / other.var + self.var / other.var - 1.0 - self.logvar + other.logvar, dim=[1, 2, 3]) def nll(self, sample, dims=[1,2,3]): if self.deterministic: return torch.Tensor([0.]) logtwopi = np.log(2.0 * np.pi) return 0.5 * torch.sum( logtwopi + self.logvar + torch.pow(sample - self.mean, 2) / self.var, dim=dims) def mode(self): return self.mean def normal_kl(mean1, logvar1, mean2, logvar2): """ source: https://github.com/openai/guided-diffusion/blob/27c20a8fab9cb472df5d6bdd6c8d11c8f430b924/guided_diffusion/losses.py#L12 Compute the KL divergence between two gaussians. Shapes are automatically broadcasted, so batches can be compared to scalars, among other use cases. """ tensor = None for obj in (mean1, logvar1, mean2, logvar2): if isinstance(obj, torch.Tensor): tensor = obj break assert tensor is not None, "at least one argument must be a Tensor" # Force variances to be Tensors. Broadcasting helps convert scalars to # Tensors, but it does not work for torch.exp(). logvar1, logvar2 = [ x if isinstance(x, torch.Tensor) else torch.tensor(x).to(tensor) for x in (logvar1, logvar2) ] return 0.5 * ( -1.0 + logvar2 - logvar1 + torch.exp(logvar1 - logvar2) + ((mean1 - mean2) ** 2) * torch.exp(-logvar2) ) ================================================ FILE: ToonCrafter/ldm/modules/ema.py ================================================ import torch from torch import nn class LitEma(nn.Module): def __init__(self, model, decay=0.9999, use_num_upates=True): super().__init__() if decay < 0.0 or decay > 1.0: raise ValueError('Decay must be between 0 and 1') self.m_name2s_name = {} self.register_buffer('decay', torch.tensor(decay, dtype=torch.float32)) self.register_buffer('num_updates', torch.tensor(0, dtype=torch.int) if use_num_upates else torch.tensor(-1, dtype=torch.int)) for name, p in model.named_parameters(): if p.requires_grad: # remove as '.'-character is not allowed in buffers s_name = name.replace('.', '') self.m_name2s_name.update({name: s_name}) self.register_buffer(s_name, p.clone().detach().data) self.collected_params = [] def reset_num_updates(self): del self.num_updates self.register_buffer('num_updates', torch.tensor(0, dtype=torch.int)) def forward(self, model): decay = self.decay if self.num_updates >= 0: self.num_updates += 1 decay = min(self.decay, (1 + self.num_updates) / (10 + self.num_updates)) one_minus_decay = 1.0 - decay with torch.no_grad(): m_param = dict(model.named_parameters()) shadow_params = dict(self.named_buffers()) for key in m_param: if m_param[key].requires_grad: sname = self.m_name2s_name[key] shadow_params[sname] = shadow_params[sname].type_as(m_param[key]) shadow_params[sname].sub_(one_minus_decay * (shadow_params[sname] - m_param[key])) else: assert not key in self.m_name2s_name def copy_to(self, model): m_param = dict(model.named_parameters()) shadow_params = dict(self.named_buffers()) for key in m_param: if m_param[key].requires_grad: m_param[key].data.copy_(shadow_params[self.m_name2s_name[key]].data) else: assert not key in self.m_name2s_name def store(self, parameters): """ Save the current parameters for restoring later. Args: parameters: Iterable of `torch.nn.Parameter`; the parameters to be temporarily stored. """ self.collected_params = [param.clone() for param in parameters] def restore(self, parameters): """ Restore the parameters stored with the `store` method. Useful to validate the model with EMA parameters without affecting the original optimization process. Store the parameters before the `copy_to` method. After validation (or model saving), use this to restore the former parameters. Args: parameters: Iterable of `torch.nn.Parameter`; the parameters to be updated with the stored parameters. """ for c_param, param in zip(self.collected_params, parameters): param.data.copy_(c_param.data) ================================================ FILE: ToonCrafter/ldm/modules/encoders/__init__.py ================================================ ================================================ FILE: ToonCrafter/ldm/modules/encoders/modules.py ================================================ import torch import torch.nn as nn from torch.utils.checkpoint import checkpoint from transformers import T5Tokenizer, T5EncoderModel, CLIPTokenizer, CLIPTextModel import open_clip from ToonCrafter.ldm.util import default, count_params class AbstractEncoder(nn.Module): def __init__(self): super().__init__() def encode(self, *args, **kwargs): raise NotImplementedError class IdentityEncoder(AbstractEncoder): def encode(self, x): return x class ClassEmbedder(nn.Module): def __init__(self, embed_dim, n_classes=1000, key='class', ucg_rate=0.1): super().__init__() self.key = key self.embedding = nn.Embedding(n_classes, embed_dim) self.n_classes = n_classes self.ucg_rate = ucg_rate def forward(self, batch, key=None, disable_dropout=False): if key is None: key = self.key # this is for use in crossattn c = batch[key][:, None] if self.ucg_rate > 0. and not disable_dropout: mask = 1. - torch.bernoulli(torch.ones_like(c) * self.ucg_rate) c = mask * c + (1-mask) * torch.ones_like(c)*(self.n_classes-1) c = c.long() c = self.embedding(c) return c def get_unconditional_conditioning(self, bs, device="cuda"): uc_class = self.n_classes - 1 # 1000 classes --> 0 ... 999, one extra class for ucg (class 1000) uc = torch.ones((bs,), device=device) * uc_class uc = {self.key: uc} return uc def disabled_train(self, mode=True): """Overwrite model.train with this function to make sure train/eval mode does not change anymore.""" return self class FrozenT5Embedder(AbstractEncoder): """Uses the T5 transformer encoder for text""" def __init__(self, version="google/t5-v1_1-large", device="cuda", max_length=77, freeze=True): # others are google/t5-v1_1-xl and google/t5-v1_1-xxl super().__init__() self.tokenizer = T5Tokenizer.from_pretrained(version) self.transformer = T5EncoderModel.from_pretrained(version) self.device = device self.max_length = max_length # TODO: typical value? if freeze: self.freeze() def freeze(self): self.transformer = self.transformer.eval() #self.train = disabled_train for param in self.parameters(): param.requires_grad = False def forward(self, text): batch_encoding = self.tokenizer(text, truncation=True, max_length=self.max_length, return_length=True, return_overflowing_tokens=False, padding="max_length", return_tensors="pt") tokens = batch_encoding["input_ids"].to(self.device) outputs = self.transformer(input_ids=tokens) z = outputs.last_hidden_state return z def encode(self, text): return self(text) class FrozenCLIPEmbedder(AbstractEncoder): """Uses the CLIP transformer encoder for text (from huggingface)""" LAYERS = [ "last", "pooled", "hidden" ] def __init__(self, version="openai/clip-vit-large-patch14", device="cuda", max_length=77, freeze=True, layer="last", layer_idx=None): # clip-vit-base-patch32 super().__init__() assert layer in self.LAYERS self.tokenizer = CLIPTokenizer.from_pretrained(version) self.transformer = CLIPTextModel.from_pretrained(version) self.device = device self.max_length = max_length if freeze: self.freeze() self.layer = layer self.layer_idx = layer_idx if layer == "hidden": assert layer_idx is not None assert 0 <= abs(layer_idx) <= 12 def freeze(self): self.transformer = self.transformer.eval() #self.train = disabled_train for param in self.parameters(): param.requires_grad = False def forward(self, text): batch_encoding = self.tokenizer(text, truncation=True, max_length=self.max_length, return_length=True, return_overflowing_tokens=False, padding="max_length", return_tensors="pt") tokens = batch_encoding["input_ids"].to(self.device) outputs = self.transformer(input_ids=tokens, output_hidden_states=self.layer=="hidden") if self.layer == "last": z = outputs.last_hidden_state elif self.layer == "pooled": z = outputs.pooler_output[:, None, :] else: z = outputs.hidden_states[self.layer_idx] return z def encode(self, text): return self(text) class FrozenOpenCLIPEmbedder(AbstractEncoder): """ Uses the OpenCLIP transformer encoder for text """ LAYERS = [ #"pooled", "last", "penultimate" ] def __init__(self, arch="ViT-H-14", version="laion2b_s32b_b79k", device="cuda", max_length=77, freeze=True, layer="last"): super().__init__() assert layer in self.LAYERS model, _, _ = open_clip.create_model_and_transforms(arch, device=torch.device('cpu'), pretrained=version) del model.visual self.model = model self.device = device self.max_length = max_length if freeze: self.freeze() self.layer = layer if self.layer == "last": self.layer_idx = 0 elif self.layer == "penultimate": self.layer_idx = 1 else: raise NotImplementedError() def freeze(self): self.model = self.model.eval() for param in self.parameters(): param.requires_grad = False def forward(self, text): tokens = open_clip.tokenize(text) z = self.encode_with_transformer(tokens.to(self.device)) return z def encode_with_transformer(self, text): x = self.model.token_embedding(text) # [batch_size, n_ctx, d_model] x = x + self.model.positional_embedding x = x.permute(1, 0, 2) # NLD -> LND x = self.text_transformer_forward(x, attn_mask=self.model.attn_mask) x = x.permute(1, 0, 2) # LND -> NLD x = self.model.ln_final(x) return x def text_transformer_forward(self, x: torch.Tensor, attn_mask = None): for i, r in enumerate(self.model.transformer.resblocks): if i == len(self.model.transformer.resblocks) - self.layer_idx: break if self.model.transformer.grad_checkpointing and not torch.jit.is_scripting(): x = checkpoint(r, x, attn_mask) else: x = r(x, attn_mask=attn_mask) return x def encode(self, text): return self(text) class FrozenCLIPT5Encoder(AbstractEncoder): def __init__(self, clip_version="openai/clip-vit-large-patch14", t5_version="google/t5-v1_1-xl", device="cuda", clip_max_length=77, t5_max_length=77): super().__init__() self.clip_encoder = FrozenCLIPEmbedder(clip_version, device, max_length=clip_max_length) self.t5_encoder = FrozenT5Embedder(t5_version, device, max_length=t5_max_length) print(f"{self.clip_encoder.__class__.__name__} has {count_params(self.clip_encoder)*1.e-6:.2f} M parameters, " f"{self.t5_encoder.__class__.__name__} comes with {count_params(self.t5_encoder)*1.e-6:.2f} M params.") def encode(self, text): return self(text) def forward(self, text): clip_z = self.clip_encoder.encode(text) t5_z = self.t5_encoder.encode(text) return [clip_z, t5_z] ================================================ FILE: ToonCrafter/ldm/modules/image_degradation/__init__.py ================================================ from ToonCrafter.ldm.modules.image_degradation.bsrgan import degradation_bsrgan_variant as degradation_fn_bsr from ToonCrafter.ldm.modules.image_degradation.bsrgan_light import degradation_bsrgan_variant as degradation_fn_bsr_light ================================================ FILE: ToonCrafter/ldm/modules/image_degradation/bsrgan.py ================================================ # -*- coding: utf-8 -*- """ # -------------------------------------------- # Super-Resolution # -------------------------------------------- # # Kai Zhang (cskaizhang@gmail.com) # https://github.com/cszn # From 2019/03--2021/08 # -------------------------------------------- """ import numpy as np import cv2 import torch from functools import partial import random from scipy import ndimage import scipy import scipy.stats as ss from scipy.interpolate import interp2d from scipy.linalg import orth import albumentations import ldm.modules.image_degradation.utils_image as util def modcrop_np(img, sf): ''' Args: img: numpy image, WxH or WxHxC sf: scale factor Return: cropped image ''' w, h = img.shape[:2] im = np.copy(img) return im[:w - w % sf, :h - h % sf, ...] """ # -------------------------------------------- # anisotropic Gaussian kernels # -------------------------------------------- """ def analytic_kernel(k): """Calculate the X4 kernel from the X2 kernel (for proof see appendix in paper)""" k_size = k.shape[0] # Calculate the big kernels size big_k = np.zeros((3 * k_size - 2, 3 * k_size - 2)) # Loop over the small kernel to fill the big one for r in range(k_size): for c in range(k_size): big_k[2 * r:2 * r + k_size, 2 * c:2 * c + k_size] += k[r, c] * k # Crop the edges of the big kernel to ignore very small values and increase run time of SR crop = k_size // 2 cropped_big_k = big_k[crop:-crop, crop:-crop] # Normalize to 1 return cropped_big_k / cropped_big_k.sum() def anisotropic_Gaussian(ksize=15, theta=np.pi, l1=6, l2=6): """ generate an anisotropic Gaussian kernel Args: ksize : e.g., 15, kernel size theta : [0, pi], rotation angle range l1 : [0.1,50], scaling of eigenvalues l2 : [0.1,l1], scaling of eigenvalues If l1 = l2, will get an isotropic Gaussian kernel. Returns: k : kernel """ v = np.dot(np.array([[np.cos(theta), -np.sin(theta)], [np.sin(theta), np.cos(theta)]]), np.array([1., 0.])) V = np.array([[v[0], v[1]], [v[1], -v[0]]]) D = np.array([[l1, 0], [0, l2]]) Sigma = np.dot(np.dot(V, D), np.linalg.inv(V)) k = gm_blur_kernel(mean=[0, 0], cov=Sigma, size=ksize) return k def gm_blur_kernel(mean, cov, size=15): center = size / 2.0 + 0.5 k = np.zeros([size, size]) for y in range(size): for x in range(size): cy = y - center + 1 cx = x - center + 1 k[y, x] = ss.multivariate_normal.pdf([cx, cy], mean=mean, cov=cov) k = k / np.sum(k) return k def shift_pixel(x, sf, upper_left=True): """shift pixel for super-resolution with different scale factors Args: x: WxHxC or WxH sf: scale factor upper_left: shift direction """ h, w = x.shape[:2] shift = (sf - 1) * 0.5 xv, yv = np.arange(0, w, 1.0), np.arange(0, h, 1.0) if upper_left: x1 = xv + shift y1 = yv + shift else: x1 = xv - shift y1 = yv - shift x1 = np.clip(x1, 0, w - 1) y1 = np.clip(y1, 0, h - 1) if x.ndim == 2: x = interp2d(xv, yv, x)(x1, y1) if x.ndim == 3: for i in range(x.shape[-1]): x[:, :, i] = interp2d(xv, yv, x[:, :, i])(x1, y1) return x def blur(x, k): ''' x: image, NxcxHxW k: kernel, Nx1xhxw ''' n, c = x.shape[:2] p1, p2 = (k.shape[-2] - 1) // 2, (k.shape[-1] - 1) // 2 x = torch.nn.functional.pad(x, pad=(p1, p2, p1, p2), mode='replicate') k = k.repeat(1, c, 1, 1) k = k.view(-1, 1, k.shape[2], k.shape[3]) x = x.view(1, -1, x.shape[2], x.shape[3]) x = torch.nn.functional.conv2d(x, k, bias=None, stride=1, padding=0, groups=n * c) x = x.view(n, c, x.shape[2], x.shape[3]) return x def gen_kernel(k_size=np.array([15, 15]), scale_factor=np.array([4, 4]), min_var=0.6, max_var=10., noise_level=0): """" # modified version of https://github.com/assafshocher/BlindSR_dataset_generator # Kai Zhang # min_var = 0.175 * sf # variance of the gaussian kernel will be sampled between min_var and max_var # max_var = 2.5 * sf """ # Set random eigen-vals (lambdas) and angle (theta) for COV matrix lambda_1 = min_var + np.random.rand() * (max_var - min_var) lambda_2 = min_var + np.random.rand() * (max_var - min_var) theta = np.random.rand() * np.pi # random theta noise = -noise_level + np.random.rand(*k_size) * noise_level * 2 # Set COV matrix using Lambdas and Theta LAMBDA = np.diag([lambda_1, lambda_2]) Q = np.array([[np.cos(theta), -np.sin(theta)], [np.sin(theta), np.cos(theta)]]) SIGMA = Q @ LAMBDA @ Q.T INV_SIGMA = np.linalg.inv(SIGMA)[None, None, :, :] # Set expectation position (shifting kernel for aligned image) MU = k_size // 2 - 0.5 * (scale_factor - 1) # - 0.5 * (scale_factor - k_size % 2) MU = MU[None, None, :, None] # Create meshgrid for Gaussian [X, Y] = np.meshgrid(range(k_size[0]), range(k_size[1])) Z = np.stack([X, Y], 2)[:, :, :, None] # Calcualte Gaussian for every pixel of the kernel ZZ = Z - MU ZZ_t = ZZ.transpose(0, 1, 3, 2) raw_kernel = np.exp(-0.5 * np.squeeze(ZZ_t @ INV_SIGMA @ ZZ)) * (1 + noise) # shift the kernel so it will be centered # raw_kernel_centered = kernel_shift(raw_kernel, scale_factor) # Normalize the kernel and return # kernel = raw_kernel_centered / np.sum(raw_kernel_centered) kernel = raw_kernel / np.sum(raw_kernel) return kernel def fspecial_gaussian(hsize, sigma): hsize = [hsize, hsize] siz = [(hsize[0] - 1.0) / 2.0, (hsize[1] - 1.0) / 2.0] std = sigma [x, y] = np.meshgrid(np.arange(-siz[1], siz[1] + 1), np.arange(-siz[0], siz[0] + 1)) arg = -(x * x + y * y) / (2 * std * std) h = np.exp(arg) h[h < scipy.finfo(float).eps * h.max()] = 0 sumh = h.sum() if sumh != 0: h = h / sumh return h def fspecial_laplacian(alpha): alpha = max([0, min([alpha, 1])]) h1 = alpha / (alpha + 1) h2 = (1 - alpha) / (alpha + 1) h = [[h1, h2, h1], [h2, -4 / (alpha + 1), h2], [h1, h2, h1]] h = np.array(h) return h def fspecial(filter_type, *args, **kwargs): ''' python code from: https://github.com/ronaldosena/imagens-medicas-2/blob/40171a6c259edec7827a6693a93955de2bd39e76/Aulas/aula_2_-_uniform_filter/matlab_fspecial.py ''' if filter_type == 'gaussian': return fspecial_gaussian(*args, **kwargs) if filter_type == 'laplacian': return fspecial_laplacian(*args, **kwargs) """ # -------------------------------------------- # degradation models # -------------------------------------------- """ def bicubic_degradation(x, sf=3): ''' Args: x: HxWxC image, [0, 1] sf: down-scale factor Return: bicubicly downsampled LR image ''' x = util.imresize_np(x, scale=1 / sf) return x def srmd_degradation(x, k, sf=3): ''' blur + bicubic downsampling Args: x: HxWxC image, [0, 1] k: hxw, double sf: down-scale factor Return: downsampled LR image Reference: @inproceedings{zhang2018learning, title={Learning a single convolutional super-resolution network for multiple degradations}, author={Zhang, Kai and Zuo, Wangmeng and Zhang, Lei}, booktitle={IEEE Conference on Computer Vision and Pattern Recognition}, pages={3262--3271}, year={2018} } ''' x = ndimage.filters.convolve(x, np.expand_dims(k, axis=2), mode='wrap') # 'nearest' | 'mirror' x = bicubic_degradation(x, sf=sf) return x def dpsr_degradation(x, k, sf=3): ''' bicubic downsampling + blur Args: x: HxWxC image, [0, 1] k: hxw, double sf: down-scale factor Return: downsampled LR image Reference: @inproceedings{zhang2019deep, title={Deep Plug-and-Play Super-Resolution for Arbitrary Blur Kernels}, author={Zhang, Kai and Zuo, Wangmeng and Zhang, Lei}, booktitle={IEEE Conference on Computer Vision and Pattern Recognition}, pages={1671--1681}, year={2019} } ''' x = bicubic_degradation(x, sf=sf) x = ndimage.filters.convolve(x, np.expand_dims(k, axis=2), mode='wrap') return x def classical_degradation(x, k, sf=3): ''' blur + downsampling Args: x: HxWxC image, [0, 1]/[0, 255] k: hxw, double sf: down-scale factor Return: downsampled LR image ''' x = ndimage.filters.convolve(x, np.expand_dims(k, axis=2), mode='wrap') # x = filters.correlate(x, np.expand_dims(np.flip(k), axis=2)) st = 0 return x[st::sf, st::sf, ...] def add_sharpening(img, weight=0.5, radius=50, threshold=10): """USM sharpening. borrowed from real-ESRGAN Input image: I; Blurry image: B. 1. K = I + weight * (I - B) 2. Mask = 1 if abs(I - B) > threshold, else: 0 3. Blur mask: 4. Out = Mask * K + (1 - Mask) * I Args: img (Numpy array): Input image, HWC, BGR; float32, [0, 1]. weight (float): Sharp weight. Default: 1. radius (float): Kernel size of Gaussian blur. Default: 50. threshold (int): """ if radius % 2 == 0: radius += 1 blur = cv2.GaussianBlur(img, (radius, radius), 0) residual = img - blur mask = np.abs(residual) * 255 > threshold mask = mask.astype('float32') soft_mask = cv2.GaussianBlur(mask, (radius, radius), 0) K = img + weight * residual K = np.clip(K, 0, 1) return soft_mask * K + (1 - soft_mask) * img def add_blur(img, sf=4): wd2 = 4.0 + sf wd = 2.0 + 0.2 * sf if random.random() < 0.5: l1 = wd2 * random.random() l2 = wd2 * random.random() k = anisotropic_Gaussian(ksize=2 * random.randint(2, 11) + 3, theta=random.random() * np.pi, l1=l1, l2=l2) else: k = fspecial('gaussian', 2 * random.randint(2, 11) + 3, wd * random.random()) img = ndimage.filters.convolve(img, np.expand_dims(k, axis=2), mode='mirror') return img def add_resize(img, sf=4): rnum = np.random.rand() if rnum > 0.8: # up sf1 = random.uniform(1, 2) elif rnum < 0.7: # down sf1 = random.uniform(0.5 / sf, 1) else: sf1 = 1.0 img = cv2.resize(img, (int(sf1 * img.shape[1]), int(sf1 * img.shape[0])), interpolation=random.choice([1, 2, 3])) img = np.clip(img, 0.0, 1.0) return img # def add_Gaussian_noise(img, noise_level1=2, noise_level2=25): # noise_level = random.randint(noise_level1, noise_level2) # rnum = np.random.rand() # if rnum > 0.6: # add color Gaussian noise # img += np.random.normal(0, noise_level / 255.0, img.shape).astype(np.float32) # elif rnum < 0.4: # add grayscale Gaussian noise # img += np.random.normal(0, noise_level / 255.0, (*img.shape[:2], 1)).astype(np.float32) # else: # add noise # L = noise_level2 / 255. # D = np.diag(np.random.rand(3)) # U = orth(np.random.rand(3, 3)) # conv = np.dot(np.dot(np.transpose(U), D), U) # img += np.random.multivariate_normal([0, 0, 0], np.abs(L ** 2 * conv), img.shape[:2]).astype(np.float32) # img = np.clip(img, 0.0, 1.0) # return img def add_Gaussian_noise(img, noise_level1=2, noise_level2=25): noise_level = random.randint(noise_level1, noise_level2) rnum = np.random.rand() if rnum > 0.6: # add color Gaussian noise img = img + np.random.normal(0, noise_level / 255.0, img.shape).astype(np.float32) elif rnum < 0.4: # add grayscale Gaussian noise img = img + np.random.normal(0, noise_level / 255.0, (*img.shape[:2], 1)).astype(np.float32) else: # add noise L = noise_level2 / 255. D = np.diag(np.random.rand(3)) U = orth(np.random.rand(3, 3)) conv = np.dot(np.dot(np.transpose(U), D), U) img = img + np.random.multivariate_normal([0, 0, 0], np.abs(L ** 2 * conv), img.shape[:2]).astype(np.float32) img = np.clip(img, 0.0, 1.0) return img def add_speckle_noise(img, noise_level1=2, noise_level2=25): noise_level = random.randint(noise_level1, noise_level2) img = np.clip(img, 0.0, 1.0) rnum = random.random() if rnum > 0.6: img += img * np.random.normal(0, noise_level / 255.0, img.shape).astype(np.float32) elif rnum < 0.4: img += img * np.random.normal(0, noise_level / 255.0, (*img.shape[:2], 1)).astype(np.float32) else: L = noise_level2 / 255. D = np.diag(np.random.rand(3)) U = orth(np.random.rand(3, 3)) conv = np.dot(np.dot(np.transpose(U), D), U) img += img * np.random.multivariate_normal([0, 0, 0], np.abs(L ** 2 * conv), img.shape[:2]).astype(np.float32) img = np.clip(img, 0.0, 1.0) return img def add_Poisson_noise(img): img = np.clip((img * 255.0).round(), 0, 255) / 255. vals = 10 ** (2 * random.random() + 2.0) # [2, 4] if random.random() < 0.5: img = np.random.poisson(img * vals).astype(np.float32) / vals else: img_gray = np.dot(img[..., :3], [0.299, 0.587, 0.114]) img_gray = np.clip((img_gray * 255.0).round(), 0, 255) / 255. noise_gray = np.random.poisson(img_gray * vals).astype(np.float32) / vals - img_gray img += noise_gray[:, :, np.newaxis] img = np.clip(img, 0.0, 1.0) return img def add_JPEG_noise(img): quality_factor = random.randint(30, 95) img = cv2.cvtColor(util.single2uint(img), cv2.COLOR_RGB2BGR) result, encimg = cv2.imencode('.jpg', img, [int(cv2.IMWRITE_JPEG_QUALITY), quality_factor]) img = cv2.imdecode(encimg, 1) img = cv2.cvtColor(util.uint2single(img), cv2.COLOR_BGR2RGB) return img def random_crop(lq, hq, sf=4, lq_patchsize=64): h, w = lq.shape[:2] rnd_h = random.randint(0, h - lq_patchsize) rnd_w = random.randint(0, w - lq_patchsize) lq = lq[rnd_h:rnd_h + lq_patchsize, rnd_w:rnd_w + lq_patchsize, :] rnd_h_H, rnd_w_H = int(rnd_h * sf), int(rnd_w * sf) hq = hq[rnd_h_H:rnd_h_H + lq_patchsize * sf, rnd_w_H:rnd_w_H + lq_patchsize * sf, :] return lq, hq def degradation_bsrgan(img, sf=4, lq_patchsize=72, isp_model=None): """ This is the degradation model of BSRGAN from the paper "Designing a Practical Degradation Model for Deep Blind Image Super-Resolution" ---------- img: HXWXC, [0, 1], its size should be large than (lq_patchsizexsf)x(lq_patchsizexsf) sf: scale factor isp_model: camera ISP model Returns ------- img: low-quality patch, size: lq_patchsizeXlq_patchsizeXC, range: [0, 1] hq: corresponding high-quality patch, size: (lq_patchsizexsf)X(lq_patchsizexsf)XC, range: [0, 1] """ isp_prob, jpeg_prob, scale2_prob = 0.25, 0.9, 0.25 sf_ori = sf h1, w1 = img.shape[:2] img = img.copy()[:w1 - w1 % sf, :h1 - h1 % sf, ...] # mod crop h, w = img.shape[:2] if h < lq_patchsize * sf or w < lq_patchsize * sf: raise ValueError(f'img size ({h1}X{w1}) is too small!') hq = img.copy() if sf == 4 and random.random() < scale2_prob: # downsample1 if np.random.rand() < 0.5: img = cv2.resize(img, (int(1 / 2 * img.shape[1]), int(1 / 2 * img.shape[0])), interpolation=random.choice([1, 2, 3])) else: img = util.imresize_np(img, 1 / 2, True) img = np.clip(img, 0.0, 1.0) sf = 2 shuffle_order = random.sample(range(7), 7) idx1, idx2 = shuffle_order.index(2), shuffle_order.index(3) if idx1 > idx2: # keep downsample3 last shuffle_order[idx1], shuffle_order[idx2] = shuffle_order[idx2], shuffle_order[idx1] for i in shuffle_order: if i == 0: img = add_blur(img, sf=sf) elif i == 1: img = add_blur(img, sf=sf) elif i == 2: a, b = img.shape[1], img.shape[0] # downsample2 if random.random() < 0.75: sf1 = random.uniform(1, 2 * sf) img = cv2.resize(img, (int(1 / sf1 * img.shape[1]), int(1 / sf1 * img.shape[0])), interpolation=random.choice([1, 2, 3])) else: k = fspecial('gaussian', 25, random.uniform(0.1, 0.6 * sf)) k_shifted = shift_pixel(k, sf) k_shifted = k_shifted / k_shifted.sum() # blur with shifted kernel img = ndimage.filters.convolve(img, np.expand_dims(k_shifted, axis=2), mode='mirror') img = img[0::sf, 0::sf, ...] # nearest downsampling img = np.clip(img, 0.0, 1.0) elif i == 3: # downsample3 img = cv2.resize(img, (int(1 / sf * a), int(1 / sf * b)), interpolation=random.choice([1, 2, 3])) img = np.clip(img, 0.0, 1.0) elif i == 4: # add Gaussian noise img = add_Gaussian_noise(img, noise_level1=2, noise_level2=25) elif i == 5: # add JPEG noise if random.random() < jpeg_prob: img = add_JPEG_noise(img) elif i == 6: # add processed camera sensor noise if random.random() < isp_prob and isp_model is not None: with torch.no_grad(): img, hq = isp_model.forward(img.copy(), hq) # add final JPEG compression noise img = add_JPEG_noise(img) # random crop img, hq = random_crop(img, hq, sf_ori, lq_patchsize) return img, hq # todo no isp_model? def degradation_bsrgan_variant(image, sf=4, isp_model=None): """ This is the degradation model of BSRGAN from the paper "Designing a Practical Degradation Model for Deep Blind Image Super-Resolution" ---------- sf: scale factor isp_model: camera ISP model Returns ------- img: low-quality patch, size: lq_patchsizeXlq_patchsizeXC, range: [0, 1] hq: corresponding high-quality patch, size: (lq_patchsizexsf)X(lq_patchsizexsf)XC, range: [0, 1] """ image = util.uint2single(image) isp_prob, jpeg_prob, scale2_prob = 0.25, 0.9, 0.25 sf_ori = sf h1, w1 = image.shape[:2] image = image.copy()[:w1 - w1 % sf, :h1 - h1 % sf, ...] # mod crop h, w = image.shape[:2] hq = image.copy() if sf == 4 and random.random() < scale2_prob: # downsample1 if np.random.rand() < 0.5: image = cv2.resize(image, (int(1 / 2 * image.shape[1]), int(1 / 2 * image.shape[0])), interpolation=random.choice([1, 2, 3])) else: image = util.imresize_np(image, 1 / 2, True) image = np.clip(image, 0.0, 1.0) sf = 2 shuffle_order = random.sample(range(7), 7) idx1, idx2 = shuffle_order.index(2), shuffle_order.index(3) if idx1 > idx2: # keep downsample3 last shuffle_order[idx1], shuffle_order[idx2] = shuffle_order[idx2], shuffle_order[idx1] for i in shuffle_order: if i == 0: image = add_blur(image, sf=sf) elif i == 1: image = add_blur(image, sf=sf) elif i == 2: a, b = image.shape[1], image.shape[0] # downsample2 if random.random() < 0.75: sf1 = random.uniform(1, 2 * sf) image = cv2.resize(image, (int(1 / sf1 * image.shape[1]), int(1 / sf1 * image.shape[0])), interpolation=random.choice([1, 2, 3])) else: k = fspecial('gaussian', 25, random.uniform(0.1, 0.6 * sf)) k_shifted = shift_pixel(k, sf) k_shifted = k_shifted / k_shifted.sum() # blur with shifted kernel image = ndimage.filters.convolve(image, np.expand_dims(k_shifted, axis=2), mode='mirror') image = image[0::sf, 0::sf, ...] # nearest downsampling image = np.clip(image, 0.0, 1.0) elif i == 3: # downsample3 image = cv2.resize(image, (int(1 / sf * a), int(1 / sf * b)), interpolation=random.choice([1, 2, 3])) image = np.clip(image, 0.0, 1.0) elif i == 4: # add Gaussian noise image = add_Gaussian_noise(image, noise_level1=2, noise_level2=25) elif i == 5: # add JPEG noise if random.random() < jpeg_prob: image = add_JPEG_noise(image) # elif i == 6: # # add processed camera sensor noise # if random.random() < isp_prob and isp_model is not None: # with torch.no_grad(): # img, hq = isp_model.forward(img.copy(), hq) # add final JPEG compression noise image = add_JPEG_noise(image) image = util.single2uint(image) example = {"image":image} return example # TODO incase there is a pickle error one needs to replace a += x with a = a + x in add_speckle_noise etc... def degradation_bsrgan_plus(img, sf=4, shuffle_prob=0.5, use_sharp=True, lq_patchsize=64, isp_model=None): """ This is an extended degradation model by combining the degradation models of BSRGAN and Real-ESRGAN ---------- img: HXWXC, [0, 1], its size should be large than (lq_patchsizexsf)x(lq_patchsizexsf) sf: scale factor use_shuffle: the degradation shuffle use_sharp: sharpening the img Returns ------- img: low-quality patch, size: lq_patchsizeXlq_patchsizeXC, range: [0, 1] hq: corresponding high-quality patch, size: (lq_patchsizexsf)X(lq_patchsizexsf)XC, range: [0, 1] """ h1, w1 = img.shape[:2] img = img.copy()[:w1 - w1 % sf, :h1 - h1 % sf, ...] # mod crop h, w = img.shape[:2] if h < lq_patchsize * sf or w < lq_patchsize * sf: raise ValueError(f'img size ({h1}X{w1}) is too small!') if use_sharp: img = add_sharpening(img) hq = img.copy() if random.random() < shuffle_prob: shuffle_order = random.sample(range(13), 13) else: shuffle_order = list(range(13)) # local shuffle for noise, JPEG is always the last one shuffle_order[2:6] = random.sample(shuffle_order[2:6], len(range(2, 6))) shuffle_order[9:13] = random.sample(shuffle_order[9:13], len(range(9, 13))) poisson_prob, speckle_prob, isp_prob = 0.1, 0.1, 0.1 for i in shuffle_order: if i == 0: img = add_blur(img, sf=sf) elif i == 1: img = add_resize(img, sf=sf) elif i == 2: img = add_Gaussian_noise(img, noise_level1=2, noise_level2=25) elif i == 3: if random.random() < poisson_prob: img = add_Poisson_noise(img) elif i == 4: if random.random() < speckle_prob: img = add_speckle_noise(img) elif i == 5: if random.random() < isp_prob and isp_model is not None: with torch.no_grad(): img, hq = isp_model.forward(img.copy(), hq) elif i == 6: img = add_JPEG_noise(img) elif i == 7: img = add_blur(img, sf=sf) elif i == 8: img = add_resize(img, sf=sf) elif i == 9: img = add_Gaussian_noise(img, noise_level1=2, noise_level2=25) elif i == 10: if random.random() < poisson_prob: img = add_Poisson_noise(img) elif i == 11: if random.random() < speckle_prob: img = add_speckle_noise(img) elif i == 12: if random.random() < isp_prob and isp_model is not None: with torch.no_grad(): img, hq = isp_model.forward(img.copy(), hq) else: print('check the shuffle!') # resize to desired size img = cv2.resize(img, (int(1 / sf * hq.shape[1]), int(1 / sf * hq.shape[0])), interpolation=random.choice([1, 2, 3])) # add final JPEG compression noise img = add_JPEG_noise(img) # random crop img, hq = random_crop(img, hq, sf, lq_patchsize) return img, hq if __name__ == '__main__': print("hey") img = util.imread_uint('utils/test.png', 3) print(img) img = util.uint2single(img) print(img) img = img[:448, :448] h = img.shape[0] // 4 print("resizing to", h) sf = 4 deg_fn = partial(degradation_bsrgan_variant, sf=sf) for i in range(20): print(i) img_lq = deg_fn(img) print(img_lq) img_lq_bicubic = albumentations.SmallestMaxSize(max_size=h, interpolation=cv2.INTER_CUBIC)(image=img)["image"] print(img_lq.shape) print("bicubic", img_lq_bicubic.shape) print(img_hq.shape) lq_nearest = cv2.resize(util.single2uint(img_lq), (int(sf * img_lq.shape[1]), int(sf * img_lq.shape[0])), interpolation=0) lq_bicubic_nearest = cv2.resize(util.single2uint(img_lq_bicubic), (int(sf * img_lq.shape[1]), int(sf * img_lq.shape[0])), interpolation=0) img_concat = np.concatenate([lq_bicubic_nearest, lq_nearest, util.single2uint(img_hq)], axis=1) util.imsave(img_concat, str(i) + '.png') ================================================ FILE: ToonCrafter/ldm/modules/image_degradation/bsrgan_light.py ================================================ # -*- coding: utf-8 -*- import numpy as np import cv2 import torch from functools import partial import random from scipy import ndimage import scipy import scipy.stats as ss from scipy.interpolate import interp2d from scipy.linalg import orth import albumentations import ldm.modules.image_degradation.utils_image as util """ # -------------------------------------------- # Super-Resolution # -------------------------------------------- # # Kai Zhang (cskaizhang@gmail.com) # https://github.com/cszn # From 2019/03--2021/08 # -------------------------------------------- """ def modcrop_np(img, sf): ''' Args: img: numpy image, WxH or WxHxC sf: scale factor Return: cropped image ''' w, h = img.shape[:2] im = np.copy(img) return im[:w - w % sf, :h - h % sf, ...] """ # -------------------------------------------- # anisotropic Gaussian kernels # -------------------------------------------- """ def analytic_kernel(k): """Calculate the X4 kernel from the X2 kernel (for proof see appendix in paper)""" k_size = k.shape[0] # Calculate the big kernels size big_k = np.zeros((3 * k_size - 2, 3 * k_size - 2)) # Loop over the small kernel to fill the big one for r in range(k_size): for c in range(k_size): big_k[2 * r:2 * r + k_size, 2 * c:2 * c + k_size] += k[r, c] * k # Crop the edges of the big kernel to ignore very small values and increase run time of SR crop = k_size // 2 cropped_big_k = big_k[crop:-crop, crop:-crop] # Normalize to 1 return cropped_big_k / cropped_big_k.sum() def anisotropic_Gaussian(ksize=15, theta=np.pi, l1=6, l2=6): """ generate an anisotropic Gaussian kernel Args: ksize : e.g., 15, kernel size theta : [0, pi], rotation angle range l1 : [0.1,50], scaling of eigenvalues l2 : [0.1,l1], scaling of eigenvalues If l1 = l2, will get an isotropic Gaussian kernel. Returns: k : kernel """ v = np.dot(np.array([[np.cos(theta), -np.sin(theta)], [np.sin(theta), np.cos(theta)]]), np.array([1., 0.])) V = np.array([[v[0], v[1]], [v[1], -v[0]]]) D = np.array([[l1, 0], [0, l2]]) Sigma = np.dot(np.dot(V, D), np.linalg.inv(V)) k = gm_blur_kernel(mean=[0, 0], cov=Sigma, size=ksize) return k def gm_blur_kernel(mean, cov, size=15): center = size / 2.0 + 0.5 k = np.zeros([size, size]) for y in range(size): for x in range(size): cy = y - center + 1 cx = x - center + 1 k[y, x] = ss.multivariate_normal.pdf([cx, cy], mean=mean, cov=cov) k = k / np.sum(k) return k def shift_pixel(x, sf, upper_left=True): """shift pixel for super-resolution with different scale factors Args: x: WxHxC or WxH sf: scale factor upper_left: shift direction """ h, w = x.shape[:2] shift = (sf - 1) * 0.5 xv, yv = np.arange(0, w, 1.0), np.arange(0, h, 1.0) if upper_left: x1 = xv + shift y1 = yv + shift else: x1 = xv - shift y1 = yv - shift x1 = np.clip(x1, 0, w - 1) y1 = np.clip(y1, 0, h - 1) if x.ndim == 2: x = interp2d(xv, yv, x)(x1, y1) if x.ndim == 3: for i in range(x.shape[-1]): x[:, :, i] = interp2d(xv, yv, x[:, :, i])(x1, y1) return x def blur(x, k): ''' x: image, NxcxHxW k: kernel, Nx1xhxw ''' n, c = x.shape[:2] p1, p2 = (k.shape[-2] - 1) // 2, (k.shape[-1] - 1) // 2 x = torch.nn.functional.pad(x, pad=(p1, p2, p1, p2), mode='replicate') k = k.repeat(1, c, 1, 1) k = k.view(-1, 1, k.shape[2], k.shape[3]) x = x.view(1, -1, x.shape[2], x.shape[3]) x = torch.nn.functional.conv2d(x, k, bias=None, stride=1, padding=0, groups=n * c) x = x.view(n, c, x.shape[2], x.shape[3]) return x def gen_kernel(k_size=np.array([15, 15]), scale_factor=np.array([4, 4]), min_var=0.6, max_var=10., noise_level=0): """" # modified version of https://github.com/assafshocher/BlindSR_dataset_generator # Kai Zhang # min_var = 0.175 * sf # variance of the gaussian kernel will be sampled between min_var and max_var # max_var = 2.5 * sf """ # Set random eigen-vals (lambdas) and angle (theta) for COV matrix lambda_1 = min_var + np.random.rand() * (max_var - min_var) lambda_2 = min_var + np.random.rand() * (max_var - min_var) theta = np.random.rand() * np.pi # random theta noise = -noise_level + np.random.rand(*k_size) * noise_level * 2 # Set COV matrix using Lambdas and Theta LAMBDA = np.diag([lambda_1, lambda_2]) Q = np.array([[np.cos(theta), -np.sin(theta)], [np.sin(theta), np.cos(theta)]]) SIGMA = Q @ LAMBDA @ Q.T INV_SIGMA = np.linalg.inv(SIGMA)[None, None, :, :] # Set expectation position (shifting kernel for aligned image) MU = k_size // 2 - 0.5 * (scale_factor - 1) # - 0.5 * (scale_factor - k_size % 2) MU = MU[None, None, :, None] # Create meshgrid for Gaussian [X, Y] = np.meshgrid(range(k_size[0]), range(k_size[1])) Z = np.stack([X, Y], 2)[:, :, :, None] # Calcualte Gaussian for every pixel of the kernel ZZ = Z - MU ZZ_t = ZZ.transpose(0, 1, 3, 2) raw_kernel = np.exp(-0.5 * np.squeeze(ZZ_t @ INV_SIGMA @ ZZ)) * (1 + noise) # shift the kernel so it will be centered # raw_kernel_centered = kernel_shift(raw_kernel, scale_factor) # Normalize the kernel and return # kernel = raw_kernel_centered / np.sum(raw_kernel_centered) kernel = raw_kernel / np.sum(raw_kernel) return kernel def fspecial_gaussian(hsize, sigma): hsize = [hsize, hsize] siz = [(hsize[0] - 1.0) / 2.0, (hsize[1] - 1.0) / 2.0] std = sigma [x, y] = np.meshgrid(np.arange(-siz[1], siz[1] + 1), np.arange(-siz[0], siz[0] + 1)) arg = -(x * x + y * y) / (2 * std * std) h = np.exp(arg) h[h < scipy.finfo(float).eps * h.max()] = 0 sumh = h.sum() if sumh != 0: h = h / sumh return h def fspecial_laplacian(alpha): alpha = max([0, min([alpha, 1])]) h1 = alpha / (alpha + 1) h2 = (1 - alpha) / (alpha + 1) h = [[h1, h2, h1], [h2, -4 / (alpha + 1), h2], [h1, h2, h1]] h = np.array(h) return h def fspecial(filter_type, *args, **kwargs): ''' python code from: https://github.com/ronaldosena/imagens-medicas-2/blob/40171a6c259edec7827a6693a93955de2bd39e76/Aulas/aula_2_-_uniform_filter/matlab_fspecial.py ''' if filter_type == 'gaussian': return fspecial_gaussian(*args, **kwargs) if filter_type == 'laplacian': return fspecial_laplacian(*args, **kwargs) """ # -------------------------------------------- # degradation models # -------------------------------------------- """ def bicubic_degradation(x, sf=3): ''' Args: x: HxWxC image, [0, 1] sf: down-scale factor Return: bicubicly downsampled LR image ''' x = util.imresize_np(x, scale=1 / sf) return x def srmd_degradation(x, k, sf=3): ''' blur + bicubic downsampling Args: x: HxWxC image, [0, 1] k: hxw, double sf: down-scale factor Return: downsampled LR image Reference: @inproceedings{zhang2018learning, title={Learning a single convolutional super-resolution network for multiple degradations}, author={Zhang, Kai and Zuo, Wangmeng and Zhang, Lei}, booktitle={IEEE Conference on Computer Vision and Pattern Recognition}, pages={3262--3271}, year={2018} } ''' x = ndimage.convolve(x, np.expand_dims(k, axis=2), mode='wrap') # 'nearest' | 'mirror' x = bicubic_degradation(x, sf=sf) return x def dpsr_degradation(x, k, sf=3): ''' bicubic downsampling + blur Args: x: HxWxC image, [0, 1] k: hxw, double sf: down-scale factor Return: downsampled LR image Reference: @inproceedings{zhang2019deep, title={Deep Plug-and-Play Super-Resolution for Arbitrary Blur Kernels}, author={Zhang, Kai and Zuo, Wangmeng and Zhang, Lei}, booktitle={IEEE Conference on Computer Vision and Pattern Recognition}, pages={1671--1681}, year={2019} } ''' x = bicubic_degradation(x, sf=sf) x = ndimage.convolve(x, np.expand_dims(k, axis=2), mode='wrap') return x def classical_degradation(x, k, sf=3): ''' blur + downsampling Args: x: HxWxC image, [0, 1]/[0, 255] k: hxw, double sf: down-scale factor Return: downsampled LR image ''' x = ndimage.convolve(x, np.expand_dims(k, axis=2), mode='wrap') # x = filters.correlate(x, np.expand_dims(np.flip(k), axis=2)) st = 0 return x[st::sf, st::sf, ...] def add_sharpening(img, weight=0.5, radius=50, threshold=10): """USM sharpening. borrowed from real-ESRGAN Input image: I; Blurry image: B. 1. K = I + weight * (I - B) 2. Mask = 1 if abs(I - B) > threshold, else: 0 3. Blur mask: 4. Out = Mask * K + (1 - Mask) * I Args: img (Numpy array): Input image, HWC, BGR; float32, [0, 1]. weight (float): Sharp weight. Default: 1. radius (float): Kernel size of Gaussian blur. Default: 50. threshold (int): """ if radius % 2 == 0: radius += 1 blur = cv2.GaussianBlur(img, (radius, radius), 0) residual = img - blur mask = np.abs(residual) * 255 > threshold mask = mask.astype('float32') soft_mask = cv2.GaussianBlur(mask, (radius, radius), 0) K = img + weight * residual K = np.clip(K, 0, 1) return soft_mask * K + (1 - soft_mask) * img def add_blur(img, sf=4): wd2 = 4.0 + sf wd = 2.0 + 0.2 * sf wd2 = wd2/4 wd = wd/4 if random.random() < 0.5: l1 = wd2 * random.random() l2 = wd2 * random.random() k = anisotropic_Gaussian(ksize=random.randint(2, 11) + 3, theta=random.random() * np.pi, l1=l1, l2=l2) else: k = fspecial('gaussian', random.randint(2, 4) + 3, wd * random.random()) img = ndimage.convolve(img, np.expand_dims(k, axis=2), mode='mirror') return img def add_resize(img, sf=4): rnum = np.random.rand() if rnum > 0.8: # up sf1 = random.uniform(1, 2) elif rnum < 0.7: # down sf1 = random.uniform(0.5 / sf, 1) else: sf1 = 1.0 img = cv2.resize(img, (int(sf1 * img.shape[1]), int(sf1 * img.shape[0])), interpolation=random.choice([1, 2, 3])) img = np.clip(img, 0.0, 1.0) return img # def add_Gaussian_noise(img, noise_level1=2, noise_level2=25): # noise_level = random.randint(noise_level1, noise_level2) # rnum = np.random.rand() # if rnum > 0.6: # add color Gaussian noise # img += np.random.normal(0, noise_level / 255.0, img.shape).astype(np.float32) # elif rnum < 0.4: # add grayscale Gaussian noise # img += np.random.normal(0, noise_level / 255.0, (*img.shape[:2], 1)).astype(np.float32) # else: # add noise # L = noise_level2 / 255. # D = np.diag(np.random.rand(3)) # U = orth(np.random.rand(3, 3)) # conv = np.dot(np.dot(np.transpose(U), D), U) # img += np.random.multivariate_normal([0, 0, 0], np.abs(L ** 2 * conv), img.shape[:2]).astype(np.float32) # img = np.clip(img, 0.0, 1.0) # return img def add_Gaussian_noise(img, noise_level1=2, noise_level2=25): noise_level = random.randint(noise_level1, noise_level2) rnum = np.random.rand() if rnum > 0.6: # add color Gaussian noise img = img + np.random.normal(0, noise_level / 255.0, img.shape).astype(np.float32) elif rnum < 0.4: # add grayscale Gaussian noise img = img + np.random.normal(0, noise_level / 255.0, (*img.shape[:2], 1)).astype(np.float32) else: # add noise L = noise_level2 / 255. D = np.diag(np.random.rand(3)) U = orth(np.random.rand(3, 3)) conv = np.dot(np.dot(np.transpose(U), D), U) img = img + np.random.multivariate_normal([0, 0, 0], np.abs(L ** 2 * conv), img.shape[:2]).astype(np.float32) img = np.clip(img, 0.0, 1.0) return img def add_speckle_noise(img, noise_level1=2, noise_level2=25): noise_level = random.randint(noise_level1, noise_level2) img = np.clip(img, 0.0, 1.0) rnum = random.random() if rnum > 0.6: img += img * np.random.normal(0, noise_level / 255.0, img.shape).astype(np.float32) elif rnum < 0.4: img += img * np.random.normal(0, noise_level / 255.0, (*img.shape[:2], 1)).astype(np.float32) else: L = noise_level2 / 255. D = np.diag(np.random.rand(3)) U = orth(np.random.rand(3, 3)) conv = np.dot(np.dot(np.transpose(U), D), U) img += img * np.random.multivariate_normal([0, 0, 0], np.abs(L ** 2 * conv), img.shape[:2]).astype(np.float32) img = np.clip(img, 0.0, 1.0) return img def add_Poisson_noise(img): img = np.clip((img * 255.0).round(), 0, 255) / 255. vals = 10 ** (2 * random.random() + 2.0) # [2, 4] if random.random() < 0.5: img = np.random.poisson(img * vals).astype(np.float32) / vals else: img_gray = np.dot(img[..., :3], [0.299, 0.587, 0.114]) img_gray = np.clip((img_gray * 255.0).round(), 0, 255) / 255. noise_gray = np.random.poisson(img_gray * vals).astype(np.float32) / vals - img_gray img += noise_gray[:, :, np.newaxis] img = np.clip(img, 0.0, 1.0) return img def add_JPEG_noise(img): quality_factor = random.randint(80, 95) img = cv2.cvtColor(util.single2uint(img), cv2.COLOR_RGB2BGR) result, encimg = cv2.imencode('.jpg', img, [int(cv2.IMWRITE_JPEG_QUALITY), quality_factor]) img = cv2.imdecode(encimg, 1) img = cv2.cvtColor(util.uint2single(img), cv2.COLOR_BGR2RGB) return img def random_crop(lq, hq, sf=4, lq_patchsize=64): h, w = lq.shape[:2] rnd_h = random.randint(0, h - lq_patchsize) rnd_w = random.randint(0, w - lq_patchsize) lq = lq[rnd_h:rnd_h + lq_patchsize, rnd_w:rnd_w + lq_patchsize, :] rnd_h_H, rnd_w_H = int(rnd_h * sf), int(rnd_w * sf) hq = hq[rnd_h_H:rnd_h_H + lq_patchsize * sf, rnd_w_H:rnd_w_H + lq_patchsize * sf, :] return lq, hq def degradation_bsrgan(img, sf=4, lq_patchsize=72, isp_model=None): """ This is the degradation model of BSRGAN from the paper "Designing a Practical Degradation Model for Deep Blind Image Super-Resolution" ---------- img: HXWXC, [0, 1], its size should be large than (lq_patchsizexsf)x(lq_patchsizexsf) sf: scale factor isp_model: camera ISP model Returns ------- img: low-quality patch, size: lq_patchsizeXlq_patchsizeXC, range: [0, 1] hq: corresponding high-quality patch, size: (lq_patchsizexsf)X(lq_patchsizexsf)XC, range: [0, 1] """ isp_prob, jpeg_prob, scale2_prob = 0.25, 0.9, 0.25 sf_ori = sf h1, w1 = img.shape[:2] img = img.copy()[:w1 - w1 % sf, :h1 - h1 % sf, ...] # mod crop h, w = img.shape[:2] if h < lq_patchsize * sf or w < lq_patchsize * sf: raise ValueError(f'img size ({h1}X{w1}) is too small!') hq = img.copy() if sf == 4 and random.random() < scale2_prob: # downsample1 if np.random.rand() < 0.5: img = cv2.resize(img, (int(1 / 2 * img.shape[1]), int(1 / 2 * img.shape[0])), interpolation=random.choice([1, 2, 3])) else: img = util.imresize_np(img, 1 / 2, True) img = np.clip(img, 0.0, 1.0) sf = 2 shuffle_order = random.sample(range(7), 7) idx1, idx2 = shuffle_order.index(2), shuffle_order.index(3) if idx1 > idx2: # keep downsample3 last shuffle_order[idx1], shuffle_order[idx2] = shuffle_order[idx2], shuffle_order[idx1] for i in shuffle_order: if i == 0: img = add_blur(img, sf=sf) elif i == 1: img = add_blur(img, sf=sf) elif i == 2: a, b = img.shape[1], img.shape[0] # downsample2 if random.random() < 0.75: sf1 = random.uniform(1, 2 * sf) img = cv2.resize(img, (int(1 / sf1 * img.shape[1]), int(1 / sf1 * img.shape[0])), interpolation=random.choice([1, 2, 3])) else: k = fspecial('gaussian', 25, random.uniform(0.1, 0.6 * sf)) k_shifted = shift_pixel(k, sf) k_shifted = k_shifted / k_shifted.sum() # blur with shifted kernel img = ndimage.convolve(img, np.expand_dims(k_shifted, axis=2), mode='mirror') img = img[0::sf, 0::sf, ...] # nearest downsampling img = np.clip(img, 0.0, 1.0) elif i == 3: # downsample3 img = cv2.resize(img, (int(1 / sf * a), int(1 / sf * b)), interpolation=random.choice([1, 2, 3])) img = np.clip(img, 0.0, 1.0) elif i == 4: # add Gaussian noise img = add_Gaussian_noise(img, noise_level1=2, noise_level2=8) elif i == 5: # add JPEG noise if random.random() < jpeg_prob: img = add_JPEG_noise(img) elif i == 6: # add processed camera sensor noise if random.random() < isp_prob and isp_model is not None: with torch.no_grad(): img, hq = isp_model.forward(img.copy(), hq) # add final JPEG compression noise img = add_JPEG_noise(img) # random crop img, hq = random_crop(img, hq, sf_ori, lq_patchsize) return img, hq # todo no isp_model? def degradation_bsrgan_variant(image, sf=4, isp_model=None, up=False): """ This is the degradation model of BSRGAN from the paper "Designing a Practical Degradation Model for Deep Blind Image Super-Resolution" ---------- sf: scale factor isp_model: camera ISP model Returns ------- img: low-quality patch, size: lq_patchsizeXlq_patchsizeXC, range: [0, 1] hq: corresponding high-quality patch, size: (lq_patchsizexsf)X(lq_patchsizexsf)XC, range: [0, 1] """ image = util.uint2single(image) isp_prob, jpeg_prob, scale2_prob = 0.25, 0.9, 0.25 sf_ori = sf h1, w1 = image.shape[:2] image = image.copy()[:w1 - w1 % sf, :h1 - h1 % sf, ...] # mod crop h, w = image.shape[:2] hq = image.copy() if sf == 4 and random.random() < scale2_prob: # downsample1 if np.random.rand() < 0.5: image = cv2.resize(image, (int(1 / 2 * image.shape[1]), int(1 / 2 * image.shape[0])), interpolation=random.choice([1, 2, 3])) else: image = util.imresize_np(image, 1 / 2, True) image = np.clip(image, 0.0, 1.0) sf = 2 shuffle_order = random.sample(range(7), 7) idx1, idx2 = shuffle_order.index(2), shuffle_order.index(3) if idx1 > idx2: # keep downsample3 last shuffle_order[idx1], shuffle_order[idx2] = shuffle_order[idx2], shuffle_order[idx1] for i in shuffle_order: if i == 0: image = add_blur(image, sf=sf) # elif i == 1: # image = add_blur(image, sf=sf) if i == 0: pass elif i == 2: a, b = image.shape[1], image.shape[0] # downsample2 if random.random() < 0.8: sf1 = random.uniform(1, 2 * sf) image = cv2.resize(image, (int(1 / sf1 * image.shape[1]), int(1 / sf1 * image.shape[0])), interpolation=random.choice([1, 2, 3])) else: k = fspecial('gaussian', 25, random.uniform(0.1, 0.6 * sf)) k_shifted = shift_pixel(k, sf) k_shifted = k_shifted / k_shifted.sum() # blur with shifted kernel image = ndimage.convolve(image, np.expand_dims(k_shifted, axis=2), mode='mirror') image = image[0::sf, 0::sf, ...] # nearest downsampling image = np.clip(image, 0.0, 1.0) elif i == 3: # downsample3 image = cv2.resize(image, (int(1 / sf * a), int(1 / sf * b)), interpolation=random.choice([1, 2, 3])) image = np.clip(image, 0.0, 1.0) elif i == 4: # add Gaussian noise image = add_Gaussian_noise(image, noise_level1=1, noise_level2=2) elif i == 5: # add JPEG noise if random.random() < jpeg_prob: image = add_JPEG_noise(image) # # elif i == 6: # # add processed camera sensor noise # if random.random() < isp_prob and isp_model is not None: # with torch.no_grad(): # img, hq = isp_model.forward(img.copy(), hq) # add final JPEG compression noise image = add_JPEG_noise(image) image = util.single2uint(image) if up: image = cv2.resize(image, (w1, h1), interpolation=cv2.INTER_CUBIC) # todo: random, as above? want to condition on it then example = {"image": image} return example if __name__ == '__main__': print("hey") img = util.imread_uint('utils/test.png', 3) img = img[:448, :448] h = img.shape[0] // 4 print("resizing to", h) sf = 4 deg_fn = partial(degradation_bsrgan_variant, sf=sf) for i in range(20): print(i) img_hq = img img_lq = deg_fn(img)["image"] img_hq, img_lq = util.uint2single(img_hq), util.uint2single(img_lq) print(img_lq) img_lq_bicubic = albumentations.SmallestMaxSize(max_size=h, interpolation=cv2.INTER_CUBIC)(image=img_hq)["image"] print(img_lq.shape) print("bicubic", img_lq_bicubic.shape) print(img_hq.shape) lq_nearest = cv2.resize(util.single2uint(img_lq), (int(sf * img_lq.shape[1]), int(sf * img_lq.shape[0])), interpolation=0) lq_bicubic_nearest = cv2.resize(util.single2uint(img_lq_bicubic), (int(sf * img_lq.shape[1]), int(sf * img_lq.shape[0])), interpolation=0) img_concat = np.concatenate([lq_bicubic_nearest, lq_nearest, util.single2uint(img_hq)], axis=1) util.imsave(img_concat, str(i) + '.png') ================================================ FILE: ToonCrafter/ldm/modules/image_degradation/utils_image.py ================================================ import os import math import random import numpy as np import torch import cv2 from torchvision.utils import make_grid from datetime import datetime #import matplotlib.pyplot as plt # TODO: check with Dominik, also bsrgan.py vs bsrgan_light.py os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE" ''' # -------------------------------------------- # Kai Zhang (github: https://github.com/cszn) # 03/Mar/2019 # -------------------------------------------- # https://github.com/twhui/SRGAN-pyTorch # https://github.com/xinntao/BasicSR # -------------------------------------------- ''' IMG_EXTENSIONS = ['.jpg', '.JPG', '.jpeg', '.JPEG', '.png', '.PNG', '.ppm', '.PPM', '.bmp', '.BMP', '.tif'] def is_image_file(filename): return any(filename.endswith(extension) for extension in IMG_EXTENSIONS) def get_timestamp(): return datetime.now().strftime('%y%m%d-%H%M%S') def imshow(x, title=None, cbar=False, figsize=None): plt.figure(figsize=figsize) plt.imshow(np.squeeze(x), interpolation='nearest', cmap='gray') if title: plt.title(title) if cbar: plt.colorbar() plt.show() def surf(Z, cmap='rainbow', figsize=None): plt.figure(figsize=figsize) ax3 = plt.axes(projection='3d') w, h = Z.shape[:2] xx = np.arange(0,w,1) yy = np.arange(0,h,1) X, Y = np.meshgrid(xx, yy) ax3.plot_surface(X,Y,Z,cmap=cmap) #ax3.contour(X,Y,Z, zdim='z',offset=-2,cmap=cmap) plt.show() ''' # -------------------------------------------- # get image pathes # -------------------------------------------- ''' def get_image_paths(dataroot): paths = None # return None if dataroot is None if dataroot is not None: paths = sorted(_get_paths_from_images(dataroot)) return paths def _get_paths_from_images(path): assert os.path.isdir(path), '{:s} is not a valid directory'.format(path) images = [] for dirpath, _, fnames in sorted(os.walk(path)): for fname in sorted(fnames): if is_image_file(fname): img_path = os.path.join(dirpath, fname) images.append(img_path) assert images, '{:s} has no valid image file'.format(path) return images ''' # -------------------------------------------- # split large images into small images # -------------------------------------------- ''' def patches_from_image(img, p_size=512, p_overlap=64, p_max=800): w, h = img.shape[:2] patches = [] if w > p_max and h > p_max: w1 = list(np.arange(0, w-p_size, p_size-p_overlap, dtype=np.int)) h1 = list(np.arange(0, h-p_size, p_size-p_overlap, dtype=np.int)) w1.append(w-p_size) h1.append(h-p_size) # print(w1) # print(h1) for i in w1: for j in h1: patches.append(img[i:i+p_size, j:j+p_size,:]) else: patches.append(img) return patches def imssave(imgs, img_path): """ imgs: list, N images of size WxHxC """ img_name, ext = os.path.splitext(os.path.basename(img_path)) for i, img in enumerate(imgs): if img.ndim == 3: img = img[:, :, [2, 1, 0]] new_path = os.path.join(os.path.dirname(img_path), img_name+str('_s{:04d}'.format(i))+'.png') cv2.imwrite(new_path, img) def split_imageset(original_dataroot, taget_dataroot, n_channels=3, p_size=800, p_overlap=96, p_max=1000): """ split the large images from original_dataroot into small overlapped images with size (p_size)x(p_size), and save them into taget_dataroot; only the images with larger size than (p_max)x(p_max) will be splitted. Args: original_dataroot: taget_dataroot: p_size: size of small images p_overlap: patch size in training is a good choice p_max: images with smaller size than (p_max)x(p_max) keep unchanged. """ paths = get_image_paths(original_dataroot) for img_path in paths: # img_name, ext = os.path.splitext(os.path.basename(img_path)) img = imread_uint(img_path, n_channels=n_channels) patches = patches_from_image(img, p_size, p_overlap, p_max) imssave(patches, os.path.join(taget_dataroot,os.path.basename(img_path))) #if original_dataroot == taget_dataroot: #del img_path ''' # -------------------------------------------- # makedir # -------------------------------------------- ''' def mkdir(path): if not os.path.exists(path): os.makedirs(path) def mkdirs(paths): if isinstance(paths, str): mkdir(paths) else: for path in paths: mkdir(path) def mkdir_and_rename(path): if os.path.exists(path): new_name = path + '_archived_' + get_timestamp() print('Path already exists. Rename it to [{:s}]'.format(new_name)) os.rename(path, new_name) os.makedirs(path) ''' # -------------------------------------------- # read image from path # opencv is fast, but read BGR numpy image # -------------------------------------------- ''' # -------------------------------------------- # get uint8 image of size HxWxn_channles (RGB) # -------------------------------------------- def imread_uint(path, n_channels=3): # input: path # output: HxWx3(RGB or GGG), or HxWx1 (G) if n_channels == 1: img = cv2.imread(path, 0) # cv2.IMREAD_GRAYSCALE img = np.expand_dims(img, axis=2) # HxWx1 elif n_channels == 3: img = cv2.imread(path, cv2.IMREAD_UNCHANGED) # BGR or G if img.ndim == 2: img = cv2.cvtColor(img, cv2.COLOR_GRAY2RGB) # GGG else: img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) # RGB return img # -------------------------------------------- # matlab's imwrite # -------------------------------------------- def imsave(img, img_path): img = np.squeeze(img) if img.ndim == 3: img = img[:, :, [2, 1, 0]] cv2.imwrite(img_path, img) def imwrite(img, img_path): img = np.squeeze(img) if img.ndim == 3: img = img[:, :, [2, 1, 0]] cv2.imwrite(img_path, img) # -------------------------------------------- # get single image of size HxWxn_channles (BGR) # -------------------------------------------- def read_img(path): # read image by cv2 # return: Numpy float32, HWC, BGR, [0,1] img = cv2.imread(path, cv2.IMREAD_UNCHANGED) # cv2.IMREAD_GRAYSCALE img = img.astype(np.float32) / 255. if img.ndim == 2: img = np.expand_dims(img, axis=2) # some images have 4 channels if img.shape[2] > 3: img = img[:, :, :3] return img ''' # -------------------------------------------- # image format conversion # -------------------------------------------- # numpy(single) <---> numpy(unit) # numpy(single) <---> tensor # numpy(unit) <---> tensor # -------------------------------------------- ''' # -------------------------------------------- # numpy(single) [0, 1] <---> numpy(unit) # -------------------------------------------- def uint2single(img): return np.float32(img/255.) def single2uint(img): return np.uint8((img.clip(0, 1)*255.).round()) def uint162single(img): return np.float32(img/65535.) def single2uint16(img): return np.uint16((img.clip(0, 1)*65535.).round()) # -------------------------------------------- # numpy(unit) (HxWxC or HxW) <---> tensor # -------------------------------------------- # convert uint to 4-dimensional torch tensor def uint2tensor4(img): if img.ndim == 2: img = np.expand_dims(img, axis=2) return torch.from_numpy(np.ascontiguousarray(img)).permute(2, 0, 1).float().div(255.).unsqueeze(0) # convert uint to 3-dimensional torch tensor def uint2tensor3(img): if img.ndim == 2: img = np.expand_dims(img, axis=2) return torch.from_numpy(np.ascontiguousarray(img)).permute(2, 0, 1).float().div(255.) # convert 2/3/4-dimensional torch tensor to uint def tensor2uint(img): img = img.data.squeeze().float().clamp_(0, 1).cpu().numpy() if img.ndim == 3: img = np.transpose(img, (1, 2, 0)) return np.uint8((img*255.0).round()) # -------------------------------------------- # numpy(single) (HxWxC) <---> tensor # -------------------------------------------- # convert single (HxWxC) to 3-dimensional torch tensor def single2tensor3(img): return torch.from_numpy(np.ascontiguousarray(img)).permute(2, 0, 1).float() # convert single (HxWxC) to 4-dimensional torch tensor def single2tensor4(img): return torch.from_numpy(np.ascontiguousarray(img)).permute(2, 0, 1).float().unsqueeze(0) # convert torch tensor to single def tensor2single(img): img = img.data.squeeze().float().cpu().numpy() if img.ndim == 3: img = np.transpose(img, (1, 2, 0)) return img # convert torch tensor to single def tensor2single3(img): img = img.data.squeeze().float().cpu().numpy() if img.ndim == 3: img = np.transpose(img, (1, 2, 0)) elif img.ndim == 2: img = np.expand_dims(img, axis=2) return img def single2tensor5(img): return torch.from_numpy(np.ascontiguousarray(img)).permute(2, 0, 1, 3).float().unsqueeze(0) def single32tensor5(img): return torch.from_numpy(np.ascontiguousarray(img)).float().unsqueeze(0).unsqueeze(0) def single42tensor4(img): return torch.from_numpy(np.ascontiguousarray(img)).permute(2, 0, 1, 3).float() # from skimage.io import imread, imsave def tensor2img(tensor, out_type=np.uint8, min_max=(0, 1)): ''' Converts a torch Tensor into an image Numpy array of BGR channel order Input: 4D(B,(3/1),H,W), 3D(C,H,W), or 2D(H,W), any range, RGB channel order Output: 3D(H,W,C) or 2D(H,W), [0,255], np.uint8 (default) ''' tensor = tensor.squeeze().float().cpu().clamp_(*min_max) # squeeze first, then clamp tensor = (tensor - min_max[0]) / (min_max[1] - min_max[0]) # to range [0,1] n_dim = tensor.dim() if n_dim == 4: n_img = len(tensor) img_np = make_grid(tensor, nrow=int(math.sqrt(n_img)), normalize=False).numpy() img_np = np.transpose(img_np[[2, 1, 0], :, :], (1, 2, 0)) # HWC, BGR elif n_dim == 3: img_np = tensor.numpy() img_np = np.transpose(img_np[[2, 1, 0], :, :], (1, 2, 0)) # HWC, BGR elif n_dim == 2: img_np = tensor.numpy() else: raise TypeError( 'Only support 4D, 3D and 2D tensor. But received with dimension: {:d}'.format(n_dim)) if out_type == np.uint8: img_np = (img_np * 255.0).round() # Important. Unlike matlab, numpy.unit8() WILL NOT round by default. return img_np.astype(out_type) ''' # -------------------------------------------- # Augmentation, flipe and/or rotate # -------------------------------------------- # The following two are enough. # (1) augmet_img: numpy image of WxHxC or WxH # (2) augment_img_tensor4: tensor image 1xCxWxH # -------------------------------------------- ''' def augment_img(img, mode=0): '''Kai Zhang (github: https://github.com/cszn) ''' if mode == 0: return img elif mode == 1: return np.flipud(np.rot90(img)) elif mode == 2: return np.flipud(img) elif mode == 3: return np.rot90(img, k=3) elif mode == 4: return np.flipud(np.rot90(img, k=2)) elif mode == 5: return np.rot90(img) elif mode == 6: return np.rot90(img, k=2) elif mode == 7: return np.flipud(np.rot90(img, k=3)) def augment_img_tensor4(img, mode=0): '''Kai Zhang (github: https://github.com/cszn) ''' if mode == 0: return img elif mode == 1: return img.rot90(1, [2, 3]).flip([2]) elif mode == 2: return img.flip([2]) elif mode == 3: return img.rot90(3, [2, 3]) elif mode == 4: return img.rot90(2, [2, 3]).flip([2]) elif mode == 5: return img.rot90(1, [2, 3]) elif mode == 6: return img.rot90(2, [2, 3]) elif mode == 7: return img.rot90(3, [2, 3]).flip([2]) def augment_img_tensor(img, mode=0): '''Kai Zhang (github: https://github.com/cszn) ''' img_size = img.size() img_np = img.data.cpu().numpy() if len(img_size) == 3: img_np = np.transpose(img_np, (1, 2, 0)) elif len(img_size) == 4: img_np = np.transpose(img_np, (2, 3, 1, 0)) img_np = augment_img(img_np, mode=mode) img_tensor = torch.from_numpy(np.ascontiguousarray(img_np)) if len(img_size) == 3: img_tensor = img_tensor.permute(2, 0, 1) elif len(img_size) == 4: img_tensor = img_tensor.permute(3, 2, 0, 1) return img_tensor.type_as(img) def augment_img_np3(img, mode=0): if mode == 0: return img elif mode == 1: return img.transpose(1, 0, 2) elif mode == 2: return img[::-1, :, :] elif mode == 3: img = img[::-1, :, :] img = img.transpose(1, 0, 2) return img elif mode == 4: return img[:, ::-1, :] elif mode == 5: img = img[:, ::-1, :] img = img.transpose(1, 0, 2) return img elif mode == 6: img = img[:, ::-1, :] img = img[::-1, :, :] return img elif mode == 7: img = img[:, ::-1, :] img = img[::-1, :, :] img = img.transpose(1, 0, 2) return img def augment_imgs(img_list, hflip=True, rot=True): # horizontal flip OR rotate hflip = hflip and random.random() < 0.5 vflip = rot and random.random() < 0.5 rot90 = rot and random.random() < 0.5 def _augment(img): if hflip: img = img[:, ::-1, :] if vflip: img = img[::-1, :, :] if rot90: img = img.transpose(1, 0, 2) return img return [_augment(img) for img in img_list] ''' # -------------------------------------------- # modcrop and shave # -------------------------------------------- ''' def modcrop(img_in, scale): # img_in: Numpy, HWC or HW img = np.copy(img_in) if img.ndim == 2: H, W = img.shape H_r, W_r = H % scale, W % scale img = img[:H - H_r, :W - W_r] elif img.ndim == 3: H, W, C = img.shape H_r, W_r = H % scale, W % scale img = img[:H - H_r, :W - W_r, :] else: raise ValueError('Wrong img ndim: [{:d}].'.format(img.ndim)) return img def shave(img_in, border=0): # img_in: Numpy, HWC or HW img = np.copy(img_in) h, w = img.shape[:2] img = img[border:h-border, border:w-border] return img ''' # -------------------------------------------- # image processing process on numpy image # channel_convert(in_c, tar_type, img_list): # rgb2ycbcr(img, only_y=True): # bgr2ycbcr(img, only_y=True): # ycbcr2rgb(img): # -------------------------------------------- ''' def rgb2ycbcr(img, only_y=True): '''same as matlab rgb2ycbcr only_y: only return Y channel Input: uint8, [0, 255] float, [0, 1] ''' in_img_type = img.dtype img.astype(np.float32) if in_img_type != np.uint8: img *= 255. # convert if only_y: rlt = np.dot(img, [65.481, 128.553, 24.966]) / 255.0 + 16.0 else: rlt = np.matmul(img, [[65.481, -37.797, 112.0], [128.553, -74.203, -93.786], [24.966, 112.0, -18.214]]) / 255.0 + [16, 128, 128] if in_img_type == np.uint8: rlt = rlt.round() else: rlt /= 255. return rlt.astype(in_img_type) def ycbcr2rgb(img): '''same as matlab ycbcr2rgb Input: uint8, [0, 255] float, [0, 1] ''' in_img_type = img.dtype img.astype(np.float32) if in_img_type != np.uint8: img *= 255. # convert rlt = np.matmul(img, [[0.00456621, 0.00456621, 0.00456621], [0, -0.00153632, 0.00791071], [0.00625893, -0.00318811, 0]]) * 255.0 + [-222.921, 135.576, -276.836] if in_img_type == np.uint8: rlt = rlt.round() else: rlt /= 255. return rlt.astype(in_img_type) def bgr2ycbcr(img, only_y=True): '''bgr version of rgb2ycbcr only_y: only return Y channel Input: uint8, [0, 255] float, [0, 1] ''' in_img_type = img.dtype img.astype(np.float32) if in_img_type != np.uint8: img *= 255. # convert if only_y: rlt = np.dot(img, [24.966, 128.553, 65.481]) / 255.0 + 16.0 else: rlt = np.matmul(img, [[24.966, 112.0, -18.214], [128.553, -74.203, -93.786], [65.481, -37.797, 112.0]]) / 255.0 + [16, 128, 128] if in_img_type == np.uint8: rlt = rlt.round() else: rlt /= 255. return rlt.astype(in_img_type) def channel_convert(in_c, tar_type, img_list): # conversion among BGR, gray and y if in_c == 3 and tar_type == 'gray': # BGR to gray gray_list = [cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) for img in img_list] return [np.expand_dims(img, axis=2) for img in gray_list] elif in_c == 3 and tar_type == 'y': # BGR to y y_list = [bgr2ycbcr(img, only_y=True) for img in img_list] return [np.expand_dims(img, axis=2) for img in y_list] elif in_c == 1 and tar_type == 'RGB': # gray/y to BGR return [cv2.cvtColor(img, cv2.COLOR_GRAY2BGR) for img in img_list] else: return img_list ''' # -------------------------------------------- # metric, PSNR and SSIM # -------------------------------------------- ''' # -------------------------------------------- # PSNR # -------------------------------------------- def calculate_psnr(img1, img2, border=0): # img1 and img2 have range [0, 255] #img1 = img1.squeeze() #img2 = img2.squeeze() if not img1.shape == img2.shape: raise ValueError('Input images must have the same dimensions.') h, w = img1.shape[:2] img1 = img1[border:h-border, border:w-border] img2 = img2[border:h-border, border:w-border] img1 = img1.astype(np.float64) img2 = img2.astype(np.float64) mse = np.mean((img1 - img2)**2) if mse == 0: return float('inf') return 20 * math.log10(255.0 / math.sqrt(mse)) # -------------------------------------------- # SSIM # -------------------------------------------- def calculate_ssim(img1, img2, border=0): '''calculate SSIM the same outputs as MATLAB's img1, img2: [0, 255] ''' #img1 = img1.squeeze() #img2 = img2.squeeze() if not img1.shape == img2.shape: raise ValueError('Input images must have the same dimensions.') h, w = img1.shape[:2] img1 = img1[border:h-border, border:w-border] img2 = img2[border:h-border, border:w-border] if img1.ndim == 2: return ssim(img1, img2) elif img1.ndim == 3: if img1.shape[2] == 3: ssims = [] for i in range(3): ssims.append(ssim(img1[:,:,i], img2[:,:,i])) return np.array(ssims).mean() elif img1.shape[2] == 1: return ssim(np.squeeze(img1), np.squeeze(img2)) else: raise ValueError('Wrong input image dimensions.') def ssim(img1, img2): C1 = (0.01 * 255)**2 C2 = (0.03 * 255)**2 img1 = img1.astype(np.float64) img2 = img2.astype(np.float64) kernel = cv2.getGaussianKernel(11, 1.5) window = np.outer(kernel, kernel.transpose()) mu1 = cv2.filter2D(img1, -1, window)[5:-5, 5:-5] # valid mu2 = cv2.filter2D(img2, -1, window)[5:-5, 5:-5] mu1_sq = mu1**2 mu2_sq = mu2**2 mu1_mu2 = mu1 * mu2 sigma1_sq = cv2.filter2D(img1**2, -1, window)[5:-5, 5:-5] - mu1_sq sigma2_sq = cv2.filter2D(img2**2, -1, window)[5:-5, 5:-5] - mu2_sq sigma12 = cv2.filter2D(img1 * img2, -1, window)[5:-5, 5:-5] - mu1_mu2 ssim_map = ((2 * mu1_mu2 + C1) * (2 * sigma12 + C2)) / ((mu1_sq + mu2_sq + C1) * (sigma1_sq + sigma2_sq + C2)) return ssim_map.mean() ''' # -------------------------------------------- # matlab's bicubic imresize (numpy and torch) [0, 1] # -------------------------------------------- ''' # matlab 'imresize' function, now only support 'bicubic' def cubic(x): absx = torch.abs(x) absx2 = absx**2 absx3 = absx**3 return (1.5*absx3 - 2.5*absx2 + 1) * ((absx <= 1).type_as(absx)) + \ (-0.5*absx3 + 2.5*absx2 - 4*absx + 2) * (((absx > 1)*(absx <= 2)).type_as(absx)) def calculate_weights_indices(in_length, out_length, scale, kernel, kernel_width, antialiasing): if (scale < 1) and (antialiasing): # Use a modified kernel to simultaneously interpolate and antialias- larger kernel width kernel_width = kernel_width / scale # Output-space coordinates x = torch.linspace(1, out_length, out_length) # Input-space coordinates. Calculate the inverse mapping such that 0.5 # in output space maps to 0.5 in input space, and 0.5+scale in output # space maps to 1.5 in input space. u = x / scale + 0.5 * (1 - 1 / scale) # What is the left-most pixel that can be involved in the computation? left = torch.floor(u - kernel_width / 2) # What is the maximum number of pixels that can be involved in the # computation? Note: it's OK to use an extra pixel here; if the # corresponding weights are all zero, it will be eliminated at the end # of this function. P = math.ceil(kernel_width) + 2 # The indices of the input pixels involved in computing the k-th output # pixel are in row k of the indices matrix. indices = left.view(out_length, 1).expand(out_length, P) + torch.linspace(0, P - 1, P).view( 1, P).expand(out_length, P) # The weights used to compute the k-th output pixel are in row k of the # weights matrix. distance_to_center = u.view(out_length, 1).expand(out_length, P) - indices # apply cubic kernel if (scale < 1) and (antialiasing): weights = scale * cubic(distance_to_center * scale) else: weights = cubic(distance_to_center) # Normalize the weights matrix so that each row sums to 1. weights_sum = torch.sum(weights, 1).view(out_length, 1) weights = weights / weights_sum.expand(out_length, P) # If a column in weights is all zero, get rid of it. only consider the first and last column. weights_zero_tmp = torch.sum((weights == 0), 0) if not math.isclose(weights_zero_tmp[0], 0, rel_tol=1e-6): indices = indices.narrow(1, 1, P - 2) weights = weights.narrow(1, 1, P - 2) if not math.isclose(weights_zero_tmp[-1], 0, rel_tol=1e-6): indices = indices.narrow(1, 0, P - 2) weights = weights.narrow(1, 0, P - 2) weights = weights.contiguous() indices = indices.contiguous() sym_len_s = -indices.min() + 1 sym_len_e = indices.max() - in_length indices = indices + sym_len_s - 1 return weights, indices, int(sym_len_s), int(sym_len_e) # -------------------------------------------- # imresize for tensor image [0, 1] # -------------------------------------------- def imresize(img, scale, antialiasing=True): # Now the scale should be the same for H and W # input: img: pytorch tensor, CHW or HW [0,1] # output: CHW or HW [0,1] w/o round need_squeeze = True if img.dim() == 2 else False if need_squeeze: img.unsqueeze_(0) in_C, in_H, in_W = img.size() out_C, out_H, out_W = in_C, math.ceil(in_H * scale), math.ceil(in_W * scale) kernel_width = 4 kernel = 'cubic' # Return the desired dimension order for performing the resize. The # strategy is to perform the resize first along the dimension with the # smallest scale factor. # Now we do not support this. # get weights and indices weights_H, indices_H, sym_len_Hs, sym_len_He = calculate_weights_indices( in_H, out_H, scale, kernel, kernel_width, antialiasing) weights_W, indices_W, sym_len_Ws, sym_len_We = calculate_weights_indices( in_W, out_W, scale, kernel, kernel_width, antialiasing) # process H dimension # symmetric copying img_aug = torch.FloatTensor(in_C, in_H + sym_len_Hs + sym_len_He, in_W) img_aug.narrow(1, sym_len_Hs, in_H).copy_(img) sym_patch = img[:, :sym_len_Hs, :] inv_idx = torch.arange(sym_patch.size(1) - 1, -1, -1).long() sym_patch_inv = sym_patch.index_select(1, inv_idx) img_aug.narrow(1, 0, sym_len_Hs).copy_(sym_patch_inv) sym_patch = img[:, -sym_len_He:, :] inv_idx = torch.arange(sym_patch.size(1) - 1, -1, -1).long() sym_patch_inv = sym_patch.index_select(1, inv_idx) img_aug.narrow(1, sym_len_Hs + in_H, sym_len_He).copy_(sym_patch_inv) out_1 = torch.FloatTensor(in_C, out_H, in_W) kernel_width = weights_H.size(1) for i in range(out_H): idx = int(indices_H[i][0]) for j in range(out_C): out_1[j, i, :] = img_aug[j, idx:idx + kernel_width, :].transpose(0, 1).mv(weights_H[i]) # process W dimension # symmetric copying out_1_aug = torch.FloatTensor(in_C, out_H, in_W + sym_len_Ws + sym_len_We) out_1_aug.narrow(2, sym_len_Ws, in_W).copy_(out_1) sym_patch = out_1[:, :, :sym_len_Ws] inv_idx = torch.arange(sym_patch.size(2) - 1, -1, -1).long() sym_patch_inv = sym_patch.index_select(2, inv_idx) out_1_aug.narrow(2, 0, sym_len_Ws).copy_(sym_patch_inv) sym_patch = out_1[:, :, -sym_len_We:] inv_idx = torch.arange(sym_patch.size(2) - 1, -1, -1).long() sym_patch_inv = sym_patch.index_select(2, inv_idx) out_1_aug.narrow(2, sym_len_Ws + in_W, sym_len_We).copy_(sym_patch_inv) out_2 = torch.FloatTensor(in_C, out_H, out_W) kernel_width = weights_W.size(1) for i in range(out_W): idx = int(indices_W[i][0]) for j in range(out_C): out_2[j, :, i] = out_1_aug[j, :, idx:idx + kernel_width].mv(weights_W[i]) if need_squeeze: out_2.squeeze_() return out_2 # -------------------------------------------- # imresize for numpy image [0, 1] # -------------------------------------------- def imresize_np(img, scale, antialiasing=True): # Now the scale should be the same for H and W # input: img: Numpy, HWC or HW [0,1] # output: HWC or HW [0,1] w/o round img = torch.from_numpy(img) need_squeeze = True if img.dim() == 2 else False if need_squeeze: img.unsqueeze_(2) in_H, in_W, in_C = img.size() out_C, out_H, out_W = in_C, math.ceil(in_H * scale), math.ceil(in_W * scale) kernel_width = 4 kernel = 'cubic' # Return the desired dimension order for performing the resize. The # strategy is to perform the resize first along the dimension with the # smallest scale factor. # Now we do not support this. # get weights and indices weights_H, indices_H, sym_len_Hs, sym_len_He = calculate_weights_indices( in_H, out_H, scale, kernel, kernel_width, antialiasing) weights_W, indices_W, sym_len_Ws, sym_len_We = calculate_weights_indices( in_W, out_W, scale, kernel, kernel_width, antialiasing) # process H dimension # symmetric copying img_aug = torch.FloatTensor(in_H + sym_len_Hs + sym_len_He, in_W, in_C) img_aug.narrow(0, sym_len_Hs, in_H).copy_(img) sym_patch = img[:sym_len_Hs, :, :] inv_idx = torch.arange(sym_patch.size(0) - 1, -1, -1).long() sym_patch_inv = sym_patch.index_select(0, inv_idx) img_aug.narrow(0, 0, sym_len_Hs).copy_(sym_patch_inv) sym_patch = img[-sym_len_He:, :, :] inv_idx = torch.arange(sym_patch.size(0) - 1, -1, -1).long() sym_patch_inv = sym_patch.index_select(0, inv_idx) img_aug.narrow(0, sym_len_Hs + in_H, sym_len_He).copy_(sym_patch_inv) out_1 = torch.FloatTensor(out_H, in_W, in_C) kernel_width = weights_H.size(1) for i in range(out_H): idx = int(indices_H[i][0]) for j in range(out_C): out_1[i, :, j] = img_aug[idx:idx + kernel_width, :, j].transpose(0, 1).mv(weights_H[i]) # process W dimension # symmetric copying out_1_aug = torch.FloatTensor(out_H, in_W + sym_len_Ws + sym_len_We, in_C) out_1_aug.narrow(1, sym_len_Ws, in_W).copy_(out_1) sym_patch = out_1[:, :sym_len_Ws, :] inv_idx = torch.arange(sym_patch.size(1) - 1, -1, -1).long() sym_patch_inv = sym_patch.index_select(1, inv_idx) out_1_aug.narrow(1, 0, sym_len_Ws).copy_(sym_patch_inv) sym_patch = out_1[:, -sym_len_We:, :] inv_idx = torch.arange(sym_patch.size(1) - 1, -1, -1).long() sym_patch_inv = sym_patch.index_select(1, inv_idx) out_1_aug.narrow(1, sym_len_Ws + in_W, sym_len_We).copy_(sym_patch_inv) out_2 = torch.FloatTensor(out_H, out_W, in_C) kernel_width = weights_W.size(1) for i in range(out_W): idx = int(indices_W[i][0]) for j in range(out_C): out_2[:, i, j] = out_1_aug[:, idx:idx + kernel_width, j].mv(weights_W[i]) if need_squeeze: out_2.squeeze_() return out_2.numpy() if __name__ == '__main__': print('---') # img = imread_uint('test.bmp', 3) # img = uint2single(img) # img_bicubic = imresize_np(img, 1/4) ================================================ FILE: ToonCrafter/ldm/modules/midas/__init__.py ================================================ ================================================ FILE: ToonCrafter/ldm/modules/midas/api.py ================================================ # based on https://github.com/isl-org/MiDaS import cv2 import torch import torch.nn as nn from torchvision.transforms import Compose from ToonCrafter.ldm.modules.midas.midas.dpt_depth import DPTDepthModel from ToonCrafter.ldm.modules.midas.midas.midas_net import MidasNet from ToonCrafter.ldm.modules.midas.midas.midas_net_custom import MidasNet_small from ToonCrafter.ldm.modules.midas.midas.transforms import Resize, NormalizeImage, PrepareForNet ISL_PATHS = { "dpt_large": "midas_models/dpt_large-midas-2f21e586.pt", "dpt_hybrid": "midas_models/dpt_hybrid-midas-501f0c75.pt", "midas_v21": "", "midas_v21_small": "", } def disabled_train(self, mode=True): """Overwrite model.train with this function to make sure train/eval mode does not change anymore.""" return self def load_midas_transform(model_type): # https://github.com/isl-org/MiDaS/blob/master/run.py # load transform only if model_type == "dpt_large": # DPT-Large net_w, net_h = 384, 384 resize_mode = "minimal" normalization = NormalizeImage(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]) elif model_type == "dpt_hybrid": # DPT-Hybrid net_w, net_h = 384, 384 resize_mode = "minimal" normalization = NormalizeImage(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]) elif model_type == "midas_v21": net_w, net_h = 384, 384 resize_mode = "upper_bound" normalization = NormalizeImage(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) elif model_type == "midas_v21_small": net_w, net_h = 256, 256 resize_mode = "upper_bound" normalization = NormalizeImage(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) else: assert False, f"model_type '{model_type}' not implemented, use: --model_type large" transform = Compose( [ Resize( net_w, net_h, resize_target=None, keep_aspect_ratio=True, ensure_multiple_of=32, resize_method=resize_mode, image_interpolation_method=cv2.INTER_CUBIC, ), normalization, PrepareForNet(), ] ) return transform def load_model(model_type): # https://github.com/isl-org/MiDaS/blob/master/run.py # load network model_path = ISL_PATHS[model_type] if model_type == "dpt_large": # DPT-Large model = DPTDepthModel( path=model_path, backbone="vitl16_384", non_negative=True, ) net_w, net_h = 384, 384 resize_mode = "minimal" normalization = NormalizeImage(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]) elif model_type == "dpt_hybrid": # DPT-Hybrid model = DPTDepthModel( path=model_path, backbone="vitb_rn50_384", non_negative=True, ) net_w, net_h = 384, 384 resize_mode = "minimal" normalization = NormalizeImage(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]) elif model_type == "midas_v21": model = MidasNet(model_path, non_negative=True) net_w, net_h = 384, 384 resize_mode = "upper_bound" normalization = NormalizeImage( mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225] ) elif model_type == "midas_v21_small": model = MidasNet_small(model_path, features=64, backbone="efficientnet_lite3", exportable=True, non_negative=True, blocks={'expand': True}) net_w, net_h = 256, 256 resize_mode = "upper_bound" normalization = NormalizeImage( mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225] ) else: print(f"model_type '{model_type}' not implemented, use: --model_type large") assert False transform = Compose( [ Resize( net_w, net_h, resize_target=None, keep_aspect_ratio=True, ensure_multiple_of=32, resize_method=resize_mode, image_interpolation_method=cv2.INTER_CUBIC, ), normalization, PrepareForNet(), ] ) return model.eval(), transform class MiDaSInference(nn.Module): MODEL_TYPES_TORCH_HUB = [ "DPT_Large", "DPT_Hybrid", "MiDaS_small" ] MODEL_TYPES_ISL = [ "dpt_large", "dpt_hybrid", "midas_v21", "midas_v21_small", ] def __init__(self, model_type): super().__init__() assert (model_type in self.MODEL_TYPES_ISL) model, _ = load_model(model_type) self.model = model self.model.train = disabled_train def forward(self, x): # x in 0..1 as produced by calling self.transform on a 0..1 float64 numpy array # NOTE: we expect that the correct transform has been called during dataloading. with torch.no_grad(): prediction = self.model(x) prediction = torch.nn.functional.interpolate( prediction.unsqueeze(1), size=x.shape[2:], mode="bicubic", align_corners=False, ) assert prediction.shape == (x.shape[0], 1, x.shape[2], x.shape[3]) return prediction ================================================ FILE: ToonCrafter/ldm/modules/midas/midas/__init__.py ================================================ ================================================ FILE: ToonCrafter/ldm/modules/midas/midas/base_model.py ================================================ import torch class BaseModel(torch.nn.Module): def load(self, path): """Load model from file. Args: path (str): file path """ parameters = torch.load(path, map_location=torch.device('cpu')) if "optimizer" in parameters: parameters = parameters["model"] self.load_state_dict(parameters) ================================================ FILE: ToonCrafter/ldm/modules/midas/midas/blocks.py ================================================ import torch import torch.nn as nn from .vit import ( _make_pretrained_vitb_rn50_384, _make_pretrained_vitl16_384, _make_pretrained_vitb16_384, forward_vit, ) def _make_encoder(backbone, features, use_pretrained, groups=1, expand=False, exportable=True, hooks=None, use_vit_only=False, use_readout="ignore",): if backbone == "vitl16_384": pretrained = _make_pretrained_vitl16_384( use_pretrained, hooks=hooks, use_readout=use_readout ) scratch = _make_scratch( [256, 512, 1024, 1024], features, groups=groups, expand=expand ) # ViT-L/16 - 85.0% Top1 (backbone) elif backbone == "vitb_rn50_384": pretrained = _make_pretrained_vitb_rn50_384( use_pretrained, hooks=hooks, use_vit_only=use_vit_only, use_readout=use_readout, ) scratch = _make_scratch( [256, 512, 768, 768], features, groups=groups, expand=expand ) # ViT-H/16 - 85.0% Top1 (backbone) elif backbone == "vitb16_384": pretrained = _make_pretrained_vitb16_384( use_pretrained, hooks=hooks, use_readout=use_readout ) scratch = _make_scratch( [96, 192, 384, 768], features, groups=groups, expand=expand ) # ViT-B/16 - 84.6% Top1 (backbone) elif backbone == "resnext101_wsl": pretrained = _make_pretrained_resnext101_wsl(use_pretrained) scratch = _make_scratch([256, 512, 1024, 2048], features, groups=groups, expand=expand) # efficientnet_lite3 elif backbone == "efficientnet_lite3": pretrained = _make_pretrained_efficientnet_lite3(use_pretrained, exportable=exportable) scratch = _make_scratch([32, 48, 136, 384], features, groups=groups, expand=expand) # efficientnet_lite3 else: print(f"Backbone '{backbone}' not implemented") assert False return pretrained, scratch def _make_scratch(in_shape, out_shape, groups=1, expand=False): scratch = nn.Module() out_shape1 = out_shape out_shape2 = out_shape out_shape3 = out_shape out_shape4 = out_shape if expand==True: out_shape1 = out_shape out_shape2 = out_shape*2 out_shape3 = out_shape*4 out_shape4 = out_shape*8 scratch.layer1_rn = nn.Conv2d( in_shape[0], out_shape1, kernel_size=3, stride=1, padding=1, bias=False, groups=groups ) scratch.layer2_rn = nn.Conv2d( in_shape[1], out_shape2, kernel_size=3, stride=1, padding=1, bias=False, groups=groups ) scratch.layer3_rn = nn.Conv2d( in_shape[2], out_shape3, kernel_size=3, stride=1, padding=1, bias=False, groups=groups ) scratch.layer4_rn = nn.Conv2d( in_shape[3], out_shape4, kernel_size=3, stride=1, padding=1, bias=False, groups=groups ) return scratch def _make_pretrained_efficientnet_lite3(use_pretrained, exportable=False): efficientnet = torch.hub.load( "rwightman/gen-efficientnet-pytorch", "tf_efficientnet_lite3", pretrained=use_pretrained, exportable=exportable ) return _make_efficientnet_backbone(efficientnet) def _make_efficientnet_backbone(effnet): pretrained = nn.Module() pretrained.layer1 = nn.Sequential( effnet.conv_stem, effnet.bn1, effnet.act1, *effnet.blocks[0:2] ) pretrained.layer2 = nn.Sequential(*effnet.blocks[2:3]) pretrained.layer3 = nn.Sequential(*effnet.blocks[3:5]) pretrained.layer4 = nn.Sequential(*effnet.blocks[5:9]) return pretrained def _make_resnet_backbone(resnet): pretrained = nn.Module() pretrained.layer1 = nn.Sequential( resnet.conv1, resnet.bn1, resnet.relu, resnet.maxpool, resnet.layer1 ) pretrained.layer2 = resnet.layer2 pretrained.layer3 = resnet.layer3 pretrained.layer4 = resnet.layer4 return pretrained def _make_pretrained_resnext101_wsl(use_pretrained): resnet = torch.hub.load("facebookresearch/WSL-Images", "resnext101_32x8d_wsl") return _make_resnet_backbone(resnet) class Interpolate(nn.Module): """Interpolation module. """ def __init__(self, scale_factor, mode, align_corners=False): """Init. Args: scale_factor (float): scaling mode (str): interpolation mode """ super(Interpolate, self).__init__() self.interp = nn.functional.interpolate self.scale_factor = scale_factor self.mode = mode self.align_corners = align_corners def forward(self, x): """Forward pass. Args: x (tensor): input Returns: tensor: interpolated data """ x = self.interp( x, scale_factor=self.scale_factor, mode=self.mode, align_corners=self.align_corners ) return x class ResidualConvUnit(nn.Module): """Residual convolution module. """ def __init__(self, features): """Init. Args: features (int): number of features """ super().__init__() self.conv1 = nn.Conv2d( features, features, kernel_size=3, stride=1, padding=1, bias=True ) self.conv2 = nn.Conv2d( features, features, kernel_size=3, stride=1, padding=1, bias=True ) self.relu = nn.ReLU(inplace=True) def forward(self, x): """Forward pass. Args: x (tensor): input Returns: tensor: output """ out = self.relu(x) out = self.conv1(out) out = self.relu(out) out = self.conv2(out) return out + x class FeatureFusionBlock(nn.Module): """Feature fusion block. """ def __init__(self, features): """Init. Args: features (int): number of features """ super(FeatureFusionBlock, self).__init__() self.resConfUnit1 = ResidualConvUnit(features) self.resConfUnit2 = ResidualConvUnit(features) def forward(self, *xs): """Forward pass. Returns: tensor: output """ output = xs[0] if len(xs) == 2: output += self.resConfUnit1(xs[1]) output = self.resConfUnit2(output) output = nn.functional.interpolate( output, scale_factor=2, mode="bilinear", align_corners=True ) return output class ResidualConvUnit_custom(nn.Module): """Residual convolution module. """ def __init__(self, features, activation, bn): """Init. Args: features (int): number of features """ super().__init__() self.bn = bn self.groups=1 self.conv1 = nn.Conv2d( features, features, kernel_size=3, stride=1, padding=1, bias=True, groups=self.groups ) self.conv2 = nn.Conv2d( features, features, kernel_size=3, stride=1, padding=1, bias=True, groups=self.groups ) if self.bn==True: self.bn1 = nn.BatchNorm2d(features) self.bn2 = nn.BatchNorm2d(features) self.activation = activation self.skip_add = nn.quantized.FloatFunctional() def forward(self, x): """Forward pass. Args: x (tensor): input Returns: tensor: output """ out = self.activation(x) out = self.conv1(out) if self.bn==True: out = self.bn1(out) out = self.activation(out) out = self.conv2(out) if self.bn==True: out = self.bn2(out) if self.groups > 1: out = self.conv_merge(out) return self.skip_add.add(out, x) # return out + x class FeatureFusionBlock_custom(nn.Module): """Feature fusion block. """ def __init__(self, features, activation, deconv=False, bn=False, expand=False, align_corners=True): """Init. Args: features (int): number of features """ super(FeatureFusionBlock_custom, self).__init__() self.deconv = deconv self.align_corners = align_corners self.groups=1 self.expand = expand out_features = features if self.expand==True: out_features = features//2 self.out_conv = nn.Conv2d(features, out_features, kernel_size=1, stride=1, padding=0, bias=True, groups=1) self.resConfUnit1 = ResidualConvUnit_custom(features, activation, bn) self.resConfUnit2 = ResidualConvUnit_custom(features, activation, bn) self.skip_add = nn.quantized.FloatFunctional() def forward(self, *xs): """Forward pass. Returns: tensor: output """ output = xs[0] if len(xs) == 2: res = self.resConfUnit1(xs[1]) output = self.skip_add.add(output, res) # output += res output = self.resConfUnit2(output) output = nn.functional.interpolate( output, scale_factor=2, mode="bilinear", align_corners=self.align_corners ) output = self.out_conv(output) return output ================================================ FILE: ToonCrafter/ldm/modules/midas/midas/dpt_depth.py ================================================ import torch import torch.nn as nn import torch.nn.functional as F from .base_model import BaseModel from .blocks import ( FeatureFusionBlock, FeatureFusionBlock_custom, Interpolate, _make_encoder, forward_vit, ) def _make_fusion_block(features, use_bn): return FeatureFusionBlock_custom( features, nn.ReLU(False), deconv=False, bn=use_bn, expand=False, align_corners=True, ) class DPT(BaseModel): def __init__( self, head, features=256, backbone="vitb_rn50_384", readout="project", channels_last=False, use_bn=False, ): super(DPT, self).__init__() self.channels_last = channels_last hooks = { "vitb_rn50_384": [0, 1, 8, 11], "vitb16_384": [2, 5, 8, 11], "vitl16_384": [5, 11, 17, 23], } # Instantiate backbone and reassemble blocks self.pretrained, self.scratch = _make_encoder( backbone, features, False, # Set to true of you want to train from scratch, uses ImageNet weights groups=1, expand=False, exportable=False, hooks=hooks[backbone], use_readout=readout, ) self.scratch.refinenet1 = _make_fusion_block(features, use_bn) self.scratch.refinenet2 = _make_fusion_block(features, use_bn) self.scratch.refinenet3 = _make_fusion_block(features, use_bn) self.scratch.refinenet4 = _make_fusion_block(features, use_bn) self.scratch.output_conv = head def forward(self, x): if self.channels_last == True: x.contiguous(memory_format=torch.channels_last) layer_1, layer_2, layer_3, layer_4 = forward_vit(self.pretrained, x) layer_1_rn = self.scratch.layer1_rn(layer_1) layer_2_rn = self.scratch.layer2_rn(layer_2) layer_3_rn = self.scratch.layer3_rn(layer_3) layer_4_rn = self.scratch.layer4_rn(layer_4) path_4 = self.scratch.refinenet4(layer_4_rn) path_3 = self.scratch.refinenet3(path_4, layer_3_rn) path_2 = self.scratch.refinenet2(path_3, layer_2_rn) path_1 = self.scratch.refinenet1(path_2, layer_1_rn) out = self.scratch.output_conv(path_1) return out class DPTDepthModel(DPT): def __init__(self, path=None, non_negative=True, **kwargs): features = kwargs["features"] if "features" in kwargs else 256 head = nn.Sequential( nn.Conv2d(features, features // 2, kernel_size=3, stride=1, padding=1), Interpolate(scale_factor=2, mode="bilinear", align_corners=True), nn.Conv2d(features // 2, 32, kernel_size=3, stride=1, padding=1), nn.ReLU(True), nn.Conv2d(32, 1, kernel_size=1, stride=1, padding=0), nn.ReLU(True) if non_negative else nn.Identity(), nn.Identity(), ) super().__init__(head, **kwargs) if path is not None: self.load(path) def forward(self, x): return super().forward(x).squeeze(dim=1) ================================================ FILE: ToonCrafter/ldm/modules/midas/midas/midas_net.py ================================================ """MidashNet: Network for monocular depth estimation trained by mixing several datasets. This file contains code that is adapted from https://github.com/thomasjpfan/pytorch_refinenet/blob/master/pytorch_refinenet/refinenet/refinenet_4cascade.py """ import torch import torch.nn as nn from .base_model import BaseModel from .blocks import FeatureFusionBlock, Interpolate, _make_encoder class MidasNet(BaseModel): """Network for monocular depth estimation. """ def __init__(self, path=None, features=256, non_negative=True): """Init. Args: path (str, optional): Path to saved model. Defaults to None. features (int, optional): Number of features. Defaults to 256. backbone (str, optional): Backbone network for encoder. Defaults to resnet50 """ print("Loading weights: ", path) super(MidasNet, self).__init__() use_pretrained = False if path is None else True self.pretrained, self.scratch = _make_encoder(backbone="resnext101_wsl", features=features, use_pretrained=use_pretrained) self.scratch.refinenet4 = FeatureFusionBlock(features) self.scratch.refinenet3 = FeatureFusionBlock(features) self.scratch.refinenet2 = FeatureFusionBlock(features) self.scratch.refinenet1 = FeatureFusionBlock(features) self.scratch.output_conv = nn.Sequential( nn.Conv2d(features, 128, kernel_size=3, stride=1, padding=1), Interpolate(scale_factor=2, mode="bilinear"), nn.Conv2d(128, 32, kernel_size=3, stride=1, padding=1), nn.ReLU(True), nn.Conv2d(32, 1, kernel_size=1, stride=1, padding=0), nn.ReLU(True) if non_negative else nn.Identity(), ) if path: self.load(path) def forward(self, x): """Forward pass. Args: x (tensor): input data (image) Returns: tensor: depth """ layer_1 = self.pretrained.layer1(x) layer_2 = self.pretrained.layer2(layer_1) layer_3 = self.pretrained.layer3(layer_2) layer_4 = self.pretrained.layer4(layer_3) layer_1_rn = self.scratch.layer1_rn(layer_1) layer_2_rn = self.scratch.layer2_rn(layer_2) layer_3_rn = self.scratch.layer3_rn(layer_3) layer_4_rn = self.scratch.layer4_rn(layer_4) path_4 = self.scratch.refinenet4(layer_4_rn) path_3 = self.scratch.refinenet3(path_4, layer_3_rn) path_2 = self.scratch.refinenet2(path_3, layer_2_rn) path_1 = self.scratch.refinenet1(path_2, layer_1_rn) out = self.scratch.output_conv(path_1) return torch.squeeze(out, dim=1) ================================================ FILE: ToonCrafter/ldm/modules/midas/midas/midas_net_custom.py ================================================ """MidashNet: Network for monocular depth estimation trained by mixing several datasets. This file contains code that is adapted from https://github.com/thomasjpfan/pytorch_refinenet/blob/master/pytorch_refinenet/refinenet/refinenet_4cascade.py """ import torch import torch.nn as nn from .base_model import BaseModel from .blocks import FeatureFusionBlock, FeatureFusionBlock_custom, Interpolate, _make_encoder class MidasNet_small(BaseModel): """Network for monocular depth estimation. """ def __init__(self, path=None, features=64, backbone="efficientnet_lite3", non_negative=True, exportable=True, channels_last=False, align_corners=True, blocks={'expand': True}): """Init. Args: path (str, optional): Path to saved model. Defaults to None. features (int, optional): Number of features. Defaults to 256. backbone (str, optional): Backbone network for encoder. Defaults to resnet50 """ print("Loading weights: ", path) super(MidasNet_small, self).__init__() use_pretrained = False if path else True self.channels_last = channels_last self.blocks = blocks self.backbone = backbone self.groups = 1 features1=features features2=features features3=features features4=features self.expand = False if "expand" in self.blocks and self.blocks['expand'] == True: self.expand = True features1=features features2=features*2 features3=features*4 features4=features*8 self.pretrained, self.scratch = _make_encoder(self.backbone, features, use_pretrained, groups=self.groups, expand=self.expand, exportable=exportable) self.scratch.activation = nn.ReLU(False) self.scratch.refinenet4 = FeatureFusionBlock_custom(features4, self.scratch.activation, deconv=False, bn=False, expand=self.expand, align_corners=align_corners) self.scratch.refinenet3 = FeatureFusionBlock_custom(features3, self.scratch.activation, deconv=False, bn=False, expand=self.expand, align_corners=align_corners) self.scratch.refinenet2 = FeatureFusionBlock_custom(features2, self.scratch.activation, deconv=False, bn=False, expand=self.expand, align_corners=align_corners) self.scratch.refinenet1 = FeatureFusionBlock_custom(features1, self.scratch.activation, deconv=False, bn=False, align_corners=align_corners) self.scratch.output_conv = nn.Sequential( nn.Conv2d(features, features//2, kernel_size=3, stride=1, padding=1, groups=self.groups), Interpolate(scale_factor=2, mode="bilinear"), nn.Conv2d(features//2, 32, kernel_size=3, stride=1, padding=1), self.scratch.activation, nn.Conv2d(32, 1, kernel_size=1, stride=1, padding=0), nn.ReLU(True) if non_negative else nn.Identity(), nn.Identity(), ) if path: self.load(path) def forward(self, x): """Forward pass. Args: x (tensor): input data (image) Returns: tensor: depth """ if self.channels_last==True: print("self.channels_last = ", self.channels_last) x.contiguous(memory_format=torch.channels_last) layer_1 = self.pretrained.layer1(x) layer_2 = self.pretrained.layer2(layer_1) layer_3 = self.pretrained.layer3(layer_2) layer_4 = self.pretrained.layer4(layer_3) layer_1_rn = self.scratch.layer1_rn(layer_1) layer_2_rn = self.scratch.layer2_rn(layer_2) layer_3_rn = self.scratch.layer3_rn(layer_3) layer_4_rn = self.scratch.layer4_rn(layer_4) path_4 = self.scratch.refinenet4(layer_4_rn) path_3 = self.scratch.refinenet3(path_4, layer_3_rn) path_2 = self.scratch.refinenet2(path_3, layer_2_rn) path_1 = self.scratch.refinenet1(path_2, layer_1_rn) out = self.scratch.output_conv(path_1) return torch.squeeze(out, dim=1) def fuse_model(m): prev_previous_type = nn.Identity() prev_previous_name = '' previous_type = nn.Identity() previous_name = '' for name, module in m.named_modules(): if prev_previous_type == nn.Conv2d and previous_type == nn.BatchNorm2d and type(module) == nn.ReLU: # print("FUSED ", prev_previous_name, previous_name, name) torch.quantization.fuse_modules(m, [prev_previous_name, previous_name, name], inplace=True) elif prev_previous_type == nn.Conv2d and previous_type == nn.BatchNorm2d: # print("FUSED ", prev_previous_name, previous_name) torch.quantization.fuse_modules(m, [prev_previous_name, previous_name], inplace=True) # elif previous_type == nn.Conv2d and type(module) == nn.ReLU: # print("FUSED ", previous_name, name) # torch.quantization.fuse_modules(m, [previous_name, name], inplace=True) prev_previous_type = previous_type prev_previous_name = previous_name previous_type = type(module) previous_name = name ================================================ FILE: ToonCrafter/ldm/modules/midas/midas/transforms.py ================================================ import numpy as np import cv2 import math def apply_min_size(sample, size, image_interpolation_method=cv2.INTER_AREA): """Rezise the sample to ensure the given size. Keeps aspect ratio. Args: sample (dict): sample size (tuple): image size Returns: tuple: new size """ shape = list(sample["disparity"].shape) if shape[0] >= size[0] and shape[1] >= size[1]: return sample scale = [0, 0] scale[0] = size[0] / shape[0] scale[1] = size[1] / shape[1] scale = max(scale) shape[0] = math.ceil(scale * shape[0]) shape[1] = math.ceil(scale * shape[1]) # resize sample["image"] = cv2.resize( sample["image"], tuple(shape[::-1]), interpolation=image_interpolation_method ) sample["disparity"] = cv2.resize( sample["disparity"], tuple(shape[::-1]), interpolation=cv2.INTER_NEAREST ) sample["mask"] = cv2.resize( sample["mask"].astype(np.float32), tuple(shape[::-1]), interpolation=cv2.INTER_NEAREST, ) sample["mask"] = sample["mask"].astype(bool) return tuple(shape) class Resize(object): """Resize sample to given size (width, height). """ def __init__( self, width, height, resize_target=True, keep_aspect_ratio=False, ensure_multiple_of=1, resize_method="lower_bound", image_interpolation_method=cv2.INTER_AREA, ): """Init. Args: width (int): desired output width height (int): desired output height resize_target (bool, optional): True: Resize the full sample (image, mask, target). False: Resize image only. Defaults to True. keep_aspect_ratio (bool, optional): True: Keep the aspect ratio of the input sample. Output sample might not have the given width and height, and resize behaviour depends on the parameter 'resize_method'. Defaults to False. ensure_multiple_of (int, optional): Output width and height is constrained to be multiple of this parameter. Defaults to 1. resize_method (str, optional): "lower_bound": Output will be at least as large as the given size. "upper_bound": Output will be at max as large as the given size. (Output size might be smaller than given size.) "minimal": Scale as least as possible. (Output size might be smaller than given size.) Defaults to "lower_bound". """ self.__width = width self.__height = height self.__resize_target = resize_target self.__keep_aspect_ratio = keep_aspect_ratio self.__multiple_of = ensure_multiple_of self.__resize_method = resize_method self.__image_interpolation_method = image_interpolation_method def constrain_to_multiple_of(self, x, min_val=0, max_val=None): y = (np.round(x / self.__multiple_of) * self.__multiple_of).astype(int) if max_val is not None and y > max_val: y = (np.floor(x / self.__multiple_of) * self.__multiple_of).astype(int) if y < min_val: y = (np.ceil(x / self.__multiple_of) * self.__multiple_of).astype(int) return y def get_size(self, width, height): # determine new height and width scale_height = self.__height / height scale_width = self.__width / width if self.__keep_aspect_ratio: if self.__resize_method == "lower_bound": # scale such that output size is lower bound if scale_width > scale_height: # fit width scale_height = scale_width else: # fit height scale_width = scale_height elif self.__resize_method == "upper_bound": # scale such that output size is upper bound if scale_width < scale_height: # fit width scale_height = scale_width else: # fit height scale_width = scale_height elif self.__resize_method == "minimal": # scale as least as possbile if abs(1 - scale_width) < abs(1 - scale_height): # fit width scale_height = scale_width else: # fit height scale_width = scale_height else: raise ValueError( f"resize_method {self.__resize_method} not implemented" ) if self.__resize_method == "lower_bound": new_height = self.constrain_to_multiple_of( scale_height * height, min_val=self.__height ) new_width = self.constrain_to_multiple_of( scale_width * width, min_val=self.__width ) elif self.__resize_method == "upper_bound": new_height = self.constrain_to_multiple_of( scale_height * height, max_val=self.__height ) new_width = self.constrain_to_multiple_of( scale_width * width, max_val=self.__width ) elif self.__resize_method == "minimal": new_height = self.constrain_to_multiple_of(scale_height * height) new_width = self.constrain_to_multiple_of(scale_width * width) else: raise ValueError(f"resize_method {self.__resize_method} not implemented") return (new_width, new_height) def __call__(self, sample): width, height = self.get_size( sample["image"].shape[1], sample["image"].shape[0] ) # resize sample sample["image"] = cv2.resize( sample["image"], (width, height), interpolation=self.__image_interpolation_method, ) if self.__resize_target: if "disparity" in sample: sample["disparity"] = cv2.resize( sample["disparity"], (width, height), interpolation=cv2.INTER_NEAREST, ) if "depth" in sample: sample["depth"] = cv2.resize( sample["depth"], (width, height), interpolation=cv2.INTER_NEAREST ) sample["mask"] = cv2.resize( sample["mask"].astype(np.float32), (width, height), interpolation=cv2.INTER_NEAREST, ) sample["mask"] = sample["mask"].astype(bool) return sample class NormalizeImage(object): """Normlize image by given mean and std. """ def __init__(self, mean, std): self.__mean = mean self.__std = std def __call__(self, sample): sample["image"] = (sample["image"] - self.__mean) / self.__std return sample class PrepareForNet(object): """Prepare sample for usage as network input. """ def __init__(self): pass def __call__(self, sample): image = np.transpose(sample["image"], (2, 0, 1)) sample["image"] = np.ascontiguousarray(image).astype(np.float32) if "mask" in sample: sample["mask"] = sample["mask"].astype(np.float32) sample["mask"] = np.ascontiguousarray(sample["mask"]) if "disparity" in sample: disparity = sample["disparity"].astype(np.float32) sample["disparity"] = np.ascontiguousarray(disparity) if "depth" in sample: depth = sample["depth"].astype(np.float32) sample["depth"] = np.ascontiguousarray(depth) return sample ================================================ FILE: ToonCrafter/ldm/modules/midas/midas/vit.py ================================================ import torch import torch.nn as nn import timm import types import math import torch.nn.functional as F class Slice(nn.Module): def __init__(self, start_index=1): super(Slice, self).__init__() self.start_index = start_index def forward(self, x): return x[:, self.start_index :] class AddReadout(nn.Module): def __init__(self, start_index=1): super(AddReadout, self).__init__() self.start_index = start_index def forward(self, x): if self.start_index == 2: readout = (x[:, 0] + x[:, 1]) / 2 else: readout = x[:, 0] return x[:, self.start_index :] + readout.unsqueeze(1) class ProjectReadout(nn.Module): def __init__(self, in_features, start_index=1): super(ProjectReadout, self).__init__() self.start_index = start_index self.project = nn.Sequential(nn.Linear(2 * in_features, in_features), nn.GELU()) def forward(self, x): readout = x[:, 0].unsqueeze(1).expand_as(x[:, self.start_index :]) features = torch.cat((x[:, self.start_index :], readout), -1) return self.project(features) class Transpose(nn.Module): def __init__(self, dim0, dim1): super(Transpose, self).__init__() self.dim0 = dim0 self.dim1 = dim1 def forward(self, x): x = x.transpose(self.dim0, self.dim1) return x def forward_vit(pretrained, x): b, c, h, w = x.shape glob = pretrained.model.forward_flex(x) layer_1 = pretrained.activations["1"] layer_2 = pretrained.activations["2"] layer_3 = pretrained.activations["3"] layer_4 = pretrained.activations["4"] layer_1 = pretrained.act_postprocess1[0:2](layer_1) layer_2 = pretrained.act_postprocess2[0:2](layer_2) layer_3 = pretrained.act_postprocess3[0:2](layer_3) layer_4 = pretrained.act_postprocess4[0:2](layer_4) unflatten = nn.Sequential( nn.Unflatten( 2, torch.Size( [ h // pretrained.model.patch_size[1], w // pretrained.model.patch_size[0], ] ), ) ) if layer_1.ndim == 3: layer_1 = unflatten(layer_1) if layer_2.ndim == 3: layer_2 = unflatten(layer_2) if layer_3.ndim == 3: layer_3 = unflatten(layer_3) if layer_4.ndim == 3: layer_4 = unflatten(layer_4) layer_1 = pretrained.act_postprocess1[3 : len(pretrained.act_postprocess1)](layer_1) layer_2 = pretrained.act_postprocess2[3 : len(pretrained.act_postprocess2)](layer_2) layer_3 = pretrained.act_postprocess3[3 : len(pretrained.act_postprocess3)](layer_3) layer_4 = pretrained.act_postprocess4[3 : len(pretrained.act_postprocess4)](layer_4) return layer_1, layer_2, layer_3, layer_4 def _resize_pos_embed(self, posemb, gs_h, gs_w): posemb_tok, posemb_grid = ( posemb[:, : self.start_index], posemb[0, self.start_index :], ) gs_old = int(math.sqrt(len(posemb_grid))) posemb_grid = posemb_grid.reshape(1, gs_old, gs_old, -1).permute(0, 3, 1, 2) posemb_grid = F.interpolate(posemb_grid, size=(gs_h, gs_w), mode="bilinear") posemb_grid = posemb_grid.permute(0, 2, 3, 1).reshape(1, gs_h * gs_w, -1) posemb = torch.cat([posemb_tok, posemb_grid], dim=1) return posemb def forward_flex(self, x): b, c, h, w = x.shape pos_embed = self._resize_pos_embed( self.pos_embed, h // self.patch_size[1], w // self.patch_size[0] ) B = x.shape[0] if hasattr(self.patch_embed, "backbone"): x = self.patch_embed.backbone(x) if isinstance(x, (list, tuple)): x = x[-1] # last feature if backbone outputs list/tuple of features x = self.patch_embed.proj(x).flatten(2).transpose(1, 2) if getattr(self, "dist_token", None) is not None: cls_tokens = self.cls_token.expand( B, -1, -1 ) # stole cls_tokens impl from Phil Wang, thanks dist_token = self.dist_token.expand(B, -1, -1) x = torch.cat((cls_tokens, dist_token, x), dim=1) else: cls_tokens = self.cls_token.expand( B, -1, -1 ) # stole cls_tokens impl from Phil Wang, thanks x = torch.cat((cls_tokens, x), dim=1) x = x + pos_embed x = self.pos_drop(x) for blk in self.blocks: x = blk(x) x = self.norm(x) return x activations = {} def get_activation(name): def hook(model, input, output): activations[name] = output return hook def get_readout_oper(vit_features, features, use_readout, start_index=1): if use_readout == "ignore": readout_oper = [Slice(start_index)] * len(features) elif use_readout == "add": readout_oper = [AddReadout(start_index)] * len(features) elif use_readout == "project": readout_oper = [ ProjectReadout(vit_features, start_index) for out_feat in features ] else: assert ( False ), "wrong operation for readout token, use_readout can be 'ignore', 'add', or 'project'" return readout_oper def _make_vit_b16_backbone( model, features=[96, 192, 384, 768], size=[384, 384], hooks=[2, 5, 8, 11], vit_features=768, use_readout="ignore", start_index=1, ): pretrained = nn.Module() pretrained.model = model pretrained.model.blocks[hooks[0]].register_forward_hook(get_activation("1")) pretrained.model.blocks[hooks[1]].register_forward_hook(get_activation("2")) pretrained.model.blocks[hooks[2]].register_forward_hook(get_activation("3")) pretrained.model.blocks[hooks[3]].register_forward_hook(get_activation("4")) pretrained.activations = activations readout_oper = get_readout_oper(vit_features, features, use_readout, start_index) # 32, 48, 136, 384 pretrained.act_postprocess1 = nn.Sequential( readout_oper[0], Transpose(1, 2), nn.Unflatten(2, torch.Size([size[0] // 16, size[1] // 16])), nn.Conv2d( in_channels=vit_features, out_channels=features[0], kernel_size=1, stride=1, padding=0, ), nn.ConvTranspose2d( in_channels=features[0], out_channels=features[0], kernel_size=4, stride=4, padding=0, bias=True, dilation=1, groups=1, ), ) pretrained.act_postprocess2 = nn.Sequential( readout_oper[1], Transpose(1, 2), nn.Unflatten(2, torch.Size([size[0] // 16, size[1] // 16])), nn.Conv2d( in_channels=vit_features, out_channels=features[1], kernel_size=1, stride=1, padding=0, ), nn.ConvTranspose2d( in_channels=features[1], out_channels=features[1], kernel_size=2, stride=2, padding=0, bias=True, dilation=1, groups=1, ), ) pretrained.act_postprocess3 = nn.Sequential( readout_oper[2], Transpose(1, 2), nn.Unflatten(2, torch.Size([size[0] // 16, size[1] // 16])), nn.Conv2d( in_channels=vit_features, out_channels=features[2], kernel_size=1, stride=1, padding=0, ), ) pretrained.act_postprocess4 = nn.Sequential( readout_oper[3], Transpose(1, 2), nn.Unflatten(2, torch.Size([size[0] // 16, size[1] // 16])), nn.Conv2d( in_channels=vit_features, out_channels=features[3], kernel_size=1, stride=1, padding=0, ), nn.Conv2d( in_channels=features[3], out_channels=features[3], kernel_size=3, stride=2, padding=1, ), ) pretrained.model.start_index = start_index pretrained.model.patch_size = [16, 16] # We inject this function into the VisionTransformer instances so that # we can use it with interpolated position embeddings without modifying the library source. pretrained.model.forward_flex = types.MethodType(forward_flex, pretrained.model) pretrained.model._resize_pos_embed = types.MethodType( _resize_pos_embed, pretrained.model ) return pretrained def _make_pretrained_vitl16_384(pretrained, use_readout="ignore", hooks=None): model = timm.create_model("vit_large_patch16_384", pretrained=pretrained) hooks = [5, 11, 17, 23] if hooks == None else hooks return _make_vit_b16_backbone( model, features=[256, 512, 1024, 1024], hooks=hooks, vit_features=1024, use_readout=use_readout, ) def _make_pretrained_vitb16_384(pretrained, use_readout="ignore", hooks=None): model = timm.create_model("vit_base_patch16_384", pretrained=pretrained) hooks = [2, 5, 8, 11] if hooks == None else hooks return _make_vit_b16_backbone( model, features=[96, 192, 384, 768], hooks=hooks, use_readout=use_readout ) def _make_pretrained_deitb16_384(pretrained, use_readout="ignore", hooks=None): model = timm.create_model("vit_deit_base_patch16_384", pretrained=pretrained) hooks = [2, 5, 8, 11] if hooks == None else hooks return _make_vit_b16_backbone( model, features=[96, 192, 384, 768], hooks=hooks, use_readout=use_readout ) def _make_pretrained_deitb16_distil_384(pretrained, use_readout="ignore", hooks=None): model = timm.create_model( "vit_deit_base_distilled_patch16_384", pretrained=pretrained ) hooks = [2, 5, 8, 11] if hooks == None else hooks return _make_vit_b16_backbone( model, features=[96, 192, 384, 768], hooks=hooks, use_readout=use_readout, start_index=2, ) def _make_vit_b_rn50_backbone( model, features=[256, 512, 768, 768], size=[384, 384], hooks=[0, 1, 8, 11], vit_features=768, use_vit_only=False, use_readout="ignore", start_index=1, ): pretrained = nn.Module() pretrained.model = model if use_vit_only == True: pretrained.model.blocks[hooks[0]].register_forward_hook(get_activation("1")) pretrained.model.blocks[hooks[1]].register_forward_hook(get_activation("2")) else: pretrained.model.patch_embed.backbone.stages[0].register_forward_hook( get_activation("1") ) pretrained.model.patch_embed.backbone.stages[1].register_forward_hook( get_activation("2") ) pretrained.model.blocks[hooks[2]].register_forward_hook(get_activation("3")) pretrained.model.blocks[hooks[3]].register_forward_hook(get_activation("4")) pretrained.activations = activations readout_oper = get_readout_oper(vit_features, features, use_readout, start_index) if use_vit_only == True: pretrained.act_postprocess1 = nn.Sequential( readout_oper[0], Transpose(1, 2), nn.Unflatten(2, torch.Size([size[0] // 16, size[1] // 16])), nn.Conv2d( in_channels=vit_features, out_channels=features[0], kernel_size=1, stride=1, padding=0, ), nn.ConvTranspose2d( in_channels=features[0], out_channels=features[0], kernel_size=4, stride=4, padding=0, bias=True, dilation=1, groups=1, ), ) pretrained.act_postprocess2 = nn.Sequential( readout_oper[1], Transpose(1, 2), nn.Unflatten(2, torch.Size([size[0] // 16, size[1] // 16])), nn.Conv2d( in_channels=vit_features, out_channels=features[1], kernel_size=1, stride=1, padding=0, ), nn.ConvTranspose2d( in_channels=features[1], out_channels=features[1], kernel_size=2, stride=2, padding=0, bias=True, dilation=1, groups=1, ), ) else: pretrained.act_postprocess1 = nn.Sequential( nn.Identity(), nn.Identity(), nn.Identity() ) pretrained.act_postprocess2 = nn.Sequential( nn.Identity(), nn.Identity(), nn.Identity() ) pretrained.act_postprocess3 = nn.Sequential( readout_oper[2], Transpose(1, 2), nn.Unflatten(2, torch.Size([size[0] // 16, size[1] // 16])), nn.Conv2d( in_channels=vit_features, out_channels=features[2], kernel_size=1, stride=1, padding=0, ), ) pretrained.act_postprocess4 = nn.Sequential( readout_oper[3], Transpose(1, 2), nn.Unflatten(2, torch.Size([size[0] // 16, size[1] // 16])), nn.Conv2d( in_channels=vit_features, out_channels=features[3], kernel_size=1, stride=1, padding=0, ), nn.Conv2d( in_channels=features[3], out_channels=features[3], kernel_size=3, stride=2, padding=1, ), ) pretrained.model.start_index = start_index pretrained.model.patch_size = [16, 16] # We inject this function into the VisionTransformer instances so that # we can use it with interpolated position embeddings without modifying the library source. pretrained.model.forward_flex = types.MethodType(forward_flex, pretrained.model) # We inject this function into the VisionTransformer instances so that # we can use it with interpolated position embeddings without modifying the library source. pretrained.model._resize_pos_embed = types.MethodType( _resize_pos_embed, pretrained.model ) return pretrained def _make_pretrained_vitb_rn50_384( pretrained, use_readout="ignore", hooks=None, use_vit_only=False ): model = timm.create_model("vit_base_resnet50_384", pretrained=pretrained) hooks = [0, 1, 8, 11] if hooks == None else hooks return _make_vit_b_rn50_backbone( model, features=[256, 512, 768, 768], size=[384, 384], hooks=hooks, use_vit_only=use_vit_only, use_readout=use_readout, ) ================================================ FILE: ToonCrafter/ldm/modules/midas/utils.py ================================================ """Utils for monoDepth.""" import sys import re import numpy as np import cv2 import torch def read_pfm(path): """Read pfm file. Args: path (str): path to file Returns: tuple: (data, scale) """ with open(path, "rb") as file: color = None width = None height = None scale = None endian = None header = file.readline().rstrip() if header.decode("ascii") == "PF": color = True elif header.decode("ascii") == "Pf": color = False else: raise Exception("Not a PFM file: " + path) dim_match = re.match(r"^(\d+)\s(\d+)\s$", file.readline().decode("ascii")) if dim_match: width, height = list(map(int, dim_match.groups())) else: raise Exception("Malformed PFM header.") scale = float(file.readline().decode("ascii").rstrip()) if scale < 0: # little-endian endian = "<" scale = -scale else: # big-endian endian = ">" data = np.fromfile(file, endian + "f") shape = (height, width, 3) if color else (height, width) data = np.reshape(data, shape) data = np.flipud(data) return data, scale def write_pfm(path, image, scale=1): """Write pfm file. Args: path (str): pathto file image (array): data scale (int, optional): Scale. Defaults to 1. """ with open(path, "wb") as file: color = None if image.dtype.name != "float32": raise Exception("Image dtype must be float32.") image = np.flipud(image) if len(image.shape) == 3 and image.shape[2] == 3: # color image color = True elif ( len(image.shape) == 2 or len(image.shape) == 3 and image.shape[2] == 1 ): # greyscale color = False else: raise Exception("Image must have H x W x 3, H x W x 1 or H x W dimensions.") file.write("PF\n" if color else "Pf\n".encode()) file.write("%d %d\n".encode() % (image.shape[1], image.shape[0])) endian = image.dtype.byteorder if endian == "<" or endian == "=" and sys.byteorder == "little": scale = -scale file.write("%f\n".encode() % scale) image.tofile(file) def read_image(path): """Read image and output RGB image (0-1). Args: path (str): path to file Returns: array: RGB image (0-1) """ img = cv2.imread(path) if img.ndim == 2: img = cv2.cvtColor(img, cv2.COLOR_GRAY2BGR) img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) / 255.0 return img def resize_image(img): """Resize image and make it fit for network. Args: img (array): image Returns: tensor: data ready for network """ height_orig = img.shape[0] width_orig = img.shape[1] if width_orig > height_orig: scale = width_orig / 384 else: scale = height_orig / 384 height = (np.ceil(height_orig / scale / 32) * 32).astype(int) width = (np.ceil(width_orig / scale / 32) * 32).astype(int) img_resized = cv2.resize(img, (width, height), interpolation=cv2.INTER_AREA) img_resized = ( torch.from_numpy(np.transpose(img_resized, (2, 0, 1))).contiguous().float() ) img_resized = img_resized.unsqueeze(0) return img_resized def resize_depth(depth, width, height): """Resize depth map and bring to CPU (numpy). Args: depth (tensor): depth width (int): image width height (int): image height Returns: array: processed depth """ depth = torch.squeeze(depth[0, :, :, :]).to("cpu") depth_resized = cv2.resize( depth.numpy(), (width, height), interpolation=cv2.INTER_CUBIC ) return depth_resized def write_depth(path, depth, bits=1): """Write depth map to pfm and png file. Args: path (str): filepath without extension depth (array): depth """ write_pfm(path + ".pfm", depth.astype(np.float32)) depth_min = depth.min() depth_max = depth.max() max_val = (2**(8*bits))-1 if depth_max - depth_min > np.finfo("float").eps: out = max_val * (depth - depth_min) / (depth_max - depth_min) else: out = np.zeros(depth.shape, dtype=depth.type) if bits == 1: cv2.imwrite(path + ".png", out.astype("uint8")) elif bits == 2: cv2.imwrite(path + ".png", out.astype("uint16")) return ================================================ FILE: ToonCrafter/ldm/util.py ================================================ import importlib import torch from torch import optim import numpy as np from inspect import isfunction from PIL import Image, ImageDraw, ImageFont def log_txt_as_img(wh, xc, size=10): # wh a tuple of (width, height) # xc a list of captions to plot b = len(xc) txts = list() for bi in range(b): txt = Image.new("RGB", wh, color="white") draw = ImageDraw.Draw(txt) font = ImageFont.truetype('font/DejaVuSans.ttf', size=size) nc = int(40 * (wh[0] / 256)) lines = "\n".join(xc[bi][start:start + nc] for start in range(0, len(xc[bi]), nc)) try: draw.text((0, 0), lines, fill="black", font=font) except UnicodeEncodeError: print("Cant encode string for logging. Skipping.") txt = np.array(txt).transpose(2, 0, 1) / 127.5 - 1.0 txts.append(txt) txts = np.stack(txts) txts = torch.tensor(txts) return txts def ismap(x): if not isinstance(x, torch.Tensor): return False return (len(x.shape) == 4) and (x.shape[1] > 3) def isimage(x): if not isinstance(x,torch.Tensor): return False return (len(x.shape) == 4) and (x.shape[1] == 3 or x.shape[1] == 1) def exists(x): return x is not None def default(val, d): if exists(val): return val return d() if isfunction(d) else d def mean_flat(tensor): """ https://github.com/openai/guided-diffusion/blob/27c20a8fab9cb472df5d6bdd6c8d11c8f430b924/guided_diffusion/nn.py#L86 Take the mean over all non-batch dimensions. """ return tensor.mean(dim=list(range(1, len(tensor.shape)))) def count_params(model, verbose=False): total_params = sum(p.numel() for p in model.parameters()) if verbose: print(f"{model.__class__.__name__} has {total_params*1.e-6:.2f} M params.") return total_params def instantiate_from_config(config): if not "target" in config: if config == '__is_first_stage__': return None elif config == "__is_unconditional__": return None raise KeyError("Expected key `target` to instantiate.") return get_obj_from_str(config["target"])(**config.get("params", dict())) def get_obj_from_str(string, reload=False): module, cls = string.rsplit(".", 1) if reload: module_imp = importlib.import_module(module) importlib.reload(module_imp) return getattr(importlib.import_module(module, package=None), cls) class AdamWwithEMAandWings(optim.Optimizer): # credit to https://gist.github.com/crowsonkb/65f7265353f403714fce3b2595e0b298 def __init__(self, params, lr=1.e-3, betas=(0.9, 0.999), eps=1.e-8, # TODO: check hyperparameters before using weight_decay=1.e-2, amsgrad=False, ema_decay=0.9999, # ema decay to match previous code ema_power=1., param_names=()): """AdamW that saves EMA versions of the parameters.""" if not 0.0 <= lr: raise ValueError("Invalid learning rate: {}".format(lr)) if not 0.0 <= eps: raise ValueError("Invalid epsilon value: {}".format(eps)) if not 0.0 <= betas[0] < 1.0: raise ValueError("Invalid beta parameter at index 0: {}".format(betas[0])) if not 0.0 <= betas[1] < 1.0: raise ValueError("Invalid beta parameter at index 1: {}".format(betas[1])) if not 0.0 <= weight_decay: raise ValueError("Invalid weight_decay value: {}".format(weight_decay)) if not 0.0 <= ema_decay <= 1.0: raise ValueError("Invalid ema_decay value: {}".format(ema_decay)) defaults = dict(lr=lr, betas=betas, eps=eps, weight_decay=weight_decay, amsgrad=amsgrad, ema_decay=ema_decay, ema_power=ema_power, param_names=param_names) super().__init__(params, defaults) def __setstate__(self, state): super().__setstate__(state) for group in self.param_groups: group.setdefault('amsgrad', False) @torch.no_grad() def step(self, closure=None): """Performs a single optimization step. Args: closure (callable, optional): A closure that reevaluates the model and returns the loss. """ loss = None if closure is not None: with torch.enable_grad(): loss = closure() for group in self.param_groups: params_with_grad = [] grads = [] exp_avgs = [] exp_avg_sqs = [] ema_params_with_grad = [] state_sums = [] max_exp_avg_sqs = [] state_steps = [] amsgrad = group['amsgrad'] beta1, beta2 = group['betas'] ema_decay = group['ema_decay'] ema_power = group['ema_power'] for p in group['params']: if p.grad is None: continue params_with_grad.append(p) if p.grad.is_sparse: raise RuntimeError('AdamW does not support sparse gradients') grads.append(p.grad) state = self.state[p] # State initialization if len(state) == 0: state['step'] = 0 # Exponential moving average of gradient values state['exp_avg'] = torch.zeros_like(p, memory_format=torch.preserve_format) # Exponential moving average of squared gradient values state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format) if amsgrad: # Maintains max of all exp. moving avg. of sq. grad. values state['max_exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format) # Exponential moving average of parameter values state['param_exp_avg'] = p.detach().float().clone() exp_avgs.append(state['exp_avg']) exp_avg_sqs.append(state['exp_avg_sq']) ema_params_with_grad.append(state['param_exp_avg']) if amsgrad: max_exp_avg_sqs.append(state['max_exp_avg_sq']) # update the steps for each param group update state['step'] += 1 # record the step after step update state_steps.append(state['step']) optim._functional.adamw(params_with_grad, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, amsgrad=amsgrad, beta1=beta1, beta2=beta2, lr=group['lr'], weight_decay=group['weight_decay'], eps=group['eps'], maximize=False) cur_ema_decay = min(ema_decay, 1 - state['step'] ** -ema_power) for param, ema_param in zip(params_with_grad, ema_params_with_grad): ema_param.mul_(cur_ema_decay).add_(param.float(), alpha=1 - cur_ema_decay) return loss ================================================ FILE: ToonCrafter/lvdm/__init__.py ================================================ ================================================ FILE: ToonCrafter/lvdm/basics.py ================================================ # adopted from # https://github.com/openai/improved-diffusion/blob/main/improved_diffusion/gaussian_diffusion.py # and # https://github.com/lucidrains/denoising-diffusion-pytorch/blob/7706bdfc6f527f58d33f84b7b522e61e6e3164b3/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py # and # https://github.com/openai/guided-diffusion/blob/0ba878e517b276c45d1195eb29f6f5f72659a05b/guided_diffusion/nn.py # # thanks! import torch.nn as nn from ToonCrafter.utils.utils import instantiate_from_config def disabled_train(self, mode=True): """Overwrite model.train with this function to make sure train/eval mode does not change anymore.""" return self def zero_module(module): """ Zero out the parameters of a module and return it. """ for p in module.parameters(): p.detach().zero_() return module def scale_module(module, scale): """ Scale the parameters of a module and return it. """ for p in module.parameters(): p.detach().mul_(scale) return module def conv_nd(dims, *args, **kwargs): """ Create a 1D, 2D, or 3D convolution module. """ if dims == 1: return nn.Conv1d(*args, **kwargs) elif dims == 2: return nn.Conv2d(*args, **kwargs) elif dims == 3: return nn.Conv3d(*args, **kwargs) raise ValueError(f"unsupported dimensions: {dims}") def linear(*args, **kwargs): """ Create a linear module. """ return nn.Linear(*args, **kwargs) def avg_pool_nd(dims, *args, **kwargs): """ Create a 1D, 2D, or 3D average pooling module. """ if dims == 1: return nn.AvgPool1d(*args, **kwargs) elif dims == 2: return nn.AvgPool2d(*args, **kwargs) elif dims == 3: return nn.AvgPool3d(*args, **kwargs) raise ValueError(f"unsupported dimensions: {dims}") def nonlinearity(type='silu'): if type == 'silu': return nn.SiLU() elif type == 'leaky_relu': return nn.LeakyReLU() class GroupNormSpecific(nn.GroupNorm): def forward(self, x): return super().forward(x.float()).type(x.dtype) def normalization(channels, num_groups=32): """ Make a standard normalization layer. :param channels: number of input channels. :return: an nn.Module for normalization. """ return GroupNormSpecific(num_groups, channels) class HybridConditioner(nn.Module): def __init__(self, c_concat_config, c_crossattn_config): super().__init__() self.concat_conditioner = instantiate_from_config(c_concat_config) self.crossattn_conditioner = instantiate_from_config(c_crossattn_config) def forward(self, c_concat, c_crossattn): c_concat = self.concat_conditioner(c_concat) c_crossattn = self.crossattn_conditioner(c_crossattn) return {'c_concat': [c_concat], 'c_crossattn': [c_crossattn]} ================================================ FILE: ToonCrafter/lvdm/common.py ================================================ import math from inspect import isfunction import torch from torch import nn import torch.distributed as dist def gather_data(data, return_np=True): ''' gather data from multiple processes to one list ''' data_list = [torch.zeros_like(data) for _ in range(dist.get_world_size())] dist.all_gather(data_list, data) # gather not supported with NCCL if return_np: data_list = [data.cpu().numpy() for data in data_list] return data_list def autocast(f): def do_autocast(*args, **kwargs): with torch.cuda.amp.autocast(enabled=True, dtype=torch.get_autocast_gpu_dtype(), cache_enabled=torch.is_autocast_cache_enabled()): return f(*args, **kwargs) return do_autocast def extract_into_tensor(a, t, x_shape): b, *_ = t.shape out = a.gather(-1, t) return out.reshape(b, *((1,) * (len(x_shape) - 1))) def noise_like(shape, device, repeat=False): repeat_noise = lambda: torch.randn((1, *shape[1:]), device=device).repeat(shape[0], *((1,) * (len(shape) - 1))) noise = lambda: torch.randn(shape, device=device) return repeat_noise() if repeat else noise() def default(val, d): if exists(val): return val return d() if isfunction(d) else d def exists(val): return val is not None def identity(*args, **kwargs): return nn.Identity() def uniq(arr): return{el: True for el in arr}.keys() def mean_flat(tensor): """ Take the mean over all non-batch dimensions. """ return tensor.mean(dim=list(range(1, len(tensor.shape)))) def ismap(x): if not isinstance(x, torch.Tensor): return False return (len(x.shape) == 4) and (x.shape[1] > 3) def isimage(x): if not isinstance(x,torch.Tensor): return False return (len(x.shape) == 4) and (x.shape[1] == 3 or x.shape[1] == 1) def max_neg_value(t): return -torch.finfo(t.dtype).max def shape_to_str(x): shape_str = "x".join([str(x) for x in x.shape]) return shape_str def init_(tensor): dim = tensor.shape[-1] std = 1 / math.sqrt(dim) tensor.uniform_(-std, std) return tensor ckpt = torch.utils.checkpoint.checkpoint def checkpoint(func, inputs, params, flag): """ Evaluate a function without caching intermediate activations, allowing for reduced memory at the expense of extra compute in the backward pass. :param func: the function to evaluate. :param inputs: the argument sequence to pass to `func`. :param params: a sequence of parameters `func` depends on but does not explicitly take as arguments. :param flag: if False, disable gradient checkpointing. """ if flag: return ckpt(func, *inputs, use_reentrant=False) else: return func(*inputs) ================================================ FILE: ToonCrafter/lvdm/data/base.py ================================================ from abc import abstractmethod from torch.utils.data import IterableDataset class Txt2ImgIterableBaseDataset(IterableDataset): ''' Define an interface to make the IterableDatasets for text2img data chainable ''' def __init__(self, num_records=0, valid_ids=None, size=256): super().__init__() self.num_records = num_records self.valid_ids = valid_ids self.sample_ids = valid_ids self.size = size print(f'{self.__class__.__name__} dataset contains {self.__len__()} examples.') def __len__(self): return self.num_records @abstractmethod def __iter__(self): pass ================================================ FILE: ToonCrafter/lvdm/data/webvid.py ================================================ import os import random from tqdm import tqdm import pandas as pd from decord import VideoReader, cpu import torch from torch.utils.data import Dataset from torch.utils.data import DataLoader from torchvision import transforms class WebVid(Dataset): """ WebVid Dataset. Assumes webvid data is structured as follows. Webvid/ videos/ 000001_000050/ ($page_dir) 1.mp4 (videoid.mp4) ... 5000.mp4 ... """ def __init__(self, meta_path, data_dir, subsample=None, video_length=16, resolution=[256, 512], frame_stride=1, frame_stride_min=1, spatial_transform=None, crop_resolution=None, fps_max=None, load_raw_resolution=False, fixed_fps=None, random_fs=False, ): self.meta_path = meta_path self.data_dir = data_dir self.subsample = subsample self.video_length = video_length self.resolution = [resolution, resolution] if isinstance(resolution, int) else resolution self.fps_max = fps_max self.frame_stride = frame_stride self.frame_stride_min = frame_stride_min self.fixed_fps = fixed_fps self.load_raw_resolution = load_raw_resolution self.random_fs = random_fs self._load_metadata() if spatial_transform is not None: if spatial_transform == "random_crop": self.spatial_transform = transforms.RandomCrop(crop_resolution) elif spatial_transform == "center_crop": self.spatial_transform = transforms.Compose([ transforms.CenterCrop(resolution), ]) elif spatial_transform == "resize_center_crop": # assert(self.resolution[0] == self.resolution[1]) self.spatial_transform = transforms.Compose([ transforms.Resize(min(self.resolution)), transforms.CenterCrop(self.resolution), ]) elif spatial_transform == "resize": self.spatial_transform = transforms.Resize(self.resolution) else: raise NotImplementedError else: self.spatial_transform = None def _load_metadata(self): metadata = pd.read_csv(self.meta_path) print(f'>>> {len(metadata)} data samples loaded.') if self.subsample is not None: metadata = metadata.sample(self.subsample, random_state=0) metadata['caption'] = metadata['name'] del metadata['name'] self.metadata = metadata self.metadata.dropna(inplace=True) def _get_video_path(self, sample): rel_video_fp = os.path.join(sample['page_dir'], str(sample['videoid']) + '.mp4') full_video_fp = os.path.join(self.data_dir, 'videos', rel_video_fp) return full_video_fp def __getitem__(self, index): if self.random_fs: frame_stride = random.randint(self.frame_stride_min, self.frame_stride) else: frame_stride = self.frame_stride ## get frames until success while True: index = index % len(self.metadata) sample = self.metadata.iloc[index] video_path = self._get_video_path(sample) ## video_path should be in the format of "....../WebVid/videos/$page_dir/$videoid.mp4" caption = sample['caption'] try: if self.load_raw_resolution: video_reader = VideoReader(video_path, ctx=cpu(0)) else: video_reader = VideoReader(video_path, ctx=cpu(0), width=530, height=300) if len(video_reader) < self.video_length: print(f"video length ({len(video_reader)}) is smaller than target length({self.video_length})") index += 1 continue else: pass except: index += 1 print(f"Load video failed! path = {video_path}") continue fps_ori = video_reader.get_avg_fps() if self.fixed_fps is not None: frame_stride = int(frame_stride * (1.0 * fps_ori / self.fixed_fps)) ## to avoid extreme cases when fixed_fps is used frame_stride = max(frame_stride, 1) ## get valid range (adapting case by case) required_frame_num = frame_stride * (self.video_length-1) + 1 frame_num = len(video_reader) if frame_num < required_frame_num: ## drop extra samples if fixed fps is required if self.fixed_fps is not None and frame_num < required_frame_num * 0.5: index += 1 continue else: frame_stride = frame_num // self.video_length required_frame_num = frame_stride * (self.video_length-1) + 1 ## select a random clip random_range = frame_num - required_frame_num start_idx = random.randint(0, random_range) if random_range > 0 else 0 ## calculate frame indices frame_indices = [start_idx + frame_stride*i for i in range(self.video_length)] try: frames = video_reader.get_batch(frame_indices) break except: print(f"Get frames failed! path = {video_path}; [max_ind vs frame_total:{max(frame_indices)} / {frame_num}]") index += 1 continue ## process data assert(frames.shape[0] == self.video_length),f'{len(frames)}, self.video_length={self.video_length}' frames = torch.tensor(frames.asnumpy()).permute(3, 0, 1, 2).float() # [t,h,w,c] -> [c,t,h,w] if self.spatial_transform is not None: frames = self.spatial_transform(frames) if self.resolution is not None: assert (frames.shape[2], frames.shape[3]) == (self.resolution[0], self.resolution[1]), f'frames={frames.shape}, self.resolution={self.resolution}' ## turn frames tensors to [-1,1] frames = (frames / 255 - 0.5) * 2 fps_clip = fps_ori // frame_stride if self.fps_max is not None and fps_clip > self.fps_max: fps_clip = self.fps_max data = {'video': frames, 'caption': caption, 'path': video_path, 'fps': fps_clip, 'frame_stride': frame_stride} return data def __len__(self): return len(self.metadata) if __name__== "__main__": meta_path = "" ## path to the meta file data_dir = "" ## path to the data directory save_dir = "" ## path to the save directory dataset = WebVid(meta_path, data_dir, subsample=None, video_length=16, resolution=[256,448], frame_stride=4, spatial_transform="resize_center_crop", crop_resolution=None, fps_max=None, load_raw_resolution=True ) dataloader = DataLoader(dataset, batch_size=1, num_workers=0, shuffle=False) import sys sys.path.insert(1, os.path.join(sys.path[0], '..', '..')) from utils.save_video import tensor_to_mp4 for i, batch in tqdm(enumerate(dataloader), desc="Data Batch"): video = batch['video'] name = batch['path'][0].split('videos/')[-1].replace('/','_') tensor_to_mp4(video, save_dir+'/'+name, fps=8) ================================================ FILE: ToonCrafter/lvdm/distributions.py ================================================ import torch import numpy as np class AbstractDistribution: def sample(self): raise NotImplementedError() def mode(self): raise NotImplementedError() class DiracDistribution(AbstractDistribution): def __init__(self, value): self.value = value def sample(self): return self.value def mode(self): return self.value class DiagonalGaussianDistribution(object): def __init__(self, parameters, deterministic=False): self.parameters = parameters self.mean, self.logvar = torch.chunk(parameters, 2, dim=1) self.logvar = torch.clamp(self.logvar, -30.0, 20.0) self.deterministic = deterministic self.std = torch.exp(0.5 * self.logvar) self.var = torch.exp(self.logvar) if self.deterministic: self.var = self.std = torch.zeros_like(self.mean).to(device=self.parameters.device) def sample(self, noise=None): if noise is None: noise = torch.randn(self.mean.shape) x = self.mean + self.std * noise.to(device=self.parameters.device) return x def kl(self, other=None): if self.deterministic: return torch.Tensor([0.]) else: if other is None: return 0.5 * torch.sum(torch.pow(self.mean, 2) + self.var - 1.0 - self.logvar, dim=[1, 2, 3]) else: return 0.5 * torch.sum( torch.pow(self.mean - other.mean, 2) / other.var + self.var / other.var - 1.0 - self.logvar + other.logvar, dim=[1, 2, 3]) def nll(self, sample, dims=[1,2,3]): if self.deterministic: return torch.Tensor([0.]) logtwopi = np.log(2.0 * np.pi) return 0.5 * torch.sum( logtwopi + self.logvar + torch.pow(sample - self.mean, 2) / self.var, dim=dims) def mode(self): return self.mean def normal_kl(mean1, logvar1, mean2, logvar2): """ source: https://github.com/openai/guided-diffusion/blob/27c20a8fab9cb472df5d6bdd6c8d11c8f430b924/guided_diffusion/losses.py#L12 Compute the KL divergence between two gaussians. Shapes are automatically broadcasted, so batches can be compared to scalars, among other use cases. """ tensor = None for obj in (mean1, logvar1, mean2, logvar2): if isinstance(obj, torch.Tensor): tensor = obj break assert tensor is not None, "at least one argument must be a Tensor" # Force variances to be Tensors. Broadcasting helps convert scalars to # Tensors, but it does not work for torch.exp(). logvar1, logvar2 = [ x if isinstance(x, torch.Tensor) else torch.tensor(x).to(tensor) for x in (logvar1, logvar2) ] return 0.5 * ( -1.0 + logvar2 - logvar1 + torch.exp(logvar1 - logvar2) + ((mean1 - mean2) ** 2) * torch.exp(-logvar2) ) ================================================ FILE: ToonCrafter/lvdm/ema.py ================================================ import torch from torch import nn class LitEma(nn.Module): def __init__(self, model, decay=0.9999, use_num_upates=True): super().__init__() if decay < 0.0 or decay > 1.0: raise ValueError('Decay must be between 0 and 1') self.m_name2s_name = {} self.register_buffer('decay', torch.tensor(decay, dtype=torch.float32)) self.register_buffer('num_updates', torch.tensor(0,dtype=torch.int) if use_num_upates else torch.tensor(-1,dtype=torch.int)) for name, p in model.named_parameters(): if p.requires_grad: #remove as '.'-character is not allowed in buffers s_name = name.replace('.','') self.m_name2s_name.update({name:s_name}) self.register_buffer(s_name,p.clone().detach().data) self.collected_params = [] def forward(self,model): decay = self.decay if self.num_updates >= 0: self.num_updates += 1 decay = min(self.decay,(1 + self.num_updates) / (10 + self.num_updates)) one_minus_decay = 1.0 - decay with torch.no_grad(): m_param = dict(model.named_parameters()) shadow_params = dict(self.named_buffers()) for key in m_param: if m_param[key].requires_grad: sname = self.m_name2s_name[key] shadow_params[sname] = shadow_params[sname].type_as(m_param[key]) shadow_params[sname].sub_(one_minus_decay * (shadow_params[sname] - m_param[key])) else: assert not key in self.m_name2s_name def copy_to(self, model): m_param = dict(model.named_parameters()) shadow_params = dict(self.named_buffers()) for key in m_param: if m_param[key].requires_grad: m_param[key].data.copy_(shadow_params[self.m_name2s_name[key]].data) else: assert not key in self.m_name2s_name def store(self, parameters): """ Save the current parameters for restoring later. Args: parameters: Iterable of `torch.nn.Parameter`; the parameters to be temporarily stored. """ self.collected_params = [param.clone() for param in parameters] def restore(self, parameters): """ Restore the parameters stored with the `store` method. Useful to validate the model with EMA parameters without affecting the original optimization process. Store the parameters before the `copy_to` method. After validation (or model saving), use this to restore the former parameters. Args: parameters: Iterable of `torch.nn.Parameter`; the parameters to be updated with the stored parameters. """ for c_param, param in zip(self.collected_params, parameters): param.data.copy_(c_param.data) ================================================ FILE: ToonCrafter/lvdm/models/autoencoder.py ================================================ import os from contextlib import contextmanager import torch import numpy as np from einops import rearrange import torch.nn.functional as F import pytorch_lightning as pl from lvdm.modules.networks.ae_modules import Encoder, Decoder from lvdm.distributions import DiagonalGaussianDistribution from ToonCrafter.utils.utils import instantiate_from_config TIMESTEPS = 16 class AutoencoderKL(pl.LightningModule): def __init__(self, ddconfig, lossconfig, embed_dim, ckpt_path=None, ignore_keys=[], image_key="image", colorize_nlabels=None, monitor=None, test=False, logdir=None, input_dim=4, test_args=None, additional_decode_keys=None, use_checkpoint=False, diff_boost_factor=3.0, ): super().__init__() self.image_key = image_key self.encoder = Encoder(**ddconfig) self.decoder = Decoder(**ddconfig) self.loss = instantiate_from_config(lossconfig) assert ddconfig["double_z"] self.quant_conv = torch.nn.Conv2d(2 * ddconfig["z_channels"], 2 * embed_dim, 1) self.post_quant_conv = torch.nn.Conv2d(embed_dim, ddconfig["z_channels"], 1) self.embed_dim = embed_dim self.input_dim = input_dim self.test = test self.test_args = test_args self.logdir = logdir if colorize_nlabels is not None: assert isinstance(colorize_nlabels, int) self.register_buffer("colorize", torch.randn(3, colorize_nlabels, 1, 1)) if monitor is not None: self.monitor = monitor if ckpt_path is not None: self.init_from_ckpt(ckpt_path, ignore_keys=ignore_keys) if self.test: self.init_test() if torch.backends.mps.is_available(): self._device = torch.device("mps") def init_test(self,): self.test = True save_dir = os.path.join(self.logdir, "test") if 'ckpt' in self.test_args: ckpt_name = os.path.basename(self.test_args.ckpt).split('.ckpt')[0] + f'_epoch{self._cur_epoch}' self.root = os.path.join(save_dir, ckpt_name) else: self.root = save_dir if 'test_subdir' in self.test_args: self.root = os.path.join(save_dir, self.test_args.test_subdir) self.root_zs = os.path.join(self.root, "zs") self.root_dec = os.path.join(self.root, "reconstructions") self.root_inputs = os.path.join(self.root, "inputs") os.makedirs(self.root, exist_ok=True) if self.test_args.save_z: os.makedirs(self.root_zs, exist_ok=True) if self.test_args.save_reconstruction: os.makedirs(self.root_dec, exist_ok=True) if self.test_args.save_input: os.makedirs(self.root_inputs, exist_ok=True) assert (self.test_args is not None) self.test_maximum = getattr(self.test_args, 'test_maximum', None) self.count = 0 self.eval_metrics = {} self.decodes = [] self.save_decode_samples = 2048 def init_from_ckpt(self, path, ignore_keys=list()): sd = torch.load(path, map_location="cpu") try: self._cur_epoch = sd['epoch'] sd = sd["state_dict"] except BaseException: self._cur_epoch = 'null' keys = list(sd.keys()) for k in keys: for ik in ignore_keys: if k.startswith(ik): print("Deleting key {} from state_dict.".format(k)) del sd[k] self.load_state_dict(sd, strict=False) # self.load_state_dict(sd, strict=True) print(f"Restored from {path}") def encode(self, x, return_hidden_states=False, **kwargs): if return_hidden_states: h, hidden = self.encoder(x, return_hidden_states) moments = self.quant_conv(h) posterior = DiagonalGaussianDistribution(moments) return posterior, hidden else: h = self.encoder(x) moments = self.quant_conv(h) posterior = DiagonalGaussianDistribution(moments) return posterior def decode(self, z, **kwargs): if len(kwargs) == 0: # use the original decoder in AutoencoderKL z = self.post_quant_conv(z) dec = self.decoder(z, **kwargs) # change for SVD decoder by adding **kwargs return dec def forward(self, input, sample_posterior=True, **additional_decode_kwargs): input_tuple = (input, ) forward_temp = partial(self._forward, sample_posterior=sample_posterior, **additional_decode_kwargs) return checkpoint(forward_temp, input_tuple, self.parameters(), self.use_checkpoint) def _forward(self, input, sample_posterior=True, **additional_decode_kwargs): posterior = self.encode(input) if sample_posterior: z = posterior.sample() else: z = posterior.mode() dec = self.decode(z, **additional_decode_kwargs) # print(input.shape, dec.shape) torch.Size([16, 3, 256, 256]) torch.Size([16, 3, 256, 256]) return dec, posterior def get_input(self, batch, k): x = batch[k] if x.dim() == 5 and self.input_dim == 4: b, c, t, h, w = x.shape self.b = b self.t = t x = rearrange(x, 'b c t h w -> (b t) c h w') return x def training_step(self, batch, batch_idx, optimizer_idx): inputs = self.get_input(batch, self.image_key) reconstructions, posterior = self(inputs) if optimizer_idx == 0: # train encoder+decoder+logvar aeloss, log_dict_ae = self.loss(inputs, reconstructions, posterior, optimizer_idx, self.global_step, last_layer=self.get_last_layer(), split="train") self.log("aeloss", aeloss, prog_bar=True, logger=True, on_step=True, on_epoch=True) self.log_dict(log_dict_ae, prog_bar=False, logger=True, on_step=True, on_epoch=False) return aeloss if optimizer_idx == 1: # train the discriminator discloss, log_dict_disc = self.loss(inputs, reconstructions, posterior, optimizer_idx, self.global_step, last_layer=self.get_last_layer(), split="train") self.log("discloss", discloss, prog_bar=True, logger=True, on_step=True, on_epoch=True) self.log_dict(log_dict_disc, prog_bar=False, logger=True, on_step=True, on_epoch=False) return discloss def validation_step(self, batch, batch_idx): inputs = self.get_input(batch, self.image_key) reconstructions, posterior = self(inputs) aeloss, log_dict_ae = self.loss(inputs, reconstructions, posterior, 0, self.global_step, last_layer=self.get_last_layer(), split="val") discloss, log_dict_disc = self.loss(inputs, reconstructions, posterior, 1, self.global_step, last_layer=self.get_last_layer(), split="val") self.log("val/rec_loss", log_dict_ae["val/rec_loss"]) self.log_dict(log_dict_ae) self.log_dict(log_dict_disc) return self.log_dict def configure_optimizers(self): lr = self.learning_rate opt_ae = torch.optim.Adam(list(self.encoder.parameters()) + list(self.decoder.parameters()) + list(self.quant_conv.parameters()) + list(self.post_quant_conv.parameters()), lr=lr, betas=(0.5, 0.9)) opt_disc = torch.optim.Adam(self.loss.discriminator.parameters(), lr=lr, betas=(0.5, 0.9)) return [opt_ae, opt_disc], [] def get_last_layer(self): return self.decoder.conv_out.weight @torch.no_grad() def log_images(self, batch, only_inputs=False, **kwargs): log = dict() x = self.get_input(batch, self.image_key) x = x.to(self.device) if not only_inputs: xrec, posterior = self(x) if x.shape[1] > 3: # colorize with random projection assert xrec.shape[1] > 3 x = self.to_rgb(x) xrec = self.to_rgb(xrec) log["samples"] = self.decode(torch.randn_like(posterior.sample())) log["reconstructions"] = xrec log["inputs"] = x return log def to_rgb(self, x): assert self.image_key == "segmentation" if not hasattr(self, "colorize"): self.register_buffer("colorize", torch.randn(3, x.shape[1], 1, 1).to(x)) x = F.conv2d(x, weight=self.colorize) x = 2. * (x - x.min()) / (x.max() - x.min()) - 1. return x class IdentityFirstStage(torch.nn.Module): def __init__(self, *args, vq_interface=False, **kwargs): self.vq_interface = vq_interface # TODO: Should be true by default but check to not break older stuff super().__init__() def encode(self, x, *args, **kwargs): return x def decode(self, x, *args, **kwargs): return x def quantize(self, x, *args, **kwargs): if self.vq_interface: return x, None, [None, None, None] return x def forward(self, x, *args, **kwargs): return x from lvdm.models.autoencoder_dualref import VideoDecoder class AutoencoderKL_Dualref(AutoencoderKL): def __init__(self, ddconfig, lossconfig, embed_dim, ckpt_path=None, ignore_keys=[], image_key="image", colorize_nlabels=None, monitor=None, test=False, logdir=None, input_dim=4, test_args=None, additional_decode_keys=None, use_checkpoint=False, diff_boost_factor=3.0, ): super().__init__(ddconfig, lossconfig, embed_dim, ckpt_path, ignore_keys, image_key, colorize_nlabels, monitor, test, logdir, input_dim, test_args, additional_decode_keys, use_checkpoint, diff_boost_factor) self.decoder = VideoDecoder(**ddconfig) def _forward(self, input, sample_posterior=True, **additional_decode_kwargs): posterior, hidden_states = self.encode(input, return_hidden_states=True) hidden_states_first_last = [] # use only the first and last hidden states for hid in hidden_states: hid = rearrange(hid, '(b t) c h w -> b c t h w', t=TIMESTEPS) hid_new = torch.cat([hid[:, :, 0:1], hid[:, :, -1:]], dim=2) hidden_states_first_last.append(hid_new) if sample_posterior: z = posterior.sample() else: z = posterior.mode() dec = self.decode(z, ref_context=hidden_states_first_last, **additional_decode_kwargs) # print(input.shape, dec.shape) torch.Size([16, 3, 256, 256]) torch.Size([16, 3, 256, 256]) return dec, posterior ================================================ FILE: ToonCrafter/lvdm/models/autoencoder_dualref.py ================================================ #### https://github.com/Stability-AI/generative-models from einops import rearrange, repeat import logging from typing import Any, Callable, Optional, Iterable, Union import os import numpy as np import torch import torch.nn as nn from packaging import version logpy = logging.getLogger(__name__) try: import xformers import xformers.ops XFORMERS_IS_AVAILABLE = True except BaseException: XFORMERS_IS_AVAILABLE = False logpy.warning("no module 'xformers'. Processing without...") from lvdm.modules.attention_svd import LinearAttention, CrossAttention, MemoryEfficientCrossAttention def nonlinearity(x): # swish return x * torch.sigmoid(x) def Normalize(in_channels, num_groups=32): return torch.nn.GroupNorm( num_groups=num_groups, num_channels=in_channels, eps=1e-6, affine=True ) class ResnetBlock(nn.Module): def __init__( self, *, in_channels, out_channels=None, conv_shortcut=False, dropout, temb_channels=512, ): super().__init__() self.in_channels = in_channels out_channels = in_channels if out_channels is None else out_channels self.out_channels = out_channels self.use_conv_shortcut = conv_shortcut self.norm1 = Normalize(in_channels) self.conv1 = torch.nn.Conv2d( in_channels, out_channels, kernel_size=3, stride=1, padding=1 ) if temb_channels > 0: self.temb_proj = torch.nn.Linear(temb_channels, out_channels) self.norm2 = Normalize(out_channels) self.dropout = torch.nn.Dropout(dropout) self.conv2 = torch.nn.Conv2d( out_channels, out_channels, kernel_size=3, stride=1, padding=1 ) if self.in_channels != self.out_channels: if self.use_conv_shortcut: self.conv_shortcut = torch.nn.Conv2d( in_channels, out_channels, kernel_size=3, stride=1, padding=1 ) else: self.nin_shortcut = torch.nn.Conv2d( in_channels, out_channels, kernel_size=1, stride=1, padding=0 ) def forward(self, x, temb): h = x h = self.norm1(h) h = nonlinearity(h) h = self.conv1(h) if temb is not None: h = h + self.temb_proj(nonlinearity(temb))[:, :, None, None] h = self.norm2(h) h = nonlinearity(h) h = self.dropout(h) h = self.conv2(h) if self.in_channels != self.out_channels: if self.use_conv_shortcut: x = self.conv_shortcut(x) else: x = self.nin_shortcut(x) return x + h class LinAttnBlock(LinearAttention): """to match AttnBlock usage""" def __init__(self, in_channels): super().__init__(dim=in_channels, heads=1, dim_head=in_channels) class AttnBlock(nn.Module): def __init__(self, in_channels): super().__init__() self.in_channels = in_channels self.norm = Normalize(in_channels) self.q = torch.nn.Conv2d( in_channels, in_channels, kernel_size=1, stride=1, padding=0 ) self.k = torch.nn.Conv2d( in_channels, in_channels, kernel_size=1, stride=1, padding=0 ) self.v = torch.nn.Conv2d( in_channels, in_channels, kernel_size=1, stride=1, padding=0 ) self.proj_out = torch.nn.Conv2d( in_channels, in_channels, kernel_size=1, stride=1, padding=0 ) def attention(self, h_: torch.Tensor) -> torch.Tensor: h_ = self.norm(h_) q = self.q(h_) k = self.k(h_) v = self.v(h_) b, c, h, w = q.shape q, k, v = map( lambda x: rearrange(x, "b c h w -> b 1 (h w) c").contiguous(), (q, k, v) ) h_ = torch.nn.functional.scaled_dot_product_attention( q, k, v ) # scale is dim ** -0.5 per default # compute attention return rearrange(h_, "b 1 (h w) c -> b c h w", h=h, w=w, c=c, b=b) def forward(self, x, **kwargs): h_ = x h_ = self.attention(h_) h_ = self.proj_out(h_) return x + h_ class MemoryEfficientAttnBlock(nn.Module): """ Uses xformers efficient implementation, see https://github.com/MatthieuTPHR/diffusers/blob/d80b531ff8060ec1ea982b65a1b8df70f73aa67c/src/diffusers/models/attention.py#L223 Note: this is a single-head self-attention operation """ # def __init__(self, in_channels): super().__init__() self.in_channels = in_channels self.norm = Normalize(in_channels) self.q = torch.nn.Conv2d( in_channels, in_channels, kernel_size=1, stride=1, padding=0 ) self.k = torch.nn.Conv2d( in_channels, in_channels, kernel_size=1, stride=1, padding=0 ) self.v = torch.nn.Conv2d( in_channels, in_channels, kernel_size=1, stride=1, padding=0 ) self.proj_out = torch.nn.Conv2d( in_channels, in_channels, kernel_size=1, stride=1, padding=0 ) self.attention_op: Optional[Any] = None def attention(self, h_: torch.Tensor) -> torch.Tensor: h_ = self.norm(h_) q = self.q(h_) k = self.k(h_) v = self.v(h_) # compute attention B, C, H, W = q.shape q, k, v = map(lambda x: rearrange(x, "b c h w -> b (h w) c"), (q, k, v)) q, k, v = map( lambda t: t.unsqueeze(3) .reshape(B, t.shape[1], 1, C) .permute(0, 2, 1, 3) .reshape(B * 1, t.shape[1], C) .contiguous(), (q, k, v), ) out = xformers.ops.memory_efficient_attention( q, k, v, attn_bias=None, op=self.attention_op ) out = ( out.unsqueeze(0) .reshape(B, 1, out.shape[1], C) .permute(0, 2, 1, 3) .reshape(B, out.shape[1], C) ) return rearrange(out, "b (h w) c -> b c h w", b=B, h=H, w=W, c=C) def forward(self, x, **kwargs): h_ = x h_ = self.attention(h_) h_ = self.proj_out(h_) return x + h_ class CrossAttentionWrapper(CrossAttention): def forward(self, x, context=None, mask=None, **unused_kwargs): b, c, h, w = x.shape x = rearrange(x, "b c h w -> b 1 (h w) c").contiguous() out = super().forward(x, context=context, mask=mask) out = rearrange(out, "b 1 (h w) c -> b c h w", h=h, w=w, c=c, b=b) return x + out class MemoryEfficientCrossAttentionWrapper(MemoryEfficientCrossAttention): def forward(self, x, context=None, mask=None, **unused_kwargs): b, c, h, w = x.shape x = rearrange(x, "b c h w -> b (h w) c") out = super().forward(x, context=context, mask=mask) out = rearrange(out, "b (h w) c -> b c h w", h=h, w=w, c=c) return x + out def make_attn(in_channels, attn_type="vanilla", attn_kwargs=None): assert attn_type in [ "vanilla", "vanilla-xformers", "cross-attn", "memory-efficient-cross-attn", "linear", "none", "cross-attn-fusion", "memory-efficient-cross-attn-fusion", ], f"attn_type {attn_type} unknown" if ( version.parse(torch.__version__) < version.parse("2.0.0") and attn_type != "none" ): assert XFORMERS_IS_AVAILABLE, ( f"We do not support vanilla attention in {torch.__version__} anymore, " f"as it is too expensive. Please install xformers via e.g. 'pip install xformers==0.0.16'" ) # attn_type = "vanilla-xformers" logpy.info(f"making attention of type '{attn_type}' with {in_channels} in_channels") if attn_type == "vanilla": assert attn_kwargs is None return AttnBlock(in_channels) elif attn_type == "vanilla-xformers": logpy.info( f"building MemoryEfficientAttnBlock with {in_channels} in_channels..." ) return MemoryEfficientAttnBlock(in_channels) elif attn_type == "cross-attn": attn_kwargs["query_dim"] = in_channels return CrossAttentionWrapper(**attn_kwargs) elif attn_type == "memory-efficient-cross-attn": attn_kwargs["query_dim"] = in_channels return MemoryEfficientCrossAttentionWrapper(**attn_kwargs) elif attn_type == "cross-attn-fusion": attn_kwargs["query_dim"] = in_channels return CrossAttentionWrapperFusion(**attn_kwargs) elif attn_type == "memory-efficient-cross-attn-fusion": attn_kwargs["query_dim"] = in_channels return MemoryEfficientCrossAttentionWrapperFusion(**attn_kwargs) elif attn_type == "none": return nn.Identity(in_channels) else: return LinAttnBlock(in_channels) class CrossAttentionWrapperFusion(CrossAttention): def __init__(self, query_dim, context_dim=None, heads=8, dim_head=64, dropout=0, **kwargs): super().__init__(query_dim, context_dim, heads, dim_head, dropout, **kwargs) self.dim_head = dim_head self.norm = Normalize(query_dim) nn.init.zeros_(self.to_out[0].weight) nn.init.zeros_(self.to_out[0].bias) def forward(self, x, context=None, mask=None): if self.training: return checkpoint(self._forward, x, context, mask, use_reentrant=False) else: return self._forward(x, context, mask) def _forward( self, x, context=None, mask=None, ): bt, c, h, w = x.shape h_ = self.norm(x) h_ = rearrange(h_, "b c h w -> b (h w) c") q = self.to_q(h_) b, c, l, h, w = context.shape context = rearrange(context, "b c l h w -> (b l) (h w) c") k = self.to_k(context) v = self.to_v(context) k = rearrange(k, "(b l) d c -> b l d c", l=l) k = torch.cat([k[:, [0] * (bt // b)], k[:, [1] * (bt // b)]], dim=2) k = rearrange(k, "b l d c -> (b l) d c") v = rearrange(v, "(b l) d c -> b l d c", l=l) v = torch.cat([v[:, [0] * (bt // b)], v[:, [1] * (bt // b)]], dim=2) v = rearrange(v, "b l d c -> (b l) d c") b, _, _ = q.shape # actually bt q, k, v = map( lambda t: t.unsqueeze(3) .reshape(b, t.shape[1], self.heads, self.dim_head) .permute(0, 2, 1, 3) .reshape(b * self.heads, t.shape[1], self.dim_head) .contiguous(), (q, k, v), ) sdpa = torch.nn.functional.scaled_dot_product_attention def slow_sdpa(q, k, v): out_list = [] step = 10 for i in range(0, q.shape[0], step): out_i = sdpa(q[i:i + step], k[i:i + step], v[i:i + step]) out_list.append(out_i) return torch.cat(out_list, dim=0) # mps 将 qkv 分十次处理, 避免内存溢出 if os.environ.get("TOON_MEM_STRATEGY", "none") == "low": out = slow_sdpa(q, k, v) else: out = sdpa(q, k, v) out = ( out.unsqueeze(0) .reshape(b, self.heads, out.shape[1], self.dim_head) .permute(0, 2, 1, 3) .reshape(b, out.shape[1], self.heads * self.dim_head) ) out = self.to_out(out) out = rearrange(out, "bt (h w) c -> bt c h w", h=h, w=w, c=c) return x + out class MemoryEfficientCrossAttentionWrapperFusion(MemoryEfficientCrossAttention): # print('x.shape: ',x.shape, 'context.shape: ',context.shape) ##torch.Size([8, 128, 256, 256]) torch.Size([1, 128, 2, 256, 256]) def __init__(self, query_dim, context_dim=None, heads=8, dim_head=64, dropout=0, **kwargs): super().__init__(query_dim, context_dim, heads, dim_head, dropout, **kwargs) self.norm = Normalize(query_dim) nn.init.zeros_(self.to_out[0].weight) nn.init.zeros_(self.to_out[0].bias) def forward(self, x, context=None, mask=None): if self.training: return checkpoint(self._forward, x, context, mask, use_reentrant=False) else: return self._forward(x, context, mask) def _forward( self, x, context=None, mask=None, ): bt, c, h, w = x.shape h_ = self.norm(x) h_ = rearrange(h_, "b c h w -> b (h w) c") q = self.to_q(h_) b, c, l, h, w = context.shape context = rearrange(context, "b c l h w -> (b l) (h w) c") k = self.to_k(context) v = self.to_v(context) k = rearrange(k, "(b l) d c -> b l d c", l=l) k = torch.cat([k[:, [0] * (bt // b)], k[:, [1] * (bt // b)]], dim=2) k = rearrange(k, "b l d c -> (b l) d c") v = rearrange(v, "(b l) d c -> b l d c", l=l) v = torch.cat([v[:, [0] * (bt // b)], v[:, [1] * (bt // b)]], dim=2) v = rearrange(v, "b l d c -> (b l) d c") b, _, _ = q.shape # actually bt q, k, v = map( lambda t: t.unsqueeze(3) .reshape(b, t.shape[1], self.heads, self.dim_head) .permute(0, 2, 1, 3) .reshape(b * self.heads, t.shape[1], self.dim_head) .contiguous(), (q, k, v), ) # actually compute the attention, what we cannot get enough of if version.parse(xformers.__version__) >= version.parse("0.0.21"): # NOTE: workaround for # https://github.com/facebookresearch/xformers/issues/845 max_bs = 32768 N = q.shape[0] n_batches = math.ceil(N / max_bs) out = list() for i_batch in range(n_batches): batch = slice(i_batch * max_bs, (i_batch + 1) * max_bs) out.append( xformers.ops.memory_efficient_attention( q[batch], k[batch], v[batch], attn_bias=None, op=self.attention_op, ) ) out = torch.cat(out, 0) else: out = xformers.ops.memory_efficient_attention( q, k, v, attn_bias=None, op=self.attention_op ) # TODO: Use this directly in the attention operation, as a bias if exists(mask): raise NotImplementedError out = ( out.unsqueeze(0) .reshape(b, self.heads, out.shape[1], self.dim_head) .permute(0, 2, 1, 3) .reshape(b, out.shape[1], self.heads * self.dim_head) ) out = self.to_out(out) out = rearrange(out, "bt (h w) c -> bt c h w", h=h, w=w, c=c) return x + out class Combiner(nn.Module): def __init__(self, ch) -> None: super().__init__() self.conv = nn.Conv2d(ch, ch, 1, padding=0) nn.init.zeros_(self.conv.weight) nn.init.zeros_(self.conv.bias) def forward(self, x, context): if self.training: return checkpoint(self._forward, x, context, use_reentrant=False) else: return self._forward(x, context) def _forward(self, x, context): # x: b c h w, context: b c 2 h w b, c, l, h, w = context.shape bt, c, h, w = x.shape context = rearrange(context, "b c l h w -> (b l) c h w") context = self.conv(context) context = rearrange(context, "(b l) c h w -> b c l h w", l=l) x = rearrange(x, "(b t) c h w -> b c t h w", t=bt // b) x[:, :, 0] = x[:, :, 0] + context[:, :, 0] x[:, :, -1] = x[:, :, -1] + context[:, :, 1] x = rearrange(x, "b c t h w -> (b t) c h w") return x class Decoder(nn.Module): def __init__( self, *, ch, out_ch, ch_mult=(1, 2, 4, 8), num_res_blocks, attn_resolutions, dropout=0.0, resamp_with_conv=True, in_channels, resolution, z_channels, give_pre_end=False, tanh_out=False, use_linear_attn=False, attn_type="vanilla-xformers", attn_level=[2, 3], **ignorekwargs, ): if "vanilla-xformers" and XFORMERS_IS_AVAILABLE is False: attn_type = "vanilla" super().__init__() if use_linear_attn: attn_type = "linear" self.ch = ch self.temb_ch = 0 self.num_resolutions = len(ch_mult) self.num_res_blocks = num_res_blocks self.resolution = resolution self.in_channels = in_channels self.give_pre_end = give_pre_end self.tanh_out = tanh_out self.attn_level = attn_level # compute in_ch_mult, block_in and curr_res at lowest res in_ch_mult = (1,) + tuple(ch_mult) block_in = ch * ch_mult[self.num_resolutions - 1] curr_res = resolution // 2 ** (self.num_resolutions - 1) self.z_shape = (1, z_channels, curr_res, curr_res) logpy.info( "Working with z of shape {} = {} dimensions.".format( self.z_shape, np.prod(self.z_shape) ) ) make_attn_cls = self._make_attn() make_resblock_cls = self._make_resblock() make_conv_cls = self._make_conv() # z to block_in self.conv_in = torch.nn.Conv2d( z_channels, block_in, kernel_size=3, stride=1, padding=1 ) # middle self.mid = nn.Module() self.mid.block_1 = make_resblock_cls( in_channels=block_in, out_channels=block_in, temb_channels=self.temb_ch, dropout=dropout, ) self.mid.attn_1 = make_attn_cls(block_in, attn_type=attn_type) self.mid.block_2 = make_resblock_cls( in_channels=block_in, out_channels=block_in, temb_channels=self.temb_ch, dropout=dropout, ) # upsampling self.up = nn.ModuleList() self.attn_refinement = nn.ModuleList() for i_level in reversed(range(self.num_resolutions)): block = nn.ModuleList() attn = nn.ModuleList() block_out = ch * ch_mult[i_level] for i_block in range(self.num_res_blocks + 1): block.append( make_resblock_cls( in_channels=block_in, out_channels=block_out, temb_channels=self.temb_ch, dropout=dropout, ) ) block_in = block_out if curr_res in attn_resolutions: attn.append(make_attn_cls(block_in, attn_type=attn_type)) up = nn.Module() up.block = block up.attn = attn if i_level != 0: up.upsample = Upsample(block_in, resamp_with_conv) curr_res = curr_res * 2 self.up.insert(0, up) # prepend to get consistent order if i_level in self.attn_level: _attn_type = 'memory-efficient-cross-attn-fusion' if XFORMERS_IS_AVAILABLE else 'cross-attn-fusion' self.attn_refinement.insert(0, make_attn_cls(block_in, attn_type=_attn_type, attn_kwargs={})) else: self.attn_refinement.insert(0, Combiner(block_in)) # end self.norm_out = Normalize(block_in) self.attn_refinement.append(Combiner(block_in)) self.conv_out = make_conv_cls( block_in, out_ch, kernel_size=3, stride=1, padding=1 ) def _make_attn(self) -> Callable: return make_attn def _make_resblock(self) -> Callable: return ResnetBlock def _make_conv(self) -> Callable: return torch.nn.Conv2d def get_last_layer(self, **kwargs): return self.conv_out.weight def forward(self, z, ref_context=None, **kwargs): # ref_context: b c 2 h w, 2 means starting and ending frame # assert z.shape[1:] == self.z_shape[1:] self.last_z_shape = z.shape # timestep embedding temb = None # z to block_in h = self.conv_in(z) # middle h = self.mid.block_1(h, temb, **kwargs) h = self.mid.attn_1(h, **kwargs) h = self.mid.block_2(h, temb, **kwargs) # upsampling for i_level in reversed(range(self.num_resolutions)): for i_block in range(self.num_res_blocks + 1): h = self.up[i_level].block[i_block](h, temb, **kwargs) if len(self.up[i_level].attn) > 0: h = self.up[i_level].attn[i_block](h, **kwargs) if ref_context: h = self.attn_refinement[i_level](x=h, context=ref_context[i_level]) if i_level != 0: h = self.up[i_level].upsample(h) # end if self.give_pre_end: return h h = self.norm_out(h) h = nonlinearity(h) if ref_context: # print(h.shape, ref_context[i_level].shape) #torch.Size([8, 128, 256, 256]) torch.Size([1, 128, 2, 256, 256]) h = self.attn_refinement[-1](x=h, context=ref_context[-1]) h = self.conv_out(h, **kwargs) if self.tanh_out: h = torch.tanh(h) return h ##### from abc import abstractmethod from lvdm.models.utils_diffusion import timestep_embedding from torch.utils.checkpoint import checkpoint from lvdm.basics import ( zero_module, conv_nd, linear, normalization, ) from lvdm.modules.networks.openaimodel3d import Upsample, Downsample class TimestepBlock(nn.Module): """ Any module where forward() takes timestep embeddings as a second argument. """ @abstractmethod def forward(self, x: torch.Tensor, emb: torch.Tensor): """ Apply the module to `x` given `emb` timestep embeddings. """ class ResBlock(TimestepBlock): """ A residual block that can optionally change the number of channels. :param channels: the number of input channels. :param emb_channels: the number of timestep embedding channels. :param dropout: the rate of dropout. :param out_channels: if specified, the number of out channels. :param use_conv: if True and out_channels is specified, use a spatial convolution instead of a smaller 1x1 convolution to change the channels in the skip connection. :param dims: determines if the signal is 1D, 2D, or 3D. :param use_checkpoint: if True, use gradient checkpointing on this module. :param up: if True, use this block for upsampling. :param down: if True, use this block for downsampling. """ def __init__( self, channels: int, emb_channels: int, dropout: float, out_channels: Optional[int] = None, use_conv: bool = False, use_scale_shift_norm: bool = False, dims: int = 2, use_checkpoint: bool = False, up: bool = False, down: bool = False, kernel_size: int = 3, exchange_temb_dims: bool = False, skip_t_emb: bool = False, ): super().__init__() self.channels = channels self.emb_channels = emb_channels self.dropout = dropout self.out_channels = out_channels or channels self.use_conv = use_conv self.use_checkpoint = use_checkpoint self.use_scale_shift_norm = use_scale_shift_norm self.exchange_temb_dims = exchange_temb_dims if isinstance(kernel_size, Iterable): padding = [k // 2 for k in kernel_size] else: padding = kernel_size // 2 self.in_layers = nn.Sequential( normalization(channels), nn.SiLU(), conv_nd(dims, channels, self.out_channels, kernel_size, padding=padding), ) self.updown = up or down if up: self.h_upd = Upsample(channels, False, dims) self.x_upd = Upsample(channels, False, dims) elif down: self.h_upd = Downsample(channels, False, dims) self.x_upd = Downsample(channels, False, dims) else: self.h_upd = self.x_upd = nn.Identity() self.skip_t_emb = skip_t_emb self.emb_out_channels = ( 2 * self.out_channels if use_scale_shift_norm else self.out_channels ) if self.skip_t_emb: # print(f"Skipping timestep embedding in {self.__class__.__name__}") assert not self.use_scale_shift_norm self.emb_layers = None self.exchange_temb_dims = False else: self.emb_layers = nn.Sequential( nn.SiLU(), linear( emb_channels, self.emb_out_channels, ), ) self.out_layers = nn.Sequential( normalization(self.out_channels), nn.SiLU(), nn.Dropout(p=dropout), zero_module( conv_nd( dims, self.out_channels, self.out_channels, kernel_size, padding=padding, ) ), ) if self.out_channels == channels: self.skip_connection = nn.Identity() elif use_conv: self.skip_connection = conv_nd( dims, channels, self.out_channels, kernel_size, padding=padding ) else: self.skip_connection = conv_nd(dims, channels, self.out_channels, 1) def forward(self, x: torch.Tensor, emb: torch.Tensor) -> torch.Tensor: """ Apply the block to a Tensor, conditioned on a timestep embedding. :param x: an [N x C x ...] Tensor of features. :param emb: an [N x emb_channels] Tensor of timestep embeddings. :return: an [N x C x ...] Tensor of outputs. """ if self.use_checkpoint: return checkpoint(self._forward, x, emb, use_reentrant=False) else: return self._forward(x, emb) def _forward(self, x: torch.Tensor, emb: torch.Tensor) -> torch.Tensor: if self.updown: in_rest, in_conv = self.in_layers[:-1], self.in_layers[-1] h = in_rest(x) h = self.h_upd(h) x = self.x_upd(x) h = in_conv(h) else: h = self.in_layers(x) if self.skip_t_emb: emb_out = torch.zeros_like(h) else: emb_out = self.emb_layers(emb).type(h.dtype) while len(emb_out.shape) < len(h.shape): emb_out = emb_out[..., None] if self.use_scale_shift_norm: out_norm, out_rest = self.out_layers[0], self.out_layers[1:] scale, shift = torch.chunk(emb_out, 2, dim=1) h = out_norm(h) * (1 + scale) + shift h = out_rest(h) else: if self.exchange_temb_dims: emb_out = rearrange(emb_out, "b t c ... -> b c t ...") h = h + emb_out h = self.out_layers(h) return self.skip_connection(x) + h ##### ##### from lvdm.modules.attention_svd import * class VideoTransformerBlock(nn.Module): ATTENTION_MODES = { "softmax": CrossAttention, "softmax-xformers": MemoryEfficientCrossAttention, } def __init__( self, dim, n_heads, d_head, dropout=0.0, context_dim=None, gated_ff=True, checkpoint=True, timesteps=None, ff_in=False, inner_dim=None, attn_mode="softmax", disable_self_attn=False, disable_temporal_crossattention=False, switch_temporal_ca_to_sa=False, ): super().__init__() attn_cls = self.ATTENTION_MODES[attn_mode] self.ff_in = ff_in or inner_dim is not None if inner_dim is None: inner_dim = dim assert int(n_heads * d_head) == inner_dim self.is_res = inner_dim == dim if self.ff_in: self.norm_in = nn.LayerNorm(dim) self.ff_in = FeedForward( dim, dim_out=inner_dim, dropout=dropout, glu=gated_ff ) self.timesteps = timesteps self.disable_self_attn = disable_self_attn if self.disable_self_attn: self.attn1 = attn_cls( query_dim=inner_dim, heads=n_heads, dim_head=d_head, context_dim=context_dim, dropout=dropout, ) # is a cross-attention else: self.attn1 = attn_cls( query_dim=inner_dim, heads=n_heads, dim_head=d_head, dropout=dropout ) # is a self-attention self.ff = FeedForward(inner_dim, dim_out=dim, dropout=dropout, glu=gated_ff) if disable_temporal_crossattention: if switch_temporal_ca_to_sa: raise ValueError else: self.attn2 = None else: self.norm2 = nn.LayerNorm(inner_dim) if switch_temporal_ca_to_sa: self.attn2 = attn_cls( query_dim=inner_dim, heads=n_heads, dim_head=d_head, dropout=dropout ) # is a self-attention else: self.attn2 = attn_cls( query_dim=inner_dim, context_dim=context_dim, heads=n_heads, dim_head=d_head, dropout=dropout, ) # is self-attn if context is none self.norm1 = nn.LayerNorm(inner_dim) self.norm3 = nn.LayerNorm(inner_dim) self.switch_temporal_ca_to_sa = switch_temporal_ca_to_sa self.checkpoint = checkpoint if self.checkpoint: print(f"====>{self.__class__.__name__} is using checkpointing") else: print(f"====>{self.__class__.__name__} is NOT using checkpointing") def forward( self, x: torch.Tensor, context: torch.Tensor = None, timesteps: int = None ) -> torch.Tensor: if self.checkpoint: return checkpoint(self._forward, x, context, timesteps, use_reentrant=False) else: return self._forward(x, context, timesteps=timesteps) def _forward(self, x, context=None, timesteps=None): assert self.timesteps or timesteps assert not (self.timesteps and timesteps) or self.timesteps == timesteps timesteps = self.timesteps or timesteps B, S, C = x.shape x = rearrange(x, "(b t) s c -> (b s) t c", t=timesteps) if self.ff_in: x_skip = x x = self.ff_in(self.norm_in(x)) if self.is_res: x += x_skip if self.disable_self_attn: x = self.attn1(self.norm1(x), context=context) + x else: x = self.attn1(self.norm1(x)) + x if self.attn2 is not None: if self.switch_temporal_ca_to_sa: x = self.attn2(self.norm2(x)) + x else: x = self.attn2(self.norm2(x), context=context) + x x_skip = x x = self.ff(self.norm3(x)) if self.is_res: x += x_skip x = rearrange( x, "(b s) t c -> (b t) s c", s=S, b=B // timesteps, c=C, t=timesteps ) return x def get_last_layer(self): return self.ff.net[-1].weight ##### ##### import functools def partialclass(cls, *args, **kwargs): class NewCls(cls): __init__ = functools.partialmethod(cls.__init__, *args, **kwargs) return NewCls ###### class VideoResBlock(ResnetBlock): def __init__( self, out_channels, *args, dropout=0.0, video_kernel_size=3, alpha=0.0, merge_strategy="learned", **kwargs, ): super().__init__(out_channels=out_channels, dropout=dropout, *args, **kwargs) if video_kernel_size is None: video_kernel_size = [3, 1, 1] self.time_stack = ResBlock( channels=out_channels, emb_channels=0, dropout=dropout, dims=3, use_scale_shift_norm=False, use_conv=False, up=False, down=False, kernel_size=video_kernel_size, use_checkpoint=True, skip_t_emb=True, ) self.merge_strategy = merge_strategy if self.merge_strategy == "fixed": self.register_buffer("mix_factor", torch.Tensor([alpha])) elif self.merge_strategy == "learned": self.register_parameter( "mix_factor", torch.nn.Parameter(torch.Tensor([alpha])) ) else: raise ValueError(f"unknown merge strategy {self.merge_strategy}") def get_alpha(self, bs): if self.merge_strategy == "fixed": return self.mix_factor elif self.merge_strategy == "learned": return torch.sigmoid(self.mix_factor) else: raise NotImplementedError() def forward(self, x, temb, skip_video=False, timesteps=None): if timesteps is None: timesteps = self.timesteps b, c, h, w = x.shape x = super().forward(x, temb) if not skip_video: x_mix = rearrange(x, "(b t) c h w -> b c t h w", t=timesteps) x = rearrange(x, "(b t) c h w -> b c t h w", t=timesteps) x = self.time_stack(x, temb) alpha = self.get_alpha(bs=b // timesteps) x = alpha * x + (1.0 - alpha) * x_mix x = rearrange(x, "b c t h w -> (b t) c h w") return x class AE3DConv(torch.nn.Conv2d): def __init__(self, in_channels, out_channels, video_kernel_size=3, *args, **kwargs): super().__init__(in_channels, out_channels, *args, **kwargs) if isinstance(video_kernel_size, Iterable): padding = [int(k // 2) for k in video_kernel_size] else: padding = int(video_kernel_size // 2) self.time_mix_conv = torch.nn.Conv3d( in_channels=out_channels, out_channels=out_channels, kernel_size=video_kernel_size, padding=padding, ) def forward(self, input, timesteps, skip_video=False): x = super().forward(input) if skip_video: return x x = rearrange(x, "(b t) c h w -> b c t h w", t=timesteps) x = self.time_mix_conv(x) return rearrange(x, "b c t h w -> (b t) c h w") class VideoBlock(AttnBlock): def __init__( self, in_channels: int, alpha: float = 0, merge_strategy: str = "learned" ): super().__init__(in_channels) # no context, single headed, as in base class self.time_mix_block = VideoTransformerBlock( dim=in_channels, n_heads=1, d_head=in_channels, checkpoint=True, ff_in=True, attn_mode="softmax", ) time_embed_dim = self.in_channels * 4 self.video_time_embed = torch.nn.Sequential( torch.nn.Linear(self.in_channels, time_embed_dim), torch.nn.SiLU(), torch.nn.Linear(time_embed_dim, self.in_channels), ) self.merge_strategy = merge_strategy if self.merge_strategy == "fixed": self.register_buffer("mix_factor", torch.Tensor([alpha])) elif self.merge_strategy == "learned": self.register_parameter( "mix_factor", torch.nn.Parameter(torch.Tensor([alpha])) ) else: raise ValueError(f"unknown merge strategy {self.merge_strategy}") def forward(self, x, timesteps, skip_video=False): if skip_video: return super().forward(x) x_in = x x = self.attention(x) h, w = x.shape[2:] x = rearrange(x, "b c h w -> b (h w) c") x_mix = x num_frames = torch.arange(timesteps, device=x.device) num_frames = repeat(num_frames, "t -> b t", b=x.shape[0] // timesteps) num_frames = rearrange(num_frames, "b t -> (b t)") t_emb = timestep_embedding(num_frames, self.in_channels, repeat_only=False) emb = self.video_time_embed(t_emb) # b, n_channels emb = emb[:, None, :] x_mix = x_mix + emb alpha = self.get_alpha() x_mix = self.time_mix_block(x_mix, timesteps=timesteps) x = alpha * x + (1.0 - alpha) * x_mix # alpha merge x = rearrange(x, "b (h w) c -> b c h w", h=h, w=w) x = self.proj_out(x) return x_in + x def get_alpha( self, ): if self.merge_strategy == "fixed": return self.mix_factor elif self.merge_strategy == "learned": return torch.sigmoid(self.mix_factor) else: raise NotImplementedError(f"unknown merge strategy {self.merge_strategy}") class MemoryEfficientVideoBlock(MemoryEfficientAttnBlock): def __init__( self, in_channels: int, alpha: float = 0, merge_strategy: str = "learned" ): super().__init__(in_channels) # no context, single headed, as in base class self.time_mix_block = VideoTransformerBlock( dim=in_channels, n_heads=1, d_head=in_channels, checkpoint=True, ff_in=True, attn_mode="softmax-xformers", ) time_embed_dim = self.in_channels * 4 self.video_time_embed = torch.nn.Sequential( torch.nn.Linear(self.in_channels, time_embed_dim), torch.nn.SiLU(), torch.nn.Linear(time_embed_dim, self.in_channels), ) self.merge_strategy = merge_strategy if self.merge_strategy == "fixed": self.register_buffer("mix_factor", torch.Tensor([alpha])) elif self.merge_strategy == "learned": self.register_parameter( "mix_factor", torch.nn.Parameter(torch.Tensor([alpha])) ) else: raise ValueError(f"unknown merge strategy {self.merge_strategy}") def forward(self, x, timesteps, skip_time_block=False): if skip_time_block: return super().forward(x) x_in = x x = self.attention(x) h, w = x.shape[2:] x = rearrange(x, "b c h w -> b (h w) c") x_mix = x num_frames = torch.arange(timesteps, device=x.device) num_frames = repeat(num_frames, "t -> b t", b=x.shape[0] // timesteps) num_frames = rearrange(num_frames, "b t -> (b t)") t_emb = timestep_embedding(num_frames, self.in_channels, repeat_only=False) emb = self.video_time_embed(t_emb) # b, n_channels emb = emb[:, None, :] x_mix = x_mix + emb alpha = self.get_alpha() x_mix = self.time_mix_block(x_mix, timesteps=timesteps) x = alpha * x + (1.0 - alpha) * x_mix # alpha merge x = rearrange(x, "b (h w) c -> b c h w", h=h, w=w) x = self.proj_out(x) return x_in + x def get_alpha( self, ): if self.merge_strategy == "fixed": return self.mix_factor elif self.merge_strategy == "learned": return torch.sigmoid(self.mix_factor) else: raise NotImplementedError(f"unknown merge strategy {self.merge_strategy}") def make_time_attn( in_channels, attn_type="vanilla", attn_kwargs=None, alpha: float = 0, merge_strategy: str = "learned", ): assert attn_type in [ "vanilla", "vanilla-xformers", ], f"attn_type {attn_type} not supported for spatio-temporal attention" print( f"making spatial and temporal attention of type '{attn_type}' with {in_channels} in_channels" ) if not XFORMERS_IS_AVAILABLE and attn_type == "vanilla-xformers": print( f"Attention mode '{attn_type}' is not available. Falling back to vanilla attention. " f"This is not a problem in Pytorch >= 2.0. FYI, you are running with PyTorch version {torch.__version__}" ) attn_type = "vanilla" if attn_type == "vanilla": assert attn_kwargs is None return partialclass( VideoBlock, in_channels, alpha=alpha, merge_strategy=merge_strategy ) elif attn_type == "vanilla-xformers": print(f"building MemoryEfficientAttnBlock with {in_channels} in_channels...") return partialclass( MemoryEfficientVideoBlock, in_channels, alpha=alpha, merge_strategy=merge_strategy, ) else: return NotImplementedError() class Conv2DWrapper(torch.nn.Conv2d): def forward(self, input: torch.Tensor, **kwargs) -> torch.Tensor: return super().forward(input) class VideoDecoder(Decoder): available_time_modes = ["all", "conv-only", "attn-only"] def __init__( self, *args, video_kernel_size: Union[int, list] = [3, 1, 1], alpha: float = 0.0, merge_strategy: str = "learned", time_mode: str = "conv-only", **kwargs, ): self.video_kernel_size = video_kernel_size self.alpha = alpha self.merge_strategy = merge_strategy self.time_mode = time_mode assert ( self.time_mode in self.available_time_modes ), f"time_mode parameter has to be in {self.available_time_modes}" super().__init__(*args, **kwargs) def get_last_layer(self, skip_time_mix=False, **kwargs): if self.time_mode == "attn-only": raise NotImplementedError("TODO") else: return ( self.conv_out.time_mix_conv.weight if not skip_time_mix else self.conv_out.weight ) def _make_attn(self) -> Callable: if self.time_mode not in ["conv-only", "only-last-conv"]: return partialclass( make_time_attn, alpha=self.alpha, merge_strategy=self.merge_strategy, ) else: return super()._make_attn() def _make_conv(self) -> Callable: if self.time_mode != "attn-only": return partialclass(AE3DConv, video_kernel_size=self.video_kernel_size) else: return Conv2DWrapper def _make_resblock(self) -> Callable: if self.time_mode not in ["attn-only", "only-last-conv"]: return partialclass( VideoResBlock, video_kernel_size=self.video_kernel_size, alpha=self.alpha, merge_strategy=self.merge_strategy, ) else: return super()._make_resblock() ================================================ FILE: ToonCrafter/lvdm/models/ddpm3d.py ================================================ """ wild mixture of https://github.com/openai/improved-diffusion/blob/e94489283bb876ac1477d5dd7709bbbd2d9902ce/improved_diffusion/gaussian_diffusion.py https://github.com/lucidrains/denoising-diffusion-pytorch/blob/7706bdfc6f527f58d33f84b7b522e61e6e3164b3/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py https://github.com/CompVis/taming-transformers -- merci """ import os from functools import partial from contextlib import contextmanager import numpy as np from tqdm import tqdm from einops import rearrange, repeat import logging mainlogger = logging.getLogger('mainlogger') import random import torch import torch.nn as nn from torch.optim.lr_scheduler import LambdaLR, CosineAnnealingLR from torchvision.utils import make_grid import pytorch_lightning as pl from pytorch_lightning.utilities import rank_zero_only from ToonCrafter.utils.utils import instantiate_from_config from lvdm.ema import LitEma from lvdm.models.samplers.ddim import DDIMSampler from lvdm.distributions import DiagonalGaussianDistribution from lvdm.models.utils_diffusion import make_beta_schedule, rescale_zero_terminal_snr from lvdm.basics import disabled_train from lvdm.common import ( extract_into_tensor, noise_like, exists, default ) import math from lvdm.models.autoencoder_dualref import VideoDecoder __conditioning_keys__ = {'concat': 'c_concat', 'crossattn': 'c_crossattn', 'adm': 'y'} class DDPM(pl.LightningModule): # classic DDPM with Gaussian diffusion, in image space def __init__(self, unet_config, timesteps=1000, beta_schedule="linear", loss_type="l2", ckpt_path=None, ignore_keys=[], load_only_unet=False, monitor=None, use_ema=True, first_stage_key="image", image_size=256, channels=3, log_every_t=100, clip_denoised=True, linear_start=1e-4, linear_end=2e-2, cosine_s=8e-3, given_betas=None, original_elbo_weight=0., v_posterior=0., # weight for choosing posterior variance as sigma = (1-v) * beta_tilde + v * beta l_simple_weight=1., conditioning_key=None, parameterization="eps", # all assuming fixed variance schedules scheduler_config=None, use_positional_encodings=False, learn_logvar=False, logvar_init=0., rescale_betas_zero_snr=False, ): super().__init__() assert parameterization in ["eps", "x0", "v"], 'currently only supporting "eps" and "x0" and "v"' self.parameterization = parameterization mainlogger.info(f"{self.__class__.__name__}: Running in {self.parameterization}-prediction mode") self.cond_stage_model = None self.clip_denoised = clip_denoised self.log_every_t = log_every_t self.first_stage_key = first_stage_key self.channels = channels self.temporal_length = unet_config.params.temporal_length self.image_size = image_size # try conv? if isinstance(self.image_size, int): self.image_size = [self.image_size, self.image_size] self.use_positional_encodings = use_positional_encodings self.model = DiffusionWrapper(unet_config, conditioning_key) #count_params(self.model, verbose=True) self.use_ema = use_ema self.rescale_betas_zero_snr = rescale_betas_zero_snr if self.use_ema: self.model_ema = LitEma(self.model) mainlogger.info(f"Keeping EMAs of {len(list(self.model_ema.buffers()))}.") self.use_scheduler = scheduler_config is not None if self.use_scheduler: self.scheduler_config = scheduler_config self.v_posterior = v_posterior self.original_elbo_weight = original_elbo_weight self.l_simple_weight = l_simple_weight if monitor is not None: self.monitor = monitor if ckpt_path is not None: self.init_from_ckpt(ckpt_path, ignore_keys=ignore_keys, only_model=load_only_unet) self.register_schedule(given_betas=given_betas, beta_schedule=beta_schedule, timesteps=timesteps, linear_start=linear_start, linear_end=linear_end, cosine_s=cosine_s) ## for reschedule self.given_betas = given_betas self.beta_schedule = beta_schedule self.timesteps = timesteps self.cosine_s = cosine_s self.loss_type = loss_type self.learn_logvar = learn_logvar self.logvar = torch.full(fill_value=logvar_init, size=(self.num_timesteps,)) if self.learn_logvar: self.logvar = nn.Parameter(self.logvar, requires_grad=True) def register_schedule(self, given_betas=None, beta_schedule="linear", timesteps=1000, linear_start=1e-4, linear_end=2e-2, cosine_s=8e-3): if exists(given_betas): betas = given_betas else: betas = make_beta_schedule(beta_schedule, timesteps, linear_start=linear_start, linear_end=linear_end, cosine_s=cosine_s) if self.rescale_betas_zero_snr: betas = rescale_zero_terminal_snr(betas) alphas = 1. - betas alphas_cumprod = np.cumprod(alphas, axis=0) alphas_cumprod_prev = np.append(1., alphas_cumprod[:-1]) timesteps, = betas.shape self.num_timesteps = int(timesteps) self.linear_start = linear_start self.linear_end = linear_end assert alphas_cumprod.shape[0] == self.num_timesteps, 'alphas have to be defined for each timestep' to_torch = partial(torch.tensor, dtype=torch.float32) self.register_buffer('betas', to_torch(betas)) self.register_buffer('alphas_cumprod', to_torch(alphas_cumprod)) self.register_buffer('alphas_cumprod_prev', to_torch(alphas_cumprod_prev)) # calculations for diffusion q(x_t | x_{t-1}) and others self.register_buffer('sqrt_alphas_cumprod', to_torch(np.sqrt(alphas_cumprod))) self.register_buffer('sqrt_one_minus_alphas_cumprod', to_torch(np.sqrt(1. - alphas_cumprod))) self.register_buffer('log_one_minus_alphas_cumprod', to_torch(np.log(1. - alphas_cumprod))) if self.parameterization != 'v': self.register_buffer('sqrt_recip_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod))) self.register_buffer('sqrt_recipm1_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod - 1))) else: self.register_buffer('sqrt_recip_alphas_cumprod', torch.zeros_like(to_torch(alphas_cumprod))) self.register_buffer('sqrt_recipm1_alphas_cumprod', torch.zeros_like(to_torch(alphas_cumprod))) # calculations for posterior q(x_{t-1} | x_t, x_0) posterior_variance = (1 - self.v_posterior) * betas * (1. - alphas_cumprod_prev) / ( 1. - alphas_cumprod) + self.v_posterior * betas # above: equal to 1. / (1. / (1. - alpha_cumprod_tm1) + alpha_t / beta_t) self.register_buffer('posterior_variance', to_torch(posterior_variance)) # below: log calculation clipped because the posterior variance is 0 at the beginning of the diffusion chain self.register_buffer('posterior_log_variance_clipped', to_torch(np.log(np.maximum(posterior_variance, 1e-20)))) self.register_buffer('posterior_mean_coef1', to_torch( betas * np.sqrt(alphas_cumprod_prev) / (1. - alphas_cumprod))) self.register_buffer('posterior_mean_coef2', to_torch( (1. - alphas_cumprod_prev) * np.sqrt(alphas) / (1. - alphas_cumprod))) if self.parameterization == "eps": lvlb_weights = self.betas ** 2 / ( 2 * self.posterior_variance * to_torch(alphas) * (1 - self.alphas_cumprod)) elif self.parameterization == "x0": lvlb_weights = 0.5 * np.sqrt(torch.Tensor(alphas_cumprod)) / (2. * 1 - torch.Tensor(alphas_cumprod)) elif self.parameterization == "v": lvlb_weights = torch.ones_like(self.betas ** 2 / ( 2 * self.posterior_variance * to_torch(alphas) * (1 - self.alphas_cumprod))) else: raise NotImplementedError("mu not supported") # TODO how to choose this term lvlb_weights[0] = lvlb_weights[1] self.register_buffer('lvlb_weights', lvlb_weights, persistent=False) assert not torch.isnan(self.lvlb_weights).all() @contextmanager def ema_scope(self, context=None): if self.use_ema: self.model_ema.store(self.model.parameters()) self.model_ema.copy_to(self.model) if context is not None: mainlogger.info(f"{context}: Switched to EMA weights") try: yield None finally: if self.use_ema: self.model_ema.restore(self.model.parameters()) if context is not None: mainlogger.info(f"{context}: Restored training weights") def init_from_ckpt(self, path, ignore_keys=list(), only_model=False): sd = torch.load(path, map_location="cpu") if "state_dict" in list(sd.keys()): sd = sd["state_dict"] keys = list(sd.keys()) for k in keys: for ik in ignore_keys: if k.startswith(ik): mainlogger.info("Deleting key {} from state_dict.".format(k)) del sd[k] missing, unexpected = self.load_state_dict(sd, strict=False) if not only_model else self.model.load_state_dict( sd, strict=False) mainlogger.info(f"Restored from {path} with {len(missing)} missing and {len(unexpected)} unexpected keys") if len(missing) > 0: mainlogger.info(f"Missing Keys: {missing}") if len(unexpected) > 0: mainlogger.info(f"Unexpected Keys: {unexpected}") def q_mean_variance(self, x_start, t): """ Get the distribution q(x_t | x_0). :param x_start: the [N x C x ...] tensor of noiseless inputs. :param t: the number of diffusion steps (minus 1). Here, 0 means one step. :return: A tuple (mean, variance, log_variance), all of x_start's shape. """ mean = (extract_into_tensor(self.sqrt_alphas_cumprod, t, x_start.shape) * x_start) variance = extract_into_tensor(1.0 - self.alphas_cumprod, t, x_start.shape) log_variance = extract_into_tensor(self.log_one_minus_alphas_cumprod, t, x_start.shape) return mean, variance, log_variance def predict_start_from_noise(self, x_t, t, noise): return ( extract_into_tensor(self.sqrt_recip_alphas_cumprod, t, x_t.shape) * x_t - extract_into_tensor(self.sqrt_recipm1_alphas_cumprod, t, x_t.shape) * noise ) def predict_start_from_z_and_v(self, x_t, t, v): # self.register_buffer('sqrt_alphas_cumprod', to_torch(np.sqrt(alphas_cumprod))) # self.register_buffer('sqrt_one_minus_alphas_cumprod', to_torch(np.sqrt(1. - alphas_cumprod))) return ( extract_into_tensor(self.sqrt_alphas_cumprod, t, x_t.shape) * x_t - extract_into_tensor(self.sqrt_one_minus_alphas_cumprod, t, x_t.shape) * v ) def predict_eps_from_z_and_v(self, x_t, t, v): return ( extract_into_tensor(self.sqrt_alphas_cumprod, t, x_t.shape) * v + extract_into_tensor(self.sqrt_one_minus_alphas_cumprod, t, x_t.shape) * x_t ) def q_posterior(self, x_start, x_t, t): posterior_mean = ( extract_into_tensor(self.posterior_mean_coef1, t, x_t.shape) * x_start + extract_into_tensor(self.posterior_mean_coef2, t, x_t.shape) * x_t ) posterior_variance = extract_into_tensor(self.posterior_variance, t, x_t.shape) posterior_log_variance_clipped = extract_into_tensor(self.posterior_log_variance_clipped, t, x_t.shape) return posterior_mean, posterior_variance, posterior_log_variance_clipped def p_mean_variance(self, x, t, clip_denoised: bool): model_out = self.model(x, t) if self.parameterization == "eps": x_recon = self.predict_start_from_noise(x, t=t, noise=model_out) elif self.parameterization == "x0": x_recon = model_out if clip_denoised: x_recon.clamp_(-1., 1.) model_mean, posterior_variance, posterior_log_variance = self.q_posterior(x_start=x_recon, x_t=x, t=t) return model_mean, posterior_variance, posterior_log_variance @torch.no_grad() def p_sample(self, x, t, clip_denoised=True, repeat_noise=False): b, *_, device = *x.shape, x.device model_mean, _, model_log_variance = self.p_mean_variance(x=x, t=t, clip_denoised=clip_denoised) noise = noise_like(x.shape, device, repeat_noise) # no noise when t == 0 nonzero_mask = (1 - (t == 0).float()).reshape(b, *((1,) * (len(x.shape) - 1))) return model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise @torch.no_grad() def p_sample_loop(self, shape, return_intermediates=False): device = self.betas.device b = shape[0] img = torch.randn(shape, device=device) intermediates = [img] for i in tqdm(reversed(range(0, self.num_timesteps)), desc='Sampling t', total=self.num_timesteps): img = self.p_sample(img, torch.full((b,), i, device=device, dtype=torch.long), clip_denoised=self.clip_denoised) if i % self.log_every_t == 0 or i == self.num_timesteps - 1: intermediates.append(img) if return_intermediates: return img, intermediates return img @torch.no_grad() def sample(self, batch_size=16, return_intermediates=False): image_size = self.image_size channels = self.channels return self.p_sample_loop((batch_size, channels, image_size, image_size), return_intermediates=return_intermediates) def q_sample(self, x_start, t, noise=None): noise = default(noise, lambda: torch.randn_like(x_start)) return (extract_into_tensor(self.sqrt_alphas_cumprod, t, x_start.shape) * x_start + extract_into_tensor(self.sqrt_one_minus_alphas_cumprod, t, x_start.shape) * noise) def get_v(self, x, noise, t): return ( extract_into_tensor(self.sqrt_alphas_cumprod, t, x.shape) * noise - extract_into_tensor(self.sqrt_one_minus_alphas_cumprod, t, x.shape) * x ) def get_loss(self, pred, target, mean=True): if self.loss_type == 'l1': loss = (target - pred).abs() if mean: loss = loss.mean() elif self.loss_type == 'l2': if mean: loss = torch.nn.functional.mse_loss(target, pred) else: loss = torch.nn.functional.mse_loss(target, pred, reduction='none') else: raise NotImplementedError("unknown loss type '{loss_type}'") return loss def p_losses(self, x_start, t, noise=None): noise = default(noise, lambda: torch.randn_like(x_start)) x_noisy = self.q_sample(x_start=x_start, t=t, noise=noise) model_out = self.model(x_noisy, t) loss_dict = {} if self.parameterization == "eps": target = noise elif self.parameterization == "x0": target = x_start elif self.parameterization == "v": target = self.get_v(x_start, noise, t) else: raise NotImplementedError(f"Paramterization {self.parameterization} not yet supported") loss = self.get_loss(model_out, target, mean=False).mean(dim=[1, 2, 3]) log_prefix = 'train' if self.training else 'val' loss_dict.update({f'{log_prefix}/loss_simple': loss.mean()}) loss_simple = loss.mean() * self.l_simple_weight loss_vlb = (self.lvlb_weights[t] * loss).mean() loss_dict.update({f'{log_prefix}/loss_vlb': loss_vlb}) loss = loss_simple + self.original_elbo_weight * loss_vlb loss_dict.update({f'{log_prefix}/loss': loss}) return loss, loss_dict def forward(self, x, *args, **kwargs): # b, c, h, w, device, img_size, = *x.shape, x.device, self.image_size # assert h == img_size and w == img_size, f'height and width of image must be {img_size}' t = torch.randint(0, self.num_timesteps, (x.shape[0],), device=self.device).long() return self.p_losses(x, t, *args, **kwargs) def get_input(self, batch, k): x = batch[k] ''' if len(x.shape) == 3: x = x[..., None] x = rearrange(x, 'b h w c -> b c h w') ''' x = x.to(memory_format=torch.contiguous_format).float() return x def shared_step(self, batch): x = self.get_input(batch, self.first_stage_key) loss, loss_dict = self(x) return loss, loss_dict def training_step(self, batch, batch_idx): loss, loss_dict = self.shared_step(batch) self.log_dict(loss_dict, prog_bar=True, logger=True, on_step=True, on_epoch=True) self.log("global_step", self.global_step, prog_bar=True, logger=True, on_step=True, on_epoch=False) if self.use_scheduler: lr = self.optimizers().param_groups[0]['lr'] self.log('lr_abs', lr, prog_bar=True, logger=True, on_step=True, on_epoch=False) return loss @torch.no_grad() def validation_step(self, batch, batch_idx): _, loss_dict_no_ema = self.shared_step(batch) with self.ema_scope(): _, loss_dict_ema = self.shared_step(batch) loss_dict_ema = {key + '_ema': loss_dict_ema[key] for key in loss_dict_ema} self.log_dict(loss_dict_no_ema, prog_bar=False, logger=True, on_step=False, on_epoch=True) self.log_dict(loss_dict_ema, prog_bar=False, logger=True, on_step=False, on_epoch=True) def on_train_batch_end(self, *args, **kwargs): if self.use_ema: self.model_ema(self.model) def _get_rows_from_list(self, samples): n_imgs_per_row = len(samples) denoise_grid = rearrange(samples, 'n b c h w -> b n c h w') denoise_grid = rearrange(denoise_grid, 'b n c h w -> (b n) c h w') denoise_grid = make_grid(denoise_grid, nrow=n_imgs_per_row) return denoise_grid @torch.no_grad() def log_images(self, batch, N=8, n_row=2, sample=True, return_keys=None, **kwargs): log = dict() x = self.get_input(batch, self.first_stage_key) N = min(x.shape[0], N) n_row = min(x.shape[0], n_row) x = x.to(self.device)[:N] log["inputs"] = x # get diffusion row diffusion_row = list() x_start = x[:n_row] for t in range(self.num_timesteps): if t % self.log_every_t == 0 or t == self.num_timesteps - 1: t = repeat(torch.tensor([t]), '1 -> b', b=n_row) t = t.to(self.device).long() noise = torch.randn_like(x_start) x_noisy = self.q_sample(x_start=x_start, t=t, noise=noise) diffusion_row.append(x_noisy) log["diffusion_row"] = self._get_rows_from_list(diffusion_row) if sample: # get denoise row with self.ema_scope("Plotting"): samples, denoise_row = self.sample(batch_size=N, return_intermediates=True) log["samples"] = samples log["denoise_row"] = self._get_rows_from_list(denoise_row) if return_keys: if np.intersect1d(list(log.keys()), return_keys).shape[0] == 0: return log else: return {key: log[key] for key in return_keys} return log def configure_optimizers(self): lr = self.learning_rate params = list(self.model.parameters()) if self.learn_logvar: params = params + [self.logvar] opt = torch.optim.AdamW(params, lr=lr) return opt class LatentDiffusion(DDPM): """main class""" def __init__(self, first_stage_config, cond_stage_config, num_timesteps_cond=None, cond_stage_key="caption", cond_stage_trainable=False, cond_stage_forward=None, conditioning_key=None, uncond_prob=0.2, uncond_type="empty_seq", scale_factor=1.0, scale_by_std=False, encoder_type="2d", only_model=False, noise_strength=0, use_dynamic_rescale=False, base_scale=0.7, turning_step=400, loop_video=False, fps_condition_type='fs', perframe_ae=False, # added logdir=None, rand_cond_frame=False, en_and_decode_n_samples_a_time=None, *args, **kwargs): self.num_timesteps_cond = default(num_timesteps_cond, 1) self.scale_by_std = scale_by_std assert self.num_timesteps_cond <= kwargs['timesteps'] # for backwards compatibility after implementation of DiffusionWrapper ckpt_path = kwargs.pop("ckpt_path", None) ignore_keys = kwargs.pop("ignore_keys", []) conditioning_key = default(conditioning_key, 'crossattn') super().__init__(conditioning_key=conditioning_key, *args, **kwargs) self.cond_stage_trainable = cond_stage_trainable self.cond_stage_key = cond_stage_key self.noise_strength = noise_strength self.use_dynamic_rescale = use_dynamic_rescale self.loop_video = loop_video self.fps_condition_type = fps_condition_type self.perframe_ae = perframe_ae self.logdir = logdir self.rand_cond_frame = rand_cond_frame self.en_and_decode_n_samples_a_time = en_and_decode_n_samples_a_time try: self.num_downs = len(first_stage_config.params.ddconfig.ch_mult) - 1 except: self.num_downs = 0 if not scale_by_std: self.scale_factor = scale_factor else: self.register_buffer('scale_factor', torch.tensor(scale_factor)) if use_dynamic_rescale: scale_arr1 = np.linspace(1.0, base_scale, turning_step) scale_arr2 = np.full(self.num_timesteps, base_scale) scale_arr = np.concatenate((scale_arr1, scale_arr2)) to_torch = partial(torch.tensor, dtype=torch.float32) self.register_buffer('scale_arr', to_torch(scale_arr)) self.instantiate_first_stage(first_stage_config) self.instantiate_cond_stage(cond_stage_config) self.first_stage_config = first_stage_config self.cond_stage_config = cond_stage_config self.clip_denoised = False self.cond_stage_forward = cond_stage_forward self.encoder_type = encoder_type assert(encoder_type in ["2d", "3d"]) self.uncond_prob = uncond_prob self.classifier_free_guidance = True if uncond_prob > 0 else False assert(uncond_type in ["zero_embed", "empty_seq"]) self.uncond_type = uncond_type self.restarted_from_ckpt = False if ckpt_path is not None: self.init_from_ckpt(ckpt_path, ignore_keys, only_model=only_model) self.restarted_from_ckpt = True def make_cond_schedule(self, ): self.cond_ids = torch.full(size=(self.num_timesteps,), fill_value=self.num_timesteps - 1, dtype=torch.long) ids = torch.round(torch.linspace(0, self.num_timesteps - 1, self.num_timesteps_cond)).long() self.cond_ids[:self.num_timesteps_cond] = ids @rank_zero_only @torch.no_grad() def on_train_batch_start(self, batch, batch_idx, dataloader_idx=None): # only for very first batch, reset the self.scale_factor if self.scale_by_std and self.current_epoch == 0 and self.global_step == 0 and batch_idx == 0 and \ not self.restarted_from_ckpt: assert self.scale_factor == 1., 'rather not use custom rescaling and std-rescaling simultaneously' # set rescale weight to 1./std of encodings mainlogger.info("### USING STD-RESCALING ###") x = super().get_input(batch, self.first_stage_key) x = x.to(self.device) encoder_posterior = self.encode_first_stage(x) z = self.get_first_stage_encoding(encoder_posterior).detach() del self.scale_factor self.register_buffer('scale_factor', 1. / z.flatten().std()) mainlogger.info(f"setting self.scale_factor to {self.scale_factor}") mainlogger.info("### USING STD-RESCALING ###") mainlogger.info(f"std={z.flatten().std()}") def register_schedule(self, given_betas=None, beta_schedule="linear", timesteps=1000, linear_start=1e-4, linear_end=2e-2, cosine_s=8e-3): super().register_schedule(given_betas, beta_schedule, timesteps, linear_start, linear_end, cosine_s) self.shorten_cond_schedule = self.num_timesteps_cond > 1 if self.shorten_cond_schedule: self.make_cond_schedule() def instantiate_first_stage(self, config): model = instantiate_from_config(config) self.first_stage_model = model.eval() self.first_stage_model.train = disabled_train for param in self.first_stage_model.parameters(): param.requires_grad = False def instantiate_cond_stage(self, config): if not self.cond_stage_trainable: model = instantiate_from_config(config) self.cond_stage_model = model.eval() self.cond_stage_model.train = disabled_train for param in self.cond_stage_model.parameters(): param.requires_grad = False else: model = instantiate_from_config(config) self.cond_stage_model = model def get_learned_conditioning(self, c): if self.cond_stage_forward is None: if hasattr(self.cond_stage_model, 'encode') and callable(self.cond_stage_model.encode): c = self.cond_stage_model.encode(c) if isinstance(c, DiagonalGaussianDistribution): c = c.mode() else: c = self.cond_stage_model(c) else: assert hasattr(self.cond_stage_model, self.cond_stage_forward) c = getattr(self.cond_stage_model, self.cond_stage_forward)(c) return c def get_first_stage_encoding(self, encoder_posterior, noise=None): if isinstance(encoder_posterior, DiagonalGaussianDistribution): z = encoder_posterior.sample(noise=noise) elif isinstance(encoder_posterior, torch.Tensor): z = encoder_posterior else: raise NotImplementedError(f"encoder_posterior of type '{type(encoder_posterior)}' not yet implemented") return self.scale_factor * z @torch.no_grad() def encode_first_stage(self, x): if self.encoder_type == "2d" and x.dim() == 5: b, _, t, _, _ = x.shape x = rearrange(x, 'b c t h w -> (b t) c h w') reshape_back = True else: reshape_back = False ## consume more GPU memory but faster if not self.perframe_ae: encoder_posterior = self.first_stage_model.encode(x) results = self.get_first_stage_encoding(encoder_posterior).detach() else: ## consume less GPU memory but slower results = [] for index in range(x.shape[0]): frame_batch = self.first_stage_model.encode(x[index:index+1,:,:,:]) frame_result = self.get_first_stage_encoding(frame_batch).detach() results.append(frame_result) results = torch.cat(results, dim=0) if reshape_back: results = rearrange(results, '(b t) c h w -> b c t h w', b=b,t=t) return results def decode_core(self, z, **kwargs): if self.encoder_type == "2d" and z.dim() == 5: b, _, t, _, _ = z.shape z = rearrange(z, 'b c t h w -> (b t) c h w') reshape_back = True else: reshape_back = False z = 1. / self.scale_factor * z if not self.perframe_ae: results = self.first_stage_model.decode(z, **kwargs) else: results = [] n_samples = default(self.en_and_decode_n_samples_a_time, self.temporal_length) n_rounds = math.ceil(z.shape[0] / n_samples) from contextlib import nullcontext with torch.autocast("cuda", enabled=True) if torch.cuda.is_available() else nullcontext(): for n in range(n_rounds): if isinstance(self.first_stage_model.decoder, VideoDecoder): kwargs.update({"timesteps": len(z[n * n_samples : (n + 1) * n_samples])}) else: kwargs = {} if os.environ.get("TOON_MEM_STRATEGY", "none") == "low": kwargs2 = kwargs.copy() kwargs2.update({"timesteps": 3}) out_list = [] for k in range(z.shape[0]): out = self.first_stage_model.decode(z[[0, k, z.shape[0] - 1]], **kwargs2) out_list.append(out[1:2]) out = torch.cat(out_list, dim=0) else: out = self.first_stage_model.decode( z[n * n_samples : (n + 1) * n_samples], **kwargs ) results.append(out) results = torch.cat(results, dim=0) if reshape_back: results = rearrange(results, '(b t) c h w -> b c t h w', b=b,t=t) return results @torch.no_grad() def decode_first_stage(self, z, **kwargs): return self.decode_core(z, **kwargs) # same as above but without decorator def differentiable_decode_first_stage(self, z, **kwargs): return self.decode_core(z, **kwargs) @torch.no_grad() def get_batch_input(self, batch, random_uncond, return_first_stage_outputs=False, return_original_cond=False): ## video shape: b, c, t, h, w x = super().get_input(batch, self.first_stage_key) ## encode video frames x to z via a 2D encoder z = self.encode_first_stage(x) ## get caption condition cond = batch[self.cond_stage_key] if random_uncond and self.uncond_type == 'empty_seq': for i, ci in enumerate(cond): if random.random() < self.uncond_prob: cond[i] = "" if isinstance(cond, dict) or isinstance(cond, list): cond_emb = self.get_learned_conditioning(cond) else: cond_emb = self.get_learned_conditioning(cond.to(self.device)) if random_uncond and self.uncond_type == 'zero_embed': for i, ci in enumerate(cond): if random.random() < self.uncond_prob: cond_emb[i] = torch.zeros_like(cond_emb[i]) out = [z, cond_emb] ## optional output: self-reconst or caption if return_first_stage_outputs: xrec = self.decode_first_stage(z) out.extend([xrec]) if return_original_cond: out.append(cond) return out def forward(self, x, c, **kwargs): t = torch.randint(0, self.num_timesteps, (x.shape[0],), device=self.device).long() if self.use_dynamic_rescale: x = x * extract_into_tensor(self.scale_arr, t, x.shape) return self.p_losses(x, c, t, **kwargs) def shared_step(self, batch, random_uncond, **kwargs): x, c = self.get_batch_input(batch, random_uncond=random_uncond) loss, loss_dict = self(x, c, **kwargs) return loss, loss_dict def apply_model(self, x_noisy, t, cond, **kwargs): if isinstance(cond, dict): # hybrid case, cond is exptected to be a dict pass else: if not isinstance(cond, list): cond = [cond] key = 'c_concat' if self.model.conditioning_key == 'concat' else 'c_crossattn' cond = {key: cond} x_recon = self.model(x_noisy, t, **cond, **kwargs) if isinstance(x_recon, tuple): return x_recon[0] else: return x_recon def p_losses(self, x_start, cond, t, noise=None, **kwargs): if self.noise_strength > 0: b, c, f, _, _ = x_start.shape offset_noise = torch.randn(b, c, f, 1, 1, device=x_start.device) noise = default(noise, lambda: torch.randn_like(x_start) + self.noise_strength * offset_noise) else: noise = default(noise, lambda: torch.randn_like(x_start)) x_noisy = self.q_sample(x_start=x_start, t=t, noise=noise) model_output = self.apply_model(x_noisy, t, cond, **kwargs) loss_dict = {} prefix = 'train' if self.training else 'val' if self.parameterization == "x0": target = x_start elif self.parameterization == "eps": target = noise elif self.parameterization == "v": target = self.get_v(x_start, noise, t) else: raise NotImplementedError() loss_simple = self.get_loss(model_output, target, mean=False).mean([1, 2, 3, 4]) loss_dict.update({f'{prefix}/loss_simple': loss_simple.mean()}) if self.logvar.device is not self.device: self.logvar = self.logvar.to(self.device) logvar_t = self.logvar[t] # logvar_t = self.logvar[t.item()].to(self.device) # device conflict when ddp shared loss = loss_simple / torch.exp(logvar_t) + logvar_t # loss = loss_simple / torch.exp(self.logvar) + self.logvar if self.learn_logvar: loss_dict.update({f'{prefix}/loss_gamma': loss.mean()}) loss_dict.update({'logvar': self.logvar.data.mean()}) loss = self.l_simple_weight * loss.mean() loss_vlb = self.get_loss(model_output, target, mean=False).mean(dim=(1, 2, 3, 4)) loss_vlb = (self.lvlb_weights[t] * loss_vlb).mean() loss_dict.update({f'{prefix}/loss_vlb': loss_vlb}) loss += (self.original_elbo_weight * loss_vlb) loss_dict.update({f'{prefix}/loss': loss}) return loss, loss_dict def training_step(self, batch, batch_idx): loss, loss_dict = self.shared_step(batch, random_uncond=self.classifier_free_guidance) ## sync_dist | rank_zero_only self.log_dict(loss_dict, prog_bar=True, logger=True, on_step=True, on_epoch=True, sync_dist=False) #self.log("epoch/global_step", self.global_step.float(), prog_bar=True, logger=True, on_step=True, on_epoch=False) ''' if self.use_scheduler: lr = self.optimizers().param_groups[0]['lr'] self.log('lr_abs', lr, prog_bar=True, logger=True, on_step=True, on_epoch=False, rank_zero_only=True) ''' if (batch_idx+1) % self.log_every_t == 0: mainlogger.info(f"batch:{batch_idx}|epoch:{self.current_epoch} [globalstep:{self.global_step}]: loss={loss}") return loss def _get_denoise_row_from_list(self, samples, desc=''): denoise_row = [] for zd in tqdm(samples, desc=desc): denoise_row.append(self.decode_first_stage(zd.to(self.device))) n_log_timesteps = len(denoise_row) denoise_row = torch.stack(denoise_row) # n_log_timesteps, b, C, H, W if denoise_row.dim() == 5: denoise_grid = rearrange(denoise_row, 'n b c h w -> b n c h w') denoise_grid = rearrange(denoise_grid, 'b n c h w -> (b n) c h w') denoise_grid = make_grid(denoise_grid, nrow=n_log_timesteps) elif denoise_row.dim() == 6: # video, grid_size=[n_log_timesteps*bs, t] video_length = denoise_row.shape[3] denoise_grid = rearrange(denoise_row, 'n b c t h w -> b n c t h w') denoise_grid = rearrange(denoise_grid, 'b n c t h w -> (b n) c t h w') denoise_grid = rearrange(denoise_grid, 'n c t h w -> (n t) c h w') denoise_grid = make_grid(denoise_grid, nrow=video_length) else: raise ValueError return denoise_grid @torch.no_grad() def log_images(self, batch, sample=True, ddim_steps=200, ddim_eta=1., plot_denoise_rows=False, \ unconditional_guidance_scale=1.0, **kwargs): """ log images for LatentDiffusion """ ##### control sampled imgae for logging, larger value may cause OOM sampled_img_num = 2 for key in batch.keys(): batch[key] = batch[key][:sampled_img_num] ## TBD: currently, classifier_free_guidance sampling is only supported by DDIM use_ddim = ddim_steps is not None log = dict() z, c, xrec, xc = self.get_batch_input(batch, random_uncond=False, return_first_stage_outputs=True, return_original_cond=True) N = xrec.shape[0] log["reconst"] = xrec log["condition"] = xc if sample: # get uncond embedding for classifier-free guidance sampling if unconditional_guidance_scale != 1.0: if isinstance(c, dict): c_cat, c_emb = c["c_concat"][0], c["c_crossattn"][0] log["condition_cat"] = c_cat else: c_emb = c if self.uncond_type == "empty_seq": prompts = N * [""] uc = self.get_learned_conditioning(prompts) elif self.uncond_type == "zero_embed": uc = torch.zeros_like(c_emb) ## hybrid case if isinstance(c, dict): uc_hybrid = {"c_concat": [c_cat], "c_crossattn": [uc]} uc = uc_hybrid else: uc = None with self.ema_scope("Plotting"): samples, z_denoise_row = self.sample_log(cond=c, batch_size=N, ddim=use_ddim, ddim_steps=ddim_steps,eta=ddim_eta, unconditional_guidance_scale=unconditional_guidance_scale, unconditional_conditioning=uc, x0=z, **kwargs) x_samples = self.decode_first_stage(samples) log["samples"] = x_samples if plot_denoise_rows: denoise_grid = self._get_denoise_row_from_list(z_denoise_row) log["denoise_row"] = denoise_grid return log def p_mean_variance(self, x, c, t, clip_denoised: bool, return_x0=False, score_corrector=None, corrector_kwargs=None, **kwargs): t_in = t model_out = self.apply_model(x, t_in, c, **kwargs) if score_corrector is not None: assert self.parameterization == "eps" model_out = score_corrector.modify_score(self, model_out, x, t, c, **corrector_kwargs) if self.parameterization == "eps": x_recon = self.predict_start_from_noise(x, t=t, noise=model_out) elif self.parameterization == "x0": x_recon = model_out else: raise NotImplementedError() if clip_denoised: x_recon.clamp_(-1., 1.) model_mean, posterior_variance, posterior_log_variance = self.q_posterior(x_start=x_recon, x_t=x, t=t) if return_x0: return model_mean, posterior_variance, posterior_log_variance, x_recon else: return model_mean, posterior_variance, posterior_log_variance @torch.no_grad() def p_sample(self, x, c, t, clip_denoised=False, repeat_noise=False, return_x0=False, \ temperature=1., noise_dropout=0., score_corrector=None, corrector_kwargs=None, **kwargs): b, *_, device = *x.shape, x.device outputs = self.p_mean_variance(x=x, c=c, t=t, clip_denoised=clip_denoised, return_x0=return_x0, \ score_corrector=score_corrector, corrector_kwargs=corrector_kwargs, **kwargs) if return_x0: model_mean, _, model_log_variance, x0 = outputs else: model_mean, _, model_log_variance = outputs noise = noise_like(x.shape, device, repeat_noise) * temperature if noise_dropout > 0.: noise = torch.nn.functional.dropout(noise, p=noise_dropout) # no noise when t == 0 nonzero_mask = (1 - (t == 0).float()).reshape(b, *((1,) * (len(x.shape) - 1))) if return_x0: return model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise, x0 else: return model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise @torch.no_grad() def p_sample_loop(self, cond, shape, return_intermediates=False, x_T=None, verbose=True, callback=None, \ timesteps=None, mask=None, x0=None, img_callback=None, start_T=None, log_every_t=None, **kwargs): if not log_every_t: log_every_t = self.log_every_t device = self.betas.device b = shape[0] # sample an initial noise if x_T is None: img = torch.randn(shape, device=device) else: img = x_T intermediates = [img] if timesteps is None: timesteps = self.num_timesteps if start_T is not None: timesteps = min(timesteps, start_T) iterator = tqdm(reversed(range(0, timesteps)), desc='Sampling t', total=timesteps) if verbose else reversed(range(0, timesteps)) if mask is not None: assert x0 is not None assert x0.shape[2:3] == mask.shape[2:3] # spatial size has to match for i in iterator: ts = torch.full((b,), i, device=device, dtype=torch.long) if self.shorten_cond_schedule: assert self.model.conditioning_key != 'hybrid' tc = self.cond_ids[ts].to(cond.device) cond = self.q_sample(x_start=cond, t=tc, noise=torch.randn_like(cond)) img = self.p_sample(img, cond, ts, clip_denoised=self.clip_denoised, **kwargs) if mask is not None: img_orig = self.q_sample(x0, ts) img = img_orig * mask + (1. - mask) * img if i % log_every_t == 0 or i == timesteps - 1: intermediates.append(img) if callback: callback(i) if img_callback: img_callback(img, i) if return_intermediates: return img, intermediates return img @torch.no_grad() def sample(self, cond, batch_size=16, return_intermediates=False, x_T=None, \ verbose=True, timesteps=None, mask=None, x0=None, shape=None, **kwargs): if shape is None: shape = (batch_size, self.channels, self.temporal_length, *self.image_size) if cond is not None: if isinstance(cond, dict): cond = {key: cond[key][:batch_size] if not isinstance(cond[key], list) else list(map(lambda x: x[:batch_size], cond[key])) for key in cond} else: cond = [c[:batch_size] for c in cond] if isinstance(cond, list) else cond[:batch_size] return self.p_sample_loop(cond, shape, return_intermediates=return_intermediates, x_T=x_T, verbose=verbose, timesteps=timesteps, mask=mask, x0=x0, **kwargs) @torch.no_grad() def sample_log(self, cond, batch_size, ddim, ddim_steps, **kwargs): if ddim: ddim_sampler = DDIMSampler(self) shape = (self.channels, self.temporal_length, *self.image_size) samples, intermediates = ddim_sampler.sample(ddim_steps, batch_size, shape, cond, verbose=False, **kwargs) else: samples, intermediates = self.sample(cond=cond, batch_size=batch_size, return_intermediates=True, **kwargs) return samples, intermediates def configure_schedulers(self, optimizer): assert 'target' in self.scheduler_config scheduler_name = self.scheduler_config.target.split('.')[-1] interval = self.scheduler_config.interval frequency = self.scheduler_config.frequency if scheduler_name == "LambdaLRScheduler": scheduler = instantiate_from_config(self.scheduler_config) scheduler.start_step = self.global_step lr_scheduler = { 'scheduler': LambdaLR(optimizer, lr_lambda=scheduler.schedule), 'interval': interval, 'frequency': frequency } elif scheduler_name == "CosineAnnealingLRScheduler": scheduler = instantiate_from_config(self.scheduler_config) decay_steps = scheduler.decay_steps last_step = -1 if self.global_step == 0 else scheduler.start_step lr_scheduler = { 'scheduler': CosineAnnealingLR(optimizer, T_max=decay_steps, last_epoch=last_step), 'interval': interval, 'frequency': frequency } else: raise NotImplementedError return lr_scheduler class LatentVisualDiffusion(LatentDiffusion): def __init__(self, img_cond_stage_config, image_proj_stage_config, freeze_embedder=True, image_proj_model_trainable=True, *args, **kwargs): super().__init__(*args, **kwargs) self.image_proj_model_trainable = image_proj_model_trainable self._init_embedder(img_cond_stage_config, freeze_embedder) self._init_img_ctx_projector(image_proj_stage_config, image_proj_model_trainable) def _init_img_ctx_projector(self, config, trainable): self.image_proj_model = instantiate_from_config(config) if not trainable: self.image_proj_model.eval() self.image_proj_model.train = disabled_train for param in self.image_proj_model.parameters(): param.requires_grad = False def _init_embedder(self, config, freeze=True): self.embedder = instantiate_from_config(config) if freeze: self.embedder.eval() self.embedder.train = disabled_train for param in self.embedder.parameters(): param.requires_grad = False def shared_step(self, batch, random_uncond, **kwargs): x, c, fs = self.get_batch_input(batch, random_uncond=random_uncond, return_fs=True) kwargs.update({"fs": fs.long()}) loss, loss_dict = self(x, c, **kwargs) return loss, loss_dict def get_batch_input(self, batch, random_uncond, return_first_stage_outputs=False, return_original_cond=False, return_fs=False, return_cond_frame=False, return_original_input=False, **kwargs): ## x: b c t h w x = super().get_input(batch, self.first_stage_key) ## encode video frames x to z via a 2D encoder z = self.encode_first_stage(x) ## get caption condition cond_input = batch[self.cond_stage_key] if isinstance(cond_input, dict) or isinstance(cond_input, list): cond_emb = self.get_learned_conditioning(cond_input) else: cond_emb = self.get_learned_conditioning(cond_input.to(self.device)) cond = {} ## to support classifier-free guidance, randomly drop out only text conditioning 5%, only image conditioning 5%, and both 5%. if random_uncond: random_num = torch.rand(x.size(0), device=x.device) else: random_num = torch.ones(x.size(0), device=x.device) ## by doning so, we can get text embedding and complete img emb for inference prompt_mask = rearrange(random_num < 2 * self.uncond_prob, "n -> n 1 1") input_mask = 1 - rearrange((random_num >= self.uncond_prob).float() * (random_num < 3 * self.uncond_prob).float(), "n -> n 1 1 1") null_prompt = self.get_learned_conditioning([""]) prompt_imb = torch.where(prompt_mask, null_prompt, cond_emb.detach()) ## get conditioning frame cond_frame_index = 0 if self.rand_cond_frame: cond_frame_index = random.randint(0, self.model.diffusion_model.temporal_length-1) img = x[:,:,cond_frame_index,...] img = input_mask * img ## img: b c h w img_emb = self.embedder(img) ## b l c img_emb = self.image_proj_model(img_emb) if self.model.conditioning_key == 'hybrid': ## simply repeat the cond_frame to match the seq_len of z img_cat_cond = z[:,:,cond_frame_index,:,:] img_cat_cond = img_cat_cond.unsqueeze(2) img_cat_cond = repeat(img_cat_cond, 'b c t h w -> b c (repeat t) h w', repeat=z.shape[2]) cond["c_concat"] = [img_cat_cond] # b c t h w cond["c_crossattn"] = [torch.cat([prompt_imb, img_emb], dim=1)] ## concat in the seq_len dim out = [z, cond] if return_first_stage_outputs: xrec = self.decode_first_stage(z) out.extend([xrec]) if return_original_cond: out.append(cond_input) if return_fs: if self.fps_condition_type == 'fs': fs = super().get_input(batch, 'frame_stride') elif self.fps_condition_type == 'fps': fs = super().get_input(batch, 'fps') out.append(fs) if return_cond_frame: out.append(x[:,:,cond_frame_index,...].unsqueeze(2)) if return_original_input: out.append(x) return out @torch.no_grad() def log_images(self, batch, sample=True, ddim_steps=50, ddim_eta=1., plot_denoise_rows=False, \ unconditional_guidance_scale=1.0, mask=None, **kwargs): """ log images for LatentVisualDiffusion """ ##### sampled_img_num: control sampled imgae for logging, larger value may cause OOM sampled_img_num = 1 for key in batch.keys(): batch[key] = batch[key][:sampled_img_num] ## TBD: currently, classifier_free_guidance sampling is only supported by DDIM use_ddim = ddim_steps is not None log = dict() z, c, xrec, xc, fs, cond_x = self.get_batch_input(batch, random_uncond=False, return_first_stage_outputs=True, return_original_cond=True, return_fs=True, return_cond_frame=True) N = xrec.shape[0] log["image_condition"] = cond_x log["reconst"] = xrec xc_with_fs = [] for idx, content in enumerate(xc): xc_with_fs.append(content + '_fs=' + str(fs[idx].item())) log["condition"] = xc_with_fs kwargs.update({"fs": fs.long()}) c_cat = None if sample: # get uncond embedding for classifier-free guidance sampling if unconditional_guidance_scale != 1.0: if isinstance(c, dict): c_emb = c["c_crossattn"][0] if 'c_concat' in c.keys(): c_cat = c["c_concat"][0] else: c_emb = c if self.uncond_type == "empty_seq": prompts = N * [""] uc_prompt = self.get_learned_conditioning(prompts) elif self.uncond_type == "zero_embed": uc_prompt = torch.zeros_like(c_emb) img = torch.zeros_like(xrec[:,:,0]) ## b c h w ## img: b c h w img_emb = self.embedder(img) ## b l c uc_img = self.image_proj_model(img_emb) uc = torch.cat([uc_prompt, uc_img], dim=1) ## hybrid case if isinstance(c, dict): uc_hybrid = {"c_concat": [c_cat], "c_crossattn": [uc]} uc = uc_hybrid else: uc = None with self.ema_scope("Plotting"): samples, z_denoise_row = self.sample_log(cond=c, batch_size=N, ddim=use_ddim, ddim_steps=ddim_steps,eta=ddim_eta, unconditional_guidance_scale=unconditional_guidance_scale, unconditional_conditioning=uc, x0=z, **kwargs) x_samples = self.decode_first_stage(samples) log["samples"] = x_samples if plot_denoise_rows: denoise_grid = self._get_denoise_row_from_list(z_denoise_row) log["denoise_row"] = denoise_grid return log def configure_optimizers(self): """ configure_optimizers for LatentDiffusion """ lr = self.learning_rate params = list(self.model.parameters()) mainlogger.info(f"@Training [{len(params)}] Full Paramters.") if self.cond_stage_trainable: params_cond_stage = [p for p in self.cond_stage_model.parameters() if p.requires_grad == True] mainlogger.info(f"@Training [{len(params_cond_stage)}] Paramters for Cond_stage_model.") params.extend(params_cond_stage) if self.image_proj_model_trainable: mainlogger.info(f"@Training [{len(list(self.image_proj_model.parameters()))}] Paramters for Image_proj_model.") params.extend(list(self.image_proj_model.parameters())) if self.learn_logvar: mainlogger.info('Diffusion model optimizing logvar') if isinstance(params[0], dict): params.append({"params": [self.logvar]}) else: params.append(self.logvar) ## optimizer optimizer = torch.optim.AdamW(params, lr=lr) ## lr scheduler if self.use_scheduler: mainlogger.info("Setting up scheduler...") lr_scheduler = self.configure_schedulers(optimizer) return [optimizer], [lr_scheduler] return optimizer class DiffusionWrapper(pl.LightningModule): def __init__(self, diff_model_config, conditioning_key): super().__init__() self.diffusion_model = instantiate_from_config(diff_model_config) self.conditioning_key = conditioning_key def forward(self, x, t, c_concat: list = None, c_crossattn: list = None, c_adm=None, s=None, mask=None, **kwargs): # temporal_context = fps is foNone if self.conditioning_key is None: out = self.diffusion_model(x, t) elif self.conditioning_key == 'concat': xc = torch.cat([x] + c_concat, dim=1) out = self.diffusion_model(xc, t, **kwargs) elif self.conditioning_key == 'crossattn': cc = torch.cat(c_crossattn, 1) out = self.diffusion_model(x, t, context=cc, **kwargs) elif self.conditioning_key == 'hybrid': ## it is just right [b,c,t,h,w]: concatenate in channel dim xc = torch.cat([x] + c_concat, dim=1) cc = torch.cat(c_crossattn, 1) out = self.diffusion_model(xc, t, context=cc, **kwargs) elif self.conditioning_key == 'resblockcond': cc = c_crossattn[0] out = self.diffusion_model(x, t, context=cc) elif self.conditioning_key == 'adm': cc = c_crossattn[0] out = self.diffusion_model(x, t, y=cc) elif self.conditioning_key == 'hybrid-adm': assert c_adm is not None xc = torch.cat([x] + c_concat, dim=1) cc = torch.cat(c_crossattn, 1) out = self.diffusion_model(xc, t, context=cc, y=c_adm, **kwargs) elif self.conditioning_key == 'hybrid-time': assert s is not None xc = torch.cat([x] + c_concat, dim=1) cc = torch.cat(c_crossattn, 1) out = self.diffusion_model(xc, t, context=cc, s=s) elif self.conditioning_key == 'concat-time-mask': # assert s is not None xc = torch.cat([x] + c_concat, dim=1) out = self.diffusion_model(xc, t, context=None, s=s, mask=mask) elif self.conditioning_key == 'concat-adm-mask': # assert s is not None if c_concat is not None: xc = torch.cat([x] + c_concat, dim=1) else: xc = x out = self.diffusion_model(xc, t, context=None, y=s, mask=mask) elif self.conditioning_key == 'hybrid-adm-mask': cc = torch.cat(c_crossattn, 1) if c_concat is not None: xc = torch.cat([x] + c_concat, dim=1) else: xc = x out = self.diffusion_model(xc, t, context=cc, y=s, mask=mask) elif self.conditioning_key == 'hybrid-time-adm': # adm means y, e.g., class index # assert s is not None assert c_adm is not None xc = torch.cat([x] + c_concat, dim=1) cc = torch.cat(c_crossattn, 1) out = self.diffusion_model(xc, t, context=cc, s=s, y=c_adm) elif self.conditioning_key == 'crossattn-adm': assert c_adm is not None cc = torch.cat(c_crossattn, 1) out = self.diffusion_model(x, t, context=cc, y=c_adm) else: raise NotImplementedError() return out ================================================ FILE: ToonCrafter/lvdm/models/samplers/ddim.py ================================================ import numpy as np from tqdm import tqdm import torch from lvdm.models.utils_diffusion import make_ddim_sampling_parameters, make_ddim_timesteps, rescale_noise_cfg from lvdm.common import noise_like from lvdm.common import extract_into_tensor import copy class DDIMSampler(object): def __init__(self, model, schedule="linear", **kwargs): super().__init__() self.model = model self.ddpm_num_timesteps = model.num_timesteps self.schedule = schedule self.counter = 0 def register_buffer(self, name, attr): if isinstance(attr, torch.Tensor): if attr.device != torch.device("cuda") and torch.cuda.is_available(): attr = attr.to(torch.device("cuda")) setattr(self, name, attr) def make_schedule(self, ddim_num_steps, ddim_discretize="uniform", ddim_eta=0., verbose=True): self.ddim_timesteps = make_ddim_timesteps(ddim_discr_method=ddim_discretize, num_ddim_timesteps=ddim_num_steps, num_ddpm_timesteps=self.ddpm_num_timesteps, verbose=verbose) alphas_cumprod = self.model.alphas_cumprod assert alphas_cumprod.shape[0] == self.ddpm_num_timesteps, 'alphas have to be defined for each timestep' def to_torch(x): return x.clone().detach().to(torch.float32).to(self.model.device) if self.model.use_dynamic_rescale: self.ddim_scale_arr = self.model.scale_arr[self.ddim_timesteps] self.ddim_scale_arr_prev = torch.cat([self.ddim_scale_arr[0:1], self.ddim_scale_arr[:-1]]) self.register_buffer('betas', to_torch(self.model.betas)) self.register_buffer('alphas_cumprod', to_torch(alphas_cumprod)) self.register_buffer('alphas_cumprod_prev', to_torch(self.model.alphas_cumprod_prev)) # calculations for diffusion q(x_t | x_{t-1}) and others self.register_buffer('sqrt_alphas_cumprod', to_torch(np.sqrt(alphas_cumprod.cpu()))) self.register_buffer('sqrt_one_minus_alphas_cumprod', to_torch(np.sqrt(1. - alphas_cumprod.cpu()))) self.register_buffer('log_one_minus_alphas_cumprod', to_torch(np.log(1. - alphas_cumprod.cpu()))) self.register_buffer('sqrt_recip_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod.cpu()))) self.register_buffer('sqrt_recipm1_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod.cpu() - 1))) # ddim sampling parameters ddim_sigmas, ddim_alphas, ddim_alphas_prev = make_ddim_sampling_parameters(alphacums=alphas_cumprod.cpu(), ddim_timesteps=self.ddim_timesteps, eta=ddim_eta, verbose=verbose) self.register_buffer('ddim_sigmas', ddim_sigmas) self.register_buffer('ddim_alphas', ddim_alphas) self.register_buffer('ddim_alphas_prev', ddim_alphas_prev) self.register_buffer('ddim_sqrt_one_minus_alphas', np.sqrt(1. - ddim_alphas)) sigmas_for_original_sampling_steps = ddim_eta * torch.sqrt( (1 - self.alphas_cumprod_prev) / (1 - self.alphas_cumprod) * ( 1 - self.alphas_cumprod / self.alphas_cumprod_prev)) self.register_buffer('ddim_sigmas_for_original_num_steps', sigmas_for_original_sampling_steps) @torch.no_grad() def sample(self, S, batch_size, shape, conditioning=None, callback=None, normals_sequence=None, img_callback=None, quantize_x0=False, eta=0., mask=None, x0=None, temperature=1., noise_dropout=0., score_corrector=None, corrector_kwargs=None, verbose=True, schedule_verbose=False, x_T=None, log_every_t=100, unconditional_guidance_scale=1., unconditional_conditioning=None, precision=None, fs=None, timestep_spacing='uniform', # uniform_trailing for starting from last timestep guidance_rescale=0.0, **kwargs ): # check condition bs if conditioning is not None: if isinstance(conditioning, dict): try: cbs = conditioning[list(conditioning.keys())[0]].shape[0] except BaseException: cbs = conditioning[list(conditioning.keys())[0]][0].shape[0] if cbs != batch_size: print(f"Warning: Got {cbs} conditionings but batch-size is {batch_size}") else: if conditioning.shape[0] != batch_size: print(f"Warning: Got {conditioning.shape[0]} conditionings but batch-size is {batch_size}") self.make_schedule(ddim_num_steps=S, ddim_discretize=timestep_spacing, ddim_eta=eta, verbose=schedule_verbose) # make shape if len(shape) == 3: C, H, W = shape size = (batch_size, C, H, W) elif len(shape) == 4: C, T, H, W = shape size = (batch_size, C, T, H, W) samples, intermediates = self.ddim_sampling(conditioning, size, callback=callback, img_callback=img_callback, quantize_denoised=quantize_x0, mask=mask, x0=x0, ddim_use_original_steps=False, noise_dropout=noise_dropout, temperature=temperature, score_corrector=score_corrector, corrector_kwargs=corrector_kwargs, x_T=x_T, log_every_t=log_every_t, unconditional_guidance_scale=unconditional_guidance_scale, unconditional_conditioning=unconditional_conditioning, verbose=verbose, precision=precision, fs=fs, guidance_rescale=guidance_rescale, **kwargs) return samples, intermediates @torch.no_grad() def ddim_sampling(self, cond, shape, x_T=None, ddim_use_original_steps=False, callback=None, timesteps=None, quantize_denoised=False, mask=None, x0=None, img_callback=None, log_every_t=100, temperature=1., noise_dropout=0., score_corrector=None, corrector_kwargs=None, unconditional_guidance_scale=1., unconditional_conditioning=None, verbose=True, precision=None, fs=None, guidance_rescale=0.0, **kwargs): device = self.model.betas.device b = shape[0] if x_T is None: img = torch.randn(shape, device=device) else: img = x_T if precision is not None: if precision == 16: img = img.to(dtype=torch.float16) if timesteps is None: timesteps = self.ddpm_num_timesteps if ddim_use_original_steps else self.ddim_timesteps elif timesteps is not None and not ddim_use_original_steps: subset_end = int(min(timesteps / self.ddim_timesteps.shape[0], 1) * self.ddim_timesteps.shape[0]) - 1 timesteps = self.ddim_timesteps[:subset_end] intermediates = {'x_inter': [img], 'pred_x0': [img]} time_range = reversed(range(0, timesteps)) if ddim_use_original_steps else np.flip(timesteps) total_steps = timesteps if ddim_use_original_steps else timesteps.shape[0] if verbose: iterator = tqdm(time_range, desc='DDIM Sampler', total=total_steps) else: iterator = time_range clean_cond = kwargs.pop("clean_cond", False) # cond_copy, unconditional_conditioning_copy = copy.deepcopy(cond), copy.deepcopy(unconditional_conditioning) for i, step in enumerate(iterator): index = total_steps - i - 1 ts = torch.full((b,), step, device=device, dtype=torch.long) # use mask to blend noised original latent (img_orig) & new sampled latent (img) if mask is not None: assert x0 is not None if clean_cond: img_orig = x0 else: img_orig = self.model.q_sample(x0, ts) # TODO: deterministic forward pass? img = img_orig * mask + (1. - mask) * img # keep original & modify use img outs = self.p_sample_ddim(img, cond, ts, index=index, use_original_steps=ddim_use_original_steps, quantize_denoised=quantize_denoised, temperature=temperature, noise_dropout=noise_dropout, score_corrector=score_corrector, corrector_kwargs=corrector_kwargs, unconditional_guidance_scale=unconditional_guidance_scale, unconditional_conditioning=unconditional_conditioning, mask=mask, x0=x0, fs=fs, guidance_rescale=guidance_rescale, **kwargs) img, pred_x0 = outs if precision == 16: img = img.to(dtype=torch.float16) if callback: callback(i) if img_callback: img_callback(pred_x0, i) if index % log_every_t == 0 or index == total_steps - 1: intermediates['x_inter'].append(img) intermediates['pred_x0'].append(pred_x0) return img, intermediates @torch.no_grad() def p_sample_ddim(self, x, c, t, index, repeat_noise=False, use_original_steps=False, quantize_denoised=False, temperature=1., noise_dropout=0., score_corrector=None, corrector_kwargs=None, unconditional_guidance_scale=1., unconditional_conditioning=None, uc_type=None, conditional_guidance_scale_temporal=None, mask=None, x0=None, guidance_rescale=0.0, **kwargs): b, *_, device = *x.shape, x.device if x.dim() == 5: is_video = True else: is_video = False if unconditional_conditioning is None or unconditional_guidance_scale == 1.: model_output = self.model.apply_model(x, t, c, **kwargs) # unet denoiser else: # do_classifier_free_guidance if isinstance(c, torch.Tensor) or isinstance(c, dict): e_t_cond = self.model.apply_model(x, t, c, **kwargs) e_t_uncond = self.model.apply_model(x, t, unconditional_conditioning, **kwargs) else: raise NotImplementedError model_output = e_t_uncond + unconditional_guidance_scale * (e_t_cond - e_t_uncond) if guidance_rescale > 0.0: model_output = rescale_noise_cfg(model_output, e_t_cond, guidance_rescale=guidance_rescale) if self.model.parameterization == "v": e_t = self.model.predict_eps_from_z_and_v(x, t, model_output) else: e_t = model_output if score_corrector is not None: assert self.model.parameterization == "eps", 'not implemented' e_t = score_corrector.modify_score(self.model, e_t, x, t, c, **corrector_kwargs) alphas = self.model.alphas_cumprod if use_original_steps else self.ddim_alphas alphas_prev = self.model.alphas_cumprod_prev if use_original_steps else self.ddim_alphas_prev sqrt_one_minus_alphas = self.model.sqrt_one_minus_alphas_cumprod if use_original_steps else self.ddim_sqrt_one_minus_alphas # sigmas = self.model.ddim_sigmas_for_original_num_steps if use_original_steps else self.ddim_sigmas sigmas = self.ddim_sigmas_for_original_num_steps if use_original_steps else self.ddim_sigmas # select parameters corresponding to the currently considered timestep if is_video: size = (b, 1, 1, 1, 1) else: size = (b, 1, 1, 1) a_t = torch.full(size, alphas[index], device=device) a_prev = torch.full(size, alphas_prev[index], device=device) sigma_t = torch.full(size, sigmas[index], device=device) sqrt_one_minus_at = torch.full(size, sqrt_one_minus_alphas[index], device=device) # current prediction for x_0 if self.model.parameterization != "v": pred_x0 = (x - sqrt_one_minus_at * e_t) / a_t.sqrt() else: pred_x0 = self.model.predict_start_from_z_and_v(x, t, model_output) if self.model.use_dynamic_rescale: scale_t = torch.full(size, self.ddim_scale_arr[index], device=device) prev_scale_t = torch.full(size, self.ddim_scale_arr_prev[index], device=device) rescale = (prev_scale_t / scale_t) pred_x0 *= rescale if quantize_denoised: pred_x0, _, *_ = self.model.first_stage_model.quantize(pred_x0) # direction pointing to x_t dir_xt = (1. - a_prev - sigma_t**2).sqrt() * e_t noise = sigma_t * noise_like(x.shape, device, repeat_noise) * temperature if noise_dropout > 0.: noise = torch.nn.functional.dropout(noise, p=noise_dropout) x_prev = a_prev.sqrt() * pred_x0 + dir_xt + noise return x_prev, pred_x0 @torch.no_grad() def decode(self, x_latent, cond, t_start, unconditional_guidance_scale=1.0, unconditional_conditioning=None, use_original_steps=False, callback=None): timesteps = np.arange(self.ddpm_num_timesteps) if use_original_steps else self.ddim_timesteps timesteps = timesteps[:t_start] time_range = np.flip(timesteps) total_steps = timesteps.shape[0] print(f"Running DDIM Sampling with {total_steps} timesteps") iterator = tqdm(time_range, desc='Decoding image', total=total_steps) x_dec = x_latent for i, step in enumerate(iterator): index = total_steps - i - 1 ts = torch.full((x_latent.shape[0],), step, device=x_latent.device, dtype=torch.long) x_dec, _ = self.p_sample_ddim(x_dec, cond, ts, index=index, use_original_steps=use_original_steps, unconditional_guidance_scale=unconditional_guidance_scale, unconditional_conditioning=unconditional_conditioning) if callback: callback(i) return x_dec @torch.no_grad() def stochastic_encode(self, x0, t, use_original_steps=False, noise=None): # fast, but does not allow for exact reconstruction # t serves as an index to gather the correct alphas if use_original_steps: sqrt_alphas_cumprod = self.sqrt_alphas_cumprod sqrt_one_minus_alphas_cumprod = self.sqrt_one_minus_alphas_cumprod else: sqrt_alphas_cumprod = torch.sqrt(self.ddim_alphas) sqrt_one_minus_alphas_cumprod = self.ddim_sqrt_one_minus_alphas if noise is None: noise = torch.randn_like(x0) return (extract_into_tensor(sqrt_alphas_cumprod, t, x0.shape) * x0 + extract_into_tensor(sqrt_one_minus_alphas_cumprod, t, x0.shape) * noise) ================================================ FILE: ToonCrafter/lvdm/models/samplers/ddim_multiplecond.py ================================================ import numpy as np from tqdm import tqdm import torch from lvdm.models.utils_diffusion import make_ddim_sampling_parameters, make_ddim_timesteps, rescale_noise_cfg from lvdm.common import noise_like from lvdm.common import extract_into_tensor import copy class DDIMSampler(object): def __init__(self, model, schedule="linear", **kwargs): super().__init__() self.model = model self.ddpm_num_timesteps = model.num_timesteps self.schedule = schedule self.counter = 0 def register_buffer(self, name, attr): if type(attr) == torch.Tensor: if attr.device != torch.device("cuda"): attr = attr.to(torch.device("cuda")) setattr(self, name, attr) def make_schedule(self, ddim_num_steps, ddim_discretize="uniform", ddim_eta=0., verbose=True): self.ddim_timesteps = make_ddim_timesteps(ddim_discr_method=ddim_discretize, num_ddim_timesteps=ddim_num_steps, num_ddpm_timesteps=self.ddpm_num_timesteps,verbose=verbose) alphas_cumprod = self.model.alphas_cumprod assert alphas_cumprod.shape[0] == self.ddpm_num_timesteps, 'alphas have to be defined for each timestep' to_torch = lambda x: x.clone().detach().to(torch.float32).to(self.model.device) if self.model.use_dynamic_rescale: self.ddim_scale_arr = self.model.scale_arr[self.ddim_timesteps] self.ddim_scale_arr_prev = torch.cat([self.ddim_scale_arr[0:1], self.ddim_scale_arr[:-1]]) self.register_buffer('betas', to_torch(self.model.betas)) self.register_buffer('alphas_cumprod', to_torch(alphas_cumprod)) self.register_buffer('alphas_cumprod_prev', to_torch(self.model.alphas_cumprod_prev)) # calculations for diffusion q(x_t | x_{t-1}) and others self.register_buffer('sqrt_alphas_cumprod', to_torch(np.sqrt(alphas_cumprod.cpu()))) self.register_buffer('sqrt_one_minus_alphas_cumprod', to_torch(np.sqrt(1. - alphas_cumprod.cpu()))) self.register_buffer('log_one_minus_alphas_cumprod', to_torch(np.log(1. - alphas_cumprod.cpu()))) self.register_buffer('sqrt_recip_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod.cpu()))) self.register_buffer('sqrt_recipm1_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod.cpu() - 1))) # ddim sampling parameters ddim_sigmas, ddim_alphas, ddim_alphas_prev = make_ddim_sampling_parameters(alphacums=alphas_cumprod.cpu(), ddim_timesteps=self.ddim_timesteps, eta=ddim_eta,verbose=verbose) self.register_buffer('ddim_sigmas', ddim_sigmas) self.register_buffer('ddim_alphas', ddim_alphas) self.register_buffer('ddim_alphas_prev', ddim_alphas_prev) self.register_buffer('ddim_sqrt_one_minus_alphas', np.sqrt(1. - ddim_alphas)) sigmas_for_original_sampling_steps = ddim_eta * torch.sqrt( (1 - self.alphas_cumprod_prev) / (1 - self.alphas_cumprod) * ( 1 - self.alphas_cumprod / self.alphas_cumprod_prev)) self.register_buffer('ddim_sigmas_for_original_num_steps', sigmas_for_original_sampling_steps) @torch.no_grad() def sample(self, S, batch_size, shape, conditioning=None, callback=None, normals_sequence=None, img_callback=None, quantize_x0=False, eta=0., mask=None, x0=None, temperature=1., noise_dropout=0., score_corrector=None, corrector_kwargs=None, verbose=True, schedule_verbose=False, x_T=None, log_every_t=100, unconditional_guidance_scale=1., unconditional_conditioning=None, precision=None, fs=None, timestep_spacing='uniform', #uniform_trailing for starting from last timestep guidance_rescale=0.0, # this has to come in the same format as the conditioning, # e.g. as encoded tokens, ... **kwargs ): # check condition bs if conditioning is not None: if isinstance(conditioning, dict): try: cbs = conditioning[list(conditioning.keys())[0]].shape[0] except: cbs = conditioning[list(conditioning.keys())[0]][0].shape[0] if cbs != batch_size: print(f"Warning: Got {cbs} conditionings but batch-size is {batch_size}") else: if conditioning.shape[0] != batch_size: print(f"Warning: Got {conditioning.shape[0]} conditionings but batch-size is {batch_size}") # print('==> timestep_spacing: ', timestep_spacing, guidance_rescale) self.make_schedule(ddim_num_steps=S, ddim_discretize=timestep_spacing, ddim_eta=eta, verbose=schedule_verbose) # make shape if len(shape) == 3: C, H, W = shape size = (batch_size, C, H, W) elif len(shape) == 4: C, T, H, W = shape size = (batch_size, C, T, H, W) # print(f'Data shape for DDIM sampling is {size}, eta {eta}') samples, intermediates = self.ddim_sampling(conditioning, size, callback=callback, img_callback=img_callback, quantize_denoised=quantize_x0, mask=mask, x0=x0, ddim_use_original_steps=False, noise_dropout=noise_dropout, temperature=temperature, score_corrector=score_corrector, corrector_kwargs=corrector_kwargs, x_T=x_T, log_every_t=log_every_t, unconditional_guidance_scale=unconditional_guidance_scale, unconditional_conditioning=unconditional_conditioning, verbose=verbose, precision=precision, fs=fs, guidance_rescale=guidance_rescale, **kwargs) return samples, intermediates @torch.no_grad() def ddim_sampling(self, cond, shape, x_T=None, ddim_use_original_steps=False, callback=None, timesteps=None, quantize_denoised=False, mask=None, x0=None, img_callback=None, log_every_t=100, temperature=1., noise_dropout=0., score_corrector=None, corrector_kwargs=None, unconditional_guidance_scale=1., unconditional_conditioning=None, verbose=True,precision=None,fs=None,guidance_rescale=0.0, **kwargs): device = self.model.betas.device b = shape[0] if x_T is None: img = torch.randn(shape, device=device) else: img = x_T if precision is not None: if precision == 16: img = img.to(dtype=torch.float16) if timesteps is None: timesteps = self.ddpm_num_timesteps if ddim_use_original_steps else self.ddim_timesteps elif timesteps is not None and not ddim_use_original_steps: subset_end = int(min(timesteps / self.ddim_timesteps.shape[0], 1) * self.ddim_timesteps.shape[0]) - 1 timesteps = self.ddim_timesteps[:subset_end] intermediates = {'x_inter': [img], 'pred_x0': [img]} time_range = reversed(range(0,timesteps)) if ddim_use_original_steps else np.flip(timesteps) total_steps = timesteps if ddim_use_original_steps else timesteps.shape[0] if verbose: iterator = tqdm(time_range, desc='DDIM Sampler', total=total_steps) else: iterator = time_range clean_cond = kwargs.pop("clean_cond", False) # cond_copy, unconditional_conditioning_copy = copy.deepcopy(cond), copy.deepcopy(unconditional_conditioning) for i, step in enumerate(iterator): index = total_steps - i - 1 ts = torch.full((b,), step, device=device, dtype=torch.long) ## use mask to blend noised original latent (img_orig) & new sampled latent (img) if mask is not None: assert x0 is not None if clean_cond: img_orig = x0 else: img_orig = self.model.q_sample(x0, ts) # TODO: deterministic forward pass? img = img_orig * mask + (1. - mask) * img # keep original & modify use img outs = self.p_sample_ddim(img, cond, ts, index=index, use_original_steps=ddim_use_original_steps, quantize_denoised=quantize_denoised, temperature=temperature, noise_dropout=noise_dropout, score_corrector=score_corrector, corrector_kwargs=corrector_kwargs, unconditional_guidance_scale=unconditional_guidance_scale, unconditional_conditioning=unconditional_conditioning, mask=mask,x0=x0,fs=fs,guidance_rescale=guidance_rescale, **kwargs) img, pred_x0 = outs if callback: callback(i) if img_callback: img_callback(pred_x0, i) if index % log_every_t == 0 or index == total_steps - 1: intermediates['x_inter'].append(img) intermediates['pred_x0'].append(pred_x0) return img, intermediates @torch.no_grad() def p_sample_ddim(self, x, c, t, index, repeat_noise=False, use_original_steps=False, quantize_denoised=False, temperature=1., noise_dropout=0., score_corrector=None, corrector_kwargs=None, unconditional_guidance_scale=1., unconditional_conditioning=None, uc_type=None, cfg_img=None,mask=None,x0=None,guidance_rescale=0.0, **kwargs): b, *_, device = *x.shape, x.device if x.dim() == 5: is_video = True else: is_video = False if cfg_img is None: cfg_img = unconditional_guidance_scale unconditional_conditioning_img_nonetext = kwargs['unconditional_conditioning_img_nonetext'] if unconditional_conditioning is None or unconditional_guidance_scale == 1.: model_output = self.model.apply_model(x, t, c, **kwargs) # unet denoiser else: ### with unconditional condition e_t_cond = self.model.apply_model(x, t, c, **kwargs) e_t_uncond = self.model.apply_model(x, t, unconditional_conditioning, **kwargs) e_t_uncond_img = self.model.apply_model(x, t, unconditional_conditioning_img_nonetext, **kwargs) # text cfg model_output = e_t_uncond + cfg_img * (e_t_uncond_img - e_t_uncond) + unconditional_guidance_scale * (e_t_cond - e_t_uncond_img) if guidance_rescale > 0.0: model_output = rescale_noise_cfg(model_output, e_t_cond, guidance_rescale=guidance_rescale) if self.model.parameterization == "v": e_t = self.model.predict_eps_from_z_and_v(x, t, model_output) else: e_t = model_output if score_corrector is not None: assert self.model.parameterization == "eps", 'not implemented' e_t = score_corrector.modify_score(self.model, e_t, x, t, c, **corrector_kwargs) alphas = self.model.alphas_cumprod if use_original_steps else self.ddim_alphas alphas_prev = self.model.alphas_cumprod_prev if use_original_steps else self.ddim_alphas_prev sqrt_one_minus_alphas = self.model.sqrt_one_minus_alphas_cumprod if use_original_steps else self.ddim_sqrt_one_minus_alphas sigmas = self.ddim_sigmas_for_original_num_steps if use_original_steps else self.ddim_sigmas # select parameters corresponding to the currently considered timestep if is_video: size = (b, 1, 1, 1, 1) else: size = (b, 1, 1, 1) a_t = torch.full(size, alphas[index], device=device) a_prev = torch.full(size, alphas_prev[index], device=device) sigma_t = torch.full(size, sigmas[index], device=device) sqrt_one_minus_at = torch.full(size, sqrt_one_minus_alphas[index],device=device) # current prediction for x_0 if self.model.parameterization != "v": pred_x0 = (x - sqrt_one_minus_at * e_t) / a_t.sqrt() else: pred_x0 = self.model.predict_start_from_z_and_v(x, t, model_output) if self.model.use_dynamic_rescale: scale_t = torch.full(size, self.ddim_scale_arr[index], device=device) prev_scale_t = torch.full(size, self.ddim_scale_arr_prev[index], device=device) rescale = (prev_scale_t / scale_t) pred_x0 *= rescale if quantize_denoised: pred_x0, _, *_ = self.model.first_stage_model.quantize(pred_x0) # direction pointing to x_t dir_xt = (1. - a_prev - sigma_t**2).sqrt() * e_t noise = sigma_t * noise_like(x.shape, device, repeat_noise) * temperature if noise_dropout > 0.: noise = torch.nn.functional.dropout(noise, p=noise_dropout) x_prev = a_prev.sqrt() * pred_x0 + dir_xt + noise return x_prev, pred_x0 @torch.no_grad() def decode(self, x_latent, cond, t_start, unconditional_guidance_scale=1.0, unconditional_conditioning=None, use_original_steps=False, callback=None): timesteps = np.arange(self.ddpm_num_timesteps) if use_original_steps else self.ddim_timesteps timesteps = timesteps[:t_start] time_range = np.flip(timesteps) total_steps = timesteps.shape[0] print(f"Running DDIM Sampling with {total_steps} timesteps") iterator = tqdm(time_range, desc='Decoding image', total=total_steps) x_dec = x_latent for i, step in enumerate(iterator): index = total_steps - i - 1 ts = torch.full((x_latent.shape[0],), step, device=x_latent.device, dtype=torch.long) x_dec, _ = self.p_sample_ddim(x_dec, cond, ts, index=index, use_original_steps=use_original_steps, unconditional_guidance_scale=unconditional_guidance_scale, unconditional_conditioning=unconditional_conditioning) if callback: callback(i) return x_dec @torch.no_grad() def stochastic_encode(self, x0, t, use_original_steps=False, noise=None): # fast, but does not allow for exact reconstruction # t serves as an index to gather the correct alphas if use_original_steps: sqrt_alphas_cumprod = self.sqrt_alphas_cumprod sqrt_one_minus_alphas_cumprod = self.sqrt_one_minus_alphas_cumprod else: sqrt_alphas_cumprod = torch.sqrt(self.ddim_alphas) sqrt_one_minus_alphas_cumprod = self.ddim_sqrt_one_minus_alphas if noise is None: noise = torch.randn_like(x0) return (extract_into_tensor(sqrt_alphas_cumprod, t, x0.shape) * x0 + extract_into_tensor(sqrt_one_minus_alphas_cumprod, t, x0.shape) * noise) ================================================ FILE: ToonCrafter/lvdm/models/utils_diffusion.py ================================================ import math import numpy as np import torch import torch.nn.functional as F from einops import repeat def timestep_embedding(timesteps, dim, max_period=10000, repeat_only=False): """ Create sinusoidal timestep embeddings. :param timesteps: a 1-D Tensor of N indices, one per batch element. These may be fractional. :param dim: the dimension of the output. :param max_period: controls the minimum frequency of the embeddings. :return: an [N x dim] Tensor of positional embeddings. """ if not repeat_only: half = dim // 2 freqs = torch.exp( -math.log(max_period) * torch.arange(start=0, end=half, dtype=torch.float32) / half ).to(device=timesteps.device) args = timesteps[:, None].float() * freqs[None] embedding = torch.cat([torch.cos(args), torch.sin(args)], dim=-1) if dim % 2: embedding = torch.cat([embedding, torch.zeros_like(embedding[:, :1])], dim=-1) else: embedding = repeat(timesteps, 'b -> b d', d=dim) return embedding def make_beta_schedule(schedule, n_timestep, linear_start=1e-4, linear_end=2e-2, cosine_s=8e-3): if schedule == "linear": betas = ( torch.linspace(linear_start ** 0.5, linear_end ** 0.5, n_timestep, dtype=torch.float64) ** 2 ) elif schedule == "cosine": timesteps = ( torch.arange(n_timestep + 1, dtype=torch.float64) / n_timestep + cosine_s ) alphas = timesteps / (1 + cosine_s) * np.pi / 2 alphas = torch.cos(alphas).pow(2) alphas = alphas / alphas[0] betas = 1 - alphas[1:] / alphas[:-1] betas = np.clip(betas, a_min=0, a_max=0.999) elif schedule == "sqrt_linear": betas = torch.linspace(linear_start, linear_end, n_timestep, dtype=torch.float64) elif schedule == "sqrt": betas = torch.linspace(linear_start, linear_end, n_timestep, dtype=torch.float64) ** 0.5 else: raise ValueError(f"schedule '{schedule}' unknown.") return betas.numpy() def make_ddim_timesteps(ddim_discr_method, num_ddim_timesteps, num_ddpm_timesteps, verbose=True): if ddim_discr_method == 'uniform': c = num_ddpm_timesteps // num_ddim_timesteps ddim_timesteps = np.asarray(list(range(0, num_ddpm_timesteps, c))) steps_out = ddim_timesteps + 1 elif ddim_discr_method == 'uniform_trailing': c = num_ddpm_timesteps / num_ddim_timesteps ddim_timesteps = np.flip(np.round(np.arange(num_ddpm_timesteps, 0, -c))).astype(np.int64) steps_out = ddim_timesteps - 1 elif ddim_discr_method == 'quad': ddim_timesteps = ((np.linspace(0, np.sqrt(num_ddpm_timesteps * .8), num_ddim_timesteps)) ** 2).astype(int) steps_out = ddim_timesteps + 1 else: raise NotImplementedError(f'There is no ddim discretization method called "{ddim_discr_method}"') # assert ddim_timesteps.shape[0] == num_ddim_timesteps # add one to get the final alpha values right (the ones from first scale to data during sampling) # steps_out = ddim_timesteps + 1 if verbose: print(f'Selected timesteps for ddim sampler: {steps_out}') return steps_out def make_ddim_sampling_parameters(alphacums, ddim_timesteps, eta, verbose=True): # select alphas for computing the variance schedule # print(f'ddim_timesteps={ddim_timesteps}, len_alphacums={len(alphacums)}') alphas = alphacums[ddim_timesteps] alphas_prev = np.asarray([alphacums[0]] + alphacums[ddim_timesteps[:-1]].tolist()) # according the formula provided in https://arxiv.org/abs/2010.02502 sigmas = eta * np.sqrt((1 - alphas_prev) / (1 - alphas) * (1 - alphas / alphas_prev)) if verbose: print(f'Selected alphas for ddim sampler: a_t: {alphas}; a_(t-1): {alphas_prev}') print(f'For the chosen value of eta, which is {eta}, ' f'this results in the following sigma_t schedule for ddim sampler {sigmas}') return sigmas, alphas, alphas_prev def betas_for_alpha_bar(num_diffusion_timesteps, alpha_bar, max_beta=0.999): """ Create a beta schedule that discretizes the given alpha_t_bar function, which defines the cumulative product of (1-beta) over time from t = [0,1]. :param num_diffusion_timesteps: the number of betas to produce. :param alpha_bar: a lambda that takes an argument t from 0 to 1 and produces the cumulative product of (1-beta) up to that part of the diffusion process. :param max_beta: the maximum beta to use; use values lower than 1 to prevent singularities. """ betas = [] for i in range(num_diffusion_timesteps): t1 = i / num_diffusion_timesteps t2 = (i + 1) / num_diffusion_timesteps betas.append(min(1 - alpha_bar(t2) / alpha_bar(t1), max_beta)) return np.array(betas) def rescale_zero_terminal_snr(betas): """ Rescales betas to have zero terminal SNR Based on https://arxiv.org/pdf/2305.08891.pdf (Algorithm 1) Args: betas (`numpy.ndarray`): the betas that the scheduler is being initialized with. Returns: `numpy.ndarray`: rescaled betas with zero terminal SNR """ # Convert betas to alphas_bar_sqrt alphas = 1.0 - betas alphas_cumprod = np.cumprod(alphas, axis=0) alphas_bar_sqrt = np.sqrt(alphas_cumprod) # Store old values. alphas_bar_sqrt_0 = alphas_bar_sqrt[0].copy() alphas_bar_sqrt_T = alphas_bar_sqrt[-1].copy() # Shift so the last timestep is zero. alphas_bar_sqrt -= alphas_bar_sqrt_T # Scale so the first timestep is back to the old value. alphas_bar_sqrt *= alphas_bar_sqrt_0 / (alphas_bar_sqrt_0 - alphas_bar_sqrt_T) # Convert alphas_bar_sqrt to betas alphas_bar = alphas_bar_sqrt**2 # Revert sqrt alphas = alphas_bar[1:] / alphas_bar[:-1] # Revert cumprod alphas = np.concatenate([alphas_bar[0:1], alphas]) betas = 1 - alphas return betas def rescale_noise_cfg(noise_cfg, noise_pred_text, guidance_rescale=0.0): """ Rescale `noise_cfg` according to `guidance_rescale`. Based on findings of [Common Diffusion Noise Schedules and Sample Steps are Flawed](https://arxiv.org/pdf/2305.08891.pdf). See Section 3.4 """ std_text = noise_pred_text.std(dim=list(range(1, noise_pred_text.ndim)), keepdim=True) std_cfg = noise_cfg.std(dim=list(range(1, noise_cfg.ndim)), keepdim=True) # rescale the results from guidance (fixes overexposure) noise_pred_rescaled = noise_cfg * (std_text / std_cfg) # mix with the original results from guidance by factor guidance_rescale to avoid "plain looking" images noise_cfg = guidance_rescale * noise_pred_rescaled + (1 - guidance_rescale) * noise_cfg return noise_cfg ================================================ FILE: ToonCrafter/lvdm/modules/attention.py ================================================ import torch from torch import nn, einsum import torch.nn.functional as F from einops import rearrange, repeat from functools import partial try: import xformers import xformers.ops XFORMERS_IS_AVAILABLE = True except: XFORMERS_IS_AVAILABLE = False from lvdm.common import ( checkpoint, exists, default, ) from lvdm.basics import zero_module class RelativePosition(nn.Module): """ https://github.com/evelinehong/Transformer_Relative_Position_PyTorch/blob/master/relative_position.py """ def __init__(self, num_units, max_relative_position): super().__init__() self.num_units = num_units self.max_relative_position = max_relative_position self.embeddings_table = nn.Parameter(torch.Tensor(max_relative_position * 2 + 1, num_units)) nn.init.xavier_uniform_(self.embeddings_table) def forward(self, length_q, length_k): device = self.embeddings_table.device range_vec_q = torch.arange(length_q, device=device) range_vec_k = torch.arange(length_k, device=device) distance_mat = range_vec_k[None, :] - range_vec_q[:, None] distance_mat_clipped = torch.clamp(distance_mat, -self.max_relative_position, self.max_relative_position) final_mat = distance_mat_clipped + self.max_relative_position final_mat = final_mat.long() embeddings = self.embeddings_table[final_mat] return embeddings class CrossAttention(nn.Module): def __init__(self, query_dim, context_dim=None, heads=8, dim_head=64, dropout=0., relative_position=False, temporal_length=None, video_length=None, image_cross_attention=False, image_cross_attention_scale=1.0, image_cross_attention_scale_learnable=False, text_context_len=77): super().__init__() inner_dim = dim_head * heads context_dim = default(context_dim, query_dim) self.scale = dim_head**-0.5 self.heads = heads self.dim_head = dim_head self.to_q = nn.Linear(query_dim, inner_dim, bias=False) self.to_k = nn.Linear(context_dim, inner_dim, bias=False) self.to_v = nn.Linear(context_dim, inner_dim, bias=False) self.to_out = nn.Sequential(nn.Linear(inner_dim, query_dim), nn.Dropout(dropout)) self.relative_position = relative_position if self.relative_position: assert(temporal_length is not None) self.relative_position_k = RelativePosition(num_units=dim_head, max_relative_position=temporal_length) self.relative_position_v = RelativePosition(num_units=dim_head, max_relative_position=temporal_length) else: ## only used for spatial attention, while NOT for temporal attention if XFORMERS_IS_AVAILABLE and temporal_length is None: self.forward = self.efficient_forward self.video_length = video_length self.image_cross_attention = image_cross_attention self.image_cross_attention_scale = image_cross_attention_scale self.text_context_len = text_context_len self.image_cross_attention_scale_learnable = image_cross_attention_scale_learnable if self.image_cross_attention: self.to_k_ip = nn.Linear(context_dim, inner_dim, bias=False) self.to_v_ip = nn.Linear(context_dim, inner_dim, bias=False) if image_cross_attention_scale_learnable: self.register_parameter('alpha', nn.Parameter(torch.tensor(0.)) ) def forward(self, x, context=None, mask=None): spatial_self_attn = (context is None) k_ip, v_ip, out_ip = None, None, None h = self.heads q = self.to_q(x) context = default(context, x) if self.image_cross_attention and not spatial_self_attn: context, context_image = context[:,:self.text_context_len,:], context[:,self.text_context_len:,:] k = self.to_k(context) v = self.to_v(context) k_ip = self.to_k_ip(context_image) v_ip = self.to_v_ip(context_image) else: if not spatial_self_attn: context = context[:,:self.text_context_len,:] k = self.to_k(context) v = self.to_v(context) q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> (b h) n d', h=h), (q, k, v)) sim = torch.einsum('b i d, b j d -> b i j', q, k) * self.scale if self.relative_position: len_q, len_k, len_v = q.shape[1], k.shape[1], v.shape[1] k2 = self.relative_position_k(len_q, len_k) sim2 = einsum('b t d, t s d -> b t s', q, k2) * self.scale # TODO check sim += sim2 del k if exists(mask): ## feasible for causal attention mask only max_neg_value = -torch.finfo(sim.dtype).max mask = repeat(mask, 'b i j -> (b h) i j', h=h) sim.masked_fill_(~(mask>0.5), max_neg_value) # attention, what we cannot get enough of sim = sim.softmax(dim=-1) out = torch.einsum('b i j, b j d -> b i d', sim, v) if self.relative_position: v2 = self.relative_position_v(len_q, len_v) out2 = einsum('b t s, t s d -> b t d', sim, v2) # TODO check out += out2 out = rearrange(out, '(b h) n d -> b n (h d)', h=h) ## for image cross-attention if k_ip is not None: k_ip, v_ip = map(lambda t: rearrange(t, 'b n (h d) -> (b h) n d', h=h), (k_ip, v_ip)) sim_ip = torch.einsum('b i d, b j d -> b i j', q, k_ip) * self.scale del k_ip sim_ip = sim_ip.softmax(dim=-1) out_ip = torch.einsum('b i j, b j d -> b i d', sim_ip, v_ip) out_ip = rearrange(out_ip, '(b h) n d -> b n (h d)', h=h) if out_ip is not None: if self.image_cross_attention_scale_learnable: out = out + self.image_cross_attention_scale * out_ip * (torch.tanh(self.alpha)+1) else: out = out + self.image_cross_attention_scale * out_ip return self.to_out(out) def efficient_forward(self, x, context=None, mask=None): spatial_self_attn = (context is None) k_ip, v_ip, out_ip = None, None, None q = self.to_q(x) context = default(context, x) if self.image_cross_attention and not spatial_self_attn: context, context_image = context[:,:self.text_context_len,:], context[:,self.text_context_len:,:] k = self.to_k(context) v = self.to_v(context) k_ip = self.to_k_ip(context_image) v_ip = self.to_v_ip(context_image) else: if not spatial_self_attn: context = context[:,:self.text_context_len,:] k = self.to_k(context) v = self.to_v(context) b, _, _ = q.shape q, k, v = map( lambda t: t.unsqueeze(3) .reshape(b, t.shape[1], self.heads, self.dim_head) .permute(0, 2, 1, 3) .reshape(b * self.heads, t.shape[1], self.dim_head) .contiguous(), (q, k, v), ) # actually compute the attention, what we cannot get enough of out = xformers.ops.memory_efficient_attention(q, k, v, attn_bias=None, op=None) ## for image cross-attention if k_ip is not None: k_ip, v_ip = map( lambda t: t.unsqueeze(3) .reshape(b, t.shape[1], self.heads, self.dim_head) .permute(0, 2, 1, 3) .reshape(b * self.heads, t.shape[1], self.dim_head) .contiguous(), (k_ip, v_ip), ) out_ip = xformers.ops.memory_efficient_attention(q, k_ip, v_ip, attn_bias=None, op=None) out_ip = ( out_ip.unsqueeze(0) .reshape(b, self.heads, out.shape[1], self.dim_head) .permute(0, 2, 1, 3) .reshape(b, out.shape[1], self.heads * self.dim_head) ) if exists(mask): raise NotImplementedError out = ( out.unsqueeze(0) .reshape(b, self.heads, out.shape[1], self.dim_head) .permute(0, 2, 1, 3) .reshape(b, out.shape[1], self.heads * self.dim_head) ) if out_ip is not None: if self.image_cross_attention_scale_learnable: out = out + self.image_cross_attention_scale * out_ip * (torch.tanh(self.alpha)+1) else: out = out + self.image_cross_attention_scale * out_ip return self.to_out(out) class BasicTransformerBlock(nn.Module): def __init__(self, dim, n_heads, d_head, dropout=0., context_dim=None, gated_ff=True, checkpoint=True, disable_self_attn=False, attention_cls=None, video_length=None, image_cross_attention=False, image_cross_attention_scale=1.0, image_cross_attention_scale_learnable=False, text_context_len=77): super().__init__() attn_cls = CrossAttention if attention_cls is None else attention_cls self.disable_self_attn = disable_self_attn self.attn1 = attn_cls(query_dim=dim, heads=n_heads, dim_head=d_head, dropout=dropout, context_dim=context_dim if self.disable_self_attn else None) self.ff = FeedForward(dim, dropout=dropout, glu=gated_ff) self.attn2 = attn_cls(query_dim=dim, context_dim=context_dim, heads=n_heads, dim_head=d_head, dropout=dropout, video_length=video_length, image_cross_attention=image_cross_attention, image_cross_attention_scale=image_cross_attention_scale, image_cross_attention_scale_learnable=image_cross_attention_scale_learnable,text_context_len=text_context_len) self.image_cross_attention = image_cross_attention self.norm1 = nn.LayerNorm(dim) self.norm2 = nn.LayerNorm(dim) self.norm3 = nn.LayerNorm(dim) self.checkpoint = checkpoint def forward(self, x, context=None, mask=None, **kwargs): ## implementation tricks: because checkpointing doesn't support non-tensor (e.g. None or scalar) arguments input_tuple = (x,) ## should not be (x), otherwise *input_tuple will decouple x into multiple arguments if context is not None: input_tuple = (x, context) if mask is not None: forward_mask = partial(self._forward, mask=mask) return checkpoint(forward_mask, (x,), self.parameters(), self.checkpoint) return checkpoint(self._forward, input_tuple, self.parameters(), self.checkpoint) def _forward(self, x, context=None, mask=None): x = self.attn1(self.norm1(x), context=context if self.disable_self_attn else None, mask=mask) + x x = self.attn2(self.norm2(x), context=context, mask=mask) + x x = self.ff(self.norm3(x)) + x return x class SpatialTransformer(nn.Module): """ Transformer block for image-like data in spatial axis. First, project the input (aka embedding) and reshape to b, t, d. Then apply standard transformer action. Finally, reshape to image NEW: use_linear for more efficiency instead of the 1x1 convs """ def __init__(self, in_channels, n_heads, d_head, depth=1, dropout=0., context_dim=None, use_checkpoint=True, disable_self_attn=False, use_linear=False, video_length=None, image_cross_attention=False, image_cross_attention_scale_learnable=False): super().__init__() self.in_channels = in_channels inner_dim = n_heads * d_head self.norm = torch.nn.GroupNorm(num_groups=32, num_channels=in_channels, eps=1e-6, affine=True) if not use_linear: self.proj_in = nn.Conv2d(in_channels, inner_dim, kernel_size=1, stride=1, padding=0) else: self.proj_in = nn.Linear(in_channels, inner_dim) attention_cls = None self.transformer_blocks = nn.ModuleList([ BasicTransformerBlock( inner_dim, n_heads, d_head, dropout=dropout, context_dim=context_dim, disable_self_attn=disable_self_attn, checkpoint=use_checkpoint, attention_cls=attention_cls, video_length=video_length, image_cross_attention=image_cross_attention, image_cross_attention_scale_learnable=image_cross_attention_scale_learnable, ) for d in range(depth) ]) if not use_linear: self.proj_out = zero_module(nn.Conv2d(inner_dim, in_channels, kernel_size=1, stride=1, padding=0)) else: self.proj_out = zero_module(nn.Linear(inner_dim, in_channels)) self.use_linear = use_linear def forward(self, x, context=None, **kwargs): b, c, h, w = x.shape x_in = x x = self.norm(x) if not self.use_linear: x = self.proj_in(x) x = rearrange(x, 'b c h w -> b (h w) c').contiguous() if self.use_linear: x = self.proj_in(x) for i, block in enumerate(self.transformer_blocks): x = block(x, context=context, **kwargs) if self.use_linear: x = self.proj_out(x) x = rearrange(x, 'b (h w) c -> b c h w', h=h, w=w).contiguous() if not self.use_linear: x = self.proj_out(x) return x + x_in class TemporalTransformer(nn.Module): """ Transformer block for image-like data in temporal axis. First, reshape to b, t, d. Then apply standard transformer action. Finally, reshape to image """ def __init__(self, in_channels, n_heads, d_head, depth=1, dropout=0., context_dim=None, use_checkpoint=True, use_linear=False, only_self_att=True, causal_attention=False, causal_block_size=1, relative_position=False, temporal_length=None): super().__init__() self.only_self_att = only_self_att self.relative_position = relative_position self.causal_attention = causal_attention self.causal_block_size = causal_block_size self.in_channels = in_channels inner_dim = n_heads * d_head self.norm = torch.nn.GroupNorm(num_groups=32, num_channels=in_channels, eps=1e-6, affine=True) self.proj_in = nn.Conv1d(in_channels, inner_dim, kernel_size=1, stride=1, padding=0) if not use_linear: self.proj_in = nn.Conv1d(in_channels, inner_dim, kernel_size=1, stride=1, padding=0) else: self.proj_in = nn.Linear(in_channels, inner_dim) if relative_position: assert(temporal_length is not None) attention_cls = partial(CrossAttention, relative_position=True, temporal_length=temporal_length) else: attention_cls = partial(CrossAttention, temporal_length=temporal_length) if self.causal_attention: assert(temporal_length is not None) self.mask = torch.tril(torch.ones([1, temporal_length, temporal_length])) if self.only_self_att: context_dim = None self.transformer_blocks = nn.ModuleList([ BasicTransformerBlock( inner_dim, n_heads, d_head, dropout=dropout, context_dim=context_dim, attention_cls=attention_cls, checkpoint=use_checkpoint) for d in range(depth) ]) if not use_linear: self.proj_out = zero_module(nn.Conv1d(inner_dim, in_channels, kernel_size=1, stride=1, padding=0)) else: self.proj_out = zero_module(nn.Linear(inner_dim, in_channels)) self.use_linear = use_linear def forward(self, x, context=None): b, c, t, h, w = x.shape x_in = x x = self.norm(x) x = rearrange(x, 'b c t h w -> (b h w) c t').contiguous() if not self.use_linear: x = self.proj_in(x) x = rearrange(x, 'bhw c t -> bhw t c').contiguous() if self.use_linear: x = self.proj_in(x) temp_mask = None if self.causal_attention: # slice the from mask map temp_mask = self.mask[:,:t,:t].to(x.device) if temp_mask is not None: mask = temp_mask.to(x.device) mask = repeat(mask, 'l i j -> (l bhw) i j', bhw=b*h*w) else: mask = None if self.only_self_att: ## note: if no context is given, cross-attention defaults to self-attention for i, block in enumerate(self.transformer_blocks): x = block(x, mask=mask) x = rearrange(x, '(b hw) t c -> b hw t c', b=b).contiguous() else: x = rearrange(x, '(b hw) t c -> b hw t c', b=b).contiguous() context = rearrange(context, '(b t) l con -> b t l con', t=t).contiguous() for i, block in enumerate(self.transformer_blocks): # calculate each batch one by one (since number in shape could not greater then 65,535 for some package) for j in range(b): context_j = repeat( context[j], 't l con -> (t r) l con', r=(h * w) // t, t=t).contiguous() ## note: causal mask will not applied in cross-attention case x[j] = block(x[j], context=context_j) if self.use_linear: x = self.proj_out(x) x = rearrange(x, 'b (h w) t c -> b c t h w', h=h, w=w).contiguous() if not self.use_linear: x = rearrange(x, 'b hw t c -> (b hw) c t').contiguous() x = self.proj_out(x) x = rearrange(x, '(b h w) c t -> b c t h w', b=b, h=h, w=w).contiguous() return x + x_in class GEGLU(nn.Module): def __init__(self, dim_in, dim_out): super().__init__() self.proj = nn.Linear(dim_in, dim_out * 2) def forward(self, x): x, gate = self.proj(x).chunk(2, dim=-1) return x * F.gelu(gate) class FeedForward(nn.Module): def __init__(self, dim, dim_out=None, mult=4, glu=False, dropout=0.): super().__init__() inner_dim = int(dim * mult) dim_out = default(dim_out, dim) project_in = nn.Sequential( nn.Linear(dim, inner_dim), nn.GELU() ) if not glu else GEGLU(dim, inner_dim) self.net = nn.Sequential( project_in, nn.Dropout(dropout), nn.Linear(inner_dim, dim_out) ) def forward(self, x): return self.net(x) class LinearAttention(nn.Module): def __init__(self, dim, heads=4, dim_head=32): super().__init__() self.heads = heads hidden_dim = dim_head * heads self.to_qkv = nn.Conv2d(dim, hidden_dim * 3, 1, bias = False) self.to_out = nn.Conv2d(hidden_dim, dim, 1) def forward(self, x): b, c, h, w = x.shape qkv = self.to_qkv(x) q, k, v = rearrange(qkv, 'b (qkv heads c) h w -> qkv b heads c (h w)', heads = self.heads, qkv=3) k = k.softmax(dim=-1) context = torch.einsum('bhdn,bhen->bhde', k, v) out = torch.einsum('bhde,bhdn->bhen', context, q) out = rearrange(out, 'b heads c (h w) -> b (heads c) h w', heads=self.heads, h=h, w=w) return self.to_out(out) class SpatialSelfAttention(nn.Module): def __init__(self, in_channels): super().__init__() self.in_channels = in_channels self.norm = torch.nn.GroupNorm(num_groups=32, num_channels=in_channels, eps=1e-6, affine=True) self.q = torch.nn.Conv2d(in_channels, in_channels, kernel_size=1, stride=1, padding=0) self.k = torch.nn.Conv2d(in_channels, in_channels, kernel_size=1, stride=1, padding=0) self.v = torch.nn.Conv2d(in_channels, in_channels, kernel_size=1, stride=1, padding=0) self.proj_out = torch.nn.Conv2d(in_channels, in_channels, kernel_size=1, stride=1, padding=0) def forward(self, x): h_ = x h_ = self.norm(h_) q = self.q(h_) k = self.k(h_) v = self.v(h_) # compute attention b,c,h,w = q.shape q = rearrange(q, 'b c h w -> b (h w) c') k = rearrange(k, 'b c h w -> b c (h w)') w_ = torch.einsum('bij,bjk->bik', q, k) w_ = w_ * (int(c)**(-0.5)) w_ = torch.nn.functional.softmax(w_, dim=2) # attend to values v = rearrange(v, 'b c h w -> b c (h w)') w_ = rearrange(w_, 'b i j -> b j i') h_ = torch.einsum('bij,bjk->bik', v, w_) h_ = rearrange(h_, 'b c (h w) -> b c h w', h=h) h_ = self.proj_out(h_) return x+h_ ================================================ FILE: ToonCrafter/lvdm/modules/attention_svd.py ================================================ import logging import math from inspect import isfunction from typing import Any, Optional import torch import torch.nn.functional as F from einops import rearrange, repeat from packaging import version from torch import nn from torch.utils.checkpoint import checkpoint logpy = logging.getLogger(__name__) if version.parse(torch.__version__) >= version.parse("2.0.0"): SDP_IS_AVAILABLE = True from torch.backends.cuda import SDPBackend, sdp_kernel BACKEND_MAP = { SDPBackend.MATH: { "enable_math": True, "enable_flash": False, "enable_mem_efficient": False, }, SDPBackend.FLASH_ATTENTION: { "enable_math": False, "enable_flash": True, "enable_mem_efficient": False, }, SDPBackend.EFFICIENT_ATTENTION: { "enable_math": False, "enable_flash": False, "enable_mem_efficient": True, }, None: {"enable_math": True, "enable_flash": True, "enable_mem_efficient": True}, } else: from contextlib import nullcontext SDP_IS_AVAILABLE = False sdp_kernel = nullcontext BACKEND_MAP = {} logpy.warn( f"No SDP backend available, likely because you are running in pytorch " f"versions < 2.0. In fact, you are using PyTorch {torch.__version__}. " f"You might want to consider upgrading." ) try: import xformers import xformers.ops XFORMERS_IS_AVAILABLE = True except: XFORMERS_IS_AVAILABLE = False logpy.warn("no module 'xformers'. Processing without...") # from .diffusionmodules.util import mixed_checkpoint as checkpoint def exists(val): return val is not None def uniq(arr): return {el: True for el in arr}.keys() def default(val, d): if exists(val): return val return d() if isfunction(d) else d def max_neg_value(t): return -torch.finfo(t.dtype).max def init_(tensor): dim = tensor.shape[-1] std = 1 / math.sqrt(dim) tensor.uniform_(-std, std) return tensor # feedforward class GEGLU(nn.Module): def __init__(self, dim_in, dim_out): super().__init__() self.proj = nn.Linear(dim_in, dim_out * 2) def forward(self, x): x, gate = self.proj(x).chunk(2, dim=-1) return x * F.gelu(gate) class FeedForward(nn.Module): def __init__(self, dim, dim_out=None, mult=4, glu=False, dropout=0.0): super().__init__() inner_dim = int(dim * mult) dim_out = default(dim_out, dim) project_in = ( nn.Sequential(nn.Linear(dim, inner_dim), nn.GELU()) if not glu else GEGLU(dim, inner_dim) ) self.net = nn.Sequential( project_in, nn.Dropout(dropout), nn.Linear(inner_dim, dim_out) ) def forward(self, x): return self.net(x) def zero_module(module): """ Zero out the parameters of a module and return it. """ for p in module.parameters(): p.detach().zero_() return module def Normalize(in_channels): return torch.nn.GroupNorm( num_groups=32, num_channels=in_channels, eps=1e-6, affine=True ) class LinearAttention(nn.Module): def __init__(self, dim, heads=4, dim_head=32): super().__init__() self.heads = heads hidden_dim = dim_head * heads self.to_qkv = nn.Conv2d(dim, hidden_dim * 3, 1, bias=False) self.to_out = nn.Conv2d(hidden_dim, dim, 1) def forward(self, x): b, c, h, w = x.shape qkv = self.to_qkv(x) q, k, v = rearrange( qkv, "b (qkv heads c) h w -> qkv b heads c (h w)", heads=self.heads, qkv=3 ) k = k.softmax(dim=-1) context = torch.einsum("bhdn,bhen->bhde", k, v) out = torch.einsum("bhde,bhdn->bhen", context, q) out = rearrange( out, "b heads c (h w) -> b (heads c) h w", heads=self.heads, h=h, w=w ) return self.to_out(out) class SelfAttention(nn.Module): ATTENTION_MODES = ("xformers", "torch", "math") def __init__( self, dim: int, num_heads: int = 8, qkv_bias: bool = False, qk_scale: Optional[float] = None, attn_drop: float = 0.0, proj_drop: float = 0.0, attn_mode: str = "xformers", ): if attn_mode == "xformers" and XFORMERS_IS_AVAILABLE is False: attn_mode = "torch" logpy.warn("no module 'xformers'. Processing without...") super().__init__() self.num_heads = num_heads head_dim = dim // num_heads self.scale = qk_scale or head_dim**-0.5 self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias) self.attn_drop = nn.Dropout(attn_drop) self.proj = nn.Linear(dim, dim) self.proj_drop = nn.Dropout(proj_drop) assert attn_mode in self.ATTENTION_MODES self.attn_mode = attn_mode def forward(self, x: torch.Tensor) -> torch.Tensor: B, L, C = x.shape qkv = self.qkv(x) if self.attn_mode == "torch": qkv = rearrange(qkv, "B L (K H D) -> K B H L D", K=3, H=self.num_heads).float() q, k, v = qkv[0], qkv[1], qkv[2] # B H L D x = torch.nn.functional.scaled_dot_product_attention(q, k, v) x = rearrange(x, "B H L D -> B L (H D)") elif self.attn_mode == "xformers": qkv = rearrange(qkv, "B L (K H D) -> K B L H D", K=3, H=self.num_heads) q, k, v = qkv[0], qkv[1], qkv[2] # B L H D x = xformers.ops.memory_efficient_attention(q, k, v) x = rearrange(x, "B L H D -> B L (H D)", H=self.num_heads) elif self.attn_mode == "math": qkv = rearrange(qkv, "B L (K H D) -> K B H L D", K=3, H=self.num_heads) q, k, v = qkv[0], qkv[1], qkv[2] # B H L D attn = (q @ k.transpose(-2, -1)) * self.scale attn = attn.softmax(dim=-1) attn = self.attn_drop(attn) x = (attn @ v).transpose(1, 2).reshape(B, L, C) else: raise NotImplemented x = self.proj(x) x = self.proj_drop(x) return x class SpatialSelfAttention(nn.Module): def __init__(self, in_channels): super().__init__() self.in_channels = in_channels self.norm = Normalize(in_channels) self.q = torch.nn.Conv2d( in_channels, in_channels, kernel_size=1, stride=1, padding=0 ) self.k = torch.nn.Conv2d( in_channels, in_channels, kernel_size=1, stride=1, padding=0 ) self.v = torch.nn.Conv2d( in_channels, in_channels, kernel_size=1, stride=1, padding=0 ) self.proj_out = torch.nn.Conv2d( in_channels, in_channels, kernel_size=1, stride=1, padding=0 ) def forward(self, x): h_ = x h_ = self.norm(h_) q = self.q(h_) k = self.k(h_) v = self.v(h_) # compute attention b, c, h, w = q.shape q = rearrange(q, "b c h w -> b (h w) c") k = rearrange(k, "b c h w -> b c (h w)") w_ = torch.einsum("bij,bjk->bik", q, k) w_ = w_ * (int(c) ** (-0.5)) w_ = torch.nn.functional.softmax(w_, dim=2) # attend to values v = rearrange(v, "b c h w -> b c (h w)") w_ = rearrange(w_, "b i j -> b j i") h_ = torch.einsum("bij,bjk->bik", v, w_) h_ = rearrange(h_, "b c (h w) -> b c h w", h=h) h_ = self.proj_out(h_) return x + h_ class CrossAttention(nn.Module): def __init__( self, query_dim, context_dim=None, heads=8, dim_head=64, dropout=0.0, backend=None, ): super().__init__() inner_dim = dim_head * heads context_dim = default(context_dim, query_dim) self.scale = dim_head**-0.5 self.heads = heads self.to_q = nn.Linear(query_dim, inner_dim, bias=False) self.to_k = nn.Linear(context_dim, inner_dim, bias=False) self.to_v = nn.Linear(context_dim, inner_dim, bias=False) self.to_out = nn.Sequential( nn.Linear(inner_dim, query_dim), nn.Dropout(dropout) ) self.backend = backend def forward( self, x, context=None, mask=None, additional_tokens=None, n_times_crossframe_attn_in_self=0, ): h = self.heads if additional_tokens is not None: # get the number of masked tokens at the beginning of the output sequence n_tokens_to_mask = additional_tokens.shape[1] # add additional token x = torch.cat([additional_tokens, x], dim=1) q = self.to_q(x) context = default(context, x) k = self.to_k(context) v = self.to_v(context) if n_times_crossframe_attn_in_self: # reprogramming cross-frame attention as in https://arxiv.org/abs/2303.13439 assert x.shape[0] % n_times_crossframe_attn_in_self == 0 n_cp = x.shape[0] // n_times_crossframe_attn_in_self k = repeat( k[::n_times_crossframe_attn_in_self], "b ... -> (b n) ...", n=n_cp ) v = repeat( v[::n_times_crossframe_attn_in_self], "b ... -> (b n) ...", n=n_cp ) q, k, v = map(lambda t: rearrange(t, "b n (h d) -> b h n d", h=h), (q, k, v)) ## old """ sim = einsum('b i d, b j d -> b i j', q, k) * self.scale del q, k if exists(mask): mask = rearrange(mask, 'b ... -> b (...)') max_neg_value = -torch.finfo(sim.dtype).max mask = repeat(mask, 'b j -> (b h) () j', h=h) sim.masked_fill_(~mask, max_neg_value) # attention, what we cannot get enough of sim = sim.softmax(dim=-1) out = einsum('b i j, b j d -> b i d', sim, v) """ ## new with sdp_kernel(**BACKEND_MAP[self.backend]): # print("dispatching into backend", self.backend, "q/k/v shape: ", q.shape, k.shape, v.shape) out = F.scaled_dot_product_attention( q, k, v, attn_mask=mask ) # scale is dim_head ** -0.5 per default del q, k, v out = rearrange(out, "b h n d -> b n (h d)", h=h) if additional_tokens is not None: # remove additional token out = out[:, n_tokens_to_mask:] return self.to_out(out) class MemoryEfficientCrossAttention(nn.Module): # https://github.com/MatthieuTPHR/diffusers/blob/d80b531ff8060ec1ea982b65a1b8df70f73aa67c/src/diffusers/models/attention.py#L223 def __init__( self, query_dim, context_dim=None, heads=8, dim_head=64, dropout=0.0, **kwargs ): super().__init__() logpy.debug( f"Setting up {self.__class__.__name__}. Query dim is {query_dim}, " f"context_dim is {context_dim} and using {heads} heads with a " f"dimension of {dim_head}." ) inner_dim = dim_head * heads context_dim = default(context_dim, query_dim) self.heads = heads self.dim_head = dim_head self.to_q = nn.Linear(query_dim, inner_dim, bias=False) self.to_k = nn.Linear(context_dim, inner_dim, bias=False) self.to_v = nn.Linear(context_dim, inner_dim, bias=False) self.to_out = nn.Sequential( nn.Linear(inner_dim, query_dim), nn.Dropout(dropout) ) self.attention_op: Optional[Any] = None def forward( self, x, context=None, mask=None, additional_tokens=None, n_times_crossframe_attn_in_self=0, ): if additional_tokens is not None: # get the number of masked tokens at the beginning of the output sequence n_tokens_to_mask = additional_tokens.shape[1] # add additional token x = torch.cat([additional_tokens, x], dim=1) q = self.to_q(x) context = default(context, x) k = self.to_k(context) v = self.to_v(context) if n_times_crossframe_attn_in_self: # reprogramming cross-frame attention as in https://arxiv.org/abs/2303.13439 assert x.shape[0] % n_times_crossframe_attn_in_self == 0 # n_cp = x.shape[0]//n_times_crossframe_attn_in_self k = repeat( k[::n_times_crossframe_attn_in_self], "b ... -> (b n) ...", n=n_times_crossframe_attn_in_self, ) v = repeat( v[::n_times_crossframe_attn_in_self], "b ... -> (b n) ...", n=n_times_crossframe_attn_in_self, ) b, _, _ = q.shape q, k, v = map( lambda t: t.unsqueeze(3) .reshape(b, t.shape[1], self.heads, self.dim_head) .permute(0, 2, 1, 3) .reshape(b * self.heads, t.shape[1], self.dim_head) .contiguous(), (q, k, v), ) # actually compute the attention, what we cannot get enough of if version.parse(xformers.__version__) >= version.parse("0.0.21"): # NOTE: workaround for # https://github.com/facebookresearch/xformers/issues/845 max_bs = 32768 N = q.shape[0] n_batches = math.ceil(N / max_bs) out = list() for i_batch in range(n_batches): batch = slice(i_batch * max_bs, (i_batch + 1) * max_bs) out.append( xformers.ops.memory_efficient_attention( q[batch], k[batch], v[batch], attn_bias=None, op=self.attention_op, ) ) out = torch.cat(out, 0) else: out = xformers.ops.memory_efficient_attention( q, k, v, attn_bias=None, op=self.attention_op ) # TODO: Use this directly in the attention operation, as a bias if exists(mask): raise NotImplementedError out = ( out.unsqueeze(0) .reshape(b, self.heads, out.shape[1], self.dim_head) .permute(0, 2, 1, 3) .reshape(b, out.shape[1], self.heads * self.dim_head) ) if additional_tokens is not None: # remove additional token out = out[:, n_tokens_to_mask:] return self.to_out(out) class BasicTransformerBlock(nn.Module): ATTENTION_MODES = { "softmax": CrossAttention, # vanilla attention "softmax-xformers": MemoryEfficientCrossAttention, # ampere } def __init__( self, dim, n_heads, d_head, dropout=0.0, context_dim=None, gated_ff=True, checkpoint=True, disable_self_attn=False, attn_mode="softmax", sdp_backend=None, ): super().__init__() assert attn_mode in self.ATTENTION_MODES if attn_mode != "softmax" and not XFORMERS_IS_AVAILABLE: logpy.warn( f"Attention mode '{attn_mode}' is not available. Falling " f"back to native attention. This is not a problem in " f"Pytorch >= 2.0. FYI, you are running with PyTorch " f"version {torch.__version__}." ) attn_mode = "softmax" elif attn_mode == "softmax" and not SDP_IS_AVAILABLE: logpy.warn( "We do not support vanilla attention anymore, as it is too " "expensive. Sorry." ) if not XFORMERS_IS_AVAILABLE: assert ( False ), "Please install xformers via e.g. 'pip install xformers==0.0.16'" else: logpy.info("Falling back to xformers efficient attention.") attn_mode = "softmax-xformers" attn_cls = self.ATTENTION_MODES[attn_mode] if version.parse(torch.__version__) >= version.parse("2.0.0"): assert sdp_backend is None or isinstance(sdp_backend, SDPBackend) else: assert sdp_backend is None self.disable_self_attn = disable_self_attn self.attn1 = attn_cls( query_dim=dim, heads=n_heads, dim_head=d_head, dropout=dropout, context_dim=context_dim if self.disable_self_attn else None, backend=sdp_backend, ) # is a self-attention if not self.disable_self_attn self.ff = FeedForward(dim, dropout=dropout, glu=gated_ff) self.attn2 = attn_cls( query_dim=dim, context_dim=context_dim, heads=n_heads, dim_head=d_head, dropout=dropout, backend=sdp_backend, ) # is self-attn if context is none self.norm1 = nn.LayerNorm(dim) self.norm2 = nn.LayerNorm(dim) self.norm3 = nn.LayerNorm(dim) self.checkpoint = checkpoint if self.checkpoint: logpy.debug(f"{self.__class__.__name__} is using checkpointing") def forward( self, x, context=None, additional_tokens=None, n_times_crossframe_attn_in_self=0 ): kwargs = {"x": x} if context is not None: kwargs.update({"context": context}) if additional_tokens is not None: kwargs.update({"additional_tokens": additional_tokens}) if n_times_crossframe_attn_in_self: kwargs.update( {"n_times_crossframe_attn_in_self": n_times_crossframe_attn_in_self} ) # return mixed_checkpoint(self._forward, kwargs, self.parameters(), self.checkpoint) if self.checkpoint: # inputs = {"x": x, "context": context} return checkpoint(self._forward, x, context) # return checkpoint(self._forward, inputs, self.parameters(), self.checkpoint) else: return self._forward(**kwargs) def _forward( self, x, context=None, additional_tokens=None, n_times_crossframe_attn_in_self=0 ): x = ( self.attn1( self.norm1(x), context=context if self.disable_self_attn else None, additional_tokens=additional_tokens, n_times_crossframe_attn_in_self=n_times_crossframe_attn_in_self if not self.disable_self_attn else 0, ) + x ) x = ( self.attn2( self.norm2(x), context=context, additional_tokens=additional_tokens ) + x ) x = self.ff(self.norm3(x)) + x return x class BasicTransformerSingleLayerBlock(nn.Module): ATTENTION_MODES = { "softmax": CrossAttention, # vanilla attention "softmax-xformers": MemoryEfficientCrossAttention # on the A100s not quite as fast as the above version # (todo might depend on head_dim, check, falls back to semi-optimized kernels for dim!=[16,32,64,128]) } def __init__( self, dim, n_heads, d_head, dropout=0.0, context_dim=None, gated_ff=True, checkpoint=True, attn_mode="softmax", ): super().__init__() assert attn_mode in self.ATTENTION_MODES attn_cls = self.ATTENTION_MODES[attn_mode] self.attn1 = attn_cls( query_dim=dim, heads=n_heads, dim_head=d_head, dropout=dropout, context_dim=context_dim, ) self.ff = FeedForward(dim, dropout=dropout, glu=gated_ff) self.norm1 = nn.LayerNorm(dim) self.norm2 = nn.LayerNorm(dim) self.checkpoint = checkpoint def forward(self, x, context=None): # inputs = {"x": x, "context": context} # return checkpoint(self._forward, inputs, self.parameters(), self.checkpoint) return checkpoint(self._forward, x, context) def _forward(self, x, context=None): x = self.attn1(self.norm1(x), context=context) + x x = self.ff(self.norm2(x)) + x return x class SpatialTransformer(nn.Module): """ Transformer block for image-like data. First, project the input (aka embedding) and reshape to b, t, d. Then apply standard transformer action. Finally, reshape to image NEW: use_linear for more efficiency instead of the 1x1 convs """ def __init__( self, in_channels, n_heads, d_head, depth=1, dropout=0.0, context_dim=None, disable_self_attn=False, use_linear=False, attn_type="softmax", use_checkpoint=True, # sdp_backend=SDPBackend.FLASH_ATTENTION sdp_backend=None, ): super().__init__() logpy.debug( f"constructing {self.__class__.__name__} of depth {depth} w/ " f"{in_channels} channels and {n_heads} heads." ) if exists(context_dim) and not isinstance(context_dim, list): context_dim = [context_dim] if exists(context_dim) and isinstance(context_dim, list): if depth != len(context_dim): logpy.warn( f"{self.__class__.__name__}: Found context dims " f"{context_dim} of depth {len(context_dim)}, which does not " f"match the specified 'depth' of {depth}. Setting context_dim " f"to {depth * [context_dim[0]]} now." ) # depth does not match context dims. assert all( map(lambda x: x == context_dim[0], context_dim) ), "need homogenous context_dim to match depth automatically" context_dim = depth * [context_dim[0]] elif context_dim is None: context_dim = [None] * depth self.in_channels = in_channels inner_dim = n_heads * d_head self.norm = Normalize(in_channels) if not use_linear: self.proj_in = nn.Conv2d( in_channels, inner_dim, kernel_size=1, stride=1, padding=0 ) else: self.proj_in = nn.Linear(in_channels, inner_dim) self.transformer_blocks = nn.ModuleList( [ BasicTransformerBlock( inner_dim, n_heads, d_head, dropout=dropout, context_dim=context_dim[d], disable_self_attn=disable_self_attn, attn_mode=attn_type, checkpoint=use_checkpoint, sdp_backend=sdp_backend, ) for d in range(depth) ] ) if not use_linear: self.proj_out = zero_module( nn.Conv2d(inner_dim, in_channels, kernel_size=1, stride=1, padding=0) ) else: # self.proj_out = zero_module(nn.Linear(in_channels, inner_dim)) self.proj_out = zero_module(nn.Linear(inner_dim, in_channels)) self.use_linear = use_linear def forward(self, x, context=None): # note: if no context is given, cross-attention defaults to self-attention if not isinstance(context, list): context = [context] b, c, h, w = x.shape x_in = x x = self.norm(x) if not self.use_linear: x = self.proj_in(x) x = rearrange(x, "b c h w -> b (h w) c").contiguous() if self.use_linear: x = self.proj_in(x) for i, block in enumerate(self.transformer_blocks): if i > 0 and len(context) == 1: i = 0 # use same context for each block x = block(x, context=context[i]) if self.use_linear: x = self.proj_out(x) x = rearrange(x, "b (h w) c -> b c h w", h=h, w=w).contiguous() if not self.use_linear: x = self.proj_out(x) return x + x_in class SimpleTransformer(nn.Module): def __init__( self, dim: int, depth: int, heads: int, dim_head: int, context_dim: Optional[int] = None, dropout: float = 0.0, checkpoint: bool = True, ): super().__init__() self.layers = nn.ModuleList([]) for _ in range(depth): self.layers.append( BasicTransformerBlock( dim, heads, dim_head, dropout=dropout, context_dim=context_dim, attn_mode="softmax-xformers", checkpoint=checkpoint, ) ) def forward( self, x: torch.Tensor, context: Optional[torch.Tensor] = None, ) -> torch.Tensor: for layer in self.layers: x = layer(x, context) return x ================================================ FILE: ToonCrafter/lvdm/modules/encoders/condition.py ================================================ import torch import torch.nn as nn import kornia import open_clip import os from torch.utils.checkpoint import checkpoint from transformers import T5Tokenizer, T5EncoderModel, CLIPTokenizer, CLIPTextModel from lvdm.common import autocast from ToonCrafter.utils.utils import count_params class AbstractEncoder(nn.Module): def __init__(self): super().__init__() def encode(self, *args, **kwargs): raise NotImplementedError class IdentityEncoder(AbstractEncoder): def encode(self, x): return x class ClassEmbedder(nn.Module): def __init__(self, embed_dim, n_classes=1000, key='class', ucg_rate=0.1): super().__init__() self.key = key self.embedding = nn.Embedding(n_classes, embed_dim) self.n_classes = n_classes self.ucg_rate = ucg_rate def forward(self, batch, key=None, disable_dropout=False): if key is None: key = self.key # this is for use in crossattn c = batch[key][:, None] if self.ucg_rate > 0. and not disable_dropout: mask = 1. - torch.bernoulli(torch.ones_like(c) * self.ucg_rate) c = mask * c + (1 - mask) * torch.ones_like(c) * (self.n_classes - 1) c = c.long() c = self.embedding(c) return c def get_unconditional_conditioning(self, bs, device="cuda"): uc_class = self.n_classes - 1 # 1000 classes --> 0 ... 999, one extra class for ucg (class 1000) uc = torch.ones((bs,), device=device) * uc_class uc = {self.key: uc} return uc def disabled_train(self, mode=True): """Overwrite model.train with this function to make sure train/eval mode does not change anymore.""" return self def get_available_devices(): devices = [] if torch.cuda.is_available(): devices.append("cuda") elif torch.backends.mps.is_available(): devices.append("mps") devices.append(torch.device("cpu")) return devices def get_device(device): devices = get_available_devices() if device in devices: return device return devices[0] class FrozenT5Embedder(AbstractEncoder): """Uses the T5 transformer encoder for text""" def __init__(self, version="google/t5-v1_1-large", device="cuda", max_length=77, freeze=True): # others are google/t5-v1_1-xl and google/t5-v1_1-xxl super().__init__() self.tokenizer = T5Tokenizer.from_pretrained(version) self.transformer = T5EncoderModel.from_pretrained(version) self.device = get_device(device) self.max_length = max_length # TODO: typical value? if freeze: self.freeze() def freeze(self): self.transformer = self.transformer.eval() # self.train = disabled_train for param in self.parameters(): param.requires_grad = False def forward(self, text): batch_encoding = self.tokenizer(text, truncation=True, max_length=self.max_length, return_length=True, return_overflowing_tokens=False, padding="max_length", return_tensors="pt") tokens = batch_encoding["input_ids"].to(self.device) outputs = self.transformer(input_ids=tokens) z = outputs.last_hidden_state return z def encode(self, text): return self(text) class FrozenCLIPEmbedder(AbstractEncoder): """Uses the CLIP transformer encoder for text (from huggingface)""" LAYERS = [ "last", "pooled", "hidden" ] def __init__(self, version="openai/clip-vit-large-patch14", device="cuda", max_length=77, freeze=True, layer="last", layer_idx=None): # clip-vit-base-patch32 super().__init__() assert layer in self.LAYERS self.tokenizer = CLIPTokenizer.from_pretrained(version) self.transformer = CLIPTextModel.from_pretrained(version) self.device = get_device(device) self.max_length = max_length if freeze: self.freeze() self.layer = layer self.layer_idx = layer_idx if layer == "hidden": assert layer_idx is not None assert 0 <= abs(layer_idx) <= 12 def freeze(self): self.transformer = self.transformer.eval() # self.train = disabled_train for param in self.parameters(): param.requires_grad = False def forward(self, text): batch_encoding = self.tokenizer(text, truncation=True, max_length=self.max_length, return_length=True, return_overflowing_tokens=False, padding="max_length", return_tensors="pt") tokens = batch_encoding["input_ids"].to(self.device) outputs = self.transformer(input_ids=tokens, output_hidden_states=self.layer == "hidden") if self.layer == "last": z = outputs.last_hidden_state elif self.layer == "pooled": z = outputs.pooler_output[:, None, :] else: z = outputs.hidden_states[self.layer_idx] return z def encode(self, text): return self(text) class ClipImageEmbedder(nn.Module): def __init__( self, model, jit=False, device='cuda' if torch.cuda.is_available() else 'cpu', antialias=True, ucg_rate=0. ): super().__init__() from clip import load as load_clip self.model, _ = load_clip(name=model, device=device, jit=jit) self.antialias = antialias self.register_buffer('mean', torch.Tensor([0.48145466, 0.4578275, 0.40821073]), persistent=False) self.register_buffer('std', torch.Tensor([0.26862954, 0.26130258, 0.27577711]), persistent=False) self.ucg_rate = ucg_rate def preprocess(self, x): # normalize to [0,1] x = kornia.geometry.resize(x, (224, 224), interpolation='bicubic', align_corners=True, antialias=self.antialias) x = (x + 1.) / 2. # re-normalize according to clip x = kornia.enhance.normalize(x, self.mean, self.std) return x def forward(self, x, no_dropout=False): # x is assumed to be in range [-1,1] out = self.model.encode_image(self.preprocess(x)) out = out.to(x.dtype) if self.ucg_rate > 0. and not no_dropout: out = torch.bernoulli((1. - self.ucg_rate) * torch.ones(out.shape[0], device=out.device))[:, None] * out return out class FrozenOpenCLIPEmbedder(AbstractEncoder): """ Uses the OpenCLIP transformer encoder for text """ LAYERS = [ # "pooled", "last", "penultimate" ] def __init__(self, arch="ViT-H-14", version="laion2b_s32b_b79k", device="cuda", max_length=77, freeze=True, layer="last"): super().__init__() assert layer in self.LAYERS version = os.environ.get("USER_DEF_CLIP", version) model, _, _ = open_clip.create_model_and_transforms(arch, device=torch.device('cpu'), pretrained=version) del model.visual self.model = model self.device = get_device(device) self.max_length = max_length if freeze: self.freeze() self.layer = layer if self.layer == "last": self.layer_idx = 0 elif self.layer == "penultimate": self.layer_idx = 1 else: raise NotImplementedError() def freeze(self): self.model = self.model.eval() for param in self.parameters(): param.requires_grad = False def forward(self, text): tokens = open_clip.tokenize(text) # all clip models use 77 as context length z = self.encode_with_transformer(tokens.to(self.device)) return z def encode_with_transformer(self, text): x = self.model.token_embedding(text) # [batch_size, n_ctx, d_model] x = x + self.model.positional_embedding x = x.permute(1, 0, 2) # NLD -> LND x = self.text_transformer_forward(x, attn_mask=self.model.attn_mask) x = x.permute(1, 0, 2) # LND -> NLD x = self.model.ln_final(x) return x def text_transformer_forward(self, x: torch.Tensor, attn_mask=None): for i, r in enumerate(self.model.transformer.resblocks): if i == len(self.model.transformer.resblocks) - self.layer_idx: break if self.model.transformer.grad_checkpointing and not torch.jit.is_scripting(): x = checkpoint(r, x, attn_mask) else: x = r(x, attn_mask=attn_mask) return x def encode(self, text): return self(text) class FrozenOpenCLIPImageEmbedder(AbstractEncoder): """ Uses the OpenCLIP vision transformer encoder for images """ def __init__(self, arch="ViT-H-14", version="laion2b_s32b_b79k", device="cuda", max_length=77, freeze=True, layer="pooled", antialias=True, ucg_rate=0.): super().__init__() version = os.environ.get("USER_DEF_CLIP", version) model, _, _ = open_clip.create_model_and_transforms(arch, device=torch.device('cpu'), pretrained=version, ) del model.transformer self.model = model # self.mapper = torch.nn.Linear(1280, 1024) self.device = get_device(device) self.max_length = max_length if freeze: self.freeze() self.layer = layer if self.layer == "penultimate": raise NotImplementedError() self.layer_idx = 1 self.antialias = antialias self.register_buffer('mean', torch.Tensor([0.48145466, 0.4578275, 0.40821073]), persistent=False) self.register_buffer('std', torch.Tensor([0.26862954, 0.26130258, 0.27577711]), persistent=False) self.ucg_rate = ucg_rate def preprocess(self, x): # normalize to [0,1] x = kornia.geometry.resize(x, (224, 224), interpolation='bicubic', align_corners=True, antialias=self.antialias) x = (x + 1.) / 2. # renormalize according to clip x = kornia.enhance.normalize(x, self.mean, self.std) return x def freeze(self): self.model = self.model.eval() for param in self.model.parameters(): param.requires_grad = False @autocast def forward(self, image, no_dropout=False): z = self.encode_with_vision_transformer(image) if self.ucg_rate > 0. and not no_dropout: z = torch.bernoulli((1. - self.ucg_rate) * torch.ones(z.shape[0], device=z.device))[:, None] * z return z def encode_with_vision_transformer(self, img): img = self.preprocess(img) x = self.model.visual(img) return x def encode(self, text): return self(text) class FrozenOpenCLIPImageEmbedderV2(AbstractEncoder): """ Uses the OpenCLIP vision transformer encoder for images """ def __init__(self, arch="ViT-H-14", version="laion2b_s32b_b79k", device="cuda", freeze=True, layer="pooled", antialias=True): super().__init__() version = os.environ.get("USER_DEF_CLIP", version) model, _, _ = open_clip.create_model_and_transforms(arch, device=torch.device('cpu'), pretrained=version, ) del model.transformer self.model = model self.device = get_device(device) if freeze: self.freeze() self.layer = layer if self.layer == "penultimate": raise NotImplementedError() self.layer_idx = 1 self.antialias = antialias self.register_buffer('mean', torch.Tensor([0.48145466, 0.4578275, 0.40821073]), persistent=False) self.register_buffer('std', torch.Tensor([0.26862954, 0.26130258, 0.27577711]), persistent=False) def preprocess(self, x): # normalize to [0,1] x = kornia.geometry.resize(x, (224, 224), interpolation='bicubic', align_corners=True, antialias=self.antialias) x = (x + 1.) / 2. # renormalize according to clip x = kornia.enhance.normalize(x, self.mean, self.std) return x def freeze(self): self.model = self.model.eval() for param in self.model.parameters(): param.requires_grad = False def forward(self, image, no_dropout=False): # image: b c h w z = self.encode_with_vision_transformer(image) return z def encode_with_vision_transformer(self, x): x = self.preprocess(x) # to patches - whether to use dual patchnorm - https://arxiv.org/abs/2302.01327v1 # if self.model.visual.input_patchnorm: # # einops - rearrange(x, 'b c (h p1) (w p2) -> b (h w) (c p1 p2)') # x = x.reshape(x.shape[0], x.shape[1], self.model.visual.grid_size[0], self.model.visual.patch_size[0], self.model.visual.grid_size[1], self.model.visual.patch_size[1]) # x = x.permute(0, 2, 4, 1, 3, 5) # x = x.reshape(x.shape[0], self.model.visual.grid_size[0] * self.model.visual.grid_size[1], -1) # x = self.model.visual.patchnorm_pre_ln(x) # x = self.model.visual.conv1(x) # else: x = self.model.visual.conv1(x) # shape = [*, width, grid, grid] x = x.reshape(x.shape[0], x.shape[1], -1) # shape = [*, width, grid ** 2] x = x.permute(0, 2, 1) # shape = [*, grid ** 2, width] # class embeddings and positional embeddings x = torch.cat( [self.model.visual.class_embedding.to(x.dtype) + torch.zeros(x.shape[0], 1, x.shape[-1], dtype=x.dtype, device=x.device), x], dim=1) # shape = [*, grid ** 2 + 1, width] x = x + self.model.visual.positional_embedding.to(x.dtype) # a patch_dropout of 0. would mean it is disabled and this function would do nothing but return what was passed in x = self.model.visual.patch_dropout(x) x = self.model.visual.ln_pre(x) x = x.permute(1, 0, 2) # NLD -> LND x = self.model.visual.transformer(x) x = x.permute(1, 0, 2) # LND -> NLD return x class FrozenCLIPT5Encoder(AbstractEncoder): def __init__(self, clip_version="openai/clip-vit-large-patch14", t5_version="google/t5-v1_1-xl", device="cuda", clip_max_length=77, t5_max_length=77): super().__init__() self.clip_encoder = FrozenCLIPEmbedder(clip_version, device, max_length=clip_max_length) self.t5_encoder = FrozenT5Embedder(t5_version, device, max_length=t5_max_length) print(f"{self.clip_encoder.__class__.__name__} has {count_params(self.clip_encoder) * 1.e-6:.2f} M parameters, " f"{self.t5_encoder.__class__.__name__} comes with {count_params(self.t5_encoder) * 1.e-6:.2f} M params.") def encode(self, text): return self(text) def forward(self, text): clip_z = self.clip_encoder.encode(text) t5_z = self.t5_encoder.encode(text) return [clip_z, t5_z] ================================================ FILE: ToonCrafter/lvdm/modules/encoders/resampler.py ================================================ # modified from https://github.com/mlfoundations/open_flamingo/blob/main/open_flamingo/src/helpers.py # and https://github.com/lucidrains/imagen-pytorch/blob/main/imagen_pytorch/imagen_pytorch.py # and https://github.com/tencent-ailab/IP-Adapter/blob/main/ip_adapter/resampler.py import math import torch import torch.nn as nn class ImageProjModel(nn.Module): """Projection Model""" def __init__(self, cross_attention_dim=1024, clip_embeddings_dim=1024, clip_extra_context_tokens=4): super().__init__() self.cross_attention_dim = cross_attention_dim self.clip_extra_context_tokens = clip_extra_context_tokens self.proj = nn.Linear(clip_embeddings_dim, self.clip_extra_context_tokens * cross_attention_dim) self.norm = nn.LayerNorm(cross_attention_dim) def forward(self, image_embeds): #embeds = image_embeds embeds = image_embeds.type(list(self.proj.parameters())[0].dtype) clip_extra_context_tokens = self.proj(embeds).reshape(-1, self.clip_extra_context_tokens, self.cross_attention_dim) clip_extra_context_tokens = self.norm(clip_extra_context_tokens) return clip_extra_context_tokens # FFN def FeedForward(dim, mult=4): inner_dim = int(dim * mult) return nn.Sequential( nn.LayerNorm(dim), nn.Linear(dim, inner_dim, bias=False), nn.GELU(), nn.Linear(inner_dim, dim, bias=False), ) def reshape_tensor(x, heads): bs, length, width = x.shape #(bs, length, width) --> (bs, length, n_heads, dim_per_head) x = x.view(bs, length, heads, -1) # (bs, length, n_heads, dim_per_head) --> (bs, n_heads, length, dim_per_head) x = x.transpose(1, 2) # (bs, n_heads, length, dim_per_head) --> (bs*n_heads, length, dim_per_head) x = x.reshape(bs, heads, length, -1) return x class PerceiverAttention(nn.Module): def __init__(self, *, dim, dim_head=64, heads=8): super().__init__() self.scale = dim_head**-0.5 self.dim_head = dim_head self.heads = heads inner_dim = dim_head * heads self.norm1 = nn.LayerNorm(dim) self.norm2 = nn.LayerNorm(dim) self.to_q = nn.Linear(dim, inner_dim, bias=False) self.to_kv = nn.Linear(dim, inner_dim * 2, bias=False) self.to_out = nn.Linear(inner_dim, dim, bias=False) def forward(self, x, latents): """ Args: x (torch.Tensor): image features shape (b, n1, D) latent (torch.Tensor): latent features shape (b, n2, D) """ x = self.norm1(x) latents = self.norm2(latents) b, l, _ = latents.shape q = self.to_q(latents) kv_input = torch.cat((x, latents), dim=-2) k, v = self.to_kv(kv_input).chunk(2, dim=-1) q = reshape_tensor(q, self.heads) k = reshape_tensor(k, self.heads) v = reshape_tensor(v, self.heads) # attention scale = 1 / math.sqrt(math.sqrt(self.dim_head)) weight = (q * scale) @ (k * scale).transpose(-2, -1) # More stable with f16 than dividing afterwards weight = torch.softmax(weight.float(), dim=-1).type(weight.dtype) out = weight @ v out = out.permute(0, 2, 1, 3).reshape(b, l, -1) return self.to_out(out) class Resampler(nn.Module): def __init__( self, dim=1024, depth=8, dim_head=64, heads=16, num_queries=8, embedding_dim=768, output_dim=1024, ff_mult=4, video_length=None, # using frame-wise version or not ): super().__init__() ## queries for a single frame / image self.num_queries = num_queries self.video_length = video_length ## queries for each frame if video_length is not None: num_queries = num_queries * video_length self.latents = nn.Parameter(torch.randn(1, num_queries, dim) / dim**0.5) self.proj_in = nn.Linear(embedding_dim, dim) self.proj_out = nn.Linear(dim, output_dim) self.norm_out = nn.LayerNorm(output_dim) self.layers = nn.ModuleList([]) for _ in range(depth): self.layers.append( nn.ModuleList( [ PerceiverAttention(dim=dim, dim_head=dim_head, heads=heads), FeedForward(dim=dim, mult=ff_mult), ] ) ) def forward(self, x): latents = self.latents.repeat(x.size(0), 1, 1) ## B (T L) C x = self.proj_in(x) for attn, ff in self.layers: latents = attn(x, latents) + latents latents = ff(latents) + latents latents = self.proj_out(latents) latents = self.norm_out(latents) # B L C or B (T L) C return latents ================================================ FILE: ToonCrafter/lvdm/modules/networks/ae_modules.py ================================================ # pytorch_diffusion + derived encoder decoder import math import torch import numpy as np import torch.nn as nn from einops import rearrange from ToonCrafter.utils.utils import instantiate_from_config from lvdm.modules.attention import LinearAttention def nonlinearity(x): # swish return x * torch.sigmoid(x) def Normalize(in_channels, num_groups=32): return torch.nn.GroupNorm(num_groups=num_groups, num_channels=in_channels, eps=1e-6, affine=True) class LinAttnBlock(LinearAttention): """to match AttnBlock usage""" def __init__(self, in_channels): super().__init__(dim=in_channels, heads=1, dim_head=in_channels) class AttnBlock(nn.Module): def __init__(self, in_channels): super().__init__() self.in_channels = in_channels self.norm = Normalize(in_channels) self.q = torch.nn.Conv2d(in_channels, in_channels, kernel_size=1, stride=1, padding=0) self.k = torch.nn.Conv2d(in_channels, in_channels, kernel_size=1, stride=1, padding=0) self.v = torch.nn.Conv2d(in_channels, in_channels, kernel_size=1, stride=1, padding=0) self.proj_out = torch.nn.Conv2d(in_channels, in_channels, kernel_size=1, stride=1, padding=0) def forward(self, x): h_ = x h_ = self.norm(h_) q = self.q(h_) k = self.k(h_) v = self.v(h_) # compute attention b, c, h, w = q.shape q = q.reshape(b, c, h * w) # bcl q = q.permute(0, 2, 1) # bcl -> blc l=hw k = k.reshape(b, c, h * w) # bcl w_ = torch.bmm(q, k) # b,hw,hw w[b,i,j]=sum_c q[b,i,c]k[b,c,j] w_ = w_ * (int(c)**(-0.5)) w_ = torch.nn.functional.softmax(w_, dim=2) # attend to values v = v.reshape(b, c, h * w) w_ = w_.permute(0, 2, 1) # b,hw,hw (first hw of k, second of q) h_ = torch.bmm(v, w_) # b, c,hw (hw of q) h_[b,c,j] = sum_i v[b,c,i] w_[b,i,j] h_ = h_.reshape(b, c, h, w) h_ = self.proj_out(h_) return x + h_ def make_attn(in_channels, attn_type="vanilla"): assert attn_type in ["vanilla", "linear", "none"], f'attn_type {attn_type} unknown' # print(f"making attention of type '{attn_type}' with {in_channels} in_channels") if attn_type == "vanilla": return AttnBlock(in_channels) elif attn_type == "none": return nn.Identity(in_channels) else: return LinAttnBlock(in_channels) class Downsample(nn.Module): def __init__(self, in_channels, with_conv): super().__init__() self.with_conv = with_conv self.in_channels = in_channels if self.with_conv: # no asymmetric padding in torch conv, must do it ourselves self.conv = torch.nn.Conv2d(in_channels, in_channels, kernel_size=3, stride=2, padding=0) def forward(self, x): if self.with_conv: pad = (0, 1, 0, 1) x = torch.nn.functional.pad(x, pad, mode="constant", value=0) x = self.conv(x) else: x = torch.nn.functional.avg_pool2d(x, kernel_size=2, stride=2) return x class Upsample(nn.Module): def __init__(self, in_channels, with_conv): super().__init__() self.with_conv = with_conv self.in_channels = in_channels if self.with_conv: self.conv = torch.nn.Conv2d(in_channels, in_channels, kernel_size=3, stride=1, padding=1) def forward(self, x): x = torch.nn.functional.interpolate(x, scale_factor=2.0, mode="nearest") if self.with_conv: x = self.conv(x) return x def get_timestep_embedding(timesteps, embedding_dim): """ This matches the implementation in Denoising Diffusion Probabilistic Models: From Fairseq. Build sinusoidal embeddings. This matches the implementation in tensor2tensor, but differs slightly from the description in Section 3.5 of "Attention Is All You Need". """ assert len(timesteps.shape) == 1 half_dim = embedding_dim // 2 emb = math.log(10000) / (half_dim - 1) emb = torch.exp(torch.arange(half_dim, dtype=torch.float32) * -emb) emb = emb.to(device=timesteps.device) emb = timesteps.float()[:, None] * emb[None, :] emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=1) if embedding_dim % 2 == 1: # zero pad emb = torch.nn.functional.pad(emb, (0, 1, 0, 0)) return emb class ResnetBlock(nn.Module): def __init__(self, *, in_channels, out_channels=None, conv_shortcut=False, dropout, temb_channels=512): super().__init__() self.in_channels = in_channels out_channels = in_channels if out_channels is None else out_channels self.out_channels = out_channels self.use_conv_shortcut = conv_shortcut self.norm1 = Normalize(in_channels) self.conv1 = torch.nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=1, padding=1) if temb_channels > 0: self.temb_proj = torch.nn.Linear(temb_channels, out_channels) self.norm2 = Normalize(out_channels) self.dropout = torch.nn.Dropout(dropout) self.conv2 = torch.nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=1, padding=1) if self.in_channels != self.out_channels: if self.use_conv_shortcut: self.conv_shortcut = torch.nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=1, padding=1) else: self.nin_shortcut = torch.nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=1, padding=0) def forward(self, x, temb): h = x h = self.norm1(h) h = nonlinearity(h) h = self.conv1(h) if temb is not None: h = h + self.temb_proj(nonlinearity(temb))[:, :, None, None] h = self.norm2(h) h = nonlinearity(h) h = self.dropout(h) h = self.conv2(h) if self.in_channels != self.out_channels: if self.use_conv_shortcut: x = self.conv_shortcut(x) else: x = self.nin_shortcut(x) return x + h class Model(nn.Module): def __init__(self, *, ch, out_ch, ch_mult=(1, 2, 4, 8), num_res_blocks, attn_resolutions, dropout=0.0, resamp_with_conv=True, in_channels, resolution, use_timestep=True, use_linear_attn=False, attn_type="vanilla"): super().__init__() if use_linear_attn: attn_type = "linear" self.ch = ch self.temb_ch = self.ch * 4 self.num_resolutions = len(ch_mult) self.num_res_blocks = num_res_blocks self.resolution = resolution self.in_channels = in_channels self.use_timestep = use_timestep if self.use_timestep: # timestep embedding self.temb = nn.Module() self.temb.dense = nn.ModuleList([ torch.nn.Linear(self.ch, self.temb_ch), torch.nn.Linear(self.temb_ch, self.temb_ch), ]) # downsampling self.conv_in = torch.nn.Conv2d(in_channels, self.ch, kernel_size=3, stride=1, padding=1) curr_res = resolution in_ch_mult = (1,) + tuple(ch_mult) self.down = nn.ModuleList() for i_level in range(self.num_resolutions): block = nn.ModuleList() attn = nn.ModuleList() block_in = ch * in_ch_mult[i_level] block_out = ch * ch_mult[i_level] for i_block in range(self.num_res_blocks): block.append(ResnetBlock(in_channels=block_in, out_channels=block_out, temb_channels=self.temb_ch, dropout=dropout)) block_in = block_out if curr_res in attn_resolutions: attn.append(make_attn(block_in, attn_type=attn_type)) down = nn.Module() down.block = block down.attn = attn if i_level != self.num_resolutions - 1: down.downsample = Downsample(block_in, resamp_with_conv) curr_res = curr_res // 2 self.down.append(down) # middle self.mid = nn.Module() self.mid.block_1 = ResnetBlock(in_channels=block_in, out_channels=block_in, temb_channels=self.temb_ch, dropout=dropout) self.mid.attn_1 = make_attn(block_in, attn_type=attn_type) self.mid.block_2 = ResnetBlock(in_channels=block_in, out_channels=block_in, temb_channels=self.temb_ch, dropout=dropout) # upsampling self.up = nn.ModuleList() for i_level in reversed(range(self.num_resolutions)): block = nn.ModuleList() attn = nn.ModuleList() block_out = ch * ch_mult[i_level] skip_in = ch * ch_mult[i_level] for i_block in range(self.num_res_blocks + 1): if i_block == self.num_res_blocks: skip_in = ch * in_ch_mult[i_level] block.append(ResnetBlock(in_channels=block_in + skip_in, out_channels=block_out, temb_channels=self.temb_ch, dropout=dropout)) block_in = block_out if curr_res in attn_resolutions: attn.append(make_attn(block_in, attn_type=attn_type)) up = nn.Module() up.block = block up.attn = attn if i_level != 0: up.upsample = Upsample(block_in, resamp_with_conv) curr_res = curr_res * 2 self.up.insert(0, up) # prepend to get consistent order # end self.norm_out = Normalize(block_in) self.conv_out = torch.nn.Conv2d(block_in, out_ch, kernel_size=3, stride=1, padding=1) def forward(self, x, t=None, context=None): # assert x.shape[2] == x.shape[3] == self.resolution if context is not None: # assume aligned context, cat along channel axis x = torch.cat((x, context), dim=1) if self.use_timestep: # timestep embedding assert t is not None temb = get_timestep_embedding(t, self.ch) temb = self.temb.dense[0](temb) temb = nonlinearity(temb) temb = self.temb.dense[1](temb) else: temb = None # downsampling hs = [self.conv_in(x)] for i_level in range(self.num_resolutions): for i_block in range(self.num_res_blocks): h = self.down[i_level].block[i_block](hs[-1], temb) if len(self.down[i_level].attn) > 0: h = self.down[i_level].attn[i_block](h) hs.append(h) if i_level != self.num_resolutions - 1: hs.append(self.down[i_level].downsample(hs[-1])) # middle h = hs[-1] h = self.mid.block_1(h, temb) h = self.mid.attn_1(h) h = self.mid.block_2(h, temb) # upsampling for i_level in reversed(range(self.num_resolutions)): for i_block in range(self.num_res_blocks + 1): h = self.up[i_level].block[i_block]( torch.cat([h, hs.pop()], dim=1), temb) if len(self.up[i_level].attn) > 0: h = self.up[i_level].attn[i_block](h) if i_level != 0: h = self.up[i_level].upsample(h) # end h = self.norm_out(h) h = nonlinearity(h) h = self.conv_out(h) return h def get_last_layer(self): return self.conv_out.weight class Encoder(nn.Module): def __init__(self, *, ch, out_ch, ch_mult=(1, 2, 4, 8), num_res_blocks, attn_resolutions, dropout=0.0, resamp_with_conv=True, in_channels, resolution, z_channels, double_z=True, use_linear_attn=False, attn_type="vanilla", **ignore_kwargs): super().__init__() if use_linear_attn: attn_type = "linear" self.ch = ch self.temb_ch = 0 self.num_resolutions = len(ch_mult) self.num_res_blocks = num_res_blocks self.resolution = resolution self.in_channels = in_channels # downsampling self.conv_in = torch.nn.Conv2d(in_channels, self.ch, kernel_size=3, stride=1, padding=1) curr_res = resolution in_ch_mult = (1,) + tuple(ch_mult) self.in_ch_mult = in_ch_mult self.down = nn.ModuleList() for i_level in range(self.num_resolutions): block = nn.ModuleList() attn = nn.ModuleList() block_in = ch * in_ch_mult[i_level] block_out = ch * ch_mult[i_level] for i_block in range(self.num_res_blocks): block.append(ResnetBlock(in_channels=block_in, out_channels=block_out, temb_channels=self.temb_ch, dropout=dropout)) block_in = block_out if curr_res in attn_resolutions: attn.append(make_attn(block_in, attn_type=attn_type)) down = nn.Module() down.block = block down.attn = attn if i_level != self.num_resolutions - 1: down.downsample = Downsample(block_in, resamp_with_conv) curr_res = curr_res // 2 self.down.append(down) # middle self.mid = nn.Module() self.mid.block_1 = ResnetBlock(in_channels=block_in, out_channels=block_in, temb_channels=self.temb_ch, dropout=dropout) self.mid.attn_1 = make_attn(block_in, attn_type=attn_type) self.mid.block_2 = ResnetBlock(in_channels=block_in, out_channels=block_in, temb_channels=self.temb_ch, dropout=dropout) # end self.norm_out = Normalize(block_in) self.conv_out = torch.nn.Conv2d(block_in, 2 * z_channels if double_z else z_channels, kernel_size=3, stride=1, padding=1) def forward(self, x, return_hidden_states=False): # timestep embedding temb = None # print(f'encoder-input={x.shape}') # downsampling hs = [self.conv_in(x)] # if we return hidden states for decoder usage, we will store them in a list if return_hidden_states: hidden_states = [] # print(f'encoder-conv in feat={hs[0].shape}') for i_level in range(self.num_resolutions): for i_block in range(self.num_res_blocks): h = self.down[i_level].block[i_block](hs[-1], temb) # print(f'encoder-down feat={h.shape}') if len(self.down[i_level].attn) > 0: h = self.down[i_level].attn[i_block](h) hs.append(h) if return_hidden_states: hidden_states.append(h) if i_level != self.num_resolutions - 1: # print(f'encoder-downsample (input)={hs[-1].shape}') hs.append(self.down[i_level].downsample(hs[-1])) # print(f'encoder-downsample (output)={hs[-1].shape}') if return_hidden_states: hidden_states.append(hs[0]) # middle h = hs[-1] h = self.mid.block_1(h, temb) # print(f'encoder-mid1 feat={h.shape}') h = self.mid.attn_1(h) h = self.mid.block_2(h, temb) # print(f'encoder-mid2 feat={h.shape}') # end h = self.norm_out(h) h = nonlinearity(h) h = self.conv_out(h) # print(f'end feat={h.shape}') if return_hidden_states: return h, hidden_states else: return h class Decoder(nn.Module): def __init__(self, *, ch, out_ch, ch_mult=(1, 2, 4, 8), num_res_blocks, attn_resolutions, dropout=0.0, resamp_with_conv=True, in_channels, resolution, z_channels, give_pre_end=False, tanh_out=False, use_linear_attn=False, attn_type="vanilla", **ignorekwargs): super().__init__() if use_linear_attn: attn_type = "linear" self.ch = ch self.temb_ch = 0 self.num_resolutions = len(ch_mult) self.num_res_blocks = num_res_blocks self.resolution = resolution self.in_channels = in_channels self.give_pre_end = give_pre_end self.tanh_out = tanh_out # compute in_ch_mult, block_in and curr_res at lowest res in_ch_mult = (1,) + tuple(ch_mult) block_in = ch * ch_mult[self.num_resolutions - 1] curr_res = resolution // 2**(self.num_resolutions - 1) self.z_shape = (1, z_channels, curr_res, curr_res) print("AE working on z of shape {} = {} dimensions.".format( self.z_shape, np.prod(self.z_shape))) # z to block_in self.conv_in = torch.nn.Conv2d(z_channels, block_in, kernel_size=3, stride=1, padding=1) # middle self.mid = nn.Module() self.mid.block_1 = ResnetBlock(in_channels=block_in, out_channels=block_in, temb_channels=self.temb_ch, dropout=dropout) self.mid.attn_1 = make_attn(block_in, attn_type=attn_type) self.mid.block_2 = ResnetBlock(in_channels=block_in, out_channels=block_in, temb_channels=self.temb_ch, dropout=dropout) # upsampling self.up = nn.ModuleList() for i_level in reversed(range(self.num_resolutions)): block = nn.ModuleList() attn = nn.ModuleList() block_out = ch * ch_mult[i_level] for i_block in range(self.num_res_blocks + 1): block.append(ResnetBlock(in_channels=block_in, out_channels=block_out, temb_channels=self.temb_ch, dropout=dropout)) block_in = block_out if curr_res in attn_resolutions: attn.append(make_attn(block_in, attn_type=attn_type)) up = nn.Module() up.block = block up.attn = attn if i_level != 0: up.upsample = Upsample(block_in, resamp_with_conv) curr_res = curr_res * 2 self.up.insert(0, up) # prepend to get consistent order # end self.norm_out = Normalize(block_in) self.conv_out = torch.nn.Conv2d(block_in, out_ch, kernel_size=3, stride=1, padding=1) def forward(self, z): # assert z.shape[1:] == self.z_shape[1:] self.last_z_shape = z.shape # print(f'decoder-input={z.shape}') # timestep embedding temb = None # z to block_in h = self.conv_in(z) # print(f'decoder-conv in feat={h.shape}') # middle h = self.mid.block_1(h, temb) h = self.mid.attn_1(h) h = self.mid.block_2(h, temb) # print(f'decoder-mid feat={h.shape}') # upsampling for i_level in reversed(range(self.num_resolutions)): for i_block in range(self.num_res_blocks + 1): h = self.up[i_level].block[i_block](h, temb) if len(self.up[i_level].attn) > 0: h = self.up[i_level].attn[i_block](h) # print(f'decoder-up feat={h.shape}') if i_level != 0: h = self.up[i_level].upsample(h) # print(f'decoder-upsample feat={h.shape}') # end if self.give_pre_end: return h h = self.norm_out(h) h = nonlinearity(h) h = self.conv_out(h) # print(f'decoder-conv_out feat={h.shape}') if self.tanh_out: h = torch.tanh(h) return h class SimpleDecoder(nn.Module): def __init__(self, in_channels, out_channels, *args, **kwargs): super().__init__() self.model = nn.ModuleList([nn.Conv2d(in_channels, in_channels, 1), ResnetBlock(in_channels=in_channels, out_channels=2 * in_channels, temb_channels=0, dropout=0.0), ResnetBlock(in_channels=2 * in_channels, out_channels=4 * in_channels, temb_channels=0, dropout=0.0), ResnetBlock(in_channels=4 * in_channels, out_channels=2 * in_channels, temb_channels=0, dropout=0.0), nn.Conv2d(2 * in_channels, in_channels, 1), Upsample(in_channels, with_conv=True)]) # end self.norm_out = Normalize(in_channels) self.conv_out = torch.nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=1, padding=1) def forward(self, x): for i, layer in enumerate(self.model): if i in [1, 2, 3]: x = layer(x, None) else: x = layer(x) h = self.norm_out(x) h = nonlinearity(h) x = self.conv_out(h) return x class UpsampleDecoder(nn.Module): def __init__(self, in_channels, out_channels, ch, num_res_blocks, resolution, ch_mult=(2, 2), dropout=0.0): super().__init__() # upsampling self.temb_ch = 0 self.num_resolutions = len(ch_mult) self.num_res_blocks = num_res_blocks block_in = in_channels curr_res = resolution // 2 ** (self.num_resolutions - 1) self.res_blocks = nn.ModuleList() self.upsample_blocks = nn.ModuleList() for i_level in range(self.num_resolutions): res_block = [] block_out = ch * ch_mult[i_level] for i_block in range(self.num_res_blocks + 1): res_block.append(ResnetBlock(in_channels=block_in, out_channels=block_out, temb_channels=self.temb_ch, dropout=dropout)) block_in = block_out self.res_blocks.append(nn.ModuleList(res_block)) if i_level != self.num_resolutions - 1: self.upsample_blocks.append(Upsample(block_in, True)) curr_res = curr_res * 2 # end self.norm_out = Normalize(block_in) self.conv_out = torch.nn.Conv2d(block_in, out_channels, kernel_size=3, stride=1, padding=1) def forward(self, x): # upsampling h = x for k, i_level in enumerate(range(self.num_resolutions)): for i_block in range(self.num_res_blocks + 1): h = self.res_blocks[i_level][i_block](h, None) if i_level != self.num_resolutions - 1: h = self.upsample_blocks[k](h) h = self.norm_out(h) h = nonlinearity(h) h = self.conv_out(h) return h class LatentRescaler(nn.Module): def __init__(self, factor, in_channels, mid_channels, out_channels, depth=2): super().__init__() # residual block, interpolate, residual block self.factor = factor self.conv_in = nn.Conv2d(in_channels, mid_channels, kernel_size=3, stride=1, padding=1) self.res_block1 = nn.ModuleList([ResnetBlock(in_channels=mid_channels, out_channels=mid_channels, temb_channels=0, dropout=0.0) for _ in range(depth)]) self.attn = AttnBlock(mid_channels) self.res_block2 = nn.ModuleList([ResnetBlock(in_channels=mid_channels, out_channels=mid_channels, temb_channels=0, dropout=0.0) for _ in range(depth)]) self.conv_out = nn.Conv2d(mid_channels, out_channels, kernel_size=1, ) def forward(self, x): x = self.conv_in(x) for block in self.res_block1: x = block(x, None) x = torch.nn.functional.interpolate(x, size=(int(round(x.shape[2] * self.factor)), int(round(x.shape[3] * self.factor)))) x = self.attn(x) for block in self.res_block2: x = block(x, None) x = self.conv_out(x) return x class MergedRescaleEncoder(nn.Module): def __init__(self, in_channels, ch, resolution, out_ch, num_res_blocks, attn_resolutions, dropout=0.0, resamp_with_conv=True, ch_mult=(1, 2, 4, 8), rescale_factor=1.0, rescale_module_depth=1): super().__init__() intermediate_chn = ch * ch_mult[-1] self.encoder = Encoder(in_channels=in_channels, num_res_blocks=num_res_blocks, ch=ch, ch_mult=ch_mult, z_channels=intermediate_chn, double_z=False, resolution=resolution, attn_resolutions=attn_resolutions, dropout=dropout, resamp_with_conv=resamp_with_conv, out_ch=None) self.rescaler = LatentRescaler(factor=rescale_factor, in_channels=intermediate_chn, mid_channels=intermediate_chn, out_channels=out_ch, depth=rescale_module_depth) def forward(self, x): x = self.encoder(x) x = self.rescaler(x) return x class MergedRescaleDecoder(nn.Module): def __init__(self, z_channels, out_ch, resolution, num_res_blocks, attn_resolutions, ch, ch_mult=(1, 2, 4, 8), dropout=0.0, resamp_with_conv=True, rescale_factor=1.0, rescale_module_depth=1): super().__init__() tmp_chn = z_channels * ch_mult[-1] self.decoder = Decoder(out_ch=out_ch, z_channels=tmp_chn, attn_resolutions=attn_resolutions, dropout=dropout, resamp_with_conv=resamp_with_conv, in_channels=None, num_res_blocks=num_res_blocks, ch_mult=ch_mult, resolution=resolution, ch=ch) self.rescaler = LatentRescaler(factor=rescale_factor, in_channels=z_channels, mid_channels=tmp_chn, out_channels=tmp_chn, depth=rescale_module_depth) def forward(self, x): x = self.rescaler(x) x = self.decoder(x) return x class Upsampler(nn.Module): def __init__(self, in_size, out_size, in_channels, out_channels, ch_mult=2): super().__init__() assert out_size >= in_size num_blocks = int(np.log2(out_size // in_size)) + 1 factor_up = 1. + (out_size % in_size) print(f"Building {self.__class__.__name__} with in_size: {in_size} --> out_size {out_size} and factor {factor_up}") self.rescaler = LatentRescaler(factor=factor_up, in_channels=in_channels, mid_channels=2 * in_channels, out_channels=in_channels) self.decoder = Decoder(out_ch=out_channels, resolution=out_size, z_channels=in_channels, num_res_blocks=2, attn_resolutions=[], in_channels=None, ch=in_channels, ch_mult=[ch_mult for _ in range(num_blocks)]) def forward(self, x): x = self.rescaler(x) x = self.decoder(x) return x class Resize(nn.Module): def __init__(self, in_channels=None, learned=False, mode="bilinear"): super().__init__() self.with_conv = learned self.mode = mode if self.with_conv: print(f"Note: {self.__class__.__name} uses learned downsampling and will ignore the fixed {mode} mode") raise NotImplementedError() assert in_channels is not None # no asymmetric padding in torch conv, must do it ourselves self.conv = torch.nn.Conv2d(in_channels, in_channels, kernel_size=4, stride=2, padding=1) def forward(self, x, scale_factor=1.0): if scale_factor == 1.0: return x else: x = torch.nn.functional.interpolate(x, mode=self.mode, align_corners=False, scale_factor=scale_factor) return x class FirstStagePostProcessor(nn.Module): def __init__(self, ch_mult: list, in_channels, pretrained_model: nn.Module = None, reshape=False, n_channels=None, dropout=0., pretrained_config=None): super().__init__() if pretrained_config is None: assert pretrained_model is not None, 'Either "pretrained_model" or "pretrained_config" must not be None' self.pretrained_model = pretrained_model else: assert pretrained_config is not None, 'Either "pretrained_model" or "pretrained_config" must not be None' self.instantiate_pretrained(pretrained_config) self.do_reshape = reshape if n_channels is None: n_channels = self.pretrained_model.encoder.ch self.proj_norm = Normalize(in_channels, num_groups=in_channels // 2) self.proj = nn.Conv2d(in_channels, n_channels, kernel_size=3, stride=1, padding=1) blocks = [] downs = [] ch_in = n_channels for m in ch_mult: blocks.append(ResnetBlock(in_channels=ch_in, out_channels=m * n_channels, dropout=dropout)) ch_in = m * n_channels downs.append(Downsample(ch_in, with_conv=False)) self.model = nn.ModuleList(blocks) self.downsampler = nn.ModuleList(downs) def instantiate_pretrained(self, config): model = instantiate_from_config(config) self.pretrained_model = model.eval() # self.pretrained_model.train = False for param in self.pretrained_model.parameters(): param.requires_grad = False @torch.no_grad() def encode_with_pretrained(self, x): c = self.pretrained_model.encode(x) if isinstance(c, DiagonalGaussianDistribution): c = c.mode() return c def forward(self, x): z_fs = self.encode_with_pretrained(x) z = self.proj_norm(z_fs) z = self.proj(z) z = nonlinearity(z) for submodel, downmodel in zip(self.model, self.downsampler): z = submodel(z, temb=None) z = downmodel(z) if self.do_reshape: z = rearrange(z, 'b c h w -> b (h w) c') return z ================================================ FILE: ToonCrafter/lvdm/modules/networks/openaimodel3d.py ================================================ from functools import partial from abc import abstractmethod import torch import torch.nn as nn from einops import rearrange import torch.nn.functional as F from lvdm.models.utils_diffusion import timestep_embedding from lvdm.common import checkpoint from lvdm.basics import ( zero_module, conv_nd, linear, avg_pool_nd, normalization ) from lvdm.modules.attention import SpatialTransformer, TemporalTransformer class TimestepBlock(nn.Module): """ Any module where forward() takes timestep embeddings as a second argument. """ @abstractmethod def forward(self, x, emb): """ Apply the module to `x` given `emb` timestep embeddings. """ class TimestepEmbedSequential(nn.Sequential, TimestepBlock): """ A sequential module that passes timestep embeddings to the children that support it as an extra input. """ def forward(self, x, emb, context=None, batch_size=None): for layer in self: if isinstance(layer, TimestepBlock): x = layer(x, emb, batch_size=batch_size) elif isinstance(layer, SpatialTransformer): x = layer(x, context) elif isinstance(layer, TemporalTransformer): x = rearrange(x, '(b f) c h w -> b c f h w', b=batch_size) x = layer(x, context) x = rearrange(x, 'b c f h w -> (b f) c h w') else: x = layer(x) return x class Downsample(nn.Module): """ A downsampling layer with an optional convolution. :param channels: channels in the inputs and outputs. :param use_conv: a bool determining if a convolution is applied. :param dims: determines if the signal is 1D, 2D, or 3D. If 3D, then downsampling occurs in the inner-two dimensions. """ def __init__(self, channels, use_conv, dims=2, out_channels=None, padding=1): super().__init__() self.channels = channels self.out_channels = out_channels or channels self.use_conv = use_conv self.dims = dims stride = 2 if dims != 3 else (1, 2, 2) if use_conv: self.op = conv_nd( dims, self.channels, self.out_channels, 3, stride=stride, padding=padding ) else: assert self.channels == self.out_channels self.op = avg_pool_nd(dims, kernel_size=stride, stride=stride) def forward(self, x): assert x.shape[1] == self.channels return self.op(x) class Upsample(nn.Module): """ An upsampling layer with an optional convolution. :param channels: channels in the inputs and outputs. :param use_conv: a bool determining if a convolution is applied. :param dims: determines if the signal is 1D, 2D, or 3D. If 3D, then upsampling occurs in the inner-two dimensions. """ def __init__(self, channels, use_conv, dims=2, out_channels=None, padding=1): super().__init__() self.channels = channels self.out_channels = out_channels or channels self.use_conv = use_conv self.dims = dims if use_conv: self.conv = conv_nd(dims, self.channels, self.out_channels, 3, padding=padding) def forward(self, x): assert x.shape[1] == self.channels if self.dims == 3: x = F.interpolate(x, (x.shape[2], x.shape[3] * 2, x.shape[4] * 2), mode='nearest') else: x = F.interpolate(x, scale_factor=2, mode='nearest') if self.use_conv: x = self.conv(x) return x class ResBlock(TimestepBlock): """ A residual block that can optionally change the number of channels. :param channels: the number of input channels. :param emb_channels: the number of timestep embedding channels. :param dropout: the rate of dropout. :param out_channels: if specified, the number of out channels. :param use_conv: if True and out_channels is specified, use a spatial convolution instead of a smaller 1x1 convolution to change the channels in the skip connection. :param dims: determines if the signal is 1D, 2D, or 3D. :param up: if True, use this block for upsampling. :param down: if True, use this block for downsampling. :param use_temporal_conv: if True, use the temporal convolution. :param use_image_dataset: if True, the temporal parameters will not be optimized. """ def __init__( self, channels, emb_channels, dropout, out_channels=None, use_scale_shift_norm=False, dims=2, use_checkpoint=False, use_conv=False, up=False, down=False, use_temporal_conv=False, tempspatial_aware=False ): super().__init__() self.channels = channels self.emb_channels = emb_channels self.dropout = dropout self.out_channels = out_channels or channels self.use_conv = use_conv self.use_checkpoint = use_checkpoint self.use_scale_shift_norm = use_scale_shift_norm self.use_temporal_conv = use_temporal_conv self.in_layers = nn.Sequential( normalization(channels), nn.SiLU(), conv_nd(dims, channels, self.out_channels, 3, padding=1), ) self.updown = up or down if up: self.h_upd = Upsample(channels, False, dims) self.x_upd = Upsample(channels, False, dims) elif down: self.h_upd = Downsample(channels, False, dims) self.x_upd = Downsample(channels, False, dims) else: self.h_upd = self.x_upd = nn.Identity() self.emb_layers = nn.Sequential( nn.SiLU(), nn.Linear( emb_channels, 2 * self.out_channels if use_scale_shift_norm else self.out_channels, ), ) self.out_layers = nn.Sequential( normalization(self.out_channels), nn.SiLU(), nn.Dropout(p=dropout), zero_module(nn.Conv2d(self.out_channels, self.out_channels, 3, padding=1)), ) if self.out_channels == channels: self.skip_connection = nn.Identity() elif use_conv: self.skip_connection = conv_nd(dims, channels, self.out_channels, 3, padding=1) else: self.skip_connection = conv_nd(dims, channels, self.out_channels, 1) if self.use_temporal_conv: self.temopral_conv = TemporalConvBlock( self.out_channels, self.out_channels, dropout=0.1, spatial_aware=tempspatial_aware ) def forward(self, x, emb, batch_size=None): """ Apply the block to a Tensor, conditioned on a timestep embedding. :param x: an [N x C x ...] Tensor of features. :param emb: an [N x emb_channels] Tensor of timestep embeddings. :return: an [N x C x ...] Tensor of outputs. """ input_tuple = (x, emb) if batch_size: forward_batchsize = partial(self._forward, batch_size=batch_size) return checkpoint(forward_batchsize, input_tuple, self.parameters(), self.use_checkpoint) return checkpoint(self._forward, input_tuple, self.parameters(), self.use_checkpoint) def _forward(self, x, emb, batch_size=None): if self.updown: in_rest, in_conv = self.in_layers[:-1], self.in_layers[-1] h = in_rest(x) h = self.h_upd(h) x = self.x_upd(x) h = in_conv(h) else: h = self.in_layers(x) emb_out = self.emb_layers(emb).type(h.dtype) while len(emb_out.shape) < len(h.shape): emb_out = emb_out[..., None] if self.use_scale_shift_norm: out_norm, out_rest = self.out_layers[0], self.out_layers[1:] scale, shift = torch.chunk(emb_out, 2, dim=1) h = out_norm(h) * (1 + scale) + shift h = out_rest(h) else: h = h + emb_out h = self.out_layers(h) h = self.skip_connection(x) + h if self.use_temporal_conv and batch_size: h = rearrange(h, '(b t) c h w -> b c t h w', b=batch_size) h = self.temopral_conv(h) h = rearrange(h, 'b c t h w -> (b t) c h w') return h class TemporalConvBlock(nn.Module): """ Adapted from modelscope: https://github.com/modelscope/modelscope/blob/master/modelscope/models/multi_modal/video_synthesis/unet_sd.py """ def __init__(self, in_channels, out_channels=None, dropout=0.0, spatial_aware=False): super(TemporalConvBlock, self).__init__() if out_channels is None: out_channels = in_channels self.in_channels = in_channels self.out_channels = out_channels th_kernel_shape = (3, 1, 1) if not spatial_aware else (3, 3, 1) th_padding_shape = (1, 0, 0) if not spatial_aware else (1, 1, 0) tw_kernel_shape = (3, 1, 1) if not spatial_aware else (3, 1, 3) tw_padding_shape = (1, 0, 0) if not spatial_aware else (1, 0, 1) # conv layers self.conv1 = nn.Sequential( nn.GroupNorm(32, in_channels), nn.SiLU(), nn.Conv3d(in_channels, out_channels, th_kernel_shape, padding=th_padding_shape)) self.conv2 = nn.Sequential( nn.GroupNorm(32, out_channels), nn.SiLU(), nn.Dropout(dropout), nn.Conv3d(out_channels, in_channels, tw_kernel_shape, padding=tw_padding_shape)) self.conv3 = nn.Sequential( nn.GroupNorm(32, out_channels), nn.SiLU(), nn.Dropout(dropout), nn.Conv3d(out_channels, in_channels, th_kernel_shape, padding=th_padding_shape)) self.conv4 = nn.Sequential( nn.GroupNorm(32, out_channels), nn.SiLU(), nn.Dropout(dropout), nn.Conv3d(out_channels, in_channels, tw_kernel_shape, padding=tw_padding_shape)) # zero out the last layer params,so the conv block is identity nn.init.zeros_(self.conv4[-1].weight) nn.init.zeros_(self.conv4[-1].bias) def forward(self, x): identity = x x = self.conv1(x) x = self.conv2(x) x = self.conv3(x) x = self.conv4(x) return identity + x class UNetModel(nn.Module): """ The full UNet model with attention and timestep embedding. :param in_channels: in_channels in the input Tensor. :param model_channels: base channel count for the model. :param out_channels: channels in the output Tensor. :param num_res_blocks: number of residual blocks per downsample. :param attention_resolutions: a collection of downsample rates at which attention will take place. May be a set, list, or tuple. For example, if this contains 4, then at 4x downsampling, attention will be used. :param dropout: the dropout probability. :param channel_mult: channel multiplier for each level of the UNet. :param conv_resample: if True, use learned convolutions for upsampling and downsampling. :param dims: determines if the signal is 1D, 2D, or 3D. :param num_classes: if specified (as an int), then this model will be class-conditional with `num_classes` classes. :param use_checkpoint: use gradient checkpointing to reduce memory usage. :param num_heads: the number of attention heads in each attention layer. :param num_heads_channels: if specified, ignore num_heads and instead use a fixed channel width per attention head. :param num_heads_upsample: works with num_heads to set a different number of heads for upsampling. Deprecated. :param use_scale_shift_norm: use a FiLM-like conditioning mechanism. :param resblock_updown: use residual blocks for up/downsampling. :param use_new_attention_order: use a different attention pattern for potentially increased efficiency. """ def __init__(self, in_channels, model_channels, out_channels, num_res_blocks, attention_resolutions, dropout=0.0, channel_mult=(1, 2, 4, 8), conv_resample=True, dims=2, context_dim=None, use_scale_shift_norm=False, resblock_updown=False, num_heads=-1, num_head_channels=-1, transformer_depth=1, use_linear=False, use_checkpoint=False, temporal_conv=False, tempspatial_aware=False, temporal_attention=True, use_relative_position=True, use_causal_attention=False, temporal_length=None, use_fp16=False, addition_attention=False, temporal_selfatt_only=True, image_cross_attention=False, image_cross_attention_scale_learnable=False, default_fs=4, fs_condition=False, ): super(UNetModel, self).__init__() if num_heads == -1: assert num_head_channels != -1, 'Either num_heads or num_head_channels has to be set' if num_head_channels == -1: assert num_heads != -1, 'Either num_heads or num_head_channels has to be set' self.in_channels = in_channels self.model_channels = model_channels self.out_channels = out_channels self.num_res_blocks = num_res_blocks self.attention_resolutions = attention_resolutions self.dropout = dropout self.channel_mult = channel_mult self.conv_resample = conv_resample self.temporal_attention = temporal_attention time_embed_dim = model_channels * 4 self.use_checkpoint = use_checkpoint self.dtype = torch.float16 if use_fp16 else torch.float32 temporal_self_att_only = True self.addition_attention = addition_attention self.temporal_length = temporal_length self.image_cross_attention = image_cross_attention self.image_cross_attention_scale_learnable = image_cross_attention_scale_learnable self.default_fs = default_fs self.fs_condition = fs_condition ## Time embedding blocks self.time_embed = nn.Sequential( linear(model_channels, time_embed_dim), nn.SiLU(), linear(time_embed_dim, time_embed_dim), ) if fs_condition: self.fps_embedding = nn.Sequential( linear(model_channels, time_embed_dim), nn.SiLU(), linear(time_embed_dim, time_embed_dim), ) nn.init.zeros_(self.fps_embedding[-1].weight) nn.init.zeros_(self.fps_embedding[-1].bias) ## Input Block self.input_blocks = nn.ModuleList( [ TimestepEmbedSequential(conv_nd(dims, in_channels, model_channels, 3, padding=1)) ] ) if self.addition_attention: self.init_attn=TimestepEmbedSequential( TemporalTransformer( model_channels, n_heads=8, d_head=num_head_channels, depth=transformer_depth, context_dim=context_dim, use_checkpoint=use_checkpoint, only_self_att=temporal_selfatt_only, causal_attention=False, relative_position=use_relative_position, temporal_length=temporal_length)) input_block_chans = [model_channels] ch = model_channels ds = 1 for level, mult in enumerate(channel_mult): for _ in range(num_res_blocks): layers = [ ResBlock(ch, time_embed_dim, dropout, out_channels=mult * model_channels, dims=dims, use_checkpoint=use_checkpoint, use_scale_shift_norm=use_scale_shift_norm, tempspatial_aware=tempspatial_aware, use_temporal_conv=temporal_conv ) ] ch = mult * model_channels if ds in attention_resolutions: if num_head_channels == -1: dim_head = ch // num_heads else: num_heads = ch // num_head_channels dim_head = num_head_channels layers.append( SpatialTransformer(ch, num_heads, dim_head, depth=transformer_depth, context_dim=context_dim, use_linear=use_linear, use_checkpoint=use_checkpoint, disable_self_attn=False, video_length=temporal_length, image_cross_attention=self.image_cross_attention, image_cross_attention_scale_learnable=self.image_cross_attention_scale_learnable, ) ) if self.temporal_attention: layers.append( TemporalTransformer(ch, num_heads, dim_head, depth=transformer_depth, context_dim=context_dim, use_linear=use_linear, use_checkpoint=use_checkpoint, only_self_att=temporal_self_att_only, causal_attention=use_causal_attention, relative_position=use_relative_position, temporal_length=temporal_length ) ) self.input_blocks.append(TimestepEmbedSequential(*layers)) input_block_chans.append(ch) if level != len(channel_mult) - 1: out_ch = ch self.input_blocks.append( TimestepEmbedSequential( ResBlock(ch, time_embed_dim, dropout, out_channels=out_ch, dims=dims, use_checkpoint=use_checkpoint, use_scale_shift_norm=use_scale_shift_norm, down=True ) if resblock_updown else Downsample(ch, conv_resample, dims=dims, out_channels=out_ch) ) ) ch = out_ch input_block_chans.append(ch) ds *= 2 if num_head_channels == -1: dim_head = ch // num_heads else: num_heads = ch // num_head_channels dim_head = num_head_channels layers = [ ResBlock(ch, time_embed_dim, dropout, dims=dims, use_checkpoint=use_checkpoint, use_scale_shift_norm=use_scale_shift_norm, tempspatial_aware=tempspatial_aware, use_temporal_conv=temporal_conv ), SpatialTransformer(ch, num_heads, dim_head, depth=transformer_depth, context_dim=context_dim, use_linear=use_linear, use_checkpoint=use_checkpoint, disable_self_attn=False, video_length=temporal_length, image_cross_attention=self.image_cross_attention,image_cross_attention_scale_learnable=self.image_cross_attention_scale_learnable ) ] if self.temporal_attention: layers.append( TemporalTransformer(ch, num_heads, dim_head, depth=transformer_depth, context_dim=context_dim, use_linear=use_linear, use_checkpoint=use_checkpoint, only_self_att=temporal_self_att_only, causal_attention=use_causal_attention, relative_position=use_relative_position, temporal_length=temporal_length ) ) layers.append( ResBlock(ch, time_embed_dim, dropout, dims=dims, use_checkpoint=use_checkpoint, use_scale_shift_norm=use_scale_shift_norm, tempspatial_aware=tempspatial_aware, use_temporal_conv=temporal_conv ) ) ## Middle Block self.middle_block = TimestepEmbedSequential(*layers) ## Output Block self.output_blocks = nn.ModuleList([]) for level, mult in list(enumerate(channel_mult))[::-1]: for i in range(num_res_blocks + 1): ich = input_block_chans.pop() layers = [ ResBlock(ch + ich, time_embed_dim, dropout, out_channels=mult * model_channels, dims=dims, use_checkpoint=use_checkpoint, use_scale_shift_norm=use_scale_shift_norm, tempspatial_aware=tempspatial_aware, use_temporal_conv=temporal_conv ) ] ch = model_channels * mult if ds in attention_resolutions: if num_head_channels == -1: dim_head = ch // num_heads else: num_heads = ch // num_head_channels dim_head = num_head_channels layers.append( SpatialTransformer(ch, num_heads, dim_head, depth=transformer_depth, context_dim=context_dim, use_linear=use_linear, use_checkpoint=use_checkpoint, disable_self_attn=False, video_length=temporal_length, image_cross_attention=self.image_cross_attention,image_cross_attention_scale_learnable=self.image_cross_attention_scale_learnable ) ) if self.temporal_attention: layers.append( TemporalTransformer(ch, num_heads, dim_head, depth=transformer_depth, context_dim=context_dim, use_linear=use_linear, use_checkpoint=use_checkpoint, only_self_att=temporal_self_att_only, causal_attention=use_causal_attention, relative_position=use_relative_position, temporal_length=temporal_length ) ) if level and i == num_res_blocks: out_ch = ch layers.append( ResBlock(ch, time_embed_dim, dropout, out_channels=out_ch, dims=dims, use_checkpoint=use_checkpoint, use_scale_shift_norm=use_scale_shift_norm, up=True ) if resblock_updown else Upsample(ch, conv_resample, dims=dims, out_channels=out_ch) ) ds //= 2 self.output_blocks.append(TimestepEmbedSequential(*layers)) self.out = nn.Sequential( normalization(ch), nn.SiLU(), zero_module(conv_nd(dims, model_channels, out_channels, 3, padding=1)), ) def forward(self, x, timesteps, context=None, features_adapter=None, fs=None, **kwargs): b,_,t,_,_ = x.shape t_emb = timestep_embedding(timesteps, self.model_channels, repeat_only=False).type(x.dtype) emb = self.time_embed(t_emb) ## repeat t times for context [(b t) 77 768] & time embedding ## check if we use per-frame image conditioning _, l_context, _ = context.shape if l_context == 77 + t*16: ## !!! HARD CODE here context_text, context_img = context[:,:77,:], context[:,77:,:] context_text = context_text.repeat_interleave(repeats=t, dim=0) context_img = rearrange(context_img, 'b (t l) c -> (b t) l c', t=t) context = torch.cat([context_text, context_img], dim=1) else: context = context.repeat_interleave(repeats=t, dim=0) emb = emb.repeat_interleave(repeats=t, dim=0) ## always in shape (b t) c h w, except for temporal layer x = rearrange(x, 'b c t h w -> (b t) c h w') ## combine emb if self.fs_condition: if fs is None: fs = torch.tensor( [self.default_fs] * b, dtype=torch.long, device=x.device) fs_emb = timestep_embedding(fs, self.model_channels, repeat_only=False).type(x.dtype) fs_embed = self.fps_embedding(fs_emb) fs_embed = fs_embed.repeat_interleave(repeats=t, dim=0) emb = emb + fs_embed if self.dtype != emb.dtype: self.dtype = emb.dtype h = x.type(self.dtype) adapter_idx = 0 hs = [] for id, module in enumerate(self.input_blocks): h = module(h, emb, context=context, batch_size=b) if id ==0 and self.addition_attention: h = self.init_attn(h, emb, context=context, batch_size=b) ## plug-in adapter features if ((id+1)%3 == 0) and features_adapter is not None: h = h + features_adapter[adapter_idx] adapter_idx += 1 hs.append(h) if features_adapter is not None: assert len(features_adapter)==adapter_idx, 'Wrong features_adapter' h = self.middle_block(h, emb, context=context, batch_size=b) for module in self.output_blocks: h = torch.cat([h, hs.pop()], dim=1) h = module(h, emb, context=context, batch_size=b) h = h.type(x.dtype) y = self.out(h) # reshape back to (b c t h w) y = rearrange(y, '(b t) c h w -> b c t h w', b=b) return y ================================================ FILE: ToonCrafter/lvdm/modules/x_transformer.py ================================================ """shout-out to https://github.com/lucidrains/x-transformers/tree/main/x_transformers""" from functools import partial from inspect import isfunction from collections import namedtuple from einops import rearrange, repeat import torch from torch import nn, einsum import torch.nn.functional as F # constants DEFAULT_DIM_HEAD = 64 Intermediates = namedtuple('Intermediates', [ 'pre_softmax_attn', 'post_softmax_attn' ]) LayerIntermediates = namedtuple('Intermediates', [ 'hiddens', 'attn_intermediates' ]) class AbsolutePositionalEmbedding(nn.Module): def __init__(self, dim, max_seq_len): super().__init__() self.emb = nn.Embedding(max_seq_len, dim) self.init_() def init_(self): nn.init.normal_(self.emb.weight, std=0.02) def forward(self, x): n = torch.arange(x.shape[1], device=x.device) return self.emb(n)[None, :, :] class FixedPositionalEmbedding(nn.Module): def __init__(self, dim): super().__init__() inv_freq = 1. / (10000 ** (torch.arange(0, dim, 2).float() / dim)) self.register_buffer('inv_freq', inv_freq) def forward(self, x, seq_dim=1, offset=0): t = torch.arange(x.shape[seq_dim], device=x.device).type_as(self.inv_freq) + offset sinusoid_inp = torch.einsum('i , j -> i j', t, self.inv_freq) emb = torch.cat((sinusoid_inp.sin(), sinusoid_inp.cos()), dim=-1) return emb[None, :, :] # helpers def exists(val): return val is not None def default(val, d): if exists(val): return val return d() if isfunction(d) else d def always(val): def inner(*args, **kwargs): return val return inner def not_equals(val): def inner(x): return x != val return inner def equals(val): def inner(x): return x == val return inner def max_neg_value(tensor): return -torch.finfo(tensor.dtype).max # keyword argument helpers def pick_and_pop(keys, d): values = list(map(lambda key: d.pop(key), keys)) return dict(zip(keys, values)) def group_dict_by_key(cond, d): return_val = [dict(), dict()] for key in d.keys(): match = bool(cond(key)) ind = int(not match) return_val[ind][key] = d[key] return (*return_val,) def string_begins_with(prefix, str): return str.startswith(prefix) def group_by_key_prefix(prefix, d): return group_dict_by_key(partial(string_begins_with, prefix), d) def groupby_prefix_and_trim(prefix, d): kwargs_with_prefix, kwargs = group_dict_by_key(partial(string_begins_with, prefix), d) kwargs_without_prefix = dict(map(lambda x: (x[0][len(prefix):], x[1]), tuple(kwargs_with_prefix.items()))) return kwargs_without_prefix, kwargs # classes class Scale(nn.Module): def __init__(self, value, fn): super().__init__() self.value = value self.fn = fn def forward(self, x, **kwargs): x, *rest = self.fn(x, **kwargs) return (x * self.value, *rest) class Rezero(nn.Module): def __init__(self, fn): super().__init__() self.fn = fn self.g = nn.Parameter(torch.zeros(1)) def forward(self, x, **kwargs): x, *rest = self.fn(x, **kwargs) return (x * self.g, *rest) class ScaleNorm(nn.Module): def __init__(self, dim, eps=1e-5): super().__init__() self.scale = dim ** -0.5 self.eps = eps self.g = nn.Parameter(torch.ones(1)) def forward(self, x): norm = torch.norm(x, dim=-1, keepdim=True) * self.scale return x / norm.clamp(min=self.eps) * self.g class RMSNorm(nn.Module): def __init__(self, dim, eps=1e-8): super().__init__() self.scale = dim ** -0.5 self.eps = eps self.g = nn.Parameter(torch.ones(dim)) def forward(self, x): norm = torch.norm(x, dim=-1, keepdim=True) * self.scale return x / norm.clamp(min=self.eps) * self.g class Residual(nn.Module): def forward(self, x, residual): return x + residual class GRUGating(nn.Module): def __init__(self, dim): super().__init__() self.gru = nn.GRUCell(dim, dim) def forward(self, x, residual): gated_output = self.gru( rearrange(x, 'b n d -> (b n) d'), rearrange(residual, 'b n d -> (b n) d') ) return gated_output.reshape_as(x) # feedforward class GEGLU(nn.Module): def __init__(self, dim_in, dim_out): super().__init__() self.proj = nn.Linear(dim_in, dim_out * 2) def forward(self, x): x, gate = self.proj(x).chunk(2, dim=-1) return x * F.gelu(gate) class FeedForward(nn.Module): def __init__(self, dim, dim_out=None, mult=4, glu=False, dropout=0.): super().__init__() inner_dim = int(dim * mult) dim_out = default(dim_out, dim) project_in = nn.Sequential( nn.Linear(dim, inner_dim), nn.GELU() ) if not glu else GEGLU(dim, inner_dim) self.net = nn.Sequential( project_in, nn.Dropout(dropout), nn.Linear(inner_dim, dim_out) ) def forward(self, x): return self.net(x) # attention. class Attention(nn.Module): def __init__( self, dim, dim_head=DEFAULT_DIM_HEAD, heads=8, causal=False, mask=None, talking_heads=False, sparse_topk=None, use_entmax15=False, num_mem_kv=0, dropout=0., on_attn=False ): super().__init__() if use_entmax15: raise NotImplementedError("Check out entmax activation instead of softmax activation!") self.scale = dim_head ** -0.5 self.heads = heads self.causal = causal self.mask = mask inner_dim = dim_head * heads self.to_q = nn.Linear(dim, inner_dim, bias=False) self.to_k = nn.Linear(dim, inner_dim, bias=False) self.to_v = nn.Linear(dim, inner_dim, bias=False) self.dropout = nn.Dropout(dropout) # talking heads self.talking_heads = talking_heads if talking_heads: self.pre_softmax_proj = nn.Parameter(torch.randn(heads, heads)) self.post_softmax_proj = nn.Parameter(torch.randn(heads, heads)) # explicit topk sparse attention self.sparse_topk = sparse_topk # entmax #self.attn_fn = entmax15 if use_entmax15 else F.softmax self.attn_fn = F.softmax # add memory key / values self.num_mem_kv = num_mem_kv if num_mem_kv > 0: self.mem_k = nn.Parameter(torch.randn(heads, num_mem_kv, dim_head)) self.mem_v = nn.Parameter(torch.randn(heads, num_mem_kv, dim_head)) # attention on attention self.attn_on_attn = on_attn self.to_out = nn.Sequential(nn.Linear(inner_dim, dim * 2), nn.GLU()) if on_attn else nn.Linear(inner_dim, dim) def forward( self, x, context=None, mask=None, context_mask=None, rel_pos=None, sinusoidal_emb=None, prev_attn=None, mem=None ): b, n, _, h, talking_heads, device = *x.shape, self.heads, self.talking_heads, x.device kv_input = default(context, x) q_input = x k_input = kv_input v_input = kv_input if exists(mem): k_input = torch.cat((mem, k_input), dim=-2) v_input = torch.cat((mem, v_input), dim=-2) if exists(sinusoidal_emb): # in shortformer, the query would start at a position offset depending on the past cached memory offset = k_input.shape[-2] - q_input.shape[-2] q_input = q_input + sinusoidal_emb(q_input, offset=offset) k_input = k_input + sinusoidal_emb(k_input) q = self.to_q(q_input) k = self.to_k(k_input) v = self.to_v(v_input) q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> b h n d', h=h), (q, k, v)) input_mask = None if any(map(exists, (mask, context_mask))): q_mask = default(mask, lambda: torch.ones((b, n), device=device).bool()) k_mask = q_mask if not exists(context) else context_mask k_mask = default(k_mask, lambda: torch.ones((b, k.shape[-2]), device=device).bool()) q_mask = rearrange(q_mask, 'b i -> b () i ()') k_mask = rearrange(k_mask, 'b j -> b () () j') input_mask = q_mask * k_mask if self.num_mem_kv > 0: mem_k, mem_v = map(lambda t: repeat(t, 'h n d -> b h n d', b=b), (self.mem_k, self.mem_v)) k = torch.cat((mem_k, k), dim=-2) v = torch.cat((mem_v, v), dim=-2) if exists(input_mask): input_mask = F.pad(input_mask, (self.num_mem_kv, 0), value=True) dots = einsum('b h i d, b h j d -> b h i j', q, k) * self.scale mask_value = max_neg_value(dots) if exists(prev_attn): dots = dots + prev_attn pre_softmax_attn = dots if talking_heads: dots = einsum('b h i j, h k -> b k i j', dots, self.pre_softmax_proj).contiguous() if exists(rel_pos): dots = rel_pos(dots) if exists(input_mask): dots.masked_fill_(~input_mask, mask_value) del input_mask if self.causal: i, j = dots.shape[-2:] r = torch.arange(i, device=device) mask = rearrange(r, 'i -> () () i ()') < rearrange(r, 'j -> () () () j') mask = F.pad(mask, (j - i, 0), value=False) dots.masked_fill_(mask, mask_value) del mask if exists(self.sparse_topk) and self.sparse_topk < dots.shape[-1]: top, _ = dots.topk(self.sparse_topk, dim=-1) vk = top[..., -1].unsqueeze(-1).expand_as(dots) mask = dots < vk dots.masked_fill_(mask, mask_value) del mask attn = self.attn_fn(dots, dim=-1) post_softmax_attn = attn attn = self.dropout(attn) if talking_heads: attn = einsum('b h i j, h k -> b k i j', attn, self.post_softmax_proj).contiguous() out = einsum('b h i j, b h j d -> b h i d', attn, v) out = rearrange(out, 'b h n d -> b n (h d)') intermediates = Intermediates( pre_softmax_attn=pre_softmax_attn, post_softmax_attn=post_softmax_attn ) return self.to_out(out), intermediates class AttentionLayers(nn.Module): def __init__( self, dim, depth, heads=8, causal=False, cross_attend=False, only_cross=False, use_scalenorm=False, use_rmsnorm=False, use_rezero=False, rel_pos_num_buckets=32, rel_pos_max_distance=128, position_infused_attn=False, custom_layers=None, sandwich_coef=None, par_ratio=None, residual_attn=False, cross_residual_attn=False, macaron=False, pre_norm=True, gate_residual=False, **kwargs ): super().__init__() ff_kwargs, kwargs = groupby_prefix_and_trim('ff_', kwargs) attn_kwargs, _ = groupby_prefix_and_trim('attn_', kwargs) dim_head = attn_kwargs.get('dim_head', DEFAULT_DIM_HEAD) self.dim = dim self.depth = depth self.layers = nn.ModuleList([]) self.has_pos_emb = position_infused_attn self.pia_pos_emb = FixedPositionalEmbedding(dim) if position_infused_attn else None self.rotary_pos_emb = always(None) assert rel_pos_num_buckets <= rel_pos_max_distance, 'number of relative position buckets must be less than the relative position max distance' self.rel_pos = None self.pre_norm = pre_norm self.residual_attn = residual_attn self.cross_residual_attn = cross_residual_attn norm_class = ScaleNorm if use_scalenorm else nn.LayerNorm norm_class = RMSNorm if use_rmsnorm else norm_class norm_fn = partial(norm_class, dim) norm_fn = nn.Identity if use_rezero else norm_fn branch_fn = Rezero if use_rezero else None if cross_attend and not only_cross: default_block = ('a', 'c', 'f') elif cross_attend and only_cross: default_block = ('c', 'f') else: default_block = ('a', 'f') if macaron: default_block = ('f',) + default_block if exists(custom_layers): layer_types = custom_layers elif exists(par_ratio): par_depth = depth * len(default_block) assert 1 < par_ratio <= par_depth, 'par ratio out of range' default_block = tuple(filter(not_equals('f'), default_block)) par_attn = par_depth // par_ratio depth_cut = par_depth * 2 // 3 # 2 / 3 attention layer cutoff suggested by PAR paper par_width = (depth_cut + depth_cut // par_attn) // par_attn assert len(default_block) <= par_width, 'default block is too large for par_ratio' par_block = default_block + ('f',) * (par_width - len(default_block)) par_head = par_block * par_attn layer_types = par_head + ('f',) * (par_depth - len(par_head)) elif exists(sandwich_coef): assert sandwich_coef > 0 and sandwich_coef <= depth, 'sandwich coefficient should be less than the depth' layer_types = ('a',) * sandwich_coef + default_block * (depth - sandwich_coef) + ('f',) * sandwich_coef else: layer_types = default_block * depth self.layer_types = layer_types self.num_attn_layers = len(list(filter(equals('a'), layer_types))) for layer_type in self.layer_types: if layer_type == 'a': layer = Attention(dim, heads=heads, causal=causal, **attn_kwargs) elif layer_type == 'c': layer = Attention(dim, heads=heads, **attn_kwargs) elif layer_type == 'f': layer = FeedForward(dim, **ff_kwargs) layer = layer if not macaron else Scale(0.5, layer) else: raise Exception(f'invalid layer type {layer_type}') if isinstance(layer, Attention) and exists(branch_fn): layer = branch_fn(layer) if gate_residual: residual_fn = GRUGating(dim) else: residual_fn = Residual() self.layers.append(nn.ModuleList([ norm_fn(), layer, residual_fn ])) def forward( self, x, context=None, mask=None, context_mask=None, mems=None, return_hiddens=False ): hiddens = [] intermediates = [] prev_attn = None prev_cross_attn = None mems = mems.copy() if exists(mems) else [None] * self.num_attn_layers for ind, (layer_type, (norm, block, residual_fn)) in enumerate(zip(self.layer_types, self.layers)): is_last = ind == (len(self.layers) - 1) if layer_type == 'a': hiddens.append(x) layer_mem = mems.pop(0) residual = x if self.pre_norm: x = norm(x) if layer_type == 'a': out, inter = block(x, mask=mask, sinusoidal_emb=self.pia_pos_emb, rel_pos=self.rel_pos, prev_attn=prev_attn, mem=layer_mem) elif layer_type == 'c': out, inter = block(x, context=context, mask=mask, context_mask=context_mask, prev_attn=prev_cross_attn) elif layer_type == 'f': out = block(x) x = residual_fn(out, residual) if layer_type in ('a', 'c'): intermediates.append(inter) if layer_type == 'a' and self.residual_attn: prev_attn = inter.pre_softmax_attn elif layer_type == 'c' and self.cross_residual_attn: prev_cross_attn = inter.pre_softmax_attn if not self.pre_norm and not is_last: x = norm(x) if return_hiddens: intermediates = LayerIntermediates( hiddens=hiddens, attn_intermediates=intermediates ) return x, intermediates return x class Encoder(AttentionLayers): def __init__(self, **kwargs): assert 'causal' not in kwargs, 'cannot set causality on encoder' super().__init__(causal=False, **kwargs) class TransformerWrapper(nn.Module): def __init__( self, *, num_tokens, max_seq_len, attn_layers, emb_dim=None, max_mem_len=0., emb_dropout=0., num_memory_tokens=None, tie_embedding=False, use_pos_emb=True ): super().__init__() assert isinstance(attn_layers, AttentionLayers), 'attention layers must be one of Encoder or Decoder' dim = attn_layers.dim emb_dim = default(emb_dim, dim) self.max_seq_len = max_seq_len self.max_mem_len = max_mem_len self.num_tokens = num_tokens self.token_emb = nn.Embedding(num_tokens, emb_dim) self.pos_emb = AbsolutePositionalEmbedding(emb_dim, max_seq_len) if ( use_pos_emb and not attn_layers.has_pos_emb) else always(0) self.emb_dropout = nn.Dropout(emb_dropout) self.project_emb = nn.Linear(emb_dim, dim) if emb_dim != dim else nn.Identity() self.attn_layers = attn_layers self.norm = nn.LayerNorm(dim) self.init_() self.to_logits = nn.Linear(dim, num_tokens) if not tie_embedding else lambda t: t @ self.token_emb.weight.t() # memory tokens (like [cls]) from Memory Transformers paper num_memory_tokens = default(num_memory_tokens, 0) self.num_memory_tokens = num_memory_tokens if num_memory_tokens > 0: self.memory_tokens = nn.Parameter(torch.randn(num_memory_tokens, dim)) # let funnel encoder know number of memory tokens, if specified if hasattr(attn_layers, 'num_memory_tokens'): attn_layers.num_memory_tokens = num_memory_tokens def init_(self): nn.init.normal_(self.token_emb.weight, std=0.02) def forward( self, x, return_embeddings=False, mask=None, return_mems=False, return_attn=False, mems=None, **kwargs ): b, n, device, num_mem = *x.shape, x.device, self.num_memory_tokens x = self.token_emb(x) x += self.pos_emb(x) x = self.emb_dropout(x) x = self.project_emb(x) if num_mem > 0: mem = repeat(self.memory_tokens, 'n d -> b n d', b=b) x = torch.cat((mem, x), dim=1) # auto-handle masking after appending memory tokens if exists(mask): mask = F.pad(mask, (num_mem, 0), value=True) x, intermediates = self.attn_layers(x, mask=mask, mems=mems, return_hiddens=True, **kwargs) x = self.norm(x) mem, x = x[:, :num_mem], x[:, num_mem:] out = self.to_logits(x) if not return_embeddings else x if return_mems: hiddens = intermediates.hiddens new_mems = list(map(lambda pair: torch.cat(pair, dim=-2), zip(mems, hiddens))) if exists(mems) else hiddens new_mems = list(map(lambda t: t[..., -self.max_mem_len:, :].detach(), new_mems)) return out, new_mems if return_attn: attn_maps = list(map(lambda t: t.post_softmax_attn, intermediates.attn_intermediates)) return out, attn_maps return out ================================================ FILE: ToonCrafter/main/__init__.py ================================================ ================================================ FILE: ToonCrafter/main/callbacks.py ================================================ import os import time import logging mainlogger = logging.getLogger('mainlogger') import torch import torchvision import pytorch_lightning as pl from pytorch_lightning.callbacks import Callback from pytorch_lightning.utilities import rank_zero_only from pytorch_lightning.utilities import rank_zero_info from utils.save_video import log_local, prepare_to_log class ImageLogger(Callback): def __init__(self, batch_frequency, max_images=8, clamp=True, rescale=True, save_dir=None, \ to_local=False, log_images_kwargs=None): super().__init__() self.rescale = rescale self.batch_freq = batch_frequency self.max_images = max_images self.to_local = to_local self.clamp = clamp self.log_images_kwargs = log_images_kwargs if log_images_kwargs else {} if self.to_local: ## default save dir self.save_dir = os.path.join(save_dir, "images") os.makedirs(os.path.join(self.save_dir, "train"), exist_ok=True) os.makedirs(os.path.join(self.save_dir, "val"), exist_ok=True) def log_to_tensorboard(self, pl_module, batch_logs, filename, split, save_fps=8): """ log images and videos to tensorboard """ global_step = pl_module.global_step for key in batch_logs: value = batch_logs[key] tag = "gs%d-%s/%s-%s"%(global_step, split, filename, key) if isinstance(value, list) and isinstance(value[0], str): captions = ' |------| '.join(value) pl_module.logger.experiment.add_text(tag, captions, global_step=global_step) elif isinstance(value, torch.Tensor) and value.dim() == 5: video = value n = video.shape[0] video = video.permute(2, 0, 1, 3, 4) # t,n,c,h,w frame_grids = [torchvision.utils.make_grid(framesheet, nrow=int(n), padding=0) for framesheet in video] #[3, n*h, 1*w] grid = torch.stack(frame_grids, dim=0) # stack in temporal dim [t, 3, n*h, w] grid = (grid + 1.0) / 2.0 grid = grid.unsqueeze(dim=0) pl_module.logger.experiment.add_video(tag, grid, fps=save_fps, global_step=global_step) elif isinstance(value, torch.Tensor) and value.dim() == 4: img = value grid = torchvision.utils.make_grid(img, nrow=int(n), padding=0) grid = (grid + 1.0) / 2.0 # -1,1 -> 0,1; c,h,w pl_module.logger.experiment.add_image(tag, grid, global_step=global_step) else: pass @rank_zero_only def log_batch_imgs(self, pl_module, batch, batch_idx, split="train"): """ generate images, then save and log to tensorboard """ skip_freq = self.batch_freq if split == "train" else 5 if (batch_idx+1) % skip_freq == 0: is_train = pl_module.training if is_train: pl_module.eval() torch.cuda.empty_cache() with torch.no_grad(): log_func = pl_module.log_images batch_logs = log_func(batch, split=split, **self.log_images_kwargs) ## process: move to CPU and clamp batch_logs = prepare_to_log(batch_logs, self.max_images, self.clamp) torch.cuda.empty_cache() filename = "ep{}_idx{}_rank{}".format( pl_module.current_epoch, batch_idx, pl_module.global_rank) if self.to_local: mainlogger.info("Log [%s] batch <%s> to local ..."%(split, filename)) filename = "gs{}_".format(pl_module.global_step) + filename log_local(batch_logs, os.path.join(self.save_dir, split), filename, save_fps=10) else: mainlogger.info("Log [%s] batch <%s> to tensorboard ..."%(split, filename)) self.log_to_tensorboard(pl_module, batch_logs, filename, split, save_fps=10) mainlogger.info('Finish!') if is_train: pl_module.train() def on_train_batch_end(self, trainer, pl_module, outputs, batch, batch_idx, dataloader_idx=None): if self.batch_freq != -1 and pl_module.logdir: self.log_batch_imgs(pl_module, batch, batch_idx, split="train") def on_validation_batch_end(self, trainer, pl_module, outputs, batch, batch_idx, dataloader_idx=None): ## different with validation_step() that saving the whole validation set and only keep the latest, ## it records the performance of every validation (without overwritten) by only keep a subset if self.batch_freq != -1 and pl_module.logdir: self.log_batch_imgs(pl_module, batch, batch_idx, split="val") if hasattr(pl_module, 'calibrate_grad_norm'): if (pl_module.calibrate_grad_norm and batch_idx % 25 == 0) and batch_idx > 0: self.log_gradients(trainer, pl_module, batch_idx=batch_idx) class CUDACallback(Callback): # see https://github.com/SeanNaren/minGPT/blob/master/mingpt/callback.py def on_train_epoch_start(self, trainer, pl_module): # Reset the memory use counter # lightning update if int((pl.__version__).split('.')[1])>=7: gpu_index = trainer.strategy.root_device.index else: gpu_index = trainer.root_gpu torch.cuda.reset_peak_memory_stats(gpu_index) torch.cuda.synchronize(gpu_index) self.start_time = time.time() def on_train_epoch_end(self, trainer, pl_module): if int((pl.__version__).split('.')[1])>=7: gpu_index = trainer.strategy.root_device.index else: gpu_index = trainer.root_gpu torch.cuda.synchronize(gpu_index) max_memory = torch.cuda.max_memory_allocated(gpu_index) / 2 ** 20 epoch_time = time.time() - self.start_time try: max_memory = trainer.training_type_plugin.reduce(max_memory) epoch_time = trainer.training_type_plugin.reduce(epoch_time) rank_zero_info(f"Average Epoch time: {epoch_time:.2f} seconds") rank_zero_info(f"Average Peak memory {max_memory:.2f}MiB") except AttributeError: pass ================================================ FILE: ToonCrafter/main/trainer.py ================================================ import argparse, os, sys, datetime from omegaconf import OmegaConf from transformers import logging as transf_logging import pytorch_lightning as pl from pytorch_lightning import seed_everything from pytorch_lightning.trainer import Trainer import torch sys.path.insert(1, os.path.join(sys.path[0], '..')) from ToonCrafter.utils.utils import instantiate_from_config from utils_train import get_trainer_callbacks, get_trainer_logger, get_trainer_strategy from utils_train import set_logger, init_workspace, load_checkpoints def get_parser(**parser_kwargs): parser = argparse.ArgumentParser(**parser_kwargs) parser.add_argument("--seed", "-s", type=int, default=20230211, help="seed for seed_everything") parser.add_argument("--name", "-n", type=str, default="", help="experiment name, as saving folder") parser.add_argument("--base", "-b", nargs="*", metavar="base_config.yaml", help="paths to base configs. Loaded from left-to-right. " "Parameters can be overwritten or added with command-line options of the form `--key value`.", default=list()) parser.add_argument("--train", "-t", action='store_true', default=False, help='train') parser.add_argument("--val", "-v", action='store_true', default=False, help='val') parser.add_argument("--test", action='store_true', default=False, help='test') parser.add_argument("--logdir", "-l", type=str, default="logs", help="directory for logging dat shit") parser.add_argument("--auto_resume", action='store_true', default=False, help="resume from full-info checkpoint") parser.add_argument("--auto_resume_weight_only", action='store_true', default=False, help="resume from weight-only checkpoint") parser.add_argument("--debug", "-d", action='store_true', default=False, help="enable post-mortem debugging") return parser def get_nondefault_trainer_args(args): parser = argparse.ArgumentParser() parser = Trainer.add_argparse_args(parser) default_trainer_args = parser.parse_args([]) return sorted(k for k in vars(default_trainer_args) if getattr(args, k) != getattr(default_trainer_args, k)) if __name__ == "__main__": now = datetime.datetime.now().strftime("%Y-%m-%dT%H-%M-%S") local_rank = int(os.environ.get('LOCAL_RANK')) global_rank = int(os.environ.get('RANK')) num_rank = int(os.environ.get('WORLD_SIZE')) parser = get_parser() ## Extends existing argparse by default Trainer attributes parser = Trainer.add_argparse_args(parser) args, unknown = parser.parse_known_args() ## disable transformer warning transf_logging.set_verbosity_error() seed_everything(args.seed) ## yaml configs: "model" | "data" | "lightning" configs = [OmegaConf.load(cfg) for cfg in args.base] cli = OmegaConf.from_dotlist(unknown) config = OmegaConf.merge(*configs, cli) lightning_config = config.pop("lightning", OmegaConf.create()) trainer_config = lightning_config.get("trainer", OmegaConf.create()) ## setup workspace directories workdir, ckptdir, cfgdir, loginfo = init_workspace(args.name, args.logdir, config, lightning_config, global_rank) logger = set_logger(logfile=os.path.join(loginfo, 'log_%d:%s.txt'%(global_rank, now))) logger.info("@lightning version: %s [>=1.8 required]"%(pl.__version__)) ## MODEL CONFIG >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> logger.info("***** Configing Model *****") config.model.params.logdir = workdir model = instantiate_from_config(config.model) ## load checkpoints model = load_checkpoints(model, config.model) ## register_schedule again to make ZTSNR work if model.rescale_betas_zero_snr: model.register_schedule(given_betas=model.given_betas, beta_schedule=model.beta_schedule, timesteps=model.timesteps, linear_start=model.linear_start, linear_end=model.linear_end, cosine_s=model.cosine_s) ## update trainer config for k in get_nondefault_trainer_args(args): trainer_config[k] = getattr(args, k) num_nodes = trainer_config.num_nodes ngpu_per_node = trainer_config.devices logger.info(f"Running on {num_rank}={num_nodes}x{ngpu_per_node} GPUs") ## setup learning rate base_lr = config.model.base_learning_rate bs = config.data.params.batch_size if getattr(config.model, 'scale_lr', True): model.learning_rate = num_rank * bs * base_lr else: model.learning_rate = base_lr ## DATA CONFIG >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> logger.info("***** Configing Data *****") data = instantiate_from_config(config.data) data.setup() for k in data.datasets: logger.info(f"{k}, {data.datasets[k].__class__.__name__}, {len(data.datasets[k])}") ## TRAINER CONFIG >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> logger.info("***** Configing Trainer *****") if "accelerator" not in trainer_config: trainer_config["accelerator"] = "gpu" ## setup trainer args: pl-logger and callbacks trainer_kwargs = dict() trainer_kwargs["num_sanity_val_steps"] = 0 logger_cfg = get_trainer_logger(lightning_config, workdir, args.debug) trainer_kwargs["logger"] = instantiate_from_config(logger_cfg) ## setup callbacks callbacks_cfg = get_trainer_callbacks(lightning_config, config, workdir, ckptdir, logger) trainer_kwargs["callbacks"] = [instantiate_from_config(callbacks_cfg[k]) for k in callbacks_cfg] strategy_cfg = get_trainer_strategy(lightning_config) trainer_kwargs["strategy"] = strategy_cfg if type(strategy_cfg) == str else instantiate_from_config(strategy_cfg) trainer_kwargs['precision'] = lightning_config.get('precision', 32) trainer_kwargs["sync_batchnorm"] = False ## trainer config: others trainer_args = argparse.Namespace(**trainer_config) trainer = Trainer.from_argparse_args(trainer_args, **trainer_kwargs) ## allow checkpointing via USR1 def melk(*args, **kwargs): ## run all checkpoint hooks if trainer.global_rank == 0: print("Summoning checkpoint.") ckpt_path = os.path.join(ckptdir, "last_summoning.ckpt") trainer.save_checkpoint(ckpt_path) def divein(*args, **kwargs): if trainer.global_rank == 0: import pudb; pudb.set_trace() import signal signal.signal(signal.SIGUSR1, melk) signal.signal(signal.SIGUSR2, divein) ## Running LOOP >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> logger.info("***** Running the Loop *****") if args.train: try: if "strategy" in lightning_config and lightning_config['strategy'].startswith('deepspeed'): logger.info("") ## deepspeed if trainer_kwargs['precision'] == 16: with torch.cuda.amp.autocast(): trainer.fit(model, data) else: trainer.fit(model, data) else: logger.info("") ## this is default ## ddpsharded trainer.fit(model, data) except Exception: #melk() raise # if args.val: # trainer.validate(model, data) # if args.test or not trainer.interrupted: # trainer.test(model, data) ================================================ FILE: ToonCrafter/main/utils_data.py ================================================ from functools import partial import numpy as np import torch import pytorch_lightning as pl from torch.utils.data import DataLoader, Dataset import os, sys os.chdir(sys.path[0]) sys.path.append("..") from lvdm.data.base import Txt2ImgIterableBaseDataset from ToonCrafter.utils.utils import instantiate_from_config def worker_init_fn(_): worker_info = torch.utils.data.get_worker_info() dataset = worker_info.dataset worker_id = worker_info.id if isinstance(dataset, Txt2ImgIterableBaseDataset): split_size = dataset.num_records // worker_info.num_workers # reset num_records to the true number to retain reliable length information dataset.sample_ids = dataset.valid_ids[worker_id * split_size:(worker_id + 1) * split_size] current_id = np.random.choice(len(np.random.get_state()[1]), 1) return np.random.seed(np.random.get_state()[1][current_id] + worker_id) else: return np.random.seed(np.random.get_state()[1][0] + worker_id) class WrappedDataset(Dataset): """Wraps an arbitrary object with __len__ and __getitem__ into a pytorch dataset""" def __init__(self, dataset): self.data = dataset def __len__(self): return len(self.data) def __getitem__(self, idx): return self.data[idx] class DataModuleFromConfig(pl.LightningDataModule): def __init__(self, batch_size, train=None, validation=None, test=None, predict=None, wrap=False, num_workers=None, shuffle_test_loader=False, use_worker_init_fn=False, shuffle_val_dataloader=False, train_img=None, test_max_n_samples=None): super().__init__() self.batch_size = batch_size self.dataset_configs = dict() self.num_workers = num_workers if num_workers is not None else batch_size * 2 self.use_worker_init_fn = use_worker_init_fn if train is not None: self.dataset_configs["train"] = train self.train_dataloader = self._train_dataloader if validation is not None: self.dataset_configs["validation"] = validation self.val_dataloader = partial(self._val_dataloader, shuffle=shuffle_val_dataloader) if test is not None: self.dataset_configs["test"] = test self.test_dataloader = partial(self._test_dataloader, shuffle=shuffle_test_loader) if predict is not None: self.dataset_configs["predict"] = predict self.predict_dataloader = self._predict_dataloader self.img_loader = None self.wrap = wrap self.test_max_n_samples = test_max_n_samples self.collate_fn = None def prepare_data(self): pass def setup(self, stage=None): self.datasets = dict((k, instantiate_from_config(self.dataset_configs[k])) for k in self.dataset_configs) if self.wrap: for k in self.datasets: self.datasets[k] = WrappedDataset(self.datasets[k]) def _train_dataloader(self): is_iterable_dataset = isinstance(self.datasets['train'], Txt2ImgIterableBaseDataset) if is_iterable_dataset or self.use_worker_init_fn: init_fn = worker_init_fn else: init_fn = None loader = DataLoader(self.datasets["train"], batch_size=self.batch_size, num_workers=self.num_workers, shuffle=False if is_iterable_dataset else True, worker_init_fn=init_fn, collate_fn=self.collate_fn, ) return loader def _val_dataloader(self, shuffle=False): if isinstance(self.datasets['validation'], Txt2ImgIterableBaseDataset) or self.use_worker_init_fn: init_fn = worker_init_fn else: init_fn = None return DataLoader(self.datasets["validation"], batch_size=self.batch_size, num_workers=self.num_workers, worker_init_fn=init_fn, shuffle=shuffle, collate_fn=self.collate_fn, ) def _test_dataloader(self, shuffle=False): try: is_iterable_dataset = isinstance(self.datasets['train'], Txt2ImgIterableBaseDataset) except: is_iterable_dataset = isinstance(self.datasets['test'], Txt2ImgIterableBaseDataset) if is_iterable_dataset or self.use_worker_init_fn: init_fn = worker_init_fn else: init_fn = None # do not shuffle dataloader for iterable dataset shuffle = shuffle and (not is_iterable_dataset) if self.test_max_n_samples is not None: dataset = torch.utils.data.Subset(self.datasets["test"], list(range(self.test_max_n_samples))) else: dataset = self.datasets["test"] return DataLoader(dataset, batch_size=self.batch_size, num_workers=self.num_workers, worker_init_fn=init_fn, shuffle=shuffle, collate_fn=self.collate_fn, ) def _predict_dataloader(self, shuffle=False): if isinstance(self.datasets['predict'], Txt2ImgIterableBaseDataset) or self.use_worker_init_fn: init_fn = worker_init_fn else: init_fn = None return DataLoader(self.datasets["predict"], batch_size=self.batch_size, num_workers=self.num_workers, worker_init_fn=init_fn, collate_fn=self.collate_fn, ) ================================================ FILE: ToonCrafter/main/utils_train.py ================================================ import os, re from omegaconf import OmegaConf import logging mainlogger = logging.getLogger('mainlogger') import torch from collections import OrderedDict def init_workspace(name, logdir, model_config, lightning_config, rank=0): workdir = os.path.join(logdir, name) ckptdir = os.path.join(workdir, "checkpoints") cfgdir = os.path.join(workdir, "configs") loginfo = os.path.join(workdir, "loginfo") # Create logdirs and save configs (all ranks will do to avoid missing directory error if rank:0 is slower) os.makedirs(workdir, exist_ok=True) os.makedirs(ckptdir, exist_ok=True) os.makedirs(cfgdir, exist_ok=True) os.makedirs(loginfo, exist_ok=True) if rank == 0: if "callbacks" in lightning_config and 'metrics_over_trainsteps_checkpoint' in lightning_config.callbacks: os.makedirs(os.path.join(ckptdir, 'trainstep_checkpoints'), exist_ok=True) OmegaConf.save(model_config, os.path.join(cfgdir, "model.yaml")) OmegaConf.save(OmegaConf.create({"lightning": lightning_config}), os.path.join(cfgdir, "lightning.yaml")) return workdir, ckptdir, cfgdir, loginfo def check_config_attribute(config, name): if name in config: value = getattr(config, name) return value else: return None def get_trainer_callbacks(lightning_config, config, logdir, ckptdir, logger): default_callbacks_cfg = { "model_checkpoint": { "target": "pytorch_lightning.callbacks.ModelCheckpoint", "params": { "dirpath": ckptdir, "filename": "{epoch}", "verbose": True, "save_last": False, } }, "batch_logger": { "target": "callbacks.ImageLogger", "params": { "save_dir": logdir, "batch_frequency": 1000, "max_images": 4, "clamp": True, } }, "learning_rate_logger": { "target": "pytorch_lightning.callbacks.LearningRateMonitor", "params": { "logging_interval": "step", "log_momentum": False } }, "cuda_callback": { "target": "callbacks.CUDACallback" }, } ## optional setting for saving checkpoints monitor_metric = check_config_attribute(config.model.params, "monitor") if monitor_metric is not None: mainlogger.info(f"Monitoring {monitor_metric} as checkpoint metric.") default_callbacks_cfg["model_checkpoint"]["params"]["monitor"] = monitor_metric default_callbacks_cfg["model_checkpoint"]["params"]["save_top_k"] = 3 default_callbacks_cfg["model_checkpoint"]["params"]["mode"] = "min" if 'metrics_over_trainsteps_checkpoint' in lightning_config.callbacks: mainlogger.info('Caution: Saving checkpoints every n train steps without deleting. This might require some free space.') default_metrics_over_trainsteps_ckpt_dict = { 'metrics_over_trainsteps_checkpoint': {"target": 'pytorch_lightning.callbacks.ModelCheckpoint', 'params': { "dirpath": os.path.join(ckptdir, 'trainstep_checkpoints'), "filename": "{epoch}-{step}", "verbose": True, 'save_top_k': -1, 'every_n_train_steps': 10000, 'save_weights_only': True } } } default_callbacks_cfg.update(default_metrics_over_trainsteps_ckpt_dict) if "callbacks" in lightning_config: callbacks_cfg = lightning_config.callbacks else: callbacks_cfg = OmegaConf.create() callbacks_cfg = OmegaConf.merge(default_callbacks_cfg, callbacks_cfg) return callbacks_cfg def get_trainer_logger(lightning_config, logdir, on_debug): default_logger_cfgs = { "tensorboard": { "target": "pytorch_lightning.loggers.TensorBoardLogger", "params": { "save_dir": logdir, "name": "tensorboard", } }, "testtube": { "target": "pytorch_lightning.loggers.CSVLogger", "params": { "name": "testtube", "save_dir": logdir, } }, } os.makedirs(os.path.join(logdir, "tensorboard"), exist_ok=True) default_logger_cfg = default_logger_cfgs["tensorboard"] if "logger" in lightning_config: logger_cfg = lightning_config.logger else: logger_cfg = OmegaConf.create() logger_cfg = OmegaConf.merge(default_logger_cfg, logger_cfg) return logger_cfg def get_trainer_strategy(lightning_config): default_strategy_dict = { "target": "pytorch_lightning.strategies.DDPShardedStrategy" } if "strategy" in lightning_config: strategy_cfg = lightning_config.strategy return strategy_cfg else: strategy_cfg = OmegaConf.create() strategy_cfg = OmegaConf.merge(default_strategy_dict, strategy_cfg) return strategy_cfg def load_checkpoints(model, model_cfg): if check_config_attribute(model_cfg, "pretrained_checkpoint"): pretrained_ckpt = model_cfg.pretrained_checkpoint assert os.path.exists(pretrained_ckpt), "Error: Pre-trained checkpoint NOT found at:%s"%pretrained_ckpt mainlogger.info(">>> Load weights from pretrained checkpoint") pl_sd = torch.load(pretrained_ckpt, map_location="cpu") try: if 'state_dict' in pl_sd.keys(): model.load_state_dict(pl_sd["state_dict"], strict=True) mainlogger.info(">>> Loaded weights from pretrained checkpoint: %s"%pretrained_ckpt) else: # deepspeed new_pl_sd = OrderedDict() for key in pl_sd['module'].keys(): new_pl_sd[key[16:]]=pl_sd['module'][key] model.load_state_dict(new_pl_sd, strict=True) except: model.load_state_dict(pl_sd) else: mainlogger.info(">>> Start training from scratch") return model def set_logger(logfile, name='mainlogger'): logger = logging.getLogger(name) logger.setLevel(logging.INFO) fh = logging.FileHandler(logfile, mode='w') fh.setLevel(logging.INFO) ch = logging.StreamHandler() ch.setLevel(logging.DEBUG) fh.setFormatter(logging.Formatter("%(asctime)s-%(levelname)s: %(message)s")) ch.setFormatter(logging.Formatter("%(message)s")) logger.addHandler(fh) logger.addHandler(ch) return logger ================================================ FILE: ToonCrafter/prompts/512_interp/prompts.txt ================================================ walking man an anime scene an anime scene ================================================ FILE: ToonCrafter/requirements.txt ================================================ decord==0.6.0 einops==0.3.0 imageio==2.9.0 numpy==1.24.2 omegaconf==2.1.1 opencv_python pandas==2.0.0 Pillow==9.5.0 pytorch_lightning==1.9.3 PyYAML==6.0 setuptools==65.6.3 torch==2.0.0 torchvision tqdm==4.65.0 transformers==4.25.1 moviepy av xformers gradio timm scikit-learn open_clip_torch==2.22.0 kornia ================================================ FILE: ToonCrafter/scripts/evaluation/ddp_wrapper.py ================================================ import datetime import argparse, importlib from pytorch_lightning import seed_everything import torch import torch.distributed as dist def setup_dist(local_rank): if dist.is_initialized(): return torch.cuda.set_device(local_rank) torch.distributed.init_process_group('nccl', init_method='env://') def get_dist_info(): if dist.is_available(): initialized = dist.is_initialized() else: initialized = False if initialized: rank = dist.get_rank() world_size = dist.get_world_size() else: rank = 0 world_size = 1 return rank, world_size if __name__ == '__main__': now = datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S") parser = argparse.ArgumentParser() parser.add_argument("--module", type=str, help="module name", default="inference") parser.add_argument("--local_rank", type=int, nargs="?", help="for ddp", default=0) args, unknown = parser.parse_known_args() inference_api = importlib.import_module(args.module, package=None) inference_parser = inference_api.get_parser() inference_args, unknown = inference_parser.parse_known_args() seed_everything(inference_args.seed) setup_dist(args.local_rank) torch.backends.cudnn.benchmark = True rank, gpu_num = get_dist_info() # inference_args.savedir = inference_args.savedir+str('_seed')+str(inference_args.seed) print("@DynamiCrafter Inference [rank%d]: %s"%(rank, now)) inference_api.run_inference(inference_args, gpu_num, rank) ================================================ FILE: ToonCrafter/scripts/evaluation/funcs.py ================================================ import os import sys import glob import numpy as np from collections import OrderedDict from decord import VideoReader, cpu import cv2 import torch import torchvision sys.path.insert(1, os.path.join(sys.path[0], '..', '..')) from lvdm.models.samplers.ddim import DDIMSampler from lvdm.models.ddpm3d import LatentDiffusion from einops import rearrange def batch_ddim_sampling(model: LatentDiffusion, cond, noise_shape, n_samples=1, ddim_steps=50, ddim_eta=1.0, cfg_scale=1.0, hs=None, temporal_cfg_scale=None, **kwargs): ddim_sampler = DDIMSampler(model) uncond_type = model.uncond_type batch_size = noise_shape[0] fs = cond["fs"] del cond["fs"] if noise_shape[-1] == 32: timestep_spacing = "uniform" guidance_rescale = 0.0 else: timestep_spacing = "uniform_trailing" guidance_rescale = 0.7 # construct unconditional guidance if cfg_scale != 1.0: if uncond_type == "empty_seq": prompts = batch_size * [""] # prompts = N * T * [""] ## if is_imgbatch=True uc_emb = model.get_learned_conditioning(prompts) elif uncond_type == "zero_embed": c_emb = cond["c_crossattn"][0] if isinstance(cond, dict) else cond uc_emb = torch.zeros_like(c_emb) # process image embedding token if hasattr(model, 'embedder'): uc_img = torch.zeros(noise_shape[0], 3, 224, 224).to(model.device) if uc_img.dtype != model.dtype: uc_img = uc_img.to(model.dtype) # img: b c h w >> b l c uc_img = model.embedder(uc_img) uc_img = model.image_proj_model(uc_img) uc_emb = torch.cat([uc_emb, uc_img], dim=1) if isinstance(cond, dict): uc = {key: cond[key] for key in cond.keys()} uc.update({'c_crossattn': [uc_emb]}) else: uc = uc_emb else: uc = None additional_decode_kwargs = {'ref_context': hs} x_T = None batch_variants = [] for _ in range(n_samples): if ddim_sampler is not None: kwargs.update({"clean_cond": True}) samples, _ = ddim_sampler.sample(S=ddim_steps, conditioning=cond, batch_size=noise_shape[0], shape=noise_shape[1:], verbose=False, unconditional_guidance_scale=cfg_scale, unconditional_conditioning=uc, eta=ddim_eta, temporal_length=noise_shape[2], conditional_guidance_scale_temporal=temporal_cfg_scale, x_T=x_T, fs=fs, precision=16 if model.dtype == torch.float16 else 32, timestep_spacing=timestep_spacing, guidance_rescale=guidance_rescale, **kwargs ) # reconstruct from latent to pixel space batch_images = model.decode_first_stage(samples, **additional_decode_kwargs) index = list(range(samples.shape[2])) del index[1] del index[-2] samples = samples[:, :, index, :, :] # reconstruct from latent to pixel space batch_images_middle = model.decode_first_stage(samples, **additional_decode_kwargs) batch_images[:, :, batch_images.shape[2] // 2 - 1:batch_images.shape[2] // 2 + 1] = batch_images_middle[:, :, batch_images.shape[2] // 2 - 2:batch_images.shape[2] // 2] batch_variants.append(batch_images) # batch, , c, t, h, w batch_variants = torch.stack(batch_variants, dim=1) return batch_variants def get_filelist(data_dir, ext='*'): file_list = glob.glob(os.path.join(data_dir, '*.%s' % ext)) file_list.sort() return file_list def get_dirlist(path): list = [] if (os.path.exists(path)): files = os.listdir(path) for file in files: m = os.path.join(path, file) if (os.path.isdir(m)): list.append(m) list.sort() return list def load_model_checkpoint(model, ckpt): def load_checkpoint(model, ckpt, full_strict): try: state_dict = torch.load(ckpt, map_location="cpu") except BaseException: # Read fp16 version from safetensors.torch import load_file state_dict = load_file(ckpt) if "state_dict" in list(state_dict.keys()): state_dict = state_dict["state_dict"] try: model.load_state_dict(state_dict, strict=full_strict) except BaseException: # rename the keys for 256x256 model new_pl_sd = OrderedDict() for k, v in state_dict.items(): new_pl_sd[k] = v for k in list(new_pl_sd.keys()): if "framestride_embed" in k: new_key = k.replace("framestride_embed", "fps_embedding") new_pl_sd[new_key] = new_pl_sd[k] del new_pl_sd[k] model.load_state_dict(new_pl_sd, strict=full_strict) # No module key in state_dict # else: # # deepspeed # new_pl_sd = OrderedDict() # for key in state_dict['module'].keys(): # new_pl_sd[key[16:]] = state_dict['module'][key] # model.load_state_dict(new_pl_sd, strict=full_strict) return model load_checkpoint(model, ckpt, full_strict=True) print('>>> model checkpoint loaded.') return model def load_prompts(prompt_file): f = open(prompt_file, 'r') prompt_list = [] for idx, line in enumerate(f.readlines()): l = line.strip() if len(l) != 0: prompt_list.append(l) f.close() return prompt_list def load_video_batch(filepath_list, frame_stride, video_size=(256, 256), video_frames=16): ''' Notice about some special cases: 1. video_frames=-1 means to take all the frames (with fs=1) 2. when the total video frames is less than required, padding strategy will be used (repeated last frame) ''' fps_list = [] batch_tensor = [] assert frame_stride > 0, "valid frame stride should be a positive interge!" for filepath in filepath_list: padding_num = 0 vidreader = VideoReader(filepath, ctx=cpu(0), width=video_size[1], height=video_size[0]) fps = vidreader.get_avg_fps() total_frames = len(vidreader) max_valid_frames = (total_frames - 1) // frame_stride + 1 if video_frames < 0: # all frames are collected: fs=1 is a must required_frames = total_frames frame_stride = 1 else: required_frames = video_frames query_frames = min(required_frames, max_valid_frames) frame_indices = [frame_stride * i for i in range(query_frames)] # [t,h,w,c] -> [c,t,h,w] frames = vidreader.get_batch(frame_indices) frame_tensor = torch.tensor(frames.asnumpy()).permute(3, 0, 1, 2).float() frame_tensor = (frame_tensor / 255. - 0.5) * 2 if max_valid_frames < required_frames: padding_num = required_frames - max_valid_frames frame_tensor = torch.cat([frame_tensor, *([frame_tensor[:, -1:, :, :]] * padding_num)], dim=1) print(f'{os.path.split(filepath)[1]} is not long enough: {padding_num} frames padded.') batch_tensor.append(frame_tensor) sample_fps = int(fps / frame_stride) fps_list.append(sample_fps) return torch.stack(batch_tensor, dim=0) from PIL import Image def load_image_batch(filepath_list, image_size=(256, 256)): batch_tensor = [] for filepath in filepath_list: _, filename = os.path.split(filepath) _, ext = os.path.splitext(filename) if ext == '.mp4': vidreader = VideoReader(filepath, ctx=cpu(0), width=image_size[1], height=image_size[0]) frame = vidreader.get_batch([0]) img_tensor = torch.tensor(frame.asnumpy()).squeeze(0).permute(2, 0, 1).float() elif ext == '.png' or ext == '.jpg': img = Image.open(filepath).convert("RGB") rgb_img = np.array(img, np.float32) # bgr_img = cv2.imread(filepath, cv2.IMREAD_COLOR) # bgr_img = cv2.cvtColor(bgr_img, cv2.COLOR_BGR2RGB) rgb_img = cv2.resize(rgb_img, (image_size[1], image_size[0]), interpolation=cv2.INTER_LINEAR) img_tensor = torch.from_numpy(rgb_img).permute(2, 0, 1).float() else: print(f'ERROR: <{ext}> image loading only support format: [mp4], [png], [jpg]') raise NotImplementedError img_tensor = (img_tensor / 255. - 0.5) * 2 batch_tensor.append(img_tensor) return torch.stack(batch_tensor, dim=0) def save_videos(batch_tensors, savedir, filenames, fps=10): # b,samples,c,t,h,w n_samples = batch_tensors.shape[1] for idx, vid_tensor in enumerate(batch_tensors): video = vid_tensor.detach().cpu() video = torch.clamp(video.float(), -1., 1.) video = video.permute(2, 0, 1, 3, 4) # t,n,c,h,w frame_grids = [torchvision.utils.make_grid(framesheet, nrow=int(n_samples)) for framesheet in video] # [3, 1*h, n*w] grid = torch.stack(frame_grids, dim=0) # stack in temporal dim [t, 3, n*h, w] grid = (grid + 1.0) / 2.0 grid = (grid * 255).to(torch.uint8).permute(0, 2, 3, 1) savepath = os.path.join(savedir, f"{filenames[idx]}.mp4") torchvision.io.write_video(savepath, grid, fps=fps, video_codec='h264', options={'crf': '10'}) def get_latent_z(model, videos): b, c, t, h, w = videos.shape x = rearrange(videos, 'b c t h w -> (b t) c h w') z = model.encode_first_stage(x) z = rearrange(z, '(b t) c h w -> b c t h w', b=b, t=t) return z ================================================ FILE: ToonCrafter/scripts/evaluation/inference.py ================================================ import argparse, os, sys, glob import datetime, time from omegaconf import OmegaConf from tqdm import tqdm from einops import rearrange, repeat from collections import OrderedDict import torch import torchvision import torchvision.transforms as transforms from pytorch_lightning import seed_everything from PIL import Image sys.path.insert(1, os.path.join(sys.path[0], '..', '..')) from lvdm.models.samplers.ddim import DDIMSampler from lvdm.models.samplers.ddim_multiplecond import DDIMSampler as DDIMSampler_multicond from ToonCrafter.utils.utils import instantiate_from_config def get_filelist(data_dir, postfixes): patterns = [os.path.join(data_dir, f"*.{postfix}") for postfix in postfixes] file_list = [] for pattern in patterns: file_list.extend(glob.glob(pattern)) file_list.sort() return file_list def load_model_checkpoint(model, ckpt): state_dict = torch.load(ckpt, map_location="cpu") if "state_dict" in list(state_dict.keys()): state_dict = state_dict["state_dict"] try: model.load_state_dict(state_dict, strict=True) except: ## rename the keys for 256x256 model new_pl_sd = OrderedDict() for k,v in state_dict.items(): new_pl_sd[k] = v for k in list(new_pl_sd.keys()): if "framestride_embed" in k: new_key = k.replace("framestride_embed", "fps_embedding") new_pl_sd[new_key] = new_pl_sd[k] del new_pl_sd[k] model.load_state_dict(new_pl_sd, strict=True) else: # deepspeed new_pl_sd = OrderedDict() for key in state_dict['module'].keys(): new_pl_sd[key[16:]]=state_dict['module'][key] model.load_state_dict(new_pl_sd) print('>>> model checkpoint loaded.') return model def load_prompts(prompt_file): f = open(prompt_file, 'r') prompt_list = [] for idx, line in enumerate(f.readlines()): l = line.strip() if len(l) != 0: prompt_list.append(l) f.close() return prompt_list def load_data_prompts(data_dir, video_size=(256,256), video_frames=16, interp=False): transform = transforms.Compose([ transforms.Resize(min(video_size)), transforms.CenterCrop(video_size), transforms.ToTensor(), transforms.Normalize(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5))]) ## load prompts prompt_file = get_filelist(data_dir, ['txt']) assert len(prompt_file) > 0, "Error: found NO prompt file!" ###### default prompt default_idx = 0 default_idx = min(default_idx, len(prompt_file)-1) if len(prompt_file) > 1: print(f"Warning: multiple prompt files exist. The one {os.path.split(prompt_file[default_idx])[1]} is used.") ## only use the first one (sorted by name) if multiple exist ## load video file_list = get_filelist(data_dir, ['jpg', 'png', 'jpeg', 'JPEG', 'PNG']) # assert len(file_list) == n_samples, "Error: data and prompts are NOT paired!" data_list = [] filename_list = [] prompt_list = load_prompts(prompt_file[default_idx]) n_samples = len(prompt_list) for idx in range(n_samples): if interp: image1 = Image.open(file_list[2*idx]).convert('RGB') image_tensor1 = transform(image1).unsqueeze(1) # [c,1,h,w] image2 = Image.open(file_list[2*idx+1]).convert('RGB') image_tensor2 = transform(image2).unsqueeze(1) # [c,1,h,w] frame_tensor1 = repeat(image_tensor1, 'c t h w -> c (repeat t) h w', repeat=video_frames//2) frame_tensor2 = repeat(image_tensor2, 'c t h w -> c (repeat t) h w', repeat=video_frames//2) frame_tensor = torch.cat([frame_tensor1, frame_tensor2], dim=1) _, filename = os.path.split(file_list[idx*2]) else: image = Image.open(file_list[idx]).convert('RGB') image_tensor = transform(image).unsqueeze(1) # [c,1,h,w] frame_tensor = repeat(image_tensor, 'c t h w -> c (repeat t) h w', repeat=video_frames) _, filename = os.path.split(file_list[idx]) data_list.append(frame_tensor) filename_list.append(filename) return filename_list, data_list, prompt_list def save_results(prompt, samples, filename, fakedir, fps=8, loop=False): filename = filename.split('.')[0]+'.mp4' prompt = prompt[0] if isinstance(prompt, list) else prompt ## save video videos = [samples] savedirs = [fakedir] for idx, video in enumerate(videos): if video is None: continue # b,c,t,h,w video = video.detach().cpu() video = torch.clamp(video.float(), -1., 1.) n = video.shape[0] video = video.permute(2, 0, 1, 3, 4) # t,n,c,h,w if loop: video = video[:-1,...] frame_grids = [torchvision.utils.make_grid(framesheet, nrow=int(n), padding=0) for framesheet in video] #[3, 1*h, n*w] grid = torch.stack(frame_grids, dim=0) # stack in temporal dim [t, 3, h, n*w] grid = (grid + 1.0) / 2.0 grid = (grid * 255).to(torch.uint8).permute(0, 2, 3, 1) path = os.path.join(savedirs[idx], filename) torchvision.io.write_video(path, grid, fps=fps, video_codec='h264', options={'crf': '10'}) ## crf indicates the quality def save_results_seperate(prompt, samples, filename, fakedir, fps=10, loop=False): prompt = prompt[0] if isinstance(prompt, list) else prompt ## save video videos = [samples] savedirs = [fakedir] for idx, video in enumerate(videos): if video is None: continue # b,c,t,h,w video = video.detach().cpu() if loop: # remove the last frame video = video[:,:,:-1,...] video = torch.clamp(video.float(), -1., 1.) n = video.shape[0] for i in range(n): grid = video[i,...] grid = (grid + 1.0) / 2.0 grid = (grid * 255).to(torch.uint8).permute(1, 2, 3, 0) #thwc path = os.path.join(savedirs[idx].replace('samples', 'samples_separate'), f'{filename.split(".")[0]}_sample{i}.mp4') torchvision.io.write_video(path, grid, fps=fps, video_codec='h264', options={'crf': '10'}) def get_latent_z(model, videos): b, c, t, h, w = videos.shape x = rearrange(videos, 'b c t h w -> (b t) c h w') z = model.encode_first_stage(x) z = rearrange(z, '(b t) c h w -> b c t h w', b=b, t=t) return z def get_latent_z_with_hidden_states(model, videos): b, c, t, h, w = videos.shape x = rearrange(videos, 'b c t h w -> (b t) c h w') encoder_posterior, hidden_states = model.first_stage_model.encode(x, return_hidden_states=True) hidden_states_first_last = [] ### use only the first and last hidden states for hid in hidden_states: hid = rearrange(hid, '(b t) c h w -> b c t h w', t=t) hid_new = torch.cat([hid[:, :, 0:1], hid[:, :, -1:]], dim=2) hidden_states_first_last.append(hid_new) z = model.get_first_stage_encoding(encoder_posterior).detach() z = rearrange(z, '(b t) c h w -> b c t h w', b=b, t=t) return z, hidden_states_first_last def image_guided_synthesis(model, prompts, videos, noise_shape, n_samples=1, ddim_steps=50, ddim_eta=1., \ unconditional_guidance_scale=1.0, cfg_img=None, fs=None, text_input=False, multiple_cond_cfg=False, loop=False, interp=False, timestep_spacing='uniform', guidance_rescale=0.0, **kwargs): ddim_sampler = DDIMSampler(model) if not multiple_cond_cfg else DDIMSampler_multicond(model) batch_size = noise_shape[0] fs = torch.tensor([fs] * batch_size, dtype=torch.long, device=model.device) if not text_input: prompts = [""]*batch_size img = videos[:,:,0] #bchw img_emb = model.embedder(img) ## blc img_emb = model.image_proj_model(img_emb) cond_emb = model.get_learned_conditioning(prompts) cond = {"c_crossattn": [torch.cat([cond_emb,img_emb], dim=1)]} if model.model.conditioning_key == 'hybrid': z, hs = get_latent_z_with_hidden_states(model, videos) # b c t h w if loop or interp: img_cat_cond = torch.zeros_like(z) img_cat_cond[:,:,0,:,:] = z[:,:,0,:,:] img_cat_cond[:,:,-1,:,:] = z[:,:,-1,:,:] else: img_cat_cond = z[:,:,:1,:,:] img_cat_cond = repeat(img_cat_cond, 'b c t h w -> b c (repeat t) h w', repeat=z.shape[2]) cond["c_concat"] = [img_cat_cond] # b c 1 h w if unconditional_guidance_scale != 1.0: if model.uncond_type == "empty_seq": prompts = batch_size * [""] uc_emb = model.get_learned_conditioning(prompts) elif model.uncond_type == "zero_embed": uc_emb = torch.zeros_like(cond_emb) uc_img_emb = model.embedder(torch.zeros_like(img)) ## b l c uc_img_emb = model.image_proj_model(uc_img_emb) uc = {"c_crossattn": [torch.cat([uc_emb,uc_img_emb],dim=1)]} if model.model.conditioning_key == 'hybrid': uc["c_concat"] = [img_cat_cond] else: uc = None additional_decode_kwargs = {'ref_context': hs} ## we need one more unconditioning image=yes, text="" if multiple_cond_cfg and cfg_img != 1.0: uc_2 = {"c_crossattn": [torch.cat([uc_emb,img_emb],dim=1)]} if model.model.conditioning_key == 'hybrid': uc_2["c_concat"] = [img_cat_cond] kwargs.update({"unconditional_conditioning_img_nonetext": uc_2}) else: kwargs.update({"unconditional_conditioning_img_nonetext": None}) z0 = None cond_mask = None batch_variants = [] for _ in range(n_samples): if z0 is not None: cond_z0 = z0.clone() kwargs.update({"clean_cond": True}) else: cond_z0 = None if ddim_sampler is not None: samples, _ = ddim_sampler.sample(S=ddim_steps, conditioning=cond, batch_size=batch_size, shape=noise_shape[1:], verbose=False, unconditional_guidance_scale=unconditional_guidance_scale, unconditional_conditioning=uc, eta=ddim_eta, cfg_img=cfg_img, mask=cond_mask, x0=cond_z0, fs=fs, timestep_spacing=timestep_spacing, guidance_rescale=guidance_rescale, **kwargs ) ## reconstruct from latent to pixel space batch_images = model.decode_first_stage(samples, **additional_decode_kwargs) index = list(range(samples.shape[2])) del index[1] del index[-2] samples = samples[:, :, index, :, :] ## reconstruct from latent to pixel space batch_images_middle = model.decode_first_stage(samples, **additional_decode_kwargs) batch_images[:, :, batch_images.shape[2] // 2 - 1:batch_images.shape[2] // 2 + 1] = batch_images_middle[:, :, batch_images.shape[2] // 2 - 2:batch_images.shape[2] // 2] batch_variants.append(batch_images) ## variants, batch, c, t, h, w batch_variants = torch.stack(batch_variants) return batch_variants.permute(1, 0, 2, 3, 4, 5) def run_inference(args, gpu_num, gpu_no): ## model config config = OmegaConf.load(args.config) model_config = config.pop("model", OmegaConf.create()) ## set use_checkpoint as False as when using deepspeed, it encounters an error "deepspeed backend not set" model_config['params']['unet_config']['params']['use_checkpoint'] = False model = instantiate_from_config(model_config) model = model.cuda(gpu_no) model.perframe_ae = args.perframe_ae assert os.path.exists(args.ckpt_path), "Error: checkpoint Not Found!" model = load_model_checkpoint(model, args.ckpt_path) model.eval() ## run over data assert (args.height % 16 == 0) and (args.width % 16 == 0), "Error: image size [h,w] should be multiples of 16!" assert args.bs == 1, "Current implementation only support [batch size = 1]!" ## latent noise shape h, w = args.height // 8, args.width // 8 channels = model.model.diffusion_model.out_channels n_frames = args.video_length print(f'Inference with {n_frames} frames') noise_shape = [args.bs, channels, n_frames, h, w] fakedir = os.path.join(args.savedir, "samples") fakedir_separate = os.path.join(args.savedir, "samples_separate") # os.makedirs(fakedir, exist_ok=True) os.makedirs(fakedir_separate, exist_ok=True) ## prompt file setting assert os.path.exists(args.prompt_dir), "Error: prompt file Not Found!" filename_list, data_list, prompt_list = load_data_prompts(args.prompt_dir, video_size=(args.height, args.width), video_frames=n_frames, interp=args.interp) num_samples = len(prompt_list) samples_split = num_samples // gpu_num print('Prompts testing [rank:%d] %d/%d samples loaded.'%(gpu_no, samples_split, num_samples)) #indices = random.choices(list(range(0, num_samples)), k=samples_per_device) indices = list(range(samples_split*gpu_no, samples_split*(gpu_no+1))) prompt_list_rank = [prompt_list[i] for i in indices] data_list_rank = [data_list[i] for i in indices] filename_list_rank = [filename_list[i] for i in indices] start = time.time() with torch.no_grad(), torch.cuda.amp.autocast(): for idx, indice in tqdm(enumerate(range(0, len(prompt_list_rank), args.bs)), desc='Sample Batch'): prompts = prompt_list_rank[indice:indice+args.bs] videos = data_list_rank[indice:indice+args.bs] filenames = filename_list_rank[indice:indice+args.bs] if isinstance(videos, list): videos = torch.stack(videos, dim=0).to("cuda") else: videos = videos.unsqueeze(0).to("cuda") batch_samples = image_guided_synthesis(model, prompts, videos, noise_shape, args.n_samples, args.ddim_steps, args.ddim_eta, \ args.unconditional_guidance_scale, args.cfg_img, args.frame_stride, args.text_input, args.multiple_cond_cfg, args.loop, args.interp, args.timestep_spacing, args.guidance_rescale) ## save each example individually for nn, samples in enumerate(batch_samples): ## samples : [n_samples,c,t,h,w] prompt = prompts[nn] filename = filenames[nn] # save_results(prompt, samples, filename, fakedir, fps=8, loop=args.loop) save_results_seperate(prompt, samples, filename, fakedir, fps=8, loop=args.loop) print(f"Saved in {args.savedir}. Time used: {(time.time() - start):.2f} seconds") def get_parser(): parser = argparse.ArgumentParser() parser.add_argument("--savedir", type=str, default=None, help="results saving path") parser.add_argument("--ckpt_path", type=str, default=None, help="checkpoint path") parser.add_argument("--config", type=str, help="config (yaml) path") parser.add_argument("--prompt_dir", type=str, default=None, help="a data dir containing videos and prompts") parser.add_argument("--n_samples", type=int, default=1, help="num of samples per prompt",) parser.add_argument("--ddim_steps", type=int, default=50, help="steps of ddim if positive, otherwise use DDPM",) parser.add_argument("--ddim_eta", type=float, default=1.0, help="eta for ddim sampling (0.0 yields deterministic sampling)",) parser.add_argument("--bs", type=int, default=1, help="batch size for inference, should be one") parser.add_argument("--height", type=int, default=512, help="image height, in pixel space") parser.add_argument("--width", type=int, default=512, help="image width, in pixel space") parser.add_argument("--frame_stride", type=int, default=3, help="frame stride control for 256 model (larger->larger motion), FPS control for 512 or 1024 model (smaller->larger motion)") parser.add_argument("--unconditional_guidance_scale", type=float, default=1.0, help="prompt classifier-free guidance") parser.add_argument("--seed", type=int, default=123, help="seed for seed_everything") parser.add_argument("--video_length", type=int, default=16, help="inference video length") parser.add_argument("--negative_prompt", action='store_true', default=False, help="negative prompt") parser.add_argument("--text_input", action='store_true', default=False, help="input text to I2V model or not") parser.add_argument("--multiple_cond_cfg", action='store_true', default=False, help="use multi-condition cfg or not") parser.add_argument("--cfg_img", type=float, default=None, help="guidance scale for image conditioning") parser.add_argument("--timestep_spacing", type=str, default="uniform", help="The way the timesteps should be scaled. Refer to Table 2 of the [Common Diffusion Noise Schedules and Sample Steps are Flawed](https://huggingface.co/papers/2305.08891) for more information.") parser.add_argument("--guidance_rescale", type=float, default=0.0, help="guidance rescale in [Common Diffusion Noise Schedules and Sample Steps are Flawed](https://huggingface.co/papers/2305.08891)") parser.add_argument("--perframe_ae", action='store_true', default=False, help="if we use per-frame AE decoding, set it to True to save GPU memory, especially for the model of 576x1024") ## currently not support looping video and generative frame interpolation parser.add_argument("--loop", action='store_true', default=False, help="generate looping videos or not") parser.add_argument("--interp", action='store_true', default=False, help="generate generative frame interpolation or not") return parser if __name__ == '__main__': now = datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S") print("@DynamiCrafter cond-Inference: %s"%now) parser = get_parser() args = parser.parse_args() seed_everything(args.seed) rank, gpu_num = 0, 1 run_inference(args, gpu_num, rank) ================================================ FILE: ToonCrafter/scripts/gradio/i2v_test.py ================================================ import os import time from omegaconf import OmegaConf import torch from scripts.evaluation.funcs import load_model_checkpoint, save_videos, batch_ddim_sampling, get_latent_z from ToonCrafter.utils.utils import instantiate_from_config from huggingface_hub import hf_hub_download from einops import repeat import torchvision.transforms as transforms from pytorch_lightning import seed_everything class Image2Video(): def __init__(self,result_dir='./tmp/',gpu_num=1,resolution='256_256') -> None: self.resolution = (int(resolution.split('_')[0]), int(resolution.split('_')[1])) #hw self.download_model() self.result_dir = result_dir if not os.path.exists(self.result_dir): os.mkdir(self.result_dir) ckpt_path='checkpoints/dynamicrafter_'+resolution.split('_')[1]+'_v1/model.ckpt' config_file='configs/inference_'+resolution.split('_')[1]+'_v1.0.yaml' config = OmegaConf.load(config_file) model_config = config.pop("model", OmegaConf.create()) model_config['params']['unet_config']['params']['use_checkpoint']=False model_list = [] for gpu_id in range(gpu_num): model = instantiate_from_config(model_config) # model = model.cuda(gpu_id) assert os.path.exists(ckpt_path), "Error: checkpoint Not Found!" model = load_model_checkpoint(model, ckpt_path) model.eval() model_list.append(model) self.model_list = model_list self.save_fps = 8 def get_image(self, image, prompt, steps=50, cfg_scale=7.5, eta=1.0, fs=3, seed=123): seed_everything(seed) transform = transforms.Compose([ transforms.Resize(min(self.resolution)), transforms.CenterCrop(self.resolution), ]) torch.cuda.empty_cache() print('start:', prompt, time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time()))) start = time.time() gpu_id=0 if steps > 60: steps = 60 model = self.model_list[gpu_id] model = model.cuda() batch_size=1 channels = model.model.diffusion_model.out_channels frames = model.temporal_length h, w = self.resolution[0] // 8, self.resolution[1] // 8 noise_shape = [batch_size, channels, frames, h, w] # text cond with torch.no_grad(), torch.cuda.amp.autocast(): text_emb = model.get_learned_conditioning([prompt]) # img cond img_tensor = torch.from_numpy(image).permute(2, 0, 1).float().to(model.device) img_tensor = (img_tensor / 255. - 0.5) * 2 image_tensor_resized = transform(img_tensor) #3,h,w videos = image_tensor_resized.unsqueeze(0) # bchw z = get_latent_z(model, videos.unsqueeze(2)) #bc,1,hw img_tensor_repeat = repeat(z, 'b c t h w -> b c (repeat t) h w', repeat=frames) cond_images = model.embedder(img_tensor.unsqueeze(0)) ## blc img_emb = model.image_proj_model(cond_images) imtext_cond = torch.cat([text_emb, img_emb], dim=1) fs = torch.tensor([fs], dtype=torch.long, device=model.device) cond = {"c_crossattn": [imtext_cond], "fs": fs, "c_concat": [img_tensor_repeat]} ## inference batch_samples = batch_ddim_sampling(model, cond, noise_shape, n_samples=1, ddim_steps=steps, ddim_eta=eta, cfg_scale=cfg_scale) ## b,samples,c,t,h,w prompt_str = prompt.replace("/", "_slash_") if "/" in prompt else prompt prompt_str = prompt_str.replace(" ", "_") if " " in prompt else prompt_str prompt_str=prompt_str[:40] if len(prompt_str) == 0: prompt_str = 'empty_prompt' save_videos(batch_samples, self.result_dir, filenames=[prompt_str], fps=self.save_fps) print(f"Saved in {prompt_str}. Time used: {(time.time() - start):.2f} seconds") model = model.cpu() return os.path.join(self.result_dir, f"{prompt_str}.mp4") def download_model(self): REPO_ID = 'Doubiiu/DynamiCrafter_'+str(self.resolution[1]) if self.resolution[1]!=256 else 'Doubiiu/DynamiCrafter' filename_list = ['model.ckpt'] if not os.path.exists('./checkpoints/dynamicrafter_'+str(self.resolution[1])+'_v1/'): os.makedirs('./checkpoints/dynamicrafter_'+str(self.resolution[1])+'_v1/') for filename in filename_list: local_file = os.path.join('./checkpoints/dynamicrafter_'+str(self.resolution[1])+'_v1/', filename) if not os.path.exists(local_file): hf_hub_download(repo_id=REPO_ID, filename=filename, local_dir='./checkpoints/dynamicrafter_'+str(self.resolution[1])+'_v1/', local_dir_use_symlinks=False) if __name__ == '__main__': i2v = Image2Video() video_path = i2v.get_image('prompts/art.png','man fishing in a boat at sunset') print('done', video_path) ================================================ FILE: ToonCrafter/scripts/gradio/i2v_test_application.py ================================================ import os import time from omegaconf import OmegaConf import torch from scripts.evaluation.funcs import load_model_checkpoint, save_videos, batch_ddim_sampling, get_latent_z from ToonCrafter.utils.utils import instantiate_from_config from huggingface_hub import hf_hub_download from einops import repeat import torchvision.transforms as transforms from pytorch_lightning import seed_everything from einops import rearrange class Image2Video(): def __init__(self,result_dir='./tmp/',gpu_num=1,resolution='256_256') -> None: self.resolution = (int(resolution.split('_')[0]), int(resolution.split('_')[1])) #hw self.download_model() self.result_dir = result_dir if not os.path.exists(self.result_dir): os.mkdir(self.result_dir) ckpt_path='checkpoints/tooncrafter_'+resolution.split('_')[1]+'_interp_v1/model.ckpt' config_file='configs/inference_'+resolution.split('_')[1]+'_v1.0.yaml' config = OmegaConf.load(config_file) model_config = config.pop("model", OmegaConf.create()) model_config['params']['unet_config']['params']['use_checkpoint']=False model_list = [] for gpu_id in range(gpu_num): model = instantiate_from_config(model_config) # model = model.cuda(gpu_id) print(ckpt_path) assert os.path.exists(ckpt_path), "Error: checkpoint Not Found!" model = load_model_checkpoint(model, ckpt_path) model.eval() model_list.append(model) self.model_list = model_list self.save_fps = 8 def get_image(self, image, prompt, steps=50, cfg_scale=7.5, eta=1.0, fs=3, seed=123, image2=None): seed_everything(seed) transform = transforms.Compose([ transforms.Resize(min(self.resolution)), transforms.CenterCrop(self.resolution), ]) torch.cuda.empty_cache() print('start:', prompt, time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time()))) start = time.time() gpu_id=0 if steps > 60: steps = 60 model = self.model_list[gpu_id] model = model.cuda() batch_size=1 channels = model.model.diffusion_model.out_channels frames = model.temporal_length h, w = self.resolution[0] // 8, self.resolution[1] // 8 noise_shape = [batch_size, channels, frames, h, w] # text cond with torch.no_grad(), torch.cuda.amp.autocast(): text_emb = model.get_learned_conditioning([prompt]) # img cond img_tensor = torch.from_numpy(image).permute(2, 0, 1).float().to(model.device) img_tensor = (img_tensor / 255. - 0.5) * 2 image_tensor_resized = transform(img_tensor) #3,h,w videos = image_tensor_resized.unsqueeze(0).unsqueeze(2) # bc1hw # z = get_latent_z(model, videos) #bc,1,hw videos = repeat(videos, 'b c t h w -> b c (repeat t) h w', repeat=frames//2) img_tensor2 = torch.from_numpy(image2).permute(2, 0, 1).float().to(model.device) img_tensor2 = (img_tensor2 / 255. - 0.5) * 2 image_tensor_resized2 = transform(img_tensor2) #3,h,w videos2 = image_tensor_resized2.unsqueeze(0).unsqueeze(2) # bchw videos2 = repeat(videos2, 'b c t h w -> b c (repeat t) h w', repeat=frames//2) videos = torch.cat([videos, videos2], dim=2) z, hs = self.get_latent_z_with_hidden_states(model, videos) img_tensor_repeat = torch.zeros_like(z) img_tensor_repeat[:,:,:1,:,:] = z[:,:,:1,:,:] img_tensor_repeat[:,:,-1:,:,:] = z[:,:,-1:,:,:] cond_images = model.embedder(img_tensor.unsqueeze(0)) ## blc img_emb = model.image_proj_model(cond_images) imtext_cond = torch.cat([text_emb, img_emb], dim=1) fs = torch.tensor([fs], dtype=torch.long, device=model.device) cond = {"c_crossattn": [imtext_cond], "fs": fs, "c_concat": [img_tensor_repeat]} ## inference batch_samples = batch_ddim_sampling(model, cond, noise_shape, n_samples=1, ddim_steps=steps, ddim_eta=eta, cfg_scale=cfg_scale, hs=hs) ## remove the last frame if image2 is None: batch_samples = batch_samples[:,:,:,:-1,...] ## b,samples,c,t,h,w prompt_str = prompt.replace("/", "_slash_") if "/" in prompt else prompt prompt_str = prompt_str.replace(" ", "_") if " " in prompt else prompt_str prompt_str=prompt_str[:40] if len(prompt_str) == 0: prompt_str = 'empty_prompt' save_videos(batch_samples, self.result_dir, filenames=[prompt_str], fps=self.save_fps) print(f"Saved in {prompt_str}. Time used: {(time.time() - start):.2f} seconds") model = model.cpu() return os.path.join(self.result_dir, f"{prompt_str}.mp4") def download_model(self): REPO_ID = 'Doubiiu/ToonCrafter' filename_list = ['model.ckpt'] if not os.path.exists('./checkpoints/tooncrafter_'+str(self.resolution[1])+'_interp_v1/'): os.makedirs('./checkpoints/tooncrafter_'+str(self.resolution[1])+'_interp_v1/') for filename in filename_list: local_file = os.path.join('./checkpoints/tooncrafter_'+str(self.resolution[1])+'_interp_v1/', filename) if not os.path.exists(local_file): hf_hub_download(repo_id=REPO_ID, filename=filename, local_dir='./checkpoints/tooncrafter_'+str(self.resolution[1])+'_interp_v1/', local_dir_use_symlinks=False) def get_latent_z_with_hidden_states(self, model, videos): b, c, t, h, w = videos.shape x = rearrange(videos, 'b c t h w -> (b t) c h w') encoder_posterior, hidden_states = model.first_stage_model.encode(x, return_hidden_states=True) hidden_states_first_last = [] ### use only the first and last hidden states for hid in hidden_states: hid = rearrange(hid, '(b t) c h w -> b c t h w', t=t) hid_new = torch.cat([hid[:, :, 0:1], hid[:, :, -1:]], dim=2) hidden_states_first_last.append(hid_new) z = model.get_first_stage_encoding(encoder_posterior).detach() z = rearrange(z, '(b t) c h w -> b c t h w', b=b, t=t) return z, hidden_states_first_last if __name__ == '__main__': i2v = Image2Video() video_path = i2v.get_image('prompts/art.png','man fishing in a boat at sunset') print('done', video_path) ================================================ FILE: ToonCrafter/scripts/run.sh ================================================ ckpt=checkpoints/tooncrafter_512_interp_v1/model.ckpt config=configs/inference_512_v1.0.yaml prompt_dir=prompts/512_interp/ res_dir="results" FS=10 ## This model adopts FPS=5, range recommended: 5-30 (smaller value -> larger motion) seed=123 name=tooncrafter_512_interp_seed${seed} CUDA_VISIBLE_DEVICES=0 python3 scripts/evaluation/inference.py \ --seed ${seed} \ --ckpt_path $ckpt \ --config $config \ --savedir $res_dir/$name \ --n_samples 1 \ --bs 1 --height 320 --width 512 \ --unconditional_guidance_scale 7.5 \ --ddim_steps 50 \ --ddim_eta 1.0 \ --prompt_dir $prompt_dir \ --text_input \ --video_length 16 \ --frame_stride ${FS} \ --timestep_spacing 'uniform_trailing' --guidance_rescale 0.7 --perframe_ae --interp ================================================ FILE: ToonCrafter/utils/__init__.py ================================================ ================================================ FILE: ToonCrafter/utils/save_video.py ================================================ import os import numpy as np from tqdm import tqdm from PIL import Image from einops import rearrange import torch import torchvision from torch import Tensor from torchvision.utils import make_grid from torchvision.transforms.functional import to_tensor def frames_to_mp4(frame_dir,output_path,fps): def read_first_n_frames(d: os.PathLike, num_frames: int): if num_frames: images = [Image.open(os.path.join(d, f)) for f in sorted(os.listdir(d))[:num_frames]] else: images = [Image.open(os.path.join(d, f)) for f in sorted(os.listdir(d))] images = [to_tensor(x) for x in images] return torch.stack(images) videos = read_first_n_frames(frame_dir, num_frames=None) videos = videos.mul(255).to(torch.uint8).permute(0, 2, 3, 1) torchvision.io.write_video(output_path, videos, fps=fps, video_codec='h264', options={'crf': '10'}) def tensor_to_mp4(video, savepath, fps, rescale=True, nrow=None): """ video: torch.Tensor, b,c,t,h,w, 0-1 if -1~1, enable rescale=True """ n = video.shape[0] video = video.permute(2, 0, 1, 3, 4) # t,n,c,h,w nrow = int(np.sqrt(n)) if nrow is None else nrow frame_grids = [torchvision.utils.make_grid(framesheet, nrow=nrow, padding=0) for framesheet in video] # [3, grid_h, grid_w] grid = torch.stack(frame_grids, dim=0) # stack in temporal dim [T, 3, grid_h, grid_w] grid = torch.clamp(grid.float(), -1., 1.) if rescale: grid = (grid + 1.0) / 2.0 grid = (grid * 255).to(torch.uint8).permute(0, 2, 3, 1) # [T, 3, grid_h, grid_w] -> [T, grid_h, grid_w, 3] torchvision.io.write_video(savepath, grid, fps=fps, video_codec='h264', options={'crf': '10'}) def tensor2videogrids(video, root, filename, fps, rescale=True, clamp=True): assert(video.dim() == 5) # b,c,t,h,w assert(isinstance(video, torch.Tensor)) video = video.detach().cpu() if clamp: video = torch.clamp(video, -1., 1.) n = video.shape[0] video = video.permute(2, 0, 1, 3, 4) # t,n,c,h,w frame_grids = [torchvision.utils.make_grid(framesheet, nrow=int(np.sqrt(n))) for framesheet in video] # [3, grid_h, grid_w] grid = torch.stack(frame_grids, dim=0) # stack in temporal dim [T, 3, grid_h, grid_w] if rescale: grid = (grid + 1.0) / 2.0 grid = (grid * 255).to(torch.uint8).permute(0, 2, 3, 1) # [T, 3, grid_h, grid_w] -> [T, grid_h, grid_w, 3] path = os.path.join(root, filename) torchvision.io.write_video(path, grid, fps=fps, video_codec='h264', options={'crf': '10'}) def log_local(batch_logs, save_dir, filename, save_fps=10, rescale=True): if batch_logs is None: return None """ save images and videos from images dict """ def save_img_grid(grid, path, rescale): if rescale: grid = (grid + 1.0) / 2.0 # -1,1 -> 0,1; c,h,w grid = grid.transpose(0, 1).transpose(1, 2).squeeze(-1) grid = grid.numpy() grid = (grid * 255).astype(np.uint8) os.makedirs(os.path.split(path)[0], exist_ok=True) Image.fromarray(grid).save(path) for key in batch_logs: value = batch_logs[key] if isinstance(value, list) and isinstance(value[0], str): ## a batch of captions path = os.path.join(save_dir, "%s-%s.txt"%(key, filename)) with open(path, 'w') as f: for i, txt in enumerate(value): f.write(f'idx={i}, txt={txt}\n') f.close() elif isinstance(value, torch.Tensor) and value.dim() == 5: ## save video grids video = value # b,c,t,h,w ## only save grayscale or rgb mode if video.shape[1] != 1 and video.shape[1] != 3: continue n = video.shape[0] video = video.permute(2, 0, 1, 3, 4) # t,n,c,h,w frame_grids = [torchvision.utils.make_grid(framesheet, nrow=int(1), padding=0) for framesheet in video] #[3, n*h, 1*w] grid = torch.stack(frame_grids, dim=0) # stack in temporal dim [t, 3, n*h, w] if rescale: grid = (grid + 1.0) / 2.0 grid = (grid * 255).to(torch.uint8).permute(0, 2, 3, 1) path = os.path.join(save_dir, "%s-%s.mp4"%(key, filename)) torchvision.io.write_video(path, grid, fps=save_fps, video_codec='h264', options={'crf': '10'}) ## save frame sheet img = value video_frames = rearrange(img, 'b c t h w -> (b t) c h w') t = img.shape[2] grid = torchvision.utils.make_grid(video_frames, nrow=t, padding=0) path = os.path.join(save_dir, "%s-%s.jpg"%(key, filename)) #save_img_grid(grid, path, rescale) elif isinstance(value, torch.Tensor) and value.dim() == 4: ## save image grids img = value ## only save grayscale or rgb mode if img.shape[1] != 1 and img.shape[1] != 3: continue n = img.shape[0] grid = torchvision.utils.make_grid(img, nrow=1, padding=0) path = os.path.join(save_dir, "%s-%s.jpg"%(key, filename)) save_img_grid(grid, path, rescale) else: pass def prepare_to_log(batch_logs, max_images=100000, clamp=True): if batch_logs is None: return None # process for key in batch_logs: N = batch_logs[key].shape[0] if hasattr(batch_logs[key], 'shape') else len(batch_logs[key]) N = min(N, max_images) batch_logs[key] = batch_logs[key][:N] ## in batch_logs: images & caption if isinstance(batch_logs[key], torch.Tensor): batch_logs[key] = batch_logs[key].detach().cpu() if clamp: try: batch_logs[key] = torch.clamp(batch_logs[key].float(), -1., 1.) except RuntimeError: print("clamp_scalar_cpu not implemented for Half") return batch_logs # ---------------------------------------------------------------------------------------------- def fill_with_black_squares(video, desired_len: int) -> Tensor: if len(video) >= desired_len: return video return torch.cat([ video, torch.zeros_like(video[0]).unsqueeze(0).repeat(desired_len - len(video), 1, 1, 1), ], dim=0) # ---------------------------------------------------------------------------------------------- def load_num_videos(data_path, num_videos): # first argument can be either data_path of np array if isinstance(data_path, str): videos = np.load(data_path)['arr_0'] # NTHWC elif isinstance(data_path, np.ndarray): videos = data_path else: raise Exception if num_videos is not None: videos = videos[:num_videos, :, :, :, :] return videos def npz_to_video_grid(data_path, out_path, num_frames, fps, num_videos=None, nrow=None, verbose=True): # videos = torch.tensor(np.load(data_path)['arr_0']).permute(0,1,4,2,3).div_(255).mul_(2) - 1.0 # NTHWC->NTCHW, np int -> torch tensor 0-1 if isinstance(data_path, str): videos = load_num_videos(data_path, num_videos) elif isinstance(data_path, np.ndarray): videos = data_path else: raise Exception n,t,h,w,c = videos.shape videos_th = [] for i in range(n): video = videos[i, :,:,:,:] images = [video[j, :,:,:] for j in range(t)] images = [to_tensor(img) for img in images] video = torch.stack(images) videos_th.append(video) if verbose: videos = [fill_with_black_squares(v, num_frames) for v in tqdm(videos_th, desc='Adding empty frames')] # NTCHW else: videos = [fill_with_black_squares(v, num_frames) for v in videos_th] # NTCHW frame_grids = torch.stack(videos).permute(1, 0, 2, 3, 4) # [T, N, C, H, W] if nrow is None: nrow = int(np.ceil(np.sqrt(n))) if verbose: frame_grids = [make_grid(fs, nrow=nrow) for fs in tqdm(frame_grids, desc='Making grids')] else: frame_grids = [make_grid(fs, nrow=nrow) for fs in frame_grids] if os.path.dirname(out_path) != "": os.makedirs(os.path.dirname(out_path), exist_ok=True) frame_grids = (torch.stack(frame_grids) * 255).to(torch.uint8).permute(0, 2, 3, 1) # [T, H, W, C] torchvision.io.write_video(out_path, frame_grids, fps=fps, video_codec='h264', options={'crf': '10'}) ================================================ FILE: ToonCrafter/utils/utils.py ================================================ import os import importlib import numpy as np import cv2 import torch import torch.distributed as dist def count_params(model, verbose=False): total_params = sum(p.numel() for p in model.parameters()) if verbose: print(f"{model.__class__.__name__} has {total_params*1.e-6:.2f} M params.") return total_params def check_istarget(name, para_list): """ name: full name of source para para_list: partial name of target para """ istarget = False for para in para_list: if para in name: return True return istarget def instantiate_from_config(config): if "target" not in config: if config == '__is_first_stage__': return None elif config == "__is_unconditional__": return None raise KeyError("Expected key `target` to instantiate.") return get_obj_from_str(config["target"])(**config.get("params", dict())) def get_obj_from_str(string, reload=False): module, cls = string.rsplit(".", 1) if reload: module_imp = importlib.import_module(module) importlib.reload(module_imp) return getattr(importlib.import_module(module, package=None), cls) def load_npz_from_dir(data_dir): data = [np.load(os.path.join(data_dir, data_name))['arr_0'] for data_name in os.listdir(data_dir)] data = np.concatenate(data, axis=0) return data def load_npz_from_paths(data_paths): data = [np.load(data_path)['arr_0'] for data_path in data_paths] data = np.concatenate(data, axis=0) return data def resize_numpy_image(image, max_resolution=512 * 512, resize_short_edge=None): h, w = image.shape[:2] if resize_short_edge is not None: k = resize_short_edge / min(h, w) else: k = max_resolution / (h * w) k = k**0.5 h = int(np.round(h * k / 64)) * 64 w = int(np.round(w * k / 64)) * 64 image = cv2.resize(image, (w, h), interpolation=cv2.INTER_LANCZOS4) return image def setup_dist(args): if dist.is_initialized(): return torch.cuda.set_device(args.local_rank) torch.distributed.init_process_group( 'nccl', init_method='env://' ) ================================================ FILE: __init__.py ================================================ import os import sys import torch import time import logging as logger import importlib from functools import cache from pathlib import Path from contextlib import contextmanager, ExitStack from omegaconf import OmegaConf from huggingface_hub import hf_hub_download from einops import repeat, rearrange from torchvision import transforms from pytorch_lightning import seed_everything from platform import system from comfy import model_management as mm from comfy.utils import ProgressBar if system() == "Darwin": os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1" os.environ["HF_HOME"] = "~/.cache/huggingface" os.environ["XFORMERS_FORCE_DISABLE_TRITON"] = "1" USER_DEF_CLIP = Path(__file__).parent.joinpath("models/open_clip_pytorch_model.bin") if USER_DEF_CLIP.exists(): os.environ["USER_DEF_CLIP"] = USER_DEF_CLIP.as_posix() # os.environ["no_proxy"] = "localhost, 127.0.0.1, ::1" ROOT = Path(__file__).parent.joinpath("ToonCrafter") sys.path.append(Path(__file__).parent.as_posix()) sys.path.append(ROOT.as_posix()) # from ToonCrafter.utils.utils import instantiate_from_config from ToonCrafter.scripts.evaluation.funcs import load_model_checkpoint, batch_ddim_sampling # from ToonCrafter.cldm.model import load_state_dict def instantiate_from_config(config): if "target" not in config: if config == '__is_first_stage__': return None elif config == "__is_unconditional__": return None raise KeyError("Expected key `target` to instantiate.") return get_obj_from_str(config["target"])(**config.get("params", dict())) def get_obj_from_str(string, reload=False): module, cls = string.rsplit(".", 1) if reload: module_imp = importlib.import_module(module) importlib.reload(module_imp) return getattr(importlib.import_module(module, package=None), cls) def get_state_dict(d): return d.get('state_dict', d) def load_state_dict(ckpt_path, location='cpu'): _, extension = os.path.splitext(ckpt_path) if extension.lower() == ".safetensors": import safetensors.torch state_dict = safetensors.torch.load_file(ckpt_path, device=location) else: state_dict = get_state_dict(torch.load(ckpt_path, map_location=torch.device(location))) state_dict = get_state_dict(state_dict) print(f'Loaded state_dict from [{ckpt_path}]') return state_dict @cache def get_models(root: Path = ROOT.joinpath("checkpoints"), ignoreed: tuple = ("sketch_encoder.ckpt", )): ckpts = [] files = [] for ext in ['ckpt', 'pt', 'bin', 'pth', 'safetensors', 'pkl']: files.extend(root.rglob(f"*.{ext}")) for file in files: if file.name in ignoreed: continue ckpts.append(file.relative_to(root).as_posix()) return sorted(ckpts) class ToonCrafterNode: @classmethod def INPUT_TYPES(s): return { "required": { "image": ("IMAGE", ), "image2": ("IMAGE", ), "ckpt_name": (get_models(), ), "vram_opt_strategy": (["none", "low"], ), "prompt": ("STRING", {"multiline": True, "dynamicPrompts": True}), # "clip": ("CLIP", ), "seed": ("INT", {"default": 123, "min": 0, "max": 0xffffffffffffffff}), "eta": ("FLOAT", {"default": 1.0, "min": 0.0, "max": 15.0, "step": 0.1}), "cfg_scale": ("FLOAT", {"default": 7.5, "min": 1.0, "max": 15.0, "step": 0.5}), "steps": ("INT", {"default": 50, "min": 1, "max": 60, "step": 1}), "frame_count": ("INT", {"default": 10, "min": 5, "max": 30, "step": 1}), "fps": ("INT", {"default": 8, "min": 1, "max": 60, "step": 1}), } } RETURN_TYPES = ("IMAGE", ) FUNCTION = "get_image" OUTPUT_NODE = True CATEGORY = "ToonCrafter" def init(self, ckpt_name="", result_dir=ROOT.joinpath("tmp/"), gpu_num=1, resolution='320_512') -> None: h, w = resolution.split('_') self.resolution = int(h), int(w) # self.download_model() self.result_dir = result_dir Path(self.result_dir).mkdir(parents=True, exist_ok=True) ckpt_path = ROOT.joinpath("checkpoints", ckpt_name) if not ckpt_path.exists(): ckpt_path = ROOT.joinpath(f'checkpoints/tooncrafter_{w}_interp_v1', 'model.ckpt') if not ckpt_path.exists(): raise Exception(f"ToonCrafterNode Error: {ckpt_path} Not Found!") config_file = ROOT.joinpath(f'configs/inference_{w}_v1.0.yaml') config = OmegaConf.load(config_file.as_posix()) model_config = config.pop("model", OmegaConf.create()) model_config['params']['unet_config']['params']['use_checkpoint'] = False model_list = [] # mm.unload_all_models() for gpu_id in range(gpu_num): model = instantiate_from_config(model_config) # model = model.cuda(gpu_id) logger.info(ckpt_path) assert ckpt_path.exists(), "Error: checkpoint Not Found!" model = load_model_checkpoint(model, ckpt_path.as_posix()) model.eval() model_list.append(model) self.model_list = model_list self.save_fps = 8 self.is_cuda = torch.cuda.is_available() self.is_mps = torch.backends.mps.is_available() self.is_cpu = torch.cpu.is_available() @contextmanager def optional_autocast(device): try: with torch.autocast(device.type): yield except Exception as e: print(f"Autocast is not supported: {e}") yield def get_image(self, image: torch.Tensor, ckpt_name, vram_opt_strategy, prompt, steps=50, cfg_scale=7.5, eta=1.0, frame_count=3, fps=8, seed=123, image2: torch.Tensor = None): os.environ["TOON_MEM_STRATEGY"] = vram_opt_strategy self.init(ckpt_name=ckpt_name) self.save_fps = fps seed = seed % 4294967295 seed_everything(seed) transform = transforms.Compose([ transforms.Resize(min(self.resolution)), transforms.CenterCrop(self.resolution), ]) mm.soft_empty_cache() print('start:', prompt, time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time()))) start = time.time() gpu_id = 0 if steps > 60: steps = 60 model: torch.nn.Module = self.model_list[gpu_id] half = mm.should_use_bf16() or mm.should_use_fp16() or vram_opt_strategy == "low" if half: model = model.half() image = image.half() image2 = image2.half() if self.is_cuda: model = model.to('cuda') elif self.is_mps: model = model.to('mps') elif self.is_cpu: model = model.to('cpu') batch_size = 1 channels = model.model.diffusion_model.out_channels frames = model.temporal_length h, w = self.resolution[0] // 8, self.resolution[1] // 8 noise_shape = [batch_size, channels, frames, h, w] pbar = ProgressBar(steps) # text cond with ExitStack() as stack: stack.enter_context(torch.no_grad()) if self.is_cuda: stack.enter_context(torch.cuda.amp.autocast()) # stack.enter_context(self.optional_autocast(device=model.device)) text_emb = model.get_learned_conditioning([prompt]) model.cond_stage_model.to("cpu") # img cond img_tensor = image[0].permute(2, 0, 1).to(model.device) img_tensor = (img_tensor - 0.5) * 2 image_tensor_resized = transform(img_tensor) # 3,h,w videos = image_tensor_resized.unsqueeze(0).unsqueeze(2) # bc1hw # z = get_latent_z(model, videos) #bc,1,hw videos = repeat(videos, 'b c t h w -> b c (repeat t) h w', repeat=frames // 2) img_tensor2 = image2[0].permute(2, 0, 1).to(model.device) img_tensor2 = (img_tensor2 - 0.5) * 2 image_tensor_resized2 = transform(img_tensor2) # 3,h,w videos2 = image_tensor_resized2.unsqueeze(0).unsqueeze(2) # bchw videos2 = repeat(videos2, 'b c t h w -> b c (repeat t) h w', repeat=frames // 2) videos = torch.cat([videos, videos2], dim=2) # v10 = torch.mps.driver_allocated_memory() / 1024**3 mm.soft_empty_cache() # v11 = torch.mps.driver_allocated_memory() / 1024**3 z, hs = self.get_latent_z_with_hidden_states(model, videos) model.cond_stage_model.to(model.device) # v20 = torch.mps.driver_allocated_memory() / 1024**3 mm.soft_empty_cache() # v21 = torch.mps.driver_allocated_memory() / 1024**3 img_tensor_repeat = torch.zeros_like(z).to(dtype=model.dtype) img_tensor_repeat[:, :, :1, :, :] = z[:, :, :1, :, :] img_tensor_repeat[:, :, -1:, :, :] = z[:, :, -1:, :, :] cond_images = model.embedder(img_tensor.unsqueeze(0)) # blc img_emb = model.image_proj_model(cond_images) imtext_cond = torch.cat([text_emb, img_emb], dim=1) del cond_images, text_emb, img_emb, videos, videos2, image_tensor_resized2, img_tensor2, image_tensor_resized, image fs = torch.tensor([frame_count], dtype=torch.long, device=model.device) cond = {"c_crossattn": [imtext_cond], "fs": fs, "c_concat": [img_tensor_repeat]} def cb(step): print(f"step: {step}", end='\r') pbar.update_absolute(step + 1) mm.soft_empty_cache() # inference batch_samples = batch_ddim_sampling(model, cond, noise_shape, n_samples=1, ddim_steps=steps, ddim_eta=eta, cfg_scale=cfg_scale, hs=hs, callback=cb) # remove the last frame if image2 is None: batch_samples = batch_samples[:, :, :, :-1, ...] # b,samples,c,t,h,w prompt_str = prompt.replace("/", "_slash_") if "/" in prompt else prompt prompt_str = prompt_str.replace(" ", "_") if " " in prompt else prompt_str prompt_str = prompt_str[:40] if len(prompt_str) == 0: prompt_str = 'empty_prompt' # self.save_videos(batch_samples, self.result_dir, filenames=[prompt_str], fps=self.save_fps) print(f"Saved in {prompt_str}. Time used: {(time.time() - start):.2f} seconds") try: # frame_count, width, height, channel batch_samples = batch_samples[0][0].permute(1, 2, 3, 0) if half: batch_samples = batch_samples.to(dtype=torch.float32) except Exception as e: sys.stderr.write(f"{e}\n") return (None, ) batch_samples = (batch_samples + 1.0) * 0.5 mm.soft_empty_cache() model = model.cpu() return (batch_samples, ) def save_videos(self, batch_tensors, savedir, filenames, fps=10): import torchvision # b,samples,c,t,h,w n_samples = batch_tensors.shape[1] for idx, vid_tensor in enumerate(batch_tensors): video = vid_tensor.detach().cpu() video = torch.clamp(video.float(), -1., 1.) video = video.permute(2, 0, 1, 3, 4) # t,n,c,h,w frame_grids = [torchvision.utils.make_grid(framesheet, nrow=int(n_samples)) for framesheet in video] # [3, 1*h, n*w] grid = torch.stack(frame_grids, dim=0) # stack in temporal dim [t, 3, n*h, w] grid = (grid + 1.0) / 2.0 grid = (grid * 255).to(torch.uint8).permute(0, 2, 3, 1) savepath = os.path.join(savedir, f"{filenames[idx]}.mp4") torchvision.io.write_video(savepath, grid, fps=fps, video_codec='h264', options={'crf': '10'}) def download_model(self): REPO_ID = 'Doubiiu/ToonCrafter' filename_list = ['model.ckpt'] model_dir = ROOT.joinpath('checkpoints/tooncrafter_' + str(self.resolution[1]) + '_interp_v1/') model_dir.mkdir(parents=True, exist_ok=True) for filename in filename_list: local_file = model_dir.joinpath(filename) if not local_file.exists(): hf_hub_download(repo_id=REPO_ID, filename=filename, local_dir=model_dir.as_posix(), local_dir_use_symlinks=False) def get_latent_z_with_hidden_states(self, model, videos): b, c, t, h, w = videos.shape x = rearrange(videos, 'b c t h w -> (b t) c h w') encoder_posterior, hidden_states = model.first_stage_model.encode(x, return_hidden_states=True) hidden_states_first_last = [] # use only the first and last hidden states for hid in hidden_states: hid = rearrange(hid, '(b t) c h w -> b c t h w', t=t) hid_new = torch.cat([hid[:, :, 0:1], hid[:, :, -1:]], dim=2) hidden_states_first_last.append(hid_new) z = model.get_first_stage_encoding(encoder_posterior).detach() z = rearrange(z, '(b t) c h w -> b c t h w', b=b, t=t) return z, hidden_states_first_last class ToonCrafterWithSketch(ToonCrafterNode): @classmethod def INPUT_TYPES(s): return { "required": { "image": ("IMAGE", ), "image2": ("IMAGE", ), "frame_guides": ("IMAGE", ), "ckpt_name": (get_models(), ), "vram_opt_strategy": (["none", "low"], ), "prompt": ("STRING", {"multiline": True, "dynamicPrompts": True}), "seed": ("INT", {"default": 123, "min": 0, "max": 0xffffffffffffffff}), "eta": ("FLOAT", {"default": 1.0, "min": 0.0, "max": 15.0, "step": 0.1}), "cfg_scale": ("FLOAT", {"default": 7.5, "min": 1.0, "max": 15.0, "step": 0.5}), "steps": ("INT", {"default": 50, "min": 1, "max": 60, "step": 1}), "frame_count": ("INT", {"default": 10, "min": 5, "max": 30, "step": 1}), "fps": ("INT", {"default": 8, "min": 1, "max": 60, "step": 1}), "control_scale": ("FLOAT", {"default": 0.6, "min": 0, "max": 1.0, "step": 0.1}), } } RETURN_TYPES = ("IMAGE", ) FUNCTION = "get_image" OUTPUT_NODE = True CATEGORY = "ToonCrafter" def init(self, ckpt_name="", result_dir=ROOT.joinpath("tmp/"), gpu_num=1, resolution='320_512') -> None: h, w = resolution.split('_') self.resolution = int(h), int(w) self.result_dir = result_dir Path(self.result_dir).mkdir(parents=True, exist_ok=True) ckpt_path = ROOT.joinpath("checkpoints", ckpt_name) if not ckpt_path.exists(): ckpt_path = ROOT.joinpath(f'checkpoints/tooncrafter_{w}_interp_v1', 'model.ckpt') if not ckpt_path.exists(): raise Exception(f"ToonCrafterWithSketch Error: {ckpt_path} Not Found!") config_file = ROOT.joinpath(f'configs/inference_{w}_v1.0.yaml') config = OmegaConf.load(config_file.as_posix()) model_config = config.pop("model", OmegaConf.create()) model_config['params']['unet_config']['params']['use_checkpoint'] = False model_list = [] # ControlModel cn_ckpt_path = ROOT.joinpath("checkpoints", "sketch_encoder.ckpt") cn_config_file = ROOT.joinpath("configs/cldm_v21.yaml") cn_config = OmegaConf.load(cn_config_file.as_posix()) cn_model_config = cn_config.pop("control_stage_config", OmegaConf.create()) self.is_cuda = torch.cuda.is_available() self.is_mps = torch.backends.mps.is_available() self.is_cpu = torch.cpu.is_available() self.device = "cuda" if self.is_cuda else "mps" if self.is_mps else "cpu" model_list = [] for gpu_id in range(gpu_num): model = instantiate_from_config(model_config) cn_model = instantiate_from_config(cn_model_config) # model = model.cuda(gpu_id) assert ckpt_path.exists(), "Error: checkpoint Not Found!" model = load_model_checkpoint(model, ckpt_path) model.eval() cn_model.load_state_dict(load_state_dict(cn_ckpt_path, location=self.device)) cn_model.eval() model.control_model = cn_model model_list.append(model) self.model_list = model_list self.save_fps = 8 def get_image(self, image: torch.Tensor, ckpt_name, vram_opt_strategy, prompt, steps=50, cfg_scale=7.5, eta=1.0, frame_count=3, fps=8, seed=123, image2: torch.Tensor = None, frame_guides=None, control_scale=0.6): os.environ["TOON_MEM_STRATEGY"] = vram_opt_strategy self.init(ckpt_name=ckpt_name) control_frames = frame_guides self.save_fps = fps seed = seed % 4294967295 seed_everything(seed) transform = transforms.Compose([ transforms.Resize(min(self.resolution)), transforms.CenterCrop(self.resolution), ]) mm.soft_empty_cache() print('start:', prompt, time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time()))) start = time.time() gpu_id = 0 if steps > 60: steps = 60 model: torch.nn.Module = self.model_list[gpu_id] if self.is_cuda: model = model.to('cuda') elif self.is_mps: model = model.to('mps') elif self.is_cpu: model = model.to('cpu') half = mm.should_use_bf16() or mm.should_use_fp16() or vram_opt_strategy == "low" if half: model = model.half() model.control_model.dtype = model.dtype image = image.half() image2 = image2.half() control_frames = control_frames.half() batch_size = 1 channels = model.model.diffusion_model.out_channels frames = model.temporal_length h, w = self.resolution[0] // 8, self.resolution[1] // 8 noise_shape = [batch_size, channels, frames, h, w] pbar = ProgressBar(steps) # text cond with ExitStack() as stack: stack.enter_context(torch.no_grad()) if self.is_cuda: stack.enter_context(torch.cuda.amp.autocast()) text_emb = model.get_learned_conditioning([prompt]) # control cond if frame_guides is not None: cn_videos = [] for frame in control_frames: # frame = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY) # frame = cv2.bitwise_not(frame) cn_tensor = frame.permute(2, 0, 1).to(model.device) # cn_tensor = (cn_tensor / 255. - 0.5) * 2 # cn_tensor = (cn_tensor / 255.0) cn_tensor_resized = transform(cn_tensor) # 3,h,w cn_video = cn_tensor_resized.unsqueeze(0).unsqueeze(2) # bc1hw cn_videos.append(cn_video) cn_videos = torch.cat(cn_videos, dim=2) del control_frames model_list = [] for model in self.model_list: model.control_scale = control_scale model_list.append(model) self.model_list = model_list else: cn_videos = None # img cond img_tensor = image[0].permute(2, 0, 1).to(model.device) img_tensor = (img_tensor - 0.5) * 2 # img_tensor = torch.from_numpy(image).permute(2, 0, 1).float().to(model.device) # img_tensor = (img_tensor / 255. - 0.5) * 2 image_tensor_resized = transform(img_tensor) # 3,h,w videos = image_tensor_resized.unsqueeze(0).unsqueeze(2) # bc1hw # z = get_latent_z(model, videos) #bc,1,hw videos = repeat(videos, 'b c t h w -> b c (repeat t) h w', repeat=frames // 2) img_tensor2 = image2[0].permute(2, 0, 1).to(model.device) img_tensor2 = (img_tensor2 - 0.5) * 2 # img_tensor2 = torch.from_numpy(image2).permute(2, 0, 1).float().to(model.device) # img_tensor2 = (img_tensor2 / 255. - 0.5) * 2 image_tensor_resized2 = transform(img_tensor2) # 3,h,w videos2 = image_tensor_resized2.unsqueeze(0).unsqueeze(2) # bchw videos2 = repeat(videos2, 'b c t h w -> b c (repeat t) h w', repeat=frames // 2) videos = torch.cat([videos, videos2], dim=2) mm.soft_empty_cache() z, hs = self.get_latent_z_with_hidden_states(model, videos) model.cond_stage_model.to(model.device) mm.soft_empty_cache() # img_tensor_repeat = torch.zeros_like(z) img_tensor_repeat = torch.zeros_like(z).to(dtype=model.dtype) img_tensor_repeat[:, :, :1, :, :] = z[:, :, :1, :, :] img_tensor_repeat[:, :, -1:, :, :] = z[:, :, -1:, :, :] cond_images = model.embedder(img_tensor.unsqueeze(0)) # blc img_emb = model.image_proj_model(cond_images) imtext_cond = torch.cat([text_emb, img_emb], dim=1) del cond_images, text_emb, img_emb, videos, videos2, image_tensor_resized2, img_tensor2, image_tensor_resized, image fs = torch.tensor([frame_count], dtype=torch.long, device=model.device) cond = {"c_crossattn": [imtext_cond], "fs": fs, "c_concat": [img_tensor_repeat], "control_cond": cn_videos} def cb(step): print(f"step: {step}", end='\r') pbar.update_absolute(step + 1) mm.soft_empty_cache() # inference batch_samples = batch_ddim_sampling(model, cond, noise_shape, n_samples=1, ddim_steps=steps, ddim_eta=eta, cfg_scale=cfg_scale, hs=hs, callback=cb) # remove the last frame if image2 is None: batch_samples = batch_samples[:, :, :, :-1, ...] # b,samples,c,t,h,w prompt_str = prompt.replace("/", "_slash_") if "/" in prompt else prompt prompt_str = prompt_str.replace(" ", "_") if " " in prompt else prompt_str prompt_str = prompt_str[:40] if len(prompt_str) == 0: prompt_str = 'empty_prompt' # self.save_videos(batch_samples, self.result_dir, filenames=[prompt_str], fps=self.save_fps) print(f"Saved in {prompt_str}. Time used: {(time.time() - start):.2f} seconds") try: # frame_count, width, height, channel batch_samples = batch_samples[0][0].permute(1, 2, 3, 0) if half: batch_samples = batch_samples.to(dtype=torch.float32) except Exception as e: sys.stderr.write(f"{e}\n") return (None, ) batch_samples = (batch_samples + 1.0) * 0.5 mm.soft_empty_cache() model = model.cpu() return (batch_samples, ) NODE_CLASS_MAPPINGS = { "ToonCrafterNode": ToonCrafterNode, "ToonCrafterWithSketch": ToonCrafterWithSketch, } NODE_DISPLAY_NAME_MAPPINGS = { "ToonCrafterNode": "ToonCrafter", "ToonCrafterWithSketch": "ToonCrafterWithSketch", } WEB_DIRECTORY = "./" ================================================ FILE: pre_run.py ================================================ import sys import argparse import os from pathlib import Path sys.path.append(Path(__file__).parent.as_posix()) from scripts.evaluation.inference import seed_everything, run_inference, get_parser def run(): old_dir = os.getcwd() os.chdir(Path(__file__).parent.as_posix()) parser = get_parser() ckpt = "checkpoints/tooncrafter_512_interp_v1/model.ckpt" config = "configs/inference_512_v1.0.yaml" prompt_dir = "prompts/512_interp/" res_dir = "results" FS = 10 # This model adopts FPS=5, range recommended: 5-30 (smaller value -> larger motion) seed = 123 name = f"tooncrafter_512_interp_seed{seed}" os.environ["CUDA_VISIBLE_DEVICES"] = "0" namespace = argparse.Namespace( seed=123, ckpt_path=ckpt, config=config, savedir=f"{res_dir}/{name}", n_samples=1, bs=1, height=320, width=512, unconditional_guidance_scale=7.5, ddim_steps=50, ddim_eta=1.0, prompt_dir=prompt_dir, text_input=True, video_length=16, frame_stride=FS, timestep_spacing='uniform_trailing', guidance_rescale=0.7, perframe_ae=True, interp=True ) args = parser.parse_args(args=[], namespace=namespace) seed_everything(args.seed) rank, gpu_num = 0, 1 run_inference(args, gpu_num, rank) os.chdir(old_dir) if __name__ == "__main__": run() ================================================ FILE: pyproject.toml ================================================ [project] name = "comfyui-tooncrafter" description = "This project is used to enable [a/ToonCrafter](https://github.com/ToonCrafter/ToonCrafter) to be used in ComfyUI.\nYou can use it to achieve generative keyframe animation\nAnd use it in Blender for animation rendering and prediction" version = "1.0.0" license = "LICENSE" dependencies = ["imageio==2.9.0", "numpy", "omegaconf==2.1.1", "opencv_python", "pandas", "pytorch_lightning==1.9.3", "pyyaml", "setuptools", "moviepy", "av", "xformers", "timm", "scikit-learn", "open_clip_torch==2.22.0", "decord==0.6.0"] [project.urls] Repository = "https://github.com/AIGODLIKE/ComfyUI-ToonCrafter" # Used by Comfy Registry https://comfyregistry.org [tool.comfy] PublisherId = "" DisplayName = "ComfyUI-ToonCrafter" Icon = "" ================================================ FILE: readme.md ================================================ # Introduction This project is used to enable [ToonCrafter](https://github.com/ToonCrafter/ToonCrafter) to be used in ComfyUI. You can use it to achieve generative keyframe animation(RTX 4090,26s) https://github.com/AIGODLIKE/ComfyUI-ToonCrafter/assets/116185401/68edb789-5a8e-418f-ae35-e3cfe6ab1300 https://github.com/AIGODLIKE/ComfyUI-ToonCrafter/assets/116185401/86553c22-9395-4b0a-9d8d-0c29c7467bd3 And use it in Blender for animation rendering and prediction **Additionally, it can be used completely without a network** ## Installation 1. ComfyUI Custom Node ```bash cd ComfyUI/custom_nodes git clone https://github.com/AIGODLIKE/ComfyUI-ToonCrafter cd ComfyUI-ToonCrafter # install dependencies ..\..\..\python_embeded\python.exe -m pip install -r requirements.txt ``` 2. Model Prepare - Download the weights: - [512 full weights](https://github.com/ToonCrafter/ToonCrafter?tab=readme-ov-file#-models) *High VRAM usage, fp16 reccomended* - [512 fp16 weights](https://huggingface.co/Kijai/DynamiCrafter_pruned/resolve/main/tooncrafter_512_interp-fp16.safetensors) - Put it in into `ComfyUI-ToonCrafter\ToonCrafter\checkpoints\tooncrafter_512_interp_v1` for example 512x512. 3. Enjoy it! ## Showcases ### Blender You can even use it directly in Blender!([ComfyUI-BlenderAI-node](https://github.com/AIGODLIKE/ComfyUI-BlenderAI-node)) https://github.com/AIGODLIKE/ComfyUI-ToonCrafter/assets/116185401/ca8ec681-b5bc-40a1-b12a-ad185acff477
Input starting frame Input ending frame Generated video
================================================ FILE: requirements.txt ================================================ imageio==2.9.0 numpy omegaconf==2.1.1 opencv_python pandas pytorch_lightning==1.9.3 pyyaml setuptools moviepy av xformers timm scikit-learn open_clip_torch==2.22.0 decord==0.6.0