[
  {
    "path": ".gitignore",
    "content": "*.pth\n*.ckpt\n*__pycache__*\n*.pyc\n*egg*\n*src/*\n*.ipynb\nlogs/*\n*delete*\neval_results*\n*.idea*\n*.pytorch"
  },
  {
    "path": "LICENSE",
    "content": "MIT License\n\nCopyright (c) 2023 danielism97\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n"
  },
  {
    "path": "README.md",
    "content": "# LDMVFI: Video Frame Interpolation with Latent Diffusion Models\n\n[**Duolikun Danier**](https://danier97.github.io/), [**Fan Zhang**](https://fan-aaron-zhang.github.io/), [**David Bull**](https://david-bull.github.io/)\n\n[Project](TODO) | [arXiv](https://arxiv.org/abs/2303.09508) | [Video](https://drive.google.com/file/d/1oL6j_l3b2QEqsL0iO7qSZrGUXJaTpRWN/view?usp=share_link)\n\n![Demo gif](assets/ldmvfi.gif)\n\n\n## Overview\nWe observe that most existing learning-based VFI models are trained to minimise the L1/L2/VGG loss between their outputs and the ground-truth frames. However, it was shown in previous works that these metrics do not correlate well with the **perceptual quality** of VFI. On the other hand, generative models, especially diffusion models, are showing remarkable results in generating visual content with high perceptual quality. In this work, we leverage the high-fidelity image/video generation capabilities of **latent diffusion models** to perform generative VFI.\n<p align=\"center\">\n<img src=\"https://danier97.github.io/LDMVFI/overall.svg\" alt=\"Paper\" width=\"60%\">\n</p>\n\n## Dependencies and Installation\nSee [environment.yaml](./environment.yaml) for requirements on packages. Simple installation:\n```\nconda env create -f environment.yaml\n```\n\n## Pre-trained Model\nThe pre-trained model can be downloaded from [here](https://drive.google.com/file/d/1_Xx2fBYQT9O-6O3zjzX76O9XduGnCh_7/view?usp=share_link), and its corresponding config file is [this yaml](./configs/ldm/ldmvfi-vqflow-f32-c256-concat_max.yaml).\n\n\n## Preparing datasets\n### Training sets:\n[[Vimeo-90K]](http://toflow.csail.mit.edu/) | [[BVI-DVC quintuplets]](https://drive.google.com/file/d/1i_CoqiNrZ2AU8DKjU8aHM1jIaDGW0fE5/view?usp=sharing)\n\n### Test sets: \n[[Middlebury]](https://vision.middlebury.edu/flow/data/) | [[UCF101]](https://sites.google.com/view/xiangyuxu/qvi_nips19) | [[DAVIS]](https://sites.google.com/view/xiangyuxu/qvi_nips19) | [[SNU-FILM]](https://myungsub.github.io/CAIN/)\n\n\nTo make use of the [evaluate.py](evaluate.py) and the files in [ldm/data/](./ldm/data/), the dataset folder names should be lower-case and structured as follows.\n```\n└──── <data directory>/\n    ├──── middlebury_others/\n    |   ├──── input/\n    |   |   ├──── Beanbags/\n    |   |   ├──── ...\n    |   |   └──── Walking/\n    |   └──── gt/\n    |       ├──── Beanbags/\n    |       ├──── ...\n    |       └──── Walking/\n    ├──── ucf101/\n    |   ├──── 0/\n    |   ├──── ...\n    |   └──── 99/\n    ├──── davis90/\n    |   ├──── bear/\n    |   ├──── ...\n    |   └──── walking/\n    ├──── snufilm/\n    |   ├──── test-easy.txt\n    |   ├──── ...\n    |   └──── data/SNU-FILM/test/...\n    ├──── bvidvc/quintuplets\n    |   ├──── 00000/\n    |   ├──── ...\n    |   └──── 17599/\n    └──── vimeo_septuplet/\n        ├──── sequences/\n        ├──── sep_testlist.txt\n        └──── sep_trainlist.txt\n```\n\n## Evaluation\n\nTo evaluate LDMVFI (with DDIM sampler), for example, on the Middlebury dataset, using PSNR/SSIM/LPIPS, run the following command.\n```\npython evaluate.py \\\n--config configs/ldm/ldmvfi-vqflow-f32-c256-concat_max.yaml \\\n--ckpt <path/to/ldmvfi-vqflow-f32-c256-concat_max.ckpt> \\\n--dataset Middlebury_others \\\n--metrics PSNR SSIM LPIPS \\\n--data_dir <path/to/data/dir> \\\n--out_dir eval_results/ldmvfi-vqflow-f32-c256-concat_max/ \\\n--use_ddim\n```\nThis will create the directory `eval_results/ldmvfi-vqflow-f32-c256-concat_max/Middlebury_others/`, and store the interpolated frames, as well as a `results.txt` file in that directory. For other test sets, replace `Middlebury_other` with the corresponding class names defined in [ldm/data/testsets.py](ldm/data/testsets.py) (e.g. `Ucf101_triplet`).\n\n\\\nTo evaluate the model on perceptual video metric FloLPIPS, first evaluate the image metrics using the code above (so that the interpolated frames are saved in `eval_results/ldmvfi-vqflow-f32-c256-concat_max`), then run the following code.\n```\npython evaluate_vqm.py \\\n--exp ldmvfi-vqflow-f32-c256-concat_max \\\n--dataset Middlebury_others \\\n--metrics FloLPIPS \\\n--data_dir <path/to/data/dir> \\\n--out_dir eval_results/ldmvfi-vqflow-f32-c256-concat_max/ \\\n```\nThis will read the interpolated frames previously stored in `eval_results/ldmvfi-vqflow-f32-c256-concat_max/Middlebury_others/` then output the evaluation results to `results_vqm.txt` in the same folder.\n\n\\\nTo interpolate a video (in .yuv format), use the following code.\n```\npython interpolate_yuv.py \\\n--net LDMVFI \\\n--config configs/ldm/ldmvfi-vqflow-f32-c256-concat_max.yaml \\\n--ckpt <path/to/ldmvfi-vqflow-f32-c256-concat_max.ckpt> \\\n--input_yuv <path/to/input/yuv> \\\n--size <spatial res of video, e.g. 1920x1080> \\\n--out_fps <output fps, should be 2 x original fps> \\\n--out_dir <desired/output/dir> \\\n--use_ddim\n```\n\n## Training\nLDMVFI is trained in two stages, where the VQ-FIGAN and the denoising U-Net are trained separately.\n### VQ-FIGAN\n```\npython main.py --base configs/autoencoder/vqflow-f32.yaml -t --gpus 0,\n```\n### Denoising U-Net\n```\npython main.py --base configs/ldm/ldmvfi-vqflow-f32-c256-concat_max.yaml -t --gpus 0,\n```\nThese will create a `logs/` folder within which the corresonding directories are created for each experiment. The log files from training include checkpoints, images and tensorboard loggings.\n\nTo resume from a checkpoint file, simply use the `--resume` argument in [main.py](main.py) to specify the checkpoint.\n\n\n## Citation\n```\n@article{danier2023ldmvfi,\n  title={LDMVFI: Video Frame Interpolation with Latent Diffusion Models},\n  author={Danier, Duolikun and Zhang, Fan and Bull, David},\n  journal={arXiv preprint arXiv:2303.09508},\n  year={2023}\n}\n```\n\n## Acknowledgement\nOur code is adapted from the original [latent-diffusion](https://github.com/CompVis/latent-diffusion) repository. We thank the authors for sharing their code."
  },
  {
    "path": "configs/autoencoder/vqflow-f32.yaml",
    "content": "model:\n  base_learning_rate: 1.0e-5\n  target: ldm.models.autoencoder.VQFlowNet\n  params:\n    monitor: val/total_loss\n    embed_dim: 3\n    n_embed: 8192\n    ddconfig:\n      double_z: False\n      z_channels: 3\n      resolution: 256\n      in_channels: 3\n      out_ch: 3\n      ch: 64\n      ch_mult: [1,2,2,2,4]  # f = 2 ^ len(ch_mult)\n      num_res_blocks: 1\n      cond_type: max_cross_attn\n      attn_type: max\n      attn_resolutions: []\n      dropout: 0.0\n\n    lossconfig:\n      target: ldm.modules.losses.vqperceptual.VQLPIPSWithDiscriminator\n      params:\n        disc_conditional: False\n        disc_in_channels: 3\n        disc_start: 10000\n        disc_weight: 0.8\n        codebook_weight: 1.0\n\ndata:\n  target: main.DataModuleFromConfig\n  params:\n    batch_size: 10\n    num_workers: 0\n    wrap: false\n    train:\n      target: ldm.data.bvi_vimeo.BVI_Vimeo_triplet\n      params:\n        db_dir: C:/data_tmp/\n        crop_sz: [256,256]\n        iter: True\n    validation:\n      target: ldm.data.bvi_vimeo.Vimeo90k_triplet\n      params:\n        db_dir: C:/data_tmp/vimeo_septuplet/\n        train: False\n        crop_sz: [256,256]\n        augment_s: False\n        augment_t: False\n\n\nlightning:\n  callbacks:\n    image_logger:\n      target: main.ImageLogger\n      params:\n        batch_frequency: 8000\n        val_batch_frequency: 800\n        max_images: 8\n        increase_log_steps: False\n        log_images_kwargs: {'N': 1}\n\n  trainer:\n    benchmark: True\n    max_epochs: -1\n"
  },
  {
    "path": "configs/ldm/ldmvfi-vqflow-f32-c256-concat_max.yaml",
    "content": "model:\n  base_learning_rate: 1.0e-06\n  target: ldm.models.diffusion.ddpm.LatentDiffusionVFI\n  params:\n    linear_start: 0.0015\n    linear_end: 0.0195\n    num_timesteps_cond: 1\n    log_every_t: 200\n    timesteps: 1000\n    first_stage_key: image\n    cond_stage_key: past_future_frames\n    image_size: 8\n    channels: 3\n    cond_stage_trainable: False\n    concat_mode: True\n    monitor: val/loss_simple_ema\n    unet_config:\n      target: ldm.modules.diffusionmodules.openaimodel.UNetModel\n      params:\n        image_size: 8 # img size of latent, used during training, determines some model params, so don't change for inference\n        in_channels: 9\n        out_channels: 3\n        model_channels: 256\n        attention_resolutions:\n        #note: this isn\\t actually the resolution but\n        # the downsampling factor, i.e. this corresnponds to\n        # attention on spatial resolution 8,16,32, as the\n        # spatial reolution of the latents is 32 for f8\n        - 4\n        - 2\n        - 1\n        num_res_blocks: 2\n        channel_mult:\n        - 1\n        - 2\n        - 4\n        num_head_channels: 32\n        use_max_self_attn: True # replace all full self-attention with MaxViT\n    first_stage_config:\n      target: ldm.models.autoencoder.VQFlowNetInterface\n      params:\n        ckpt_path: null # must specify pre-trained autoencoding model ckpt to train the denoising UNet\n        embed_dim: 3\n        n_embed: 8192\n        ddconfig:\n          double_z: False\n          z_channels: 3\n          resolution: 256\n          in_channels: 3\n          out_ch: 3\n          ch: 64\n          ch_mult: [1,2,2,2,4]  # f = 2 ^ len(ch_mult)\n          num_res_blocks: 1\n          cond_type: max_cross_attn\n          attn_type: max\n          attn_resolutions: [ ]\n          dropout: 0.0\n        lossconfig:\n          target: torch.nn.Identity\n    cond_stage_config: __is_first_stage__\n\n\ndata:\n  target: main.DataModuleFromConfig\n  params:\n    batch_size: 64\n    num_workers: 0\n    wrap: false\n    train:\n      target: ldm.data.bvi_vimeo.BVI_Vimeo_triplet\n      params:\n        db_dir: C:/data_tmp/\n        crop_sz: [256,256]\n        iter: True\n    validation:\n      target: ldm.data.bvi_vimeo.Vimeo90k_triplet\n      params:\n        db_dir: C:/data_tmp/vimeo_septuplet/\n        train: False\n        crop_sz: [256,256]\n        augment_s: False\n        augment_t: False\n\n\nlightning:\n  callbacks:\n    image_logger:\n      target: main.ImageLogger\n      params:\n        batch_frequency: 1250\n        val_batch_frequency: 125\n        max_images: 8\n        increase_log_steps: False\n        log_images_kwargs: {'N': 1}\n\n  trainer:\n    benchmark: True\n    max_epochs: -1\n"
  },
  {
    "path": "cupy_module/__init__.py",
    "content": ""
  },
  {
    "path": "cupy_module/dsepconv.py",
    "content": "import torch\n\nimport cupy\nimport re\n\n\nclass Stream:\n    ptr = torch.cuda.current_stream().cuda_stream\n\n\n# end\n\nkernel_DSepconv_updateOutput = '''\n\textern \"C\" __global__ void kernel_DSepconv_updateOutput(\n\t\tconst int n,\n\t\tconst float* input,\n\t\tconst float* vertical,\n\t\tconst float* horizontal,\n\t\tconst float* offset_x,\n\t\tconst float* offset_y,\n\t\tconst float* mask,\n\t\tfloat* output\n\t) { for (int intIndex = (blockIdx.x * blockDim.x) + threadIdx.x; intIndex < n; intIndex += blockDim.x * gridDim.x) {\n\t\tfloat dblOutput = 0.0;\n\n\t\tconst int intSample = ( intIndex / SIZE_3(output) / SIZE_2(output) / SIZE_1(output) ) % SIZE_0(output);\n\t\tconst int intDepth  = ( intIndex / SIZE_3(output) / SIZE_2(output)                  ) % SIZE_1(output);\n\t\tconst int intY      = ( intIndex / SIZE_3(output)                                   ) % SIZE_2(output);\n\t\tconst int intX      = ( intIndex                                                    ) % SIZE_3(output);\n\t\t\n\n\t\tfor (int intFilterY = 0; intFilterY < SIZE_1(vertical); intFilterY += 1) {\n\t\t\tfor (int intFilterX = 0; intFilterX < SIZE_1(horizontal); intFilterX += 1) {\n\t\t\t    float delta_x = OFFSET_4(offset_y, intSample, intFilterY*SIZE_1(vertical) + intFilterX, intY, intX);\n\t\t\t    float delta_y = OFFSET_4(offset_x, intSample, intFilterY*SIZE_1(vertical) + intFilterX, intY, intX);\n\t\t\t    \n\t\t\t    float position_x = delta_x + intX + intFilterX - (SIZE_1(horizontal) - 1) / 2 + 1;\n\t\t\t    float position_y = delta_y + intY + intFilterY - (SIZE_1(vertical) - 1) / 2 + 1;\n\t\t\t    if (position_x < 0)\n\t\t\t        position_x = 0;\n\t\t\t    if (position_x > SIZE_3(input) - 1)\n\t\t\t        position_x = SIZE_3(input) - 1;\n\t\t\t    if (position_y < 0)\n\t\t\t        position_y = 0;\n\t\t\t    if (position_y > SIZE_2(input) - 1)\n\t\t\t        position_y =  SIZE_2(input) - 1;\n\t\t\t    \n\t\t\t    int left = floor(delta_x + intX + intFilterX - (SIZE_1(horizontal) - 1) / 2 + 1);\n\t\t\t    int right = left + 1;\n\t\t\t    if (left < 0)\n\t\t\t        left = 0;\n\t\t\t    if (left > SIZE_3(input) - 1)\n\t\t\t        left = SIZE_3(input) - 1;\n\t\t\t    if (right < 0)\n\t\t\t        right = 0;\n\t\t\t    if (right > SIZE_3(input) - 1)\n\t\t\t        right = SIZE_3(input) - 1;\n\t\t\t    \n\t\t\t    int top = floor(delta_y + intY + intFilterY - (SIZE_1(vertical) - 1) / 2 + 1);\n\t\t\t    int bottom = top + 1;\n\t\t\t    if (top < 0)\n\t\t\t        top = 0;\n\t\t\t    if (top > SIZE_2(input) - 1)\n\t\t\t        top =  SIZE_2(input) - 1;\n\t\t\t    if (bottom < 0)\n\t\t\t        bottom = 0;   \n\t\t\t    if (bottom > SIZE_2(input) - 1)\n\t\t\t        bottom = SIZE_2(input) - 1;\n\t\t\t    \n\t\t\t    float floatValue = VALUE_4(input, intSample, intDepth, top, left) * (1 + (left - position_x)) * (1 + (top - position_y)) + \n\t\t\t                       VALUE_4(input, intSample, intDepth, top, right) * (1 - (right - position_x)) *  (1 + (top - position_y)) + \n\t\t\t                       VALUE_4(input, intSample, intDepth, bottom, left) * (1 + (left - position_x)) * (1 - (bottom - position_y)) + \n\t\t\t                       VALUE_4(input, intSample, intDepth, bottom, right) * (1 - (right - position_x)) * (1 - (bottom - position_y));\n\t\t\t                       \n\t\t\t\tdblOutput += floatValue * VALUE_4(vertical, intSample, intFilterY, intY, intX) * VALUE_4(horizontal, intSample, intFilterX, intY, intX) * VALUE_4(mask, intSample, SIZE_1(vertical)*intFilterY + intFilterX, intY, intX);\n\t\t\t}\n\t\t}\n\t\toutput[intIndex] = dblOutput;\n\t} }\n'''\n\nkernel_DSepconv_updateGradVertical = '''\n\textern \"C\" __global__ void kernel_DSepconv_updateGradVertical(\n\t\tconst int n,\n\t\tconst float* gradLoss,\n\t\tconst float* input,\n\t\tconst float* horizontal,\n\t\tconst float* offset_x,\n\t\tconst float* offset_y,\n\t\tconst float* mask,\n\t\tfloat* gradVertical\n\t) { for (int intIndex = (blockIdx.x * blockDim.x) + threadIdx.x; intIndex < n; intIndex += blockDim.x * gridDim.x) {\n\t\tfloat floatOutput = 0.0;\n\n\t\tconst int intSample   = ( intIndex / SIZE_3(gradVertical) / SIZE_2(gradVertical) / SIZE_1(gradVertical) ) % SIZE_0(gradVertical);\n\t\tconst int intFilterY  = ( intIndex / SIZE_3(gradVertical) / SIZE_2(gradVertical)                        ) % SIZE_1(gradVertical);\n\t\tconst int intY        = ( intIndex / SIZE_3(gradVertical)                                               ) % SIZE_2(gradVertical);\n\t\tconst int intX        = ( intIndex                                                                      ) % SIZE_3(gradVertical);\n\n\t\tfor (int intFilterX = 0; intFilterX < SIZE_1(horizontal); intFilterX += 1){\n\t\t    int intDepth = intFilterY * SIZE_1(horizontal) + intFilterX;\n\t\t    float delta_x = OFFSET_4(offset_y, intSample, intDepth, intY, intX);\n\t\t\tfloat delta_y = OFFSET_4(offset_x, intSample, intDepth, intY, intX);\n\t\t\t\n\t\t\tfloat position_x = delta_x + intX + intFilterX - (SIZE_1(horizontal) - 1) / 2 + 1;\n\t\t\tfloat position_y = delta_y + intY + intFilterY - (SIZE_1(horizontal) - 1) / 2 + 1;\n\t\t\tif (position_x < 0)\n\t\t\t    position_x = 0;\n\t\t\tif (position_x > SIZE_3(input) - 1)\n\t\t\t    position_x = SIZE_3(input) - 1;\n\t\t\tif (position_y < 0)\n\t\t\t    position_y = 0;\n\t\t\tif (position_y > SIZE_2(input) - 1)\n\t\t\t    position_y =  SIZE_2(input) - 1;\n\t\t\n\t\t\tint left = floor(delta_x + intX + intFilterX - (SIZE_1(horizontal) - 1) / 2 + 1);\n\t\t\tint right = left + 1;\n\t\t\tif (left < 0)\n\t\t\t    left = 0;\n\t\t\tif (left > SIZE_3(input) - 1)\n\t\t\t    left = SIZE_3(input) - 1;\n\t\t\tif (right < 0)\n\t\t\t    right = 0;\n\t\t\tif (right > SIZE_3(input) - 1)\n\t\t\t    right = SIZE_3(input) - 1;\n\n\t\t\tint top = floor(delta_y + intY + intFilterY - (SIZE_1(horizontal) - 1) / 2 + 1);\n\t\t\tint bottom = top + 1;\n\t\t\tif (top < 0)\n\t\t\t    top = 0;\n\t\t\tif (top > SIZE_2(input) - 1)\n\t\t\t    top =  SIZE_2(input) - 1;\n\t\t\tif (bottom < 0)\n\t\t\t    bottom = 0;   \n\t\t\tif (bottom > SIZE_2(input) - 1)\n\t\t\t    bottom = SIZE_2(input) - 1;\n\t\t\t\n\t\t\tfloat floatSampled0 = VALUE_4(input, intSample, 0, top, left) * (1 + (left - position_x)) * (1 + (top - position_y)) + \n\t\t\t               VALUE_4(input, intSample, 0, top, right) * (1 - (right - position_x)) *  (1 + (top - position_y)) + \n\t\t\t               VALUE_4(input, intSample, 0, bottom, left) * (1 + (left - position_x)) * (1 - (bottom - position_y)) + \n\t\t\t               VALUE_4(input, intSample, 0, bottom, right) * (1 - (right - position_x)) * (1 - (bottom - position_y));\n\t\t\tfloat floatSampled1 = VALUE_4(input, intSample, 1, top, left) * (1 + (left - position_x)) * (1 + (top - position_y)) + \n\t\t\t               VALUE_4(input, intSample, 1, top, right) * (1 - (right - position_x)) *  (1 + (top - position_y)) + \n\t\t\t               VALUE_4(input, intSample, 1, bottom, left) * (1 + (left - position_x)) * (1 - (bottom - position_y)) + \n\t\t\t               VALUE_4(input, intSample, 1, bottom, right) * (1 - (right - position_x)) * (1 - (bottom - position_y));\n\t\t\tfloat floatSampled2 = VALUE_4(input, intSample, 2, top, left) * (1 + (left - position_x)) * (1 + (top - position_y)) + \n\t\t\t               VALUE_4(input, intSample, 2, top, right) * (1 - (right - position_x)) *  (1 + (top - position_y)) + \n\t\t\t               VALUE_4(input, intSample, 2, bottom, left) * (1 + (left - position_x)) * (1 - (bottom - position_y)) + \n\t\t\t               VALUE_4(input, intSample, 2, bottom, right) * (1 - (right - position_x)) * (1 - (bottom - position_y));\n\t\t\t\n\t\t\tfloatOutput += VALUE_4(gradLoss, intSample, 0, intY, intX) * floatSampled0 * VALUE_4(horizontal, intSample, intFilterX, intY, intX) * VALUE_4(mask, intSample, intDepth, intY, intX) +\n\t\t\t\t       VALUE_4(gradLoss, intSample, 1, intY, intX) * floatSampled1 * VALUE_4(horizontal, intSample, intFilterX, intY, intX) * VALUE_4(mask, intSample, intDepth, intY, intX) +\n\t\t\t\t       VALUE_4(gradLoss, intSample, 2, intY, intX) * floatSampled2 * VALUE_4(horizontal, intSample, intFilterX, intY, intX) * VALUE_4(mask, intSample, intDepth, intY, intX);\n\t\t}\n\t\tgradVertical[intIndex] = floatOutput;\n\t} }\n\n'''\n\nkernel_DSepconv_updateGradHorizontal = '''\n\textern \"C\" __global__ void kernel_DSepconv_updateGradHorizontal(\n\t\tconst int n,\n\t\tconst float* gradLoss,\n\t\tconst float* input,\n\t\tconst float* vertical,\n\t\tconst float* offset_x,\n\t\tconst float* offset_y,\n\t\tconst float* mask,\n\t\tfloat* gradHorizontal\n\t) { for (int intIndex = (blockIdx.x * blockDim.x) + threadIdx.x; intIndex < n; intIndex += blockDim.x * gridDim.x) {\n\t\tfloat floatOutput = 0.0;\n\n\t\tconst int intSample   = ( intIndex / SIZE_3(gradHorizontal) / SIZE_2(gradHorizontal) / SIZE_1(gradHorizontal) ) % SIZE_0(gradHorizontal);\n\t\tconst int intFilterX  = ( intIndex / SIZE_3(gradHorizontal) / SIZE_2(gradHorizontal)                          ) % SIZE_1(gradHorizontal);\n\t\tconst int intY        = ( intIndex / SIZE_3(gradHorizontal)                                                   ) % SIZE_2(gradHorizontal);\n\t\tconst int intX        = ( intIndex                                                                            ) % SIZE_3(gradHorizontal);\n\n\t\tfor (int intFilterY = 0; intFilterY < SIZE_1(vertical); intFilterY += 1){\n\t\t    int intDepth = intFilterY * SIZE_1(vertical) + intFilterX;\n\t\t    float delta_x = OFFSET_4(offset_y, intSample, intDepth, intY, intX);\n\t\t\tfloat delta_y = OFFSET_4(offset_x, intSample, intDepth, intY, intX);\n\t\t\n\t\t\tfloat position_x = delta_x + intX + intFilterX - (SIZE_1(vertical) - 1) / 2 + 1;\n\t\t\tfloat position_y = delta_y + intY + intFilterY - (SIZE_1(vertical) - 1) / 2 + 1;\n\t\t\tif (position_x < 0)\n\t\t\t    position_x = 0;\n\t\t\tif (position_x > SIZE_3(input) - 1)\n\t\t\t    position_x = SIZE_3(input) - 1;\n\t\t\tif (position_y < 0)\n\t\t\t    position_y = 0;\n\t\t\tif (position_y > SIZE_2(input) - 1)\n\t\t\t    position_y =  SIZE_2(input) - 1;\n\t\t\n\t\t\tint left = floor(delta_x + intX + intFilterX - (SIZE_1(vertical) - 1) / 2 + 1);\n\t\t\tint right = left + 1;\n\t\t\tif (left < 0)\n\t\t\t    left = 0;\n\t\t\tif (left > SIZE_3(input) - 1)\n\t\t\t    left = SIZE_3(input) - 1;\n\t\t\tif (right < 0)\n\t\t\t    right = 0;\n\t\t\tif (right > SIZE_3(input) - 1)\n\t\t\t    right = SIZE_3(input) - 1;\n\n\t\t\tint top = floor(delta_y + intY + intFilterY - (SIZE_1(vertical) - 1) / 2 + 1);\n\t\t\tint bottom = top + 1;\n\t\t\tif (top < 0)\n\t\t\t    top = 0;\n\t\t\tif (top > SIZE_2(input) - 1)\n\t\t\t    top =  SIZE_2(input) - 1;\n\t\t\tif (bottom < 0)\n\t\t\t    bottom = 0;   \n\t\t\tif (bottom > SIZE_2(input) - 1)\n\t\t\t    bottom = SIZE_2(input) - 1;\n\t\t\t\n\t\t\tfloat floatSampled0 = VALUE_4(input, intSample, 0, top, left) * (1 + (left - position_x)) * (1 + (top - position_y)) + \n\t\t\t               VALUE_4(input, intSample, 0, top, right) * (1 - (right - position_x)) *  (1 + (top - position_y)) + \n\t\t\t               VALUE_4(input, intSample, 0, bottom, left) * (1 + (left - position_x)) * (1 - (bottom - position_y)) + \n\t\t\t               VALUE_4(input, intSample, 0, bottom, right) * (1 - (right - position_x)) * (1 - (bottom - position_y));\n\t\t\tfloat floatSampled1 = VALUE_4(input, intSample, 1, top, left) * (1 + (left - position_x)) * (1 + (top - position_y)) + \n\t\t\t               VALUE_4(input, intSample, 1, top, right) * (1 - (right - position_x)) *  (1 + (top - position_y)) + \n\t\t\t               VALUE_4(input, intSample, 1, bottom, left) * (1 + (left - position_x)) * (1 - (bottom - position_y)) + \n\t\t\t               VALUE_4(input, intSample, 1, bottom, right) * (1 - (right - position_x)) * (1 - (bottom - position_y));\n\t\t\tfloat floatSampled2 = VALUE_4(input, intSample, 2, top, left) * (1 + (left - position_x)) * (1 + (top - position_y)) + \n\t\t\t               VALUE_4(input, intSample, 2, top, right) * (1 - (right - position_x)) *  (1 + (top - position_y)) + \n\t\t\t               VALUE_4(input, intSample, 2, bottom, left) * (1 + (left - position_x)) * (1 - (bottom - position_y)) + \n\t\t\t               VALUE_4(input, intSample, 2, bottom, right) * (1 - (right - position_x)) * (1 - (bottom - position_y));\n\t\t\t\t\n\t\t\tfloatOutput += VALUE_4(gradLoss, intSample, 0, intY, intX) * floatSampled0 * VALUE_4(vertical, intSample, intFilterY, intY, intX) * VALUE_4(mask, intSample, intDepth, intY, intX) +\n\t\t\t\t       VALUE_4(gradLoss, intSample, 1, intY, intX) * floatSampled1 * VALUE_4(vertical, intSample, intFilterY, intY, intX) * VALUE_4(mask, intSample, intDepth, intY, intX) +\n\t\t\t\t       VALUE_4(gradLoss, intSample, 2, intY, intX) * floatSampled2 * VALUE_4(vertical, intSample, intFilterY, intY, intX) * VALUE_4(mask, intSample, intDepth, intY, intX);\n\t\t}\n\t\tgradHorizontal[intIndex] = floatOutput;\n\t} }\n'''\n\nkernel_DSepconv_updateGradMask = '''\n\textern \"C\" __global__ void kernel_DSepconv_updateGradMask(\n\t\tconst int n,\n\t\tconst float* gradLoss,\n\t\tconst float* input,\n\t\tconst float* vertical,\n\t\tconst float* horizontal,\n\t\tconst float* offset_x,\n\t\tconst float* offset_y,\n\t\tfloat* gradMask\n\t) { for (int intIndex = (blockIdx.x * blockDim.x) + threadIdx.x; intIndex < n; intIndex += blockDim.x * gridDim.x) {\n\t    float floatOutput = 0.0;\n\n\t\tconst int intSample   = ( intIndex / SIZE_3(gradMask) / SIZE_2(gradMask) / SIZE_1(gradMask) ) % SIZE_0(gradMask);\n\t\tconst int intDepth    = ( intIndex / SIZE_3(gradMask) / SIZE_2(gradMask)                    ) % SIZE_1(gradMask);\n\t\tconst int intY        = ( intIndex / SIZE_3(gradMask)                                       ) % SIZE_2(gradMask);\n\t\tconst int intX        = ( intIndex                                                          ) % SIZE_3(gradMask);\n\t\t\n\t\tint intFilterY = intDepth / SIZE_1(vertical);\n        int intFilterX = intDepth % SIZE_1(vertical);\n        \n        float delta_x = OFFSET_4(offset_y, intSample, intDepth, intY, intX);\n\t\tfloat delta_y = OFFSET_4(offset_x, intSample, intDepth, intY, intX);\n\t\t\n\t\tfloat position_x = delta_x + intX + intFilterX - (SIZE_1(vertical) - 1) / 2 + 1;\n\t\tfloat position_y = delta_y + intY + intFilterY - (SIZE_1(vertical) - 1) / 2 + 1;\n\t\tif (position_x < 0)\n\t\t\tposition_x = 0;\n\t\tif (position_x > SIZE_3(input) - 1)\n\t\t\tposition_x = SIZE_3(input) - 1;\n\t\tif (position_y < 0)\n\t\t\tposition_y = 0;\n\t\tif (position_y > SIZE_2(input) - 1)\n\t\t\tposition_y =  SIZE_2(input) - 1;\n\t\t\n\t\tint left = floor(delta_x + intX + intFilterX - (SIZE_1(vertical) - 1) / 2 + 1);\n\t\tint right = left + 1;\n\t\tif (left < 0)\n\t\t\tleft = 0;\n\t\tif (left > SIZE_3(input) - 1)\n\t\t\tleft = SIZE_3(input) - 1;\n\t\tif (right < 0)\n\t\t\tright = 0;\n\t\tif (right > SIZE_3(input) - 1)\n\t\t\tright = SIZE_3(input) - 1;\n\n\t\tint top = floor(delta_y + intY + intFilterY - (SIZE_1(vertical) - 1) / 2 + 1);\n\t\tint bottom = top + 1;\n\t\tif (top < 0)\n\t\t\ttop = 0;\n\t\tif (top > SIZE_2(input) - 1)\n\t\t\ttop =  SIZE_2(input) - 1;\n\t\tif (bottom < 0)\n\t\t\tbottom = 0;   \n\t\tif (bottom > SIZE_2(input) - 1)\n\t\t\tbottom = SIZE_2(input) - 1;\n\t\t\n\t\tfor (int intChannel = 0; intChannel < 3; intChannel++){\n\t\t    floatOutput += VALUE_4(gradLoss, intSample, intChannel, intY, intX) * (\n\t\t                   VALUE_4(input, intSample, intChannel, top, left) * (1 + (left - position_x)) * (1 + (top - position_y)) + \n\t\t\t               VALUE_4(input, intSample, intChannel, top, right) * (1 - (right - position_x)) *  (1 + (top - position_y)) + \n\t\t\t               VALUE_4(input, intSample, intChannel, bottom, left) * (1 + (left - position_x)) * (1 - (bottom - position_y)) + \n\t\t\t               VALUE_4(input, intSample, intChannel, bottom, right) * (1 - (right - position_x)) * (1 - (bottom - position_y))\n\t\t                   ) * VALUE_4(vertical, intSample, intFilterY, intY, intX) * VALUE_4(horizontal, intSample, intFilterX, intY, intX);\n\t\t} \n\t\tgradMask[intIndex] = floatOutput;\n\t} }\n'''\n\nkernel_DSepconv_updateGradOffsetX = '''\n\textern \"C\" __global__ void kernel_DSepconv_updateGradOffsetX(\n\t\tconst int n,\n\t\tconst float* gradLoss,\n\t\tconst float* input,\n\t\tconst float* vertical,\n\t\tconst float* horizontal,\n\t\tconst float* offset_x,\n\t\tconst float* offset_y,\n\t\tconst float* mask,\n\t\tfloat* gradOffsetX\n\t) { for (int intIndex = (blockIdx.x * blockDim.x) + threadIdx.x; intIndex < n; intIndex += blockDim.x * gridDim.x) {\n\t    float floatOutput = 0.0;\n\n\t\tconst int intSample   = ( intIndex / SIZE_3(gradOffsetX) / SIZE_2(gradOffsetX) / SIZE_1(gradOffsetX) ) % SIZE_0(gradOffsetX);\n\t\tconst int intDepth    = ( intIndex / SIZE_3(gradOffsetX) / SIZE_2(gradOffsetX)                       ) % SIZE_1(gradOffsetX);\n\t\tconst int intY        = ( intIndex / SIZE_3(gradOffsetX)                                             ) % SIZE_2(gradOffsetX);\n\t\tconst int intX        = ( intIndex                                                                   ) % SIZE_3(gradOffsetX);\n\n\t\tint intFilterY = intDepth / SIZE_1(vertical);\n        int intFilterX = intDepth % SIZE_1(vertical);\n\n        float delta_x = OFFSET_4(offset_y, intSample, intDepth, intY, intX);\n\t\tfloat delta_y = OFFSET_4(offset_x, intSample, intDepth, intY, intX);\n\n\t\tfloat position_x = delta_x + intX + intFilterX - (SIZE_1(vertical) - 1) / 2 + 1;\n\t\tfloat position_y = delta_y + intY + intFilterY - (SIZE_1(vertical) - 1) / 2 + 1;\n\t\tif (position_x < 0)\n\t\t\tposition_x = 0;\n\t\tif (position_x > SIZE_3(input) - 1)\n\t\t\tposition_x = SIZE_3(input) - 1;\n\t\tif (position_y < 0)\n\t\t\tposition_y = 0;\n\t\tif (position_y > SIZE_2(input) - 1)\n\t\t\tposition_y =  SIZE_2(input) - 1;\n\t\t\n\t\tint left = floor(delta_x + intX + intFilterX - (SIZE_1(vertical) - 1) / 2 + 1);\n\t\tint right = left + 1;\n\t\tif (left < 0)\n\t\t\tleft = 0;\n\t\tif (left > SIZE_3(input) - 1)\n\t\t\tleft = SIZE_3(input) - 1;\n\t\tif (right < 0)\n\t\t\tright = 0;\n\t\tif (right > SIZE_3(input) - 1)\n\t\t\tright = SIZE_3(input) - 1;\n\n\t\tint top = floor(delta_y + intY + intFilterY - (SIZE_1(vertical) - 1) / 2 + 1);\n\t\tint bottom = top + 1;\n\t\tif (top < 0)\n\t\t\ttop = 0;\n\t\tif (top > SIZE_2(input) - 1)\n\t\t\ttop =  SIZE_2(input) - 1;\n\t\tif (bottom < 0)\n\t\t\tbottom = 0;   \n\t\tif (bottom > SIZE_2(input) - 1)\n\t\t\tbottom = SIZE_2(input) - 1;\n\n\t\tfor (int intChannel = 0; intChannel < 3; intChannel++){\n\t\t\tfloatOutput += VALUE_4(gradLoss, intSample, intChannel, intY, intX) * (\n\t\t                   - VALUE_4(input, intSample, intChannel, top, left)  * (1 + (left - position_x))\n\t\t                   - VALUE_4(input, intSample, intChannel, top, right)  *  (1 - (right - position_x))\n\t\t\t               + VALUE_4(input, intSample, intChannel, bottom, left) * (1 + (left - position_x))\n\t\t\t               + VALUE_4(input, intSample, intChannel, bottom, right) * (1 - (right - position_x))\n\t\t\t               )\n\t\t                   * VALUE_4(vertical, intSample, intFilterY, intY, intX) * VALUE_4(horizontal, intSample, intFilterX, intY, intX)\n\t\t                   * VALUE_4(mask, intSample, intDepth, intY, intX);\n\t\t} \n\t\tgradOffsetX[intIndex] = floatOutput;\n\t} }\n'''\n\nkernel_DSepconv_updateGradOffsetY = '''\n\textern \"C\" __global__ void kernel_DSepconv_updateGradOffsetY(\n\t\tconst int n,\n\t\tconst float* gradLoss,\n\t\tconst float* input,\n\t\tconst float* vertical,\n\t\tconst float* horizontal,\n\t\tconst float* offset_x,\n\t\tconst float* offset_y,\n\t\tconst float* mask,\n\t\tfloat* gradOffsetY\n\t) { for (int intIndex = (blockIdx.x * blockDim.x) + threadIdx.x; intIndex < n; intIndex += blockDim.x * gridDim.x) {\n\t    float floatOutput = 0.0;\n\n\t\tconst int intSample   = ( intIndex / SIZE_3(gradOffsetX) / SIZE_2(gradOffsetX) / SIZE_1(gradOffsetX) ) % SIZE_0(gradOffsetX);\n\t\tconst int intDepth    = ( intIndex / SIZE_3(gradOffsetX) / SIZE_2(gradOffsetX)                       ) % SIZE_1(gradOffsetX);\n\t\tconst int intY        = ( intIndex / SIZE_3(gradOffsetX)                                             ) % SIZE_2(gradOffsetX);\n\t\tconst int intX        = ( intIndex                                                                   ) % SIZE_3(gradOffsetX);\n\n\t\tint intFilterY = intDepth / SIZE_1(vertical);\n        int intFilterX = intDepth % SIZE_1(vertical);\n\n        float delta_x = OFFSET_4(offset_y, intSample, intDepth, intY, intX);\n\t\tfloat delta_y = OFFSET_4(offset_x, intSample, intDepth, intY, intX);\n\n\t\tfloat position_x = delta_x + intX + intFilterX - (SIZE_1(vertical) - 1) / 2 + 1;\n\t\tfloat position_y = delta_y + intY + intFilterY - (SIZE_1(vertical) - 1) / 2 + 1;\n\t\tif (position_x < 0)\n\t\t\tposition_x = 0;\n\t\tif (position_x > SIZE_3(input) - 1)\n\t\t\tposition_x = SIZE_3(input) - 1;\n\t\tif (position_y < 0)\n\t\t\tposition_y = 0;\n\t\tif (position_y > SIZE_2(input) - 1)\n\t\t\tposition_y =  SIZE_2(input) - 1;\n\t\t\n\t\tint left = floor(delta_x + intX + intFilterX - (SIZE_1(vertical) - 1) / 2 + 1);\n\t\tint right = left + 1;\n\t\tif (left < 0)\n\t\t\tleft = 0;\n\t\tif (left > SIZE_3(input) - 1)\n\t\t\tleft = SIZE_3(input) - 1;\n\t\tif (right < 0)\n\t\t\tright = 0;\n\t\tif (right > SIZE_3(input) - 1)\n\t\t\tright = SIZE_3(input) - 1;\n\n\t\tint top = floor(delta_y + intY + intFilterY - (SIZE_1(vertical) - 1) / 2 + 1);\n\t\tint bottom = top + 1;\n\t\tif (top < 0)\n\t\t\ttop = 0;\n\t\tif (top > SIZE_2(input) - 1)\n\t\t\ttop =  SIZE_2(input) - 1;\n\t\tif (bottom < 0)\n\t\t\tbottom = 0;   \n\t\tif (bottom > SIZE_2(input) - 1)\n\t\t\tbottom = SIZE_2(input) - 1;\n\n\t\tfor (int intChannel = 0; intChannel < 3; intChannel++){\n\t\t    floatOutput += VALUE_4(gradLoss, intSample, intChannel, intY, intX) * (\n\t\t                   - VALUE_4(input, intSample, intChannel, top, left)  * (1 + (top - position_y)) \n\t\t                   + VALUE_4(input, intSample, intChannel, top, right)  *  (1 + (top - position_y)) \n\t\t\t               - VALUE_4(input, intSample, intChannel, bottom, left) * (1 - (bottom - position_y)) \n\t\t\t               + VALUE_4(input, intSample, intChannel, bottom, right) * (1 - (bottom - position_y))\n\t\t\t               )\n\t\t                   * VALUE_4(vertical, intSample, intFilterY, intY, intX) * VALUE_4(horizontal, intSample, intFilterX, intY, intX)\n\t\t                   * VALUE_4(mask, intSample, intDepth, intY, intX);\n\t\t} \n\t\tgradOffsetY[intIndex] = floatOutput;\n\t} }\n'''\n\n\ndef cupy_kernel(strFunction, objectVariables):\n    strKernel = globals()[strFunction]\n\n    while True:\n        objectMatch = re.search('(SIZE_)([0-4])(\\()([^\\)]*)(\\))', strKernel)\n\n        if objectMatch is None:\n            break\n        # end\n\n        intArg = int(objectMatch.group(2))\n\n        strTensor = objectMatch.group(4)\n        intSizes = objectVariables[strTensor].size()\n\n        strKernel = strKernel.replace(objectMatch.group(), str(intSizes[intArg]))\n    # end\n\n    while True:\n        objectMatch = re.search('(VALUE_)([0-4])(\\()([^\\)]+)(\\))', strKernel)\n\n        if objectMatch is None:\n            break\n        # end\n\n        intArgs = int(objectMatch.group(2))\n        strArgs = objectMatch.group(4).split(',')\n\n        strTensor = strArgs[0]\n        intStrides = objectVariables[strTensor].stride()\n        strIndex = ['((' + strArgs[intArg + 1].replace('{', '(').replace('}', ')').strip() + ')*' + str(\n            intStrides[intArg]) + ')' for intArg in range(intArgs)]\n\n        strKernel = strKernel.replace(objectMatch.group(0), strTensor + '[' + str.join('+', strIndex) + ']')\n    # end\n\n    while True:\n        objectMatch = re.search('(OFFSET_)([0-4])(\\()([^\\)]+)(\\))', strKernel)\n\n        if objectMatch is None:\n            break\n        # end\n\n        intArgs = int(objectMatch.group(2))\n        strArgs = objectMatch.group(4).split(',')\n\n        strTensor = strArgs[0]\n        intStrides = objectVariables[strTensor].stride()\n        strIndex = ['((' + strArgs[intArg + 1].replace('{', '(').replace('}', ')').strip() + ')*' + str(\n            intStrides[intArg]) + ')' for intArg in range(intArgs)]\n\n        strKernel = strKernel.replace(objectMatch.group(0), strTensor + '[' + str.join('+', strIndex) + ']')\n    # end\n\n    return strKernel\n\n\n# end\n\n@cupy.memoize(for_each_device=True)\ndef cupy_launch(strFunction, strKernel):\n    # return cupy.cuda.compile_with_cache(strKernel).get_function(strFunction)\n    return cupy.RawKernel(strKernel, strFunction)\n\n\n# end\n\nclass _FunctionDSepconv(torch.autograd.Function):\n    @staticmethod\n    def forward(self, input, vertical, horizontal, offset_x, offset_y, mask):\n        self.save_for_backward(input, vertical, horizontal, offset_x, offset_y, mask)\n\n        intSample = input.size(0)\n        intInputDepth = input.size(1)\n        intInputHeight = input.size(2)\n        intInputWidth = input.size(3)\n        intFilterSize = min(vertical.size(1), horizontal.size(1))\n        intOutputHeight = min(vertical.size(2), horizontal.size(2))\n        intOutputWidth = min(vertical.size(3), horizontal.size(3))\n\n        assert (intInputHeight == intOutputHeight + intFilterSize - 1)\n        assert (intInputWidth == intOutputWidth + intFilterSize - 1)\n\n        assert (input.is_contiguous() == True)\n        assert (vertical.is_contiguous() == True)\n        assert (horizontal.is_contiguous() == True)\n        assert (offset_x.is_contiguous() == True)\n        assert (offset_y.is_contiguous() == True)\n        assert (mask.is_contiguous() == True)\n\n        output = input.new_zeros([intSample, intInputDepth, intOutputHeight, intOutputWidth])\n\n        if input.is_cuda == True:\n            n = output.nelement()\n            cupy_launch('kernel_DSepconv_updateOutput', cupy_kernel('kernel_DSepconv_updateOutput', {\n                'input': input,\n                'vertical': vertical,\n                'horizontal': horizontal,\n                'offset_x': offset_x,\n                'offset_y': offset_y,\n                'mask': mask,\n                'output': output\n            }))(\n                grid=tuple([int((n + 512 - 1) / 512), 1, 1]),\n                block=tuple([512, 1, 1]),\n                args=[n, input.data_ptr(), vertical.data_ptr(), horizontal.data_ptr(), offset_x.data_ptr(), offset_y.data_ptr(),\n                      mask.data_ptr(), output.data_ptr()],\n                stream=Stream\n            )\n\n        elif input.is_cuda == False:\n            raise NotImplementedError()\n\n        # end\n\n        return output\n\n    # end\n\n    @staticmethod\n    def backward(self, gradOutput):\n        input, vertical, horizontal, offset_x, offset_y, mask = self.saved_tensors\n\n        intSample = input.size(0)\n        intInputDepth = input.size(1)\n        intInputHeight = input.size(2)\n        intInputWidth = input.size(3)\n        intFilterSize = min(vertical.size(1), horizontal.size(1))\n        intOutputHeight = min(vertical.size(2), horizontal.size(2))\n        intOutputWidth = min(vertical.size(3), horizontal.size(3))\n\n        assert (intInputHeight == intOutputHeight + intFilterSize - 1)\n        assert (intInputWidth == intOutputWidth + intFilterSize - 1)\n\n        assert (gradOutput.is_contiguous() == True)\n\n        gradInput = input.new_zeros([intSample, intInputDepth, intInputHeight, intInputWidth]) if \\\n            self.needs_input_grad[0] == True else None\n        gradVertical = input.new_zeros([intSample, intFilterSize, intOutputHeight, intOutputWidth]) if \\\n            self.needs_input_grad[1] == True else None\n        gradHorizontal = input.new_zeros([intSample, intFilterSize, intOutputHeight, intOutputWidth]) if \\\n            self.needs_input_grad[2] == True else None\n        gradOffsetX = input.new_zeros([intSample, intFilterSize * intFilterSize, intOutputHeight, intOutputWidth]) if \\\n            self.needs_input_grad[3] == True else None\n        gradOffsetY = input.new_zeros([intSample, intFilterSize * intFilterSize, intOutputHeight, intOutputWidth]) if \\\n            self.needs_input_grad[4] == True else None\n        gradMask = input.new_zeros([intSample, intFilterSize * intFilterSize, intOutputHeight, intOutputWidth]) if \\\n            self.needs_input_grad[5] == True else None\n\n        if input.is_cuda == True:\n            nv = gradVertical.nelement()\n            cupy_launch('kernel_DSepconv_updateGradVertical', cupy_kernel('kernel_DSepconv_updateGradVertical', {\n                'gradLoss': gradOutput,\n                'input': input,\n                'horizontal': horizontal,\n                'offset_x': offset_x,\n                'offset_y': offset_y,\n                'mask': mask,\n                'gradVertical': gradVertical\n            }))(\n                grid=tuple([int((nv + 512 - 1) / 512), 1, 1]),\n                block=tuple([512, 1, 1]),\n                args=[nv, gradOutput.data_ptr(), input.data_ptr(), horizontal.data_ptr(), offset_x.data_ptr(),\n                      offset_y.data_ptr(), mask.data_ptr(), gradVertical.data_ptr()],\n                stream=Stream\n            )\n\n            nh = gradHorizontal.nelement()\n            cupy_launch('kernel_DSepconv_updateGradHorizontal', cupy_kernel('kernel_DSepconv_updateGradHorizontal', {\n                'gradLoss': gradOutput,\n                'input': input,\n                'vertical': vertical,\n                'offset_x': offset_x,\n                'offset_y': offset_y,\n                'mask': mask,\n                'gradHorizontal': gradHorizontal\n            }))(\n                grid=tuple([int((nh + 512 - 1) / 512), 1, 1]),\n                block=tuple([512, 1, 1]),\n                args=[nh, gradOutput.data_ptr(), input.data_ptr(), vertical.data_ptr(), offset_x.data_ptr(),\n                      offset_y.data_ptr(), mask.data_ptr(), gradHorizontal.data_ptr()],\n                stream=Stream\n            )\n\n            nx = gradOffsetX.nelement()\n            cupy_launch('kernel_DSepconv_updateGradOffsetX', cupy_kernel('kernel_DSepconv_updateGradOffsetX', {\n                'gradLoss': gradOutput,\n                'input': input,\n                'vertical': vertical,\n                'horizontal': horizontal,\n                'offset_x': offset_x,\n                'offset_y': offset_y,\n                'mask': mask,\n                'gradOffsetX': gradOffsetX\n            }))(\n                grid=tuple([int((nx + 512 - 1) / 512), 1, 1]),\n                block=tuple([512, 1, 1]),\n                args=[nx, gradOutput.data_ptr(), input.data_ptr(), vertical.data_ptr(), horizontal.data_ptr(), offset_x.data_ptr(),\n                      offset_y.data_ptr(), mask.data_ptr(), gradOffsetX.data_ptr()],\n                stream=Stream\n            )\n\n            ny = gradOffsetY.nelement()\n            cupy_launch('kernel_DSepconv_updateGradOffsetY', cupy_kernel('kernel_DSepconv_updateGradOffsetY', {\n                'gradLoss': gradOutput,\n                'input': input,\n                'vertical': vertical,\n                'horizontal': horizontal,\n                'offset_x': offset_x,\n                'offset_y': offset_y,\n                'mask': mask,\n                'gradOffsetX': gradOffsetY\n            }))(\n                grid=tuple([int((ny + 512 - 1) / 512), 1, 1]),\n                block=tuple([512, 1, 1]),\n                args=[ny, gradOutput.data_ptr(), input.data_ptr(), vertical.data_ptr(), horizontal.data_ptr(),\n                      offset_x.data_ptr(),\n                      offset_y.data_ptr(), mask.data_ptr(), gradOffsetY.data_ptr()],\n                stream=Stream\n            )\n\n            nm = gradMask.nelement()\n            cupy_launch('kernel_DSepconv_updateGradMask', cupy_kernel('kernel_DSepconv_updateGradMask', {\n                'gradLoss': gradOutput,\n                'input': input,\n                'vertical': vertical,\n                'horizontal': horizontal,\n                'offset_x': offset_x,\n                'offset_y': offset_y,\n                'gradMask': gradMask\n            }))(\n                grid=tuple([int((nm + 512 - 1) / 512), 1, 1]),\n                block=tuple([512, 1, 1]),\n                args=[nm, gradOutput.data_ptr(), input.data_ptr(), vertical.data_ptr(), horizontal.data_ptr(),\n                      offset_x.data_ptr(),\n                      offset_y.data_ptr(), gradMask.data_ptr()],\n                stream=Stream\n            )\n\n        elif input.is_cuda == False:\n            raise NotImplementedError()\n\n        # end\n\n        return gradInput, gradVertical, gradHorizontal, gradOffsetX, gradOffsetY, gradMask\n\n\n# end\n# end\n\ndef FunctionDSepconv(tensorInput, tensorVertical, tensorHorizontal, tensorOffsetX, tensorOffsetY, tensorMask):\n    return _FunctionDSepconv.apply(tensorInput, tensorVertical, tensorHorizontal, tensorOffsetX, tensorOffsetY, tensorMask)\n\n\n# end\n\nclass ModuleDSepconv(torch.nn.Module):\n    def __init__(self):\n        super(ModuleDSepconv, self).__init__()\n\n    # end\n\n    def forward(self, tensorInput, tensorVertical, tensorHorizontal, tensorOffsetX, tensorOffsetY, tensorMask):\n        return _FunctionDSepconv.apply(tensorInput, tensorVertical, tensorHorizontal, tensorOffsetX, tensorOffsetY, tensorMask)\n# end\n# end\n\n# float floatValue = VALUE_4(input, intSample, intDepth, top, left) * (1 - (delta_x - floor(delta_x))) * (1 - (delta_y - floor(delta_y))) +\n# \t\t\t                       VALUE_4(input, intSample, intDepth, top, right) * (delta_x - floor(delta_x)) *  (1 - (delta_y - floor(delta_y))) +\n# \t\t\t                       VALUE_4(input, intSample, intDepth, bottom, left) * (1 - (delta_x - floor(delta_x))) * (delta_y - floor(delta_y)) +\n# \t\t\t                       VALUE_4(input, intSample, intDepth, bottom, right) * (delta_x - floor(delta_x)) * (delta_y - floor(delta_y));"
  },
  {
    "path": "environment.yaml",
    "content": "name: ldmvfi\nchannels:\n  - pytorch\n  - defaults\n  - conda-forge\ndependencies:\n  - python=3.9.13\n  - pytorch=1.11.0\n  - torchvision=0.12.0\n  - cudatoolkit=11.3\n  - pip:\n    - opencv-python==4.6.0.66\n    - pudb==2022.1.3\n    - imageio==2.22.3\n    - imageio-ffmpeg==0.4.7\n    - pytorch-lightning==1.7.7\n    - omegaconf==2.2.3\n    - test-tube==0.7.5\n    - streamlit==1.14.0\n    - einops==0.5.0\n    - torch-fidelity==0.3.0\n    - transformers==4.23.1\n    - timm==0.6.12\n    - cupy\n    - -e git+https://github.com/CompVis/taming-transformers.git@master#egg=taming-transformers\n    - -e git+https://github.com/openai/CLIP.git@main#egg=clip\n    - -e .\n\n# conda create -n ldmvfi python=3.9\n# conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch\n# pip install opencv-python==4.6.0.66 pudb==2022.1.3 imageio==2.22.3 imageio-ffmpeg==0.4.7 pytorch-lightning==1.7.7 omegaconf==2.2.3  test-tube==0.7.5 streamlit==1.14.0  einops==0.5.0 torch-fidelity==0.3.0 transformers==4.23.1\n# pip install -e git+https://github.com/CompVis/taming-transformers.git@master#egg=taming-transformers\n# pip install -e git+https://github.com/openai/CLIP.git@main#egg=clip\n# pip install -e .\n# pip install timm"
  },
  {
    "path": "evaluate.py",
    "content": "import argparse\nimport os\nimport torch\nfrom functools import partial\nfrom omegaconf import OmegaConf\nfrom main import instantiate_from_config\nfrom ldm.models.diffusion.ddim import DDIMSampler\nfrom ldm.data import testsets\n\n\nparser = argparse.ArgumentParser(description='Frame Interpolation Evaluation')\n\nparser.add_argument('--config', type=str, default=None)\nparser.add_argument('--ckpt', type=str, default=None)\nparser.add_argument('--dataset', type=str, default='Middlebury_others')\nparser.add_argument('--metrics', nargs='+', type=str, default=['PSNR', 'SSIM', 'LPIPS'])\nparser.add_argument('--data_dir', type=str, default='D:\\\\')\nparser.add_argument('--out_dir', type=str, default='eval_results')\nparser.add_argument('--resume', dest='resume', default=False, action='store_true')\n\n# sampler args\nparser.add_argument('--use_ddim', dest='use_ddim', default=False, action='store_true')\nparser.add_argument('--ddim_eta', type=float, default=1.0)\nparser.add_argument('--ddim_steps', type=int, default=200)\n\ndef main():\n\n    args = parser.parse_args()\n    \n    # initialise model\n    config = OmegaConf.load(args.config)\n    model = instantiate_from_config(config.model)\n    model.load_state_dict(torch.load(args.ckpt)['state_dict'])\n    device = torch.device(\"cuda\") if torch.cuda.is_available() else torch.device(\"cpu\")\n    model = model.to(device)\n    model = model.eval()\n    print('Model loaded successfully')\n\n    # set up sampler\n    if args.use_ddim:\n        ddim = DDIMSampler(model)\n        sample_func = partial(ddim.sample, S=args.ddim_steps, eta=args.ddim_eta, verbose=False)\n    else:\n        sample_func = partial(model.sample_ddpm, return_intermediates=False, verbose=False)\n\n    # setup output dirs\n    if not os.path.exists(args.out_dir):\n        os.makedirs(args.out_dir)\n\n    # initialise test set\n    print('Testing on dataset: ', args.dataset)\n    test_dir = os.path.join(args.out_dir, args.dataset)\n    if args.dataset.split('_')[0] in ['VFITex', 'Ucf101', 'Davis90']:\n        db_folder = args.dataset.split('_')[0].lower()\n    else:\n        db_folder = args.dataset.lower()\n    test_db = getattr(testsets, args.dataset)(os.path.join(args.data_dir, db_folder))\n    if not os.path.exists(test_dir):\n        os.mkdir(test_dir)\n    test_db.eval(model, sample_func, metrics=args.metrics, output_dir=test_dir, resume=args.resume)\n\n\n\nif __name__ == '__main__':\n    main()"
  },
  {
    "path": "evaluate_vqm.py",
    "content": "import argparse\nimport os\nfrom ldm.data import testsets_vqm\n\n\nparser = argparse.ArgumentParser(description='Frame Interpolation Evaluation')\n\nparser.add_argument('--exp', type=str, default=None)\nparser.add_argument('--dataset', type=str, default='Middlebury_others')\nparser.add_argument('--metrics', nargs='+', type=str, default=['FloLPIPS'])\nparser.add_argument('--data_dir', type=str, default='D:\\\\')\nparser.add_argument('--out_dir', type=str, default='eval_results')\nparser.add_argument('--resume', dest='resume', default=False, action='store_true')\n\n\ndef main():\n\n    args = parser.parse_args()\n    \n    # initialise model\n    model = args.exp\n    print('Evaluating model:', model)\n\n    # setup output dirs\n    assert os.path.exists(args.out_dir), 'Frames not previously interpolated!'\n    \n    # initialise test set\n    print('Testing on dataset: ', args.dataset)\n    test_dir = os.path.join(args.out_dir, args.dataset)\n    assert os.path.exists(test_dir), f'{args.dataset} not pre-computed!'\n\n    if args.dataset.split('_')[0] in ['VFITex', 'Ucf101', 'Davis90']:\n        db_folder = args.dataset.split('_')[0].lower()\n    else:\n        db_folder = args.dataset.lower()\n\n    test_db = getattr(testsets_vqm, args.dataset)(os.path.join(args.data_dir, db_folder))\n    test_db.eval(metrics=args.metrics, output_dir=test_dir, resume=args.resume)\n\n\n\nif __name__ == '__main__':\n    main()"
  },
  {
    "path": "interpolate_yuv.py",
    "content": "import argparse\nimport torch\nimport torchvision.transforms.functional as TF\nimport os\nfrom PIL import Image\nfrom tqdm import tqdm\nimport skvideo.io\nfrom functools import partial\nfrom utility import read_frame_yuv2rgb, tensor2rgb\nfrom omegaconf import OmegaConf\nfrom main import instantiate_from_config\nfrom ldm.models.diffusion.ddim import DDIMSampler\n\n\nparser = argparse.ArgumentParser(description='Frame Interpolation Evaluation')\n\nparser.add_argument('--net', type=str, default='LDMVFI')\nparser.add_argument('--config', type=str, default='configs/ldm/ldmvfi-vqflow-f32-c256-concat_max.yaml')\nparser.add_argument('--ckpt', type=str, default='ckpt.pth')\nparser.add_argument('--input_yuv', type=str, default='D:\\\\')\nparser.add_argument('--size', type=str, default='1920x1080')\nparser.add_argument('--out_fps', type=int, default=60)\nparser.add_argument('--out_dir', type=str, default='.')\n\n# sampler args\nparser.add_argument('--use_ddim', dest='use_ddim', default=False, action='store_true')\nparser.add_argument('--ddim_eta', type=float, default=1.0)\nparser.add_argument('--ddim_steps', type=int, default=200)\n\n\ndef main():\n    args = parser.parse_args()\n\n    # initialise model\n    config = OmegaConf.load(args.config)\n    model = instantiate_from_config(config.model)\n    model.load_state_dict(torch.load(args.ckpt)['state_dict'])\n    device = torch.device(\"cuda\") if torch.cuda.is_available() else torch.device(\"cpu\")\n    model = model.to(device)\n    model = model.eval()\n    print('Model loaded successfully')\n\n    # set up sampler\n    if args.use_ddim:\n        ddim = DDIMSampler(model)\n        sample_func = partial(ddim.sample, S=args.ddim_steps, eta=args.ddim_eta, verbose=False)\n    else:\n        sample_func = partial(model.sample_ddpm, return_intermediates=False, verbose=False)\n\n    # Setup output file\n    if not os.path.exists(args.out_dir):\n        os.makedirs(args.out_dir)\n    _, fname = os.path.split(args.input_yuv)\n    seq_name = fname.strip('.yuv')\n    width, height = args.size.split('x')\n    bit_depth = 16 if '16bit' in fname else 10 if '10bit' in fname else 8\n    pix_fmt = '444' if '444' in fname else '420'\n    try:\n        width = int(width)\n        height = int(height)\n    except:\n        print('Invalid size, should be \\'<width>x<height>\\'')\n        return \n\n    outname = '{}_{}x{}_{}fps_{}.mp4'.format(seq_name, width, height, args.out_fps, args.net)\n    writer = skvideo.io.FFmpegWriter(os.path.join(args.out_dir, outname), \n        inputdict={\n            '-r': str(args.out_fps)\n        },\n        outputdict={\n            '-pix_fmt': 'yuv420p',\n            '-s': '{}x{}'.format(width,height),\n            '-r': str(args.out_fps),\n            '-vcodec': 'libx264',  #use the h.264 codec\n            '-crf': '0',           #set the constant rate factor to 0, which is lossless\n            '-preset':'veryslow'   #the slower the better compression, in princple, try \n                                #other options see https://trac.ffmpeg.org/wiki/Encode/H.264\n        }\n    ) \n\n    # Start interpolation\n    print('Using model {} to upsample file {}'.format(args.net, fname))\n    stream = open(args.input_yuv, 'r')\n    file_size = os.path.getsize(args.input_yuv)\n\n    # YUV reading setup\n    bytes_per_frame = width*height*1.5\n    if pix_fmt == '444':\n        bytes_per_frame *= 2\n    if bit_depth != 8:\n        bytes_per_frame *= 2\n\n    num_frames = int(file_size // bytes_per_frame)\n    rawFrame0 = Image.fromarray(read_frame_yuv2rgb(stream, width, height, 0, bit_depth, pix_fmt))\n    frame0 = TF.normalize(TF.to_tensor(rawFrame0), (0.5, 0.5, 0.5), (0.5, 0.5, 0.5))[None,...].cuda()\n    for t in tqdm(range(1, num_frames)):\n        rawFrame1 = Image.fromarray(read_frame_yuv2rgb(stream, width, height, t, bit_depth, pix_fmt))\n        frame1 = TF.normalize(TF.to_tensor(rawFrame1), (0.5, 0.5, 0.5), (0.5, 0.5, 0.5))[None,...].cuda()\n\n        with torch.no_grad():\n            with model.ema_scope():\n                # form condition tensor and define shape of latent rep\n                xc = {'prev_frame': frame0, 'next_frame': frame1}\n                c, phi_prev_list, phi_next_list = model.get_learned_conditioning(xc)\n                shape = (model.channels, c.shape[2], c.shape[3])\n                # run sampling and get denoised latent\n                out = sample_func(conditioning=c, batch_size=c.shape[0], shape=shape)\n                if isinstance(out, tuple): # using ddim\n                    out = out[0]\n                # reconstruct interpolated frame from latent\n                out = model.decode_first_stage(out, xc, phi_prev_list, phi_next_list)\n                out =  torch.clamp(out, min=-1., max=1.) # interpolated frame in [-1,1]\n\n        # write to output video\n        writer.writeFrame(tensor2rgb(frame0)[0])\n        writer.writeFrame(tensor2rgb(out)[0])\n\n        # update frame0\n        frame0 = frame1\n    \n    # write the last frame\n    writer.writeFrame(tensor2rgb(frame1)[0])\n\n    stream.close()\n    writer.close() # close the writer\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "ldm/data/__init__.py",
    "content": ""
  },
  {
    "path": "ldm/data/bvi_vimeo.py",
    "content": "import numpy as np\nimport random\nfrom os import listdir\nfrom os.path import join, isdir, split, getsize\nfrom torch.utils.data import Dataset\nimport torchvision.transforms.functional as TF\nfrom PIL import Image\nimport ldm.data.vfitransforms as vt\nfrom functools import partial\n\nclass Vimeo90k_triplet(Dataset):\n    def __init__(self, db_dir, train=True,  crop_sz=(256,256), augment_s=True, augment_t=True):\n        seq_dir = join(db_dir, 'sequences')\n        self.crop_sz = crop_sz\n        self.augment_s = augment_s\n        self.augment_t = augment_t\n\n        if train:\n            seq_list_txt = join(db_dir, 'sep_trainlist.txt')\n        else:\n            seq_list_txt = join(db_dir, 'sep_testlist.txt')\n\n        with open(seq_list_txt) as f:\n            contents = f.readlines()\n            seq_path = [line.strip() for line in contents if line != '\\n']\n\n        self.seq_path_list = [join(seq_dir, *line.split('/')) for line in seq_path]\n\n    def __getitem__(self, index):\n        rawFrame3 = Image.open(join(self.seq_path_list[index],  \"im3.png\"))\n        rawFrame4 = Image.open(join(self.seq_path_list[index],  \"im4.png\"))\n        rawFrame5 = Image.open(join(self.seq_path_list[index],  \"im5.png\"))\n\n        if self.crop_sz is not None:\n            rawFrame3, rawFrame4, rawFrame5 = vt.rand_crop(rawFrame3, rawFrame4, rawFrame5, sz=self.crop_sz)\n\n        if self.augment_s:\n            rawFrame3, rawFrame4, rawFrame5 = vt.rand_flip(rawFrame3, rawFrame4, rawFrame5, p=0.5)\n        \n        if self.augment_t:\n            rawFrame3, rawFrame4, rawFrame5 = vt.rand_reverse(rawFrame3, rawFrame4, rawFrame5, p=0.5)\n\n        to_array = partial(np.array, dtype=np.float32)\n        frame3, frame4, frame5 = map(to_array, (rawFrame3, rawFrame4, rawFrame5)) #(256,256,3), 0-255\n\n        frame3 = frame3/127.5 - 1.0\n        frame4 = frame4/127.5 - 1.0\n        frame5 = frame5/127.5 - 1.0\n\n        return {'image': frame4, 'prev_frame': frame3, 'next_frame': frame5}\n\n    def __len__(self):\n        return len(self.seq_path_list)\n\n\nclass Vimeo90k_quintuplet(Dataset):\n    def __init__(self, db_dir, train=True,  crop_sz=(256,256), augment_s=True, augment_t=True):\n        seq_dir = join(db_dir, 'sequences')\n        self.crop_sz = crop_sz\n        self.augment_s = augment_s\n        self.augment_t = augment_t\n\n        if train:\n            seq_list_txt = join(db_dir, 'sep_trainlist.txt')\n        else:\n            seq_list_txt = join(db_dir, 'sep_testlist.txt')\n\n        with open(seq_list_txt) as f:\n            contents = f.readlines()\n            seq_path = [line.strip() for line in contents if line != '\\n']\n\n        self.seq_path_list = [join(seq_dir, *line.split('/')) for line in seq_path]\n\n    def __getitem__(self, index):\n        rawFrame1 = Image.open(join(self.seq_path_list[index],  \"im1.png\"))\n        rawFrame3 = Image.open(join(self.seq_path_list[index],  \"im3.png\"))\n        rawFrame4 = Image.open(join(self.seq_path_list[index],  \"im4.png\"))\n        rawFrame5 = Image.open(join(self.seq_path_list[index],  \"im5.png\"))\n        rawFrame7 = Image.open(join(self.seq_path_list[index],  \"im7.png\"))\n\n        if self.crop_sz is not None:\n            rawFrame1, rawFrame3, rawFrame4, rawFrame5, rawFrame7 = vt.rand_crop(rawFrame1, rawFrame3, rawFrame4, rawFrame5, rawFrame7, sz=self.crop_sz)\n\n        if self.augment_s:\n            rawFrame1, rawFrame3, rawFrame4, rawFrame5, rawFrame7 = vt.rand_flip(rawFrame1, rawFrame3, rawFrame4, rawFrame5, rawFrame7, p=0.5)\n        \n        if self.augment_t:\n            rawFrame1, rawFrame3, rawFrame4, rawFrame5, rawFrame7 = vt.rand_reverse(rawFrame1, rawFrame3, rawFrame4, rawFrame5, rawFrame7, p=0.5)\n\n        frame1, frame3, frame4, frame5, frame7 = map(TF.to_tensor, (rawFrame1, rawFrame3, rawFrame4, rawFrame5, rawFrame7))\n\n        return frame1, frame3, frame4, frame5, frame7\n\n    def __len__(self):\n        return len(self.seq_path_list)\n\n    \nclass BVIDVC_triplet(Dataset):\n    def __init__(self, db_dir, res=None, crop_sz=(256,256), augment_s=True, augment_t=True):\n\n        db_dir = join(db_dir, 'quintuplets')\n        self.crop_sz = crop_sz\n        self.augment_s = augment_s\n        self.augment_t = augment_t\n        self.seq_path_list = [join(db_dir, f) for f in listdir(db_dir)]\n\n    def __getitem__(self, index):\n\n        cat = Image.open(join(self.seq_path_list[index], 'quintuplet.png'))\n\n        rawFrame3 = cat.crop((256, 0, 256*2, 256))\n        rawFrame5 = cat.crop((256*2, 0, 256*3, 256))\n        rawFrame4 = cat.crop((256*4, 0, 256*5, 256))\n\n        if self.crop_sz is not None:\n            rawFrame3, rawFrame4, rawFrame5 = vt.rand_crop(rawFrame3, rawFrame4, rawFrame5, sz=self.crop_sz)\n\n        if self.augment_s:\n            rawFrame3, rawFrame4, rawFrame5 = vt.rand_flip(rawFrame3, rawFrame4, rawFrame5, p=0.5)\n        \n        if self.augment_t:\n            rawFrame3, rawFrame4, rawFrame5 = vt.rand_reverse(rawFrame3, rawFrame4, rawFrame5, p=0.5)\n\n        to_array = partial(np.array, dtype=np.float32)\n        frame3, frame4, frame5 = map(to_array, (rawFrame3, rawFrame4, rawFrame5)) #(256,256,3), 0-255\n\n        frame3 = frame3/127.5 - 1.0\n        frame4 = frame4/127.5 - 1.0\n        frame5 = frame5/127.5 - 1.0\n\n        return {'image': frame4, 'prev_frame': frame3, 'next_frame': frame5}\n\n    def __len__(self):\n        return len(self.seq_path_list)\n\n\nclass BVIDVC_quintuplet(Dataset):\n    def __init__(self, db_dir, res=None, crop_sz=(256,256), augment_s=True, augment_t=True):\n\n        db_dir = join(db_dir, 'quintuplets')\n        self.crop_sz = crop_sz\n        self.augment_s = augment_s\n        self.augment_t = augment_t\n        self.seq_path_list = [join(db_dir, f) for f in listdir(db_dir)]\n\n    def __getitem__(self, index):\n\n        cat = Image.open(join(self.seq_path_list[index], 'quintuplet.png'))\n\n        rawFrame1 = cat.crop((0, 0, 256, 256))\n        rawFrame3 = cat.crop((256, 0, 256*2, 256))\n        rawFrame5 = cat.crop((256*2, 0, 256*3, 256))\n        rawFrame7 = cat.crop((256*3, 0, 256*4, 256))\n        rawFrame4 = cat.crop((256*4, 0, 256*5, 256))\n\n        if self.augment_s:\n            rawFrame1, rawFrame3, rawFrame4, rawFrame5, rawFrame7 = vt.rand_flip(rawFrame1, rawFrame3, rawFrame4, rawFrame5, rawFrame7, p=0.5)\n        \n        if self.augment_t:\n            rawFrame1, rawFrame3, rawFrame4, rawFrame5, rawFrame7 = vt.rand_reverse(rawFrame1, rawFrame3, rawFrame4, rawFrame5, rawFrame7, p=0.5)\n\n        frame1, frame3, frame4, frame5, frame7 = map(TF.to_tensor, (rawFrame1, rawFrame3, rawFrame4, rawFrame5, rawFrame7))\n\n        return frame1, frame3, frame4, frame5, frame7\n\n    def __len__(self):\n        return len(self.seq_path_list)\n\n\nclass Sampler(Dataset):\n    def __init__(self, datasets, p_datasets=None, iter=False, samples_per_epoch=1000):\n        self.datasets = datasets\n        self.len_datasets = np.array([len(dataset) for dataset in self.datasets])\n        self.p_datasets = p_datasets\n        self.iter = iter\n\n        if p_datasets is None:\n            self.p_datasets = self.len_datasets / np.sum(self.len_datasets)\n\n        self.samples_per_epoch = samples_per_epoch\n\n        self.accum = [0,]\n        for i, length in enumerate(self.len_datasets):\n            self.accum.append(self.accum[-1] + self.len_datasets[i])\n\n    def __getitem__(self, index):\n        if self.iter:\n            # iterate through all datasets\n            for i in range(len(self.accum)):\n                if index < self.accum[i]:\n                    return self.datasets[i-1].__getitem__(index-self.accum[i-1])\n        else:\n            # first sample a dataset\n            dataset = random.choices(self.datasets, self.p_datasets)[0]\n            # sample a sequence from the dataset\n            return dataset.__getitem__(random.randint(0,len(dataset)-1))\n            \n\n    def __len__(self):\n        if self.iter:\n            return int(np.sum(self.len_datasets))\n        else:\n            return self.samples_per_epoch\n\n\nclass BVI_Vimeo_triplet(Dataset):\n    def __init__(self, db_dir, crop_sz=[256,256], p_datasets=None, iter=False, samples_per_epoch=1000):\n        vimeo90k_train = Vimeo90k_triplet(join(db_dir, 'vimeo_septuplet'), train=True,  crop_sz=crop_sz)\n        bvidvc_train = BVIDVC_triplet(join(db_dir, 'bvidvc'), crop_sz=crop_sz)\n\n        self.datasets = [vimeo90k_train, bvidvc_train]\n        self.len_datasets = np.array([len(dataset) for dataset in self.datasets])\n        self.p_datasets = p_datasets\n        self.iter = iter\n\n        if p_datasets is None:\n            self.p_datasets = self.len_datasets / np.sum(self.len_datasets)\n\n        self.samples_per_epoch = samples_per_epoch\n\n        self.accum = [0,]\n        for i, length in enumerate(self.len_datasets):\n            self.accum.append(self.accum[-1] + self.len_datasets[i])\n\n    def __getitem__(self, index):\n        if self.iter:\n            # iterate through all datasets\n            for i in range(len(self.accum)):\n                if index < self.accum[i]:\n                    return self.datasets[i-1].__getitem__(index-self.accum[i-1])\n        else:\n            # first sample a dataset\n            dataset = random.choices(self.datasets, self.p_datasets)[0]\n            # sample a sequence from the dataset\n            return dataset.__getitem__(random.randint(0,len(dataset)-1))\n            \n\n    def __len__(self):\n        if self.iter:\n            return int(np.sum(self.len_datasets))\n        else:\n            return self.samples_per_epoch"
  },
  {
    "path": "ldm/data/testsets.py",
    "content": "import glob\nfrom typing import List\nfrom PIL import Image\nimport torch\nfrom torchvision import transforms\nimport torchvision.transforms.functional as TF\nfrom torchvision.utils import save_image as imwrite\nimport os\nfrom os.path import join, exists\nimport utility\nimport numpy as np\nimport ast\nimport time\nfrom ldm.models.autoencoder import * \n\n\nclass TripletTestSet:\n    def __init__(self):\n        self.transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]) #outptu tensor in [-1,1]\n\n    def eval(self, model, sample_func, metrics=['PSNR', 'SSIM'], output_dir=None, output_name='output.png', resume=False):\n        results_dict = {k : [] for k in metrics}\n\n        start_idx = 0\n        if resume:\n            # fill in results_dict with prev results and find where to start from\n            assert os.path.exists(join(output_dir, 'results.txt')), 'no res file found to resume from!'\n            with open(join(output_dir, 'results.txt'), 'r') as f:\n                prev_lines = f.readlines()\n                for line in prev_lines:\n                    if len(line) < 2:\n                        continue\n                    cur_res = ast.literal_eval(line.strip().split('-- ')[1].split('time')[0]) #parse dict from string\n                    for k in metrics:\n                        results_dict[k].append(float(cur_res[k]))\n                    start_idx += 1\n        \n        logfile = open(join(output_dir, 'results.txt'), 'a')\n        for idx in range(len(self.im_list)):\n            if resume and idx < start_idx:\n                assert os.path.exists(join(output_dir, self.im_list[idx], output_name)), f'skipping idx {idx} but output not found!'\n                continue\n\n            print(f'Evaluating {self.im_list[idx]}')\n            t0 = time.time()\n            if not exists(join(output_dir, self.im_list[idx])):\n                os.makedirs(join(output_dir, self.im_list[idx]))\n\n            with torch.no_grad():\n                with model.ema_scope():\n                    # form condition tensor and define shape of latent rep\n                    xc = {'prev_frame': self.input0_list[idx], 'next_frame': self.input1_list[idx]}\n                    c, phi_prev_list, phi_next_list = model.get_learned_conditioning(xc)\n                    shape = (model.channels, c.shape[2], c.shape[3])\n                    # run sampling and get denoised latent rep\n                    out = sample_func(conditioning=c, batch_size=c.shape[0], shape=shape, x_T=None)\n                    if isinstance(out, tuple): # using ddim\n                        out = out[0]\n                    # reconstruct interpolated frame from latent\n                    out = model.decode_first_stage(out, xc, phi_prev_list, phi_next_list)\n                    out =  torch.clamp(out, min=-1., max=1.) # interpolated frame in [-1,1]\n\n            gt = self.gt_list[idx]\n\n            for metric in metrics:\n                score = getattr(utility, 'calc_{}'.format(metric.lower()))(gt, out, [self.input0_list[idx], self.input1_list[idx]])[0].item()\n                results_dict[metric].append(score)\n\n            imwrite(out, join(output_dir, self.im_list[idx], output_name), value_range=(-1, 1), normalize=True)\n\n            msg = '{:<15s} -- {}'.format(self.im_list[idx], {k: round(results_dict[k][-1],3) for k in metrics}) + f'    time taken: {round(time.time()-t0,2)}' + '\\n'\n            print(msg, end='')\n            logfile.write(msg)\n\n        msg = '{:<15s} -- {}'.format('Average', {k: round(np.mean(results_dict[k]),3) for k in metrics}) + '\\n\\n'\n        print(msg, end='')\n        logfile.write(msg)\n        logfile.close()\n\nclass Middlebury_others(TripletTestSet):\n    def __init__(self, db_dir):\n        super(Middlebury_others, self).__init__()\n        self.im_list = ['Beanbags', 'Dimetrodon', 'DogDance', 'Grove2', 'Grove3', 'Hydrangea', 'MiniCooper', 'RubberWhale', 'Urban2', 'Urban3', 'Venus', 'Walking']\n\n        self.input0_list = []\n        self.input1_list = []\n        self.gt_list = []\n        for item in self.im_list:\n            self.input0_list.append(self.transform(Image.open(join(db_dir, 'input', item , 'frame10.png'))).cuda().unsqueeze(0)) # [1,3,H,W] in [-1,1]\n            self.input1_list.append(self.transform(Image.open(join(db_dir, 'input', item , 'frame11.png'))).cuda().unsqueeze(0)) # [1,3,H,W] in [-1,1]\n            self.gt_list.append(self.transform(Image.open(join(db_dir, 'gt', item , 'frame10i11.png'))).cuda().unsqueeze(0))\n\nclass Davis(TripletTestSet):\n    def __init__(self, db_dir):\n        super(Davis, self).__init__()\n        self.im_list = ['bike-trial', 'boxing', 'burnout', 'choreography', 'demolition', 'dive-in', 'dolphins', 'e-bike', 'grass-chopper', 'hurdles', 'inflatable', 'juggle', 'kart-turn', 'kids-turning', 'lions', 'mbike-santa', 'monkeys', 'ocean-birds', 'pole-vault', 'running', 'selfie', 'skydive', 'speed-skating', 'swing-boy', 'tackle', 'turtle', 'varanus-tree', 'vietnam', 'wings-turn']\n\n        self.input0_list = []\n        self.input1_list = []\n        self.gt_list = []\n        for item in self.im_list:\n            self.input0_list.append(self.transform(Image.open(join(db_dir, 'input', item , 'frame10.png'))).cuda().unsqueeze(0))\n            self.input1_list.append(self.transform(Image.open(join(db_dir, 'input', item , 'frame11.png'))).cuda().unsqueeze(0))\n            self.gt_list.append(self.transform(Image.open(join(db_dir, 'gt', item , 'frame10i11.png'))).cuda().unsqueeze(0))\n\n\nclass Ucf(TripletTestSet):\n    def __init__(self, db_dir):\n        super(Ucf, self).__init__()\n        self.im_list = os.listdir(db_dir)\n\n        self.input0_list = []\n        self.input1_list = []\n        self.gt_list = []\n        for item in self.im_list:\n            self.input0_list.append(self.transform(Image.open(join(db_dir, item , 'frame_00.png'))).cuda().unsqueeze(0))\n            self.input1_list.append(self.transform(Image.open(join(db_dir, item , 'frame_02.png'))).cuda().unsqueeze(0))\n            self.gt_list.append(self.transform(Image.open(join(db_dir, item , 'frame_01_gt.png'))).cuda().unsqueeze(0))\n\n\nclass Snufilm(TripletTestSet):\n    def __init__(self, db_dir, mode):\n        super(Snufilm, self).__init__()     \n        self.mode = mode\n        self.input0_list = []\n        self.input1_list = []\n        self.gt_list = []\n        with open(join(db_dir, 'test-{}.txt'.format(mode)), 'r') as f:\n            triplet_list = f.read().splitlines()\n        self.im_list = []\n        for i, triplet in enumerate(triplet_list, 1):\n            self.im_list.append('{}-{}'.format(mode, str(i).zfill(3)))\n            lst = triplet.split(' ')\n            self.input0_list.append(self.transform(Image.open(join(db_dir, lst[0]))).cuda().unsqueeze(0))\n            self.input1_list.append(self.transform(Image.open(join(db_dir, lst[2]))).cuda().unsqueeze(0))\n            self.gt_list.append(self.transform(Image.open(join(db_dir, lst[1]))).cuda().unsqueeze(0))\n\n\nclass Snufilm_easy(Snufilm):\n    def __init__(self, db_dir):\n        db_dir = db_dir[:-5]\n        super(Snufilm_easy, self).__init__(db_dir, 'easy')\n\nclass Snufilm_medium(Snufilm):\n    def __init__(self, db_dir):\n        db_dir = db_dir[:-7]\n        super(Snufilm_medium, self).__init__(db_dir, 'medium')\n\nclass Snufilm_hard(Snufilm):\n    def __init__(self, db_dir):\n        db_dir = db_dir[:-5]\n        super(Snufilm_hard, self).__init__(db_dir, 'hard')\n\nclass Snufilm_extreme(Snufilm):\n    def __init__(self, db_dir):\n        db_dir = db_dir[:-8]\n        super(Snufilm_extreme, self).__init__(db_dir, 'extreme')\n\nclass VFITex_triplet:\n    def __init__(self, db_dir):\n        self.seq_list = os.listdir(db_dir)\n        self.db_dir = db_dir\n        self.transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]) #outptu tensor in [-1,1]\n\n\n    def eval(self, model, sample_func, metrics=['PSNR', 'SSIM'], output_dir=None, output_name=None, resume=False):\n        model.eval()\n        results_dict = {k : [] for k in metrics}\n\n        start_idx = 0\n        if resume:\n            # fill in results_dict with prev results and find where to start from\n            assert os.path.exists(join(output_dir, 'results.txt')), 'no res file found to resume from!'\n            with open(join(output_dir, 'results.txt'), 'r') as f:\n                prev_lines = f.readlines()\n                for line in prev_lines:\n                    if len(line) < 2:\n                        continue\n                    cur_res = ast.literal_eval(line.strip().split('-- ')[1].split('time')[0]) #parse dict from string\n                    for k in metrics:\n                        results_dict[k].append(float(cur_res[k]))\n                    start_idx += 1\n        \n        logfile = open(join(output_dir, 'results.txt'), 'a')\n\n        for idx, seq in enumerate(self.seq_list):\n            if resume and idx < start_idx:\n                assert len(glob.glob(join(output_dir, seq, '*.png'))) > 0, f'skipping idx {idx} but output not found!'\n                continue\n\n            print(f'Evaluating {seq}')\n            t0 = time.time()\n\n            seqpath = join(self.db_dir, seq)\n            if not exists(join(output_dir, seq)):\n                os.makedirs(join(output_dir, seq))\n\n            # interpolate between every 2 frames\n            gt_list, out_list, inputs_list = [], [], []\n            tmp_dict = {k : [] for k in metrics}\n            num_frames = len([f for f in os.listdir(seqpath) if f.endswith('.png')])\n            for t in range(1, num_frames-5, 2):\n                im0 = Image.open(join(seqpath, str(t+2).zfill(3)+'.png'))\n                im1 = Image.open(join(seqpath, str(t+3).zfill(3)+'.png'))\n                im2 = Image.open(join(seqpath, str(t+4).zfill(3)+'.png'))\n                # center crop if 4K\n                if '4K' in seq:\n                    w, h  = im0.size\n                    im0 = TF.center_crop(im0, (h//2, w//2))\n                    im1 = TF.center_crop(im1, (h//2, w//2))\n                    im2 = TF.center_crop(im2, (h//2, w//2))\n                im0 = self.transform(im0).cuda().unsqueeze(0)\n                im1 = self.transform(im1).cuda().unsqueeze(0)\n                im2 = self.transform(im2).cuda().unsqueeze(0)\n\n                with torch.no_grad():\n                    with model.ema_scope():\n                        # form condition tensor and define shape of latent rep\n                        xc = {'prev_frame': im0, 'next_frame': im2}\n                        c, phi_prev_list, phi_next_list = model.get_learned_conditioning(xc)\n                        shape = (model.channels, c.shape[2], c.shape[3])\n                        # run sampling and get denoised latent rep\n                        out = sample_func(conditioning=c, batch_size=c.shape[0], shape=shape, x_T=None)\n                        if isinstance(out, tuple): # using ddim\n                            out = out[0]\n                        # reconstruct interpolated frame from latent\n                        out = model.decode_first_stage(out, xc, phi_prev_list, phi_next_list)\n                        out =  torch.clamp(out, min=-1., max=1.) # interpolated frame in [-1,1]\n\n                for metric in metrics:\n                    score = getattr(utility, 'calc_{}'.format(metric.lower()))(im1, out, [im0, im2])[0].item()\n                    tmp_dict[metric].append(score)\n\n                imwrite(out, join(output_dir, seq, 'frame{}.png'.format(t+3)), value_range=(-1, 1), normalize=True)\n\n            # compute sequence-level scores\n            for metric in metrics:\n                results_dict[metric].append(np.mean(tmp_dict[metric]))\n\n            # log\n            msg = '{:<15s} -- {}'.format(seq, {k: round(results_dict[k][-1], 3) for k in metrics}) + f'    time taken: {round(time.time()-t0,2)}' + '\\n'\n            print(msg, end='')\n            logfile.write(msg)\n        \n        msg = '{:<15s} -- {}'.format('Average', {k: round(np.mean(results_dict[k]), 3) for k in metrics}) + '\\n'\n        print(msg, end='')\n        logfile.write(msg)\n        logfile.close()\n\n\n\n\nclass Davis90_triplet:\n    def __init__(self, db_dir):\n        self.seq_list = sorted(os.listdir(db_dir))\n        self.db_dir = db_dir\n        self.transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]) #outptu tensor in [-1,1]\n\n\n    def eval(self, model, sample_func, metrics=['PSNR', 'SSIM'], output_dir=None, output_name=None, resume=False):\n        model.eval()\n        results_dict = {k : [] for k in metrics}\n\n        start_idx = 0\n        if resume:\n            # fill in results_dict with prev results and find where to start from\n            assert os.path.exists(join(output_dir, 'results.txt')), 'no res file found to resume from!'\n            with open(join(output_dir, 'results.txt'), 'r') as f:\n                prev_lines = f.readlines()\n                for line in prev_lines:\n                    if len(line) < 2:\n                        continue\n                    cur_res = ast.literal_eval(line.strip().split('-- ')[1].split('time')[0]) #parse dict from string\n                    for k in metrics:\n                        results_dict[k].append(float(cur_res[k]))\n                    start_idx += 1\n        \n        logfile = open(join(output_dir, 'results.txt'), 'a')\n\n        for idx, seq in enumerate(self.seq_list):\n            if resume and idx < start_idx:\n                assert len(glob.glob(join(output_dir, seq, '*.png'))) > 0, f'skipping idx {idx} but output not found!'\n                continue\n\n            print(f'Evaluating {seq}')\n            t0 = time.time()\n\n            seqpath = join(self.db_dir, seq)\n            if not exists(join(output_dir, seq)):\n                os.makedirs(join(output_dir, seq))\n\n            # interpolate between every 2 frames\n            gt_list, out_list, inputs_list = [], [], []\n            tmp_dict = {k : [] for k in metrics}\n            num_frames = len(os.listdir(seqpath))\n            for t in range(0, num_frames-6, 2):\n                im3 = Image.open(join(seqpath, str(t+2).zfill(5)+'.jpg'))\n                im4 = Image.open(join(seqpath, str(t+3).zfill(5)+'.jpg'))\n                im5 = Image.open(join(seqpath, str(t+4).zfill(5)+'.jpg'))\n\n                im3 = self.transform(im3).cuda().unsqueeze(0)\n                im4 = self.transform(im4).cuda().unsqueeze(0)\n                im5 = self.transform(im5).cuda().unsqueeze(0)\n\n                with torch.no_grad():\n                    with model.ema_scope():\n                        # form condition tensor and define shape of latent rep\n                        xc = {'prev_frame': im3, 'next_frame': im5}\n                        c, phi_prev_list, phi_next_list = model.get_learned_conditioning(xc)\n                        shape = (model.channels, c.shape[2], c.shape[3])\n                        # run sampling and get denoised latent rep\n                        out = sample_func(conditioning=c, batch_size=c.shape[0], shape=shape, x_T=None)\n                        if isinstance(out, tuple): # using ddim\n                            out = out[0]\n                        # reconstruct interpolated frame from latent\n                        out = model.decode_first_stage(out, xc, phi_prev_list, phi_next_list)\n                        out =  torch.clamp(out, min=-1., max=1.) # interpolated frame in [-1,1]\n\n                for metric in metrics:\n                    score = getattr(utility, 'calc_{}'.format(metric.lower()))(im4, out, [im3, im5])[0].item()\n                    tmp_dict[metric].append(score)\n\n                imwrite(out, join(output_dir, seq, 'frame{}.png'.format(t+3)), value_range=(-1, 1), normalize=True)\n\n            # compute sequence-level scores\n            for metric in metrics:\n                results_dict[metric].append(np.mean(tmp_dict[metric]))\n\n            # log\n            msg = '{:<15s} -- {}'.format(seq, {k: round(results_dict[k][-1], 3) for k in metrics}) + f'    time taken: {round(time.time()-t0,2)}' + '\\n'\n            print(msg, end='')\n            logfile.write(msg)\n        \n        msg = '{:<15s} -- {}'.format('Average', {k: round(np.mean(results_dict[k]), 3) for k in metrics}) + '\\n'\n        print(msg, end='')\n        logfile.write(msg)\n        logfile.close()\n\n\n\nclass Ucf101_triplet:\n    def __init__(self, db_dir):\n        self.db_dir = db_dir\n        self.transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]) #outptu tensor in [-1,1]\n\n        self.im_list = os.listdir(db_dir)\n\n        self.input3_list = []\n        self.input5_list = []\n        self.gt_list = []\n        for item in self.im_list:\n            self.input3_list.append(self.transform(Image.open(join(db_dir, item , 'frame1.png'))).cuda().unsqueeze(0))\n            self.input5_list.append(self.transform(Image.open(join(db_dir, item , 'frame2.png'))).cuda().unsqueeze(0))\n            self.gt_list.append(self.transform(Image.open(join(db_dir, item , 'framet.png'))).cuda().unsqueeze(0))\n\n    def eval(self, model, sample_func, metrics=['PSNR', 'SSIM'], output_dir=None, output_name='output.png', resume=False):\n        model.eval()\n        results_dict = {k : [] for k in metrics}\n\n        start_idx = 0\n        if resume:\n            # fill in results_dict with prev results and find where to start from\n            assert os.path.exists(join(output_dir, 'results.txt')), 'no res file found to resume from!'\n            with open(join(output_dir, 'results.txt'), 'r') as f:\n                prev_lines = f.readlines()\n                for line in prev_lines:\n                    if len(line) < 2:\n                        continue\n                    cur_res = ast.literal_eval(line.strip().split('-- ')[1].split('time')[0]) #parse dict from string\n                    for k in metrics:\n                        results_dict[k].append(float(cur_res[k]))\n                    start_idx += 1\n        \n        logfile = open(join(output_dir, 'results.txt'), 'a')\n\n        for idx in range(len(self.im_list)):\n            if resume and idx < start_idx:\n                assert os.path.exists(join(output_dir, self.im_list[idx], output_name)), f'skipping idx {idx} but output not found!'\n                continue\n\n            print(f'Evaluating {self.im_list[idx]}')\n            t0 = time.time()\n\n            if not exists(join(output_dir, self.im_list[idx])):\n                os.makedirs(join(output_dir, self.im_list[idx]))\n\n            with torch.no_grad():\n                with model.ema_scope():\n                    # form condition tensor and define shape of latent rep\n                    xc = {'prev_frame': self.input3_list[idx], 'next_frame': self.input5_list[idx]}\n                    c, phi_prev_list, phi_next_list = model.get_learned_conditioning(xc)\n                    shape = (model.channels, c.shape[2], c.shape[3])\n                    # run sampling and get denoised latent rep\n                    out = sample_func(conditioning=c, batch_size=c.shape[0], shape=shape, x_T=None)\n                    if isinstance(out, tuple): # using ddim\n                        out = out[0]\n                    # reconstruct interpolated frame from latent\n                    out = model.decode_first_stage(out, xc, phi_prev_list, phi_next_list)\n                    out =  torch.clamp(out, min=-1., max=1.) # interpolated frame in [-1,1]\n\n            gt = self.gt_list[idx]\n\n            for metric in metrics:\n                score = getattr(utility, 'calc_{}'.format(metric.lower()))(gt, out, [self.input3_list[idx], self.input5_list[idx]])[0].item()\n                results_dict[metric].append(score)\n\n            imwrite(out, join(output_dir, self.im_list[idx], output_name), value_range=(-1, 1), normalize=True)\n\n            msg = '{:<15s} -- {}'.format(self.im_list[idx], {k: round(results_dict[k][-1],3) for k in metrics}) + f'    time taken: {round(time.time()-t0,2)}' + '\\n'\n            print(msg, end='')\n            logfile.write(msg)\n\n        msg = '{:<15s} -- {}'.format('Average', {k: round(np.mean(results_dict[k]),3) for k in metrics}) + '\\n\\n'\n        print(msg, end='')\n        logfile.write(msg)\n        logfile.close()"
  },
  {
    "path": "ldm/data/testsets_vqm.py",
    "content": "import glob\nfrom typing import List\nfrom PIL import Image\nimport torch\nfrom torchvision import transforms\nimport torchvision.transforms.functional as TF\nfrom torchvision.utils import save_image as imwrite\nimport os\nfrom os.path import join, exists\nimport utility\nimport numpy as np\nimport ast\nimport time\n\n\nclass TripletTestSet:\n    def __init__(self):\n        self.transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]) #outptu tensor in [-1,1]\n\n    def eval(self, metrics=['FloLPIPS'], output_dir=None, output_name='output.png', resume=False):\n        results_dict = {k : [] for k in metrics}\n\n        start_idx = 0\n        if resume:\n            # fill in results_dict with prev results and find where to start from\n            assert os.path.exists(join(output_dir, 'results_vqm.txt')), 'No res file found to resume from!'\n            with open(join(output_dir, 'results_vqm.txt'), 'r') as f:\n                prev_lines = f.readlines()\n                for line in prev_lines:\n                    if len(line) < 2:\n                        continue\n                    cur_res = ast.literal_eval(line.strip().split('-- ')[1].split('time')[0]) #parse dict from string\n                    for k in metrics:\n                        results_dict[k].append(float(cur_res[k]))\n                    start_idx += 1\n        \n        logfile = open(join(output_dir, 'results_vqm.txt'), 'a')\n\n        for idx in range(len(self.im_list)):\n            if resume and idx < start_idx:\n                assert os.path.exists(join(output_dir, self.im_list[idx], output_name)), f'skipping idx {idx} but output not found!'\n                continue\n\n            print(f'Evaluating {self.im_list[idx]}')\n            t0 = time.time()\n            assert exists(join(output_dir, self.im_list[idx], output_name)), f'No interpolated frames found for {self.im_list[idx]}'\n\n            out = self.transform(Image.open(join(output_dir, self.im_list[idx], output_name))).cuda().unsqueeze(0)\n            gt = self.gt_list[idx]\n\n            for metric in metrics:\n                score = getattr(utility, 'calc_{}'.format(metric.lower()))([gt], [out], [self.input0_list[idx], self.input1_list[idx]])\n                results_dict[metric].append(score)\n\n\n            msg = '{:<15s} -- {}'.format(self.im_list[idx], {k: round(results_dict[k][-1],3) for k in metrics}) + f'    time taken: {round(time.time()-t0,2)}' + '\\n'\n            print(msg, end='')\n            logfile.write(msg)\n\n        msg = '{:<15s} -- {}'.format('Average', {k: round(np.mean(results_dict[k]),3) for k in metrics}) + '\\n\\n'\n        print(msg, end='')\n        logfile.write(msg)\n        logfile.close()\n\nclass Middlebury_others(TripletTestSet):\n    def __init__(self, db_dir):\n        super(Middlebury_others, self).__init__()\n        self.im_list = ['Beanbags', 'Dimetrodon', 'DogDance', 'Grove2', 'Grove3', 'Hydrangea', 'MiniCooper', 'RubberWhale', 'Urban2', 'Urban3', 'Venus', 'Walking']\n\n        self.input0_list = []\n        self.input1_list = []\n        self.gt_list = []\n        for item in self.im_list:\n            self.input0_list.append(self.transform(Image.open(join(db_dir, 'input', item , 'frame10.png'))).cuda().unsqueeze(0)) # [1,3,H,W] in [-1,1]\n            self.input1_list.append(self.transform(Image.open(join(db_dir, 'input', item , 'frame11.png'))).cuda().unsqueeze(0)) # [1,3,H,W] in [-1,1]\n            self.gt_list.append(self.transform(Image.open(join(db_dir, 'gt', item , 'frame10i11.png'))).cuda().unsqueeze(0))\n\nclass Davis(TripletTestSet):\n    def __init__(self, db_dir):\n        super(Davis, self).__init__()\n        self.im_list = ['bike-trial', 'boxing', 'burnout', 'choreography', 'demolition', 'dive-in', 'dolphins', 'e-bike', 'grass-chopper', 'hurdles', 'inflatable', 'juggle', 'kart-turn', 'kids-turning', 'lions', 'mbike-santa', 'monkeys', 'ocean-birds', 'pole-vault', 'running', 'selfie', 'skydive', 'speed-skating', 'swing-boy', 'tackle', 'turtle', 'varanus-tree', 'vietnam', 'wings-turn']\n\n        self.input0_list = []\n        self.input1_list = []\n        self.gt_list = []\n        for item in self.im_list:\n            self.input0_list.append(self.transform(Image.open(join(db_dir, 'input', item , 'frame10.png'))).cuda().unsqueeze(0))\n            self.input1_list.append(self.transform(Image.open(join(db_dir, 'input', item , 'frame11.png'))).cuda().unsqueeze(0))\n            self.gt_list.append(self.transform(Image.open(join(db_dir, 'gt', item , 'frame10i11.png'))).cuda().unsqueeze(0))\n\n\nclass Ucf(TripletTestSet):\n    def __init__(self, db_dir):\n        super(Ucf, self).__init__()\n        self.im_list = os.listdir(db_dir)\n\n        self.input0_list = []\n        self.input1_list = []\n        self.gt_list = []\n        for item in self.im_list:\n            self.input0_list.append(self.transform(Image.open(join(db_dir, item , 'frame_00.png'))).cuda().unsqueeze(0))\n            self.input1_list.append(self.transform(Image.open(join(db_dir, item , 'frame_02.png'))).cuda().unsqueeze(0))\n            self.gt_list.append(self.transform(Image.open(join(db_dir, item , 'frame_01_gt.png'))).cuda().unsqueeze(0))\n\n\nclass Snufilm(TripletTestSet):\n    def __init__(self, db_dir, mode):\n        super(Snufilm, self).__init__()     \n        self.mode = mode\n        self.input0_list = []\n        self.input1_list = []\n        self.gt_list = []\n        with open(join(db_dir, 'test-{}.txt'.format(mode)), 'r') as f:\n            triplet_list = f.read().splitlines()\n        self.im_list = []\n        for i, triplet in enumerate(triplet_list, 1):\n            self.im_list.append('{}-{}'.format(mode, str(i).zfill(3)))\n            lst = triplet.split(' ')\n            self.input0_list.append(self.transform(Image.open(join(db_dir, lst[0]))).cuda().unsqueeze(0))\n            self.input1_list.append(self.transform(Image.open(join(db_dir, lst[2]))).cuda().unsqueeze(0))\n            self.gt_list.append(self.transform(Image.open(join(db_dir, lst[1]))).cuda().unsqueeze(0))\n\n\nclass Snufilm_easy(Snufilm):\n    def __init__(self, db_dir):\n        db_dir = db_dir[:-5]\n        super(Snufilm_easy, self).__init__(db_dir, 'easy')\n\nclass Snufilm_medium(Snufilm):\n    def __init__(self, db_dir):\n        db_dir = db_dir[:-7]\n        super(Snufilm_medium, self).__init__(db_dir, 'medium')\n\nclass Snufilm_hard(Snufilm):\n    def __init__(self, db_dir):\n        db_dir = db_dir[:-5]\n        super(Snufilm_hard, self).__init__(db_dir, 'hard')\n\nclass Snufilm_extreme(Snufilm):\n    def __init__(self, db_dir):\n        db_dir = db_dir[:-8]\n        super(Snufilm_extreme, self).__init__(db_dir, 'extreme')\n\nclass VFITex_triplet:\n    def __init__(self, db_dir):\n        self.seq_list = os.listdir(db_dir)\n        self.db_dir = db_dir\n        self.transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]) #outptu tensor in [-1,1]\n\n\n    def eval(self, metrics=['FloLPIPS'], output_dir=None, output_name=None, resume=False):\n        results_dict = {k : [] for k in metrics}\n\n        start_idx = 0\n        if resume:\n            # fill in results_dict with prev results and find where to start from\n            assert os.path.exists(join(output_dir, 'results_vqm.txt')), 'no res file found to resume from!'\n            with open(join(output_dir, 'results_vqm.txt'), 'r') as f:\n                prev_lines = f.readlines()\n                for line in prev_lines:\n                    if len(line) < 2:\n                        continue\n                    cur_res = ast.literal_eval(line.strip().split('-- ')[1].split('time')[0]) #parse dict from string\n                    for k in metrics:\n                        results_dict[k].append(float(cur_res[k]))\n                    start_idx += 1\n        \n        logfile = open(join(output_dir, 'results_vqm.txt'), 'a')\n\n        for idx, seq in enumerate(self.seq_list):\n            if resume and idx < start_idx:\n                assert len(glob.glob(join(output_dir, seq, '*.png'))) > 0, f'skipping idx {idx} but output not found!'\n                continue\n\n            print(f'Evaluating {seq}')\n            t0 = time.time()\n\n            seqpath = join(self.db_dir, seq)\n            assert len(glob.glob(join(output_dir, seq, '*.png'))) > 0, f'No interpolated frames found for {seq}'\n\n            # interpolate between every 2 frames\n            gt_list, out_list, inputs_list = [], [], []\n            num_frames = len([f for f in os.listdir(seqpath) if f.endswith('.png')])\n            for t in range(1, num_frames-5, 2):\n                im0 = Image.open(join(seqpath, str(t+2).zfill(3)+'.png'))\n                im1 = Image.open(join(seqpath, str(t+3).zfill(3)+'.png'))\n                im2 = Image.open(join(seqpath, str(t+4).zfill(3)+'.png'))\n                # center crop if 4K\n                if '4K' in seq:\n                    w, h  = im0.size\n                    im0 = TF.center_crop(im0, (h//2, w//2))\n                    im1 = TF.center_crop(im1, (h//2, w//2))\n                    im2 = TF.center_crop(im2, (h//2, w//2))\n                im0 = self.transform(im0).cuda().unsqueeze(0)\n                im1 = self.transform(im1).cuda().unsqueeze(0)\n                im2 = self.transform(im2).cuda().unsqueeze(0)\n\n                out = self.transform(Image.open(join(output_dir, seq, 'frame{}.png'.format(t+3)))).cuda().unsqueeze(0)\n                gt_list.append(im1)\n                out_list.append(out)\n                if t == 1:\n                    inputs_list.append(im0)\n                inputs_list.append(im2)\n\n            # compute sequence-level scores\n            for metric in metrics:\n                score = getattr(utility, 'calc_{}'.format(metric.lower()))(gt_list, out_list, inputs_list)\n                results_dict[metric].append(score)\n\n            # log\n            msg = '{:<15s} -- {}'.format(seq, {k: round(results_dict[k][-1], 3) for k in metrics}) + f'    time taken: {round(time.time()-t0,2)}' + '\\n'\n            print(msg, end='')\n            logfile.write(msg)\n        \n        msg = '{:<15s} -- {}'.format('Average', {k: round(np.mean(results_dict[k]), 3) for k in metrics}) + '\\n'\n        print(msg, end='')\n        logfile.write(msg)\n        logfile.close()\n\n\n\n\nclass Davis90_triplet:\n    def __init__(self, db_dir):\n        self.seq_list = sorted(os.listdir(db_dir))\n        self.db_dir = db_dir\n        self.transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]) #outptu tensor in [-1,1]\n\n\n    def eval(self, metrics=['FloLPIPS'], output_dir=None, output_name=None, resume=False):\n        results_dict = {k : [] for k in metrics}\n\n        start_idx = 0\n        if resume:\n            # fill in results_dict with prev results and find where to start from\n            assert os.path.exists(join(output_dir, 'results_vqm.txt')), 'no res file found to resume from!'\n            with open(join(output_dir, 'results_vqm.txt'), 'r') as f:\n                prev_lines = f.readlines()\n                for line in prev_lines:\n                    if len(line) < 2:\n                        continue\n                    cur_res = ast.literal_eval(line.strip().split('-- ')[1].split('time')[0]) #parse dict from string\n                    for k in metrics:\n                        results_dict[k].append(float(cur_res[k]))\n                    start_idx += 1\n        \n        logfile = open(join(output_dir, 'results_vqm.txt'), 'a')\n\n        for idx, seq in enumerate(self.seq_list):\n            if resume and idx < start_idx:\n                assert len(glob.glob(join(output_dir, seq, '*.png'))) > 0, f'skipping idx {idx} but output not found!'\n                continue\n\n            print(f'Evaluating {seq}')\n            t0 = time.time()\n\n            seqpath = join(self.db_dir, seq)\n            assert len(glob.glob(join(output_dir, seq, '*.png'))) > 0, f'No interpolated frames found for {seq}'\n\n            # interpolate between every 2 frames\n            gt_list, out_list, inputs_list = [], [], []\n            num_frames = len(os.listdir(seqpath))\n            for t in range(0, num_frames-6, 2):\n                im3 = Image.open(join(seqpath, str(t+2).zfill(5)+'.jpg'))\n                im4 = Image.open(join(seqpath, str(t+3).zfill(5)+'.jpg'))\n                im5 = Image.open(join(seqpath, str(t+4).zfill(5)+'.jpg'))\n\n                im3 = self.transform(im3).cuda().unsqueeze(0)\n                im4 = self.transform(im4).cuda().unsqueeze(0)\n                im5 = self.transform(im5).cuda().unsqueeze(0)\n\n                out = self.transform(Image.open(join(output_dir, seq, 'frame{}.png'.format(t+3)))).cuda().unsqueeze(0)\n                gt_list.append(im4)\n                out_list.append(out)\n                if t == 0:\n                    inputs_list.append(im3)\n                inputs_list.append(im5)\n\n            # compute sequence-level scores\n            for metric in metrics:\n                score = getattr(utility, 'calc_{}'.format(metric.lower()))(gt_list, out_list, inputs_list)\n                results_dict[metric].append(score)\n\n            # log\n            msg = '{:<15s} -- {}'.format(seq, {k: round(results_dict[k][-1], 3) for k in metrics}) + f'    time taken: {round(time.time()-t0,2)}' + '\\n'\n            print(msg, end='')\n            logfile.write(msg)\n        \n        msg = '{:<15s} -- {}'.format('Average', {k: round(np.mean(results_dict[k]), 3) for k in metrics}) + '\\n'\n        print(msg, end='')\n        logfile.write(msg)\n        logfile.close()\n\n\n\nclass Ucf101_triplet:\n    def __init__(self, db_dir):\n        self.db_dir = db_dir\n        self.transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]) #outptu tensor in [-1,1]\n\n        self.im_list = os.listdir(db_dir)\n\n        self.input3_list = []\n        self.input5_list = []\n        self.gt_list = []\n        for item in self.im_list:\n            self.input3_list.append(self.transform(Image.open(join(db_dir, item , 'frame1.png'))).cuda().unsqueeze(0))\n            self.input5_list.append(self.transform(Image.open(join(db_dir, item , 'frame2.png'))).cuda().unsqueeze(0))\n            self.gt_list.append(self.transform(Image.open(join(db_dir, item , 'framet.png'))).cuda().unsqueeze(0))\n\n    def eval(self, metrics=['FloLPIPS'], output_dir=None, output_name='output.png', resume=False):\n        results_dict = {k : [] for k in metrics}\n\n        start_idx = 0\n        if resume:\n            # fill in results_dict with prev results and find where to start from\n            assert os.path.exists(join(output_dir, 'results_vqm.txt')), 'no res file found to resume from!'\n            with open(join(output_dir, 'results_vqm.txt'), 'r') as f:\n                prev_lines = f.readlines()\n                for line in prev_lines:\n                    if len(line) < 2:\n                        continue\n                    cur_res = ast.literal_eval(line.strip().split('-- ')[1].split('time')[0]) #parse dict from string\n                    for k in metrics:\n                        results_dict[k].append(float(cur_res[k]))\n                    start_idx += 1\n        \n        logfile = open(join(output_dir, 'results_vqm.txt'), 'a')\n\n        for idx in range(len(self.im_list)):\n            if resume and idx < start_idx:\n                assert os.path.exists(join(output_dir, self.im_list[idx], output_name)), f'skipping idx {idx} but output not found!'\n                continue\n\n            print(f'Evaluating {self.im_list[idx]}')\n            t0 = time.time()\n\n            assert exists(join(output_dir, self.im_list[idx], output_name)), f'No interpolated frames found for {self.im_list[idx]}'\n\n            out = self.transform(Image.open(join(output_dir, self.im_list[idx], output_name))).cuda().unsqueeze(0)\n            gt = self.gt_list[idx]\n\n            for metric in metrics:\n                score = getattr(utility, 'calc_{}'.format(metric.lower()))([gt], [out], [self.input3_list[idx], self.input5_list[idx]])\n                results_dict[metric].append(score)\n\n\n            msg = '{:<15s} -- {}'.format(self.im_list[idx], {k: round(results_dict[k][-1],3) for k in metrics}) + f'    time taken: {round(time.time()-t0,2)}' + '\\n'\n            print(msg, end='')\n            logfile.write(msg)\n\n        msg = '{:<15s} -- {}'.format('Average', {k: round(np.mean(results_dict[k]),3) for k in metrics}) + '\\n\\n'\n        print(msg, end='')\n        logfile.write(msg)\n        logfile.close()"
  },
  {
    "path": "ldm/data/vfitransforms.py",
    "content": "import random\nimport torch\nimport torchvision\nimport torchvision.transforms.functional as TF\n\n\ndef rand_crop(*args, sz):\n    i, j, h, w = torchvision.transforms.RandomCrop.get_params(args[0], output_size=sz)\n    out = []\n    for im in args:\n        out.append(TF.crop(im, i, j, h, w))\n    return out\n\n\ndef rand_flip(*args, p):\n    out = list(args)\n    if random.random() < p:\n        for i, im in enumerate(out):\n            out[i] = TF.hflip(im)\n    if random.random() < p:\n        for i, im in enumerate(out):\n            out[i] = TF.vflip(im)\n    return out\n\n\ndef rand_reverse(*args, p):\n    if random.random() < p:\n        return args[::-1]\n    else:\n        return args"
  },
  {
    "path": "ldm/lr_scheduler.py",
    "content": "import numpy as np\n\n\nclass LambdaWarmUpCosineScheduler:\n    \"\"\"\n    note: use with a base_lr of 1.0\n    \"\"\"\n    def __init__(self, warm_up_steps, lr_min, lr_max, lr_start, max_decay_steps, verbosity_interval=0):\n        self.lr_warm_up_steps = warm_up_steps\n        self.lr_start = lr_start\n        self.lr_min = lr_min\n        self.lr_max = lr_max\n        self.lr_max_decay_steps = max_decay_steps\n        self.last_lr = 0.\n        self.verbosity_interval = verbosity_interval\n\n    def schedule(self, n, **kwargs):\n        if self.verbosity_interval > 0:\n            if n % self.verbosity_interval == 0: print(f\"current step: {n}, recent lr-multiplier: {self.last_lr}\")\n        if n < self.lr_warm_up_steps:\n            lr = (self.lr_max - self.lr_start) / self.lr_warm_up_steps * n + self.lr_start\n            self.last_lr = lr\n            return lr\n        else:\n            t = (n - self.lr_warm_up_steps) / (self.lr_max_decay_steps - self.lr_warm_up_steps)\n            t = min(t, 1.0)\n            lr = self.lr_min + 0.5 * (self.lr_max - self.lr_min) * (\n                    1 + np.cos(t * np.pi))\n            self.last_lr = lr\n            return lr\n\n    def __call__(self, n, **kwargs):\n        return self.schedule(n,**kwargs)\n\n\nclass LambdaWarmUpCosineScheduler2:\n    \"\"\"\n    supports repeated iterations, configurable via lists\n    note: use with a base_lr of 1.0.\n    \"\"\"\n    def __init__(self, warm_up_steps, f_min, f_max, f_start, cycle_lengths, verbosity_interval=0):\n        assert len(warm_up_steps) == len(f_min) == len(f_max) == len(f_start) == len(cycle_lengths)\n        self.lr_warm_up_steps = warm_up_steps\n        self.f_start = f_start\n        self.f_min = f_min\n        self.f_max = f_max\n        self.cycle_lengths = cycle_lengths\n        self.cum_cycles = np.cumsum([0] + list(self.cycle_lengths))\n        self.last_f = 0.\n        self.verbosity_interval = verbosity_interval\n\n    def find_in_interval(self, n):\n        interval = 0\n        for cl in self.cum_cycles[1:]:\n            if n <= cl:\n                return interval\n            interval += 1\n\n    def schedule(self, n, **kwargs):\n        cycle = self.find_in_interval(n)\n        n = n - self.cum_cycles[cycle]\n        if self.verbosity_interval > 0:\n            if n % self.verbosity_interval == 0: print(f\"current step: {n}, recent lr-multiplier: {self.last_f}, \"\n                                                       f\"current cycle {cycle}\")\n        if n < self.lr_warm_up_steps[cycle]:\n            f = (self.f_max[cycle] - self.f_start[cycle]) / self.lr_warm_up_steps[cycle] * n + self.f_start[cycle]\n            self.last_f = f\n            return f\n        else:\n            t = (n - self.lr_warm_up_steps[cycle]) / (self.cycle_lengths[cycle] - self.lr_warm_up_steps[cycle])\n            t = min(t, 1.0)\n            f = self.f_min[cycle] + 0.5 * (self.f_max[cycle] - self.f_min[cycle]) * (\n                    1 + np.cos(t * np.pi))\n            self.last_f = f\n            return f\n\n    def __call__(self, n, **kwargs):\n        return self.schedule(n, **kwargs)\n\n\nclass LambdaLinearScheduler(LambdaWarmUpCosineScheduler2):\n\n    def schedule(self, n, **kwargs):\n        cycle = self.find_in_interval(n)\n        n = n - self.cum_cycles[cycle]\n        if self.verbosity_interval > 0:\n            if n % self.verbosity_interval == 0: print(f\"current step: {n}, recent lr-multiplier: {self.last_f}, \"\n                                                       f\"current cycle {cycle}\")\n\n        if n < self.lr_warm_up_steps[cycle]:\n            f = (self.f_max[cycle] - self.f_start[cycle]) / self.lr_warm_up_steps[cycle] * n + self.f_start[cycle]\n            self.last_f = f\n            return f\n        else:\n            f = self.f_min[cycle] + (self.f_max[cycle] - self.f_min[cycle]) * (self.cycle_lengths[cycle] - n) / (self.cycle_lengths[cycle])\n            self.last_f = f\n            return f\n\n"
  },
  {
    "path": "ldm/models/autoencoder.py",
    "content": "import torch\nimport pytorch_lightning as pl\nimport torch.nn.functional as F\nfrom torch.optim.lr_scheduler import LambdaLR\nimport numpy as np\nfrom packaging import version\nfrom ldm.modules.ema import LitEma\nfrom contextlib import contextmanager\n\nfrom taming.modules.vqvae.quantize import VectorQuantizer2 as VectorQuantizer\n\nfrom ldm.modules.diffusionmodules.model import *\n\nfrom ldm.util import instantiate_from_config\n\n\n\nclass VQFlowNet(pl.LightningModule):\n    def __init__(self,\n                 ddconfig,\n                 lossconfig,\n                 n_embed,\n                 embed_dim,\n                 ckpt_path=None,\n                 ignore_keys=[],\n                 image_key=\"image\",\n                 colorize_nlabels=None,\n                 monitor=None,\n                 batch_resize_range=None,\n                 scheduler_config=None,\n                 lr_g_factor=1.0,\n                 remap=None,\n                 sane_index_shape=False, # tell vector quantizer to return indices as bhw\n                 use_ema=False\n                 ):\n        super().__init__()\n        self.embed_dim = embed_dim # 3\n        self.n_embed = n_embed # 8192\n        self.image_key = image_key # 'image'\n        self.encoder = FlowEncoder(**ddconfig)\n        self.decoder = FlowDecoderWithResidual(**ddconfig)\n        self.loss = instantiate_from_config(lossconfig)\n        self.quantize = VectorQuantizer(n_embed, embed_dim, beta=0.25,\n                                        remap=remap,\n                                        sane_index_shape=sane_index_shape)\n        self.quant_conv = torch.nn.Conv2d(ddconfig[\"z_channels\"], embed_dim, 1)\n        self.post_quant_conv = torch.nn.Conv2d(embed_dim, ddconfig[\"z_channels\"], 1)\n        if colorize_nlabels is not None:\n            assert type(colorize_nlabels)==int\n            self.register_buffer(\"colorize\", torch.randn(3, colorize_nlabels, 1, 1))\n        if monitor is not None:\n            self.monitor = monitor\n        self.batch_resize_range = batch_resize_range\n        if self.batch_resize_range is not None:\n            print(f\"{self.__class__.__name__}: Using per-batch resizing in range {batch_resize_range}.\")\n\n        self.use_ema = use_ema\n        if self.use_ema:\n            self.model_ema = LitEma(self)\n            print(f\"Keeping EMAs of {len(list(self.model_ema.buffers()))}.\")\n\n        if ckpt_path is not None:\n            self.init_from_ckpt(ckpt_path, ignore_keys=ignore_keys)\n        self.scheduler_config = scheduler_config\n        self.lr_g_factor = lr_g_factor\n        self.h0 = None\n        self.w0 = None\n        self.h_padded = None\n        self.w_padded = None\n        self.pad_h = 0\n        self.pad_w = 0\n\n    @contextmanager\n    def ema_scope(self, context=None):\n        if self.use_ema:\n            self.model_ema.store(self.parameters())\n            self.model_ema.copy_to(self)\n            if context is not None:\n                print(f\"{context}: Switched to EMA weights\")\n        try:\n            yield None\n        finally:\n            if self.use_ema:\n                self.model_ema.restore(self.parameters())\n                if context is not None:\n                    print(f\"{context}: Restored training weights\")\n\n    def init_from_ckpt(self, path, ignore_keys=list()):\n        sd = torch.load(path, map_location=\"cpu\")[\"state_dict\"]\n        keys = list(sd.keys())\n        for k in keys:\n            for ik in ignore_keys:\n                if k.startswith(ik):\n                    print(\"Deleting key {} from state_dict.\".format(k))\n                    del sd[k]\n        missing, unexpected = self.load_state_dict(sd, strict=False)\n        print(f\"Restored from {path} with {len(missing)} missing and {len(unexpected)} unexpected keys\")\n        if len(missing) > 0:\n            print(f\"Missing Keys: {missing}\")\n            print(f\"Unexpected Keys: {unexpected}\")\n\n    def on_train_batch_end(self, *args, **kwargs):\n        if self.use_ema:\n            self.model_ema(self)\n\n    def encode(self, x, ret_feature=False):\n        '''\n        Set ret_feature = True when encoding conditions in ddpm\n        '''\n        # Pad the input first so its size is deividable by 8.\n        # this is to tolerate different f values, various size inputs, \n        # and some operations in the DDPM unet model.\n        self.h0, self.w0 = x.shape[2:]\n        # 8: window size for max vit\n        # 2**(nr-1): f \n        # 4: factor of downsampling in DDPM unet\n        min_side = 8 * 2**(self.encoder.num_resolutions-1) * 4\n        if self.h0 % min_side != 0:\n            pad_h = min_side - (self.h0 % min_side)\n            if pad_h == self.h0: # this is to avoid padding 256 patches\n                pad_h = 0\n            x = F.pad(x, (0, 0, 0, pad_h), mode='reflect')\n            self.h_padded = True\n            self.pad_h = pad_h\n\n        if self.w0 % min_side != 0:\n            pad_w = min_side - (self.w0 % min_side)\n            if pad_w == self.w0:\n                pad_w = 0\n            x = F.pad(x, (0, pad_w, 0, 0), mode='reflect')\n            self.w_padded = True\n            self.pad_w = pad_w\n\n        phi_list = None\n        if ret_feature:\n            h, phi_list = self.encoder(x, ret_feature)\n        else:\n            h = self.encoder(x)\n        h = self.quant_conv(h)\n        quant, emb_loss, info = self.quantize(h)\n        if ret_feature:\n            return quant, emb_loss, info, phi_list\n        return quant, emb_loss, info\n\n    def encode_to_prequant(self, x):\n        h = self.encoder(x)\n        h = self.quant_conv(h)\n        return h\n\n    def decode(self, quant, x_prev, x_next):\n        cond_dict = dict(\n            phi_prev_list = self.encode(x_prev, ret_feature=True)[-1],\n            phi_next_list = self.encode(x_next, ret_feature=True)[-1],\n            frame_prev = F.pad(x_prev, (0, self.pad_w, 0, self.pad_h), mode='reflect'),\n            frame_next = F.pad(x_next, (0, self.pad_w, 0, self.pad_h), mode='reflect')\n        )\n        quant = self.post_quant_conv(quant)\n        dec = self.decoder(quant, cond_dict)\n\n        # check if image is padded and return the original part only\n        if self.h_padded:\n            dec = dec[:, :, 0:self.h0, :]\n        if self.w_padded:\n            dec = dec[:, :, :, 0:self.w0]\n        return dec\n\n    def decode_code(self, code_b):\n        quant_b = self.quantize.embed_code(code_b)\n        dec = self.decode(quant_b)\n        return dec\n\n    def forward(self, input, x_prev, x_next, return_pred_indices=False):\n\n        quant, diff, (_,_,ind) = self.encode(input)\n        dec = self.decode(quant, x_prev, x_next)\n        if return_pred_indices:\n            return dec, diff, ind\n        return dec, diff\n\n    def get_input(self, batch, k):\n        x = batch[k]\n        if len(x.shape) == 3:\n            x = x[..., None]\n        x = x.permute(0, 3, 1, 2).to(memory_format=torch.contiguous_format).float()\n        if self.batch_resize_range is not None:\n            lower_size = self.batch_resize_range[0]\n            upper_size = self.batch_resize_range[1]\n            if self.global_step <= 4:\n                # do the first few batches with max size to avoid later oom\n                new_resize = upper_size\n            else:\n                new_resize = np.random.choice(np.arange(lower_size, upper_size+16, 16))\n            if new_resize != x.shape[2]:\n                x = F.interpolate(x, size=new_resize, mode=\"bicubic\")\n            x = x.detach()\n        return x\n\n    def training_step(self, batch, batch_idx, optimizer_idx):\n        # https://github.com/pytorch/pytorch/issues/37142\n        # try not to fool the heuristics\n        x = self.get_input(batch, self.image_key)\n        x_prev = self.get_input(batch, 'prev_frame')\n        x_next = self.get_input(batch, 'next_frame')\n        xrec, qloss = self(x, x_prev, x_next)\n\n        if optimizer_idx == 0:\n            # autoencode\n            aeloss, log_dict_ae = self.loss(qloss, x, xrec, optimizer_idx, self.global_step,\n                                            last_layer=self.get_last_layer(), split=\"train\")\n\n            self.log_dict(log_dict_ae, prog_bar=False, logger=True, on_step=True, on_epoch=True)\n            return aeloss\n\n        if optimizer_idx == 1:\n            # discriminator\n            discloss, log_dict_disc = self.loss(qloss, x, xrec, optimizer_idx, self.global_step,\n                                            last_layer=self.get_last_layer(), split=\"train\")\n            self.log_dict(log_dict_disc, prog_bar=False, logger=True, on_step=True, on_epoch=True)\n            return discloss\n\n    def validation_step(self, batch, batch_idx):\n        log_dict = self._validation_step(batch, batch_idx)\n        with self.ema_scope():\n            log_dict_ema = self._validation_step(batch, batch_idx, suffix=\"_ema\")\n        return log_dict\n\n    def _validation_step(self, batch, batch_idx, suffix=\"\"):\n        x = self.get_input(batch, self.image_key)\n        x_prev = self.get_input(batch, 'prev_frame')\n        x_next = self.get_input(batch, 'next_frame')\n        xrec, qloss = self(x, x_prev, x_next)\n        aeloss, log_dict_ae = self.loss(qloss, x, xrec, 0,\n                                        self.global_step,\n                                        last_layer=self.get_last_layer(),\n                                        split=\"val\"+suffix,\n                                        )\n\n        discloss, log_dict_disc = self.loss(qloss, x, xrec, 1,\n                                            self.global_step,\n                                            last_layer=self.get_last_layer(),\n                                            split=\"val\"+suffix,\n                                            )\n        rec_loss = log_dict_ae[f\"val{suffix}/rec_loss\"]\n        self.log(f\"val{suffix}/rec_loss\", rec_loss,\n                   prog_bar=True, logger=True, on_step=False, on_epoch=True, sync_dist=True)\n        self.log(f\"val{suffix}/aeloss\", aeloss,\n                   prog_bar=True, logger=True, on_step=False, on_epoch=True, sync_dist=True)\n        if version.parse(pl.__version__) >= version.parse('1.4.0'):\n            del log_dict_ae[f\"val{suffix}/rec_loss\"]\n        self.log_dict(log_dict_ae)\n        self.log_dict(log_dict_disc)\n        return self.log_dict\n\n    def configure_optimizers(self):\n        lr_d = self.learning_rate\n        lr_g = self.lr_g_factor*self.learning_rate\n        print(\"lr_d\", lr_d)\n        print(\"lr_g\", lr_g)\n        opt_ae = torch.optim.Adam(list(self.encoder.parameters())+\n                                  list(self.decoder.parameters())+\n                                  list(self.quantize.parameters())+\n                                  list(self.quant_conv.parameters())+\n                                  list(self.post_quant_conv.parameters()),\n                                  lr=lr_g, betas=(0.5, 0.9))\n        opt_disc = torch.optim.Adam(self.loss.discriminator.parameters(),\n                                    lr=lr_d, betas=(0.5, 0.9))\n\n        if self.scheduler_config is not None:\n            scheduler = instantiate_from_config(self.scheduler_config)\n\n            print(\"Setting up LambdaLR scheduler...\")\n            scheduler = [\n                {\n                    'scheduler': LambdaLR(opt_ae, lr_lambda=scheduler.schedule),\n                    'interval': 'step',\n                    'frequency': 1\n                },\n                {\n                    'scheduler': LambdaLR(opt_disc, lr_lambda=scheduler.schedule),\n                    'interval': 'step',\n                    'frequency': 1\n                },\n            ]\n            return [opt_ae, opt_disc], scheduler\n        return [opt_ae, opt_disc], []\n\n    def get_last_layer(self):\n        return self.decoder.conv_out.weight\n\n    def log_images(self, batch, only_inputs=False, plot_ema=False, **kwargs):\n        log = dict()\n        x = self.get_input(batch, self.image_key)\n        x_prev = self.get_input(batch, 'prev_frame')\n        x_next = self.get_input(batch, 'next_frame')\n        x = x.to(self.device)\n        if only_inputs:\n            log[\"inputs\"] = x\n            return log\n        xrec, _ = self(x, x_prev, x_next)\n        if x.shape[1] > 3:\n            # colorize with random projection\n            assert xrec.shape[1] > 3\n            x = self.to_rgb(x)\n            xrec = self.to_rgb(xrec)\n        log[\"inputs\"] = x\n        log[\"reconstructions\"] = xrec\n        if plot_ema:\n            with self.ema_scope():\n                xrec_ema, _ = self(x, x_prev, x_next)\n                if x.shape[1] > 3: xrec_ema = self.to_rgb(xrec_ema)\n                log[\"reconstructions_ema\"] = xrec_ema\n        return log\n\n    def to_rgb(self, x):\n        assert self.image_key == \"segmentation\"\n        if not hasattr(self, \"colorize\"):\n            self.register_buffer(\"colorize\", torch.randn(3, x.shape[1], 1, 1).to(x))\n        x = F.conv2d(x, weight=self.colorize)\n        x = 2.*(x-x.min())/(x.max()-x.min()) - 1.\n        return x\n\n    \nclass VQFlowNetInterface(VQFlowNet):\n    def __init__(self, **kwargs):\n        super().__init__(**kwargs)\n\n    def encode(self, x, ret_feature=False):\n        '''\n        Set ret_feature = True when encoding conditions in ddpm\n        '''\n        # Pad the input first so its size is deividable by 8.\n        # this is to tolerate different f values, various size inputs, \n        # and some operations in the DDPM unet model.\n        self.h0, self.w0 = x.shape[2:]\n        # 8: window size for max vit\n        # 2**(nr-1): f \n        # 4: factor of downsampling in DDPM unet\n        min_side = 512#8 * 2**(self.encoder.num_resolutions-1) * 16\n        min_side = min_side // 2 if self.h0 <= 256 else min_side\n        if self.h0 % min_side != 0:\n            pad_h = min_side - (self.h0 % min_side)\n            if pad_h == self.h0: # this is to avoid padding 256 patches\n                pad_h = 0\n            x = F.pad(x, (0, 0, 0, pad_h), mode='reflect')\n            self.h_padded = True\n            self.pad_h = pad_h\n\n        if self.w0 % min_side != 0:\n            pad_w = min_side - (self.w0 % min_side)\n            if pad_w == self.w0:\n                pad_w = 0\n            x = F.pad(x, (0, pad_w, 0, 0), mode='reflect')\n            self.w_padded = True\n            self.pad_w = pad_w\n\n        phi_list = None\n        if ret_feature:\n            h, phi_list = self.encoder(x, ret_feature)\n        else:\n            h = self.encoder(x)\n        h = self.quant_conv(h)\n        if ret_feature:\n            return h, phi_list\n        return h\n\n    def decode(self, h, x_prev, x_next, phi_prev_list, phi_next_list, force_not_quantize=False):\n        # also go through quantization layer\n        if not force_not_quantize:\n            quant, emb_loss, info = self.quantize(h)\n        else:\n            quant = h\n        cond_dict = dict(\n            phi_prev_list = phi_prev_list,\n            phi_next_list = phi_next_list,\n            frame_prev = F.pad(x_prev, (0, self.pad_w, 0, self.pad_h), mode='reflect'),\n            frame_next = F.pad(x_next, (0, self.pad_w, 0, self.pad_h), mode='reflect')\n        )\n        quant = self.post_quant_conv(quant)\n        dec = self.decoder(quant, cond_dict)\n\n        # check if image is padded and return the original part only\n        if self.h_padded:\n            dec = dec[:, :, 0:self.h0, :]\n        if self.w_padded:\n            dec = dec[:, :, :, 0:self.w0]\n        return dec"
  },
  {
    "path": "ldm/models/diffusion/__init__.py",
    "content": ""
  },
  {
    "path": "ldm/models/diffusion/ddim.py",
    "content": "\"\"\"SAMPLING ONLY.\"\"\"\n\nimport torch\nimport numpy as np\nfrom tqdm import tqdm\nfrom functools import partial\n\nfrom ldm.modules.diffusionmodules.util import make_ddim_sampling_parameters, make_ddim_timesteps, noise_like\n\n\nclass DDIMSampler(object):\n    def __init__(self, model, schedule=\"linear\", **kwargs):\n        super().__init__()\n        self.model = model\n        self.ddpm_num_timesteps = model.num_timesteps\n        self.schedule = schedule\n\n    def register_buffer(self, name, attr):\n        if type(attr) == torch.Tensor:\n            if attr.device != torch.device(\"cuda\"):\n                attr = attr.to(torch.device(\"cuda\"))\n        setattr(self, name, attr)\n\n    def make_schedule(self, ddim_num_steps, ddim_discretize=\"uniform\", ddim_eta=0., verbose=True):\n        self.ddim_timesteps = make_ddim_timesteps(ddim_discr_method=ddim_discretize, num_ddim_timesteps=ddim_num_steps,\n                                                  num_ddpm_timesteps=self.ddpm_num_timesteps,verbose=verbose)\n        alphas_cumprod = self.model.alphas_cumprod\n        assert alphas_cumprod.shape[0] == self.ddpm_num_timesteps, 'alphas have to be defined for each timestep'\n        to_torch = lambda x: x.clone().detach().to(torch.float32).to(self.model.device)\n\n        self.register_buffer('betas', to_torch(self.model.betas))\n        self.register_buffer('alphas_cumprod', to_torch(alphas_cumprod))\n        self.register_buffer('alphas_cumprod_prev', to_torch(self.model.alphas_cumprod_prev))\n\n        # calculations for diffusion q(x_t | x_{t-1}) and others\n        self.register_buffer('sqrt_alphas_cumprod', to_torch(np.sqrt(alphas_cumprod.cpu())))\n        self.register_buffer('sqrt_one_minus_alphas_cumprod', to_torch(np.sqrt(1. - alphas_cumprod.cpu())))\n        self.register_buffer('log_one_minus_alphas_cumprod', to_torch(np.log(1. - alphas_cumprod.cpu())))\n        self.register_buffer('sqrt_recip_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod.cpu())))\n        self.register_buffer('sqrt_recipm1_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod.cpu() - 1)))\n\n        # ddim sampling parameters\n        ddim_sigmas, ddim_alphas, ddim_alphas_prev = make_ddim_sampling_parameters(alphacums=alphas_cumprod.cpu(),\n                                                                                   ddim_timesteps=self.ddim_timesteps,\n                                                                                   eta=ddim_eta,verbose=verbose)\n        self.register_buffer('ddim_sigmas', ddim_sigmas)\n        self.register_buffer('ddim_alphas', ddim_alphas)\n        self.register_buffer('ddim_alphas_prev', ddim_alphas_prev)\n        self.register_buffer('ddim_sqrt_one_minus_alphas', np.sqrt(1. - ddim_alphas))\n        sigmas_for_original_sampling_steps = ddim_eta * torch.sqrt(\n            (1 - self.alphas_cumprod_prev) / (1 - self.alphas_cumprod) * (\n                        1 - self.alphas_cumprod / self.alphas_cumprod_prev))\n        self.register_buffer('ddim_sigmas_for_original_num_steps', sigmas_for_original_sampling_steps)\n\n    @torch.no_grad()\n    def sample(self,\n               S,\n               batch_size,\n               shape,\n               conditioning=None,\n               callback=None,\n               normals_sequence=None,\n               img_callback=None,\n               quantize_x0=False,\n               eta=0.,\n               mask=None,\n               x0=None,\n               temperature=1.,\n               noise_dropout=0.,\n               score_corrector=None,\n               corrector_kwargs=None,\n               verbose=True,\n               x_T=None,\n               log_every_t=100,\n               unconditional_guidance_scale=1.,\n               unconditional_conditioning=None,\n               # this has to come in the same format as the conditioning, # e.g. as encoded tokens, ...\n               **kwargs\n               ):\n        if conditioning is not None:\n            if isinstance(conditioning, dict):\n                cbs = conditioning[list(conditioning.keys())[0]].shape[0]\n                if cbs != batch_size:\n                    print(f\"Warning: Got {cbs} conditionings but batch-size is {batch_size}\")\n            else:\n                if conditioning.shape[0] != batch_size:\n                    print(f\"Warning: Got {conditioning.shape[0]} conditionings but batch-size is {batch_size}\")\n\n        self.make_schedule(ddim_num_steps=S, ddim_eta=eta, verbose=verbose)\n        # sampling\n        C, H, W = shape\n        size = (batch_size, C, H, W)\n        if verbose:\n            print(f'Data shape for DDIM sampling is {size}, eta {eta}')\n\n        samples, intermediates = self.ddim_sampling(conditioning, size,\n                                                    callback=callback,\n                                                    img_callback=img_callback,\n                                                    quantize_denoised=quantize_x0,\n                                                    mask=mask, x0=x0,\n                                                    ddim_use_original_steps=False,\n                                                    noise_dropout=noise_dropout,\n                                                    temperature=temperature,\n                                                    score_corrector=score_corrector,\n                                                    corrector_kwargs=corrector_kwargs,\n                                                    x_T=x_T,\n                                                    log_every_t=log_every_t,\n                                                    unconditional_guidance_scale=unconditional_guidance_scale,\n                                                    unconditional_conditioning=unconditional_conditioning,\n                                                    verbose=verbose\n                                                    )\n        return samples, intermediates\n\n    @torch.no_grad()\n    def ddim_sampling(self, cond, shape,\n                      x_T=None, ddim_use_original_steps=False,\n                      callback=None, timesteps=None, quantize_denoised=False,\n                      mask=None, x0=None, img_callback=None, log_every_t=100,\n                      temperature=1., noise_dropout=0., score_corrector=None, corrector_kwargs=None,\n                      unconditional_guidance_scale=1., unconditional_conditioning=None,verbose=True):\n        device = self.model.betas.device\n        b = shape[0]\n        if x_T is None:\n            img = torch.randn(shape, device=device)\n        else:\n            img = x_T\n\n        if timesteps is None:\n            timesteps = self.ddpm_num_timesteps if ddim_use_original_steps else self.ddim_timesteps\n        elif timesteps is not None and not ddim_use_original_steps:\n            subset_end = int(min(timesteps / self.ddim_timesteps.shape[0], 1) * self.ddim_timesteps.shape[0]) - 1\n            timesteps = self.ddim_timesteps[:subset_end]\n\n        intermediates = {'x_inter': [img], 'pred_x0': [img]}\n        time_range = reversed(range(0,timesteps)) if ddim_use_original_steps else np.flip(timesteps)\n        total_steps = timesteps if ddim_use_original_steps else timesteps.shape[0]\n        if verbose:\n            print(f\"Running DDIM Sampling with {total_steps} timesteps\")\n\n        iterator = tqdm(time_range, desc='DDIM Sampler', total=total_steps) if verbose else time_range\n\n        for i, step in enumerate(iterator):\n            index = total_steps - i - 1\n            ts = torch.full((b,), step, device=device, dtype=torch.long)\n\n            if mask is not None:\n                assert x0 is not None\n                img_orig = self.model.q_sample(x0, ts)  # TODO: deterministic forward pass?\n                img = img_orig * mask + (1. - mask) * img\n\n            outs = self.p_sample_ddim(img, cond, ts, index=index, use_original_steps=ddim_use_original_steps,\n                                      quantize_denoised=quantize_denoised, temperature=temperature,\n                                      noise_dropout=noise_dropout, score_corrector=score_corrector,\n                                      corrector_kwargs=corrector_kwargs,\n                                      unconditional_guidance_scale=unconditional_guidance_scale,\n                                      unconditional_conditioning=unconditional_conditioning)\n            img, pred_x0 = outs\n            if callback: callback(i)\n            if img_callback: img_callback(pred_x0, i)\n\n            if index % log_every_t == 0 or index == total_steps - 1:\n                intermediates['x_inter'].append(img)\n                intermediates['pred_x0'].append(pred_x0)\n\n        return img, intermediates\n\n    @torch.no_grad()\n    def p_sample_ddim(self, x, c, t, index, repeat_noise=False, use_original_steps=False, quantize_denoised=False,\n                      temperature=1., noise_dropout=0., score_corrector=None, corrector_kwargs=None,\n                      unconditional_guidance_scale=1., unconditional_conditioning=None):\n        b, *_, device = *x.shape, x.device\n\n        if unconditional_conditioning is None or unconditional_guidance_scale == 1.:\n            e_t = self.model.apply_model(x, t, c)\n        else:\n            x_in = torch.cat([x] * 2)\n            t_in = torch.cat([t] * 2)\n            c_in = torch.cat([unconditional_conditioning, c])\n            e_t_uncond, e_t = self.model.apply_model(x_in, t_in, c_in).chunk(2)\n            e_t = e_t_uncond + unconditional_guidance_scale * (e_t - e_t_uncond)\n\n        if score_corrector is not None:\n            assert self.model.parameterization == \"eps\"\n            e_t = score_corrector.modify_score(self.model, e_t, x, t, c, **corrector_kwargs)\n\n        alphas = self.model.alphas_cumprod if use_original_steps else self.ddim_alphas\n        alphas_prev = self.model.alphas_cumprod_prev if use_original_steps else self.ddim_alphas_prev\n        sqrt_one_minus_alphas = self.model.sqrt_one_minus_alphas_cumprod if use_original_steps else self.ddim_sqrt_one_minus_alphas\n        sigmas = self.model.ddim_sigmas_for_original_num_steps if use_original_steps else self.ddim_sigmas\n        # select parameters corresponding to the currently considered timestep\n        a_t = torch.full((b, 1, 1, 1), alphas[index], device=device)\n        a_prev = torch.full((b, 1, 1, 1), alphas_prev[index], device=device)\n        sigma_t = torch.full((b, 1, 1, 1), sigmas[index], device=device)\n        sqrt_one_minus_at = torch.full((b, 1, 1, 1), sqrt_one_minus_alphas[index],device=device)\n\n        # current prediction for x_0\n        pred_x0 = (x - sqrt_one_minus_at * e_t) / a_t.sqrt()\n        if quantize_denoised:\n            pred_x0, _, *_ = self.model.first_stage_model.quantize(pred_x0)\n        # direction pointing to x_t\n        dir_xt = (1. - a_prev - sigma_t**2).sqrt() * e_t\n        noise = sigma_t * noise_like(x.shape, device, repeat_noise) * temperature\n        if noise_dropout > 0.:\n            noise = torch.nn.functional.dropout(noise, p=noise_dropout)\n        x_prev = a_prev.sqrt() * pred_x0 + dir_xt + noise\n        return x_prev, pred_x0\n"
  },
  {
    "path": "ldm/models/diffusion/ddpm.py",
    "content": "\"\"\"\nwild mixture of\nhttps://github.com/lucidrains/denoising-diffusion-pytorch/blob/7706bdfc6f527f58d33f84b7b522e61e6e3164b3/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py\nhttps://github.com/openai/improved-diffusion/blob/e94489283bb876ac1477d5dd7709bbbd2d9902ce/improved_diffusion/gaussian_diffusion.py\nhttps://github.com/CompVis/taming-transformers\n-- merci\n\"\"\"\n\nimport torch\nimport torch.nn as nn\nimport numpy as np\nimport pytorch_lightning as pl\nfrom torch.optim.lr_scheduler import LambdaLR\nfrom einops import rearrange, repeat\nfrom contextlib import contextmanager\nfrom functools import partial\nfrom tqdm import tqdm\nfrom torchvision.utils import make_grid\nfrom pytorch_lightning.utilities.distributed import rank_zero_only\n\nfrom ldm.util import log_txt_as_img, exists, default, ismap, isimage, mean_flat, count_params, instantiate_from_config\nfrom ldm.modules.ema import LitEma\nfrom ldm.models.autoencoder import * \nfrom ldm.modules.diffusionmodules.util import make_beta_schedule, extract_into_tensor, noise_like\nfrom ldm.models.diffusion.ddim import DDIMSampler\n\n\n__conditioning_keys__ = {'concat': 'c_concat',\n                         'crossattn': 'c_crossattn',\n                         'adm': 'y'}\n\n\ndef disabled_train(self, mode=True):\n    \"\"\"Overwrite model.train with this function to make sure train/eval mode\n    does not change anymore.\"\"\"\n    return self\n\n\n\nclass DDPM(pl.LightningModule):\n    # classic DDPM with Gaussian diffusion, in image space\n    def __init__(self,\n                 unet_config,\n                 timesteps=1000,\n                 beta_schedule=\"linear\",\n                 loss_type=\"l2\",\n                 ckpt_path=None,\n                 ignore_keys=[],\n                 load_only_unet=False,\n                 monitor=\"val/loss\",\n                 use_ema=True,\n                 first_stage_key=\"image\",\n                 image_size=256,\n                 channels=3,\n                 log_every_t=100,\n                 clip_denoised=True,\n                 linear_start=1e-4,\n                 linear_end=2e-2,\n                 cosine_s=8e-3,\n                 given_betas=None,\n                 original_elbo_weight=0.,\n                 v_posterior=0.,  # weight for choosing posterior variance as sigma = (1-v) * beta_tilde + v * beta\n                 l_simple_weight=1.,\n                 conditioning_key=None,\n                 parameterization=\"eps\",  # all assuming fixed variance schedules\n                 scheduler_config=None,\n                 use_positional_encodings=False,\n                 learn_logvar=False,\n                 logvar_init=0.,\n                 ):\n        super().__init__()\n        assert parameterization in [\"eps\", \"x0\"], 'currently only supporting \"eps\" and \"x0\"'\n        self.parameterization = parameterization\n        print(f\"{self.__class__.__name__}: Running in {self.parameterization}-prediction mode\")\n        self.cond_stage_model = None\n        self.clip_denoised = clip_denoised\n        self.log_every_t = log_every_t\n        self.first_stage_key = first_stage_key\n        self.image_size = image_size  # try conv?\n        self.channels = channels\n        self.use_positional_encodings = use_positional_encodings\n        self.model = DiffusionWrapper(unet_config, conditioning_key)\n        count_params(self.model, verbose=True)\n        self.use_ema = use_ema\n        if self.use_ema:\n            self.model_ema = LitEma(self.model)\n            print(f\"Keeping EMAs of {len(list(self.model_ema.buffers()))}.\")\n\n        self.use_scheduler = scheduler_config is not None\n        if self.use_scheduler:\n            self.scheduler_config = scheduler_config\n\n        self.v_posterior = v_posterior\n        self.original_elbo_weight = original_elbo_weight\n        self.l_simple_weight = l_simple_weight\n\n        if monitor is not None:\n            self.monitor = monitor\n        if ckpt_path is not None:\n            self.init_from_ckpt(ckpt_path, ignore_keys=ignore_keys, only_model=load_only_unet)\n\n        self.register_schedule(given_betas=given_betas, beta_schedule=beta_schedule, timesteps=timesteps,\n                               linear_start=linear_start, linear_end=linear_end, cosine_s=cosine_s)\n\n        self.loss_type = loss_type\n\n        self.learn_logvar = learn_logvar\n        self.logvar = torch.full(fill_value=logvar_init, size=(self.num_timesteps,))\n        if self.learn_logvar:\n            self.logvar = nn.Parameter(self.logvar, requires_grad=True)\n\n\n    def register_schedule(self, given_betas=None, beta_schedule=\"linear\", timesteps=1000,\n                          linear_start=1e-4, linear_end=2e-2, cosine_s=8e-3):\n        if exists(given_betas):\n            betas = given_betas\n        else:\n            betas = make_beta_schedule(beta_schedule, timesteps, linear_start=linear_start, linear_end=linear_end,\n                                       cosine_s=cosine_s)\n        alphas = 1. - betas\n        alphas_cumprod = np.cumprod(alphas, axis=0)\n        alphas_cumprod_prev = np.append(1., alphas_cumprod[:-1])\n\n        timesteps, = betas.shape\n        self.num_timesteps = int(timesteps)\n        self.linear_start = linear_start\n        self.linear_end = linear_end\n        assert alphas_cumprod.shape[0] == self.num_timesteps, 'alphas have to be defined for each timestep'\n\n        to_torch = partial(torch.tensor, dtype=torch.float32)\n\n        self.register_buffer('betas', to_torch(betas))\n        self.register_buffer('alphas_cumprod', to_torch(alphas_cumprod))\n        self.register_buffer('alphas_cumprod_prev', to_torch(alphas_cumprod_prev))\n\n        # calculations for diffusion q(x_t | x_{t-1}) and others\n        self.register_buffer('sqrt_alphas_cumprod', to_torch(np.sqrt(alphas_cumprod)))\n        self.register_buffer('sqrt_one_minus_alphas_cumprod', to_torch(np.sqrt(1. - alphas_cumprod)))\n        self.register_buffer('log_one_minus_alphas_cumprod', to_torch(np.log(1. - alphas_cumprod)))\n        self.register_buffer('sqrt_recip_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod)))\n        self.register_buffer('sqrt_recipm1_alphas_cumprod', to_torch(np.sqrt(1. / alphas_cumprod - 1)))\n\n        # calculations for posterior q(x_{t-1} | x_t, x_0)\n        posterior_variance = (1 - self.v_posterior) * betas * (1. - alphas_cumprod_prev) / (\n                    1. - alphas_cumprod) + self.v_posterior * betas\n        # above: equal to 1. / (1. / (1. - alpha_cumprod_tm1) + alpha_t / beta_t)\n        self.register_buffer('posterior_variance', to_torch(posterior_variance))\n        # below: log calculation clipped because the posterior variance is 0 at the beginning of the diffusion chain\n        self.register_buffer('posterior_log_variance_clipped', to_torch(np.log(np.maximum(posterior_variance, 1e-20))))\n        self.register_buffer('posterior_mean_coef1', to_torch(\n            betas * np.sqrt(alphas_cumprod_prev) / (1. - alphas_cumprod)))\n        self.register_buffer('posterior_mean_coef2', to_torch(\n            (1. - alphas_cumprod_prev) * np.sqrt(alphas) / (1. - alphas_cumprod)))\n\n        if self.parameterization == \"eps\":\n            lvlb_weights = self.betas ** 2 / (\n                        2 * self.posterior_variance * to_torch(alphas) * (1 - self.alphas_cumprod))\n        elif self.parameterization == \"x0\":\n            lvlb_weights = 0.5 * np.sqrt(torch.Tensor(alphas_cumprod)) / (2. * 1 - torch.Tensor(alphas_cumprod))\n        else:\n            raise NotImplementedError(\"mu not supported\")\n        # TODO how to choose this term\n        lvlb_weights[0] = lvlb_weights[1]\n        self.register_buffer('lvlb_weights', lvlb_weights, persistent=False)\n        assert not torch.isnan(self.lvlb_weights).all()\n\n    @contextmanager\n    def ema_scope(self, context=None):\n        if self.use_ema:\n            self.model_ema.store(self.model.parameters())\n            self.model_ema.copy_to(self.model)\n            if context is not None:\n                print(f\"{context}: Switched to EMA weights\")\n        try:\n            yield None\n        finally:\n            if self.use_ema:\n                self.model_ema.restore(self.model.parameters())\n                if context is not None:\n                    print(f\"{context}: Restored training weights\")\n\n    def init_from_ckpt(self, path, ignore_keys=list(), only_model=False):\n        sd = torch.load(path, map_location=\"cpu\")\n        if \"state_dict\" in list(sd.keys()):\n            sd = sd[\"state_dict\"]\n        keys = list(sd.keys())\n        for k in keys:\n            for ik in ignore_keys:\n                if k.startswith(ik):\n                    print(\"Deleting key {} from state_dict.\".format(k))\n                    del sd[k]\n        missing, unexpected = self.load_state_dict(sd, strict=False) if not only_model else self.model.load_state_dict(\n            sd, strict=False)\n        print(f\"Restored from {path} with {len(missing)} missing and {len(unexpected)} unexpected keys\")\n        if len(missing) > 0:\n            print(f\"Missing Keys: {missing}\")\n        if len(unexpected) > 0:\n            print(f\"Unexpected Keys: {unexpected}\")\n\n    def q_mean_variance(self, x_start, t):\n        \"\"\"\n        Get the distribution q(x_t | x_0).\n        :param x_start: the [N x C x ...] tensor of noiseless inputs.\n        :param t: the number of diffusion steps (minus 1). Here, 0 means one step.\n        :return: A tuple (mean, variance, log_variance), all of x_start's shape.\n        \"\"\"\n        mean = (extract_into_tensor(self.sqrt_alphas_cumprod, t, x_start.shape) * x_start)\n        variance = extract_into_tensor(1.0 - self.alphas_cumprod, t, x_start.shape)\n        log_variance = extract_into_tensor(self.log_one_minus_alphas_cumprod, t, x_start.shape)\n        return mean, variance, log_variance\n\n    def predict_start_from_noise(self, x_t, t, noise):\n        return (\n                extract_into_tensor(self.sqrt_recip_alphas_cumprod, t, x_t.shape) * x_t -\n                extract_into_tensor(self.sqrt_recipm1_alphas_cumprod, t, x_t.shape) * noise\n        )\n\n    def q_posterior(self, x_start, x_t, t):\n        posterior_mean = (\n                extract_into_tensor(self.posterior_mean_coef1, t, x_t.shape) * x_start +\n                extract_into_tensor(self.posterior_mean_coef2, t, x_t.shape) * x_t\n        )\n        posterior_variance = extract_into_tensor(self.posterior_variance, t, x_t.shape)\n        posterior_log_variance_clipped = extract_into_tensor(self.posterior_log_variance_clipped, t, x_t.shape)\n        return posterior_mean, posterior_variance, posterior_log_variance_clipped\n\n    def p_mean_variance(self, x, t, clip_denoised: bool):\n        model_out = self.model(x, t)\n        if self.parameterization == \"eps\":\n            x_recon = self.predict_start_from_noise(x, t=t, noise=model_out)\n        elif self.parameterization == \"x0\":\n            x_recon = model_out\n        if clip_denoised:\n            x_recon.clamp_(-1., 1.)\n\n        model_mean, posterior_variance, posterior_log_variance = self.q_posterior(x_start=x_recon, x_t=x, t=t)\n        return model_mean, posterior_variance, posterior_log_variance\n\n    @torch.no_grad()\n    def p_sample(self, x, t, clip_denoised=True, repeat_noise=False):\n        b, *_, device = *x.shape, x.device\n        model_mean, _, model_log_variance = self.p_mean_variance(x=x, t=t, clip_denoised=clip_denoised)\n        noise = noise_like(x.shape, device, repeat_noise)\n        # no noise when t == 0\n        nonzero_mask = (1 - (t == 0).float()).reshape(b, *((1,) * (len(x.shape) - 1)))\n        return model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise\n\n    @torch.no_grad()\n    def p_sample_loop(self, shape, return_intermediates=False):\n        device = self.betas.device\n        b = shape[0]\n        img = torch.randn(shape, device=device)\n        intermediates = [img]\n        for i in tqdm(reversed(range(0, self.num_timesteps)), desc='Sampling t', total=self.num_timesteps):\n            img = self.p_sample(img, torch.full((b,), i, device=device, dtype=torch.long),\n                                clip_denoised=self.clip_denoised)\n            if i % self.log_every_t == 0 or i == self.num_timesteps - 1:\n                intermediates.append(img)\n        if return_intermediates:\n            return img, intermediates\n        return img\n\n    @torch.no_grad()\n    def sample(self, batch_size=16, return_intermediates=False):\n        image_size = self.image_size\n        channels = self.channels\n        return self.p_sample_loop((batch_size, channels, image_size, image_size),\n                                  return_intermediates=return_intermediates)\n\n    def q_sample(self, x_start, t, noise=None):\n        noise = default(noise, lambda: torch.randn_like(x_start))\n        return (extract_into_tensor(self.sqrt_alphas_cumprod, t, x_start.shape) * x_start +\n                extract_into_tensor(self.sqrt_one_minus_alphas_cumprod, t, x_start.shape) * noise)\n\n    def get_loss(self, pred, target, mean=True):\n        if self.loss_type == 'l1':\n            loss = (target - pred).abs()\n            if mean:\n                loss = loss.mean()\n        elif self.loss_type == 'l2':\n            if mean:\n                loss = torch.nn.functional.mse_loss(target, pred)\n            else:\n                loss = torch.nn.functional.mse_loss(target, pred, reduction='none')\n        else:\n            raise NotImplementedError(\"unknown loss type '{loss_type}'\")\n\n        return loss\n\n    def p_losses(self, x_start, t, noise=None):\n        noise = default(noise, lambda: torch.randn_like(x_start))\n        x_noisy = self.q_sample(x_start=x_start, t=t, noise=noise)\n        model_out = self.model(x_noisy, t)\n\n        loss_dict = {}\n        if self.parameterization == \"eps\":\n            target = noise\n        elif self.parameterization == \"x0\":\n            target = x_start\n        else:\n            raise NotImplementedError(f\"Paramterization {self.parameterization} not yet supported\")\n\n        loss = self.get_loss(model_out, target, mean=False).mean(dim=[1, 2, 3])\n\n        log_prefix = 'train' if self.training else 'val'\n\n        loss_dict.update({f'{log_prefix}/loss_simple': loss.mean()})\n        loss_simple = loss.mean() * self.l_simple_weight\n\n        loss_vlb = (self.lvlb_weights[t] * loss).mean()\n        loss_dict.update({f'{log_prefix}/loss_vlb': loss_vlb})\n\n        loss = loss_simple + self.original_elbo_weight * loss_vlb\n\n        loss_dict.update({f'{log_prefix}/loss': loss})\n\n        return loss, loss_dict\n\n    def forward(self, x, *args, **kwargs):\n        # b, c, h, w, device, img_size, = *x.shape, x.device, self.image_size\n        # assert h == img_size and w == img_size, f'height and width of image must be {img_size}'\n        t = torch.randint(0, self.num_timesteps, (x.shape[0],), device=self.device).long()\n        return self.p_losses(x, t, *args, **kwargs)\n\n    def get_input(self, batch, k):\n        x = batch[k]\n        if len(x.shape) == 3:\n            x = x[..., None]\n        x = rearrange(x, 'b h w c -> b c h w')\n        x = x.to(memory_format=torch.contiguous_format).float()\n        return x\n\n    def shared_step(self, batch):\n        x = self.get_input(batch, self.first_stage_key)\n        loss, loss_dict = self(x)\n        return loss, loss_dict\n\n    def training_step(self, batch, batch_idx):\n        loss, loss_dict = self.shared_step(batch)\n\n        self.log_dict(loss_dict, prog_bar=True,\n                      logger=True, on_step=True, on_epoch=True)\n\n        self.log(\"global_step\", float(self.global_step),\n                 prog_bar=True, logger=True, on_step=True, on_epoch=False)\n\n        if self.use_scheduler:\n            lr = self.optimizers().param_groups[0]['lr']\n            self.log('lr_abs', lr, prog_bar=True, logger=True, on_step=True, on_epoch=False)\n\n        return loss\n\n    @torch.no_grad()\n    def validation_step(self, batch, batch_idx):\n        _, loss_dict_no_ema = self.shared_step(batch)\n        with self.ema_scope():\n            _, loss_dict_ema = self.shared_step(batch)\n            loss_dict_ema = {key + '_ema': loss_dict_ema[key] for key in loss_dict_ema}\n        self.log_dict(loss_dict_no_ema, prog_bar=False, logger=True, on_step=False, on_epoch=True)\n        self.log_dict(loss_dict_ema, prog_bar=False, logger=True, on_step=False, on_epoch=True)\n\n    def on_train_batch_end(self, *args, **kwargs):\n        if self.use_ema:\n            self.model_ema(self.model)\n\n    def _get_rows_from_list(self, samples):\n        n_imgs_per_row = len(samples)\n        denoise_grid = rearrange(samples, 'n b c h w -> b n c h w')\n        denoise_grid = rearrange(denoise_grid, 'b n c h w -> (b n) c h w')\n        denoise_grid = make_grid(denoise_grid, nrow=n_imgs_per_row)\n        return denoise_grid\n\n    @torch.no_grad()\n    def log_images(self, batch, N=8, n_row=2, sample=True, return_keys=None, **kwargs):\n        log = dict()\n        x = self.get_input(batch, self.first_stage_key)\n        N = min(x.shape[0], N)\n        n_row = min(x.shape[0], n_row)\n        x = x.to(self.device)[:N]\n        log[\"inputs\"] = x\n\n        # get diffusion row\n        diffusion_row = list()\n        x_start = x[:n_row]\n\n        for t in range(self.num_timesteps):\n            if t % self.log_every_t == 0 or t == self.num_timesteps - 1:\n                t = repeat(torch.tensor([t]), '1 -> b', b=n_row)\n                t = t.to(self.device).long()\n                noise = torch.randn_like(x_start)\n                x_noisy = self.q_sample(x_start=x_start, t=t, noise=noise)\n                diffusion_row.append(x_noisy)\n\n        log[\"diffusion_row\"] = self._get_rows_from_list(diffusion_row)\n\n        if sample:\n            # get denoise row\n            with self.ema_scope(\"Plotting\"):\n                samples, denoise_row = self.sample(batch_size=N, return_intermediates=True)\n\n            log[\"samples\"] = samples\n            log[\"denoise_row\"] = self._get_rows_from_list(denoise_row)\n\n        if return_keys:\n            if np.intersect1d(list(log.keys()), return_keys).shape[0] == 0:\n                return log\n            else:\n                return {key: log[key] for key in return_keys}\n        return log\n\n    def configure_optimizers(self):\n        lr = self.learning_rate\n        params = list(self.model.parameters())\n        if self.learn_logvar:\n            params = params + [self.logvar]\n        opt = torch.optim.AdamW(params, lr=lr)\n        return opt\n\n\nclass LatentDiffusion(DDPM):\n    \"\"\"main class\"\"\"\n    def __init__(self,\n                 first_stage_config,\n                 cond_stage_config,\n                 num_timesteps_cond=None,\n                 cond_stage_key=\"image\",\n                 cond_stage_trainable=False,\n                 concat_mode=True,\n                 cond_stage_forward=None,\n                 conditioning_key=None,\n                 scale_factor=1.0,\n                 *args, **kwargs):\n        self.num_timesteps_cond = default(num_timesteps_cond, 1)\n        assert self.num_timesteps_cond <= kwargs['timesteps']\n        # for backwards compatibility after implementation of DiffusionWrapper\n        if conditioning_key is None:\n            conditioning_key = 'concat' if concat_mode else 'crossattn'\n        if cond_stage_config == '__is_unconditional__':\n            conditioning_key = None\n        ckpt_path = kwargs.pop(\"ckpt_path\", None)\n        ignore_keys = kwargs.pop(\"ignore_keys\", [])\n        super().__init__(conditioning_key=conditioning_key, *args, **kwargs)\n        self.concat_mode = concat_mode\n        self.cond_stage_trainable = cond_stage_trainable\n        self.cond_stage_key = cond_stage_key\n        try:\n            self.num_downs = len(first_stage_config.params.ddconfig.ch_mult) - 1\n        except:\n            self.num_downs = 0\n        self.scale_factor = scale_factor\n\n        self.instantiate_first_stage(first_stage_config)\n        self.instantiate_cond_stage(cond_stage_config)\n        self.cond_stage_forward = cond_stage_forward\n        self.clip_denoised = False\n\n        self.restarted_from_ckpt = False\n        if ckpt_path is not None:\n            self.init_from_ckpt(ckpt_path, ignore_keys)\n            self.restarted_from_ckpt = True\n\n    def make_cond_schedule(self, ):\n        self.cond_ids = torch.full(size=(self.num_timesteps,), fill_value=self.num_timesteps - 1, dtype=torch.long)\n        ids = torch.round(torch.linspace(0, self.num_timesteps - 1, self.num_timesteps_cond)).long()\n        self.cond_ids[:self.num_timesteps_cond] = ids\n\n\n    def register_schedule(self,\n                          given_betas=None, beta_schedule=\"linear\", timesteps=1000,\n                          linear_start=1e-4, linear_end=2e-2, cosine_s=8e-3):\n        super().register_schedule(given_betas, beta_schedule, timesteps, linear_start, linear_end, cosine_s)\n\n        self.shorten_cond_schedule = self.num_timesteps_cond > 1\n        if self.shorten_cond_schedule:\n            self.make_cond_schedule()\n\n    def instantiate_first_stage(self, config):\n        model = instantiate_from_config(config)\n        self.first_stage_model = model.eval()\n        self.first_stage_model.train = disabled_train\n        for param in self.first_stage_model.parameters():\n            param.requires_grad = False\n\n    def instantiate_cond_stage(self, config):\n        if not self.cond_stage_trainable:\n            if config == \"__is_first_stage__\":\n                print(\"Using first stage also as cond stage.\")\n                self.cond_stage_model = self.first_stage_model\n            elif config == \"__is_unconditional__\":\n                print(f\"Training {self.__class__.__name__} as an unconditional model.\")\n                self.cond_stage_model = None\n                # self.be_unconditional = True\n            else:\n                model = instantiate_from_config(config)\n                self.cond_stage_model = model.eval()\n                self.cond_stage_model.train = disabled_train\n                for param in self.cond_stage_model.parameters():\n                    param.requires_grad = False\n        else:\n            assert config != '__is_first_stage__'\n            assert config != '__is_unconditional__'\n            model = instantiate_from_config(config)\n            self.cond_stage_model = model\n\n    def _get_denoise_row_from_list(self, samples, desc='', force_no_decoder_quantization=False):\n        denoise_row = []\n        for zd in tqdm(samples, desc=desc):\n            denoise_row.append(self.decode_first_stage(zd.to(self.device),\n                                                            force_not_quantize=force_no_decoder_quantization))\n        n_imgs_per_row = len(denoise_row)\n        denoise_row = torch.stack(denoise_row)  # n_log_step, n_row, C, H, W\n        denoise_grid = rearrange(denoise_row, 'n b c h w -> b n c h w')\n        denoise_grid = rearrange(denoise_grid, 'b n c h w -> (b n) c h w')\n        denoise_grid = make_grid(denoise_grid, nrow=n_imgs_per_row)\n        return denoise_grid\n\n    def get_first_stage_encoding(self, encoder_posterior):\n        if isinstance(encoder_posterior, torch.Tensor):\n            z = encoder_posterior\n        else:\n            raise NotImplementedError(f\"encoder_posterior of type '{type(encoder_posterior)}' not yet implemented\")\n        return self.scale_factor * z\n\n    def get_learned_conditioning(self, c):\n        if self.cond_stage_forward is None:\n            if hasattr(self.cond_stage_model, 'encode') and callable(self.cond_stage_model.encode):\n                c = self.cond_stage_model.encode(c)\n            else:\n                c = self.cond_stage_model(c)\n        else:\n            assert hasattr(self.cond_stage_model, self.cond_stage_forward)\n            c = getattr(self.cond_stage_model, self.cond_stage_forward)(c)\n        return c\n\n\n    @torch.no_grad()\n    def get_input(self, batch, k, return_first_stage_outputs=False, force_c_encode=False,\n                  cond_key=None, return_original_cond=False, bs=None):\n        x = super().get_input(batch, k)\n        if bs is not None:\n            x = x[:bs]\n        x = x.to(self.device)\n        encoder_posterior = self.encode_first_stage(x)\n        z = self.get_first_stage_encoding(encoder_posterior).detach()\n\n        if self.model.conditioning_key is not None:\n            if cond_key is None:\n                cond_key = self.cond_stage_key\n            if cond_key != self.first_stage_key:\n                if cond_key in ['caption', 'coordinates_bbox']:\n                    xc = batch[cond_key]\n                elif cond_key == 'class_label':\n                    xc = batch\n                else:\n                    xc = super().get_input(batch, cond_key).to(self.device)\n            else:\n                xc = x\n            if not self.cond_stage_trainable or force_c_encode:\n                if isinstance(xc, dict) or isinstance(xc, list):\n                    # import pudb; pudb.set_trace()\n                    c = self.get_learned_conditioning(xc)\n                else:\n                    c = self.get_learned_conditioning(xc.to(self.device))\n            else:\n                c = xc\n            if bs is not None:\n                c = c[:bs]\n\n            if self.use_positional_encodings:\n                pos_x, pos_y = self.compute_latent_shifts(batch)\n                ckey = __conditioning_keys__[self.model.conditioning_key]\n                c = {ckey: c, 'pos_x': pos_x, 'pos_y': pos_y}\n\n        else:\n            c = None\n            xc = None\n            if self.use_positional_encodings:\n                pos_x, pos_y = self.compute_latent_shifts(batch)\n                c = {'pos_x': pos_x, 'pos_y': pos_y}\n        out = [z, c]\n        if return_first_stage_outputs:\n            xrec = self.decode_first_stage(z)\n            out.extend([x, xrec])\n        if return_original_cond:\n            out.append(xc)\n        return out\n\n    @torch.no_grad()\n    def decode_first_stage(self, z, predict_cids=False, force_not_quantize=False):\n        if predict_cids:\n            if z.dim() == 4:\n                z = torch.argmax(z.exp(), dim=1).long()\n            z = self.first_stage_model.quantize.get_codebook_entry(z, shape=None)\n            z = rearrange(z, 'b h w c -> b c h w').contiguous()\n\n        z = 1. / self.scale_factor * z\n\n        return self.first_stage_model.decode(z)\n\n\n    @torch.no_grad()\n    def encode_first_stage(self, x):\n        return self.first_stage_model.encode(x)\n\n    def shared_step(self, batch, **kwargs):\n        x, c = self.get_input(batch, self.first_stage_key)\n        loss = self(x, c)\n        return loss\n\n    def forward(self, x, c, *args, **kwargs):\n        t = torch.randint(0, self.num_timesteps, (x.shape[0],), device=self.device).long()\n        if self.model.conditioning_key is not None:\n            assert c is not None\n            if self.cond_stage_trainable:\n                c = self.get_learned_conditioning(c)\n            if self.shorten_cond_schedule:  # TODO: drop this option\n                tc = self.cond_ids[t].to(self.device)\n                c = self.q_sample(x_start=c, t=tc, noise=torch.randn_like(c.float()))\n        return self.p_losses(x, c, t, *args, **kwargs)\n\n\n    def apply_model(self, x_noisy, t, cond, return_ids=False):\n\n        if isinstance(cond, dict):\n            # hybrid case, cond is exptected to be a dict\n            pass\n        else:\n            if not isinstance(cond, list):\n                cond = [cond]\n            key = 'c_concat' if self.model.conditioning_key == 'concat' else 'c_crossattn'\n            cond = {key: cond}\n\n        x_recon = self.model(x_noisy, t, **cond)\n\n        if isinstance(x_recon, tuple) and not return_ids:\n            return x_recon[0]\n        else:\n            return x_recon\n\n\n    def p_losses(self, x_start, cond, t, noise=None):\n        noise = default(noise, lambda: torch.randn_like(x_start))\n        x_noisy = self.q_sample(x_start=x_start, t=t, noise=noise)\n        model_output = self.apply_model(x_noisy, t, cond)\n\n        loss_dict = {}\n        prefix = 'train' if self.training else 'val'\n\n        if self.parameterization == \"x0\":\n            target = x_start\n        elif self.parameterization == \"eps\":\n            target = noise\n        else:\n            raise NotImplementedError()\n\n        loss_simple = self.get_loss(model_output, target, mean=False).mean([1, 2, 3])\n        loss_dict.update({f'{prefix}/loss_simple': loss_simple.mean()})\n\n        logvar_t = self.logvar[t].to(self.device)\n        loss = loss_simple / torch.exp(logvar_t) + logvar_t\n        # loss = loss_simple / torch.exp(self.logvar) + self.logvar\n        if self.learn_logvar:\n            loss_dict.update({f'{prefix}/loss_gamma': loss.mean()})\n            loss_dict.update({'logvar': self.logvar.data.mean()})\n\n        loss = self.l_simple_weight * loss.mean()\n\n        loss_vlb = self.get_loss(model_output, target, mean=False).mean(dim=(1, 2, 3))\n        loss_vlb = (self.lvlb_weights[t] * loss_vlb).mean()\n        loss_dict.update({f'{prefix}/loss_vlb': loss_vlb})\n        loss += (self.original_elbo_weight * loss_vlb)\n        loss_dict.update({f'{prefix}/loss': loss})\n\n        return loss, loss_dict\n\n    def p_mean_variance(self, x, c, t, clip_denoised: bool, return_codebook_ids=False, quantize_denoised=False,\n                        return_x0=False, score_corrector=None, corrector_kwargs=None):\n        t_in = t\n        model_out = self.apply_model(x, t_in, c, return_ids=return_codebook_ids)\n\n        if score_corrector is not None:\n            assert self.parameterization == \"eps\"\n            model_out = score_corrector.modify_score(self, model_out, x, t, c, **corrector_kwargs)\n\n        if return_codebook_ids:\n            model_out, logits = model_out\n\n        if self.parameterization == \"eps\":\n            x_recon = self.predict_start_from_noise(x, t=t, noise=model_out)\n        elif self.parameterization == \"x0\":\n            x_recon = model_out\n        else:\n            raise NotImplementedError()\n\n        if clip_denoised:\n            x_recon.clamp_(-1., 1.)\n        if quantize_denoised:\n            x_recon, _, [_, _, indices] = self.first_stage_model.quantize(x_recon)\n        model_mean, posterior_variance, posterior_log_variance = self.q_posterior(x_start=x_recon, x_t=x, t=t)\n        if return_codebook_ids:\n            return model_mean, posterior_variance, posterior_log_variance, logits\n        elif return_x0:\n            return model_mean, posterior_variance, posterior_log_variance, x_recon\n        else:\n            return model_mean, posterior_variance, posterior_log_variance\n\n    @torch.no_grad()\n    def p_sample(self, x, c, t, clip_denoised=False, repeat_noise=False,\n                 return_codebook_ids=False, quantize_denoised=False, return_x0=False,\n                 temperature=1., noise_dropout=0., score_corrector=None, corrector_kwargs=None):\n        b, *_, device = *x.shape, x.device\n        outputs = self.p_mean_variance(x=x, c=c, t=t, clip_denoised=clip_denoised,\n                                       return_codebook_ids=return_codebook_ids,\n                                       quantize_denoised=quantize_denoised,\n                                       return_x0=return_x0,\n                                       score_corrector=score_corrector, corrector_kwargs=corrector_kwargs)\n        if return_codebook_ids:\n            raise DeprecationWarning(\"Support dropped.\")\n            model_mean, _, model_log_variance, logits = outputs\n        elif return_x0:\n            model_mean, _, model_log_variance, x0 = outputs\n        else:\n            model_mean, _, model_log_variance = outputs\n\n        noise = noise_like(x.shape, device, repeat_noise) * temperature\n        if noise_dropout > 0.:\n            noise = torch.nn.functional.dropout(noise, p=noise_dropout)\n        # no noise when t == 0\n        nonzero_mask = (1 - (t == 0).float()).reshape(b, *((1,) * (len(x.shape) - 1)))\n\n        # if return_codebook_ids:\n        #     return model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise, logits.argmax(dim=1)\n        if return_x0:\n            return model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise, x0\n        else:\n            return model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise\n\n    @torch.no_grad()\n    def progressive_denoising(self, cond, shape, verbose=True, callback=None, quantize_denoised=False,\n                              img_callback=None, mask=None, x0=None, temperature=1., noise_dropout=0.,\n                              score_corrector=None, corrector_kwargs=None, batch_size=None, x_T=None, start_T=None,\n                              log_every_t=None):\n        if not log_every_t:\n            log_every_t = self.log_every_t\n        timesteps = self.num_timesteps\n        if batch_size is not None:\n            b = batch_size if batch_size is not None else shape[0]\n            shape = [batch_size] + list(shape)\n        else:\n            b = batch_size = shape[0]\n        if x_T is None:\n            img = torch.randn(shape, device=self.device)\n        else:\n            img = x_T\n        intermediates = []\n        if cond is not None:\n            if isinstance(cond, dict):\n                cond = {key: cond[key][:batch_size] if not isinstance(cond[key], list) else\n                list(map(lambda x: x[:batch_size], cond[key])) for key in cond}\n            else:\n                cond = [c[:batch_size] for c in cond] if isinstance(cond, list) else cond[:batch_size]\n\n        if start_T is not None:\n            timesteps = min(timesteps, start_T)\n        iterator = tqdm(reversed(range(0, timesteps)), desc='Progressive Generation',\n                        total=timesteps) if verbose else reversed(\n            range(0, timesteps))\n        if type(temperature) == float:\n            temperature = [temperature] * timesteps\n\n        for i in iterator:\n            ts = torch.full((b,), i, device=self.device, dtype=torch.long)\n            if self.shorten_cond_schedule:\n                assert self.model.conditioning_key != 'hybrid'\n                tc = self.cond_ids[ts].to(cond.device)\n                cond = self.q_sample(x_start=cond, t=tc, noise=torch.randn_like(cond))\n\n            img, x0_partial = self.p_sample(img, cond, ts,\n                                            clip_denoised=self.clip_denoised,\n                                            quantize_denoised=quantize_denoised, return_x0=True,\n                                            temperature=temperature[i], noise_dropout=noise_dropout,\n                                            score_corrector=score_corrector, corrector_kwargs=corrector_kwargs)\n            if mask is not None:\n                assert x0 is not None\n                img_orig = self.q_sample(x0, ts)\n                img = img_orig * mask + (1. - mask) * img\n\n            if i % log_every_t == 0 or i == timesteps - 1:\n                intermediates.append(x0_partial)\n            if callback: callback(i)\n            if img_callback: img_callback(img, i)\n        return img, intermediates\n\n    @torch.no_grad()\n    def p_sample_loop(self, cond, shape, return_intermediates=False,\n                      x_T=None, verbose=True, callback=None, timesteps=None, quantize_denoised=False,\n                      mask=None, x0=None, img_callback=None, start_T=None,\n                      log_every_t=None):\n\n        if not log_every_t:\n            log_every_t = self.log_every_t\n        device = self.betas.device\n        b = shape[0]\n        if x_T is None:\n            img = torch.randn(shape, device=device)\n        else:\n            img = x_T\n\n        intermediates = [img]\n        if timesteps is None:\n            timesteps = self.num_timesteps\n\n        if start_T is not None:\n            timesteps = min(timesteps, start_T)\n        iterator = tqdm(reversed(range(0, timesteps)), desc='Sampling t', total=timesteps) if verbose else reversed(\n            range(0, timesteps))\n\n        if mask is not None:\n            assert x0 is not None\n            assert x0.shape[2:3] == mask.shape[2:3]  # spatial size has to match\n\n        for i in iterator:\n            ts = torch.full((b,), i, device=device, dtype=torch.long)\n            if self.shorten_cond_schedule:\n                assert self.model.conditioning_key != 'hybrid'\n                tc = self.cond_ids[ts].to(cond.device)\n                cond = self.q_sample(x_start=cond, t=tc, noise=torch.randn_like(cond))\n\n            img = self.p_sample(img, cond, ts,\n                                clip_denoised=self.clip_denoised,\n                                quantize_denoised=quantize_denoised)\n            if mask is not None:\n                img_orig = self.q_sample(x0, ts)\n                img = img_orig * mask + (1. - mask) * img\n\n            if i % log_every_t == 0 or i == timesteps - 1:\n                intermediates.append(img)\n            if callback: callback(i)\n            if img_callback: img_callback(img, i)\n\n        if return_intermediates:\n            return img, intermediates\n        return img\n\n    @torch.no_grad()\n    def sample(self, cond, batch_size=16, return_intermediates=False, x_T=None,\n               verbose=True, timesteps=None, quantize_denoised=False,\n               mask=None, x0=None, shape=None,**kwargs):\n        if shape is None:\n            shape = (batch_size, self.channels, self.image_size, self.image_size)\n        if cond is not None:\n            if isinstance(cond, dict):\n                cond = {key: cond[key][:batch_size] if not isinstance(cond[key], list) else\n                list(map(lambda x: x[:batch_size], cond[key])) for key in cond}\n            else:\n                cond = [c[:batch_size] for c in cond] if isinstance(cond, list) else cond[:batch_size]\n        return self.p_sample_loop(cond,\n                                  shape,\n                                  return_intermediates=return_intermediates, x_T=x_T,\n                                  verbose=verbose, timesteps=timesteps, quantize_denoised=quantize_denoised,\n                                  mask=mask, x0=x0)\n\n    @torch.no_grad()\n    def sample_log(self,cond,batch_size,ddim, ddim_steps,**kwargs):\n\n        if ddim:\n            ddim_sampler = DDIMSampler(self)\n            shape = (self.channels, self.image_size, self.image_size)\n            samples, intermediates =ddim_sampler.sample(ddim_steps,batch_size,\n                                                        shape,cond,verbose=False,**kwargs)\n\n        else:\n            samples, intermediates = self.sample(cond=cond, batch_size=batch_size,\n                                                 return_intermediates=True,**kwargs)\n\n        return samples, intermediates\n\n\n    @torch.no_grad()\n    def log_images(self, batch, N=8, n_row=4, sample=True, ddim_steps=200, ddim_eta=1., return_keys=None,\n                   quantize_denoised=True, inpaint=True, plot_denoise_rows=False, plot_progressive_rows=True,\n                   plot_diffusion_rows=True, **kwargs):\n\n        use_ddim = ddim_steps is not None\n\n        log = dict()\n        z, c, x, xrec, xc = self.get_input(batch, self.first_stage_key,\n                                           return_first_stage_outputs=True,\n                                           force_c_encode=True,\n                                           return_original_cond=True,\n                                           bs=N)\n        N = min(x.shape[0], N)\n        n_row = min(x.shape[0], n_row)\n        log[\"inputs\"] = x\n        log[\"reconstruction\"] = xrec\n        if self.model.conditioning_key is not None:\n            if hasattr(self.cond_stage_model, \"decode\"):\n                xc = self.cond_stage_model.decode(c)\n                log[\"conditioning\"] = xc\n            elif self.cond_stage_key in [\"caption\"]:\n                xc = log_txt_as_img((x.shape[2], x.shape[3]), batch[\"caption\"])\n                log[\"conditioning\"] = xc\n            elif self.cond_stage_key == 'class_label':\n                xc = log_txt_as_img((x.shape[2], x.shape[3]), batch[\"human_label\"])\n                log['conditioning'] = xc\n            elif isimage(xc):\n                log[\"conditioning\"] = xc\n            if ismap(xc):\n                log[\"original_conditioning\"] = self.to_rgb(xc)\n\n        if plot_diffusion_rows:\n            # get diffusion row\n            diffusion_row = list()\n            z_start = z[:n_row]\n            for t in range(self.num_timesteps):\n                if t % self.log_every_t == 0 or t == self.num_timesteps - 1:\n                    t = repeat(torch.tensor([t]), '1 -> b', b=n_row)\n                    t = t.to(self.device).long()\n                    noise = torch.randn_like(z_start)\n                    z_noisy = self.q_sample(x_start=z_start, t=t, noise=noise)\n                    diffusion_row.append(self.decode_first_stage(z_noisy))\n\n            diffusion_row = torch.stack(diffusion_row)  # n_log_step, n_row, C, H, W\n            diffusion_grid = rearrange(diffusion_row, 'n b c h w -> b n c h w')\n            diffusion_grid = rearrange(diffusion_grid, 'b n c h w -> (b n) c h w')\n            diffusion_grid = make_grid(diffusion_grid, nrow=diffusion_row.shape[0])\n            log[\"diffusion_row\"] = diffusion_grid\n\n        if sample:\n            # get denoise row\n            with self.ema_scope(\"Plotting\"):\n                samples, z_denoise_row = self.sample_log(cond=c,batch_size=N,ddim=use_ddim,\n                                                         ddim_steps=ddim_steps,eta=ddim_eta)\n                # samples, z_denoise_row = self.sample(cond=c, batch_size=N, return_intermediates=True)\n            x_samples = self.decode_first_stage(samples)\n            log[\"samples\"] = x_samples\n            if plot_denoise_rows:\n                denoise_grid = self._get_denoise_row_from_list(z_denoise_row)\n                log[\"denoise_row\"] = denoise_grid\n\n            if quantize_denoised:\n                # also display when quantizing x0 while sampling\n                with self.ema_scope(\"Plotting Quantized Denoised\"):\n                    samples, z_denoise_row = self.sample_log(cond=c,batch_size=N,ddim=use_ddim,\n                                                             ddim_steps=ddim_steps,eta=ddim_eta,\n                                                             quantize_denoised=True)\n                    # samples, z_denoise_row = self.sample(cond=c, batch_size=N, return_intermediates=True,\n                    #                                      quantize_denoised=True)\n                x_samples = self.decode_first_stage(samples.to(self.device))\n                log[\"samples_x0_quantized\"] = x_samples\n\n            if inpaint:\n                # make a simple center square\n                b, h, w = z.shape[0], z.shape[2], z.shape[3]\n                mask = torch.ones(N, h, w).to(self.device)\n                # zeros will be filled in\n                mask[:, h // 4:3 * h // 4, w // 4:3 * w // 4] = 0.\n                mask = mask[:, None, ...]\n                with self.ema_scope(\"Plotting Inpaint\"):\n\n                    samples, _ = self.sample_log(cond=c,batch_size=N,ddim=use_ddim, eta=ddim_eta,\n                                                ddim_steps=ddim_steps, x0=z[:N], mask=mask)\n                x_samples = self.decode_first_stage(samples.to(self.device))\n                log[\"samples_inpainting\"] = x_samples\n                log[\"mask\"] = mask\n\n                # outpaint\n                with self.ema_scope(\"Plotting Outpaint\"):\n                    samples, _ = self.sample_log(cond=c, batch_size=N, ddim=use_ddim,eta=ddim_eta,\n                                                ddim_steps=ddim_steps, x0=z[:N], mask=mask)\n                x_samples = self.decode_first_stage(samples.to(self.device))\n                log[\"samples_outpainting\"] = x_samples\n\n        if plot_progressive_rows:\n            with self.ema_scope(\"Plotting Progressives\"):\n                img, progressives = self.progressive_denoising(c,\n                                                               shape=(self.channels, self.image_size, self.image_size),\n                                                               batch_size=N)\n            prog_row = self._get_denoise_row_from_list(progressives, desc=\"Progressive Generation\")\n            log[\"progressive_row\"] = prog_row\n\n        if return_keys:\n            if np.intersect1d(list(log.keys()), return_keys).shape[0] == 0:\n                return log\n            else:\n                return {key: log[key] for key in return_keys}\n        return log\n\n    def configure_optimizers(self):\n        lr = self.learning_rate\n        params = list(self.model.parameters())\n        if self.cond_stage_trainable:\n            print(f\"{self.__class__.__name__}: Also optimizing conditioner params!\")\n            params = params + list(self.cond_stage_model.parameters())\n        if self.learn_logvar:\n            print('Diffusion model optimizing logvar')\n            params.append(self.logvar)\n        opt = torch.optim.AdamW(params, lr=lr)\n        if self.use_scheduler:\n            assert 'target' in self.scheduler_config\n            scheduler = instantiate_from_config(self.scheduler_config)\n\n            print(\"Setting up LambdaLR scheduler...\")\n            scheduler = [\n                {\n                    'scheduler': LambdaLR(opt, lr_lambda=scheduler.schedule),\n                    'interval': 'step',\n                    'frequency': 1\n                }]\n            return [opt], scheduler\n        return opt\n\n    @torch.no_grad()\n    def to_rgb(self, x):\n        x = x.float()\n        if not hasattr(self, \"colorize\"):\n            self.colorize = torch.randn(3, x.shape[1], 1, 1).to(x)\n        x = nn.functional.conv2d(x, weight=self.colorize)\n        x = 2. * (x - x.min()) / (x.max() - x.min()) - 1.\n        return x\n\n\nclass DiffusionWrapper(pl.LightningModule):\n    def __init__(self, diff_model_config, conditioning_key):\n        super().__init__()\n        self.diffusion_model = instantiate_from_config(diff_model_config)\n        self.conditioning_key = conditioning_key\n        assert self.conditioning_key in [None, 'concat', 'crossattn', 'hybrid', 'adm', 'mcvd']\n\n    def forward(self, x, t, c_concat: list = None, c_crossattn: list = None):\n        if self.conditioning_key is None:\n            out = self.diffusion_model(x, t)\n        elif self.conditioning_key == 'concat':\n            xc = torch.cat([x] + c_concat, dim=1)\n            out = self.diffusion_model(xc, t)\n        elif self.conditioning_key == 'crossattn':\n            cc = torch.cat(c_crossattn, 1)\n            out = self.diffusion_model(x, t, context=cc)\n        elif self.conditioning_key == 'hybrid':\n            xc = torch.cat([x] + c_concat, dim=1)\n            cc = torch.cat(c_crossattn, 1)\n            out = self.diffusion_model(xc, t, context=cc)\n        elif self.conditioning_key == 'adm':\n            cc = c_crossattn[0]\n            out = self.diffusion_model(x, t, y=cc)\n        elif self.conditioning_key == 'mcvd':\n            out = self.diffusion_model(x, t, cond=c_concat[0])\n        else:\n            raise NotImplementedError()\n\n        return out\n\n\n\n#########################################################\n#########################################################\n#########################################################\n#########################################################\n#########################################################\n#########################################################\n\n\nclass LatentDiffusionVFI(DDPM):\n    \"\"\"main class\"\"\"\n    def __init__(self,\n                 first_stage_config,\n                 cond_stage_config,\n                 num_timesteps_cond=None,\n                 cond_stage_key=\"image\",\n                 cond_stage_trainable=False,\n                 concat_mode=True,\n                 cond_stage_forward=None,\n                 conditioning_key=None,\n                 scale_factor=1.0,\n                 *args, **kwargs):\n        self.num_timesteps_cond = default(num_timesteps_cond, 1)\n        assert self.num_timesteps_cond <= kwargs['timesteps']\n        # for backwards compatibility after implementation of DiffusionWrapper\n        if conditioning_key is None:\n            conditioning_key = 'concat' if concat_mode else 'crossattn'\n        if cond_stage_config == '__is_unconditional__':\n            conditioning_key = None\n        ckpt_path = kwargs.pop(\"ckpt_path\", None)\n        ignore_keys = kwargs.pop(\"ignore_keys\", [])\n        super().__init__(conditioning_key=conditioning_key, *args, **kwargs)\n        self.concat_mode = concat_mode\n        self.cond_stage_trainable = cond_stage_trainable\n        self.cond_stage_key = cond_stage_key\n        try:\n            self.num_downs = len(first_stage_config.params.ddconfig.ch_mult) - 1\n        except:\n            self.num_downs = 0\n        self.scale_factor = scale_factor\n\n        self.instantiate_first_stage(first_stage_config)\n        self.instantiate_cond_stage(cond_stage_config)\n        self.cond_stage_forward = cond_stage_forward\n        self.clip_denoised = False\n\n        self.restarted_from_ckpt = False\n        if ckpt_path is not None:\n            self.init_from_ckpt(ckpt_path, ignore_keys)\n            self.restarted_from_ckpt = True\n\n    def make_cond_schedule(self, ):\n        self.cond_ids = torch.full(size=(self.num_timesteps,), fill_value=self.num_timesteps - 1, dtype=torch.long)\n        ids = torch.round(torch.linspace(0, self.num_timesteps - 1, self.num_timesteps_cond)).long()\n        self.cond_ids[:self.num_timesteps_cond] = ids\n\n\n    def register_schedule(self,\n                          given_betas=None, beta_schedule=\"linear\", timesteps=1000,\n                          linear_start=1e-4, linear_end=2e-2, cosine_s=8e-3):\n        super().register_schedule(given_betas, beta_schedule, timesteps, linear_start, linear_end, cosine_s)\n\n        self.shorten_cond_schedule = self.num_timesteps_cond > 1\n        if self.shorten_cond_schedule:\n            self.make_cond_schedule()\n\n    def instantiate_first_stage(self, config):\n        model = instantiate_from_config(config)\n        self.first_stage_model = model.eval()\n        self.first_stage_model.train = disabled_train\n        for param in self.first_stage_model.parameters():\n            param.requires_grad = False\n\n    def instantiate_cond_stage(self, config):\n        if not self.cond_stage_trainable:\n            if config == \"__is_first_stage__\":\n                print(\"Using first stage also as cond stage.\")\n                self.cond_stage_model = self.first_stage_model\n            elif config == \"__is_unconditional__\":\n                print(f\"Training {self.__class__.__name__} as an unconditional model.\")\n                self.cond_stage_model = None\n                # self.be_unconditional = True\n            else:\n                model = instantiate_from_config(config)\n                self.cond_stage_model = model.eval()\n                self.cond_stage_model.train = disabled_train\n                for param in self.cond_stage_model.parameters():\n                    param.requires_grad = False\n        else:\n            assert config != '__is_first_stage__'\n            assert config != '__is_unconditional__'\n            model = instantiate_from_config(config)\n            self.cond_stage_model = model\n\n    def _get_denoise_row_from_list(self, samples, xc=None, phi_prev_list=None, phi_next_list=None, desc='', force_no_decoder_quantization=False):\n        denoise_row = []\n        for zd in tqdm(samples, desc=desc):\n            denoise_row.append(self.decode_first_stage(zd.to(self.device), xc, phi_prev_list, phi_next_list,\n                                        force_not_quantize=force_no_decoder_quantization))\n \n        n_imgs_per_row = len(denoise_row)\n        denoise_row = torch.stack(denoise_row)  # n_log_step, n_row, C, H, W\n        denoise_grid = rearrange(denoise_row, 'n b c h w -> b n c h w')\n        denoise_grid = rearrange(denoise_grid, 'b n c h w -> (b n) c h w')\n        denoise_grid = make_grid(denoise_grid, nrow=n_imgs_per_row)\n        return denoise_grid\n\n    def get_first_stage_encoding(self, encoder_posterior):\n        if isinstance(encoder_posterior, torch.Tensor):\n            z = encoder_posterior\n        else:\n            raise NotImplementedError(f\"encoder_posterior of type '{type(encoder_posterior)}' not yet implemented\")\n        return self.scale_factor * z\n\n    def get_learned_conditioning(self, c):\n        phi_prev_list, phi_next_list = None, None\n        if isinstance(c, dict) and 'prev_frame' in c.keys():\n            c_prev, phi_prev_list = self.cond_stage_model.encode(c['prev_frame'], ret_feature=True)\n            c_next, phi_next_list = self.cond_stage_model.encode(c['next_frame'], ret_feature=True)\n            c = torch.cat([c_prev, c_next], dim=1)\n        else:\n            c = self.cond_stage_model.encode(c)\n        return c, phi_prev_list, phi_next_list\n\n\n    @torch.no_grad()\n    def get_input(self, batch, k, return_first_stage_outputs=False, force_c_encode=False,\n                  cond_key=None, return_original_cond=False, return_phi=False, bs=None):\n        x = super().get_input(batch, k)\n        if bs is not None:\n            x = x[:bs]\n        x = x.to(self.device)\n\n        encoder_posterior = self.encode_first_stage(x)\n        z = self.get_first_stage_encoding(encoder_posterior).detach()\n        \n        phi_prev_list, phi_next_list = None, None\n        assert self.model.conditioning_key is not None\n        if cond_key == None:\n            cond_key = self.cond_stage_key\n        assert cond_key == 'past_future_frames'\n        xc = {'prev_frame': super().get_input(batch, 'prev_frame'),\n              'next_frame': super().get_input(batch, 'next_frame')}\n\n        if not self.cond_stage_trainable or force_c_encode:\n            if isinstance(xc, dict) or isinstance(xc, list):\n                # import pudb; pudb.set_trace()\n                c, phi_prev_list, phi_next_list = self.get_learned_conditioning(xc)\n            else:\n                c, phi_prev_list, phi_next_list = self.get_learned_conditioning(xc.to(self.device))\n        else:\n            c = xc\n        if bs is not None:\n            c = c[:bs]\n            if isinstance(xc, dict):\n                xc['prev_frame'] = xc['prev_frame'][:bs]\n                xc['next_frame'] = xc['next_frame'][:bs]\n            if phi_prev_list and phi_next_list:\n                phi_prev_list = [phi_prev[:bs] for phi_prev in phi_prev_list]\n                phi_next_list = [phi_next[:bs] for phi_next in phi_next_list]\n\n\n        out = [z, c]\n        if return_first_stage_outputs:\n            xrec = self.decode_first_stage(z, xc, phi_prev_list, phi_next_list)\n            out.extend([x, xrec])\n        if return_original_cond:\n            out.append(xc)\n        if return_phi:\n            out.append(phi_prev_list)\n            out.append(phi_next_list)\n        return out\n\n    @torch.no_grad()\n    def decode_first_stage(self, z, xc=None, phi_prev_list=None, phi_next_list=None, predict_cids=False, force_not_quantize=False):\n        if predict_cids:\n            if z.dim() == 4:\n                z = torch.argmax(z.exp(), dim=1).long()\n            z = self.first_stage_model.quantize.get_codebook_entry(z, shape=None)\n            z = rearrange(z, 'b h w c -> b c h w').contiguous()\n\n        z = 1. / self.scale_factor * z\n\n        return self.first_stage_model.decode(z, xc['prev_frame'], xc['next_frame'], phi_prev_list, phi_next_list, force_not_quantize=predict_cids or force_not_quantize)\n\n\n    @torch.no_grad()\n    def encode_first_stage(self, x):\n        return self.first_stage_model.encode(x)\n\n    def shared_step(self, batch, **kwargs):\n        x, c = self.get_input(batch, self.first_stage_key)\n        loss = self(x, c)\n        return loss\n\n    def forward(self, x, c, *args, **kwargs):\n        t = torch.randint(0, self.num_timesteps, (x.shape[0],), device=self.device).long()\n        if self.model.conditioning_key is not None:\n            assert c is not None\n            if self.cond_stage_trainable:\n                c, _, _ = self.get_learned_conditioning(c)\n            if self.shorten_cond_schedule:  # TODO: drop this option\n                tc = self.cond_ids[t].to(self.device)\n                c = self.q_sample(x_start=c, t=tc, noise=torch.randn_like(c.float()))\n        return self.p_losses(x, c, t, *args, **kwargs)\n\n\n    def apply_model(self, x_noisy, t, cond, return_ids=False):\n\n        if isinstance(cond, dict):\n            # hybrid case, cond is exptected to be a dict\n            pass\n        else:\n            if not isinstance(cond, list):\n                cond = [cond]\n            key = 'c_concat' if self.model.conditioning_key == 'concat' else 'c_crossattn'\n            cond = {key: cond}\n        \n        x_recon = self.model(x_noisy, t, **cond)\n\n        if isinstance(x_recon, tuple) and not return_ids:\n            return x_recon[0]\n        else:\n            return x_recon\n\n\n    def p_losses(self, x_start, cond, t, noise=None):\n        noise = default(noise, lambda: torch.randn_like(x_start))\n        x_noisy = self.q_sample(x_start=x_start, t=t, noise=noise)\n        model_output = self.apply_model(x_noisy, t, cond)\n\n        loss_dict = {}\n        prefix = 'train' if self.training else 'val'\n\n        if self.parameterization == \"x0\":\n            target = x_start\n        elif self.parameterization == \"eps\":\n            target = noise\n        else:\n            raise NotImplementedError()\n\n        loss_simple = self.get_loss(model_output, target, mean=False).mean([1, 2, 3])\n        loss_dict.update({f'{prefix}/loss_simple': loss_simple.mean()})\n\n        logvar_t = self.logvar[t].to(self.device)\n        loss = loss_simple / torch.exp(logvar_t) + logvar_t\n        # loss = loss_simple / torch.exp(self.logvar) + self.logvar\n        if self.learn_logvar:\n            loss_dict.update({f'{prefix}/loss_gamma': loss.mean()})\n            loss_dict.update({'logvar': self.logvar.data.mean()})\n\n        loss = self.l_simple_weight * loss.mean()\n\n        loss_vlb = self.get_loss(model_output, target, mean=False).mean(dim=(1, 2, 3))\n        loss_vlb = (self.lvlb_weights[t] * loss_vlb).mean()\n        loss_dict.update({f'{prefix}/loss_vlb': loss_vlb})\n        loss += (self.original_elbo_weight * loss_vlb)\n        loss_dict.update({f'{prefix}/loss': loss})\n\n        return loss, loss_dict\n\n    def p_mean_variance(self, x, c, t, clip_denoised: bool, return_codebook_ids=False, quantize_denoised=False,\n                        return_x0=False, score_corrector=None, corrector_kwargs=None):\n        t_in = t\n        model_out = self.apply_model(x, t_in, c, return_ids=return_codebook_ids)\n\n        if score_corrector is not None:\n            assert self.parameterization == \"eps\"\n            model_out = score_corrector.modify_score(self, model_out, x, t, c, **corrector_kwargs)\n\n        if return_codebook_ids:\n            model_out, logits = model_out\n\n        if self.parameterization == \"eps\":\n            x_recon = self.predict_start_from_noise(x, t=t, noise=model_out)\n        elif self.parameterization == \"x0\":\n            x_recon = model_out\n        else:\n            raise NotImplementedError()\n\n        if clip_denoised:\n            x_recon.clamp_(-1., 1.)\n        if quantize_denoised:\n            x_recon, _, [_, _, indices] = self.first_stage_model.quantize(x_recon)\n        model_mean, posterior_variance, posterior_log_variance = self.q_posterior(x_start=x_recon, x_t=x, t=t)\n        if return_codebook_ids:\n            return model_mean, posterior_variance, posterior_log_variance, logits\n        elif return_x0:\n            return model_mean, posterior_variance, posterior_log_variance, x_recon\n        else:\n            return model_mean, posterior_variance, posterior_log_variance\n\n    @torch.no_grad()\n    def p_sample(self, x, c, t, clip_denoised=False, repeat_noise=False,\n                 return_codebook_ids=False, quantize_denoised=False, return_x0=False,\n                 temperature=1., noise_dropout=0., score_corrector=None, corrector_kwargs=None):\n        b, *_, device = *x.shape, x.device\n        outputs = self.p_mean_variance(x=x, c=c, t=t, clip_denoised=clip_denoised,\n                                       return_codebook_ids=return_codebook_ids,\n                                       quantize_denoised=quantize_denoised,\n                                       return_x0=return_x0,\n                                       score_corrector=score_corrector, corrector_kwargs=corrector_kwargs)\n        if return_codebook_ids:\n            raise DeprecationWarning(\"Support dropped.\")\n            model_mean, _, model_log_variance, logits = outputs\n        elif return_x0:\n            model_mean, _, model_log_variance, x0 = outputs\n        else:\n            model_mean, _, model_log_variance = outputs\n\n        noise = noise_like(x.shape, device, repeat_noise) * temperature\n        if noise_dropout > 0.:\n            noise = torch.nn.functional.dropout(noise, p=noise_dropout)\n        # no noise when t == 0\n        nonzero_mask = (1 - (t == 0).float()).reshape(b, *((1,) * (len(x.shape) - 1)))\n\n        # if return_codebook_ids:\n        #     return model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise, logits.argmax(dim=1)\n        if return_x0:\n            return model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise, x0\n        else:\n            return model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise\n\n    @torch.no_grad()\n    def progressive_denoising(self, cond, shape, verbose=True, callback=None, quantize_denoised=False,\n                              img_callback=None, mask=None, x0=None, temperature=1., noise_dropout=0.,\n                              score_corrector=None, corrector_kwargs=None, batch_size=None, x_T=None, start_T=None,\n                              log_every_t=None):\n        if not log_every_t:\n            log_every_t = self.log_every_t\n        timesteps = self.num_timesteps\n        if batch_size is not None:\n            b = batch_size if batch_size is not None else shape[0]\n            shape = [batch_size] + list(shape)\n        else:\n            b = batch_size = shape[0]\n        if x_T is None:\n            img = torch.randn(shape, device=self.device)\n        else:\n            img = x_T\n        intermediates = []\n        if cond is not None:\n            if isinstance(cond, dict):\n                cond = {key: cond[key][:batch_size] if not isinstance(cond[key], list) else\n                list(map(lambda x: x[:batch_size], cond[key])) for key in cond}\n            else:\n                cond = [c[:batch_size] for c in cond] if isinstance(cond, list) else cond[:batch_size]\n\n        if start_T is not None:\n            timesteps = min(timesteps, start_T)\n        iterator = tqdm(reversed(range(0, timesteps)), desc='Progressive Generation',\n                        total=timesteps) if verbose else reversed(\n            range(0, timesteps))\n        if type(temperature) == float:\n            temperature = [temperature] * timesteps\n\n        for i in iterator:\n            ts = torch.full((b,), i, device=self.device, dtype=torch.long)\n            if self.shorten_cond_schedule:\n                assert self.model.conditioning_key != 'hybrid'\n                tc = self.cond_ids[ts].to(cond.device)\n                cond = self.q_sample(x_start=cond, t=tc, noise=torch.randn_like(cond))\n\n            img, x0_partial = self.p_sample(img, cond, ts,\n                                            clip_denoised=self.clip_denoised,\n                                            quantize_denoised=quantize_denoised, return_x0=True,\n                                            temperature=temperature[i], noise_dropout=noise_dropout,\n                                            score_corrector=score_corrector, corrector_kwargs=corrector_kwargs)\n            if mask is not None:\n                assert x0 is not None\n                img_orig = self.q_sample(x0, ts)\n                img = img_orig * mask + (1. - mask) * img\n\n            if i % log_every_t == 0 or i == timesteps - 1:\n                intermediates.append(x0_partial)\n            if callback: callback(i)\n            if img_callback: img_callback(img, i)\n        return img, intermediates\n\n    @torch.no_grad()\n    def p_sample_loop(self, cond, shape, return_intermediates=False,\n                      x_T=None, verbose=True, callback=None, timesteps=None, quantize_denoised=False,\n                      mask=None, x0=None, img_callback=None, start_T=None,\n                      log_every_t=None):\n\n        if not log_every_t:\n            log_every_t = self.log_every_t\n        device = self.betas.device\n        b = shape[0]\n        if x_T is None:\n            img = torch.randn(shape, device=device)\n        else:\n            img = x_T\n\n        intermediates = [img]\n        if timesteps is None:\n            timesteps = self.num_timesteps\n\n        if start_T is not None:\n            timesteps = min(timesteps, start_T)\n        iterator = tqdm(reversed(range(0, timesteps)), desc='Sampling t', total=timesteps) if verbose else reversed(\n            range(0, timesteps))\n\n        if mask is not None:\n            assert x0 is not None\n            assert x0.shape[2:3] == mask.shape[2:3]  # spatial size has to match\n\n        for i in iterator:\n            ts = torch.full((b,), i, device=device, dtype=torch.long)\n            if self.shorten_cond_schedule:\n                assert self.model.conditioning_key != 'hybrid'\n                tc = self.cond_ids[ts].to(cond.device)\n                cond = self.q_sample(x_start=cond, t=tc, noise=torch.randn_like(cond))\n\n            img = self.p_sample(img, cond, ts,\n                                clip_denoised=self.clip_denoised,\n                                quantize_denoised=quantize_denoised)\n            if mask is not None:\n                img_orig = self.q_sample(x0, ts)\n                img = img_orig * mask + (1. - mask) * img\n\n            if i % log_every_t == 0 or i == timesteps - 1:\n                intermediates.append(img)\n            if callback: callback(i)\n            if img_callback: img_callback(img, i)\n\n        if return_intermediates:\n            return img, intermediates\n        return img\n\n    @torch.no_grad()\n    def sample(self, cond, batch_size=16, return_intermediates=False, x_T=None,\n               verbose=True, timesteps=None, quantize_denoised=False,\n               mask=None, x0=None, shape=None,**kwargs):\n        if shape is None:\n            shape = (batch_size, self.channels, self.image_size, self.image_size)\n        if cond is not None:\n            if isinstance(cond, dict):\n                cond = {key: cond[key][:batch_size] if not isinstance(cond[key], list) else\n                list(map(lambda x: x[:batch_size], cond[key])) for key in cond}\n            else:\n                cond = [c[:batch_size] for c in cond] if isinstance(cond, list) else cond[:batch_size]\n        return self.p_sample_loop(cond,\n                                  shape,\n                                  return_intermediates=return_intermediates, x_T=x_T,\n                                  verbose=verbose, timesteps=timesteps, quantize_denoised=quantize_denoised,\n                                  mask=mask, x0=x0)\n    @torch.no_grad()\n    def sample_ddpm(self, conditioning, batch_size=16, return_intermediates=False, x_T=None,\n                    verbose=True, timesteps=None, quantize_denoised=False,\n                    mask=None, x0=None, shape=None,**kwargs):\n        if shape is None:\n            shape = (batch_size, self.channels, self.image_size, self.image_size)\n        elif len(shape) == 3:\n            shape = (batch_size,) + shape\n        if conditioning is not None:\n            if isinstance(conditioning, dict):\n                conditioning = {key: conditioning[key][:batch_size] if not isinstance(conditioning[key], list) else\n                list(map(lambda x: x[:batch_size], conditioning[key])) for key in conditioning}\n            else:\n                conditioning = [c[:batch_size] for c in conditioning] if isinstance(conditioning, list) else conditioning[:batch_size]\n        return self.p_sample_loop(conditioning,\n                                  shape,\n                                  return_intermediates=return_intermediates, x_T=x_T,\n                                  verbose=verbose, timesteps=timesteps, quantize_denoised=quantize_denoised,\n                                  mask=mask, x0=x0)\n\n    @torch.no_grad()\n    def sample_log(self,cond,batch_size,ddim, ddim_steps,**kwargs):\n\n        if ddim:\n            ddim_sampler = DDIMSampler(self)\n            shape = (self.channels, self.image_size, self.image_size)\n            samples, intermediates =ddim_sampler.sample(ddim_steps,batch_size,\n                                                        shape,cond,verbose=False,**kwargs)\n\n        else:\n            samples, intermediates = self.sample(cond=cond, batch_size=batch_size,\n                                                 return_intermediates=True,**kwargs)\n\n        return samples, intermediates\n\n\n    @torch.no_grad()\n    def log_images(self, batch, N=8, n_row=4, sample=True, ddim_steps=200, ddim_eta=1., return_keys=None,\n                   quantize_denoised=True, plot_denoise_rows=False, plot_progressive_rows=True,\n                   plot_diffusion_rows=True, **kwargs):\n\n        use_ddim = ddim_steps is not None\n\n        log = dict()\n        z, c, x, xrec, xc, phi_prev_list, phi_next_list = self.get_input(batch, \n                                           self.first_stage_key,\n                                           return_first_stage_outputs=True,\n                                           force_c_encode=True,\n                                           return_original_cond=True,\n                                           return_phi=True,\n                                           bs=N)\n        N = min(x.shape[0], N)\n        n_row = min(x.shape[0], n_row)\n        log[\"inputs\"] = x\n        log[\"reconstruction\"] = xrec\n\n        if plot_diffusion_rows:\n            # get diffusion row\n            diffusion_row = list()\n            z_start = z[:n_row]\n            for t in range(self.num_timesteps):\n                if t % self.log_every_t == 0 or t == self.num_timesteps - 1:\n                    t = repeat(torch.tensor([t]), '1 -> b', b=n_row)\n                    t = t.to(self.device).long()\n                    noise = torch.randn_like(z_start)\n                    z_noisy = self.q_sample(x_start=z_start, t=t, noise=noise)\n                    diffusion_row.append(self.decode_first_stage(z_noisy, xc, phi_prev_list, phi_next_list))\n\n            diffusion_row = torch.stack(diffusion_row)  # n_log_step, n_row, C, H, W\n            diffusion_grid = rearrange(diffusion_row, 'n b c h w -> b n c h w')\n            diffusion_grid = rearrange(diffusion_grid, 'b n c h w -> (b n) c h w')\n            diffusion_grid = make_grid(diffusion_grid, nrow=diffusion_row.shape[0])\n            log[\"diffusion_row\"] = diffusion_grid\n\n        if sample:\n            # get denoise row\n            with self.ema_scope(\"Plotting\"):\n                samples, z_denoise_row = self.sample_log(cond=c,batch_size=N,ddim=use_ddim,\n                                                         ddim_steps=ddim_steps,eta=ddim_eta, x_T=None)\n                # samples, z_denoise_row = self.sample(cond=c, batch_size=N, return_intermediates=True)\n            x_samples = self.decode_first_stage(samples, xc, phi_prev_list, phi_next_list)\n            log[\"samples\"] = x_samples\n            if plot_denoise_rows:\n                denoise_grid = self._get_denoise_row_from_list(z_denoise_row, \n                                                               xc,\n                                                               phi_prev_list,\n                                                               phi_next_list)\n                log[\"denoise_row\"] = denoise_grid\n\n            if quantize_denoised:\n                # also display when quantizing x0 while sampling\n                with self.ema_scope(\"Plotting Quantized Denoised\"):\n                    samples, z_denoise_row = self.sample_log(cond=c,batch_size=N,ddim=use_ddim,\n                                                             ddim_steps=ddim_steps,eta=ddim_eta,\n                                                             quantize_denoised=True, x_T=None)\n                    # samples, z_denoise_row = self.sample(cond=c, batch_size=N, return_intermediates=True,\n                    #                                      quantize_denoised=True)\n                x_samples = self.decode_first_stage(samples.to(self.device), xc, phi_prev_list, phi_next_list)\n                log[\"samples_x0_quantized\"] = x_samples\n\n        if plot_progressive_rows:\n            with self.ema_scope(\"Plotting Progressives\"):\n                img, progressives = self.progressive_denoising(c,\n                                                               shape=(self.channels, self.image_size, self.image_size),\n                                                               batch_size=N, x_T=None)\n            prog_row = self._get_denoise_row_from_list(progressives, \n                                                       xc,\n                                                       phi_prev_list,\n                                                       phi_next_list,\n                                                       desc=\"Progressive Generation\")\n            log[\"progressive_row\"] = prog_row\n\n        if return_keys:\n            if np.intersect1d(list(log.keys()), return_keys).shape[0] == 0:\n                return log\n            else:\n                return {key: log[key] for key in return_keys}\n        return log\n\n    def configure_optimizers(self):\n        lr = self.learning_rate\n        params = list(self.model.parameters())\n        if self.cond_stage_trainable:\n            print(f\"{self.__class__.__name__}: Also optimizing conditioner params!\")\n            params = params + list(self.cond_stage_model.parameters())\n        if self.learn_logvar:\n            print('Diffusion model optimizing logvar')\n            params.append(self.logvar)\n        opt = torch.optim.AdamW(params, lr=lr)\n        if self.use_scheduler:\n            assert 'target' in self.scheduler_config\n            scheduler = instantiate_from_config(self.scheduler_config)\n\n            print(\"Setting up LambdaLR scheduler...\")\n            scheduler = [\n                {\n                    'scheduler': LambdaLR(opt, lr_lambda=scheduler.schedule),\n                    'interval': 'step',\n                    'frequency': 1\n                }]\n            return [opt], scheduler\n        return opt\n\n    @torch.no_grad()\n    def to_rgb(self, x):\n        x = x.float()\n        if not hasattr(self, \"colorize\"):\n            self.colorize = torch.randn(3, x.shape[1], 1, 1).to(x)\n        x = nn.functional.conv2d(x, weight=self.colorize)\n        x = 2. * (x - x.min()) / (x.max() - x.min()) - 1.\n        return x\n"
  },
  {
    "path": "ldm/modules/attention.py",
    "content": "from inspect import isfunction\nimport math\nimport torch\nimport torch.nn.functional as F\nfrom torch import nn, einsum\nfrom einops import rearrange, repeat\n\nfrom ldm.modules.diffusionmodules.util import checkpoint\n\n\ndef exists(val):\n    return val is not None\n\n\ndef uniq(arr):\n    return{el: True for el in arr}.keys()\n\n\ndef default(val, d):\n    if exists(val):\n        return val\n    return d() if isfunction(d) else d\n\n\ndef max_neg_value(t):\n    return -torch.finfo(t.dtype).max\n\n\ndef init_(tensor):\n    dim = tensor.shape[-1]\n    std = 1 / math.sqrt(dim)\n    tensor.uniform_(-std, std)\n    return tensor\n\n\n# feedforward\nclass GEGLU(nn.Module):\n    def __init__(self, dim_in, dim_out):\n        super().__init__()\n        self.proj = nn.Linear(dim_in, dim_out * 2)\n\n    def forward(self, x):\n        x, gate = self.proj(x).chunk(2, dim=-1)\n        return x * F.gelu(gate)\n\n\nclass FeedForward(nn.Module):\n    def __init__(self, dim, dim_out=None, mult=4, glu=False, dropout=0.):\n        super().__init__()\n        inner_dim = int(dim * mult)\n        dim_out = default(dim_out, dim)\n        project_in = nn.Sequential(\n            nn.Linear(dim, inner_dim),\n            nn.GELU()\n        ) if not glu else GEGLU(dim, inner_dim)\n\n        self.net = nn.Sequential(\n            project_in,\n            nn.Dropout(dropout),\n            nn.Linear(inner_dim, dim_out)\n        )\n\n    def forward(self, x):\n        return self.net(x)\n\n\ndef zero_module(module):\n    \"\"\"\n    Zero out the parameters of a module and return it.\n    \"\"\"\n    for p in module.parameters():\n        p.detach().zero_()\n    return module\n\n\ndef Normalize(in_channels):\n    return torch.nn.GroupNorm(num_groups=32, num_channels=in_channels, eps=1e-6, affine=True)\n\n\nclass LinearAttention(nn.Module):\n    def __init__(self, dim, heads=4, dim_head=32):\n        super().__init__()\n        self.heads = heads\n        hidden_dim = dim_head * heads\n        self.to_qkv = nn.Conv2d(dim, hidden_dim * 3, 1, bias = False)\n        self.to_out = nn.Conv2d(hidden_dim, dim, 1)\n\n    def forward(self, x):\n        b, c, h, w = x.shape\n        qkv = self.to_qkv(x)\n        q, k, v = rearrange(qkv, 'b (qkv heads c) h w -> qkv b heads c (h w)', heads = self.heads, qkv=3)\n        k = k.softmax(dim=-1)  \n        context = torch.einsum('bhdn,bhen->bhde', k, v)\n        out = torch.einsum('bhde,bhdn->bhen', context, q)\n        out = rearrange(out, 'b heads c (h w) -> b (heads c) h w', heads=self.heads, h=h, w=w)\n        return self.to_out(out)\n\n\nclass SpatialSelfAttention(nn.Module):\n    def __init__(self, in_channels):\n        super().__init__()\n        self.in_channels = in_channels\n\n        self.norm = Normalize(in_channels)\n        self.q = torch.nn.Conv2d(in_channels,\n                                 in_channels,\n                                 kernel_size=1,\n                                 stride=1,\n                                 padding=0)\n        self.k = torch.nn.Conv2d(in_channels,\n                                 in_channels,\n                                 kernel_size=1,\n                                 stride=1,\n                                 padding=0)\n        self.v = torch.nn.Conv2d(in_channels,\n                                 in_channels,\n                                 kernel_size=1,\n                                 stride=1,\n                                 padding=0)\n        self.proj_out = torch.nn.Conv2d(in_channels,\n                                        in_channels,\n                                        kernel_size=1,\n                                        stride=1,\n                                        padding=0)\n\n    def forward(self, x):\n        h_ = x\n        h_ = self.norm(h_)\n        q = self.q(h_)\n        k = self.k(h_)\n        v = self.v(h_)\n\n        # compute attention\n        b,c,h,w = q.shape\n        q = rearrange(q, 'b c h w -> b (h w) c')\n        k = rearrange(k, 'b c h w -> b c (h w)')\n        w_ = torch.einsum('bij,bjk->bik', q, k)\n\n        w_ = w_ * (int(c)**(-0.5))\n        w_ = torch.nn.functional.softmax(w_, dim=2)\n\n        # attend to values\n        v = rearrange(v, 'b c h w -> b c (h w)')\n        w_ = rearrange(w_, 'b i j -> b j i')\n        h_ = torch.einsum('bij,bjk->bik', v, w_)\n        h_ = rearrange(h_, 'b c (h w) -> b c h w', h=h)\n        h_ = self.proj_out(h_)\n\n        return x+h_\n\n\nclass CrossAttention(nn.Module):\n    '''\n    Perform self-attention if context is None, else cross-attention.\n    The dims of the input and output of the block are the same (arg query_dim).\n    '''\n    def __init__(self, query_dim, context_dim=None, heads=8, dim_head=64, dropout=0.):\n        super().__init__()\n        inner_dim = dim_head * heads\n        context_dim = default(context_dim, query_dim)\n\n        self.scale = dim_head ** -0.5\n        self.heads = heads\n\n        self.to_q = nn.Linear(query_dim, inner_dim, bias=False)\n        self.to_k = nn.Linear(context_dim, inner_dim, bias=False)\n        self.to_v = nn.Linear(context_dim, inner_dim, bias=False)\n\n        self.to_out = nn.Sequential(\n            nn.Linear(inner_dim, query_dim),\n            nn.Dropout(dropout)\n        )\n\n    def forward(self, x, context=None, mask=None):\n        h = self.heads\n\n        q = self.to_q(x)\n        context = default(context, x)\n        k = self.to_k(context)\n        v = self.to_v(context)\n\n        q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> (b h) n d', h=h), (q, k, v))\n\n        sim = einsum('b i d, b j d -> b i j', q, k) * self.scale\n\n        if exists(mask):\n            mask = rearrange(mask, 'b ... -> b (...)')\n            max_neg_value = -torch.finfo(sim.dtype).max\n            mask = repeat(mask, 'b j -> (b h) () j', h=h)\n            sim.masked_fill_(~mask, max_neg_value)\n\n        # attention, what we cannot get enough of\n        attn = sim.softmax(dim=-1)\n\n        out = einsum('b i j, b j d -> b i d', attn, v)\n        out = rearrange(out, '(b h) n d -> b n (h d)', h=h)\n        return self.to_out(out)\n\n\nclass SpatialCrossAttention(nn.Module):\n    '''\n    Cross-attention block for image-like data.\n    First image reshape to b, t, d.\n    Perform self-attention if context is None, else cross-attention.\n    The dims of the input and output of the block are the same (arg query_dim).\n    '''\n    def __init__(self, query_dim, context_dim=None, heads=8, dim_head=64, dropout=0.):\n        super().__init__()\n        inner_dim = dim_head * heads\n        context_dim = default(context_dim, query_dim)\n\n        self.scale = dim_head ** -0.5\n        self.heads = heads\n\n        self.to_q = nn.Linear(query_dim, inner_dim, bias=False)\n        self.to_k = nn.Linear(context_dim, inner_dim, bias=False)\n        self.to_v = nn.Linear(context_dim, inner_dim, bias=False)\n\n        self.to_out = nn.Sequential(\n            nn.Linear(inner_dim, query_dim),\n            nn.Dropout(dropout)\n        )\n\n        self.norm = nn.LayerNorm(query_dim)\n\n    def forward(self, x, context=None):\n        # re-arrange image data to b, t, d.\n        b, c, h, w = x.shape\n        x_in = x\n        x = rearrange(x, 'b c h w -> b (h w) c')\n        if (len(context.shape) == 4):\n            context = rearrange(context, 'b c h w -> b (h w) c')\n\n        heads = self.heads\n\n        x = self.norm(x)\n        q = self.to_q(x)\n        context = default(context, x)\n        k = self.to_k(context)\n        v = self.to_v(context)\n\n        q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> (b h) n d', h=heads), (q, k, v))\n\n        sim = einsum('b i d, b j d -> b i j', q, k) * self.scale\n\n        # attention, what we cannot get enough of\n        attn = sim.softmax(dim=-1)\n\n        out = einsum('b i j, b j d -> b i d', attn, v)\n        out = rearrange(out, '(b h) n d -> b n (h d)', h=heads)\n        out = self.to_out(out)\n\n        # restore image shape\n        out = rearrange(out, 'b (h w) c -> b c h w', h=h, w=w)\n\n        return x_in + out\n\n\ndef posemb_sincos_2d(patches, temperature = 10000, dtype = torch.float32):\n    '''\n    Borrowed from https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/simple_vit.py\n    '''\n    _, dim, h, w, device, dtype = *patches.shape, patches.device, patches.dtype\n\n    y, x = torch.meshgrid(torch.arange(h, device = device), torch.arange(w, device = device), indexing = 'ij')\n    assert (dim % 4) == 0, 'feature dimension must be multiple of 4 for sincos emb'\n    omega = torch.arange(dim // 4, device = device) / (dim // 4 - 1)\n    omega = 1. / (temperature ** omega)\n\n    y = y.flatten()[:, None] * omega[None, :]\n    x = x.flatten()[:, None] * omega[None, :] \n    pe = torch.cat((x.sin(), x.cos(), y.sin(), y.cos()), dim = 1)\n    return pe.type(dtype) # (n, hd)\n\n\nclass SpatialCrossAttentionWithPosEmb(nn.Module):\n    '''\n    Cross-attention block for image-like data.\n    First image reshape to b, t, d.\n    Perform self-attention if context is None, else cross-attention.\n    The dims of the input and output of the block are the same (arg query_dim).\n    '''\n    def __init__(self, in_channels=None, heads=8, dim_head=64, dropout=0.):\n        super().__init__()\n        inner_dim = dim_head * heads\n\n        self.scale = dim_head ** -0.5\n        self.heads = heads\n\n        self.proj_in = nn.Conv2d(in_channels,\n                                 inner_dim,\n                                 kernel_size=1,\n                                 stride=1,\n                                 padding=0)\n        self.to_q = nn.Linear(inner_dim, inner_dim, bias=False)\n        self.to_k = nn.Linear(inner_dim, inner_dim, bias=False)\n        self.to_v = nn.Linear(inner_dim, inner_dim, bias=False)\n\n        self.to_out = nn.Sequential(\n            nn.Linear(inner_dim, inner_dim),\n            nn.Dropout(dropout)\n        )\n\n        self.proj_out = zero_module(nn.Conv2d(inner_dim,\n                                              in_channels,\n                                              kernel_size=1,\n                                              stride=1,\n                                              padding=0))\n\n        self.norm = nn.LayerNorm(inner_dim)\n\n    def forward(self, x, context=None):\n        b, c, h, w = x.shape\n        x_in = x\n        context = default(context, x)\n        x = self.proj_in(x) # (b,d,h,w)\n        context = self.proj_in(context) # (b,d,h,w)\n        \n        # positional embedding\n        pe = posemb_sincos_2d(x) # (n,d)\n\n        # re-arrange image data to b, n, d.\n        x = rearrange(x, 'b c h w -> b (h w) c')\n        if (len(context.shape) == 4):\n            context = rearrange(context, 'b c h w -> b (h w) c')\n\n        # add pos emb\n        x += pe\n        if context.shape[1] != x.shape[1]:\n            context[:,:h*w] += pe\n            context[:,h*w:] += pe\n        else:\n            context += pe\n\n        heads = self.heads\n\n        x = self.norm(x)\n        context = self.norm(context)\n\n        q = self.to_q(x)\n        k = self.to_k(context)\n        v = self.to_v(context)\n\n        q, k, v = map(lambda t: rearrange(t, 'b n (h d) -> (b h) n d', h=heads), (q, k, v))\n\n        sim = einsum('b i d, b j d -> b i j', q, k) * self.scale\n\n        # attention, what we cannot get enough of\n        attn = sim.softmax(dim=-1)\n\n        out = einsum('b i j, b j d -> b i d', attn, v)\n        out = rearrange(out, '(b h) n d -> b n (h d)', h=heads)\n        out = self.to_out(out)\n\n        # restore image shape\n        out = rearrange(out, 'b (h w) c -> b c h w', h=h, w=w)\n\n        return x_in + out\n\n\nclass BasicTransformerBlock(nn.Module):\n    '''\n    Two CrossAttention modules followed by a fully connected layer.\n    The first CrossAttention is applied to x in self-attention manner.\n    The second CrossAttention is applied to x and context as cross attention.\n    The fully connected layer has 4x internal dimention.\n    The dims of the input and output of the block are the same (arg dim).\n    '''\n    def __init__(self, dim, n_heads, d_head, dropout=0., context_dim=None, gated_ff=True, checkpoint=True):\n        super().__init__()\n        self.attn1 = CrossAttention(query_dim=dim, heads=n_heads, dim_head=d_head, dropout=dropout)  # is a self-attention\n        self.ff = FeedForward(dim, dropout=dropout, glu=gated_ff)\n        self.attn2 = CrossAttention(query_dim=dim, context_dim=context_dim,\n                                    heads=n_heads, dim_head=d_head, dropout=dropout)  # is self-attn if context is none\n        self.norm1 = nn.LayerNorm(dim)\n        self.norm2 = nn.LayerNorm(dim)\n        self.norm3 = nn.LayerNorm(dim)\n        self.checkpoint = checkpoint\n\n    def forward(self, x, context=None):\n        return checkpoint(self._forward, (x, context), self.parameters(), self.checkpoint)\n\n    def _forward(self, x, context=None):\n        x = self.attn1(self.norm1(x)) + x # (8,4096,256)\n        x = self.attn2(self.norm2(x), context=context) + x\n        x = self.ff(self.norm3(x)) + x\n        return x\n\n\nclass SpatialTransformer(nn.Module):\n    \"\"\"\n    Transformer block for image-like data.\n    First, project the input (aka embedding) to inner_dim (d) using conv1x1\n    Then reshape to b, t, d.\n    Then apply standard transformer action (BasicTransformerBlock).\n    Finally, reshape to image and pass to output conv1x1 layer, to restore the channel size of input.\n    The dims of the input and output of the block are the same (arg in_channels).\n    \"\"\"\n    def __init__(self, in_channels, n_heads, d_head,\n                 depth=1, dropout=0., context_dim=None):\n        super().__init__()\n        self.in_channels = in_channels\n        inner_dim = n_heads * d_head\n        self.norm = Normalize(in_channels)\n\n        self.proj_in = nn.Conv2d(in_channels,\n                                 inner_dim,\n                                 kernel_size=1,\n                                 stride=1,\n                                 padding=0)\n\n        self.transformer_blocks = nn.ModuleList(\n            [BasicTransformerBlock(inner_dim, n_heads, d_head, dropout=dropout, context_dim=context_dim)\n                for d in range(depth)]\n        )\n\n        self.proj_out = zero_module(nn.Conv2d(inner_dim,\n                                              in_channels,\n                                              kernel_size=1,\n                                              stride=1,\n                                              padding=0))\n\n    def forward(self, x, context=None):\n        # note: if no context is given, cross-attention defaults to self-attention\n        b, c, h, w = x.shape\n        # here x and context might have different resolutions\n        # because x is being downsampled while c is not\n        x_in = x\n        x = self.norm(x)\n        x = self.proj_in(x)\n        x = rearrange(x, 'b c h w -> b (h w) c')\n        if (len(context.shape) == 4):\n            context = rearrange(context, 'b c h w -> b (h w) c')\n        for block in self.transformer_blocks:\n            x = block(x, context=context)\n        x = rearrange(x, 'b (h w) c -> b c h w', h=h, w=w)\n        x = self.proj_out(x)\n        return x + x_in"
  },
  {
    "path": "ldm/modules/diffusionmodules/__init__.py",
    "content": ""
  },
  {
    "path": "ldm/modules/diffusionmodules/model.py",
    "content": "# pytorch_diffusion + derived encoder decoder\nimport math\nimport torch\nimport torch.nn as nn\nimport numpy as np\n\nfrom ldm.modules.attention import LinearAttention, SpatialCrossAttentionWithPosEmb\nfrom ldm.modules.maxvit import SpatialCrossAttentionWithMax, MaxAttentionBlock\n\nfrom cupy_module import dsepconv\n\n\ndef get_timestep_embedding(timesteps, embedding_dim):\n    \"\"\"\n    This matches the implementation in Denoising Diffusion Probabilistic Models:\n    From Fairseq.\n    Build sinusoidal embeddings.\n    This matches the implementation in tensor2tensor, but differs slightly\n    from the description in Section 3.5 of \"Attention Is All You Need\".\n    \"\"\"\n    assert len(timesteps.shape) == 1\n\n    half_dim = embedding_dim // 2\n    emb = math.log(10000) / (half_dim - 1)\n    emb = torch.exp(torch.arange(half_dim, dtype=torch.float32) * -emb)\n    emb = emb.to(device=timesteps.device)\n    emb = timesteps.float()[:, None] * emb[None, :]\n    emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=1)\n    if embedding_dim % 2 == 1:  # zero pad\n        emb = torch.nn.functional.pad(emb, (0,1,0,0))\n    return emb\n\n\ndef nonlinearity(x):\n    # swish\n    return x*torch.sigmoid(x)\n\n\ndef Normalize(in_channels, num_groups=32):\n    return torch.nn.GroupNorm(num_groups=num_groups, num_channels=in_channels, eps=1e-6, affine=True)\n\n\nclass IdentityWrapper(nn.Module):\n    \"\"\"\n    A wrapper for nn.Identity that allows additional input.\n    \"\"\"\n    def __init__(self) -> None:\n        super().__init__()\n        self.layer = nn.Identity()\n\n    def forward(self, x, context=None):\n        return self.layer(x)\n\n\n\nclass Upsample(nn.Module):\n    def __init__(self, in_channels, with_conv):\n        super().__init__()\n        self.with_conv = with_conv\n        if self.with_conv:\n            self.conv = torch.nn.Conv2d(in_channels,\n                                        in_channels,\n                                        kernel_size=3,\n                                        stride=1,\n                                        padding=1)\n\n    def forward(self, x):\n        x = torch.nn.functional.interpolate(x, scale_factor=2.0, mode=\"nearest\")\n        if self.with_conv:\n            x = self.conv(x)\n        return x\n\n\nclass Downsample(nn.Module):\n    def __init__(self, in_channels, with_conv):\n        super().__init__()\n        self.with_conv = with_conv\n        if self.with_conv:\n            # no asymmetric padding in torch conv, must do it ourselves\n            self.conv = torch.nn.Conv2d(in_channels,\n                                        in_channels,\n                                        kernel_size=3,\n                                        stride=2,\n                                        padding=0)\n\n    def forward(self, x):\n        if self.with_conv:\n            pad = (0,1,0,1)\n            x = torch.nn.functional.pad(x, pad, mode=\"constant\", value=0)\n            x = self.conv(x)\n        else:\n            x = torch.nn.functional.avg_pool2d(x, kernel_size=2, stride=2)\n        return x\n\n\nclass ResnetBlock(nn.Module):\n    def __init__(self, *, in_channels, out_channels=None, conv_shortcut=False,\n                 dropout, temb_channels=512):\n        super().__init__()\n        self.in_channels = in_channels\n        out_channels = in_channels if out_channels is None else out_channels\n        self.out_channels = out_channels\n        self.use_conv_shortcut = conv_shortcut\n\n        self.norm1 = Normalize(in_channels)\n        self.conv1 = torch.nn.Conv2d(in_channels,\n                                     out_channels,\n                                     kernel_size=3,\n                                     stride=1,\n                                     padding=1)\n        if temb_channels > 0:\n            self.temb_proj = torch.nn.Linear(temb_channels,\n                                             out_channels)\n        self.norm2 = Normalize(out_channels)\n        self.dropout = torch.nn.Dropout(dropout)\n        self.conv2 = torch.nn.Conv2d(out_channels,\n                                     out_channels,\n                                     kernel_size=3,\n                                     stride=1,\n                                     padding=1)\n        if self.in_channels != self.out_channels:\n            if self.use_conv_shortcut:\n                self.conv_shortcut = torch.nn.Conv2d(in_channels,\n                                                     out_channels,\n                                                     kernel_size=3,\n                                                     stride=1,\n                                                     padding=1)\n            else:\n                self.nin_shortcut = torch.nn.Conv2d(in_channels,\n                                                    out_channels,\n                                                    kernel_size=1,\n                                                    stride=1,\n                                                    padding=0)\n\n    def forward(self, x, temb):\n        h = x\n        h = self.norm1(h)\n        h = nonlinearity(h)\n        h = self.conv1(h)\n\n        if temb is not None:\n            h = h + self.temb_proj(nonlinearity(temb))[:,:,None,None]\n\n        h = self.norm2(h)\n        h = nonlinearity(h)\n        h = self.dropout(h)\n        h = self.conv2(h)\n\n        if self.in_channels != self.out_channels:\n            if self.use_conv_shortcut:\n                x = self.conv_shortcut(x)\n            else:\n                x = self.nin_shortcut(x)\n\n        return x+h\n\n\nclass LinAttnBlock(LinearAttention):\n    \"\"\"to match AttnBlock usage\"\"\"\n    def __init__(self, in_channels):\n        super().__init__(dim=in_channels, heads=1, dim_head=in_channels)\n\n\nclass AttnBlock(nn.Module):\n    def __init__(self, in_channels):\n        super().__init__()\n        self.in_channels = in_channels\n\n        self.norm = Normalize(in_channels)\n        self.q = torch.nn.Conv2d(in_channels,\n                                 in_channels,\n                                 kernel_size=1,\n                                 stride=1,\n                                 padding=0)\n        self.k = torch.nn.Conv2d(in_channels,\n                                 in_channels,\n                                 kernel_size=1,\n                                 stride=1,\n                                 padding=0)\n        self.v = torch.nn.Conv2d(in_channels,\n                                 in_channels,\n                                 kernel_size=1,\n                                 stride=1,\n                                 padding=0)\n        self.proj_out = torch.nn.Conv2d(in_channels,\n                                        in_channels,\n                                        kernel_size=1,\n                                        stride=1,\n                                        padding=0)\n\n\n    def forward(self, x):\n        h_ = x\n        h_ = self.norm(h_)\n        q = self.q(h_)\n        k = self.k(h_)\n        v = self.v(h_)\n\n        # compute attention\n        b,c,h,w = q.shape\n        q = q.reshape(b,c,h*w)\n        q = q.permute(0,2,1)   # b,hw,c\n        k = k.reshape(b,c,h*w) # b,c,hw\n        w_ = torch.bmm(q,k)     # b,hw,hw    w[b,i,j]=sum_c q[b,i,c]k[b,c,j]\n        w_ = w_ * (int(c)**(-0.5))\n        w_ = torch.nn.functional.softmax(w_, dim=2)\n\n        # attend to values\n        v = v.reshape(b,c,h*w)\n        w_ = w_.permute(0,2,1)   # b,hw,hw (first hw of k, second of q)\n        h_ = torch.bmm(v,w_)     # b, c,hw (hw of q) h_[b,c,j] = sum_i v[b,c,i] w_[b,i,j]\n        h_ = h_.reshape(b,c,h,w)\n\n        h_ = self.proj_out(h_)\n\n        return x+h_\n\n\ndef make_attn(in_channels, attn_type=\"vanilla\"):\n    assert attn_type in [\"vanilla\", \"linear\", \"none\", 'max'], f'attn_type {attn_type} unknown'\n    print(f\"making attention of type '{attn_type}' with {in_channels} in_channels\")\n    if attn_type == \"vanilla\":\n        return AttnBlock(in_channels)\n    elif attn_type == \"none\":\n        return nn.Identity(in_channels)\n    elif attn_type == 'max':\n        return MaxAttentionBlock(in_channels, heads=1, dim_head=in_channels)\n    else:\n        return LinAttnBlock(in_channels)\n\n\n\n\nclass FIEncoder(nn.Module):\n    def __init__(self, *, ch, out_ch, ch_mult=(1,2,4,8), num_res_blocks,\n                 attn_resolutions, dropout=0.0, resamp_with_conv=True, in_channels,\n                 resolution, z_channels, double_z=True, use_linear_attn=False, attn_type=\"vanilla\",\n                 **ignore_kwargs):\n        super().__init__()\n        if use_linear_attn: attn_type = \"linear\"\n        self.ch = ch # 128\n        self.temb_ch = 0\n        self.num_resolutions = len(ch_mult) # 3\n        self.num_res_blocks = num_res_blocks # 2\n        self.resolution = resolution # 256\n        self.in_channels = in_channels # 3\n\n        # downsampling\n        self.conv_in = torch.nn.Conv2d(in_channels,\n                                       self.ch,\n                                       kernel_size=3,\n                                       stride=1,\n                                       padding=1)\n\n        curr_res = resolution # 256\n        in_ch_mult = (1,)+tuple(ch_mult) # (1,1,2,4)\n        self.in_ch_mult = in_ch_mult # (1,1,2,4)\n        self.down = nn.ModuleList()\n        for i_level in range(self.num_resolutions):\n            block = nn.ModuleList()\n            attn = nn.ModuleList()\n            block_in = int(ch*in_ch_mult[i_level])\n            block_out = int(ch*ch_mult[i_level])\n            for i_block in range(self.num_res_blocks):\n                block.append(ResnetBlock(in_channels=block_in,\n                                         out_channels=block_out,\n                                         temb_channels=self.temb_ch,\n                                         dropout=dropout))\n                block_in = block_out\n                if curr_res in attn_resolutions:\n                    attn.append(make_attn(block_in, attn_type=attn_type))\n            down = nn.Module()\n            down.block = block\n            down.attn = attn\n            # if i_level != self.num_resolutions-1:\n            down.downsample = Downsample(block_in, resamp_with_conv)\n            curr_res = curr_res // 2\n            self.down.append(down)\n\n        # middle\n        self.mid = nn.Module()\n        self.mid.block_1 = ResnetBlock(in_channels=block_in,\n                                       out_channels=block_in,\n                                       temb_channels=self.temb_ch,\n                                       dropout=dropout)\n        self.mid.attn_1 = make_attn(block_in, attn_type=attn_type)\n        self.mid.block_2 = ResnetBlock(in_channels=block_in,\n                                       out_channels=block_in,\n                                       temb_channels=self.temb_ch,\n                                       dropout=dropout)\n\n        # end\n        self.norm_out = Normalize(block_in)\n        self.conv_out = torch.nn.Conv2d(block_in,\n                                        2*z_channels if double_z else z_channels, # 3\n                                        kernel_size=3,\n                                        stride=1,\n                                        padding=1)\n\n    def forward(self, x, ret_feature=False):\n        # timestep embedding\n        temb = None\n\n        # downsampling\n        hs = [self.conv_in(x)]\n        phi_list = []\n        for i_level in range(self.num_resolutions):\n            for i_block in range(self.num_res_blocks):\n                h = self.down[i_level].block[i_block](hs[-1], temb)\n                if len(self.down[i_level].attn) > 0:\n                    h = self.down[i_level].attn[i_block](h)\n                hs.append(h)\n            # if i_level != self.num_resolutions-1:\n            hs.append(self.down[i_level].downsample(hs[-1]))\n            phi_list.append(hs[-1])\n\n        # middle\n        h = hs[-1]\n        h = self.mid.block_1(h, temb)\n        h = self.mid.attn_1(h)\n        h = self.mid.block_2(h, temb)\n\n        # end\n        h = self.norm_out(h)\n        h = nonlinearity(h)\n        h = self.conv_out(h)\n\n        if ret_feature:\n            return h, phi_list\n        return h\n\n\nclass FlowEncoder(FIEncoder):\n    def __init__(self, *, ch, out_ch, ch_mult=(1, 2, 4, 8), num_res_blocks, attn_resolutions, dropout=0, resamp_with_conv=True, in_channels, resolution, z_channels, double_z=True, use_linear_attn=False, attn_type=\"vanilla\", **ignore_kwargs):\n        super().__init__(\n            ch=ch, \n            out_ch=out_ch,\n            ch_mult=ch_mult, \n            num_res_blocks=num_res_blocks, \n            attn_resolutions=attn_resolutions, \n            dropout=dropout, \n            resamp_with_conv=resamp_with_conv, \n            in_channels=in_channels, \n            resolution=resolution, \n            z_channels=z_channels, \n            double_z=double_z, \n            use_linear_attn=use_linear_attn, \n            attn_type=attn_type, \n            **ignore_kwargs\n        )\n\n\n\nclass FlowDecoderWithResidual(nn.Module):\n    def __init__(self, *, ch, out_ch, ch_mult=(1,2,4,8), num_res_blocks,\n                 attn_resolutions, dropout=0.0, resamp_with_conv=True, in_channels,\n                 resolution, z_channels, give_pre_end=False, tanh_out=False, use_linear_attn=False,\n                 attn_type=\"vanilla\", num_head_channels=32, num_heads=1, cond_type=None,\n                 **ignorekwargs):\n        super().__init__()\n\n        def KernelHead(c_in):\n            return torch.nn.Sequential(\n                    torch.nn.Conv2d(in_channels=c_in, out_channels=64, kernel_size=3, stride=1, padding=1),\n                    torch.nn.ReLU(inplace=False),\n                    torch.nn.Conv2d(in_channels=64, out_channels=32, kernel_size=3, stride=1, padding=1),\n                    torch.nn.ReLU(inplace=False),\n                    torch.nn.Conv2d(in_channels=32, out_channels=5, kernel_size=3, stride=1, padding=1),\n                    torch.nn.ReLU(inplace=False),\n                    # torch.nn.Upsample(scale_factor=2, mode='bilinear', align_corners=True),\n                    torch.nn.Conv2d(in_channels=5, out_channels=5, kernel_size=3, stride=1, padding=1)\n                )\n        # end\n\n        def OffsetHead(c_in):\n            return torch.nn.Sequential(\n                    torch.nn.Conv2d(in_channels=c_in, out_channels=64, kernel_size=3, stride=1, padding=1),\n                    torch.nn.ReLU(inplace=False),\n                    torch.nn.Conv2d(in_channels=64, out_channels=32, kernel_size=3, stride=1, padding=1),\n                    torch.nn.ReLU(inplace=False),\n                    torch.nn.Conv2d(in_channels=32, out_channels=5 ** 2, kernel_size=3, stride=1, padding=1),\n                    torch.nn.ReLU(inplace=False),\n                    # torch.nn.Upsample(scale_factor=2, mode='bilinear', align_corners=True),\n                    torch.nn.Conv2d(in_channels=5 ** 2, out_channels=5 ** 2, kernel_size=3, stride=1, padding=1)\n                )\n\n\n        def MaskHead(c_in):\n            return torch.nn.Sequential(\n                    torch.nn.Conv2d(in_channels=c_in, out_channels=64, kernel_size=3, stride=1, padding=1),\n                    torch.nn.ReLU(inplace=False),\n                    torch.nn.Conv2d(in_channels=64, out_channels=32, kernel_size=3, stride=1, padding=1),\n                    torch.nn.ReLU(inplace=False),\n                    torch.nn.Conv2d(in_channels=32, out_channels=5 ** 2, kernel_size=3, stride=1, padding=1),\n                    torch.nn.ReLU(inplace=False),\n                    # torch.nn.Upsample(scale_factor=2, mode='bilinear', align_corners=True),\n                    torch.nn.Conv2d(in_channels=5 ** 2, out_channels=5 ** 2, kernel_size=3,\n                             stride=1, padding=1),\n                    torch.nn.Sigmoid()\n                )\n\n        def ResidualHead(c_in):\n            return torch.nn.Sequential(\n                    torch.nn.Conv2d(in_channels=c_in, out_channels=64, kernel_size=3, stride=1, padding=1),\n                    torch.nn.ReLU(inplace=False),\n                    torch.nn.Conv2d(in_channels=64, out_channels=32, kernel_size=3, stride=1, padding=1),\n                    torch.nn.ReLU(inplace=False),\n                    torch.nn.Conv2d(in_channels=32, out_channels=3, kernel_size=3, stride=1, padding=1),\n                    torch.nn.ReLU(inplace=False),\n                    # torch.nn.Upsample(scale_factor=2, mode='bilinear', align_corners=True),\n                    torch.nn.Conv2d(in_channels=3, out_channels=3, kernel_size=3, stride=1, padding=1)\n                )\n        \n\n        self.ch = ch # 128\n        self.temb_ch = 0\n        self.num_resolutions = len(ch_mult) # 3\n        self.num_res_blocks = num_res_blocks # 2\n        self.resolution = resolution # 256\n        self.in_channels = in_channels # 3\n        self.give_pre_end = give_pre_end # False\n        self.tanh_out = tanh_out # False\n\n        # compute in_ch_mult, block_in and curr_res at lowest res\n        in_ch_mult = (1,)+tuple(ch_mult) # (1,1,2,4)\n        block_in = int(ch*ch_mult[self.num_resolutions-1]) # 512\n        curr_res = resolution // 2**(self.num_resolutions-1) # 64\n        self.z_shape = (1,z_channels,curr_res,curr_res) # (1,3,64,64)\n        print(\"Working with z of shape {} = {} dimensions.\".format(\n            self.z_shape, np.prod(self.z_shape)))\n\n        # z to block_in\n        self.conv_in = torch.nn.Conv2d(z_channels,\n                                       block_in,\n                                       kernel_size=3,\n                                       stride=1,\n                                       padding=1)\n\n        # middle\n        self.mid = nn.Module()\n        self.mid.block_1 = ResnetBlock(in_channels=block_in,\n                                       out_channels=block_in,\n                                       temb_channels=self.temb_ch,\n                                       dropout=dropout)\n        self.mid.attn_1 = make_attn(block_in, attn_type=attn_type)\n        self.mid.block_2 = ResnetBlock(in_channels=block_in,\n                                       out_channels=block_in,\n                                       temb_channels=self.temb_ch,\n                                       dropout=dropout)\n\n        # upsampling\n        self.up = nn.ModuleList()\n        for i_level in reversed(range(self.num_resolutions)): # 2,1,0\n            block = nn.ModuleList()\n            attn = nn.ModuleList()\n            block_out = int(ch*ch_mult[i_level])\n            # ResBlocks\n            for i_block in range(self.num_res_blocks):\n                block.append(ResnetBlock(in_channels=block_in,\n                                         out_channels=block_out,\n                                         temb_channels=self.temb_ch,\n                                         dropout=dropout))\n                block_in = block_out\n                if curr_res in attn_resolutions:\n                    attn.append(make_attn(block_in, attn_type=attn_type))\n\n            # CrossAttention\n            if num_head_channels == -1:\n                dim_head = block_in // num_heads\n            else:\n                num_heads = block_in // num_head_channels\n                dim_head = num_head_channels # 32\n            if cond_type == 'cross_attn':\n                cross_attn = SpatialCrossAttentionWithPosEmb(in_channels=block_in, \n                                                             heads=num_heads,\n                                                             dim_head=dim_head)\n            elif cond_type == 'max_cross_attn':\n                cross_attn = SpatialCrossAttentionWithMax(in_channels=block_in,\n                                                          heads=num_heads,\n                                                          dim_head=dim_head)\n            elif cond_type == 'max_cross_attn_frame':\n                cross_attn = SpatialCrossAttentionWithMax(in_channels=block_in,\n                                                          heads=num_heads,\n                                                          dim_head=dim_head,\n                                                          ctx_dim=6)\n            else:\n                cross_attn = IdentityWrapper()\n\n            up = nn.Module()\n            up.block = block\n            up.attn = attn\n            up.cross_attn = cross_attn\n\n            # Upsample\n            # if i_level != self.num_resolutions-1: ## THIS IS ORIGINAL CODE\n            # if i_level != 0:\n            up.upsample = Upsample(block_in, resamp_with_conv)\n            curr_res = curr_res * 2\n            self.up.insert(0, up) # prepend to get consistent order\n\n        # end\n        self.norm_out = Normalize(block_in)\n        self.conv_out = torch.nn.Conv2d(block_in,\n                                        block_in,\n                                        kernel_size=3,\n                                        stride=1,\n                                        padding=1)\n        self.moduleAlpha1 = OffsetHead(c_in=block_in)\n        self.moduleAlpha2 = OffsetHead(c_in=block_in)\n        self.moduleBeta1 = OffsetHead(c_in=block_in)\n        self.moduleBeta2 = OffsetHead(c_in=block_in)\n        self.moduleKernelHorizontal1 = KernelHead(c_in=block_in)\n        self.moduleKernelHorizontal2 = KernelHead(c_in=block_in)\n        self.moduleKernelVertical1 = KernelHead(c_in=block_in)\n        self.moduleKernelVertical2 = KernelHead(c_in=block_in)\n        self.moduleMask = MaskHead(c_in=block_in)\n        self.moduleResidual = ResidualHead(c_in=block_in)\n        self.modulePad = torch.nn.ReplicationPad2d([2, 2, 2, 2])\n\n    def forward(self, z, cond_dict):\n        phi_prev_list = cond_dict['phi_prev_list']\n        phi_next_list = cond_dict['phi_next_list']\n        frame_prev = cond_dict['frame_prev']\n        frame_next = cond_dict['frame_next']\n\n        #assert z.shape[1:] == self.z_shape[1:]\n        self.last_z_shape = z.shape\n\n        # timestep embedding\n        temb = None\n\n        # z to block_in\n        h = self.conv_in(z)\n\n        # middle\n        h = self.mid.block_1(h, temb)\n        h = self.mid.attn_1(h)\n        h = self.mid.block_2(h, temb)\n\n        # upsampling\n        for i_level in reversed(range(self.num_resolutions)): # [2,1,0]\n            for i_block in range(self.num_res_blocks):\n                h = self.up[i_level].block[i_block](h, temb)\n                if len(self.up[i_level].attn) > 0:\n                    h = self.up[i_level].attn[i_block](h)\n            ctx = None\n            if phi_prev_list[i_level] is not None:\n                ctx = torch.cat([phi_prev_list[i_level], phi_next_list[i_level]], dim=1)\n            h = self.up[i_level].cross_attn(h, ctx)\n            # if i_level != self.num_resolutions-1:\n            # if i_level != 0:\n            h = self.up[i_level].upsample(h)\n\n        # end\n        if self.give_pre_end:\n            return h\n\n        h = self.norm_out(h)\n        h = nonlinearity(h)\n        h = self.conv_out(h)\n        alpha1 = self.moduleAlpha1(h)\n        alpha2 = self.moduleAlpha2(h)\n        beta1 = self.moduleBeta1(h)\n        beta2 = self.moduleBeta2(h)\n        v1 = self.moduleKernelVertical1(h)\n        v2 = self.moduleKernelVertical2(h)\n        h1 = self.moduleKernelHorizontal1(h)\n        h2 = self.moduleKernelHorizontal2(h)\n        mask1 = self.moduleMask(h)\n        mask2 = 1.0 - mask1\n        warped1 = dsepconv.FunctionDSepconv(self.modulePad(frame_prev), v1, h1, alpha1, beta1, mask1)\n        warped2 = dsepconv.FunctionDSepconv(self.modulePad(frame_next), v2, h2, alpha2, beta2, mask2)\n        warped = warped1 + warped2\n        out = warped + self.moduleResidual(h)\n        return out"
  },
  {
    "path": "ldm/modules/diffusionmodules/openaimodel.py",
    "content": "from abc import abstractmethod\nfrom functools import partial\nimport math\nfrom typing import Iterable\n\nimport numpy as np\nimport torch as th\nimport torch.nn as nn\nimport torch.nn.functional as F\n\nfrom ldm.modules.maxvit import SpatialTransformerWithMax, MaxAttentionBlock\n\nfrom ldm.modules.diffusionmodules.util import (\n    checkpoint,\n    conv_nd,\n    linear,\n    avg_pool_nd,\n    zero_module,\n    normalization,\n    timestep_embedding,\n)\nfrom ldm.modules.attention import SpatialTransformer\n\n\n# dummy replace\ndef convert_module_to_f16(x):\n    pass\n\ndef convert_module_to_f32(x):\n    pass\n\n\n## go\nclass AttentionPool2d(nn.Module):\n    \"\"\"\n    Adapted from CLIP: https://github.com/openai/CLIP/blob/main/clip/model.py\n    \"\"\"\n\n    def __init__(\n        self,\n        spacial_dim: int,\n        embed_dim: int,\n        num_heads_channels: int,\n        output_dim: int = None,\n    ):\n        super().__init__()\n        self.positional_embedding = nn.Parameter(th.randn(embed_dim, spacial_dim ** 2 + 1) / embed_dim ** 0.5)\n        self.qkv_proj = conv_nd(1, embed_dim, 3 * embed_dim, 1)\n        self.c_proj = conv_nd(1, embed_dim, output_dim or embed_dim, 1)\n        self.num_heads = embed_dim // num_heads_channels\n        self.attention = QKVAttention(self.num_heads)\n\n    def forward(self, x):\n        b, c, *_spatial = x.shape\n        x = x.reshape(b, c, -1)  # NC(HW)\n        x = th.cat([x.mean(dim=-1, keepdim=True), x], dim=-1)  # NC(HW+1)\n        x = x + self.positional_embedding[None, :, :].to(x.dtype)  # NC(HW+1)\n        x = self.qkv_proj(x)\n        x = self.attention(x)\n        x = self.c_proj(x)\n        return x[:, :, 0]\n\n\nclass TimestepBlock(nn.Module):\n    \"\"\"\n    Any module where forward() takes timestep embeddings as a second argument.\n    \"\"\"\n\n    @abstractmethod\n    def forward(self, x, emb):\n        \"\"\"\n        Apply the module to `x` given `emb` timestep embeddings.\n        \"\"\"\n\n\nclass TimestepEmbedSequential(nn.Sequential, TimestepBlock):\n    \"\"\"\n    A sequential module that passes timestep embeddings to the children that\n    support it as an extra input.\n    \"\"\"\n\n    def forward(self, x, emb, context=None):\n        for layer in self:\n            if isinstance(layer, TimestepBlock):\n                x = layer(x, emb)\n            elif isinstance(layer, SpatialTransformer) or isinstance(layer, SpatialTransformerWithMax):\n                x = layer(x, context)\n            else:\n                x = layer(x)\n        return x\n\n\nclass Upsample(nn.Module):\n    \"\"\"\n    An upsampling layer with an optional convolution.\n    :param channels: channels in the inputs and outputs.\n    :param use_conv: a bool determining if a convolution is applied.\n    :param dims: determines if the signal is 1D, 2D, or 3D. If 3D, then\n                 upsampling occurs in the inner-two dimensions.\n    \"\"\"\n\n    def __init__(self, channels, use_conv, dims=2, out_channels=None, padding=1):\n        super().__init__()\n        self.channels = channels\n        self.out_channels = out_channels or channels\n        self.use_conv = use_conv\n        self.dims = dims\n        if use_conv:\n            self.conv = conv_nd(dims, self.channels, self.out_channels, 3, padding=padding)\n\n    def forward(self, x):\n        assert x.shape[1] == self.channels\n        if self.dims == 3:\n            x = F.interpolate(\n                x, (x.shape[2], x.shape[3] * 2, x.shape[4] * 2), mode=\"nearest\"\n            )\n        else:\n            x = F.interpolate(x, scale_factor=2, mode=\"nearest\")\n        if self.use_conv:\n            x = self.conv(x)\n        return x\n\nclass TransposedUpsample(nn.Module):\n    'Learned 2x upsampling without padding'\n    def __init__(self, channels, out_channels=None, ks=5):\n        super().__init__()\n        self.channels = channels\n        self.out_channels = out_channels or channels\n\n        self.up = nn.ConvTranspose2d(self.channels,self.out_channels,kernel_size=ks,stride=2)\n\n    def forward(self,x):\n        return self.up(x)\n\n\nclass Downsample(nn.Module):\n    \"\"\"\n    A downsampling layer with an optional convolution.\n    :param channels: channels in the inputs and outputs.\n    :param use_conv: a bool determining if a convolution is applied.\n    :param dims: determines if the signal is 1D, 2D, or 3D. If 3D, then\n                 downsampling occurs in the inner-two dimensions.\n    \"\"\"\n\n    def __init__(self, channels, use_conv, dims=2, out_channels=None,padding=1):\n        super().__init__()\n        self.channels = channels\n        self.out_channels = out_channels or channels\n        self.use_conv = use_conv\n        self.dims = dims\n        stride = 2 if dims != 3 else (1, 2, 2)\n        if use_conv:\n            self.op = conv_nd(\n                dims, self.channels, self.out_channels, 3, stride=stride, padding=padding\n            )\n        else:\n            assert self.channels == self.out_channels\n            self.op = avg_pool_nd(dims, kernel_size=stride, stride=stride)\n\n    def forward(self, x):\n        assert x.shape[1] == self.channels\n        return self.op(x)\n\n\nclass ResBlock(TimestepBlock):\n    \"\"\"\n    A residual block that can optionally change the number of channels.\n    :param channels: the number of input channels.\n    :param emb_channels: the number of timestep embedding channels.\n    :param dropout: the rate of dropout.\n    :param out_channels: if specified, the number of out channels.\n    :param use_conv: if True and out_channels is specified, use a spatial\n        convolution instead of a smaller 1x1 convolution to change the\n        channels in the skip connection.\n    :param dims: determines if the signal is 1D, 2D, or 3D.\n    :param use_checkpoint: if True, use gradient checkpointing on this module.\n    :param up: if True, use this block for upsampling.\n    :param down: if True, use this block for downsampling.\n    \"\"\"\n\n    def __init__(\n        self,\n        channels,\n        emb_channels,\n        dropout,\n        out_channels=None,\n        use_conv=False,\n        use_scale_shift_norm=False,\n        dims=2,\n        use_checkpoint=False,\n        up=False,\n        down=False,\n    ):\n        super().__init__()\n        self.channels = channels # 256\n        self.emb_channels = emb_channels # 1024\n        self.dropout = dropout # 0\n        self.out_channels = out_channels or channels # 256\n        self.use_conv = use_conv # False\n        self.use_checkpoint = use_checkpoint # False\n        self.use_scale_shift_norm = use_scale_shift_norm # False\n\n        self.in_layers = nn.Sequential(\n            normalization(channels),\n            nn.SiLU(),\n            conv_nd(dims, channels, self.out_channels, 3, padding=1),\n        )\n\n        self.updown = up or down\n\n        if up:\n            self.h_upd = Upsample(channels, False, dims)\n            self.x_upd = Upsample(channels, False, dims)\n        elif down:\n            self.h_upd = Downsample(channels, False, dims)\n            self.x_upd = Downsample(channels, False, dims)\n        else:\n            self.h_upd = self.x_upd = nn.Identity()\n\n        self.emb_layers = nn.Sequential(\n            nn.SiLU(),\n            linear(\n                emb_channels,\n                2 * self.out_channels if use_scale_shift_norm else self.out_channels,\n            ),\n        )\n        self.out_layers = nn.Sequential(\n            normalization(self.out_channels),\n            nn.SiLU(),\n            nn.Dropout(p=dropout),\n            zero_module(\n                conv_nd(dims, self.out_channels, self.out_channels, 3, padding=1)\n            ),\n        )\n\n        if self.out_channels == channels:\n            self.skip_connection = nn.Identity()\n        elif use_conv:\n            self.skip_connection = conv_nd(\n                dims, channels, self.out_channels, 3, padding=1\n            )\n        else:\n            self.skip_connection = conv_nd(dims, channels, self.out_channels, 1)\n\n    def forward(self, x, emb):\n        \"\"\"\n        Apply the block to a Tensor, conditioned on a timestep embedding.\n        :param x: an [N x C x ...] Tensor of features.\n        :param emb: an [N x emb_channels] Tensor of timestep embeddings.\n        :return: an [N x C x ...] Tensor of outputs.\n        \"\"\"\n        return checkpoint(\n            self._forward, (x, emb), self.parameters(), self.use_checkpoint\n        )\n\n\n    def _forward(self, x, emb):\n        if self.updown:\n            in_rest, in_conv = self.in_layers[:-1], self.in_layers[-1]\n            h = in_rest(x)\n            h = self.h_upd(h)\n            x = self.x_upd(x)\n            h = in_conv(h)\n        else:\n            h = self.in_layers(x)\n        emb_out = self.emb_layers(emb).type(h.dtype)\n        while len(emb_out.shape) < len(h.shape):\n            emb_out = emb_out[..., None]\n        if self.use_scale_shift_norm:\n            out_norm, out_rest = self.out_layers[0], self.out_layers[1:]\n            scale, shift = th.chunk(emb_out, 2, dim=1)\n            h = out_norm(h) * (1 + scale) + shift\n            h = out_rest(h)\n        else:\n            h = h + emb_out\n            h = self.out_layers(h)\n        return self.skip_connection(x) + h\n\n\nclass AttentionBlock(nn.Module):\n    \"\"\"\n    An attention block that allows spatial positions to attend to each other.\n    Originally ported from here, but adapted to the N-d case.\n    https://github.com/hojonathanho/diffusion/blob/1e0dceb3b3495bbe19116a5e1b3596cd0706c543/diffusion_tf/models/unet.py#L66.\n    \"\"\"\n\n    def __init__(\n        self,\n        channels,\n        num_heads=1,\n        num_head_channels=-1,\n        use_checkpoint=False,\n        use_new_attention_order=False,\n    ):\n        super().__init__()\n        self.channels = channels\n        if num_head_channels == -1:\n            self.num_heads = num_heads\n        else:\n            assert (\n                channels % num_head_channels == 0\n            ), f\"q,k,v channels {channels} is not divisible by num_head_channels {num_head_channels}\"\n            self.num_heads = channels // num_head_channels\n        self.use_checkpoint = use_checkpoint\n        self.norm = normalization(channels)\n        self.qkv = conv_nd(1, channels, channels * 3, 1)\n        if use_new_attention_order:\n            # split qkv before split heads\n            self.attention = QKVAttention(self.num_heads)\n        else:\n            # split heads before split qkv\n            self.attention = QKVAttentionLegacy(self.num_heads)\n\n        self.proj_out = zero_module(conv_nd(1, channels, channels, 1))\n\n    def forward(self, x):\n        return checkpoint(self._forward, (x,), self.parameters(), True)   # TODO: check checkpoint usage, is True # TODO: fix the .half call!!!\n        #return pt_checkpoint(self._forward, x)  # pytorch\n\n    def _forward(self, x):\n        b, c, *spatial = x.shape\n        x = x.reshape(b, c, -1)\n        qkv = self.qkv(self.norm(x))\n        h = self.attention(qkv)\n        h = self.proj_out(h)\n        return (x + h).reshape(b, c, *spatial)\n\n\ndef count_flops_attn(model, _x, y):\n    \"\"\"\n    A counter for the `thop` package to count the operations in an\n    attention operation.\n    Meant to be used like:\n        macs, params = thop.profile(\n            model,\n            inputs=(inputs, timestamps),\n            custom_ops={QKVAttention: QKVAttention.count_flops},\n        )\n    \"\"\"\n    b, c, *spatial = y[0].shape\n    num_spatial = int(np.prod(spatial))\n    # We perform two matmuls with the same number of ops.\n    # The first computes the weight matrix, the second computes\n    # the combination of the value vectors.\n    matmul_ops = 2 * b * (num_spatial ** 2) * c\n    model.total_ops += th.DoubleTensor([matmul_ops])\n\n\nclass QKVAttentionLegacy(nn.Module):\n    \"\"\"\n    A module which performs QKV attention. Matches legacy QKVAttention + input/ouput heads shaping\n    \"\"\"\n\n    def __init__(self, n_heads):\n        super().__init__()\n        self.n_heads = n_heads\n\n    def forward(self, qkv):\n        \"\"\"\n        Apply QKV attention.\n        :param qkv: an [N x (H * 3 * C) x T] tensor of Qs, Ks, and Vs.\n        :return: an [N x (H * C) x T] tensor after attention.\n        \"\"\"\n        bs, width, length = qkv.shape\n        assert width % (3 * self.n_heads) == 0\n        ch = width // (3 * self.n_heads)\n        q, k, v = qkv.reshape(bs * self.n_heads, ch * 3, length).split(ch, dim=1)\n        scale = 1 / math.sqrt(math.sqrt(ch))\n        weight = th.einsum(\n            \"bct,bcs->bts\", q * scale, k * scale\n        )  # More stable with f16 than dividing afterwards\n        weight = th.softmax(weight.float(), dim=-1).type(weight.dtype)\n        a = th.einsum(\"bts,bcs->bct\", weight, v)\n        return a.reshape(bs, -1, length)\n\n    @staticmethod\n    def count_flops(model, _x, y):\n        return count_flops_attn(model, _x, y)\n\n\nclass QKVAttention(nn.Module):\n    \"\"\"\n    A module which performs QKV attention and splits in a different order.\n    \"\"\"\n\n    def __init__(self, n_heads):\n        super().__init__()\n        self.n_heads = n_heads\n\n    def forward(self, qkv):\n        \"\"\"\n        Apply QKV attention.\n        :param qkv: an [N x (3 * H * C) x T] tensor of Qs, Ks, and Vs.\n        :return: an [N x (H * C) x T] tensor after attention.\n        \"\"\"\n        bs, width, length = qkv.shape\n        assert width % (3 * self.n_heads) == 0\n        ch = width // (3 * self.n_heads)\n        q, k, v = qkv.chunk(3, dim=1)\n        scale = 1 / math.sqrt(math.sqrt(ch))\n        weight = th.einsum(\n            \"bct,bcs->bts\",\n            (q * scale).view(bs * self.n_heads, ch, length),\n            (k * scale).view(bs * self.n_heads, ch, length),\n        )  # More stable with f16 than dividing afterwards\n        weight = th.softmax(weight.float(), dim=-1).type(weight.dtype)\n        a = th.einsum(\"bts,bcs->bct\", weight, v.reshape(bs * self.n_heads, ch, length))\n        return a.reshape(bs, -1, length)\n\n    @staticmethod\n    def count_flops(model, _x, y):\n        return count_flops_attn(model, _x, y)\n\n\nclass UNetModel(nn.Module):\n    \"\"\"\n    The full UNet model with attention and timestep embedding.\n    :param in_channels: channels in the input Tensor.\n    :param model_channels: base channel count for the model.\n    :param out_channels: channels in the output Tensor.\n    :param num_res_blocks: number of residual blocks per downsample.\n    :param attention_resolutions: a collection of downsample rates at which\n        attention will take place. May be a set, list, or tuple.\n        For example, if this contains 4, then at 4x downsampling, attention\n        will be used.\n    :param dropout: the dropout probability.\n    :param channel_mult: channel multiplier for each level of the UNet.\n    :param conv_resample: if True, use learned convolutions for upsampling and\n        downsampling.\n    :param dims: determines if the signal is 1D, 2D, or 3D.\n    :param num_classes: if specified (as an int), then this model will be\n        class-conditional with `num_classes` classes.\n    :param use_checkpoint: use gradient checkpointing to reduce memory usage.\n    :param num_heads: the number of attention heads in each attention layer.\n    :param num_heads_channels: if specified, ignore num_heads and instead use\n                               a fixed channel width per attention head.\n    :param num_heads_upsample: works with num_heads to set a different number\n                               of heads for upsampling. Deprecated.\n    :param use_scale_shift_norm: use a FiLM-like conditioning mechanism.\n    :param resblock_updown: use residual blocks for up/downsampling.\n    :param use_new_attention_order: use a different attention pattern for potentially\n                                    increased efficiency.\n    \"\"\"\n\n    def __init__(\n        self,\n        image_size,\n        in_channels,\n        model_channels,\n        out_channels,\n        num_res_blocks,\n        attention_resolutions,\n        dropout=0,\n        channel_mult=(1, 2, 4, 8),\n        conv_resample=True,\n        dims=2,\n        num_classes=None,\n        use_checkpoint=False,\n        use_fp16=False,\n        num_heads=-1,\n        num_head_channels=-1,\n        num_heads_upsample=-1,\n        use_scale_shift_norm=False,\n        resblock_updown=False,\n        use_new_attention_order=False,\n        use_max_self_attn=False,\n        use_spatial_transformer=False,    # custom transformer support\n        use_max_spatial_transfomer=False,\n        transformer_depth=1,              # custom transformer support\n        context_dim=None,                 # custom transformer support\n        n_embed=None,                     # custom support for prediction of discrete ids into codebook of first stage vq model\n        legacy=True,\n    ):\n        super().__init__()\n        if use_spatial_transformer:\n            assert context_dim is not None, 'Fool!! You forgot to include the dimension of your cross-attention conditioning...'\n\n        if context_dim is not None:\n            assert use_spatial_transformer, 'Fool!! You forgot to use the spatial transformer for your cross-attention conditioning...'\n            from omegaconf.listconfig import ListConfig\n            if type(context_dim) == ListConfig:\n                context_dim = list(context_dim)\n\n        if num_heads_upsample == -1:\n            num_heads_upsample = num_heads\n\n        if num_heads == -1:\n            assert num_head_channels != -1, 'Either num_heads or num_head_channels has to be set'\n\n        if num_head_channels == -1:\n            assert num_heads != -1, 'Either num_heads or num_head_channels has to be set'\n\n        self.image_size = image_size # 32\n        self.in_channels = in_channels # 3\n        self.model_channels = model_channels # 256\n        self.out_channels = out_channels # 3\n        self.num_res_blocks = num_res_blocks # 2\n        self.attention_resolutions = attention_resolutions # [4,2,1]\n        self.dropout = dropout # 0\n        self.channel_mult = channel_mult # [1,2,4]\n        self.conv_resample = conv_resample # True\n        self.num_classes = num_classes # None\n        self.use_checkpoint = use_checkpoint # False\n        self.dtype = th.float16 if use_fp16 else th.float32 # float32\n        self.num_heads = num_heads # -1\n        self.num_head_channels = num_head_channels # 32\n        self.num_heads_upsample = num_heads_upsample # -1\n        self.predict_codebook_ids = n_embed is not None # False\n\n        time_embed_dim = model_channels * 4 # 256*4 = 1024\n        self.time_embed = nn.Sequential(\n            linear(model_channels, time_embed_dim),\n            nn.SiLU(),\n            linear(time_embed_dim, time_embed_dim),\n        )\n\n        if self.num_classes is not None:\n            self.label_emb = nn.Embedding(num_classes, time_embed_dim)\n\n        self.input_blocks = nn.ModuleList(\n            [\n                TimestepEmbedSequential(\n                    conv_nd(dims, in_channels, model_channels, 3, padding=1)\n                )\n            ]\n        )\n        self._feature_size = model_channels # 256\n        input_block_chans = [model_channels] # [256,]\n        ch = model_channels # 256\n        ds = 1\n        max_self_attn_ws = min(self.image_size // 4, 8)\n        for level, mult in enumerate(channel_mult): # [1,2,4]\n            for _ in range(num_res_blocks):\n                layers = [\n                    ResBlock(\n                        ch,\n                        time_embed_dim,\n                        dropout,\n                        out_channels=mult * model_channels,\n                        dims=dims,\n                        use_checkpoint=use_checkpoint,\n                        use_scale_shift_norm=use_scale_shift_norm,\n                    )\n                ]\n                ch = mult * model_channels\n                if ds in attention_resolutions:\n                    if num_head_channels == -1:\n                        dim_head = ch // num_heads\n                    else:\n                        num_heads = ch // num_head_channels\n                        dim_head = num_head_channels # 32\n                    if legacy:\n                        #num_heads = 1\n                        dim_head = ch // num_heads if use_spatial_transformer else num_head_channels\n                    layers.append(\n                        AttentionBlock(\n                            ch,\n                            use_checkpoint=use_checkpoint,\n                            num_heads=num_heads,\n                            num_head_channels=dim_head,\n                            use_new_attention_order=use_new_attention_order,\n                        ) if not use_max_self_attn and not use_spatial_transformer else MaxAttentionBlock(\n                            ch, num_heads, dim_head, window_size=max_self_attn_ws\n                        ) if not use_spatial_transformer else SpatialTransformer(\n                            ch, num_heads, dim_head, depth=transformer_depth, context_dim=context_dim\n                        ) if not use_max_spatial_transfomer else SpatialTransformerWithMax(\n                            ch, num_heads, dim_head, context_dim=context_dim\n                        )\n                    )\n                self.input_blocks.append(TimestepEmbedSequential(*layers))\n                self._feature_size += ch\n                input_block_chans.append(ch)\n            if level != len(channel_mult) - 1:\n                out_ch = ch\n                self.input_blocks.append(\n                    TimestepEmbedSequential(\n                        ResBlock(\n                            ch,\n                            time_embed_dim,\n                            dropout,\n                            out_channels=out_ch,\n                            dims=dims,\n                            use_checkpoint=use_checkpoint,\n                            use_scale_shift_norm=use_scale_shift_norm,\n                            down=True,\n                        )\n                        if resblock_updown\n                        else Downsample(\n                            ch, conv_resample, dims=dims, out_channels=out_ch\n                        )\n                    )\n                )\n                ch = out_ch\n                input_block_chans.append(ch)\n                ds *= 2\n                self._feature_size += ch\n\n        if num_head_channels == -1:\n            dim_head = ch // num_heads\n        else:\n            num_heads = ch // num_head_channels\n            dim_head = num_head_channels\n        if legacy:\n            #num_heads = 1\n            dim_head = ch // num_heads if use_spatial_transformer else num_head_channels\n        self.middle_block = TimestepEmbedSequential(\n            ResBlock(\n                ch,\n                time_embed_dim,\n                dropout,\n                dims=dims,\n                use_checkpoint=use_checkpoint,\n                use_scale_shift_norm=use_scale_shift_norm,\n            ),\n            AttentionBlock(\n                ch,\n                use_checkpoint=use_checkpoint,\n                num_heads=num_heads,\n                num_head_channels=dim_head,\n                use_new_attention_order=use_new_attention_order,\n            ) if not use_max_self_attn and not use_spatial_transformer else MaxAttentionBlock(\n                ch, num_heads, dim_head, window_size=max_self_attn_ws\n            ) if not use_spatial_transformer else SpatialTransformer(\n                ch, num_heads, dim_head, depth=transformer_depth, context_dim=context_dim\n            ) if not use_max_spatial_transfomer else SpatialTransformerWithMax(\n                ch, num_heads, dim_head, context_dim=context_dim\n            ),\n            ResBlock(\n                ch,\n                time_embed_dim,\n                dropout,\n                dims=dims,\n                use_checkpoint=use_checkpoint,\n                use_scale_shift_norm=use_scale_shift_norm,\n            ),\n        )\n        self._feature_size += ch\n\n        self.output_blocks = nn.ModuleList([])\n        for level, mult in list(enumerate(channel_mult))[::-1]: # [4,2,1]\n            for i in range(num_res_blocks + 1):\n                ich = input_block_chans.pop()\n                layers = [\n                    ResBlock(\n                        ch + ich,\n                        time_embed_dim,\n                        dropout,\n                        out_channels=model_channels * mult,\n                        dims=dims,\n                        use_checkpoint=use_checkpoint,\n                        use_scale_shift_norm=use_scale_shift_norm,\n                    )\n                ]\n                ch = model_channels * mult\n                if ds in attention_resolutions:\n                    if num_head_channels == -1:\n                        dim_head = ch // num_heads\n                    else:\n                        num_heads = ch // num_head_channels\n                        dim_head = num_head_channels\n                    if legacy:\n                        #num_heads = 1\n                        dim_head = ch // num_heads if use_spatial_transformer else num_head_channels\n                    layers.append(\n                        AttentionBlock(\n                            ch,\n                            use_checkpoint=use_checkpoint,\n                            num_heads=num_heads_upsample,\n                            num_head_channels=dim_head,\n                            use_new_attention_order=use_new_attention_order,\n                        ) if not use_max_self_attn and not use_spatial_transformer else MaxAttentionBlock(\n                            ch, num_heads, dim_head, window_size=max_self_attn_ws\n                        ) if not use_spatial_transformer else SpatialTransformer(\n                            ch, num_heads, dim_head, depth=transformer_depth, context_dim=context_dim\n                        ) if not use_max_spatial_transfomer else SpatialTransformerWithMax(\n                            ch, num_heads, dim_head, context_dim=context_dim\n                        )\n                    )\n                if level and i == num_res_blocks:\n                    out_ch = ch\n                    layers.append(\n                        ResBlock(\n                            ch,\n                            time_embed_dim,\n                            dropout,\n                            out_channels=out_ch,\n                            dims=dims,\n                            use_checkpoint=use_checkpoint,\n                            use_scale_shift_norm=use_scale_shift_norm,\n                            up=True,\n                        )\n                        if resblock_updown\n                        else Upsample(ch, conv_resample, dims=dims, out_channels=out_ch)\n                    )\n                    ds //= 2\n                self.output_blocks.append(TimestepEmbedSequential(*layers))\n                self._feature_size += ch\n\n        self.out = nn.Sequential(\n            normalization(ch),\n            nn.SiLU(),\n            zero_module(conv_nd(dims, model_channels, out_channels, 3, padding=1)),\n        )\n        if self.predict_codebook_ids:\n            self.id_predictor = nn.Sequential(\n            normalization(ch),\n            conv_nd(dims, model_channels, n_embed, 1),\n            #nn.LogSoftmax(dim=1)  # change to cross_entropy and produce non-normalized logits\n        )\n\n    def convert_to_fp16(self):\n        \"\"\"\n        Convert the torso of the model to float16.\n        \"\"\"\n        self.input_blocks.apply(convert_module_to_f16)\n        self.middle_block.apply(convert_module_to_f16)\n        self.output_blocks.apply(convert_module_to_f16)\n\n    def convert_to_fp32(self):\n        \"\"\"\n        Convert the torso of the model to float32.\n        \"\"\"\n        self.input_blocks.apply(convert_module_to_f32)\n        self.middle_block.apply(convert_module_to_f32)\n        self.output_blocks.apply(convert_module_to_f32)\n\n    def forward(self, x, timesteps=None, context=None, y=None,**kwargs):\n        \"\"\"\n        Apply the model to an input batch.\n        :param x: an [N x C x ...] Tensor of inputs.\n        :param timesteps: a 1-D batch of timesteps.\n        :param context: conditioning plugged in via crossattn\n        :param y: an [N] Tensor of labels, if class-conditional.\n        :return: an [N x C x ...] Tensor of outputs.\n        \"\"\"\n        assert (y is not None) == (\n            self.num_classes is not None\n        ), \"must specify y if and only if the model is class-conditional\"\n        hs = []\n        t_emb = timestep_embedding(timesteps, self.model_channels, repeat_only=False)\n        emb = self.time_embed(t_emb) # (8,1024)\n\n        if self.num_classes is not None:\n            assert y.shape == (x.shape[0],)\n            emb = emb + self.label_emb(y)\n\n        h = x.type(self.dtype)\n        for module in self.input_blocks:\n            h = module(h, emb, context)\n            hs.append(h)\n        h = self.middle_block(h, emb, context)\n        for module in self.output_blocks:\n            h = th.cat([h, hs.pop()], dim=1)\n            h = module(h, emb, context)\n        h = h.type(x.dtype)\n        if self.predict_codebook_ids:\n            return self.id_predictor(h)\n        else:\n            return self.out(h)\n\n\nclass EncoderUNetModel(nn.Module):\n    \"\"\"\n    The half UNet model with attention and timestep embedding.\n    For usage, see UNet.\n    \"\"\"\n\n    def __init__(\n        self,\n        image_size,\n        in_channels,\n        model_channels,\n        out_channels,\n        num_res_blocks,\n        attention_resolutions,\n        dropout=0,\n        channel_mult=(1, 2, 4, 8),\n        conv_resample=True,\n        dims=2,\n        use_checkpoint=False,\n        use_fp16=False,\n        num_heads=1,\n        num_head_channels=-1,\n        num_heads_upsample=-1,\n        use_scale_shift_norm=False,\n        resblock_updown=False,\n        use_new_attention_order=False,\n        pool=\"adaptive\",\n        *args,\n        **kwargs\n    ):\n        super().__init__()\n\n        if num_heads_upsample == -1:\n            num_heads_upsample = num_heads\n\n        self.in_channels = in_channels\n        self.model_channels = model_channels\n        self.out_channels = out_channels\n        self.num_res_blocks = num_res_blocks\n        self.attention_resolutions = attention_resolutions\n        self.dropout = dropout\n        self.channel_mult = channel_mult\n        self.conv_resample = conv_resample\n        self.use_checkpoint = use_checkpoint\n        self.dtype = th.float16 if use_fp16 else th.float32\n        self.num_heads = num_heads\n        self.num_head_channels = num_head_channels\n        self.num_heads_upsample = num_heads_upsample\n\n        time_embed_dim = model_channels * 4\n        self.time_embed = nn.Sequential(\n            linear(model_channels, time_embed_dim),\n            nn.SiLU(),\n            linear(time_embed_dim, time_embed_dim),\n        )\n\n        self.input_blocks = nn.ModuleList(\n            [\n                TimestepEmbedSequential(\n                    conv_nd(dims, in_channels, model_channels, 3, padding=1)\n                )\n            ]\n        )\n        self._feature_size = model_channels\n        input_block_chans = [model_channels]\n        ch = model_channels\n        ds = 1\n        for level, mult in enumerate(channel_mult):\n            for _ in range(num_res_blocks):\n                layers = [\n                    ResBlock(\n                        ch,\n                        time_embed_dim,\n                        dropout,\n                        out_channels=mult * model_channels,\n                        dims=dims,\n                        use_checkpoint=use_checkpoint,\n                        use_scale_shift_norm=use_scale_shift_norm,\n                    )\n                ]\n                ch = mult * model_channels\n                if ds in attention_resolutions:\n                    layers.append(\n                        AttentionBlock(\n                            ch,\n                            use_checkpoint=use_checkpoint,\n                            num_heads=num_heads,\n                            num_head_channels=num_head_channels,\n                            use_new_attention_order=use_new_attention_order,\n                        )\n                    )\n                self.input_blocks.append(TimestepEmbedSequential(*layers))\n                self._feature_size += ch\n                input_block_chans.append(ch)\n            if level != len(channel_mult) - 1:\n                out_ch = ch\n                self.input_blocks.append(\n                    TimestepEmbedSequential(\n                        ResBlock(\n                            ch,\n                            time_embed_dim,\n                            dropout,\n                            out_channels=out_ch,\n                            dims=dims,\n                            use_checkpoint=use_checkpoint,\n                            use_scale_shift_norm=use_scale_shift_norm,\n                            down=True,\n                        )\n                        if resblock_updown\n                        else Downsample(\n                            ch, conv_resample, dims=dims, out_channels=out_ch\n                        )\n                    )\n                )\n                ch = out_ch\n                input_block_chans.append(ch)\n                ds *= 2\n                self._feature_size += ch\n\n        self.middle_block = TimestepEmbedSequential(\n            ResBlock(\n                ch,\n                time_embed_dim,\n                dropout,\n                dims=dims,\n                use_checkpoint=use_checkpoint,\n                use_scale_shift_norm=use_scale_shift_norm,\n            ),\n            AttentionBlock(\n                ch,\n                use_checkpoint=use_checkpoint,\n                num_heads=num_heads,\n                num_head_channels=num_head_channels,\n                use_new_attention_order=use_new_attention_order,\n            ),\n            ResBlock(\n                ch,\n                time_embed_dim,\n                dropout,\n                dims=dims,\n                use_checkpoint=use_checkpoint,\n                use_scale_shift_norm=use_scale_shift_norm,\n            ),\n        )\n        self._feature_size += ch\n        self.pool = pool\n        if pool == \"adaptive\":\n            self.out = nn.Sequential(\n                normalization(ch),\n                nn.SiLU(),\n                nn.AdaptiveAvgPool2d((1, 1)),\n                zero_module(conv_nd(dims, ch, out_channels, 1)),\n                nn.Flatten(),\n            )\n        elif pool == \"attention\":\n            assert num_head_channels != -1\n            self.out = nn.Sequential(\n                normalization(ch),\n                nn.SiLU(),\n                AttentionPool2d(\n                    (image_size // ds), ch, num_head_channels, out_channels\n                ),\n            )\n        elif pool == \"spatial\":\n            self.out = nn.Sequential(\n                nn.Linear(self._feature_size, 2048),\n                nn.ReLU(),\n                nn.Linear(2048, self.out_channels),\n            )\n        elif pool == \"spatial_v2\":\n            self.out = nn.Sequential(\n                nn.Linear(self._feature_size, 2048),\n                normalization(2048),\n                nn.SiLU(),\n                nn.Linear(2048, self.out_channels),\n            )\n        else:\n            raise NotImplementedError(f\"Unexpected {pool} pooling\")\n\n    def convert_to_fp16(self):\n        \"\"\"\n        Convert the torso of the model to float16.\n        \"\"\"\n        self.input_blocks.apply(convert_module_to_f16)\n        self.middle_block.apply(convert_module_to_f16)\n\n    def convert_to_fp32(self):\n        \"\"\"\n        Convert the torso of the model to float32.\n        \"\"\"\n        self.input_blocks.apply(convert_module_to_f32)\n        self.middle_block.apply(convert_module_to_f32)\n\n    def forward(self, x, timesteps):\n        \"\"\"\n        Apply the model to an input batch.\n        :param x: an [N x C x ...] Tensor of inputs.\n        :param timesteps: a 1-D batch of timesteps.\n        :return: an [N x K] Tensor of outputs.\n        \"\"\"\n        emb = self.time_embed(timestep_embedding(timesteps, self.model_channels))\n\n        results = []\n        h = x.type(self.dtype)\n        for module in self.input_blocks:\n            h = module(h, emb)\n            if self.pool.startswith(\"spatial\"):\n                results.append(h.type(x.dtype).mean(dim=(2, 3)))\n        h = self.middle_block(h, emb)\n        if self.pool.startswith(\"spatial\"):\n            results.append(h.type(x.dtype).mean(dim=(2, 3)))\n            h = th.cat(results, axis=-1)\n            return self.out(h)\n        else:\n            h = h.type(x.dtype)\n            return self.out(h)\n\n"
  },
  {
    "path": "ldm/modules/diffusionmodules/util.py",
    "content": "# adopted from\n# https://github.com/openai/improved-diffusion/blob/main/improved_diffusion/gaussian_diffusion.py\n# and\n# https://github.com/lucidrains/denoising-diffusion-pytorch/blob/7706bdfc6f527f58d33f84b7b522e61e6e3164b3/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py\n# and\n# https://github.com/openai/guided-diffusion/blob/0ba878e517b276c45d1195eb29f6f5f72659a05b/guided_diffusion/nn.py\n#\n# thanks!\n\n\nimport os\nimport math\nimport torch\nimport torch.nn as nn\nimport numpy as np\nfrom einops import repeat\n\nfrom ldm.util import instantiate_from_config\n\n\ndef make_beta_schedule(schedule, n_timestep, linear_start=1e-4, linear_end=2e-2, cosine_s=8e-3):\n    if schedule == \"linear\":\n        betas = (\n                torch.linspace(linear_start ** 0.5, linear_end ** 0.5, n_timestep, dtype=torch.float64) ** 2\n        )\n\n    elif schedule == \"cosine\":\n        timesteps = (\n                torch.arange(n_timestep + 1, dtype=torch.float64) / n_timestep + cosine_s\n        )\n        alphas = timesteps / (1 + cosine_s) * np.pi / 2\n        alphas = torch.cos(alphas).pow(2)\n        alphas = alphas / alphas[0]\n        betas = 1 - alphas[1:] / alphas[:-1]\n        betas = np.clip(betas, a_min=0, a_max=0.999)\n\n    elif schedule == \"sqrt_linear\":\n        betas = torch.linspace(linear_start, linear_end, n_timestep, dtype=torch.float64)\n    elif schedule == \"sqrt\":\n        betas = torch.linspace(linear_start, linear_end, n_timestep, dtype=torch.float64) ** 0.5\n    else:\n        raise ValueError(f\"schedule '{schedule}' unknown.\")\n    return betas.numpy()\n\n\ndef make_ddim_timesteps(ddim_discr_method, num_ddim_timesteps, num_ddpm_timesteps, verbose=True):\n    if ddim_discr_method == 'uniform':\n        c = num_ddpm_timesteps // num_ddim_timesteps\n        ddim_timesteps = np.asarray(list(range(0, num_ddpm_timesteps, c)))\n    elif ddim_discr_method == 'quad':\n        ddim_timesteps = ((np.linspace(0, np.sqrt(num_ddpm_timesteps * .8), num_ddim_timesteps)) ** 2).astype(int)\n    else:\n        raise NotImplementedError(f'There is no ddim discretization method called \"{ddim_discr_method}\"')\n\n    # assert ddim_timesteps.shape[0] == num_ddim_timesteps\n    # add one to get the final alpha values right (the ones from first scale to data during sampling)\n    steps_out = ddim_timesteps + 1\n    if verbose:\n        print(f'Selected timesteps for ddim sampler: {steps_out}')\n    return steps_out\n\n\ndef make_ddim_sampling_parameters(alphacums, ddim_timesteps, eta, verbose=True):\n    # select alphas for computing the variance schedule\n    alphas = alphacums[ddim_timesteps]\n    alphas_prev = np.asarray([alphacums[0]] + alphacums[ddim_timesteps[:-1]].tolist())\n\n    # according the the formula provided in https://arxiv.org/abs/2010.02502\n    sigmas = eta * np.sqrt((1 - alphas_prev) / (1 - alphas) * (1 - alphas / alphas_prev))\n    if verbose:\n        print(f'Selected alphas for ddim sampler: a_t: {alphas}; a_(t-1): {alphas_prev}')\n        print(f'For the chosen value of eta, which is {eta}, '\n              f'this results in the following sigma_t schedule for ddim sampler {sigmas}')\n    return sigmas, alphas, alphas_prev\n\n\ndef betas_for_alpha_bar(num_diffusion_timesteps, alpha_bar, max_beta=0.999):\n    \"\"\"\n    Create a beta schedule that discretizes the given alpha_t_bar function,\n    which defines the cumulative product of (1-beta) over time from t = [0,1].\n    :param num_diffusion_timesteps: the number of betas to produce.\n    :param alpha_bar: a lambda that takes an argument t from 0 to 1 and\n                      produces the cumulative product of (1-beta) up to that\n                      part of the diffusion process.\n    :param max_beta: the maximum beta to use; use values lower than 1 to\n                     prevent singularities.\n    \"\"\"\n    betas = []\n    for i in range(num_diffusion_timesteps):\n        t1 = i / num_diffusion_timesteps\n        t2 = (i + 1) / num_diffusion_timesteps\n        betas.append(min(1 - alpha_bar(t2) / alpha_bar(t1), max_beta))\n    return np.array(betas)\n\n\ndef extract_into_tensor(a, t, x_shape):\n    b, *_ = t.shape\n    out = a.gather(-1, t)\n    return out.reshape(b, *((1,) * (len(x_shape) - 1)))\n\n\ndef checkpoint(func, inputs, params, flag):\n    \"\"\"\n    Evaluate a function without caching intermediate activations, allowing for\n    reduced memory at the expense of extra compute in the backward pass.\n    :param func: the function to evaluate.\n    :param inputs: the argument sequence to pass to `func`.\n    :param params: a sequence of parameters `func` depends on but does not\n                   explicitly take as arguments.\n    :param flag: if False, disable gradient checkpointing.\n    \"\"\"\n    if flag:\n        args = tuple(inputs) + tuple(params)\n        return CheckpointFunction.apply(func, len(inputs), *args)\n    else:\n        return func(*inputs)\n\n\nclass CheckpointFunction(torch.autograd.Function):\n    @staticmethod\n    def forward(ctx, run_function, length, *args):\n        ctx.run_function = run_function\n        ctx.input_tensors = list(args[:length])\n        ctx.input_params = list(args[length:])\n\n        with torch.no_grad():\n            output_tensors = ctx.run_function(*ctx.input_tensors)\n        return output_tensors\n\n    @staticmethod\n    def backward(ctx, *output_grads):\n        ctx.input_tensors = [x.detach().requires_grad_(True) for x in ctx.input_tensors]\n        with torch.enable_grad():\n            # Fixes a bug where the first op in run_function modifies the\n            # Tensor storage in place, which is not allowed for detach()'d\n            # Tensors.\n            shallow_copies = [x.view_as(x) for x in ctx.input_tensors]\n            output_tensors = ctx.run_function(*shallow_copies)\n        input_grads = torch.autograd.grad(\n            output_tensors,\n            ctx.input_tensors + ctx.input_params,\n            output_grads,\n            allow_unused=True,\n        )\n        del ctx.input_tensors\n        del ctx.input_params\n        del output_tensors\n        return (None, None) + input_grads\n\n\ndef timestep_embedding(timesteps, dim, max_period=10000, repeat_only=False):\n    \"\"\"\n    Create sinusoidal timestep embeddings.\n    :param timesteps: a 1-D Tensor of N indices, one per batch element.\n                      These may be fractional.\n    :param dim: the dimension of the output.\n    :param max_period: controls the minimum frequency of the embeddings.\n    :return: an [N x dim] Tensor of positional embeddings.\n    \"\"\"\n    if not repeat_only:\n        half = dim // 2\n        freqs = torch.exp(\n            -math.log(max_period) * torch.arange(start=0, end=half, dtype=torch.float32) / half\n        ).to(device=timesteps.device)\n        args = timesteps[:, None].float() * freqs[None]\n        embedding = torch.cat([torch.cos(args), torch.sin(args)], dim=-1)\n        if dim % 2:\n            embedding = torch.cat([embedding, torch.zeros_like(embedding[:, :1])], dim=-1)\n    else:\n        embedding = repeat(timesteps, 'b -> b d', d=dim)\n    return embedding\n\n\ndef zero_module(module):\n    \"\"\"\n    Zero out the parameters of a module and return it.\n    \"\"\"\n    for p in module.parameters():\n        p.detach().zero_()\n    return module\n\n\ndef scale_module(module, scale):\n    \"\"\"\n    Scale the parameters of a module and return it.\n    \"\"\"\n    for p in module.parameters():\n        p.detach().mul_(scale)\n    return module\n\n\ndef mean_flat(tensor):\n    \"\"\"\n    Take the mean over all non-batch dimensions.\n    \"\"\"\n    return tensor.mean(dim=list(range(1, len(tensor.shape))))\n\n\ndef normalization(channels):\n    \"\"\"\n    Make a standard normalization layer.\n    :param channels: number of input channels.\n    :return: an nn.Module for normalization.\n    \"\"\"\n    return GroupNorm32(32, channels)\n\n\n# PyTorch 1.7 has SiLU, but we support PyTorch 1.5.\nclass SiLU(nn.Module):\n    def forward(self, x):\n        return x * torch.sigmoid(x)\n\n\nclass GroupNorm32(nn.GroupNorm):\n    def forward(self, x):\n        return super().forward(x.float()).type(x.dtype)\n\ndef conv_nd(dims, *args, **kwargs):\n    \"\"\"\n    Create a 1D, 2D, or 3D convolution module.\n    \"\"\"\n    if dims == 1:\n        return nn.Conv1d(*args, **kwargs)\n    elif dims == 2:\n        return nn.Conv2d(*args, **kwargs)\n    elif dims == 3:\n        return nn.Conv3d(*args, **kwargs)\n    raise ValueError(f\"unsupported dimensions: {dims}\")\n\n\ndef linear(*args, **kwargs):\n    \"\"\"\n    Create a linear module.\n    \"\"\"\n    return nn.Linear(*args, **kwargs)\n\n\ndef avg_pool_nd(dims, *args, **kwargs):\n    \"\"\"\n    Create a 1D, 2D, or 3D average pooling module.\n    \"\"\"\n    if dims == 1:\n        return nn.AvgPool1d(*args, **kwargs)\n    elif dims == 2:\n        return nn.AvgPool2d(*args, **kwargs)\n    elif dims == 3:\n        return nn.AvgPool3d(*args, **kwargs)\n    raise ValueError(f\"unsupported dimensions: {dims}\")\n\n\nclass HybridConditioner(nn.Module):\n\n    def __init__(self, c_concat_config, c_crossattn_config):\n        super().__init__()\n        self.concat_conditioner = instantiate_from_config(c_concat_config)\n        self.crossattn_conditioner = instantiate_from_config(c_crossattn_config)\n\n    def forward(self, c_concat, c_crossattn):\n        c_concat = self.concat_conditioner(c_concat)\n        c_crossattn = self.crossattn_conditioner(c_crossattn)\n        return {'c_concat': [c_concat], 'c_crossattn': [c_crossattn]}\n\n\ndef noise_like(shape, device, repeat=False):\n    repeat_noise = lambda: torch.randn((1, *shape[1:]), device=device).repeat(shape[0], *((1,) * (len(shape) - 1)))\n    noise = lambda: torch.randn(shape, device=device)\n    return repeat_noise() if repeat else noise()"
  },
  {
    "path": "ldm/modules/ema.py",
    "content": "import torch\nfrom torch import nn\n\n\nclass LitEma(nn.Module):\n    def __init__(self, model, decay=0.9999, use_num_upates=True):\n        super().__init__()\n        if decay < 0.0 or decay > 1.0:\n            raise ValueError('Decay must be between 0 and 1')\n\n        self.m_name2s_name = {}\n        self.register_buffer('decay', torch.tensor(decay, dtype=torch.float32))\n        self.register_buffer('num_updates', torch.tensor(0,dtype=torch.int) if use_num_upates\n                             else torch.tensor(-1,dtype=torch.int))\n\n        for name, p in model.named_parameters():\n            if p.requires_grad:\n                #remove as '.'-character is not allowed in buffers\n                s_name = name.replace('.','')\n                self.m_name2s_name.update({name:s_name})\n                self.register_buffer(s_name,p.clone().detach().data)\n\n        self.collected_params = []\n\n    def forward(self,model):\n        decay = self.decay\n\n        if self.num_updates >= 0:\n            self.num_updates += 1\n            decay = min(self.decay,(1 + self.num_updates) / (10 + self.num_updates))\n\n        one_minus_decay = 1.0 - decay\n\n        with torch.no_grad():\n            m_param = dict(model.named_parameters())\n            shadow_params = dict(self.named_buffers())\n\n            for key in m_param:\n                if m_param[key].requires_grad:\n                    sname = self.m_name2s_name[key]\n                    shadow_params[sname] = shadow_params[sname].type_as(m_param[key])\n                    shadow_params[sname].sub_(one_minus_decay * (shadow_params[sname] - m_param[key]))\n                else:\n                    assert not key in self.m_name2s_name\n\n    def copy_to(self, model):\n        m_param = dict(model.named_parameters())\n        shadow_params = dict(self.named_buffers())\n        for key in m_param:\n            if m_param[key].requires_grad:\n                m_param[key].data.copy_(shadow_params[self.m_name2s_name[key]].data)\n            else:\n                assert not key in self.m_name2s_name\n\n    def store(self, parameters):\n        \"\"\"\n        Save the current parameters for restoring later.\n        Args:\n          parameters: Iterable of `torch.nn.Parameter`; the parameters to be\n            temporarily stored.\n        \"\"\"\n        self.collected_params = [param.clone() for param in parameters]\n\n    def restore(self, parameters):\n        \"\"\"\n        Restore the parameters stored with the `store` method.\n        Useful to validate the model with EMA parameters without affecting the\n        original optimization process. Store the parameters before the\n        `copy_to` method. After validation (or model saving), use this to\n        restore the former parameters.\n        Args:\n          parameters: Iterable of `torch.nn.Parameter`; the parameters to be\n            updated with the stored parameters.\n        \"\"\"\n        for c_param, param in zip(self.collected_params, parameters):\n            param.data.copy_(c_param.data)\n"
  },
  {
    "path": "ldm/modules/losses/__init__.py",
    "content": ""
  },
  {
    "path": "ldm/modules/losses/vqperceptual.py",
    "content": "import torch\nfrom torch import nn\nimport torch.nn.functional as F\nfrom einops import repeat\n\nfrom taming.modules.discriminator.model import NLayerDiscriminator, weights_init\nfrom taming.modules.losses.lpips import LPIPS\nfrom taming.modules.losses.vqperceptual import hinge_d_loss, vanilla_d_loss\n\n\ndef hinge_d_loss_with_exemplar_weights(logits_real, logits_fake, weights):\n    assert weights.shape[0] == logits_real.shape[0] == logits_fake.shape[0]\n    loss_real = torch.mean(F.relu(1. - logits_real), dim=[1,2,3])\n    loss_fake = torch.mean(F.relu(1. + logits_fake), dim=[1,2,3])\n    loss_real = (weights * loss_real).sum() / weights.sum()\n    loss_fake = (weights * loss_fake).sum() / weights.sum()\n    d_loss = 0.5 * (loss_real + loss_fake)\n    return d_loss\n\ndef adopt_weight(weight, global_step, threshold=0, value=0.):\n    if global_step < threshold:\n        weight = value\n    return weight\n\n\ndef measure_perplexity(predicted_indices, n_embed):\n    # src: https://github.com/karpathy/deep-vector-quantization/blob/main/model.py\n    # eval cluster perplexity. when perplexity == num_embeddings then all clusters are used exactly equally\n    encodings = F.one_hot(predicted_indices, n_embed).float().reshape(-1, n_embed)\n    avg_probs = encodings.mean(0)\n    perplexity = (-(avg_probs * torch.log(avg_probs + 1e-10)).sum()).exp()\n    cluster_use = torch.sum(avg_probs > 0)\n    return perplexity, cluster_use\n\ndef l1(x, y):\n    return torch.abs(x-y)\n\n\ndef l2(x, y):\n    return torch.pow((x-y), 2)\n\n\nclass VQLPIPSWithDiscriminator(nn.Module):\n    def __init__(self, disc_start, codebook_weight=1.0, pixelloss_weight=1.0,\n                 disc_num_layers=3, disc_in_channels=3, disc_factor=1.0, disc_weight=1.0,\n                 perceptual_weight=1.0, use_actnorm=False, disc_conditional=False,\n                 disc_ndf=64, disc_loss=\"hinge\", n_classes=None, perceptual_loss=\"lpips\",\n                 pixel_loss=\"l1\"):\n        super().__init__()\n        assert disc_loss in [\"hinge\", \"vanilla\"]\n        assert perceptual_loss in [\"lpips\", \"clips\", \"dists\"]\n        assert pixel_loss in [\"l1\", \"l2\"]\n        self.codebook_weight = codebook_weight\n        self.pixel_weight = pixelloss_weight\n        if perceptual_loss == \"lpips\":\n            print(f\"{self.__class__.__name__}: Running with LPIPS.\")\n            self.perceptual_loss = LPIPS().eval()\n        else:\n            raise ValueError(f\"Unknown perceptual loss: >> {perceptual_loss} <<\")\n        self.perceptual_weight = perceptual_weight\n\n        if pixel_loss == \"l1\":\n            self.pixel_loss = l1\n        else:\n            self.pixel_loss = l2\n\n        self.discriminator = NLayerDiscriminator(input_nc=disc_in_channels,\n                                                 n_layers=disc_num_layers,\n                                                 use_actnorm=use_actnorm,\n                                                 ndf=disc_ndf\n                                                 ).apply(weights_init)\n        self.discriminator_iter_start = disc_start\n        if disc_loss == \"hinge\":\n            self.disc_loss = hinge_d_loss\n        elif disc_loss == \"vanilla\":\n            self.disc_loss = vanilla_d_loss\n        else:\n            raise ValueError(f\"Unknown GAN loss '{disc_loss}'.\")\n        print(f\"VQLPIPSWithDiscriminator running with {disc_loss} loss.\")\n        self.disc_factor = disc_factor\n        self.discriminator_weight = disc_weight\n        self.disc_conditional = disc_conditional\n        self.n_classes = n_classes\n\n    def calculate_adaptive_weight(self, nll_loss, g_loss, last_layer=None):\n        if last_layer is not None:\n            nll_grads = torch.autograd.grad(nll_loss, last_layer, retain_graph=True)[0]\n            g_grads = torch.autograd.grad(g_loss, last_layer, retain_graph=True)[0]\n        else:\n            nll_grads = torch.autograd.grad(nll_loss, self.last_layer[0], retain_graph=True)[0]\n            g_grads = torch.autograd.grad(g_loss, self.last_layer[0], retain_graph=True)[0]\n\n        d_weight = torch.norm(nll_grads) / (torch.norm(g_grads) + 1e-4)\n        d_weight = torch.clamp(d_weight, 0.0, 1e4).detach()\n        d_weight = d_weight * self.discriminator_weight\n        return d_weight\n\n    def forward(self, codebook_loss, inputs, reconstructions, optimizer_idx,\n                global_step, last_layer=None, cond=None, split=\"train\", predicted_indices=None):\n        #rec_loss = torch.abs(inputs.contiguous() - reconstructions.contiguous())\n        rec_loss = self.pixel_loss(inputs.contiguous(), reconstructions.contiguous())\n        if self.perceptual_weight > 0:\n            p_loss = self.perceptual_loss(inputs.contiguous(), reconstructions.contiguous())\n            rec_loss = rec_loss + self.perceptual_weight * p_loss\n        else:\n            p_loss = torch.tensor([0.0])\n\n        nll_loss = rec_loss\n        #nll_loss = torch.sum(nll_loss) / nll_loss.shape[0]\n        nll_loss = torch.mean(nll_loss)\n\n        # now the GAN part\n        if optimizer_idx == 0:\n            # generator update\n            if cond is None:\n                assert not self.disc_conditional\n                logits_fake = self.discriminator(reconstructions.contiguous())\n            else:\n                assert self.disc_conditional\n                logits_fake = self.discriminator(torch.cat((reconstructions.contiguous(), cond), dim=1))\n            g_loss = -torch.mean(logits_fake)\n\n            try:\n                d_weight = self.calculate_adaptive_weight(nll_loss, g_loss, last_layer=last_layer)\n            except RuntimeError:\n                assert not self.training\n                d_weight = torch.tensor(0.0)\n\n            disc_factor = adopt_weight(self.disc_factor, global_step, threshold=self.discriminator_iter_start)\n            loss = nll_loss + d_weight * disc_factor * g_loss + self.codebook_weight * codebook_loss.mean()\n\n            log = {\"{}/total_loss\".format(split): loss.clone().detach().mean(),\n                   \"{}/quant_loss\".format(split): codebook_loss.detach().mean(),\n                   \"{}/nll_loss\".format(split): nll_loss.detach().mean(),\n                   \"{}/rec_loss\".format(split): rec_loss.detach().mean(),\n                   \"{}/p_loss\".format(split): p_loss.detach().mean(),\n                   \"{}/d_weight\".format(split): d_weight.detach(),\n                   \"{}/disc_factor\".format(split): torch.tensor(disc_factor),\n                   \"{}/g_loss\".format(split): g_loss.detach().mean(),\n                   }\n            if predicted_indices is not None:\n                assert self.n_classes is not None\n                with torch.no_grad():\n                    perplexity, cluster_usage = measure_perplexity(predicted_indices, self.n_classes)\n                log[f\"{split}/perplexity\"] = perplexity\n                log[f\"{split}/cluster_usage\"] = cluster_usage\n            return loss, log\n\n        if optimizer_idx == 1:\n            # second pass for discriminator update\n            if cond is None:\n                logits_real = self.discriminator(inputs.contiguous().detach())\n                logits_fake = self.discriminator(reconstructions.contiguous().detach())\n            else:\n                logits_real = self.discriminator(torch.cat((inputs.contiguous().detach(), cond), dim=1))\n                logits_fake = self.discriminator(torch.cat((reconstructions.contiguous().detach(), cond), dim=1))\n\n            disc_factor = adopt_weight(self.disc_factor, global_step, threshold=self.discriminator_iter_start)\n            d_loss = disc_factor * self.disc_loss(logits_real, logits_fake)\n\n            log = {\"{}/disc_loss\".format(split): d_loss.clone().detach().mean(),\n                   \"{}/logits_real\".format(split): logits_real.detach().mean(),\n                   \"{}/logits_fake\".format(split): logits_fake.detach().mean()\n                   }\n            return d_loss, log\n"
  },
  {
    "path": "ldm/modules/maxvit.py",
    "content": "import torch\nfrom torch import nn, einsum\nimport torch.nn.functional\nfrom einops import rearrange, repeat\nfrom einops.layers.torch import Rearrange, Reduce\n\nfrom inspect import isfunction\n\n\n# Code adapted from https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/max_vit.py\n\ndef exists(val):\n    return val is not None\n\n\ndef default(val, d):\n    if exists(val):\n        return val\n    return d() if isfunction(d) else d\n\n\nclass PreNormResidual(nn.Module):\n    def __init__(self, dim, fn):\n        super().__init__()\n        self.norm = nn.LayerNorm(dim)\n        self.fn = fn\n\n    def forward(self, x, c=None):\n        if exists(c):\n            return self.fn(self.norm(x), self.norm(c)) + x\n        return self.fn(self.norm(x)) + x\n\n\nclass SqueezeExcitation(nn.Module):\n    def __init__(self, dim, shrinkage_rate = 0.25):\n        super().__init__()\n        hidden_dim = int(dim * shrinkage_rate)\n\n        self.gate = nn.Sequential(\n            Reduce('b c h w -> b c', 'mean'),\n            nn.Linear(dim, hidden_dim, bias = False),\n            nn.SiLU(),\n            nn.Linear(hidden_dim, dim, bias = False),\n            nn.Sigmoid(),\n            Rearrange('b c -> b c 1 1')\n        )\n\n    def forward(self, x):\n        return x * self.gate(x)\n\n\nclass FeedForward(nn.Module):\n    def __init__(self, dim, mult = 4, dropout = 0.):\n        super().__init__()\n        inner_dim = int(dim * mult)\n        self.net = nn.Sequential(\n            nn.Linear(dim, inner_dim),\n            nn.GELU(),\n            nn.Dropout(dropout),\n            nn.Linear(inner_dim, dim),\n            nn.Dropout(dropout)\n        )\n    def forward(self, x):\n        return self.net(x)\n\n\nclass Attention(nn.Module):\n    def __init__(\n        self,\n        dim,\n        dim_head = 32,\n        dropout = 0.,\n        window_size = 7\n    ):\n        super().__init__()\n        assert (dim % dim_head) == 0, 'dimension should be divisible by dimension per head'\n\n        self.heads = dim // dim_head\n        self.scale = dim_head ** -0.5\n\n        self.to_q = nn.Linear(dim, dim, bias = False)\n        self.to_k = nn.Linear(dim, dim, bias = False)\n        self.to_v = nn.Linear(dim, dim, bias = False)\n\n        self.attend = nn.Sequential(\n            nn.Softmax(dim = -1),\n            nn.Dropout(dropout)\n        )\n\n        self.to_out = nn.Sequential(\n            nn.Linear(dim, dim, bias = False),\n            nn.Dropout(dropout)\n        )\n\n        # relative positional bias\n\n        self.rel_pos_bias = nn.Embedding((2 * window_size - 1) ** 2, self.heads)\n\n        pos = torch.arange(window_size)\n        grid = torch.stack(torch.meshgrid(pos, pos, indexing = 'ij'))\n        grid = rearrange(grid, 'c i j -> (i j) c')\n        rel_pos = rearrange(grid, 'i ... -> i 1 ...') - rearrange(grid, 'j ... -> 1 j ...')\n        rel_pos += window_size - 1\n        rel_pos_indices = (rel_pos * torch.tensor([2 * window_size - 1, 1])).sum(dim = -1)\n\n        self.register_buffer('rel_pos_indices', rel_pos_indices, persistent = False)\n\n    def forward(self, x, c=None):\n        c = default(c, x)\n        batch, height, width, window_height, window_width, _, device, h = *x.shape, x.device, self.heads\n\n        # flatten\n\n        x = rearrange(x, 'b x y w1 w2 d -> (b x y) (w1 w2) d')\n        c = rearrange(c, 'b x y w1 w2 d -> (b x y) (w1 w2) d')\n\n        # project for queries, keys, values\n\n        q = self.to_q(x)\n        k = self.to_k(c)\n        v = self.to_v(c)\n\n        # split heads\n\n        q, k, v = map(lambda t: rearrange(t, 'b n (h d ) -> b h n d', h = h), (q, k, v))\n\n        # scale\n\n        q = q * self.scale\n\n        # sim\n\n        sim = einsum('b h i d, b h j d -> b h i j', q, k)\n\n        # add positional bias\n\n        bias = self.rel_pos_bias(self.rel_pos_indices)\n        sim = sim + rearrange(bias, 'i j h -> h i j')\n\n        # attention\n\n        attn = self.attend(sim)\n\n        # aggregate\n\n        out = einsum('b h i j, b h j d -> b h i d', attn, v)\n\n        # merge heads\n\n        out = rearrange(out, 'b h (w1 w2) d -> b w1 w2 (h d)', w1 = window_height, w2 = window_width)\n\n        # combine heads out\n\n        out = self.to_out(out)\n        return rearrange(out, '(b x y) ... -> b x y ...', x = height, y = width)\n\n\nclass Dropsample(nn.Module):\n    def __init__(self, prob = 0):\n        super().__init__()\n        self.prob = prob\n  \n    def forward(self, x):\n        device = x.device\n\n        if self.prob == 0. or (not self.training):\n            return x\n\n        keep_mask = torch.FloatTensor((x.shape[0], 1, 1, 1), device = device).uniform_() > self.prob\n        return x * keep_mask / (1 - self.prob)\n\n\nclass MBConvResidual(nn.Module):\n    def __init__(self, fn, dropout = 0.):\n        super().__init__()\n        self.fn = fn\n        self.dropsample = Dropsample(dropout)\n\n    def forward(self, x):\n        out = self.fn(x)\n        out = self.dropsample(out)\n        return out + x\n\n\ndef MBConv(\n    dim_in,\n    dim_out,\n    *,\n    downsample,\n    expansion_rate = 4,\n    shrinkage_rate = 0.25,\n    dropout = 0.\n):\n    hidden_dim = int(expansion_rate * dim_out)\n    stride = 2 if downsample else 1\n\n    net = nn.Sequential(\n        nn.Conv2d(dim_in, hidden_dim, 1),\n        nn.BatchNorm2d(hidden_dim),\n        nn.GELU(),\n        nn.Conv2d(hidden_dim, hidden_dim, 3, stride = stride, padding = 1, groups = hidden_dim),\n        nn.BatchNorm2d(hidden_dim),\n        nn.GELU(),\n        SqueezeExcitation(hidden_dim, shrinkage_rate = shrinkage_rate),\n        nn.Conv2d(hidden_dim, dim_out, 1),\n        nn.BatchNorm2d(dim_out)\n    )\n\n    if dim_in == dim_out and not downsample:\n        net = MBConvResidual(net, dropout = dropout)\n\n    return net\n\n\nclass MaxAttentionBlock(nn.Module):\n    def __init__(self, in_channels, heads=8, dim_head=64, dropout=0., window_size=8):\n        super().__init__()\n        w = window_size\n        layer_dim = dim_head * heads\n\n        self.rearrange_block_in = Rearrange('b d (x w1) (y w2) -> b x y w1 w2 d', w1 = w, w2 = w)  # block-like attention\n        self.attn_block = PreNormResidual(layer_dim, Attention(dim = layer_dim, dim_head = dim_head, dropout = dropout, window_size = w))\n        self.ff_block = PreNormResidual(layer_dim, FeedForward(dim = layer_dim, dropout = dropout))\n        self.rearrange_block_out = Rearrange('b x y w1 w2 d -> b d (x w1) (y w2)')\n\n        self.rearrange_grid_in = Rearrange('b d (w1 x) (w2 y) -> b x y w1 w2 d', w1 = w, w2 = w)  # grid-like attention\n        self.attn_grid = PreNormResidual(layer_dim, Attention(dim = layer_dim, dim_head = dim_head, dropout = dropout, window_size = w))\n        self.ff_grid = PreNormResidual(layer_dim, FeedForward(dim = layer_dim, dropout = dropout))\n        self.rearrange_grid_out = Rearrange('b x y w1 w2 d -> b d (w1 x) (w2 y)')\n\n\n    def forward(self, x):\n\n        # block attention\n        x = self.rearrange_block_in(x)        \n        x = self.attn_block(x)\n        x = self.ff_block(x)\n        x = self.rearrange_block_out(x)\n\n        # grid attention\n        x = self.rearrange_grid_in(x)\n        x = self.attn_grid(x)\n        x = self.ff_grid(x)\n        x = self.rearrange_grid_out(x) \n        \n        ## output stage\n        return x\n\nclass SpatialCrossAttentionWithMax(nn.Module):\n    def __init__(self, in_channels, heads=8, dim_head=64, ctx_dim=None, dropout=0., window_size=8):\n        super().__init__()\n        w = window_size\n        layer_dim = dim_head * heads\n        if ctx_dim == None:\n            self.proj_in = MBConv(layer_dim*2, layer_dim, downsample=False)\n        else:\n            self.proj_in = MBConv(ctx_dim, layer_dim, downsample=False)\n\n        self.rearrange_block_in = Rearrange('b d (x w1) (y w2) -> b x y w1 w2 d', w1 = w, w2 = w)  # block-like attention\n        self.attn_block = PreNormResidual(layer_dim, Attention(dim = layer_dim, dim_head = dim_head, dropout = dropout, window_size = w))\n        self.ff_block = PreNormResidual(layer_dim, FeedForward(dim = layer_dim, dropout = dropout))\n        self.rearrange_block_out = Rearrange('b x y w1 w2 d -> b d (x w1) (y w2)')\n\n        self.rearrange_grid_in = Rearrange('b d (w1 x) (w2 y) -> b x y w1 w2 d', w1 = w, w2 = w)  # grid-like attention\n        self.attn_grid = PreNormResidual(layer_dim, Attention(dim = layer_dim, dim_head = dim_head, dropout = dropout, window_size = w))\n        self.ff_grid = PreNormResidual(layer_dim, FeedForward(dim = layer_dim, dropout = dropout))\n        self.rearrange_grid_out = Rearrange('b x y w1 w2 d -> b d (w1 x) (w2 y)')\n\n        self.out_conv = nn.Sequential(\n            SqueezeExcitation(dim=layer_dim*2),\n            nn.Conv2d(layer_dim*2, layer_dim, kernel_size=3, padding=1)\n        )\n\n    def forward(self, x, context=None):\n        context = default(context, x)\n\n        # MBConv\n        c = self.proj_in(context)\n\n        # block attention\n        x = self.rearrange_block_in(x)   \n        c = self.rearrange_block_in(c)\n        x = self.attn_block(x, c)\n        x = self.ff_block(x)\n        x = self.rearrange_block_out(x)\n        c = self.rearrange_block_out(c)\n\n        # grid attention\n        x = self.rearrange_grid_in(x)\n        c = self.rearrange_grid_in(c)\n        x = self.attn_grid(x, c)\n        x = self.ff_grid(x)\n        x = self.rearrange_grid_out(x)\n        \n        return x\n\n\nclass SpatialTransformerWithMax(nn.Module):\n    \"\"\"\n    Transformer block for image-like data.\n    First, project the input (aka embedding) to inner_dim (d) using conv1x1\n    Then reshape to b, t, d.\n    Then apply standard transformer action (BasicTransformerBlock).\n    Finally, reshape to image and pass to output conv1x1 layer, to restore the channel size of input.\n    The dims of the input and output of the block are the same (arg in_channels).\n    \"\"\"\n    def __init__(self, in_channels, n_heads, d_head, dropout=0., context_dim=None, w=2):\n        super().__init__()\n        self.in_channels = in_channels\n        self.context_dim = context_dim\n        inner_dim = n_heads * d_head\n\n        self.proj_in = MBConv(context_dim, inner_dim, downsample=False)\n\n        self.rearrange_block_in = Rearrange('b d (x w1) (y w2) -> b x y w1 w2 d', w1 = w, w2 = w)  # block-like attention\n        self.attn_block = PreNormResidual(inner_dim, Attention(dim = inner_dim, dim_head = d_head, dropout = dropout, window_size = w))\n        self.ff_block = PreNormResidual(inner_dim, FeedForward(dim = inner_dim, dropout = dropout))\n        self.rearrange_block_out = Rearrange('b x y w1 w2 d -> b d (x w1) (y w2)')\n\n        self.rearrange_grid_in = Rearrange('b d (w1 x) (w2 y) -> b x y w1 w2 d', w1 = w, w2 = w)  # grid-like attention\n        self.attn_grid = PreNormResidual(inner_dim, Attention(dim = inner_dim, dim_head = d_head, dropout = dropout, window_size = w))\n        self.ff_grid = PreNormResidual(inner_dim, FeedForward(dim = inner_dim, dropout = dropout))\n        self.rearrange_grid_out = Rearrange('b x y w1 w2 d -> b d (w1 x) (w2 y)')\n\n    def forward(self, x, context=None):\n        context = default(context, x)\n\n        # down sample context if necessary\n        # this is due to the implementation of max crossattn here\n        if context.shape[2] != x.shape[2]:\n            stride = context.shape[2] // x.shape[2]\n            context = torch.nn.functional.avg_pool2d(context, kernel_size=stride, stride=stride)\n\n        # MBConv\n        c = self.proj_in(context)\n\n        # block attention\n        x = self.rearrange_block_in(x)   \n        c = self.rearrange_block_in(c)\n        x = self.attn_block(x, c)\n        x = self.ff_block(x)\n        x = self.rearrange_block_out(x)\n        c = self.rearrange_block_out(c)\n\n        # grid attention\n        x = self.rearrange_grid_in(x)\n        c = self.rearrange_grid_in(c)\n        x = self.attn_grid(x, c)\n        x = self.ff_grid(x)\n        x = self.rearrange_grid_out(x)\n        \n        return x"
  },
  {
    "path": "ldm/util.py",
    "content": "import importlib\n\nimport torch\nimport numpy as np\nfrom collections import abc\nfrom einops import rearrange\nfrom functools import partial\n\nimport multiprocessing as mp\nfrom threading import Thread\nfrom queue import Queue\n\nfrom inspect import isfunction\nfrom PIL import Image, ImageDraw, ImageFont\n\n\ndef log_txt_as_img(wh, xc, size=10):\n    # wh a tuple of (width, height)\n    # xc a list of captions to plot\n    b = len(xc)\n    txts = list()\n    for bi in range(b):\n        txt = Image.new(\"RGB\", wh, color=\"white\")\n        draw = ImageDraw.Draw(txt)\n        font = ImageFont.truetype('data/DejaVuSans.ttf', size=size)\n        nc = int(40 * (wh[0] / 256))\n        lines = \"\\n\".join(xc[bi][start:start + nc] for start in range(0, len(xc[bi]), nc))\n\n        try:\n            draw.text((0, 0), lines, fill=\"black\", font=font)\n        except UnicodeEncodeError:\n            print(\"Cant encode string for logging. Skipping.\")\n\n        txt = np.array(txt).transpose(2, 0, 1) / 127.5 - 1.0\n        txts.append(txt)\n    txts = np.stack(txts)\n    txts = torch.tensor(txts)\n    return txts\n\n\ndef ismap(x):\n    if not isinstance(x, torch.Tensor):\n        return False\n    return (len(x.shape) == 4) and (x.shape[1] > 3)\n\n\ndef isimage(x):\n    if not isinstance(x, torch.Tensor):\n        return False\n    return (len(x.shape) == 4) and (x.shape[1] == 3 or x.shape[1] == 1)\n\n\ndef exists(x):\n    return x is not None\n\n\ndef default(val, d):\n    if exists(val):\n        return val\n    return d() if isfunction(d) else d\n\n\ndef mean_flat(tensor):\n    \"\"\"\n    https://github.com/openai/guided-diffusion/blob/27c20a8fab9cb472df5d6bdd6c8d11c8f430b924/guided_diffusion/nn.py#L86\n    Take the mean over all non-batch dimensions.\n    \"\"\"\n    return tensor.mean(dim=list(range(1, len(tensor.shape))))\n\n\ndef count_params(model, verbose=False):\n    total_params = sum(p.numel() for p in model.parameters())\n    if verbose:\n        print(f\"{model.__class__.__name__} has {total_params * 1.e-6:.2f} M params.\")\n    return total_params\n\n\ndef instantiate_from_config(config):\n    if not \"target\" in config:\n        if config == '__is_first_stage__':\n            return None\n        elif config == \"__is_unconditional__\":\n            return None\n        raise KeyError(\"Expected key `target` to instantiate.\")\n    return get_obj_from_str(config[\"target\"])(**config.get(\"params\", dict()))\n\n\ndef get_obj_from_str(string, reload=False):\n    module, cls = string.rsplit(\".\", 1)\n    if reload:\n        module_imp = importlib.import_module(module)\n        importlib.reload(module_imp)\n    return getattr(importlib.import_module(module, package=None), cls)\n\n\ndef _do_parallel_data_prefetch(func, Q, data, idx, idx_to_fn=False):\n    # create dummy dataset instance\n\n    # run prefetching\n    if idx_to_fn:\n        res = func(data, worker_id=idx)\n    else:\n        res = func(data)\n    Q.put([idx, res])\n    Q.put(\"Done\")\n\n\ndef parallel_data_prefetch(\n        func: callable, data, n_proc, target_data_type=\"ndarray\", cpu_intensive=True, use_worker_id=False\n):\n    # if target_data_type not in [\"ndarray\", \"list\"]:\n    #     raise ValueError(\n    #         \"Data, which is passed to parallel_data_prefetch has to be either of type list or ndarray.\"\n    #     )\n    if isinstance(data, np.ndarray) and target_data_type == \"list\":\n        raise ValueError(\"list expected but function got ndarray.\")\n    elif isinstance(data, abc.Iterable):\n        if isinstance(data, dict):\n            print(\n                f'WARNING:\"data\" argument passed to parallel_data_prefetch is a dict: Using only its values and disregarding keys.'\n            )\n            data = list(data.values())\n        if target_data_type == \"ndarray\":\n            data = np.asarray(data)\n        else:\n            data = list(data)\n    else:\n        raise TypeError(\n            f\"The data, that shall be processed parallel has to be either an np.ndarray or an Iterable, but is actually {type(data)}.\"\n        )\n\n    if cpu_intensive:\n        Q = mp.Queue(1000)\n        proc = mp.Process\n    else:\n        Q = Queue(1000)\n        proc = Thread\n    # spawn processes\n    if target_data_type == \"ndarray\":\n        arguments = [\n            [func, Q, part, i, use_worker_id]\n            for i, part in enumerate(np.array_split(data, n_proc))\n        ]\n    else:\n        step = (\n            int(len(data) / n_proc + 1)\n            if len(data) % n_proc != 0\n            else int(len(data) / n_proc)\n        )\n        arguments = [\n            [func, Q, part, i, use_worker_id]\n            for i, part in enumerate(\n                [data[i: i + step] for i in range(0, len(data), step)]\n            )\n        ]\n    processes = []\n    for i in range(n_proc):\n        p = proc(target=_do_parallel_data_prefetch, args=arguments[i])\n        processes += [p]\n\n    # start processes\n    print(f\"Start prefetching...\")\n    import time\n\n    start = time.time()\n    gather_res = [[] for _ in range(n_proc)]\n    try:\n        for p in processes:\n            p.start()\n\n        k = 0\n        while k < n_proc:\n            # get result\n            res = Q.get()\n            if res == \"Done\":\n                k += 1\n            else:\n                gather_res[res[0]] = res[1]\n\n    except Exception as e:\n        print(\"Exception: \", e)\n        for p in processes:\n            p.terminate()\n\n        raise e\n    finally:\n        for p in processes:\n            p.join()\n        print(f\"Prefetching complete. [{time.time() - start} sec.]\")\n\n    if target_data_type == 'ndarray':\n        if not isinstance(gather_res[0], np.ndarray):\n            return np.concatenate([np.asarray(r) for r in gather_res], axis=0)\n\n        # order outputs\n        return np.concatenate(gather_res, axis=0)\n    elif target_data_type == 'list':\n        out = []\n        for r in gather_res:\n            out.extend(r)\n        return out\n    else:\n        return gather_res\n"
  },
  {
    "path": "main.py",
    "content": "import argparse, os, sys, datetime, glob\nimport numpy as np\nimport time\nimport torch\nimport torchvision\nimport pytorch_lightning as pl\n\nfrom packaging import version\nfrom omegaconf import OmegaConf\nfrom torch.utils.data import DataLoader, Dataset\nfrom functools import partial\nfrom PIL import Image\n\nfrom pytorch_lightning import seed_everything\nfrom pytorch_lightning.trainer import Trainer\nfrom pytorch_lightning.callbacks import ModelCheckpoint, Callback, LearningRateMonitor\nfrom pytorch_lightning.utilities.distributed import rank_zero_only\nfrom pytorch_lightning.utilities import rank_zero_info\n\nfrom ldm.util import instantiate_from_config\n\n\ndef get_parser(**parser_kwargs):\n    def str2bool(v):\n        if isinstance(v, bool):\n            return v\n        if v.lower() in (\"yes\", \"true\", \"t\", \"y\", \"1\"):\n            return True\n        elif v.lower() in (\"no\", \"false\", \"f\", \"n\", \"0\"):\n            return False\n        else:\n            raise argparse.ArgumentTypeError(\"Boolean value expected.\")\n\n    parser = argparse.ArgumentParser(**parser_kwargs)\n    parser.add_argument(\n        \"-n\",\n        \"--name\",\n        type=str,\n        const=True,\n        default=\"\",\n        nargs=\"?\",\n        help=\"postfix for logdir\",\n    )\n    parser.add_argument(\n        \"-r\",\n        \"--resume\",\n        type=str,\n        const=True,\n        default=\"\",\n        nargs=\"?\",\n        help=\"resume from checkpoint\",\n    )\n    parser.add_argument(\n        \"-b\",\n        \"--base\",\n        nargs=\"*\",\n        metavar=\"base_config.yaml\",\n        help=\"paths to base configs. Loaded from left-to-right. \"\n             \"Parameters can be overwritten or added with command-line options of the form `--key value`.\",\n        default=list(),\n    )\n    parser.add_argument(\n        \"-t\",\n        \"--train\",\n        type=str2bool,\n        const=True,\n        default=False,\n        nargs=\"?\",\n        help=\"train\",\n    )\n    parser.add_argument(\n        \"--no-test\",\n        type=str2bool,\n        const=True,\n        default=False,\n        nargs=\"?\",\n        help=\"disable test\",\n    )\n    parser.add_argument(\n        \"-p\",\n        \"--project\",\n        help=\"name of new or path to existing project\"\n    )\n    parser.add_argument(\n        \"-d\",\n        \"--debug\",\n        type=str2bool,\n        nargs=\"?\",\n        const=True,\n        default=False,\n        help=\"enable post-mortem debugging\",\n    )\n    parser.add_argument(\n        \"-s\",\n        \"--seed\",\n        type=int,\n        default=23,\n        help=\"seed for seed_everything\",\n    )\n    parser.add_argument(\n        \"-f\",\n        \"--postfix\",\n        type=str,\n        default=\"\",\n        help=\"post-postfix for default name\",\n    )\n    parser.add_argument(\n        \"-l\",\n        \"--logdir\",\n        type=str,\n        default=\"logs\",\n        help=\"directory for logging dat shit\",\n    )\n    parser.add_argument(\n        \"--scale_lr\",\n        type=str2bool,\n        nargs=\"?\",\n        const=True,\n        default=True,\n        help=\"scale base-lr by ngpu * batch_size * n_accumulate\",\n    )\n    return parser\n\n\ndef nondefault_trainer_args(opt):\n    parser = argparse.ArgumentParser()\n    parser = Trainer.add_argparse_args(parser)\n    args = parser.parse_args([])\n    return sorted(k for k in vars(args) if getattr(opt, k) != getattr(args, k))\n\n\nclass WrappedDataset(Dataset):\n    \"\"\"Wraps an arbitrary object with __len__ and __getitem__ into a pytorch dataset\"\"\"\n\n    def __init__(self, dataset):\n        self.data = dataset\n\n    def __len__(self):\n        return len(self.data)\n\n    def __getitem__(self, idx):\n        return self.data[idx]\n\n\ndef worker_init_fn(_):\n    worker_info = torch.utils.data.get_worker_info()\n\n    dataset = worker_info.dataset\n    worker_id = worker_info.id\n\n    return np.random.seed(np.random.get_state()[1][0] + worker_id)\n\n\nclass DataModuleFromConfig(pl.LightningDataModule):\n    def __init__(self, batch_size, train=None, validation=None, test=None, predict=None,\n                 wrap=False, num_workers=None, shuffle_test_loader=False, use_worker_init_fn=False,\n                 shuffle_val_dataloader=False):\n        super().__init__()\n        self.batch_size = batch_size\n        self.dataset_configs = dict()\n        self.num_workers = num_workers if num_workers is not None else batch_size * 2\n        self.use_worker_init_fn = use_worker_init_fn\n        if train is not None:\n            self.dataset_configs[\"train\"] = train\n            self.train_dataloader = self._train_dataloader\n        if validation is not None:\n            self.dataset_configs[\"validation\"] = validation\n            self.val_dataloader = partial(self._val_dataloader, shuffle=shuffle_val_dataloader)\n        if test is not None:\n            self.dataset_configs[\"test\"] = test\n            self.test_dataloader = partial(self._test_dataloader, shuffle=shuffle_test_loader)\n        if predict is not None:\n            self.dataset_configs[\"predict\"] = predict\n            self.predict_dataloader = self._predict_dataloader\n        self.wrap = wrap\n\n    def prepare_data(self):\n        for data_cfg in self.dataset_configs.values():\n            instantiate_from_config(data_cfg)\n\n    def setup(self, stage=None):\n        self.datasets = dict(\n            (k, instantiate_from_config(self.dataset_configs[k]))\n            for k in self.dataset_configs)\n        if self.wrap:\n            for k in self.datasets:\n                self.datasets[k] = WrappedDataset(self.datasets[k])\n\n    def _train_dataloader(self):\n        if self.use_worker_init_fn:\n            init_fn = worker_init_fn\n        else:\n            init_fn = None\n        return DataLoader(self.datasets[\"train\"], batch_size=self.batch_size,\n                          num_workers=self.num_workers, shuffle=True,\n                          worker_init_fn=init_fn)\n\n    def _val_dataloader(self, shuffle=False):\n        if self.use_worker_init_fn:\n            init_fn = worker_init_fn\n        else:\n            init_fn = None\n        return DataLoader(self.datasets[\"validation\"],\n                          batch_size=self.batch_size,\n                          num_workers=self.num_workers,\n                          worker_init_fn=init_fn,\n                          shuffle=shuffle)\n\n    def _test_dataloader(self, shuffle=False):\n        if self.use_worker_init_fn:\n            init_fn = worker_init_fn\n        else:\n            init_fn = None\n\n        return DataLoader(self.datasets[\"test\"], batch_size=self.batch_size,\n                          num_workers=self.num_workers, worker_init_fn=init_fn, shuffle=shuffle)\n\n    def _predict_dataloader(self, shuffle=False):\n        if self.use_worker_init_fn:\n            init_fn = worker_init_fn\n        else:\n            init_fn = None\n        return DataLoader(self.datasets[\"predict\"], batch_size=self.batch_size,\n                          num_workers=self.num_workers, worker_init_fn=init_fn)\n\n\nclass SetupCallback(Callback):\n    def __init__(self, resume, now, logdir, ckptdir, cfgdir, config, lightning_config):\n        super().__init__()\n        self.resume = resume\n        self.now = now\n        self.logdir = logdir\n        self.ckptdir = ckptdir\n        self.cfgdir = cfgdir\n        self.config = config\n        self.lightning_config = lightning_config\n\n    def on_keyboard_interrupt(self, trainer, pl_module):\n        if trainer.global_rank == 0:\n            print(\"Summoning checkpoint.\")\n            ckpt_path = os.path.join(self.ckptdir, \"last.ckpt\")\n            trainer.save_checkpoint(ckpt_path)\n\n    def on_fit_start(self, trainer, pl_module):\n        if trainer.global_rank == 0:\n            # Create logdirs and save configs\n            os.makedirs(self.logdir, exist_ok=True)\n            os.makedirs(self.ckptdir, exist_ok=True)\n            os.makedirs(self.cfgdir, exist_ok=True)\n\n            if \"callbacks\" in self.lightning_config:\n                if 'metrics_over_trainsteps_checkpoint' in self.lightning_config['callbacks']:\n                    os.makedirs(os.path.join(self.ckptdir, 'trainstep_checkpoints'), exist_ok=True)\n            print(\"Project config\")\n            print(OmegaConf.to_yaml(self.config))\n            OmegaConf.save(self.config,\n                           os.path.join(self.cfgdir, \"{}-project.yaml\".format(self.now)))\n\n            print(\"Lightning config\")\n            print(OmegaConf.to_yaml(self.lightning_config))\n            OmegaConf.save(OmegaConf.create({\"lightning\": self.lightning_config}),\n                           os.path.join(self.cfgdir, \"{}-lightning.yaml\".format(self.now)))\n\n        else:\n            # ModelCheckpoint callback created log directory --- remove it\n            if not self.resume and os.path.exists(self.logdir):\n                dst, name = os.path.split(self.logdir)\n                dst = os.path.join(dst, \"child_runs\", name)\n                os.makedirs(os.path.split(dst)[0], exist_ok=True)\n                try:\n                    os.rename(self.logdir, dst)\n                except FileNotFoundError:\n                    pass\n\n\nclass ImageLogger(Callback):\n    def __init__(self, batch_frequency, val_batch_frequency, max_images, clamp=True, increase_log_steps=True,\n                 rescale=True, disabled=False, log_on_batch_idx=False, log_first_step=False,\n                 log_images_kwargs=None):\n        super().__init__()\n        self.rescale = rescale\n        self.batch_freq = batch_frequency\n        self.val_batch_frequency = val_batch_frequency\n        self.max_images = max_images\n        self.logger_log_images = {\n            pl.loggers.TensorBoardLogger: self._testtube,\n        }\n        self.log_steps = [2 ** n for n in range(int(np.log2(self.batch_freq)) + 1)]\n        if not increase_log_steps:\n            self.log_steps = [self.batch_freq]\n        self.clamp = clamp\n        self.disabled = disabled\n        self.log_on_batch_idx = log_on_batch_idx\n        self.log_images_kwargs = log_images_kwargs if log_images_kwargs else {}\n        self.log_first_step = log_first_step\n        self.val_psnr_epoch = []\n\n    @rank_zero_only\n    def _testtube(self, pl_module, images, batch_idx, split):\n        for k in images:\n            grid = torchvision.utils.make_grid(images[k])\n            grid = (grid + 1.0) / 2.0  # -1,1 -> 0,1; c,h,w\n\n            tag = f\"{split}/{k}\"\n            pl_module.logger.experiment.add_image(\n                tag, grid,\n                global_step=pl_module.global_step)\n\n    @rank_zero_only\n    def log_local(self, save_dir, split, images,\n                  global_step, current_epoch, batch_idx):\n        root = os.path.join(save_dir, \"images\", split)\n        for k in images:\n            grid = torchvision.utils.make_grid(images[k], nrow=4)\n            if self.rescale:\n                grid = (grid + 1.0) / 2.0  # -1,1 -> 0,1; c,h,w\n            grid = grid.transpose(0, 1).transpose(1, 2).squeeze(-1)\n            grid = grid.numpy()\n            grid = (grid * 255).astype(np.uint8)\n            filename = \"{}_gs-{:06}_e-{:06}_b-{:06}.png\".format(\n                k,\n                global_step,\n                current_epoch,\n                batch_idx)\n            path = os.path.join(root, filename)\n            os.makedirs(os.path.split(path)[0], exist_ok=True)\n            Image.fromarray(grid).save(path)\n\n    def log_img(self, pl_module, batch, batch_idx, split=\"train\"):\n        log = False\n        check_idx = batch_idx if self.log_on_batch_idx else pl_module.global_step\n        if (split == 'train' and\n            self.check_frequency(check_idx) and  # batch_idx % self.batch_freq == 0\n            hasattr(pl_module, \"log_images\") and\n            callable(pl_module.log_images) and\n            self.max_images > 0):\n            log = True\n\n        elif (split == 'val' and\n              (batch_idx % self.val_batch_frequency) == 0 and\n              hasattr(pl_module, \"log_images\") and\n              callable(pl_module.log_images) and\n              self.max_images > 0):\n            log = True\n\n        if log:\n            logger = type(pl_module.logger)\n\n            is_train = pl_module.training\n            if is_train:\n                pl_module.eval()\n\n            with torch.no_grad():\n                images = pl_module.log_images(batch, split=split, **self.log_images_kwargs)\n\n            for k in images:\n                N = min(images[k].shape[0], self.max_images)\n                images[k] = images[k][:N]\n                if isinstance(images[k], torch.Tensor):\n                    images[k] = images[k].detach().cpu()\n                    if self.clamp:\n                        images[k] = torch.clamp(images[k], -1., 1.)\n            \n            # calculate PSNR using images['samples']\n            if split == 'val':\n                out = images['samples'] if 'samples' in images.keys() else images['reconstructions']\n                samples = ((out+1.0)/2.0).mul(255).round()\n                gts = ((batch['image'][:N]+1.0)/2.0).mul(255).round().permute((0,3,1,2)).cpu()\n                mse = torch.mean((samples - gts)**2, dim=1).mean(1).mean(1)\n                psnr = -10 * torch.log10(mse/255**2 + 1e-8)\n                self.val_psnr_epoch.append(psnr)\n\n            self.log_local(pl_module.logger.save_dir, split, images,\n                           pl_module.global_step, pl_module.current_epoch, batch_idx)\n\n            logger_log_images = self.logger_log_images.get(logger, lambda *args, **kwargs: None)\n            logger_log_images(pl_module, images, pl_module.global_step, split)\n\n            if is_train:\n                pl_module.train()\n\n    def check_frequency(self, check_idx):\n        if ((check_idx % self.batch_freq) == 0 or (check_idx in self.log_steps)) and (\n                check_idx > 0 or self.log_first_step):\n            try:\n                self.log_steps.pop(0)\n            except IndexError as e:\n                print(e)\n                pass\n            return True\n        return False\n\n    def on_train_batch_end(self, trainer, pl_module, outputs, batch, batch_idx, dataloader_idx=0):\n        if not self.disabled and (pl_module.global_step > 0 or self.log_first_step):\n            self.log_img(pl_module, batch, batch_idx, split=\"train\")\n\n    def on_validation_batch_end(self, trainer, pl_module, outputs, batch, batch_idx, dataloader_idx=0):\n        if not self.disabled and pl_module.global_step > 0:\n            self.log_img(pl_module, batch, batch_idx, split=\"val\")\n        if hasattr(pl_module, 'calibrate_grad_norm'):\n            if (pl_module.calibrate_grad_norm and batch_idx % 25 == 0) and batch_idx > 0:\n                self.log_gradients(trainer, pl_module, batch_idx=batch_idx)\n\n    def on_validation_epoch_end(self, trainer, pl_module):\n        if len(self.val_psnr_epoch) > 0:\n            epoch_psnr = torch.cat(self.val_psnr_epoch).mean().item()\n            pl_module.log_dict({'val/psnr':epoch_psnr}, prog_bar=False, logger=True, on_step=False, on_epoch=True)\n            self.val_psnr_epoch = []\n\nclass CUDACallback(Callback):\n    # see https://github.com/SeanNaren/minGPT/blob/master/mingpt/callback.py\n    def on_train_epoch_start(self, trainer, pl_module):\n        # Reset the memory use counter\n        torch.cuda.reset_peak_memory_stats(trainer.strategy.root_device.index)\n        torch.cuda.synchronize(trainer.strategy.root_device.index)\n        self.start_time = time.time()\n\n    def on_train_epoch_end(self, trainer, pl_module, outputs=None):\n        torch.cuda.synchronize(trainer.strategy.root_device.index)\n        max_memory = torch.cuda.max_memory_allocated(trainer.strategy.root_device.index) / 2 ** 20\n        epoch_time = time.time() - self.start_time\n\n        try:\n            max_memory = trainer.strategy.reduce(max_memory)\n            epoch_time = trainer.strategy.reduce(epoch_time)\n\n            rank_zero_info(f\"Average Epoch time: {epoch_time:.2f} seconds\")\n            rank_zero_info(f\"Average Peak memory {max_memory:.2f}MiB\")\n        except AttributeError:\n            pass\n\n\nif __name__ == \"__main__\":\n\n    now = datetime.datetime.now().strftime(\"%Y-%m-%dT%H-%M-%S\")\n\n    # add cwd for convenience and to make classes in this file available when\n    # running as `python main.py`\n    # (in particular `main.DataModuleFromConfig`)\n    sys.path.append(os.getcwd())\n\n    parser = get_parser()\n    parser = Trainer.add_argparse_args(parser)\n\n    opt, unknown = parser.parse_known_args()\n    if opt.name and opt.resume:\n        raise ValueError(\n            \"-n/--name and -r/--resume cannot be specified both.\"\n            \"If you want to resume training in a new log folder, \"\n            \"use -n/--name in combination with --resume_from_checkpoint\"\n        )\n    if opt.resume:\n        if not os.path.exists(opt.resume):\n            raise ValueError(\"Cannot find {}\".format(opt.resume))\n        if os.path.isfile(opt.resume):\n            paths = opt.resume.split(\"/\")\n            # idx = len(paths)-paths[::-1].index(\"logs\")+1\n            # logdir = \"/\".join(paths[:idx])\n            logdir = \"/\".join(paths[:-2])\n            ckpt = opt.resume\n        else:\n            assert os.path.isdir(opt.resume), opt.resume\n            logdir = opt.resume.rstrip(\"/\")\n            ckpt = os.path.join(logdir, \"checkpoints\", \"last.ckpt\")\n\n        opt.resume_from_checkpoint = ckpt\n        base_configs = sorted(glob.glob(os.path.join(logdir, \"configs/*.yaml\")))\n        opt.base = base_configs + opt.base\n        _tmp = logdir.split(\"/\")\n        nowname = _tmp[-1]\n    else:\n        if opt.name:\n            name = \"_\" + opt.name\n        elif opt.base:\n            cfg_fname = os.path.split(opt.base[0])[-1]\n            cfg_name = os.path.splitext(cfg_fname)[0]\n            name = \"_\" + cfg_name\n        else:\n            name = \"\"\n        nowname = now + name + opt.postfix\n        logdir = os.path.join(opt.logdir, nowname)\n\n    ckptdir = os.path.join(logdir, \"checkpoints\")\n    cfgdir = os.path.join(logdir, \"configs\")\n    seed_everything(opt.seed)\n\n    try:\n        # init and save configs\n        configs = [OmegaConf.load(cfg) for cfg in opt.base]\n        cli = OmegaConf.from_dotlist(unknown)\n        config = OmegaConf.merge(*configs, cli)\n        lightning_config = config.pop(\"lightning\", OmegaConf.create())\n        # merge trainer cli with config\n        trainer_config = lightning_config.get(\"trainer\", OmegaConf.create())\n        # default to ddp\n        # trainer_config[\"accelerator\"] = \"ddp\"\n        # TODO: confirm why defaulting to ddp doesn't work\n        trainer_config[\"accelerator\"] = \"gpu\"\n        for k in nondefault_trainer_args(opt):\n            trainer_config[k] = getattr(opt, k)\n        if not \"gpus\" in trainer_config:\n            del trainer_config[\"accelerator\"]\n            cpu = True\n        else:\n            gpuinfo = trainer_config[\"gpus\"]\n            print(f\"Running on GPUs {gpuinfo}\")\n            cpu = False\n        trainer_config['devices'] = trainer_config.pop('gpus')\n        trainer_opt = argparse.Namespace(**trainer_config)\n        lightning_config.trainer = trainer_config\n\n        # model\n        model = instantiate_from_config(config.model)\n\n        # trainer and callbacks\n        trainer_kwargs = dict()\n\n        # default logger configs\n        default_logger_cfgs = {\n            \"wandb\": {\n                \"target\": \"pytorch_lightning.loggers.WandbLogger\",\n                \"params\": {\n                    \"name\": nowname,\n                    \"save_dir\": logdir,\n                    \"offline\": opt.debug,\n                    \"id\": nowname,\n                }\n            },\n            \"testtube\": {\n                \"target\": \"pytorch_lightning.loggers.TensorBoardLogger\",\n                \"params\": {\n                    \"name\": \"testtube\",\n                    \"save_dir\": logdir,\n                }\n            },\n        }\n        default_logger_cfg = default_logger_cfgs[\"testtube\"]\n        if \"logger\" in lightning_config:\n            logger_cfg = lightning_config.logger\n        else:\n            logger_cfg = OmegaConf.create()\n        logger_cfg = OmegaConf.merge(default_logger_cfg, logger_cfg)\n        trainer_kwargs[\"logger\"] = instantiate_from_config(logger_cfg)\n\n        # modelcheckpoint - use TrainResult/EvalResult(checkpoint_on=metric) to\n        # specify which metric is used to determine best models\n        default_modelckpt_cfg = {\n            \"target\": \"pytorch_lightning.callbacks.ModelCheckpoint\",\n            \"params\": {\n                \"dirpath\": ckptdir,\n                \"filename\": \"{epoch:06}\",\n                \"verbose\": True,\n                \"save_last\": True,\n            }\n        }\n        if hasattr(model, \"monitor\"):\n            print(f\"Monitoring {model.monitor} as checkpoint metric.\")\n            default_modelckpt_cfg[\"params\"][\"monitor\"] = model.monitor\n            default_modelckpt_cfg[\"params\"][\"save_top_k\"] = 3\n\n        if \"modelcheckpoint\" in lightning_config:\n            modelckpt_cfg = lightning_config.modelcheckpoint\n        else:\n            modelckpt_cfg =  OmegaConf.create()\n        modelckpt_cfg = OmegaConf.merge(default_modelckpt_cfg, modelckpt_cfg)\n        print(f\"Merged modelckpt-cfg: \\n{modelckpt_cfg}\")\n        if version.parse(pl.__version__) < version.parse('1.4.0'):\n            trainer_kwargs[\"checkpoint_callback\"] = instantiate_from_config(modelckpt_cfg)\n\n        # add callback which sets up log directory\n        default_callbacks_cfg = {\n            \"setup_callback\": {\n                \"target\": \"main.SetupCallback\",\n                \"params\": {\n                    \"resume\": opt.resume,\n                    \"now\": now,\n                    \"logdir\": logdir,\n                    \"ckptdir\": ckptdir,\n                    \"cfgdir\": cfgdir,\n                    \"config\": config,\n                    \"lightning_config\": lightning_config,\n                }\n            },\n            \"image_logger\": {\n                \"target\": \"main.ImageLogger\",\n                \"params\": {\n                    \"batch_frequency\": 750,\n                    \"max_images\": 4,\n                    \"clamp\": True\n                }\n            },\n            \"learning_rate_logger\": {\n                \"target\": \"main.LearningRateMonitor\",\n                \"params\": {\n                    \"logging_interval\": \"step\",\n                    # \"log_momentum\": True\n                }\n            },\n            \"cuda_callback\": {\n                \"target\": \"main.CUDACallback\"\n            },\n            \"progress_bar\": {\n                \"target\": 'pytorch_lightning.callbacks.TQDMProgressBar',\n                \"params\": {\n                    'refresh_rate': 100,\n                }\n            },\n        }\n        if version.parse(pl.__version__) >= version.parse('1.4.0'):\n            default_callbacks_cfg.update({'checkpoint_callback': modelckpt_cfg})\n\n        if \"callbacks\" in lightning_config:\n            callbacks_cfg = lightning_config.callbacks\n        else:\n            callbacks_cfg = OmegaConf.create()\n\n        if 'metrics_over_trainsteps_checkpoint' in callbacks_cfg:\n            print(\n                'Caution: Saving checkpoints every n train steps without deleting. This might require some free space.')\n            default_metrics_over_trainsteps_ckpt_dict = {\n                'metrics_over_trainsteps_checkpoint':\n                    {\"target\": 'pytorch_lightning.callbacks.ModelCheckpoint',\n                     'params': {\n                         \"dirpath\": os.path.join(ckptdir, 'trainstep_checkpoints'),\n                         \"filename\": \"{epoch:06}-{step:09}\",\n                         \"verbose\": True,\n                         'save_top_k': -1,\n                         'every_n_train_steps': 10000,\n                         'save_weights_only': True\n                     }\n                    }\n            }\n            default_callbacks_cfg.update(default_metrics_over_trainsteps_ckpt_dict)\n\n        callbacks_cfg = OmegaConf.merge(default_callbacks_cfg, callbacks_cfg)\n        if 'ignore_keys_callback' in callbacks_cfg and hasattr(trainer_opt, 'resume_from_checkpoint'):\n            callbacks_cfg.ignore_keys_callback.params['ckpt_path'] = trainer_opt.resume_from_checkpoint\n        elif 'ignore_keys_callback' in callbacks_cfg:\n            del callbacks_cfg['ignore_keys_callback']\n\n        trainer_kwargs[\"callbacks\"] = [instantiate_from_config(callbacks_cfg[k]) for k in callbacks_cfg]\n\n        trainer = Trainer.from_argparse_args(trainer_opt, **trainer_kwargs)\n        trainer.logdir = logdir  ###\n\n        # data\n        data = instantiate_from_config(config.data)\n        # NOTE according to https://pytorch-lightning.readthedocs.io/en/latest/datamodules.html\n        # calling these ourselves should not be necessary but it is.\n        # lightning still takes care of proper multiprocessing though\n        data.prepare_data()\n        data.setup()\n        print(\"#### Data #####\")\n        for k in data.datasets:\n            print(f\"{k}, {data.datasets[k].__class__.__name__}, {len(data.datasets[k])}\")\n\n        # configure learning rate\n        bs, base_lr = config.data.params.batch_size, config.model.base_learning_rate\n        if not cpu:\n            ngpu = len(lightning_config.trainer.devices.strip(\",\").split(','))\n        else:\n            ngpu = 1\n        if 'accumulate_grad_batches' in lightning_config.trainer:\n            accumulate_grad_batches = lightning_config.trainer.accumulate_grad_batches\n        else:\n            accumulate_grad_batches = 1\n        print(f\"accumulate_grad_batches = {accumulate_grad_batches}\")\n        lightning_config.trainer.accumulate_grad_batches = accumulate_grad_batches\n        if opt.scale_lr:\n            model.learning_rate = accumulate_grad_batches * ngpu * bs * base_lr\n            print(\n                \"Setting learning rate to {:.2e} = {} (accumulate_grad_batches) * {} (num_gpus) * {} (batchsize) * {:.2e} (base_lr)\".format(\n                    model.learning_rate, accumulate_grad_batches, ngpu, bs, base_lr))\n        else:\n            model.learning_rate = base_lr\n            print(\"++++ NOT USING LR SCALING ++++\")\n            print(f\"Setting learning rate to {model.learning_rate:.2e}\")\n\n\n        # allow checkpointing via USR1\n        def melk(*args, **kwargs):\n            # run all checkpoint hooks\n            if trainer.global_rank == 0:\n                print(\"Summoning checkpoint.\")\n                ckpt_path = os.path.join(ckptdir, \"last.ckpt\")\n                trainer.save_checkpoint(ckpt_path)\n\n\n        def divein(*args, **kwargs):\n            if trainer.global_rank == 0:\n                import pudb;\n                pudb.set_trace()\n\n\n        import signal\n        import platform\n        # see https://github.com/rinongal/textual_inversion/issues/44\n        if platform.system() == 'Windows':\n            os.environ[\"PL_TORCH_DISTRIBUTED_BACKEND\"] = \"gloo\"\n            signal.signal(signal.SIGTERM, melk)\n            signal.signal(signal.SIGTERM, divein)\n        else:\n            signal.signal(signal.SIGUSR1, melk)\n            signal.signal(signal.SIGUSR2, divein)\n\n        # run\n        if opt.train:\n            try:\n                trainer.fit(model, data)\n            except Exception:\n                melk()\n                raise\n        if not opt.no_test and not trainer.interrupted:\n            trainer.test(model, data)\n    except Exception:\n        if opt.debug and trainer.global_rank == 0:\n            try:\n                import pudb as debugger\n            except ImportError:\n                import pdb as debugger\n            debugger.post_mortem()\n        raise\n    finally:\n        # move newly created debug project to debug_runs\n        if opt.debug and not opt.resume and trainer.global_rank == 0:\n            dst, name = os.path.split(logdir)\n            dst = os.path.join(dst, \"debug_runs\", name)\n            os.makedirs(os.path.split(dst)[0], exist_ok=True)\n            os.rename(logdir, dst)\n        if trainer.global_rank == 0:\n            print(trainer.profiler.summary())\n"
  },
  {
    "path": "metrics/flolpips/.gitignore",
    "content": "*__pycache__*\n*.ipynb\n*delete*"
  },
  {
    "path": "metrics/flolpips/LICENSE",
    "content": "MIT License\n\nCopyright (c) 2022 danielism97\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n"
  },
  {
    "path": "metrics/flolpips/README.md",
    "content": "# FloLPIPS: A bespoke video quality metric for frame interpoation\n\n### Duolikun Danier, Fan Zhang, David Bull\n\n\n[Project](https://danielism97.github.io/FloLPIPS) | [arXiv](https://arxiv.org/abs/2207.08119)\n\n\n## Dependencies\nThe following packages were used to evaluate the model.\n\n- python==3.8.8\n- pytorch==1.7.1\n- torchvision==0.8.2\n- cudatoolkit==10.1.243\n- opencv-python==4.5.1.48\n- numpy==1.19.2\n- pillow==8.1.2\n- cupy==9.0.0\n\n\n## Usage\n```python\nfrom flolpips import calc_flolpips\nref_video = '<path to the reference>.mp4'\ndis_video = '<path to the distorted>.mp4'\nres = calc_flolpips(dis_video, ref_video)\n```\n\n\n## Citation\n```\n@article{danier2022flolpips,\n  title={FloLPIPS: A Bespoke Video Quality Metric for Frame Interpoation},\n  author={Danier, Duolikun and Zhang, Fan and Bull, David},\n  journal={arXiv preprint arXiv:2207.08119},\n  year={2022}\n}\n```\n\n## Acknowledgement\nLots of code in this repository are adapted/taken from the following repositories:\n\n- [LPIPS](https://github.com/richzhang/PerceptualSimilarity)\n- [pytorch-pwc](https://github.com/sniklaus/pytorch-pwc)\n\nWe would like to thank the authors for sharing their code."
  },
  {
    "path": "metrics/flolpips/__init__.py",
    "content": "from .flolpips import *\n"
  },
  {
    "path": "metrics/flolpips/correlation/correlation.py",
    "content": "#!/usr/bin/env python\n\nimport torch\n\nimport cupy\nimport re\n\nkernel_Correlation_rearrange = '''\n\textern \"C\" __global__ void kernel_Correlation_rearrange(\n\t\tconst int n,\n\t\tconst float* input,\n\t\tfloat* output\n\t) {\n\t  int intIndex = (blockIdx.x * blockDim.x) + threadIdx.x;\n\n\t  if (intIndex >= n) {\n\t    return;\n\t  }\n\n\t  int intSample = blockIdx.z;\n\t  int intChannel = blockIdx.y;\n\n\t  float fltValue = input[(((intSample * SIZE_1(input)) + intChannel) * SIZE_2(input) * SIZE_3(input)) + intIndex];\n\n\t  __syncthreads();\n\n\t  int intPaddedY = (intIndex / SIZE_3(input)) + 4;\n\t  int intPaddedX = (intIndex % SIZE_3(input)) + 4;\n\t  int intRearrange = ((SIZE_3(input) + 8) * intPaddedY) + intPaddedX;\n\n\t  output[(((intSample * SIZE_1(output) * SIZE_2(output)) + intRearrange) * SIZE_1(input)) + intChannel] = fltValue;\n\t}\n'''\n\nkernel_Correlation_updateOutput = '''\n\textern \"C\" __global__ void kernel_Correlation_updateOutput(\n\t  const int n,\n\t  const float* rbot0,\n\t  const float* rbot1,\n\t  float* top\n\t) {\n\t  extern __shared__ char patch_data_char[];\n\t  \n\t  float *patch_data = (float *)patch_data_char;\n\t  \n\t  // First (upper left) position of kernel upper-left corner in current center position of neighborhood in image 1\n\t  int x1 = blockIdx.x + 4;\n\t  int y1 = blockIdx.y + 4;\n\t  int item = blockIdx.z;\n\t  int ch_off = threadIdx.x;\n\t  \n\t  // Load 3D patch into shared shared memory\n\t  for (int j = 0; j < 1; j++) { // HEIGHT\n\t    for (int i = 0; i < 1; i++) { // WIDTH\n\t      int ji_off = (j + i) * SIZE_3(rbot0);\n\t      for (int ch = ch_off; ch < SIZE_3(rbot0); ch += 32) { // CHANNELS\n\t        int idx1 = ((item * SIZE_1(rbot0) + y1+j) * SIZE_2(rbot0) + x1+i) * SIZE_3(rbot0) + ch;\n\t        int idxPatchData = ji_off + ch;\n\t        patch_data[idxPatchData] = rbot0[idx1];\n\t      }\n\t    }\n\t  }\n\t  \n\t  __syncthreads();\n\t  \n\t  __shared__ float sum[32];\n\t  \n\t  // Compute correlation\n\t  for (int top_channel = 0; top_channel < SIZE_1(top); top_channel++) {\n\t    sum[ch_off] = 0;\n\t  \n\t    int s2o = top_channel % 9 - 4;\n\t    int s2p = top_channel / 9 - 4;\n\t    \n\t    for (int j = 0; j < 1; j++) { // HEIGHT\n\t      for (int i = 0; i < 1; i++) { // WIDTH\n\t        int ji_off = (j + i) * SIZE_3(rbot0);\n\t        for (int ch = ch_off; ch < SIZE_3(rbot0); ch += 32) { // CHANNELS\n\t          int x2 = x1 + s2o;\n\t          int y2 = y1 + s2p;\n\t          \n\t          int idxPatchData = ji_off + ch;\n\t          int idx2 = ((item * SIZE_1(rbot0) + y2+j) * SIZE_2(rbot0) + x2+i) * SIZE_3(rbot0) + ch;\n\t          \n\t          sum[ch_off] += patch_data[idxPatchData] * rbot1[idx2];\n\t        }\n\t      }\n\t    }\n\t    \n\t    __syncthreads();\n\t    \n\t    if (ch_off == 0) {\n\t      float total_sum = 0;\n\t      for (int idx = 0; idx < 32; idx++) {\n\t        total_sum += sum[idx];\n\t      }\n\t      const int sumelems = SIZE_3(rbot0);\n\t      const int index = ((top_channel*SIZE_2(top) + blockIdx.y)*SIZE_3(top))+blockIdx.x;\n\t      top[index + item*SIZE_1(top)*SIZE_2(top)*SIZE_3(top)] = total_sum / (float)sumelems;\n\t    }\n\t  }\n\t}\n'''\n\nkernel_Correlation_updateGradFirst = '''\n\t#define ROUND_OFF 50000\n\n\textern \"C\" __global__ void kernel_Correlation_updateGradFirst(\n\t  const int n,\n\t  const int intSample,\n\t  const float* rbot0,\n\t  const float* rbot1,\n\t  const float* gradOutput,\n\t  float* gradFirst,\n\t  float* gradSecond\n\t) { for (int intIndex = (blockIdx.x * blockDim.x) + threadIdx.x; intIndex < n; intIndex += blockDim.x * gridDim.x) {\n\t  int n = intIndex % SIZE_1(gradFirst); // channels\n\t  int l = (intIndex / SIZE_1(gradFirst)) % SIZE_3(gradFirst) + 4; // w-pos\n\t  int m = (intIndex / SIZE_1(gradFirst) / SIZE_3(gradFirst)) % SIZE_2(gradFirst) + 4; // h-pos\n\t  \n\t  // round_off is a trick to enable integer division with ceil, even for negative numbers\n\t  // We use a large offset, for the inner part not to become negative.\n\t  const int round_off = ROUND_OFF;\n\t  const int round_off_s1 = round_off;\n\t  \n\t  // We add round_off before_s1 the int division and subtract round_off after it, to ensure the formula matches ceil behavior:\n\t  int xmin = (l - 4 + round_off_s1 - 1) + 1 - round_off; // ceil (l - 4)\n\t  int ymin = (m - 4 + round_off_s1 - 1) + 1 - round_off; // ceil (l - 4)\n\t  \n\t  // Same here:\n\t  int xmax = (l - 4 + round_off_s1) - round_off; // floor (l - 4)\n\t  int ymax = (m - 4 + round_off_s1) - round_off; // floor (m - 4)\n\t  \n\t  float sum = 0;\n\t  if (xmax>=0 && ymax>=0 && (xmin<=SIZE_3(gradOutput)-1) && (ymin<=SIZE_2(gradOutput)-1)) {\n\t    xmin = max(0,xmin);\n\t    xmax = min(SIZE_3(gradOutput)-1,xmax);\n\t    \n\t    ymin = max(0,ymin);\n\t    ymax = min(SIZE_2(gradOutput)-1,ymax);\n\t    \n\t    for (int p = -4; p <= 4; p++) {\n\t      for (int o = -4; o <= 4; o++) {\n\t        // Get rbot1 data:\n\t        int s2o = o;\n\t        int s2p = p;\n\t        int idxbot1 = ((intSample * SIZE_1(rbot0) + (m+s2p)) * SIZE_2(rbot0) + (l+s2o)) * SIZE_3(rbot0) + n;\n\t        float bot1tmp = rbot1[idxbot1]; // rbot1[l+s2o,m+s2p,n]\n\t        \n\t        // Index offset for gradOutput in following loops:\n\t        int op = (p+4) * 9 + (o+4); // index[o,p]\n\t        int idxopoffset = (intSample * SIZE_1(gradOutput) + op);\n\t        \n\t        for (int y = ymin; y <= ymax; y++) {\n\t          for (int x = xmin; x <= xmax; x++) {\n\t            int idxgradOutput = (idxopoffset * SIZE_2(gradOutput) + y) * SIZE_3(gradOutput) + x; // gradOutput[x,y,o,p]\n\t            sum += gradOutput[idxgradOutput] * bot1tmp;\n\t          }\n\t        }\n\t      }\n\t    }\n\t  }\n\t  const int sumelems = SIZE_1(gradFirst);\n\t  const int bot0index = ((n * SIZE_2(gradFirst)) + (m-4)) * SIZE_3(gradFirst) + (l-4);\n\t  gradFirst[bot0index + intSample*SIZE_1(gradFirst)*SIZE_2(gradFirst)*SIZE_3(gradFirst)] = sum / (float)sumelems;\n\t} }\n'''\n\nkernel_Correlation_updateGradSecond = '''\n\t#define ROUND_OFF 50000\n\n\textern \"C\" __global__ void kernel_Correlation_updateGradSecond(\n\t  const int n,\n\t  const int intSample,\n\t  const float* rbot0,\n\t  const float* rbot1,\n\t  const float* gradOutput,\n\t  float* gradFirst,\n\t  float* gradSecond\n\t) { for (int intIndex = (blockIdx.x * blockDim.x) + threadIdx.x; intIndex < n; intIndex += blockDim.x * gridDim.x) {\n\t  int n = intIndex % SIZE_1(gradSecond); // channels\n\t  int l = (intIndex / SIZE_1(gradSecond)) % SIZE_3(gradSecond) + 4; // w-pos\n\t  int m = (intIndex / SIZE_1(gradSecond) / SIZE_3(gradSecond)) % SIZE_2(gradSecond) + 4; // h-pos\n\t  \n\t  // round_off is a trick to enable integer division with ceil, even for negative numbers\n\t  // We use a large offset, for the inner part not to become negative.\n\t  const int round_off = ROUND_OFF;\n\t  const int round_off_s1 = round_off;\n\t  \n\t  float sum = 0;\n\t  for (int p = -4; p <= 4; p++) {\n\t    for (int o = -4; o <= 4; o++) {\n\t      int s2o = o;\n\t      int s2p = p;\n\t      \n\t      //Get X,Y ranges and clamp\n\t      // We add round_off before_s1 the int division and subtract round_off after it, to ensure the formula matches ceil behavior:\n\t      int xmin = (l - 4 - s2o + round_off_s1 - 1) + 1 - round_off; // ceil (l - 4 - s2o)\n\t      int ymin = (m - 4 - s2p + round_off_s1 - 1) + 1 - round_off; // ceil (l - 4 - s2o)\n\t      \n\t      // Same here:\n\t      int xmax = (l - 4 - s2o + round_off_s1) - round_off; // floor (l - 4 - s2o)\n\t      int ymax = (m - 4 - s2p + round_off_s1) - round_off; // floor (m - 4 - s2p)\n          \n\t      if (xmax>=0 && ymax>=0 && (xmin<=SIZE_3(gradOutput)-1) && (ymin<=SIZE_2(gradOutput)-1)) {\n\t        xmin = max(0,xmin);\n\t        xmax = min(SIZE_3(gradOutput)-1,xmax);\n\t        \n\t        ymin = max(0,ymin);\n\t        ymax = min(SIZE_2(gradOutput)-1,ymax);\n\t        \n\t        // Get rbot0 data:\n\t        int idxbot0 = ((intSample * SIZE_1(rbot0) + (m-s2p)) * SIZE_2(rbot0) + (l-s2o)) * SIZE_3(rbot0) + n;\n\t        float bot0tmp = rbot0[idxbot0]; // rbot1[l+s2o,m+s2p,n]\n\t        \n\t        // Index offset for gradOutput in following loops:\n\t        int op = (p+4) * 9 + (o+4); // index[o,p]\n\t        int idxopoffset = (intSample * SIZE_1(gradOutput) + op);\n\t        \n\t        for (int y = ymin; y <= ymax; y++) {\n\t          for (int x = xmin; x <= xmax; x++) {\n\t            int idxgradOutput = (idxopoffset * SIZE_2(gradOutput) + y) * SIZE_3(gradOutput) + x; // gradOutput[x,y,o,p]\n\t            sum += gradOutput[idxgradOutput] * bot0tmp;\n\t          }\n\t        }\n\t      }\n\t    }\n\t  }\n\t  const int sumelems = SIZE_1(gradSecond);\n\t  const int bot1index = ((n * SIZE_2(gradSecond)) + (m-4)) * SIZE_3(gradSecond) + (l-4);\n\t  gradSecond[bot1index + intSample*SIZE_1(gradSecond)*SIZE_2(gradSecond)*SIZE_3(gradSecond)] = sum / (float)sumelems;\n\t} }\n'''\n\ndef cupy_kernel(strFunction, objVariables):\n\tstrKernel = globals()[strFunction]\n\n\twhile True:\n\t\tobjMatch = re.search('(SIZE_)([0-4])(\\()([^\\)]*)(\\))', strKernel)\n\n\t\tif objMatch is None:\n\t\t\tbreak\n\t\t# end\n\n\t\tintArg = int(objMatch.group(2))\n\n\t\tstrTensor = objMatch.group(4)\n\t\tintSizes = objVariables[strTensor].size()\n\n\t\tstrKernel = strKernel.replace(objMatch.group(), str(intSizes[intArg]))\n\t# end\n\n\twhile True:\n\t\tobjMatch = re.search('(VALUE_)([0-4])(\\()([^\\)]+)(\\))', strKernel)\n\n\t\tif objMatch is None:\n\t\t\tbreak\n\t\t# end\n\n\t\tintArgs = int(objMatch.group(2))\n\t\tstrArgs = objMatch.group(4).split(',')\n\n\t\tstrTensor = strArgs[0]\n\t\tintStrides = objVariables[strTensor].stride()\n\t\tstrIndex = [ '((' + strArgs[intArg + 1].replace('{', '(').replace('}', ')').strip() + ')*' + str(intStrides[intArg]) + ')' for intArg in range(intArgs) ]\n\n\t\tstrKernel = strKernel.replace(objMatch.group(0), strTensor + '[' + str.join('+', strIndex) + ']')\n\t# end\n\n\treturn strKernel\n# end\n\n@cupy.memoize(for_each_device=True)\ndef cupy_launch(strFunction, strKernel):\n\treturn cupy.cuda.compile_with_cache(strKernel).get_function(strFunction)\n# end\n\nclass _FunctionCorrelation(torch.autograd.Function):\n\t@staticmethod\n\tdef forward(self, first, second):\n\t\trbot0 = first.new_zeros([ first.shape[0], first.shape[2] + 8, first.shape[3] + 8, first.shape[1] ])\n\t\trbot1 = first.new_zeros([ first.shape[0], first.shape[2] + 8, first.shape[3] + 8, first.shape[1] ])\n\n\t\tself.save_for_backward(first, second, rbot0, rbot1)\n\n\t\tfirst = first.contiguous();\tassert(first.is_cuda == True)\n\t\tsecond = second.contiguous(); assert(second.is_cuda == True)\n\n\t\toutput = first.new_zeros([ first.shape[0], 81, first.shape[2], first.shape[3] ])\n\n\t\tif first.is_cuda == True:\n\t\t\tn = first.shape[2] * first.shape[3]\n\t\t\tcupy_launch('kernel_Correlation_rearrange', cupy_kernel('kernel_Correlation_rearrange', {\n\t\t\t\t'input': first,\n\t\t\t\t'output': rbot0\n\t\t\t}))(\n\t\t\t\tgrid=tuple([ int((n + 16 - 1) / 16), first.shape[1], first.shape[0] ]),\n\t\t\t\tblock=tuple([ 16, 1, 1 ]),\n\t\t\t\targs=[ n, first.data_ptr(), rbot0.data_ptr() ]\n\t\t\t)\n\n\t\t\tn = second.shape[2] * second.shape[3]\n\t\t\tcupy_launch('kernel_Correlation_rearrange', cupy_kernel('kernel_Correlation_rearrange', {\n\t\t\t\t'input': second,\n\t\t\t\t'output': rbot1\n\t\t\t}))(\n\t\t\t\tgrid=tuple([ int((n + 16 - 1) / 16), second.shape[1], second.shape[0] ]),\n\t\t\t\tblock=tuple([ 16, 1, 1 ]),\n\t\t\t\targs=[ n, second.data_ptr(), rbot1.data_ptr() ]\n\t\t\t)\n\n\t\t\tn = output.shape[1] * output.shape[2] * output.shape[3]\n\t\t\tcupy_launch('kernel_Correlation_updateOutput', cupy_kernel('kernel_Correlation_updateOutput', {\n\t\t\t\t'rbot0': rbot0,\n\t\t\t\t'rbot1': rbot1,\n\t\t\t\t'top': output\n\t\t\t}))(\n\t\t\t\tgrid=tuple([ output.shape[3], output.shape[2], output.shape[0] ]),\n\t\t\t\tblock=tuple([ 32, 1, 1 ]),\n\t\t\t\tshared_mem=first.shape[1] * 4,\n\t\t\t\targs=[ n, rbot0.data_ptr(), rbot1.data_ptr(), output.data_ptr() ]\n\t\t\t)\n\n\t\telif first.is_cuda == False:\n\t\t\traise NotImplementedError()\n\n\t\t# end\n\n\t\treturn output\n\t# end\n\n\t@staticmethod\n\tdef backward(self, gradOutput):\n\t\tfirst, second, rbot0, rbot1 = self.saved_tensors\n\n\t\tgradOutput = gradOutput.contiguous(); assert(gradOutput.is_cuda == True)\n\n\t\tgradFirst = first.new_zeros([ first.shape[0], first.shape[1], first.shape[2], first.shape[3] ]) if self.needs_input_grad[0] == True else None\n\t\tgradSecond = first.new_zeros([ first.shape[0], first.shape[1], first.shape[2], first.shape[3] ]) if self.needs_input_grad[1] == True else None\n\n\t\tif first.is_cuda == True:\n\t\t\tif gradFirst is not None:\n\t\t\t\tfor intSample in range(first.shape[0]):\n\t\t\t\t\tn = first.shape[1] * first.shape[2] * first.shape[3]\n\t\t\t\t\tcupy_launch('kernel_Correlation_updateGradFirst', cupy_kernel('kernel_Correlation_updateGradFirst', {\n\t\t\t\t\t\t'rbot0': rbot0,\n\t\t\t\t\t\t'rbot1': rbot1,\n\t\t\t\t\t\t'gradOutput': gradOutput,\n\t\t\t\t\t\t'gradFirst': gradFirst,\n\t\t\t\t\t\t'gradSecond': None\n\t\t\t\t\t}))(\n\t\t\t\t\t\tgrid=tuple([ int((n + 512 - 1) / 512), 1, 1 ]),\n\t\t\t\t\t\tblock=tuple([ 512, 1, 1 ]),\n\t\t\t\t\t\targs=[ n, intSample, rbot0.data_ptr(), rbot1.data_ptr(), gradOutput.data_ptr(), gradFirst.data_ptr(), None ]\n\t\t\t\t\t)\n\t\t\t\t# end\n\t\t\t# end\n\n\t\t\tif gradSecond is not None:\n\t\t\t\tfor intSample in range(first.shape[0]):\n\t\t\t\t\tn = first.shape[1] * first.shape[2] * first.shape[3]\n\t\t\t\t\tcupy_launch('kernel_Correlation_updateGradSecond', cupy_kernel('kernel_Correlation_updateGradSecond', {\n\t\t\t\t\t\t'rbot0': rbot0,\n\t\t\t\t\t\t'rbot1': rbot1,\n\t\t\t\t\t\t'gradOutput': gradOutput,\n\t\t\t\t\t\t'gradFirst': None,\n\t\t\t\t\t\t'gradSecond': gradSecond\n\t\t\t\t\t}))(\n\t\t\t\t\t\tgrid=tuple([ int((n + 512 - 1) / 512), 1, 1 ]),\n\t\t\t\t\t\tblock=tuple([ 512, 1, 1 ]),\n\t\t\t\t\t\targs=[ n, intSample, rbot0.data_ptr(), rbot1.data_ptr(), gradOutput.data_ptr(), None, gradSecond.data_ptr() ]\n\t\t\t\t\t)\n\t\t\t\t# end\n\t\t\t# end\n\n\t\telif first.is_cuda == False:\n\t\t\traise NotImplementedError()\n\n\t\t# end\n\n\t\treturn gradFirst, gradSecond\n\t# end\n# end\n\ndef FunctionCorrelation(tenFirst, tenSecond):\n\treturn _FunctionCorrelation.apply(tenFirst, tenSecond)\n# end\n\nclass ModuleCorrelation(torch.nn.Module):\n\tdef __init__(self):\n\t\tsuper(ModuleCorrelation, self).__init__()\n\t# end\n\n\tdef forward(self, tenFirst, tenSecond):\n\t\treturn _FunctionCorrelation.apply(tenFirst, tenSecond)\n\t# end\n# end"
  },
  {
    "path": "metrics/flolpips/flolpips.py",
    "content": "\nfrom __future__ import absolute_import\nimport os\nimport numpy as np\nimport torch\nimport torch.nn as nn\nfrom torch.autograd import Variable\nimport metrics.flolpips.pretrained_networks as pn\nimport torch.nn\nimport torch.nn.functional as F\nimport torchvision.transforms.functional as TF\nimport cv2\n\nfrom .pwcnet import Network as PWCNet\nimport metrics.flolpips.utils as utils\n\ndef spatial_average(in_tens, keepdim=True):\n    return in_tens.mean([2,3],keepdim=keepdim)\n\ndef mw_spatial_average(in_tens, flow, keepdim=True):\n    _,_,h,w = in_tens.shape\n    flow = F.interpolate(flow, (h,w), align_corners=False, mode='bilinear')\n    flow_mag = torch.sqrt(flow[:,0:1]**2 + flow[:,1:2]**2)\n    flow_mag = flow_mag / torch.sum(flow_mag, dim=[1,2,3], keepdim=True)\n    return torch.sum(in_tens*flow_mag, dim=[2,3],keepdim=keepdim)\n\n\ndef mtw_spatial_average(in_tens, flow, texture, keepdim=True):\n    _,_,h,w = in_tens.shape\n    flow = F.interpolate(flow, (h,w), align_corners=False, mode='bilinear')\n    texture = F.interpolate(texture, (h,w), align_corners=False, mode='bilinear')\n    flow_mag = torch.sqrt(flow[:,0:1]**2 + flow[:,1:2]**2)\n    flow_mag = (flow_mag - flow_mag.min()) / (flow_mag.max() - flow_mag.min()) + 1e-6\n    texture = (texture - texture.min()) / (texture.max() - texture.min()) + 1e-6\n    weight = flow_mag / texture\n    weight /= torch.sum(weight)\n    return torch.sum(in_tens*weight, dim=[2,3],keepdim=keepdim)\n\n\n\ndef m2w_spatial_average(in_tens, flow, keepdim=True):\n    _,_,h,w = in_tens.shape\n    flow = F.interpolate(flow, (h,w), align_corners=False, mode='bilinear')\n    flow_mag = flow[:,0:1]**2 + flow[:,1:2]**2 # B,1,H,W\n    flow_mag = flow_mag / torch.sum(flow_mag)\n    return torch.sum(in_tens*flow_mag, dim=[2,3],keepdim=keepdim)\n\ndef upsample(in_tens, out_HW=(64,64)): # assumes scale factor is same for H and W\n    in_H, in_W = in_tens.shape[2], in_tens.shape[3]\n    return nn.Upsample(size=out_HW, mode='bilinear', align_corners=False)(in_tens)\n\n# Learned perceptual metric\nclass LPIPS(nn.Module):\n    def __init__(self, pretrained=True, net='alex', version='0.1', lpips=True, spatial=False, \n        pnet_rand=False, pnet_tune=False, use_dropout=True, model_path=None, eval_mode=True, verbose=False):\n        # lpips - [True] means with linear calibration on top of base network\n        # pretrained - [True] means load linear weights\n\n        super(LPIPS, self).__init__()\n        if(verbose):\n            print('Setting up [%s] perceptual loss: trunk [%s], v[%s], spatial [%s]'%\n                ('LPIPS' if lpips else 'baseline', net, version, 'on' if spatial else 'off'))\n\n        self.pnet_type = net\n        self.pnet_tune = pnet_tune\n        self.pnet_rand = pnet_rand\n        self.spatial = spatial\n        self.lpips = lpips # false means baseline of just averaging all layers\n        self.version = version\n        self.scaling_layer = ScalingLayer()\n\n        if(self.pnet_type in ['vgg','vgg16']):\n            net_type = pn.vgg16\n            self.chns = [64,128,256,512,512]\n        elif(self.pnet_type=='alex'):\n            net_type = pn.alexnet\n            self.chns = [64,192,384,256,256]\n        elif(self.pnet_type=='squeeze'):\n            net_type = pn.squeezenet\n            self.chns = [64,128,256,384,384,512,512]\n        self.L = len(self.chns)\n\n        self.net = net_type(pretrained=not self.pnet_rand, requires_grad=self.pnet_tune)\n\n        if(lpips):\n            self.lin0 = NetLinLayer(self.chns[0], use_dropout=use_dropout)\n            self.lin1 = NetLinLayer(self.chns[1], use_dropout=use_dropout)\n            self.lin2 = NetLinLayer(self.chns[2], use_dropout=use_dropout)\n            self.lin3 = NetLinLayer(self.chns[3], use_dropout=use_dropout)\n            self.lin4 = NetLinLayer(self.chns[4], use_dropout=use_dropout)\n            self.lins = [self.lin0,self.lin1,self.lin2,self.lin3,self.lin4]\n            if(self.pnet_type=='squeeze'): # 7 layers for squeezenet\n                self.lin5 = NetLinLayer(self.chns[5], use_dropout=use_dropout)\n                self.lin6 = NetLinLayer(self.chns[6], use_dropout=use_dropout)\n                self.lins+=[self.lin5,self.lin6]\n            self.lins = nn.ModuleList(self.lins)\n\n            if(pretrained):\n                if(model_path is None):\n                    import inspect\n                    import os\n                    model_path = os.path.abspath(os.path.join(inspect.getfile(self.__init__), '..', 'weights/v%s/%s.pth'%(version,net)))\n\n                if(verbose):\n                    print('Loading model from: %s'%model_path)\n                self.load_state_dict(torch.load(model_path, map_location='cpu'), strict=False)          \n\n        if(eval_mode):\n            self.eval()\n\n    def forward(self, in0, in1, retPerLayer=False, normalize=False):\n        if normalize: # turn on this flag if input is [0,1] so it can be adjusted to [-1, +1]\n            in0 = 2 * in0  - 1\n            in1 = 2 * in1  - 1\n\n        # v0.0 - original release had a bug, where input was not scaled\n        in0_input, in1_input = (self.scaling_layer(in0), self.scaling_layer(in1)) if self.version=='0.1' else (in0, in1)\n        outs0, outs1 = self.net.forward(in0_input), self.net.forward(in1_input)\n        feats0, feats1, diffs = {}, {}, {}\n\n        for kk in range(self.L):\n            feats0[kk], feats1[kk] = utils.normalize_tensor(outs0[kk]), utils.normalize_tensor(outs1[kk])\n            diffs[kk] = (feats0[kk]-feats1[kk])**2\n\n        if(self.lpips):\n            if(self.spatial):\n                res = [upsample(self.lins[kk](diffs[kk]), out_HW=in0.shape[2:]) for kk in range(self.L)]\n            else:\n                res = [spatial_average(self.lins[kk](diffs[kk]), keepdim=True) for kk in range(self.L)]\n        else:\n            if(self.spatial):\n                res = [upsample(diffs[kk].sum(dim=1,keepdim=True), out_HW=in0.shape[2:]) for kk in range(self.L)]\n            else:\n                res = [spatial_average(diffs[kk].sum(dim=1,keepdim=True), keepdim=True) for kk in range(self.L)]\n\n        # val = res[0]\n        # for l in range(1,self.L):\n        #     val += res[l]\n        #     print(val)\n\n        # a = spatial_average(self.lins[kk](diffs[kk]), keepdim=True)\n        # b = torch.max(self.lins[kk](feats0[kk]**2))\n        # for kk in range(self.L):\n        #     a += spatial_average(self.lins[kk](diffs[kk]), keepdim=True)\n        #     b = torch.max(b,torch.max(self.lins[kk](feats0[kk]**2)))\n        # a = a/self.L\n        # from IPython import embed\n        # embed()\n        # return 10*torch.log10(b/a)\n        \n        # if(retPerLayer):\n        #     return (val, res)\n        # else:\n        return torch.sum(torch.cat(res, 1), dim=(1,2,3), keepdims=False)\n\n\nclass ScalingLayer(nn.Module):\n    def __init__(self):\n        super(ScalingLayer, self).__init__()\n        self.register_buffer('shift', torch.Tensor([-.030,-.088,-.188])[None,:,None,None])\n        self.register_buffer('scale', torch.Tensor([.458,.448,.450])[None,:,None,None])\n\n    def forward(self, inp):\n        return (inp - self.shift) / self.scale\n\n\nclass NetLinLayer(nn.Module):\n    ''' A single linear layer which does a 1x1 conv '''\n    def __init__(self, chn_in, chn_out=1, use_dropout=False):\n        super(NetLinLayer, self).__init__()\n\n        layers = [nn.Dropout(),] if(use_dropout) else []\n        layers += [nn.Conv2d(chn_in, chn_out, 1, stride=1, padding=0, bias=False),]\n        self.model = nn.Sequential(*layers)\n\n    def forward(self, x):\n        return self.model(x)\n\nclass Dist2LogitLayer(nn.Module):\n    ''' takes 2 distances, puts through fc layers, spits out value between [0,1] (if use_sigmoid is True) '''\n    def __init__(self, chn_mid=32, use_sigmoid=True):\n        super(Dist2LogitLayer, self).__init__()\n\n        layers = [nn.Conv2d(5, chn_mid, 1, stride=1, padding=0, bias=True),]\n        layers += [nn.LeakyReLU(0.2,True),]\n        layers += [nn.Conv2d(chn_mid, chn_mid, 1, stride=1, padding=0, bias=True),]\n        layers += [nn.LeakyReLU(0.2,True),]\n        layers += [nn.Conv2d(chn_mid, 1, 1, stride=1, padding=0, bias=True),]\n        if(use_sigmoid):\n            layers += [nn.Sigmoid(),]\n        self.model = nn.Sequential(*layers)\n\n    def forward(self,d0,d1,eps=0.1):\n        return self.model.forward(torch.cat((d0,d1,d0-d1,d0/(d1+eps),d1/(d0+eps)),dim=1))\n\nclass BCERankingLoss(nn.Module):\n    def __init__(self, chn_mid=32):\n        super(BCERankingLoss, self).__init__()\n        self.net = Dist2LogitLayer(chn_mid=chn_mid)\n        # self.parameters = list(self.net.parameters())\n        self.loss = torch.nn.BCELoss()\n\n    def forward(self, d0, d1, judge):\n        per = (judge+1.)/2.\n        self.logit = self.net.forward(d0,d1)\n        return self.loss(self.logit, per)\n\n# L2, DSSIM metrics\nclass FakeNet(nn.Module):\n    def __init__(self, use_gpu=True, colorspace='Lab'):\n        super(FakeNet, self).__init__()\n        self.use_gpu = use_gpu\n        self.colorspace = colorspace\n\nclass L2(FakeNet):\n    def forward(self, in0, in1, retPerLayer=None):\n        assert(in0.size()[0]==1) # currently only supports batchSize 1\n\n        if(self.colorspace=='RGB'):\n            (N,C,X,Y) = in0.size()\n            value = torch.mean(torch.mean(torch.mean((in0-in1)**2,dim=1).view(N,1,X,Y),dim=2).view(N,1,1,Y),dim=3).view(N)\n            return value\n        elif(self.colorspace=='Lab'):\n            value = utils.l2(utils.tensor2np(utils.tensor2tensorlab(in0.data,to_norm=False)), \n                utils.tensor2np(utils.tensor2tensorlab(in1.data,to_norm=False)), range=100.).astype('float')\n            ret_var = Variable( torch.Tensor((value,) ) )\n            if(self.use_gpu):\n                ret_var = ret_var.cuda()\n            return ret_var\n\nclass DSSIM(FakeNet):\n\n    def forward(self, in0, in1, retPerLayer=None):\n        assert(in0.size()[0]==1) # currently only supports batchSize 1\n\n        if(self.colorspace=='RGB'):\n            value = utils.dssim(1.*utils.tensor2im(in0.data), 1.*utils.tensor2im(in1.data), range=255.).astype('float')\n        elif(self.colorspace=='Lab'):\n            value = utils.dssim(utils.tensor2np(utils.tensor2tensorlab(in0.data,to_norm=False)), \n                utils.tensor2np(utils.tensor2tensorlab(in1.data,to_norm=False)), range=100.).astype('float')\n        ret_var = Variable( torch.Tensor((value,) ) )\n        if(self.use_gpu):\n            ret_var = ret_var.cuda()\n        return ret_var\n\ndef print_network(net):\n    num_params = 0\n    for param in net.parameters():\n        num_params += param.numel()\n    print('Network',net)\n    print('Total number of parameters: %d' % num_params)\n\n\nclass FloLPIPS(LPIPS):\n    def __init__(self, pretrained=True, net='alex', version='0.1', lpips=True, spatial=False, pnet_rand=False, pnet_tune=False, use_dropout=True, model_path=None, eval_mode=True, verbose=False):\n        super(FloLPIPS, self).__init__(pretrained, net, version, lpips, spatial, pnet_rand, pnet_tune, use_dropout, model_path, eval_mode, verbose)\n\n    def forward(self, in0, in1, flow, retPerLayer=False, normalize=False):\n        if normalize: # turn on this flag if input is [0,1] so it can be adjusted to [-1, +1]\n            in0 = 2 * in0  - 1\n            in1 = 2 * in1  - 1\n\n        in0_input, in1_input = (self.scaling_layer(in0), self.scaling_layer(in1)) if self.version=='0.1' else (in0, in1)\n        outs0, outs1 = self.net.forward(in0_input), self.net.forward(in1_input)\n        feats0, feats1, diffs = {}, {}, {}\n\n        for kk in range(self.L):\n            feats0[kk], feats1[kk] = utils.normalize_tensor(outs0[kk]), utils.normalize_tensor(outs1[kk])\n            diffs[kk] = (feats0[kk]-feats1[kk])**2\n\n        res = [mw_spatial_average(self.lins[kk](diffs[kk]), flow, keepdim=True) for kk in range(self.L)]\n\n\n        return torch.sum(torch.cat(res, 1), dim=(1,2,3), keepdims=False)\n\n\n\ndef calc_flolpips(dis_path, ref_path):\n\n    batch_size = 8\n\n    # convert to yuv first\n    os.system('ffmpeg -hide_banner -loglevel error -i {} flolpips_ref.yuv'.format(ref_path))\n    os.system('ffmpeg -hide_banner -loglevel error -i {} flolpips_dis.yuv'.format(dis_path))\n\n    loss_fn = FloLPIPS(net='alex',version='0.1').cuda()\n    flownet = PWCNet().cuda()\n    # batch_size = 128\n\n    cap_dis = cv2.VideoCapture(dis_path)\n    cap_ref = cv2.VideoCapture(ref_path)\n    assert int(cap_dis.get(cv2.CAP_PROP_FRAME_COUNT)) == int(cap_ref.get(cv2.CAP_PROP_FRAME_COUNT))\n    num_frames = int(cap_ref.get(cv2.CAP_PROP_FRAME_COUNT))\n    width  = int(cap_ref.get(3))\n    height = int(cap_ref.get(4))\n    cap_dis.release()\n    cap_ref.release()\n    stream_dis = open('flolpips_dis.yuv', 'r')\n    stream_ref = open('flolpips_ref.yuv', 'r')\n\n    flolpips_list = []\n    batch_ref_list, batch_dis_list = [], []\n    batch_ref_next_list, batch_dis_next_list = [], []\n    for iFrame in range(num_frames-1):\n        frame_dis = TF.to_tensor(utils.read_frame_yuv2rgb(stream_dis, width, height, iFrame, 8, '420'))\n        frame_dis_next = TF.to_tensor(utils.read_frame_yuv2rgb(stream_dis, width, height, iFrame+1, 8, '420'))\n        frame_ref = TF.to_tensor(utils.read_frame_yuv2rgb(stream_ref, width, height, iFrame, 8, '420'))\n        frame_ref_next = TF.to_tensor(utils.read_frame_yuv2rgb(stream_ref, width, height, iFrame+1, 8, '420'))\n        batch_dis_list.append(frame_dis)\n        batch_dis_next_list.append(frame_dis_next)\n        batch_ref_list.append(frame_ref)\n        batch_ref_next_list.append(frame_ref_next)\n        if len(batch_ref_list) % batch_size == 0:\n            with torch.no_grad():\n                frames_ref = torch.stack(batch_ref_list, dim=0).cuda()\n                frames_dis = torch.stack(batch_dis_list, dim=0).cuda()\n                frames_ref_next = torch.stack(batch_ref_next_list, dim=0).cuda()\n                frames_dis_next = torch.stack(batch_dis_next_list, dim=0).cuda()\n                flow_ref = flownet(frames_ref, frames_ref_next)\n                flow_dis = flownet(frames_dis, frames_dis_next)\n                flow_diff = flow_ref - flow_dis\n                flolpips = loss_fn.forward(frames_ref, frames_dis, flow_diff, normalize=True)\n            batch_ref_list, batch_dis_list, batch_ref_next_list, batch_dis_next_list = [], [], [], []\n            flolpips_list = flolpips_list + flolpips.cpu().tolist()\n    if len(batch_ref_list) > 0:\n        with torch.no_grad():\n            frames_ref = torch.stack(batch_ref_list, dim=0).cuda()\n            frames_dis = torch.stack(batch_dis_list, dim=0).cuda()\n            frames_ref_next = torch.stack(batch_ref_next_list, dim=0).cuda()\n            frames_dis_next = torch.stack(batch_dis_next_list, dim=0).cuda()\n            flow_ref = flownet(frames_ref, frames_ref_next)\n            flow_dis = flownet(frames_dis, frames_dis_next)\n            flow_diff = flow_ref - flow_dis\n            flolpips = loss_fn.forward(frames_ref, frames_dis, flow_diff, normalize=True)\n        flolpips_list = flolpips_list + flolpips.cpu().tolist()\n\n    stream_dis.close()\n    stream_ref.close()\n\n    # delete files, modify command accordingly\n    os.remove('flolpips_dis.yuv')\n    os.remove('flolpips_ref.yuv')\n\n    return np.mean(flolpips_list)"
  },
  {
    "path": "metrics/flolpips/pretrained_networks.py",
    "content": "from collections import namedtuple\nimport torch\nfrom torchvision import models as tv\n\nclass squeezenet(torch.nn.Module):\n    def __init__(self, requires_grad=False, pretrained=True):\n        super(squeezenet, self).__init__()\n        pretrained_features = tv.squeezenet1_1(pretrained=pretrained).features\n        self.slice1 = torch.nn.Sequential()\n        self.slice2 = torch.nn.Sequential()\n        self.slice3 = torch.nn.Sequential()\n        self.slice4 = torch.nn.Sequential()\n        self.slice5 = torch.nn.Sequential()\n        self.slice6 = torch.nn.Sequential()\n        self.slice7 = torch.nn.Sequential()\n        self.N_slices = 7\n        for x in range(2):\n            self.slice1.add_module(str(x), pretrained_features[x])\n        for x in range(2,5):\n            self.slice2.add_module(str(x), pretrained_features[x])\n        for x in range(5, 8):\n            self.slice3.add_module(str(x), pretrained_features[x])\n        for x in range(8, 10):\n            self.slice4.add_module(str(x), pretrained_features[x])\n        for x in range(10, 11):\n            self.slice5.add_module(str(x), pretrained_features[x])\n        for x in range(11, 12):\n            self.slice6.add_module(str(x), pretrained_features[x])\n        for x in range(12, 13):\n            self.slice7.add_module(str(x), pretrained_features[x])\n        if not requires_grad:\n            for param in self.parameters():\n                param.requires_grad = False\n\n    def forward(self, X):\n        h = self.slice1(X)\n        h_relu1 = h\n        h = self.slice2(h)\n        h_relu2 = h\n        h = self.slice3(h)\n        h_relu3 = h\n        h = self.slice4(h)\n        h_relu4 = h\n        h = self.slice5(h)\n        h_relu5 = h\n        h = self.slice6(h)\n        h_relu6 = h\n        h = self.slice7(h)\n        h_relu7 = h\n        vgg_outputs = namedtuple(\"SqueezeOutputs\", ['relu1','relu2','relu3','relu4','relu5','relu6','relu7'])\n        out = vgg_outputs(h_relu1,h_relu2,h_relu3,h_relu4,h_relu5,h_relu6,h_relu7)\n\n        return out\n\n\nclass alexnet(torch.nn.Module):\n    def __init__(self, requires_grad=False, pretrained=True):\n        super(alexnet, self).__init__()\n        alexnet_pretrained_features = tv.alexnet(pretrained=pretrained).features\n        self.slice1 = torch.nn.Sequential()\n        self.slice2 = torch.nn.Sequential()\n        self.slice3 = torch.nn.Sequential()\n        self.slice4 = torch.nn.Sequential()\n        self.slice5 = torch.nn.Sequential()\n        self.N_slices = 5\n        for x in range(2):\n            self.slice1.add_module(str(x), alexnet_pretrained_features[x])\n        for x in range(2, 5):\n            self.slice2.add_module(str(x), alexnet_pretrained_features[x])\n        for x in range(5, 8):\n            self.slice3.add_module(str(x), alexnet_pretrained_features[x])\n        for x in range(8, 10):\n            self.slice4.add_module(str(x), alexnet_pretrained_features[x])\n        for x in range(10, 12):\n            self.slice5.add_module(str(x), alexnet_pretrained_features[x])\n        if not requires_grad:\n            for param in self.parameters():\n                param.requires_grad = False\n\n    def forward(self, X):\n        h = self.slice1(X)\n        h_relu1 = h\n        h = self.slice2(h)\n        h_relu2 = h\n        h = self.slice3(h)\n        h_relu3 = h\n        h = self.slice4(h)\n        h_relu4 = h\n        h = self.slice5(h)\n        h_relu5 = h\n        alexnet_outputs = namedtuple(\"AlexnetOutputs\", ['relu1', 'relu2', 'relu3', 'relu4', 'relu5'])\n        out = alexnet_outputs(h_relu1, h_relu2, h_relu3, h_relu4, h_relu5)\n\n        return out\n\nclass vgg16(torch.nn.Module):\n    def __init__(self, requires_grad=False, pretrained=True):\n        super(vgg16, self).__init__()\n        vgg_pretrained_features = tv.vgg16(pretrained=pretrained).features\n        self.slice1 = torch.nn.Sequential()\n        self.slice2 = torch.nn.Sequential()\n        self.slice3 = torch.nn.Sequential()\n        self.slice4 = torch.nn.Sequential()\n        self.slice5 = torch.nn.Sequential()\n        self.N_slices = 5\n        for x in range(4):\n            self.slice1.add_module(str(x), vgg_pretrained_features[x])\n        for x in range(4, 9):\n            self.slice2.add_module(str(x), vgg_pretrained_features[x])\n        for x in range(9, 16):\n            self.slice3.add_module(str(x), vgg_pretrained_features[x])\n        for x in range(16, 23):\n            self.slice4.add_module(str(x), vgg_pretrained_features[x])\n        for x in range(23, 30):\n            self.slice5.add_module(str(x), vgg_pretrained_features[x])\n        if not requires_grad:\n            for param in self.parameters():\n                param.requires_grad = False\n\n    def forward(self, X):\n        h = self.slice1(X)\n        h_relu1_2 = h\n        h = self.slice2(h)\n        h_relu2_2 = h\n        h = self.slice3(h)\n        h_relu3_3 = h\n        h = self.slice4(h)\n        h_relu4_3 = h\n        h = self.slice5(h)\n        h_relu5_3 = h\n        vgg_outputs = namedtuple(\"VggOutputs\", ['relu1_2', 'relu2_2', 'relu3_3', 'relu4_3', 'relu5_3'])\n        out = vgg_outputs(h_relu1_2, h_relu2_2, h_relu3_3, h_relu4_3, h_relu5_3)\n\n        return out\n\n\n\nclass resnet(torch.nn.Module):\n    def __init__(self, requires_grad=False, pretrained=True, num=18):\n        super(resnet, self).__init__()\n        if(num==18):\n            self.net = tv.resnet18(pretrained=pretrained)\n        elif(num==34):\n            self.net = tv.resnet34(pretrained=pretrained)\n        elif(num==50):\n            self.net = tv.resnet50(pretrained=pretrained)\n        elif(num==101):\n            self.net = tv.resnet101(pretrained=pretrained)\n        elif(num==152):\n            self.net = tv.resnet152(pretrained=pretrained)\n        self.N_slices = 5\n\n        self.conv1 = self.net.conv1\n        self.bn1 = self.net.bn1\n        self.relu = self.net.relu\n        self.maxpool = self.net.maxpool\n        self.layer1 = self.net.layer1\n        self.layer2 = self.net.layer2\n        self.layer3 = self.net.layer3\n        self.layer4 = self.net.layer4\n\n    def forward(self, X):\n        h = self.conv1(X)\n        h = self.bn1(h)\n        h = self.relu(h)\n        h_relu1 = h\n        h = self.maxpool(h)\n        h = self.layer1(h)\n        h_conv2 = h\n        h = self.layer2(h)\n        h_conv3 = h\n        h = self.layer3(h)\n        h_conv4 = h\n        h = self.layer4(h)\n        h_conv5 = h\n\n        outputs = namedtuple(\"Outputs\", ['relu1','conv2','conv3','conv4','conv5'])\n        out = outputs(h_relu1, h_conv2, h_conv3, h_conv4, h_conv5)\n\n        return out\n"
  },
  {
    "path": "metrics/flolpips/pwcnet.py",
    "content": "#!/usr/bin/env python\n\nimport torch\n\nimport getopt\nimport math\nimport numpy\nimport os\nimport PIL\nimport PIL.Image\nimport sys\n\n# try:\nfrom metrics.flolpips.correlation import correlation # the custom cost volume layer\n# except:\n# \tsys.path.insert(0, './correlation'); import correlation # you should consider upgrading python\n# end\n\n##########################################################\n\n# assert(int(str('').join(torch.__version__.split('.')[0:2])) >= 13) # requires at least pytorch version 1.3.0\n\n# torch.set_grad_enabled(False) # make sure to not compute gradients for computational performance\n\n# torch.backends.cudnn.enabled = True # make sure to use cudnn for computational performance\n\n# ##########################################################\n\n# arguments_strModel = 'default' # 'default', or 'chairs-things'\n# arguments_strFirst = './images/first.png'\n# arguments_strSecond = './images/second.png'\n# arguments_strOut = './out.flo'\n\n# for strOption, strArgument in getopt.getopt(sys.argv[1:], '', [ strParameter[2:] + '=' for strParameter in sys.argv[1::2] ])[0]:\n# \tif strOption == '--model' and strArgument != '': arguments_strModel = strArgument # which model to use\n# \tif strOption == '--first' and strArgument != '': arguments_strFirst = strArgument # path to the first frame\n# \tif strOption == '--second' and strArgument != '': arguments_strSecond = strArgument # path to the second frame\n# \tif strOption == '--out' and strArgument != '': arguments_strOut = strArgument # path to where the output should be stored\n# end\n\n##########################################################\n\n\n\ndef backwarp(tenInput, tenFlow):\n\tbackwarp_tenGrid = {}\n\tbackwarp_tenPartial = {}\n\tif str(tenFlow.shape) not in backwarp_tenGrid:\n\t\ttenHor = torch.linspace(-1.0 + (1.0 / tenFlow.shape[3]), 1.0 - (1.0 / tenFlow.shape[3]), tenFlow.shape[3]).view(1, 1, 1, -1).expand(-1, -1, tenFlow.shape[2], -1)\n\t\ttenVer = torch.linspace(-1.0 + (1.0 / tenFlow.shape[2]), 1.0 - (1.0 / tenFlow.shape[2]), tenFlow.shape[2]).view(1, 1, -1, 1).expand(-1, -1, -1, tenFlow.shape[3])\n\n\t\tbackwarp_tenGrid[str(tenFlow.shape)] = torch.cat([ tenHor, tenVer ], 1).cuda()\n\t# end\n\n\tif str(tenFlow.shape) not in backwarp_tenPartial:\n\t\tbackwarp_tenPartial[str(tenFlow.shape)] = tenFlow.new_ones([ tenFlow.shape[0], 1, tenFlow.shape[2], tenFlow.shape[3] ])\n\t# end\n\n\ttenFlow = torch.cat([ tenFlow[:, 0:1, :, :] / ((tenInput.shape[3] - 1.0) / 2.0), tenFlow[:, 1:2, :, :] / ((tenInput.shape[2] - 1.0) / 2.0) ], 1)\n\ttenInput = torch.cat([ tenInput, backwarp_tenPartial[str(tenFlow.shape)] ], 1)\n\n\ttenOutput = torch.nn.functional.grid_sample(input=tenInput, grid=(backwarp_tenGrid[str(tenFlow.shape)] + tenFlow).permute(0, 2, 3, 1), mode='bilinear', padding_mode='zeros', align_corners=False)\n\n\ttenMask = tenOutput[:, -1:, :, :]; tenMask[tenMask > 0.999] = 1.0; tenMask[tenMask < 1.0] = 0.0\n\n\treturn tenOutput[:, :-1, :, :] * tenMask\n# end\n\n##########################################################\n\nclass Network(torch.nn.Module):\n\tdef __init__(self):\n\t\tsuper(Network, self).__init__()\n\n\t\tclass Extractor(torch.nn.Module):\n\t\t\tdef __init__(self):\n\t\t\t\tsuper(Extractor, self).__init__()\n\n\t\t\t\tself.netOne = torch.nn.Sequential(\n\t\t\t\t\ttorch.nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, stride=2, padding=1),\n\t\t\t\t\ttorch.nn.LeakyReLU(inplace=False, negative_slope=0.1),\n\t\t\t\t\ttorch.nn.Conv2d(in_channels=16, out_channels=16, kernel_size=3, stride=1, padding=1),\n\t\t\t\t\ttorch.nn.LeakyReLU(inplace=False, negative_slope=0.1),\n\t\t\t\t\ttorch.nn.Conv2d(in_channels=16, out_channels=16, kernel_size=3, stride=1, padding=1),\n\t\t\t\t\ttorch.nn.LeakyReLU(inplace=False, negative_slope=0.1)\n\t\t\t\t)\n\n\t\t\t\tself.netTwo = torch.nn.Sequential(\n\t\t\t\t\ttorch.nn.Conv2d(in_channels=16, out_channels=32, kernel_size=3, stride=2, padding=1),\n\t\t\t\t\ttorch.nn.LeakyReLU(inplace=False, negative_slope=0.1),\n\t\t\t\t\ttorch.nn.Conv2d(in_channels=32, out_channels=32, kernel_size=3, stride=1, padding=1),\n\t\t\t\t\ttorch.nn.LeakyReLU(inplace=False, negative_slope=0.1),\n\t\t\t\t\ttorch.nn.Conv2d(in_channels=32, out_channels=32, kernel_size=3, stride=1, padding=1),\n\t\t\t\t\ttorch.nn.LeakyReLU(inplace=False, negative_slope=0.1)\n\t\t\t\t)\n\n\t\t\t\tself.netThr = torch.nn.Sequential(\n\t\t\t\t\ttorch.nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, stride=2, padding=1),\n\t\t\t\t\ttorch.nn.LeakyReLU(inplace=False, negative_slope=0.1),\n\t\t\t\t\ttorch.nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3, stride=1, padding=1),\n\t\t\t\t\ttorch.nn.LeakyReLU(inplace=False, negative_slope=0.1),\n\t\t\t\t\ttorch.nn.Conv2d(in_channels=64, out_channels=64, kernel_size=3, stride=1, padding=1),\n\t\t\t\t\ttorch.nn.LeakyReLU(inplace=False, negative_slope=0.1)\n\t\t\t\t)\n\n\t\t\t\tself.netFou = torch.nn.Sequential(\n\t\t\t\t\ttorch.nn.Conv2d(in_channels=64, out_channels=96, kernel_size=3, stride=2, padding=1),\n\t\t\t\t\ttorch.nn.LeakyReLU(inplace=False, negative_slope=0.1),\n\t\t\t\t\ttorch.nn.Conv2d(in_channels=96, out_channels=96, kernel_size=3, stride=1, padding=1),\n\t\t\t\t\ttorch.nn.LeakyReLU(inplace=False, negative_slope=0.1),\n\t\t\t\t\ttorch.nn.Conv2d(in_channels=96, out_channels=96, kernel_size=3, stride=1, padding=1),\n\t\t\t\t\ttorch.nn.LeakyReLU(inplace=False, negative_slope=0.1)\n\t\t\t\t)\n\n\t\t\t\tself.netFiv = torch.nn.Sequential(\n\t\t\t\t\ttorch.nn.Conv2d(in_channels=96, out_channels=128, kernel_size=3, stride=2, padding=1),\n\t\t\t\t\ttorch.nn.LeakyReLU(inplace=False, negative_slope=0.1),\n\t\t\t\t\ttorch.nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, stride=1, padding=1),\n\t\t\t\t\ttorch.nn.LeakyReLU(inplace=False, negative_slope=0.1),\n\t\t\t\t\ttorch.nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, stride=1, padding=1),\n\t\t\t\t\ttorch.nn.LeakyReLU(inplace=False, negative_slope=0.1)\n\t\t\t\t)\n\n\t\t\t\tself.netSix = torch.nn.Sequential(\n\t\t\t\t\ttorch.nn.Conv2d(in_channels=128, out_channels=196, kernel_size=3, stride=2, padding=1),\n\t\t\t\t\ttorch.nn.LeakyReLU(inplace=False, negative_slope=0.1),\n\t\t\t\t\ttorch.nn.Conv2d(in_channels=196, out_channels=196, kernel_size=3, stride=1, padding=1),\n\t\t\t\t\ttorch.nn.LeakyReLU(inplace=False, negative_slope=0.1),\n\t\t\t\t\ttorch.nn.Conv2d(in_channels=196, out_channels=196, kernel_size=3, stride=1, padding=1),\n\t\t\t\t\ttorch.nn.LeakyReLU(inplace=False, negative_slope=0.1)\n\t\t\t\t)\n\t\t\t# end\n\n\t\t\tdef forward(self, tenInput):\n\t\t\t\ttenOne = self.netOne(tenInput)\n\t\t\t\ttenTwo = self.netTwo(tenOne)\n\t\t\t\ttenThr = self.netThr(tenTwo)\n\t\t\t\ttenFou = self.netFou(tenThr)\n\t\t\t\ttenFiv = self.netFiv(tenFou)\n\t\t\t\ttenSix = self.netSix(tenFiv)\n\n\t\t\t\treturn [ tenOne, tenTwo, tenThr, tenFou, tenFiv, tenSix ]\n\t\t\t# end\n\t\t# end\n\n\t\tclass Decoder(torch.nn.Module):\n\t\t\tdef __init__(self, intLevel):\n\t\t\t\tsuper(Decoder, self).__init__()\n\n\t\t\t\tintPrevious = [ None, None, 81 + 32 + 2 + 2, 81 + 64 + 2 + 2, 81 + 96 + 2 + 2, 81 + 128 + 2 + 2, 81, None ][intLevel + 1]\n\t\t\t\tintCurrent = [ None, None, 81 + 32 + 2 + 2, 81 + 64 + 2 + 2, 81 + 96 + 2 + 2, 81 + 128 + 2 + 2, 81, None ][intLevel + 0]\n\n\t\t\t\tif intLevel < 6: self.netUpflow = torch.nn.ConvTranspose2d(in_channels=2, out_channels=2, kernel_size=4, stride=2, padding=1)\n\t\t\t\tif intLevel < 6: self.netUpfeat = torch.nn.ConvTranspose2d(in_channels=intPrevious + 128 + 128 + 96 + 64 + 32, out_channels=2, kernel_size=4, stride=2, padding=1)\n\t\t\t\tif intLevel < 6: self.fltBackwarp = [ None, None, None, 5.0, 2.5, 1.25, 0.625, None ][intLevel + 1]\n\n\t\t\t\tself.netOne = torch.nn.Sequential(\n\t\t\t\t\ttorch.nn.Conv2d(in_channels=intCurrent, out_channels=128, kernel_size=3, stride=1, padding=1),\n\t\t\t\t\ttorch.nn.LeakyReLU(inplace=False, negative_slope=0.1)\n\t\t\t\t)\n\n\t\t\t\tself.netTwo = torch.nn.Sequential(\n\t\t\t\t\ttorch.nn.Conv2d(in_channels=intCurrent + 128, out_channels=128, kernel_size=3, stride=1, padding=1),\n\t\t\t\t\ttorch.nn.LeakyReLU(inplace=False, negative_slope=0.1)\n\t\t\t\t)\n\n\t\t\t\tself.netThr = torch.nn.Sequential(\n\t\t\t\t\ttorch.nn.Conv2d(in_channels=intCurrent + 128 + 128, out_channels=96, kernel_size=3, stride=1, padding=1),\n\t\t\t\t\ttorch.nn.LeakyReLU(inplace=False, negative_slope=0.1)\n\t\t\t\t)\n\n\t\t\t\tself.netFou = torch.nn.Sequential(\n\t\t\t\t\ttorch.nn.Conv2d(in_channels=intCurrent + 128 + 128 + 96, out_channels=64, kernel_size=3, stride=1, padding=1),\n\t\t\t\t\ttorch.nn.LeakyReLU(inplace=False, negative_slope=0.1)\n\t\t\t\t)\n\n\t\t\t\tself.netFiv = torch.nn.Sequential(\n\t\t\t\t\ttorch.nn.Conv2d(in_channels=intCurrent + 128 + 128 + 96 + 64, out_channels=32, kernel_size=3, stride=1, padding=1),\n\t\t\t\t\ttorch.nn.LeakyReLU(inplace=False, negative_slope=0.1)\n\t\t\t\t)\n\n\t\t\t\tself.netSix = torch.nn.Sequential(\n\t\t\t\t\ttorch.nn.Conv2d(in_channels=intCurrent + 128 + 128 + 96 + 64 + 32, out_channels=2, kernel_size=3, stride=1, padding=1)\n\t\t\t\t)\n\t\t\t# end\n\n\t\t\tdef forward(self, tenFirst, tenSecond, objPrevious):\n\t\t\t\ttenFlow = None\n\t\t\t\ttenFeat = None\n\n\t\t\t\tif objPrevious is None:\n\t\t\t\t\ttenFlow = None\n\t\t\t\t\ttenFeat = None\n\n\t\t\t\t\ttenVolume = torch.nn.functional.leaky_relu(input=correlation.FunctionCorrelation(tenFirst=tenFirst, tenSecond=tenSecond), negative_slope=0.1, inplace=False)\n\n\t\t\t\t\ttenFeat = torch.cat([ tenVolume ], 1)\n\n\t\t\t\telif objPrevious is not None:\n\t\t\t\t\ttenFlow = self.netUpflow(objPrevious['tenFlow'])\n\t\t\t\t\ttenFeat = self.netUpfeat(objPrevious['tenFeat'])\n\n\t\t\t\t\ttenVolume = torch.nn.functional.leaky_relu(input=correlation.FunctionCorrelation(tenFirst=tenFirst, tenSecond=backwarp(tenInput=tenSecond, tenFlow=tenFlow * self.fltBackwarp)), negative_slope=0.1, inplace=False)\n\n\t\t\t\t\ttenFeat = torch.cat([ tenVolume, tenFirst, tenFlow, tenFeat ], 1)\n\n\t\t\t\t# end\n\n\t\t\t\ttenFeat = torch.cat([ self.netOne(tenFeat), tenFeat ], 1)\n\t\t\t\ttenFeat = torch.cat([ self.netTwo(tenFeat), tenFeat ], 1)\n\t\t\t\ttenFeat = torch.cat([ self.netThr(tenFeat), tenFeat ], 1)\n\t\t\t\ttenFeat = torch.cat([ self.netFou(tenFeat), tenFeat ], 1)\n\t\t\t\ttenFeat = torch.cat([ self.netFiv(tenFeat), tenFeat ], 1)\n\n\t\t\t\ttenFlow = self.netSix(tenFeat)\n\n\t\t\t\treturn {\n\t\t\t\t\t'tenFlow': tenFlow,\n\t\t\t\t\t'tenFeat': tenFeat\n\t\t\t\t}\n\t\t\t# end\n\t\t# end\n\n\t\tclass Refiner(torch.nn.Module):\n\t\t\tdef __init__(self):\n\t\t\t\tsuper(Refiner, self).__init__()\n\n\t\t\t\tself.netMain = torch.nn.Sequential(\n\t\t\t\t\ttorch.nn.Conv2d(in_channels=81 + 32 + 2 + 2 + 128 + 128 + 96 + 64 + 32, out_channels=128, kernel_size=3, stride=1, padding=1, dilation=1),\n\t\t\t\t\ttorch.nn.LeakyReLU(inplace=False, negative_slope=0.1),\n\t\t\t\t\ttorch.nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, stride=1, padding=2, dilation=2),\n\t\t\t\t\ttorch.nn.LeakyReLU(inplace=False, negative_slope=0.1),\n\t\t\t\t\ttorch.nn.Conv2d(in_channels=128, out_channels=128, kernel_size=3, stride=1, padding=4, dilation=4),\n\t\t\t\t\ttorch.nn.LeakyReLU(inplace=False, negative_slope=0.1),\n\t\t\t\t\ttorch.nn.Conv2d(in_channels=128, out_channels=96, kernel_size=3, stride=1, padding=8, dilation=8),\n\t\t\t\t\ttorch.nn.LeakyReLU(inplace=False, negative_slope=0.1),\n\t\t\t\t\ttorch.nn.Conv2d(in_channels=96, out_channels=64, kernel_size=3, stride=1, padding=16, dilation=16),\n\t\t\t\t\ttorch.nn.LeakyReLU(inplace=False, negative_slope=0.1),\n\t\t\t\t\ttorch.nn.Conv2d(in_channels=64, out_channels=32, kernel_size=3, stride=1, padding=1, dilation=1),\n\t\t\t\t\ttorch.nn.LeakyReLU(inplace=False, negative_slope=0.1),\n\t\t\t\t\ttorch.nn.Conv2d(in_channels=32, out_channels=2, kernel_size=3, stride=1, padding=1, dilation=1)\n\t\t\t\t)\n\t\t\t# end\n\n\t\t\tdef forward(self, tenInput):\n\t\t\t\treturn self.netMain(tenInput)\n\t\t\t# end\n\t\t# end\n\n\t\tself.netExtractor = Extractor()\n\n\t\tself.netTwo = Decoder(2)\n\t\tself.netThr = Decoder(3)\n\t\tself.netFou = Decoder(4)\n\t\tself.netFiv = Decoder(5)\n\t\tself.netSix = Decoder(6)\n\n\t\tself.netRefiner = Refiner()\n\n\t\tself.load_state_dict({ strKey.replace('module', 'net'): tenWeight for strKey, tenWeight in torch.hub.load_state_dict_from_url(url='http://content.sniklaus.com/github/pytorch-pwc/network-' + 'default' + '.pytorch').items() })\n\t# end\n\n\tdef forward(self, tenFirst, tenSecond, *args):\n\t\tintWidth = tenFirst.shape[3]\n\t\tintHeight = tenFirst.shape[2]\n\n\t\tintPreprocessedWidth = int(math.floor(math.ceil(intWidth / 64.0) * 64.0))\n\t\tintPreprocessedHeight = int(math.floor(math.ceil(intHeight / 64.0) * 64.0))\n\n\t\t# optionally pass pre-extracted feature pyramid in as args\n\t\tif len(args) == 0:\t\n\t\t\ttenPreprocessedFirst = torch.nn.functional.interpolate(input=tenFirst, size=(intPreprocessedHeight, intPreprocessedWidth), mode='bilinear', align_corners=False)\n\t\t\ttenPreprocessedSecond = torch.nn.functional.interpolate(input=tenSecond, size=(intPreprocessedHeight, intPreprocessedWidth), mode='bilinear', align_corners=False)\n\t\t\t\n\t\t\ttenFirst = self.netExtractor(tenPreprocessedFirst)\n\t\t\ttenSecond = self.netExtractor(tenPreprocessedSecond)\n\t\telse:\n\t\t\ttenFirst, tenSecond = args\n\n\t\tobjEstimate = self.netSix(tenFirst[-1], tenSecond[-1], None)\n\t\tobjEstimate = self.netFiv(tenFirst[-2], tenSecond[-2], objEstimate)\n\t\tobjEstimate = self.netFou(tenFirst[-3], tenSecond[-3], objEstimate)\n\t\tobjEstimate = self.netThr(tenFirst[-4], tenSecond[-4], objEstimate)\n\t\tobjEstimate = self.netTwo(tenFirst[-5], tenSecond[-5], objEstimate)\n\n\t\ttenFlow = objEstimate['tenFlow'] + self.netRefiner(objEstimate['tenFeat'])\n\t\ttenFlow = 20.0 * torch.nn.functional.interpolate(input=tenFlow, size=(intHeight, intWidth), mode='bilinear', align_corners=False)\n\t\ttenFlow[:, 0, :, :] *= float(intWidth) / float(intPreprocessedWidth)\n\t\ttenFlow[:, 1, :, :] *= float(intHeight) / float(intPreprocessedHeight)\n\n\t\treturn tenFlow\n\t# end\n# end\n\n\tdef extract_pyramid_single(self, tenFirst):\n\t\tintWidth = tenFirst.shape[3]\n\t\tintHeight = tenFirst.shape[2]\n\n\t\tintPreprocessedWidth = int(math.floor(math.ceil(intWidth / 64.0) * 64.0))\n\t\tintPreprocessedHeight = int(math.floor(math.ceil(intHeight / 64.0) * 64.0))\n\n\t\ttenPreprocessedFirst = torch.nn.functional.interpolate(input=tenFirst, size=(intPreprocessedHeight, intPreprocessedWidth), mode='bilinear', align_corners=False)\n\t\treturn self.netExtractor(tenPreprocessedFirst)\n\n\nnetNetwork = None\n\n##########################################################\n\ndef estimate(tenFirst, tenSecond):\n\tglobal netNetwork\n\n\tif netNetwork is None:\n\t\tnetNetwork = Network().cuda().eval()\n\t# end\n\n\tassert(tenFirst.shape[1] == tenSecond.shape[1])\n\tassert(tenFirst.shape[2] == tenSecond.shape[2])\n\n\tintWidth = tenFirst.shape[2]\n\tintHeight = tenFirst.shape[1]\n\n\tassert(intWidth == 1024) # remember that there is no guarantee for correctness, comment this line out if you acknowledge this and want to continue\n\tassert(intHeight == 436) # remember that there is no guarantee for correctness, comment this line out if you acknowledge this and want to continue\n\n\ttenPreprocessedFirst = tenFirst.cuda().view(1, 3, intHeight, intWidth)\n\ttenPreprocessedSecond = tenSecond.cuda().view(1, 3, intHeight, intWidth)\n\n\tintPreprocessedWidth = int(math.floor(math.ceil(intWidth / 64.0) * 64.0))\n\tintPreprocessedHeight = int(math.floor(math.ceil(intHeight / 64.0) * 64.0))\n\n\ttenPreprocessedFirst = torch.nn.functional.interpolate(input=tenPreprocessedFirst, size=(intPreprocessedHeight, intPreprocessedWidth), mode='bilinear', align_corners=False)\n\ttenPreprocessedSecond = torch.nn.functional.interpolate(input=tenPreprocessedSecond, size=(intPreprocessedHeight, intPreprocessedWidth), mode='bilinear', align_corners=False)\n\n\ttenFlow = 20.0 * torch.nn.functional.interpolate(input=netNetwork(tenPreprocessedFirst, tenPreprocessedSecond), size=(intHeight, intWidth), mode='bilinear', align_corners=False)\n\n\ttenFlow[:, 0, :, :] *= float(intWidth) / float(intPreprocessedWidth)\n\ttenFlow[:, 1, :, :] *= float(intHeight) / float(intPreprocessedHeight)\n\n\treturn tenFlow[0, :, :, :].cpu()\n# end\n\n##########################################################\n\n# if __name__ == '__main__':\n# \ttenFirst = torch.FloatTensor(numpy.ascontiguousarray(numpy.array(PIL.Image.open(arguments_strFirst))[:, :, ::-1].transpose(2, 0, 1).astype(numpy.float32) * (1.0 / 255.0)))\n# \ttenSecond = torch.FloatTensor(numpy.ascontiguousarray(numpy.array(PIL.Image.open(arguments_strSecond))[:, :, ::-1].transpose(2, 0, 1).astype(numpy.float32) * (1.0 / 255.0)))\n\n# \ttenOutput = estimate(tenFirst, tenSecond)\n\n# \tobjOutput = open(arguments_strOut, 'wb')\n\n# \tnumpy.array([ 80, 73, 69, 72 ], numpy.uint8).tofile(objOutput)\n# \tnumpy.array([ tenOutput.shape[2], tenOutput.shape[1] ], numpy.int32).tofile(objOutput)\n# \tnumpy.array(tenOutput.numpy().transpose(1, 2, 0), numpy.float32).tofile(objOutput)\n\n# \tobjOutput.close()\n# end"
  },
  {
    "path": "metrics/flolpips/utils.py",
    "content": "import numpy as np\nimport cv2\nimport torch\n\n\ndef normalize_tensor(in_feat,eps=1e-10):\n    norm_factor = torch.sqrt(torch.sum(in_feat**2,dim=1,keepdim=True))\n    return in_feat/(norm_factor+eps)\n\ndef l2(p0, p1, range=255.):\n    return .5*np.mean((p0 / range - p1 / range)**2)\n\ndef dssim(p0, p1, range=255.):\n    from skimage.measure import compare_ssim\n    return (1 - compare_ssim(p0, p1, data_range=range, multichannel=True)) / 2.\n\ndef tensor2im(image_tensor, imtype=np.uint8, cent=1., factor=255./2.):\n    image_numpy = image_tensor[0].cpu().float().numpy()\n    image_numpy = (np.transpose(image_numpy, (1, 2, 0)) + cent) * factor\n    return image_numpy.astype(imtype)\n\ndef tensor2np(tensor_obj):\n    # change dimension of a tensor object into a numpy array\n    return tensor_obj[0].cpu().float().numpy().transpose((1,2,0))\n\ndef np2tensor(np_obj):\n     # change dimenion of np array into tensor array\n    return torch.Tensor(np_obj[:, :, :, np.newaxis].transpose((3, 2, 0, 1)))\n\ndef tensor2tensorlab(image_tensor,to_norm=True,mc_only=False):\n    # image tensor to lab tensor\n    from skimage import color\n\n    img = tensor2im(image_tensor)\n    img_lab = color.rgb2lab(img)\n    if(mc_only):\n        img_lab[:,:,0] = img_lab[:,:,0]-50\n    if(to_norm and not mc_only):\n        img_lab[:,:,0] = img_lab[:,:,0]-50\n        img_lab = img_lab/100.\n\n    return np2tensor(img_lab)\n\ndef read_frame_yuv2rgb(stream, width, height, iFrame, bit_depth, pix_fmt='420'):\n    if pix_fmt == '420':\n        multiplier = 1\n        uv_factor = 2\n    elif pix_fmt == '444':\n        multiplier = 2\n        uv_factor = 1\n    else:\n        print('Pixel format {} is not supported'.format(pix_fmt))\n        return\n\n    if bit_depth == 8:\n        datatype = np.uint8\n        stream.seek(iFrame*1.5*width*height*multiplier)\n        Y = np.fromfile(stream, dtype=datatype, count=width*height).reshape((height, width))\n        \n        # read chroma samples and upsample since original is 4:2:0 sampling\n        U = np.fromfile(stream, dtype=datatype, count=(width//uv_factor)*(height//uv_factor)).\\\n                                reshape((height//uv_factor, width//uv_factor))\n        V = np.fromfile(stream, dtype=datatype, count=(width//uv_factor)*(height//uv_factor)).\\\n                                reshape((height//uv_factor, width//uv_factor))\n\n    else:\n        datatype = np.uint16\n        stream.seek(iFrame*3*width*height*multiplier)\n        Y = np.fromfile(stream, dtype=datatype, count=width*height).reshape((height, width))\n                \n        U = np.fromfile(stream, dtype=datatype, count=(width//uv_factor)*(height//uv_factor)).\\\n                                reshape((height//uv_factor, width//uv_factor))\n        V = np.fromfile(stream, dtype=datatype, count=(width//uv_factor)*(height//uv_factor)).\\\n                                reshape((height//uv_factor, width//uv_factor))\n\n    if pix_fmt == '420':\n        yuv = np.empty((height*3//2, width), dtype=datatype)\n        yuv[0:height,:] = Y\n\n        yuv[height:height+height//4,:] = U.reshape(-1, width)\n        yuv[height+height//4:,:] = V.reshape(-1, width)\n\n        if bit_depth != 8:\n            yuv = (yuv/(2**bit_depth-1)*255).astype(np.uint8)\n\n        #convert to rgb\n        rgb = cv2.cvtColor(yuv, cv2.COLOR_YUV2RGB_I420)\n    \n    else:\n        yvu = np.stack([Y,V,U],axis=2)\n        if bit_depth != 8:\n            yvu = (yvu/(2**bit_depth-1)*255).astype(np.uint8)\n        rgb = cv2.cvtColor(yvu, cv2.COLOR_YCrCb2RGB)\n\n    return rgb\n"
  },
  {
    "path": "metrics/lpips/__init__.py",
    "content": "\nfrom __future__ import absolute_import\nfrom __future__ import division\nfrom __future__ import print_function\n\nimport numpy as np\nimport torch\n# from torch.autograd import Variable\n\n# from .trainer import *\nfrom .lpips import *\n\n# class PerceptualLoss(torch.nn.Module):\n#     def __init__(self, model='lpips', net='alex', spatial=False, use_gpu=False, gpu_ids=[0], version='0.1'): # VGG using our perceptually-learned weights (LPIPS metric)\n#     # def __init__(self, model='net', net='vgg', use_gpu=True): # \"default\" way of using VGG as a perceptual loss\n#         super(PerceptualLoss, self).__init__()\n#         print('Setting up Perceptual loss...')\n#         self.use_gpu = use_gpu\n#         self.spatial = spatial\n#         self.gpu_ids = gpu_ids\n#         self.model = dist_model.DistModel()\n#         self.model.initialize(model=model, net=net, use_gpu=use_gpu, spatial=self.spatial, gpu_ids=gpu_ids, version=version)\n#         print('...[%s] initialized'%self.model.name())\n#         print('...Done')\n\n#     def forward(self, pred, target, normalize=False):\n#         \"\"\"\n#         Pred and target are Variables.\n#         If normalize is True, assumes the images are between [0,1] and then scales them between [-1,+1]\n#         If normalize is False, assumes the images are already between [-1,+1]\n\n#         Inputs pred and target are Nx3xHxW\n#         Output pytorch Variable N long\n#         \"\"\"\n\n#         if normalize:\n#             target = 2 * target  - 1\n#             pred = 2 * pred  - 1\n\n#         return self.model.forward(target, pred)\n\ndef normalize_tensor(in_feat,eps=1e-10):\n    norm_factor = torch.sqrt(torch.sum(in_feat**2,dim=1,keepdim=True))\n    return in_feat/(norm_factor+eps)\n\ndef l2(p0, p1, range=255.):\n    return .5*np.mean((p0 / range - p1 / range)**2)\n\ndef psnr(p0, p1, peak=255.):\n    return 10*np.log10(peak**2/np.mean((1.*p0-1.*p1)**2))\n\ndef dssim(p0, p1, range=255.):\n    from skimage.measure import compare_ssim\n    return (1 - compare_ssim(p0, p1, data_range=range, multichannel=True)) / 2.\n\ndef rgb2lab(in_img,mean_cent=False):\n    from skimage import color\n    img_lab = color.rgb2lab(in_img)\n    if(mean_cent):\n        img_lab[:,:,0] = img_lab[:,:,0]-50\n    return img_lab\n\ndef tensor2np(tensor_obj):\n    # change dimension of a tensor object into a numpy array\n    return tensor_obj[0].cpu().float().numpy().transpose((1,2,0))\n\ndef np2tensor(np_obj):\n     # change dimenion of np array into tensor array\n    return torch.Tensor(np_obj[:, :, :, np.newaxis].transpose((3, 2, 0, 1)))\n\ndef tensor2tensorlab(image_tensor,to_norm=True,mc_only=False):\n    # image tensor to lab tensor\n    from skimage import color\n\n    img = tensor2im(image_tensor)\n    img_lab = color.rgb2lab(img)\n    if(mc_only):\n        img_lab[:,:,0] = img_lab[:,:,0]-50\n    if(to_norm and not mc_only):\n        img_lab[:,:,0] = img_lab[:,:,0]-50\n        img_lab = img_lab/100.\n\n    return np2tensor(img_lab)\n\ndef tensorlab2tensor(lab_tensor,return_inbnd=False):\n    from skimage import color\n    import warnings\n    warnings.filterwarnings(\"ignore\")\n\n    lab = tensor2np(lab_tensor)*100.\n    lab[:,:,0] = lab[:,:,0]+50\n\n    rgb_back = 255.*np.clip(color.lab2rgb(lab.astype('float')),0,1)\n    if(return_inbnd):\n        # convert back to lab, see if we match\n        lab_back = color.rgb2lab(rgb_back.astype('uint8'))\n        mask = 1.*np.isclose(lab_back,lab,atol=2.)\n        mask = np2tensor(np.prod(mask,axis=2)[:,:,np.newaxis])\n        return (im2tensor(rgb_back),mask)\n    else:\n        return im2tensor(rgb_back)\n\ndef load_image(path):\n    if(path[-3:] == 'dng'):\n        import rawpy\n        with rawpy.imread(path) as raw:\n            img = raw.postprocess()\n    elif(path[-3:]=='bmp' or path[-3:]=='jpg' or path[-3:]=='png' or path[-4:]=='jpeg'):\n        import cv2\n        return cv2.imread(path)[:,:,::-1]\n    else:\n        img = (255*plt.imread(path)[:,:,:3]).astype('uint8')\n\n    return img\n\ndef rgb2lab(input):\n    from skimage import color\n    return color.rgb2lab(input / 255.)\n\ndef tensor2im(image_tensor, imtype=np.uint8, cent=1., factor=255./2.):\n    image_numpy = image_tensor[0].cpu().float().numpy()\n    image_numpy = (np.transpose(image_numpy, (1, 2, 0)) + cent) * factor\n    return image_numpy.astype(imtype)\n\ndef im2tensor(image, imtype=np.uint8, cent=1., factor=255./2.):\n    return torch.Tensor((image / factor - cent)\n                        [:, :, :, np.newaxis].transpose((3, 2, 0, 1)))\n\ndef tensor2vec(vector_tensor):\n    return vector_tensor.data.cpu().numpy()[:, :, 0, 0]\n\n\ndef tensor2im(image_tensor, imtype=np.uint8, cent=1., factor=255./2.):\n# def tensor2im(image_tensor, imtype=np.uint8, cent=1., factor=1.):\n    image_numpy = image_tensor[0].cpu().float().numpy()\n    image_numpy = (np.transpose(image_numpy, (1, 2, 0)) + cent) * factor\n    return image_numpy.astype(imtype)\n\ndef im2tensor(image, imtype=np.uint8, cent=1., factor=255./2.):\n# def im2tensor(image, imtype=np.uint8, cent=1., factor=1.):\n    return torch.Tensor((image / factor - cent)\n                        [:, :, :, np.newaxis].transpose((3, 2, 0, 1)))\n\n\n\ndef voc_ap(rec, prec, use_07_metric=False):\n    \"\"\" ap = voc_ap(rec, prec, [use_07_metric])\n    Compute VOC AP given precision and recall.\n    If use_07_metric is true, uses the\n    VOC 07 11 point method (default:False).\n    \"\"\"\n    if use_07_metric:\n        # 11 point metric\n        ap = 0.\n        for t in np.arange(0., 1.1, 0.1):\n            if np.sum(rec >= t) == 0:\n                p = 0\n            else:\n                p = np.max(prec[rec >= t])\n            ap = ap + p / 11.\n    else:\n        # correct AP calculation\n        # first append sentinel values at the end\n        mrec = np.concatenate(([0.], rec, [1.]))\n        mpre = np.concatenate(([0.], prec, [0.]))\n\n        # compute the precision envelope\n        for i in range(mpre.size - 1, 0, -1):\n            mpre[i - 1] = np.maximum(mpre[i - 1], mpre[i])\n\n        # to calculate area under PR curve, look for points\n        # where X axis (recall) changes value\n        i = np.where(mrec[1:] != mrec[:-1])[0]\n\n        # and sum (\\Delta recall) * prec\n        ap = np.sum((mrec[i + 1] - mrec[i]) * mpre[i + 1])\n    return ap\n\n"
  },
  {
    "path": "metrics/lpips/lpips.py",
    "content": "\nfrom __future__ import absolute_import\n\nimport torch\nimport torch.nn as nn\nimport torch.nn.init as init\nfrom torch.autograd import Variable\nimport numpy as np\nfrom . import pretrained_networks as pn\nimport torch.nn\n\nfrom .. import lpips\n\ndef spatial_average(in_tens, keepdim=True):\n    if torch.__version__ == '0.4.0':\n        if keepdim:\n            return in_tens.mean(2, keepdim=True).mean(3, keepdim=True)\n        else:\n            in_tens.mean(2).mean(2)\n    return in_tens.mean([2,3],keepdim=keepdim)\n\ndef upsample(in_tens, out_HW=(64,64)): # assumes scale factor is same for H and W\n    in_H, in_W = in_tens.shape[2], in_tens.shape[3]\n    return nn.Upsample(size=out_HW, mode='bilinear', align_corners=False)(in_tens)\n\n# Learned perceptual metric\nclass LPIPS(nn.Module):\n    def __init__(self, pretrained=True, net='alex', version='0.1', lpips=True, spatial=False, \n        pnet_rand=False, pnet_tune=False, use_dropout=True, model_path=None, eval_mode=True, verbose=False):\n        # lpips - [True] means with linear calibration on top of base network\n        # pretrained - [True] means load linear weights\n\n        super(LPIPS, self).__init__()\n        if(verbose):\n            print('Setting up [%s] perceptual loss: trunk [%s], v[%s], spatial [%s]'%\n                ('LPIPS' if lpips else 'baseline', net, version, 'on' if spatial else 'off'))\n\n        self.pnet_type = net\n        self.pnet_tune = pnet_tune\n        self.pnet_rand = pnet_rand\n        self.spatial = spatial\n        self.lpips = lpips # false means baseline of just averaging all layers\n        self.version = version\n        self.scaling_layer = ScalingLayer()\n\n        if(self.pnet_type in ['vgg','vgg16']):\n            net_type = pn.vgg16\n            self.chns = [64,128,256,512,512]\n        elif(self.pnet_type=='alex'):\n            net_type = pn.alexnet\n            self.chns = [64,192,384,256,256]\n        elif(self.pnet_type=='squeeze'):\n            net_type = pn.squeezenet\n            self.chns = [64,128,256,384,384,512,512]\n        self.L = len(self.chns)\n\n        self.net = net_type(pretrained=not self.pnet_rand, requires_grad=self.pnet_tune)\n\n        if(lpips):\n            self.lin0 = NetLinLayer(self.chns[0], use_dropout=use_dropout)\n            self.lin1 = NetLinLayer(self.chns[1], use_dropout=use_dropout)\n            self.lin2 = NetLinLayer(self.chns[2], use_dropout=use_dropout)\n            self.lin3 = NetLinLayer(self.chns[3], use_dropout=use_dropout)\n            self.lin4 = NetLinLayer(self.chns[4], use_dropout=use_dropout)\n            self.lins = [self.lin0,self.lin1,self.lin2,self.lin3,self.lin4]\n            if(self.pnet_type=='squeeze'): # 7 layers for squeezenet\n                self.lin5 = NetLinLayer(self.chns[5], use_dropout=use_dropout)\n                self.lin6 = NetLinLayer(self.chns[6], use_dropout=use_dropout)\n                self.lins+=[self.lin5,self.lin6]\n            self.lins = nn.ModuleList(self.lins)\n\n            if(pretrained):\n                if(model_path is None):\n                    import inspect\n                    import os\n                    model_path = os.path.abspath(os.path.join(inspect.getfile(self.__init__), '..', 'weights/v%s/%s.pth'%(version,net)))\n\n                if(verbose):\n                    print('Loading model from: %s'%model_path)\n                self.load_state_dict(torch.load(model_path, map_location='cpu'), strict=False)          \n\n        if(eval_mode):\n            self.eval()\n\n    def forward(self, in0, in1, retPerLayer=False, normalize=False):\n        if normalize: # turn on this flag if input is [0,1] so it can be adjusted to [-1, +1]\n            in0 = 2 * in0  - 1\n            in1 = 2 * in1  - 1\n\n        # v0.0 - original release had a bug, where input was not scaled\n        in0_input, in1_input = (self.scaling_layer(in0), self.scaling_layer(in1)) if self.version=='0.1' else (in0, in1)\n        outs0, outs1 = self.net.forward(in0_input), self.net.forward(in1_input)\n        feats0, feats1, diffs = {}, {}, {}\n\n        for kk in range(self.L):\n            feats0[kk], feats1[kk] = lpips.normalize_tensor(outs0[kk]), lpips.normalize_tensor(outs1[kk])\n            diffs[kk] = (feats0[kk]-feats1[kk])**2\n\n        if(self.lpips):\n            if(self.spatial):\n                res = [upsample(self.lins[kk](diffs[kk]), out_HW=in0.shape[2:]) for kk in range(self.L)]\n            else:\n                res = [spatial_average(self.lins[kk](diffs[kk]), keepdim=True) for kk in range(self.L)]\n        else:\n            if(self.spatial):\n                res = [upsample(diffs[kk].sum(dim=1,keepdim=True), out_HW=in0.shape[2:]) for kk in range(self.L)]\n            else:\n                res = [spatial_average(diffs[kk].sum(dim=1,keepdim=True), keepdim=True) for kk in range(self.L)]\n\n        # val = res[0]\n        # for l in range(1,self.L):\n        #     val += res[l]\n        #     print(val)\n\n        # a = spatial_average(self.lins[kk](diffs[kk]), keepdim=True)\n        # b = torch.max(self.lins[kk](feats0[kk]**2))\n        # for kk in range(self.L):\n        #     a += spatial_average(self.lins[kk](diffs[kk]), keepdim=True)\n        #     b = torch.max(b,torch.max(self.lins[kk](feats0[kk]**2)))\n        # a = a/self.L\n        # from IPython import embed\n        # embed()\n        # return 10*torch.log10(b/a)\n        \n        # if(retPerLayer):\n        #     return (val, res)\n        # else:\n        # return torch.sum(torch.cat(res, 1), dim=(1,2,3), keepdims=False)\n        return torch.cat(res, 1).sum(dim=1).sum(dim=1).sum(dim=1)\n\n\nclass ScalingLayer(nn.Module):\n    def __init__(self):\n        super(ScalingLayer, self).__init__()\n        self.register_buffer('shift', torch.Tensor([-.030,-.088,-.188])[None,:,None,None])\n        self.register_buffer('scale', torch.Tensor([.458,.448,.450])[None,:,None,None])\n\n    def forward(self, inp):\n        return (inp - self.shift) / self.scale\n\n\nclass NetLinLayer(nn.Module):\n    ''' A single linear layer which does a 1x1 conv '''\n    def __init__(self, chn_in, chn_out=1, use_dropout=False):\n        super(NetLinLayer, self).__init__()\n\n        layers = [nn.Dropout(),] if(use_dropout) else []\n        layers += [nn.Conv2d(chn_in, chn_out, 1, stride=1, padding=0, bias=False),]\n        self.model = nn.Sequential(*layers)\n\n    def forward(self, x):\n        return self.model(x)\n\nclass Dist2LogitLayer(nn.Module):\n    ''' takes 2 distances, puts through fc layers, spits out value between [0,1] (if use_sigmoid is True) '''\n    def __init__(self, chn_mid=32, use_sigmoid=True):\n        super(Dist2LogitLayer, self).__init__()\n\n        layers = [nn.Conv2d(5, chn_mid, 1, stride=1, padding=0, bias=True),]\n        layers += [nn.LeakyReLU(0.2,True),]\n        layers += [nn.Conv2d(chn_mid, chn_mid, 1, stride=1, padding=0, bias=True),]\n        layers += [nn.LeakyReLU(0.2,True),]\n        layers += [nn.Conv2d(chn_mid, 1, 1, stride=1, padding=0, bias=True),]\n        if(use_sigmoid):\n            layers += [nn.Sigmoid(),]\n        self.model = nn.Sequential(*layers)\n\n    def forward(self,d0,d1,eps=0.1):\n        return self.model.forward(torch.cat((d0,d1,d0-d1,d0/(d1+eps),d1/(d0+eps)),dim=1))\n\nclass BCERankingLoss(nn.Module):\n    def __init__(self, chn_mid=32):\n        super(BCERankingLoss, self).__init__()\n        self.net = Dist2LogitLayer(chn_mid=chn_mid)\n        # self.parameters = list(self.net.parameters())\n        self.loss = torch.nn.BCELoss()\n\n    def forward(self, d0, d1, judge):\n        per = (judge+1.)/2.\n        self.logit = self.net.forward(d0,d1)\n        return self.loss(self.logit, per)\n\n# L2, DSSIM metrics\nclass FakeNet(nn.Module):\n    def __init__(self, use_gpu=True, colorspace='Lab'):\n        super(FakeNet, self).__init__()\n        self.use_gpu = use_gpu\n        self.colorspace = colorspace\n\nclass L2(FakeNet):\n    def forward(self, in0, in1, retPerLayer=None):\n        assert(in0.size()[0]==1) # currently only supports batchSize 1\n\n        if(self.colorspace=='RGB'):\n            (N,C,X,Y) = in0.size()\n            value = torch.mean(torch.mean(torch.mean((in0-in1)**2,dim=1).view(N,1,X,Y),dim=2).view(N,1,1,Y),dim=3).view(N)\n            return value\n        elif(self.colorspace=='Lab'):\n            value = lpips.l2(lpips.tensor2np(lpips.tensor2tensorlab(in0.data,to_norm=False)), \n                lpips.tensor2np(lpips.tensor2tensorlab(in1.data,to_norm=False)), range=100.).astype('float')\n            ret_var = Variable( torch.Tensor((value,) ) )\n            if(self.use_gpu):\n                ret_var = ret_var.cuda()\n            return ret_var\n\nclass DSSIM(FakeNet):\n\n    def forward(self, in0, in1, retPerLayer=None):\n        assert(in0.size()[0]==1) # currently only supports batchSize 1\n\n        if(self.colorspace=='RGB'):\n            value = lpips.dssim(1.*lpips.tensor2im(in0.data), 1.*lpips.tensor2im(in1.data), range=255.).astype('float')\n        elif(self.colorspace=='Lab'):\n            value = lpips.dssim(lpips.tensor2np(lpips.tensor2tensorlab(in0.data,to_norm=False)), \n                lpips.tensor2np(lpips.tensor2tensorlab(in1.data,to_norm=False)), range=100.).astype('float')\n        ret_var = Variable( torch.Tensor((value,) ) )\n        if(self.use_gpu):\n            ret_var = ret_var.cuda()\n        return ret_var\n\ndef print_network(net):\n    num_params = 0\n    for param in net.parameters():\n        num_params += param.numel()\n    print('Network',net)\n    print('Total number of parameters: %d' % num_params)\n"
  },
  {
    "path": "metrics/lpips/pretrained_networks.py",
    "content": "from collections import namedtuple\nimport torch\nfrom torchvision import models as tv\n\nclass squeezenet(torch.nn.Module):\n    def __init__(self, requires_grad=False, pretrained=True):\n        super(squeezenet, self).__init__()\n        pretrained_features = tv.squeezenet1_1(pretrained=pretrained).features\n        self.slice1 = torch.nn.Sequential()\n        self.slice2 = torch.nn.Sequential()\n        self.slice3 = torch.nn.Sequential()\n        self.slice4 = torch.nn.Sequential()\n        self.slice5 = torch.nn.Sequential()\n        self.slice6 = torch.nn.Sequential()\n        self.slice7 = torch.nn.Sequential()\n        self.N_slices = 7\n        for x in range(2):\n            self.slice1.add_module(str(x), pretrained_features[x])\n        for x in range(2,5):\n            self.slice2.add_module(str(x), pretrained_features[x])\n        for x in range(5, 8):\n            self.slice3.add_module(str(x), pretrained_features[x])\n        for x in range(8, 10):\n            self.slice4.add_module(str(x), pretrained_features[x])\n        for x in range(10, 11):\n            self.slice5.add_module(str(x), pretrained_features[x])\n        for x in range(11, 12):\n            self.slice6.add_module(str(x), pretrained_features[x])\n        for x in range(12, 13):\n            self.slice7.add_module(str(x), pretrained_features[x])\n        if not requires_grad:\n            for param in self.parameters():\n                param.requires_grad = False\n\n    def forward(self, X):\n        h = self.slice1(X)\n        h_relu1 = h\n        h = self.slice2(h)\n        h_relu2 = h\n        h = self.slice3(h)\n        h_relu3 = h\n        h = self.slice4(h)\n        h_relu4 = h\n        h = self.slice5(h)\n        h_relu5 = h\n        h = self.slice6(h)\n        h_relu6 = h\n        h = self.slice7(h)\n        h_relu7 = h\n        vgg_outputs = namedtuple(\"SqueezeOutputs\", ['relu1','relu2','relu3','relu4','relu5','relu6','relu7'])\n        out = vgg_outputs(h_relu1,h_relu2,h_relu3,h_relu4,h_relu5,h_relu6,h_relu7)\n\n        return out\n\n\nclass alexnet(torch.nn.Module):\n    def __init__(self, requires_grad=False, pretrained=True):\n        super(alexnet, self).__init__()\n        alexnet_pretrained_features = tv.alexnet(pretrained=pretrained).features\n        self.slice1 = torch.nn.Sequential()\n        self.slice2 = torch.nn.Sequential()\n        self.slice3 = torch.nn.Sequential()\n        self.slice4 = torch.nn.Sequential()\n        self.slice5 = torch.nn.Sequential()\n        self.N_slices = 5\n        for x in range(2):\n            self.slice1.add_module(str(x), alexnet_pretrained_features[x])\n        for x in range(2, 5):\n            self.slice2.add_module(str(x), alexnet_pretrained_features[x])\n        for x in range(5, 8):\n            self.slice3.add_module(str(x), alexnet_pretrained_features[x])\n        for x in range(8, 10):\n            self.slice4.add_module(str(x), alexnet_pretrained_features[x])\n        for x in range(10, 12):\n            self.slice5.add_module(str(x), alexnet_pretrained_features[x])\n        if not requires_grad:\n            for param in self.parameters():\n                param.requires_grad = False\n\n    def forward(self, X):\n        h = self.slice1(X)\n        h_relu1 = h\n        h = self.slice2(h)\n        h_relu2 = h\n        h = self.slice3(h)\n        h_relu3 = h\n        h = self.slice4(h)\n        h_relu4 = h\n        h = self.slice5(h)\n        h_relu5 = h\n        alexnet_outputs = namedtuple(\"AlexnetOutputs\", ['relu1', 'relu2', 'relu3', 'relu4', 'relu5'])\n        out = alexnet_outputs(h_relu1, h_relu2, h_relu3, h_relu4, h_relu5)\n\n        return out\n\nclass vgg16(torch.nn.Module):\n    def __init__(self, requires_grad=False, pretrained=True):\n        super(vgg16, self).__init__()\n        vgg_pretrained_features = tv.vgg16(pretrained=pretrained).features\n        self.slice1 = torch.nn.Sequential()\n        self.slice2 = torch.nn.Sequential()\n        self.slice3 = torch.nn.Sequential()\n        self.slice4 = torch.nn.Sequential()\n        self.slice5 = torch.nn.Sequential()\n        self.N_slices = 5\n        for x in range(4):\n            self.slice1.add_module(str(x), vgg_pretrained_features[x])\n        for x in range(4, 9):\n            self.slice2.add_module(str(x), vgg_pretrained_features[x])\n        for x in range(9, 16):\n            self.slice3.add_module(str(x), vgg_pretrained_features[x])\n        for x in range(16, 23):\n            self.slice4.add_module(str(x), vgg_pretrained_features[x])\n        for x in range(23, 30):\n            self.slice5.add_module(str(x), vgg_pretrained_features[x])\n        if not requires_grad:\n            for param in self.parameters():\n                param.requires_grad = False\n\n    def forward(self, X):\n        h = self.slice1(X)\n        h_relu1_2 = h\n        h = self.slice2(h)\n        h_relu2_2 = h\n        h = self.slice3(h)\n        h_relu3_3 = h\n        h = self.slice4(h)\n        h_relu4_3 = h\n        h = self.slice5(h)\n        h_relu5_3 = h\n        vgg_outputs = namedtuple(\"VggOutputs\", ['relu1_2', 'relu2_2', 'relu3_3', 'relu4_3', 'relu5_3'])\n        out = vgg_outputs(h_relu1_2, h_relu2_2, h_relu3_3, h_relu4_3, h_relu5_3)\n\n        return out\n\n\n\nclass resnet(torch.nn.Module):\n    def __init__(self, requires_grad=False, pretrained=True, num=18):\n        super(resnet, self).__init__()\n        if(num==18):\n            self.net = tv.resnet18(pretrained=pretrained)\n        elif(num==34):\n            self.net = tv.resnet34(pretrained=pretrained)\n        elif(num==50):\n            self.net = tv.resnet50(pretrained=pretrained)\n        elif(num==101):\n            self.net = tv.resnet101(pretrained=pretrained)\n        elif(num==152):\n            self.net = tv.resnet152(pretrained=pretrained)\n        self.N_slices = 5\n\n        self.conv1 = self.net.conv1\n        self.bn1 = self.net.bn1\n        self.relu = self.net.relu\n        self.maxpool = self.net.maxpool\n        self.layer1 = self.net.layer1\n        self.layer2 = self.net.layer2\n        self.layer3 = self.net.layer3\n        self.layer4 = self.net.layer4\n\n    def forward(self, X):\n        h = self.conv1(X)\n        h = self.bn1(h)\n        h = self.relu(h)\n        h_relu1 = h\n        h = self.maxpool(h)\n        h = self.layer1(h)\n        h_conv2 = h\n        h = self.layer2(h)\n        h_conv3 = h\n        h = self.layer3(h)\n        h_conv4 = h\n        h = self.layer4(h)\n        h_conv5 = h\n\n        outputs = namedtuple(\"Outputs\", ['relu1','conv2','conv3','conv4','conv5'])\n        out = outputs(h_relu1, h_conv2, h_conv3, h_conv4, h_conv5)\n\n        return out\n"
  },
  {
    "path": "metrics/pytorch_ssim/__init__.py",
    "content": "import torch\nimport torch.nn.functional as F\nfrom torch.autograd import Variable\nimport numpy as np\nfrom math import exp\n\ndef gaussian(window_size, sigma):\n    gauss = torch.Tensor([exp(-(x - window_size//2)**2/float(2*sigma**2)) for x in range(window_size)])\n    return gauss/gauss.sum()\n\ndef create_window(window_size, channel):\n    _1D_window = gaussian(window_size, 1.5).unsqueeze(1)\n    _2D_window = _1D_window.mm(_1D_window.t()).float().unsqueeze(0).unsqueeze(0)\n    window = Variable(_2D_window.expand(channel, 1, window_size, window_size).contiguous())\n    return window\n\n\ndef create_window_3d(window_size, channel=1):\n    _1D_window = gaussian(window_size, 1.5).unsqueeze(1)\n    _2D_window = _1D_window.mm(_1D_window.t())\n    _3D_window = _2D_window.unsqueeze(2) @ (_1D_window.t())\n    window = _3D_window.expand(1, channel, window_size, window_size, window_size).contiguous().cuda()\n    return window\n\n\ndef _ssim(img1, img2, window, window_size, channel, size_average = True):\n    mu1 = F.conv2d(img1, window, padding = window_size//2, groups = channel)\n    mu2 = F.conv2d(img2, window, padding = window_size//2, groups = channel)\n\n    mu1_sq = mu1.pow(2)\n    mu2_sq = mu2.pow(2)\n    mu1_mu2 = mu1*mu2\n\n    sigma1_sq = F.conv2d(img1*img1, window, padding = window_size//2, groups = channel) - mu1_sq\n    sigma2_sq = F.conv2d(img2*img2, window, padding = window_size//2, groups = channel) - mu2_sq\n    sigma12 = F.conv2d(img1*img2, window, padding = window_size//2, groups = channel) - mu1_mu2\n\n    C1 = 0.01**2\n    C2 = 0.03**2\n\n    ssim_map = ((2*mu1_mu2 + C1)*(2*sigma12 + C2))/((mu1_sq + mu2_sq + C1)*(sigma1_sq + sigma2_sq + C2))\n\n    if size_average:\n        return ssim_map.mean()\n    else:\n        return ssim_map.mean(1).mean(1).mean(1)\n\n\ndef ssim_matlab(img1, img2, window_size=11, window=None, size_average=True, full=False, val_range=255):\n    # Value range can be different from 255. Other common ranges are 1 (sigmoid) and 2 (tanh).\n    if val_range is None:\n        if torch.max(img1) > 128:\n            max_val = 255\n        else:\n            max_val = 1\n\n        if torch.min(img1) < -0.5:\n            min_val = -1\n        else:\n            min_val = 0\n        L = max_val - min_val\n    else:\n        L = val_range\n\n    padd = 0\n    (_, _, height, width) = img1.size()\n    if window is None:\n        real_size = min(window_size, height, width)\n        window = create_window_3d(real_size, channel=1).to(img1.device)\n        # Channel is set to 1 since we consider color images as volumetric images\n\n    img1 = img1.unsqueeze(1)\n    img2 = img2.unsqueeze(1)\n\n    mu1 = F.conv3d(F.pad(img1, (5, 5, 5, 5, 5, 5), mode='replicate'), window, padding=padd, groups=1)\n    mu2 = F.conv3d(F.pad(img2, (5, 5, 5, 5, 5, 5), mode='replicate'), window, padding=padd, groups=1)\n\n    mu1_sq = mu1.pow(2)\n    mu2_sq = mu2.pow(2)\n    mu1_mu2 = mu1 * mu2\n\n    sigma1_sq = F.conv3d(F.pad(img1 * img1, (5, 5, 5, 5, 5, 5), 'replicate'), window, padding=padd, groups=1) - mu1_sq\n    sigma2_sq = F.conv3d(F.pad(img2 * img2, (5, 5, 5, 5, 5, 5), 'replicate'), window, padding=padd, groups=1) - mu2_sq\n    sigma12 = F.conv3d(F.pad(img1 * img2, (5, 5, 5, 5, 5, 5), 'replicate'), window, padding=padd, groups=1) - mu1_mu2\n\n    C1 = (0.01 * L) ** 2\n    C2 = (0.03 * L) ** 2\n\n    v1 = 2.0 * sigma12 + C2\n    v2 = sigma1_sq + sigma2_sq + C2\n    cs = torch.mean(v1 / v2)  # contrast sensitivity\n\n    ssim_map = ((2 * mu1_mu2 + C1) * v1) / ((mu1_sq + mu2_sq + C1) * v2)\n\n    if size_average:\n        ret = ssim_map.mean()\n    else:\n        ret = ssim_map.mean(1).mean(1).mean(1).mean(1)\n\n    if full:\n        return ret, cs\n    return ret\n\n\nclass SSIM(torch.nn.Module):\n    def __init__(self, window_size = 11, size_average = True):\n        super(SSIM, self).__init__()\n        self.window_size = window_size\n        self.size_average = size_average\n        self.channel = 1\n        self.window = create_window(window_size, self.channel)\n\n    def forward(self, img1, img2):\n        (_, channel, _, _) = img1.size()\n\n        if channel == self.channel and self.window.data.type() == img1.data.type():\n            window = self.window\n        else:\n            window = create_window(self.window_size, channel)\n            \n            if img1.is_cuda:\n                window = window.cuda(img1.get_device())\n            window = window.type_as(img1)\n            \n            self.window = window\n            self.channel = channel\n\n\n        return _ssim(img1, img2, window, self.window_size, channel, self.size_average)\n\ndef ssim(img1, img2, window_size = 11, size_average = True):\n    (_, channel, _, _) = img1.size()\n    window = create_window(window_size, channel)\n    \n    if img1.is_cuda:\n        window = window.cuda(img1.get_device())\n    window = window.type_as(img1)\n    \n    return _ssim(img1, img2, window, window_size, channel, size_average)\n"
  },
  {
    "path": "setup.py",
    "content": "from setuptools import setup, find_packages\n\nsetup(\n    name='latent-diffusion',\n    version='0.0.1',\n    description='',\n    packages=find_packages(),\n    install_requires=[\n        'torch',\n        'numpy',\n        'tqdm',\n    ],\n)"
  },
  {
    "path": "utility.py",
    "content": "import torch\nimport torch.nn.functional as F\nimport numpy as np\nimport cv2\nfrom metrics import pytorch_ssim, lpips, flolpips\n\n\n\ndef read_frame_yuv2rgb(stream, width, height, iFrame, bit_depth, pix_fmt='420'):\n    if pix_fmt == '420':\n        multiplier = 1\n        uv_factor = 2\n    elif pix_fmt == '444':\n        multiplier = 2\n        uv_factor = 1\n    else:\n        print('Pixel format {} is not supported'.format(pix_fmt))\n        return\n\n    if bit_depth == 8:\n        datatype = np.uint8\n        stream.seek(iFrame*1.5*width*height*multiplier)\n        Y = np.fromfile(stream, dtype=datatype, count=width*height).reshape((height, width))\n        \n        # read chroma samples and upsample since original is 4:2:0 sampling\n        U = np.fromfile(stream, dtype=datatype, count=(width//uv_factor)*(height//uv_factor)).\\\n                                reshape((height//uv_factor, width//uv_factor))\n        V = np.fromfile(stream, dtype=datatype, count=(width//uv_factor)*(height//uv_factor)).\\\n                                reshape((height//uv_factor, width//uv_factor))\n\n    else:\n        datatype = np.uint16\n        stream.seek(iFrame*3*width*height*multiplier)\n        Y = np.fromfile(stream, dtype=datatype, count=width*height).reshape((height, width))\n                \n        U = np.fromfile(stream, dtype=datatype, count=(width//uv_factor)*(height//uv_factor)).\\\n                                reshape((height//uv_factor, width//uv_factor))\n        V = np.fromfile(stream, dtype=datatype, count=(width//uv_factor)*(height//uv_factor)).\\\n                                reshape((height//uv_factor, width//uv_factor))\n\n    if pix_fmt == '420':\n        yuv = np.empty((height*3//2, width), dtype=datatype)\n        yuv[0:height,:] = Y\n\n        yuv[height:height+height//4,:] = U.reshape(-1, width)\n        yuv[height+height//4:,:] = V.reshape(-1, width)\n\n        if bit_depth != 8:\n            yuv = (yuv/(2**bit_depth-1)*255).astype(np.uint8)\n\n        #convert to rgb\n        rgb = cv2.cvtColor(yuv, cv2.COLOR_YUV2RGB_I420)\n    \n    else:\n        yvu = np.stack([Y,V,U],axis=2)\n        if bit_depth != 8:\n            yvu = (yvu/(2**bit_depth-1)*255).astype(np.uint8)\n        rgb = cv2.cvtColor(yvu, cv2.COLOR_YCrCb2RGB)\n\n    return rgb\n\n\n\n\ndef CharbonnierFunc(data, epsilon=0.001):\n    return torch.mean(torch.sqrt(data ** 2 + epsilon ** 2))\n\n\ndef moduleNormalize(frame):\n    return torch.cat([(frame[:, 0:1, :, :] - 0.4631), (frame[:, 1:2, :, :] - 0.4352), (frame[:, 2:3, :, :] - 0.3990)], 1)\n\n\ndef gaussian_kernel(sz, sigma):\n    k = torch.arange(-(sz-1)/2, (sz+1)/2)\n    k = torch.exp(-1.0/(2*sigma**2) * k**2)\n    k = k.reshape(-1, 1) * k.reshape(1, -1)\n    k = k / torch.sum(k)\n    return k\n\n\ndef quantize(imTensor):\n    return ((imTensor.clamp(-1.0, 1.0)+1.)/2.).mul(255).round()\n\n\ndef tensor2rgb(tensor):\n    \"\"\"\n    Convert GPU Tensor to RGB image (numpy array)\n    \"\"\"\n    out = []\n    for b in range(tensor.shape[0]):\n        out.append(np.moveaxis(quantize(tensor[b]).cpu().detach().numpy(), 0, 2).astype(np.uint8))\n    return np.array(out) #(B,H,W,C)\n\n\ndef calc_psnr(gt, out, *args):\n    \"\"\"\n    args:\n    gt, out -- (B,3,H,W) cuda Tensors in [-1,1]\n    \"\"\"\n    mse = torch.mean((quantize(gt) - quantize(out))**2, dim=1).mean(1).mean(1)\n    return -10 * torch.log10(mse/255**2 + 1e-8) # (B,)\n\n\ndef calc_ssim(gt, out, *args):\n    return pytorch_ssim.ssim_matlab(quantize(gt), quantize(out), size_average=False)\n\n\ndef calc_lpips(gt, out, *args):\n    loss_fn = lpips.LPIPS(net='alex',version='0.1').cuda()\n    # return loss_fn.forward(gt, out, normalize=True)\n    return loss_fn.forward(quantize(gt)/255., quantize(out)/255., normalize=True)\n\n\ndef calc_flolpips(gt_list, out_list, inputs_list):\n    '''\n    gt, out - list of (B,3,H,W) cuda Tensors in [-1,1]\n    inputs - list of two (B,3,H,W) cuda Tensors in [-1,1]\n    e.g. gt can contain frames 1,3,5... while inputs contains frames 0,2,4,6...\n    '''\n    loss_fn = flolpips.FloLPIPS(net='alex',version='0.1').cuda()\n    flownet = flolpips.PWCNet().cuda()\n    \n    scores = []\n    for i in range(len(gt_list)):\n        frame_ref = (gt_list[i] + 1.) / 2.\n        frame_dis = (out_list[i] + 1.) / 2.\n        frame_prev = (inputs_list[i] + 1.) / 2. if i == 0 else frame_next\n        frame_next = (inputs_list[i+1] + 1.) / 2.\n    \n        with torch.no_grad():\n            feat_ref = flownet.extract_pyramid_single(frame_ref)\n            feat_dis = flownet.extract_pyramid_single(frame_dis)\n            feat_prev = flownet.extract_pyramid_single(frame_prev) if i == 0 else feat_next\n            feat_next = flownet.extract_pyramid_single(frame_next)\n\n            # for first two frames in triplet\n            flow_ref = flownet(frame_ref, frame_next, feat_ref, feat_next)\n            flow_dis = flownet(frame_dis, frame_next, feat_dis, feat_next)\n            flow_diff = flow_ref - flow_dis\n            scores.append(loss_fn.forward(frame_ref, frame_dis, flow_diff, normalize=True).item())\n\n            # for next two frames in triplet\n            flow_ref = flownet(frame_ref, frame_prev, feat_ref, feat_prev)\n            flow_dis = flownet(frame_dis, frame_prev, feat_dis, feat_prev)\n            flow_diff = flow_ref - flow_dis\n            scores.append(loss_fn.forward(frame_ref, frame_dis, flow_diff, normalize=True).item())\n\n    return np.mean(scores)"
  }
]