Full Code of TomTomTommi/stereoanyvideo for AI

main 3a8beddc470c cached

83 files

513.1 KB

168.4k tokens

534 symbols

1 requests

Download .txt

Showing preview only (541K chars total). Download the full file or copy to clipboard to get everything.

Repository: TomTomTommi/stereoanyvideo
Branch: main
Commit: 3a8beddc470c
Files: 83
Total size: 513.1 KB

Directory structure:
gitextract_751vu01n/

├── LICENSE
├── README.md
├── assets/
│   └── 1
├── checkpoints/
│   └── checkpoints here.txt
├── data/
│   └── datasets/
│       └── dataset here.txt
├── datasets/
│   ├── augmentor.py
│   ├── frame_utils.py
│   └── video_datasets.py
├── demo.py
├── demo.sh
├── evaluate_stereoanyvideo.sh
├── evaluation/
│   ├── configs/
│   │   ├── eval_dynamic_replica.yaml
│   │   ├── eval_infinigensv.yaml
│   │   ├── eval_kittidepth.yaml
│   │   ├── eval_sintel_clean.yaml
│   │   ├── eval_sintel_final.yaml
│   │   ├── eval_southkensington.yaml
│   │   └── eval_vkitti2.yaml
│   ├── core/
│   │   └── evaluator.py
│   ├── evaluate.py
│   └── utils/
│       ├── eval_utils.py
│       ├── ssim.py
│       └── utils.py
├── models/
│   ├── Video-Depth-Anything/
│   │   ├── app.py
│   │   ├── get_weights.sh
│   │   ├── run.py
│   │   ├── utils/
│   │   │   ├── dc_utils.py
│   │   │   └── util.py
│   │   └── video_depth_anything/
│   │       ├── dinov2.py
│   │       ├── dinov2_layers/
│   │       │   ├── __init__.py
│   │       │   ├── attention.py
│   │       │   ├── block.py
│   │       │   ├── drop_path.py
│   │       │   ├── layer_scale.py
│   │       │   ├── mlp.py
│   │       │   ├── patch_embed.py
│   │       │   └── swiglu_ffn.py
│   │       ├── dpt.py
│   │       ├── dpt_temporal.py
│   │       ├── motion_module/
│   │       │   ├── attention.py
│   │       │   └── motion_module.py
│   │       ├── util/
│   │       │   ├── blocks.py
│   │       │   └── transform.py
│   │       └── video_depth.py
│   ├── core/
│   │   ├── attention.py
│   │   ├── corr.py
│   │   ├── extractor.py
│   │   ├── model_zoo.py
│   │   ├── stereoanyvideo.py
│   │   ├── update.py
│   │   └── utils/
│   │       ├── config.py
│   │       └── utils.py
│   ├── raft_model.py
│   └── stereoanyvideo_model.py
├── requirements.txt
├── third_party/
│   └── RAFT/
│       ├── LICENSE
│       ├── README.md
│       ├── alt_cuda_corr/
│       │   ├── correlation.cpp
│       │   ├── correlation_kernel.cu
│       │   └── setup.py
│       ├── chairs_split.txt
│       ├── core/
│       │   ├── __init__.py
│       │   ├── corr.py
│       │   ├── datasets.py
│       │   ├── extractor.py
│       │   ├── raft.py
│       │   ├── update.py
│       │   └── utils/
│       │       ├── __init__.py
│       │       ├── augmentor.py
│       │       ├── flow_viz.py
│       │       ├── frame_utils.py
│       │       └── utils.py
│       ├── demo.py
│       ├── download_models.sh
│       ├── evaluate.py
│       ├── train.py
│       ├── train_mixed.sh
│       └── train_standard.sh
├── train_stereoanyvideo.py
├── train_stereoanyvideo.sh
└── train_utils/
    ├── logger.py
    ├── losses.py
    └── utils.py

================================================
FILE CONTENTS
================================================

================================================
FILE: LICENSE
================================================
                                 Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/

   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

   1. Definitions.

      "License" shall mean the terms and conditions for use, reproduction,
      and distribution as defined by Sections 1 through 9 of this document.

      "Licensor" shall mean the copyright owner or entity authorized by
      the copyright owner that is granting the License.

      "Legal Entity" shall mean the union of the acting entity and all
      other entities that control, are controlled by, or are under common
      control with that entity. For the purposes of this definition,
      "control" means (i) the power, direct or indirect, to cause the
      direction or management of such entity, whether by contract or
      otherwise, or (ii) ownership of fifty percent (50%) or more of the
      outstanding shares, or (iii) beneficial ownership of such entity.

      "You" (or "Your") shall mean an individual or Legal Entity
      exercising permissions granted by this License.

      "Source" form shall mean the preferred form for making modifications,
      including but not limited to software source code, documentation
      source, and configuration files.

      "Object" form shall mean any form resulting from mechanical
      transformation or translation of a Source form, including but
      not limited to compiled object code, generated documentation,
      and conversions to other media types.

      "Work" shall mean the work of authorship, whether in Source or
      Object form, made available under the License, as indicated by a
      copyright notice that is included in or attached to the work
      (an example is provided in the Appendix below).

      "Derivative Works" shall mean any work, whether in Source or Object
      form, that is based on (or derived from) the Work and for which the
      editorial revisions, annotations, elaborations, or other modifications
      represent, as a whole, an original work of authorship. For the purposes
      of this License, Derivative Works shall not include works that remain
      separable from, or merely link (or bind by name) to the interfaces of,
      the Work and Derivative Works thereof.

      "Contribution" shall mean any work of authorship, including
      the original version of the Work and any modifications or additions
      to that Work or Derivative Works thereof, that is intentionally
      submitted to Licensor for inclusion in the Work by the copyright owner
      or by an individual or Legal Entity authorized to submit on behalf of
      the copyright owner. For the purposes of this definition, "submitted"
      means any form of electronic, verbal, or written communication sent
      to the Licensor or its representatives, including but not limited to
      communication on electronic mailing lists, source code control systems,
      and issue tracking systems that are managed by, or on behalf of, the
      Licensor for the purpose of discussing and improving the Work, but
      excluding communication that is conspicuously marked or otherwise
      designated in writing by the copyright owner as "Not a Contribution."

      "Contributor" shall mean Licensor and any individual or Legal Entity
      on behalf of whom a Contribution has been received by Licensor and
      subsequently incorporated within the Work.

   2. Grant of Copyright License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      copyright license to reproduce, prepare Derivative Works of,
      publicly display, publicly perform, sublicense, and distribute the
      Work and such Derivative Works in Source or Object form.

   3. Grant of Patent License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      (except as stated in this section) patent license to make, have made,
      use, offer to sell, sell, import, and otherwise transfer the Work,
      where such license applies only to those patent claims licensable
      by such Contributor that are necessarily infringed by their
      Contribution(s) alone or by combination of their Contribution(s)
      with the Work to which such Contribution(s) was submitted. If You
      institute patent litigation against any entity (including a
      cross-claim or counterclaim in a lawsuit) alleging that the Work
      or a Contribution incorporated within the Work constitutes direct
      or contributory patent infringement, then any patent licenses
      granted to You under this License for that Work shall terminate
      as of the date such litigation is filed.

   4. Redistribution. You may reproduce and distribute copies of the
      Work or Derivative Works thereof in any medium, with or without
      modifications, and in Source or Object form, provided that You
      meet the following conditions:

      (a) You must give any other recipients of the Work or
          Derivative Works a copy of this License; and

      (b) You must cause any modified files to carry prominent notices
          stating that You changed the files; and

      (c) You must retain, in the Source form of any Derivative Works
          that You distribute, all copyright, patent, trademark, and
          attribution notices from the Source form of the Work,
          excluding those notices that do not pertain to any part of
          the Derivative Works; and

      (d) If the Work includes a "NOTICE" text file as part of its
          distribution, then any Derivative Works that You distribute must
          include a readable copy of the attribution notices contained
          within such NOTICE file, excluding those notices that do not
          pertain to any part of the Derivative Works, in at least one
          of the following places: within a NOTICE text file distributed
          as part of the Derivative Works; within the Source form or
          documentation, if provided along with the Derivative Works; or,
          within a display generated by the Derivative Works, if and
          wherever such third-party notices normally appear. The contents
          of the NOTICE file are for informational purposes only and
          do not modify the License. You may add Your own attribution
          notices within Derivative Works that You distribute, alongside
          or as an addendum to the NOTICE text from the Work, provided
          that such additional attribution notices cannot be construed
          as modifying the License.

      You may add Your own copyright statement to Your modifications and
      may provide additional or different license terms and conditions
      for use, reproduction, or distribution of Your modifications, or
      for any such Derivative Works as a whole, provided Your use,
      reproduction, and distribution of the Work otherwise complies with
      the conditions stated in this License.

   5. Submission of Contributions. Unless You explicitly state otherwise,
      any Contribution intentionally submitted for inclusion in the Work
      by You to the Licensor shall be under the terms and conditions of
      this License, without any additional terms or conditions.
      Notwithstanding the above, nothing herein shall supersede or modify
      the terms of any separate license agreement you may have executed
      with Licensor regarding such Contributions.

   6. Trademarks. This License does not grant permission to use the trade
      names, trademarks, service marks, or product names of the Licensor,
      except as required for reasonable and customary use in describing the
      origin of the Work and reproducing the content of the NOTICE file.

   7. Disclaimer of Warranty. Unless required by applicable law or
      agreed to in writing, Licensor provides the Work (and each
      Contributor provides its Contributions) on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
      implied, including, without limitation, any warranties or conditions
      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
      PARTICULAR PURPOSE. You are solely responsible for determining the
      appropriateness of using or redistributing the Work and assume any
      risks associated with Your exercise of permissions under this License.

   8. Limitation of Liability. In no event and under no legal theory,
      whether in tort (including negligence), contract, or otherwise,
      unless required by applicable law (such as deliberate and grossly
      negligent acts) or agreed to in writing, shall any Contributor be
      liable to You for damages, including any direct, indirect, special,
      incidental, or consequential damages of any character arising as a
      result of this License or out of the use or inability to use the
      Work (including but not limited to damages for loss of goodwill,
      work stoppage, computer failure or malfunction, or any and all
      other commercial damages or losses), even if such Contributor
      has been advised of the possibility of such damages.

   9. Accepting Warranty or Additional Liability. While redistributing
      the Work or Derivative Works thereof, You may choose to offer,
      and charge a fee for, acceptance of support, warranty, indemnity,
      or other liability obligations and/or rights consistent with this
      License. However, in accepting such obligations, You may act only
      on Your own behalf and on Your sole responsibility, not on behalf
      of any other Contributor, and only if You agree to indemnify,
      defend, and hold each Contributor harmless for any liability
      incurred by, or claims asserted against, such Contributor by reason
      of your accepting any such warranty or additional liability.

   END OF TERMS AND CONDITIONS

   APPENDIX: How to apply the Apache License to your work.

      To apply the Apache License to your work, attach the following
      boilerplate notice, with the fields enclosed by brackets "[]"
      replaced with your own identifying information. (Don't include
      the brackets!)  The text should be enclosed in the appropriate
      comment syntax for the file format. We also recommend that a
      file or class name and description of purpose be included on the
      same "printed page" as the copyright notice for easier
      identification within third-party archives.

   Copyright [yyyy] [name of copyright owner]

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.


================================================
FILE: README.md
================================================
<h1 align='center' style="text-align:center; font-weight:bold; font-size:2.0em;letter-spacing:2.0px;">
Stereo Any Video: <br> Temporally Consistent Stereo Matching<h1>      

<div align="center">
  <a href="https://arxiv.org/abs/2503.05549" target="_blank" rel="external nofollow noopener">
  <img src="https://img.shields.io/badge/Paper-arXiv-deepgreen" alt="Paper arXiv"></a>
  <a href="https://tomtomtommi.github.io/StereoAnyVideo/" target="_blank" rel="external nofollow noopener">
  <img src="https://img.shields.io/badge/Project-Page-9cf" alt="Project Page"></a>
</div>
</p>

![Demo](./assets/stereoanyvideo.gif)

## Installation

Installation with cuda 12.2

<details>
  <summary>Setup the root for all source files</summary>
  <pre><code>
    git clone https://github.com/tomtomtommi/stereoanyvideo
    cd stereoanyvideo
    export PYTHONPATH=`(cd ../ && pwd)`:`pwd`:$PYTHONPATH
  </code></pre>
</details>

<details>
  <summary>Create a conda env</summary>
  <pre><code>
    conda create -n sav python=3.10
    conda activate sav
  </code></pre>
</details>

<details>
  <summary>Install requirements</summary>
  <pre><code>
    conda install pytorch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 pytorch-cuda=12.1 -c pytorch -c nvidia
    pip install pip==24.0
    pip install pytorch_lightning==1.6.0
    pip install iopath
    conda install -c bottler nvidiacub
    pip install scikit-image matplotlib imageio plotly opencv-python
    conda install -c fvcore -c conda-forge fvcore
    pip install black usort flake8 flake8-bugbear flake8-comprehensions
    conda install pytorch3d -c pytorch3d
    pip install -r requirements.txt
    pip install timm
  </code></pre>
</details>

<details>
  <summary>Download VDA checkpoints</summary>
  <pre><code>
    cd models/Video-Depth-Anything
    sh get_weights.sh
  </code></pre>
</details>

## Inference a stereo video

```
sh demo.sh
```
Before running, download the checkpoints on [google drive](https://drive.google.com/drive/folders/1c7L065dcBWhCYYjWYo2edGOG605PnpXv?usp=sharing) . 
Copy the checkpoints to `./checkpoints/`

In default, left and right camera videos are supposed to be structured like this:
```none
./demo_video/
        ├── left
            ├── left000000.png
            ├── left000001.png
            ├── left000002.png
            ...
        ├── right
            ├── right000000.png
            ├── right000001.png
            ├── right000002.png
            ...
```

A simple way to run the demo is using SouthKensingtonSV.

To test on your own data, modify `--path ./demo_video/`. More arguments can be found and modified in ` demo.py`

## Dataset

Download the following datasets and put in `./data/datasets/`:
 - [SceneFlow](https://lmb.informatik.uni-freiburg.de/resources/datasets/SceneFlowDatasets.en.html)
 - [Sintel](http://sintel.is.tue.mpg.de/stereo)
 - [Dynamic_Replica](https://dynamic-stereo.github.io/)
 - [KITTI Depth](https://www.cvlibs.net/datasets/kitti/eval_depth_all.php)
 - [Infinigen SV](https://tomtomtommi.github.io/BiDAVideo/)
 - [Virtual KITTI2](https://europe.naverlabs.com/proxy-virtual-worlds-vkitti-2/)
 - [SouthKensington SV](https://tomtomtommi.github.io/BiDAVideo/)


## Evaluation
```
sh evaluate_stereoanyvideo.sh
```

## Training
```
sh train_stereoanyvideo.sh
```

## Citation 
If you use our method in your research, please consider citing:
```
@inproceedings{jing2025stereo,
  title={Stereo any video: Temporally consistent stereo matching},
  author={Jing, Junpeng and Luo, Weixun and Mao, Ye and Mikolajczyk, Krystian},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={20836--20846},
  year={2025}
}
```


================================================
FILE: assets/1
================================================



================================================
FILE: checkpoints/checkpoints here.txt
================================================


================================================
FILE: data/datasets/dataset here.txt
================================================


================================================
FILE: datasets/augmentor.py
================================================
import numpy as np
import random
from PIL import Image

import cv2

cv2.setNumThreads(0)
cv2.ocl.setUseOpenCL(False)

from torchvision.transforms import ColorJitter, functional, Compose


class AdjustGamma(object):
    def __init__(self, gamma_min, gamma_max, gain_min=1.0, gain_max=1.0):
        self.gamma_min, self.gamma_max, self.gain_min, self.gain_max = (
            gamma_min,
            gamma_max,
            gain_min,
            gain_max,
        )

    def __call__(self, sample):
        gain = random.uniform(self.gain_min, self.gain_max)
        gamma = random.uniform(self.gamma_min, self.gamma_max)
        return functional.adjust_gamma(sample, gamma, gain)

    def __repr__(self):
        return f"Adjust Gamma {self.gamma_min}, ({self.gamma_max}) and Gain ({self.gain_min}, {self.gain_max})"


class SequenceDispFlowAugmentor:
    def __init__(
        self,
        crop_size,
        min_scale=-0.2,
        max_scale=0.5,
        do_flip=True,
        yjitter=False,
        saturation_range=[0.6, 1.4],
        gamma=[1, 1, 1, 1],
    ):
        # spatial augmentation params
        self.crop_size = crop_size
        self.min_scale = min_scale
        self.max_scale = max_scale
        self.spatial_aug_prob = 1.0
        self.stretch_prob = 0.8
        self.max_stretch = 0.2

        # flip augmentation params
        self.yjitter = yjitter
        self.do_flip = do_flip
        self.h_flip_prob = 0.5
        self.v_flip_prob = 0.1

        # photometric augmentation params
        self.photo_aug = Compose(
            [
                ColorJitter(
                    brightness=0.4,
                    contrast=0.4,
                    saturation=saturation_range,
                    hue=0.5 / 3.14,
                ),
                AdjustGamma(*gamma),
            ]
        )
        self.asymmetric_color_aug_prob = 0.2
        self.eraser_aug_prob = 0.5

    def color_transform(self, seq):
        """Photometric augmentation"""

        # asymmetric
        if np.random.rand() < self.asymmetric_color_aug_prob:
            for i in range(len(seq)):
                for cam in (0, 1):
                    seq[i][cam] = np.array(
                        self.photo_aug(Image.fromarray(seq[i][cam])), dtype=np.uint8
                    )
        # symmetric
        else:
            image_stack = np.concatenate(
                [seq[i][cam] for i in range(len(seq)) for cam in (0, 1)], axis=0
            )
            image_stack = np.array(
                self.photo_aug(Image.fromarray(image_stack)), dtype=np.uint8
            )
            split = np.split(image_stack, len(seq) * 2, axis=0)
            for i in range(len(seq)):
                seq[i][0] = split[2 * i]
                seq[i][1] = split[2 * i + 1]
        return seq

    def eraser_transform(self, seq, bounds=[50, 100]):
        """Occlusion augmentation"""
        ht, wd = seq[0][0].shape[:2]
        for i in range(len(seq)):
            for cam in (0, 1):
                if np.random.rand() < self.eraser_aug_prob:
                    mean_color = np.mean(seq[0][0].reshape(-1, 3), axis=0)
                    for _ in range(np.random.randint(1, 3)):
                        x0 = np.random.randint(0, wd)
                        y0 = np.random.randint(0, ht)
                        dx = np.random.randint(bounds[0], bounds[1])
                        dy = np.random.randint(bounds[0], bounds[1])
                        seq[i][cam][y0 : y0 + dy, x0 : x0 + dx, :] = mean_color

        return seq

    def spatial_transform(self, img, disp):
        # randomly sample scale
        ht, wd = img[0][0].shape[:2]
        min_scale = np.maximum(
            (self.crop_size[0] + 8) / float(ht), (self.crop_size[1] + 8) / float(wd)
        )

        scale = 2 ** np.random.uniform(self.min_scale, self.max_scale)
        scale_x = scale
        scale_y = scale
        if np.random.rand() < self.stretch_prob:
            scale_x *= 2 ** np.random.uniform(-self.max_stretch, self.max_stretch)
            scale_y *= 2 ** np.random.uniform(-self.max_stretch, self.max_stretch)

        scale_x = np.clip(scale_x, min_scale, None)
        scale_y = np.clip(scale_y, min_scale, None)

        if np.random.rand() < self.spatial_aug_prob:
            # rescale the images
            for i in range(len(img)):
                for cam in (0, 1):
                    img[i][cam] = cv2.resize(
                        img[i][cam],
                        None,
                        fx=scale_x,
                        fy=scale_y,
                        interpolation=cv2.INTER_LINEAR,
                    )
                    if len(disp[i]) > 0:
                        disp[i][cam] = cv2.resize(
                            disp[i][cam],
                            None,
                            fx=scale_x,
                            fy=scale_y,
                            interpolation=cv2.INTER_LINEAR,
                        )
                        disp[i][cam] = disp[i][cam] * [scale_x, scale_y]

        if self.yjitter:
            y0 = np.random.randint(2, img[0][0].shape[0] - self.crop_size[0] - 2)
            x0 = np.random.randint(2, img[0][0].shape[1] - self.crop_size[1] - 2)

            for i in range(len(img)):
                y1 = y0 + np.random.randint(-2, 2 + 1)
                img[i][0] = img[i][0][
                    y0 : y0 + self.crop_size[0], x0 : x0 + self.crop_size[1]
                ]
                img[i][1] = img[i][1][
                    y1 : y1 + self.crop_size[0], x0 : x0 + self.crop_size[1]
                ]
                if len(disp[i]) > 0:
                    disp[i][0] = disp[i][0][
                        y0 : y0 + self.crop_size[0], x0 : x0 + self.crop_size[1]
                    ]
                    disp[i][1] = disp[i][1][
                        y1 : y1 + self.crop_size[0], x0 : x0 + self.crop_size[1]
                    ]
        else:
            y0 = np.random.randint(0, img[0][0].shape[0] - self.crop_size[0])
            x0 = np.random.randint(0, img[0][0].shape[1] - self.crop_size[1])
            for i in range(len(img)):
                for cam in (0, 1):
                    img[i][cam] = img[i][cam][
                        y0 : y0 + self.crop_size[0], x0 : x0 + self.crop_size[1]
                    ]
                    if len(disp[i]) > 0:
                        disp[i][cam] = disp[i][cam][
                            y0 : y0 + self.crop_size[0], x0 : x0 + self.crop_size[1]
                        ]

        return img, disp

    def __call__(self, img, disp):
        img = self.color_transform(img)
        img = self.eraser_transform(img)
        img, disp = self.spatial_transform(img, disp)

        for i in range(len(img)):
            for cam in (0, 1):
                img[i][cam] = np.ascontiguousarray(img[i][cam])
                if len(disp[i]) > 0:
                    disp[i][cam] = np.ascontiguousarray(disp[i][cam])

        return img, disp


class SequenceDispSparseFlowAugmentor:
    def __init__(
        self,
        crop_size,
        min_scale=-0.2,
        max_scale=0.5,
        do_flip=True,
        yjitter=False,
        saturation_range=[0.6, 1.4],
        gamma=[1, 1, 1, 1],
    ):
        # spatial augmentation params
        self.crop_size = crop_size
        self.min_scale = min_scale
        self.max_scale = max_scale
        self.spatial_aug_prob = 1.0
        self.stretch_prob = 0.8
        self.max_stretch = 0.2

        # flip augmentation params
        self.yjitter = yjitter
        self.do_flip = do_flip
        self.h_flip_prob = 0.5
        self.v_flip_prob = 0.1

        # photometric augmentation params
        self.photo_aug = Compose(
            [
                ColorJitter(
                    brightness=0.4,
                    contrast=0.4,
                    saturation=saturation_range,
                    hue=0.5 / 3.14,
                ),
                AdjustGamma(*gamma),
            ]
        )
        self.asymmetric_color_aug_prob = 0.2
        self.eraser_aug_prob = 0.5

    def color_transform(self, seq):
        """Photometric augmentation"""
        # symmetric
        image_stack = np.concatenate(
            [seq[i][cam] for i in range(len(seq)) for cam in (0, 1)], axis=0
        )
        image_stack = np.array(
            self.photo_aug(Image.fromarray(image_stack)), dtype=np.uint8
        )
        split = np.split(image_stack, len(seq) * 2, axis=0)
        for i in range(len(seq)):
            seq[i][0] = split[2 * i]
            seq[i][1] = split[2 * i + 1]
        return seq

    def eraser_transform(self, seq, bounds=[50, 100]):
        """Occlusion augmentation"""
        ht, wd = seq[0][0].shape[:2]
        for i in range(len(seq)):
            for cam in (0, 1):
                if np.random.rand() < self.eraser_aug_prob:
                    mean_color = np.mean(seq[0][0].reshape(-1, 3), axis=0)
                    for _ in range(np.random.randint(1, 3)):
                        x0 = np.random.randint(0, wd)
                        y0 = np.random.randint(0, ht)
                        dx = np.random.randint(bounds[0], bounds[1])
                        dy = np.random.randint(bounds[0], bounds[1])
                        seq[i][cam][y0 : y0 + dy, x0 : x0 + dx, :] = mean_color

        return seq

    def resize_sparse_flow_map(self, flow, valid, fx=1.0, fy=1.0):
        ht, wd = flow.shape[:2]
        coords = np.meshgrid(np.arange(wd), np.arange(ht))
        coords = np.stack(coords, axis=-1)

        coords = coords.reshape(-1, 2).astype(np.float32)
        flow = flow.reshape(-1, 2).astype(np.float32)
        valid = valid.reshape(-1).astype(np.float32)

        coords0 = coords[valid>=1]
        flow0 = flow[valid>=1]

        ht1 = int(round(ht * fy))
        wd1 = int(round(wd * fx))

        coords1 = coords0 * [fx, fy]
        flow1 = flow0 * [fx, fy]

        xx = np.round(coords1[:,0]).astype(np.int32)
        yy = np.round(coords1[:,1]).astype(np.int32)

        v = (xx > 0) & (xx < wd1) & (yy > 0) & (yy < ht1)
        xx = xx[v]
        yy = yy[v]
        flow1 = flow1[v]

        flow_img = np.zeros([ht1, wd1, 2], dtype=np.float32)
        valid_img = np.zeros([ht1, wd1], dtype=np.int32)

        flow_img[yy, xx] = flow1
        valid_img[yy, xx] = 1

        return flow_img, valid_img

    def spatial_transform(self, img, disp, valid):
        # randomly sample scale
        ht, wd = img[0][0].shape[:2]
        min_scale = np.maximum(
            (self.crop_size[0] + 8) / float(ht), (self.crop_size[1] + 8) / float(wd)
        )

        scale = 2 ** np.random.uniform(self.min_scale, self.max_scale)
        scale_x = scale
        scale_y = scale
        if np.random.rand() < self.stretch_prob:
            scale_x *= 2 ** np.random.uniform(-self.max_stretch, self.max_stretch)
            scale_y *= 2 ** np.random.uniform(-self.max_stretch, self.max_stretch)

        scale_x = np.clip(scale_x, min_scale, None)
        scale_y = np.clip(scale_y, min_scale, None)

        if np.random.rand() < self.spatial_aug_prob:
            # rescale the images
            for i in range(len(img)):
                for cam in (0, 1):
                    img[i][cam] = cv2.resize(
                        img[i][cam],
                        None,
                        fx=scale_x,
                        fy=scale_y,
                        interpolation=cv2.INTER_LINEAR,
                    )
                    if len(disp[i]) > 0:
                        disp[i][cam], valid[i][cam] = self.resize_sparse_flow_map(disp[i][cam], valid[i][cam], fx=scale_x, fy=scale_y)

        margin_y = 20
        margin_x = 50

        y0 = np.random.randint(0, img[0][0].shape[0] - self.crop_size[0])
        x0 = np.random.randint(0, img[0][0].shape[1] - self.crop_size[1])
        for i in range(len(img)):
            for cam in (0, 1):
                img[i][cam] = img[i][cam][
                    y0 : y0 + self.crop_size[0], x0 : x0 + self.crop_size[1]
                ]
                if len(disp[i]) > 0:
                    disp[i][cam] = disp[i][cam][
                        y0 : y0 + self.crop_size[0], x0 : x0 + self.crop_size[1]
                    ]
                    valid[i][cam] = valid[i][cam][
                        y0: y0 + self.crop_size[0], x0: x0 + self.crop_size[1]
                    ]
        return img, disp, valid

    def __call__(self, img, disp, valid):
        img = self.color_transform(img)
        img = self.eraser_transform(img)
        img, disp, valid = self.spatial_transform(img, disp, valid)

        for i in range(len(img)):
            for cam in (0, 1):
                img[i][cam] = np.ascontiguousarray(img[i][cam])
                if len(disp[i]) > 0:
                    disp[i][cam] = np.ascontiguousarray(disp[i][cam])
                    valid[i][cam] = np.ascontiguousarray(valid[i][cam])

        return img, disp, valid


================================================
FILE: datasets/frame_utils.py
================================================
import numpy as np
from PIL import Image
from os.path import *
import re
import imageio
import cv2

cv2.setNumThreads(0)
cv2.ocl.setUseOpenCL(False)

TAG_CHAR = np.array([202021.25], np.float32)


def readFlow(fn):
    """Read .flo file in Middlebury format"""
    # Code adapted from:
    # http://stackoverflow.com/questions/28013200/reading-middlebury-flow-files-with-python-bytes-array-numpy

    # WARNING: this will work on little-endian architectures (eg Intel x86) only!
    # print 'fn = %s'%(fn)
    with open(fn, "rb") as f:
        magic = np.fromfile(f, np.float32, count=1)
        if 202021.25 != magic:
            print("Magic number incorrect. Invalid .flo file")
            return None
        else:
            w = np.fromfile(f, np.int32, count=1)
            h = np.fromfile(f, np.int32, count=1)
            # print 'Reading %d x %d flo file\n' % (w, h)
            data = np.fromfile(f, np.float32, count=2 * int(w) * int(h))
            # Reshape data into 3D array (columns, rows, bands)
            # The reshape here is for visualization, the original code is (w,h,2)
            return np.resize(data, (int(h), int(w), 2))


def readPFM(file):
    file = open(file, "rb")

    color = None
    width = None
    height = None
    scale = None
    endian = None

    header = file.readline().rstrip()
    if header == b"PF":
        color = True
    elif header == b"Pf":
        color = False
    else:
        raise Exception("Not a PFM file.")

    dim_match = re.match(rb"^(\d+)\s(\d+)\s$", file.readline())
    if dim_match:
        width, height = map(int, dim_match.groups())
    else:
        raise Exception("Malformed PFM header.")

    scale = float(file.readline().rstrip())
    if scale < 0:  # little-endian
        endian = "<"
        scale = -scale
    else:
        endian = ">"  # big-endian

    data = np.fromfile(file, endian + "f")
    shape = (height, width, 3) if color else (height, width)

    data = np.reshape(data, shape)
    data = np.flipud(data)
    return data


def readDispSintelStereo(file_name):
    """Return disparity read from filename."""
    f_in = np.array(Image.open(file_name))
    d_r = f_in[:, :, 0].astype("float64")
    d_g = f_in[:, :, 1].astype("float64")
    d_b = f_in[:, :, 2].astype("float64")

    disp = d_r * 4 + d_g / (2 ** 6) + d_b / (2 ** 14)
    mask = np.array(Image.open(file_name.replace("disparities", "occlusions")))
    valid = (mask == 0) & (disp > 0)
    return disp, valid


def readDispMiddlebury(file_name):
    assert basename(file_name) == "disp0GT.pfm"
    disp = readPFM(file_name).astype(np.float32)
    assert len(disp.shape) == 2
    nocc_pix = file_name.replace("disp0GT.pfm", "mask0nocc.png")
    assert exists(nocc_pix)
    nocc_pix = imageio.imread(nocc_pix) == 255
    assert np.any(nocc_pix)
    return disp, nocc_pix


def read_gen(file_name, pil=False):
    ext = splitext(file_name)[-1]
    if ext == ".png" or ext == ".jpeg" or ext == ".ppm" or ext == ".jpg":
        return Image.open(file_name)
    elif ext == ".bin" or ext == ".raw":
        return np.load(file_name)
    elif ext == ".flo":
        return readFlow(file_name).astype(np.float32)
    elif ext == ".pfm":
        flow = readPFM(file_name).astype(np.float32)
        if len(flow.shape) == 2:
            return flow
        else:
            return flow[:, :, :-1]
    return []


================================================
FILE: datasets/video_datasets.py
================================================
import os
import copy
import gzip
import logging
import torch
import numpy as np
import torch.utils.data as data
import torch.nn.functional as F
import os.path as osp
from glob import glob
import cv2
import re
from scipy.spatial.transform import Rotation as R

from collections import defaultdict
from PIL import Image
from dataclasses import dataclass
from typing import List, Optional
from pytorch3d.renderer.cameras import PerspectiveCameras
from pytorch3d.implicitron.dataset.types import (
    FrameAnnotation as ImplicitronFrameAnnotation,
    load_dataclass,
)

from stereoanyvideo.datasets import frame_utils
from stereoanyvideo.evaluation.utils.eval_utils import depth2disparity_scale
from stereoanyvideo.datasets.augmentor import SequenceDispFlowAugmentor, SequenceDispSparseFlowAugmentor


@dataclass
class DynamicReplicaFrameAnnotation(ImplicitronFrameAnnotation):
    """A dataclass used to load annotations from json."""

    camera_name: Optional[str] = None


class StereoSequenceDataset(data.Dataset):
    def __init__(self, aug_params=None, sparse=False, reader=None):
        self.augmentor = None
        self.sparse = sparse
        self.img_pad = (
            aug_params.pop("img_pad", None) if aug_params is not None else None
        )
        if aug_params is not None and "crop_size" in aug_params:
            if sparse:
                self.augmentor = SequenceDispSparseFlowAugmentor(**aug_params)
            else:
                self.augmentor = SequenceDispFlowAugmentor(**aug_params)

        if reader is None:
            self.disparity_reader = frame_utils.read_gen
        else:
            self.disparity_reader = reader
        self.depth_reader = self._load_depth
        self.is_test = False
        self.sample_list = []
        self.extra_info = []
        self.depth_eps = 1e-5

    def _load_depth(self, depth_path):
        if depth_path[-3:] == "npy":
            return self._load_npy_depth(depth_path)
        elif depth_path[-3:] == "png":
            if "kitti_depth" in depth_path:
                return self._load_kitti_depth(depth_path)
            elif "vkitti2" in depth_path:
                return self._load_vkitti2(depth_path)
            else:
                return self._load_16big_png_depth(depth_path)
        else:
            raise ValueError("Other format depth is not implemented")

    def _load_npy_depth(self, depth_npy):
        depth = np.load(depth_npy)
        return depth

    def _load_vkitti2(self, depth_png):
        depth_image = cv2.imread(depth_png, cv2.IMREAD_ANYCOLOR | cv2.IMREAD_ANYDEPTH)
        depth_in_meters = depth_image.astype(np.float32) / 100.0
        depth_in_meters[depth_image == 0] = -1.

        return depth_in_meters

    def _load_kitti_depth(self, depth_png):
        # depth_image = cv2.imread(depth_png, cv2.IMREAD_UNCHANGED)
        # depth_in_meters = depth_image.astype(np.float32) / 256.0
        depth_image = np.array(Image.open(depth_png), dtype=int)
        # make sure we have a proper 16bit depth map here.. not 8bit!
        assert (np.max(depth_image) > 255)

        depth_in_meters = depth_image.astype(np.float32) / 256.
        depth_in_meters[depth_image == 0] = -1.

        return depth_in_meters

    def _load_16big_png_depth(self, depth_png):
        with Image.open(depth_png) as depth_pil:
            # the image is stored with 16-bit depth but PIL reads it as I (32 bit).
            # we cast it to uint16, then reinterpret as float16, then cast to float32
            depth = (
                np.frombuffer(np.array(depth_pil, dtype=np.uint16), dtype=np.float16)
                .astype(np.float32)
                .reshape((depth_pil.size[1], depth_pil.size[0]))
            )
        return depth

    def load_tartanair_pose(self, filepath, index=0):
        poses = np.loadtxt(filepath)
        tx, ty, tz, qx, qy, qz, qw = poses[index]

        # Quaternion to rotation matrix
        r = R.from_quat([qx, qy, qz, qw])
        R_mat = r.as_matrix()

        # Assemble 4x4 pose matrix
        T = np.eye(4)
        T[:3, :3] = R_mat
        T[:3, 3] = [tx, ty, tz]

        return T

    def parse_txt_file(self, file_path):
        with open(file_path, 'r') as file:
            data = file.read()

        # Regex patterns
        intrinsic_pattern = re.compile(r"Intrinsic:\s*\[\[([^\]]+)\]\s*\[\s*([^\]]+)\]\s*\[\s*([^\]]+)\]\]")
        frame_pattern = re.compile(r"Frame (\d+): Pose: ([\w\d]+)\n([\s\S]+?)(?=Frame|\Z)")

        # Extract intrinsic matrix (K)
        intrinsic_match = intrinsic_pattern.search(data)
        if intrinsic_match:
            K = np.array([list(map(float, row.split())) for row in intrinsic_match.groups()])
        else:
            raise ValueError("Intrinsic matrix not found in the file")

        # Extract frames and compute R and T
        frames = []
        for frame_match in frame_pattern.finditer(data):
            frame_number = int(frame_match.group(1))
            pose_id = frame_match.group(2)
            pose_matrix = np.array([list(map(float, row.split())) for row in frame_match.group(3).strip().split('\n')])

            # Decompose pose matrix into R and T
            R = pose_matrix[:3, :3]  # The upper-left 3x3 part is the rotation matrix
            T = pose_matrix[:3, 3]  # The first three elements of the fourth column is the translation vector

            frames.append({
                'frame_number': frame_number,
                'pose_id': pose_id,
                'pose_matrix': pose_matrix,
                'R': R,
                'T': T
            })

        return K, frames

    def _get_pytorch3d_camera(
        self, entry_viewpoint, image_size, scale: float
    ) -> PerspectiveCameras:
        assert entry_viewpoint is not None
        # principal point and focal length
        principal_point = torch.tensor(
            entry_viewpoint.principal_point, dtype=torch.float
        )
        focal_length = torch.tensor(entry_viewpoint.focal_length, dtype=torch.float)
        half_image_size_wh_orig = (
            torch.tensor(list(reversed(image_size)), dtype=torch.float) / 2.0
        )

        # first, we convert from the dataset's NDC convention to pixels
        format = entry_viewpoint.intrinsics_format
        if format.lower() == "ndc_norm_image_bounds":
            # this is e.g. currently used in CO3D for storing intrinsics
            rescale = half_image_size_wh_orig
        elif format.lower() == "ndc_isotropic":
            rescale = half_image_size_wh_orig.min()
        else:
            raise ValueError(f"Unknown intrinsics format: {format}")

        # principal point and focal length in pixels
        principal_point_px = half_image_size_wh_orig - principal_point * rescale
        focal_length_px = focal_length * rescale

        # now, convert from pixels to PyTorch3D v0.5+ NDC convention
        # if self.image_height is None or self.image_width is None:
        out_size = list(reversed(image_size))

        half_image_size_output = torch.tensor(out_size, dtype=torch.float) / 2.0
        half_min_image_size_output = half_image_size_output.min()

        # rescaled principal point and focal length in ndc
        principal_point = (
            half_image_size_output - principal_point_px * scale
        ) / half_min_image_size_output
        focal_length = focal_length_px * scale / half_min_image_size_output
        return PerspectiveCameras(
            focal_length=focal_length[None],
            principal_point=principal_point[None],
            R=torch.tensor(entry_viewpoint.R, dtype=torch.float)[None],
            T=torch.tensor(entry_viewpoint.T, dtype=torch.float)[None],
        )

    def _get_pytorch3d_camera_from_blender(self, R, T, K, image_size, scale: float) -> PerspectiveCameras:
        assert R is not None and T is not None and K is not None
        assert R.shape == (3, 3), f"Expected R to be 3x3, but got {R.shape}"
        assert T.shape == (3,), f"Expected T to be a 3-element vector, but got {T.shape}"
        assert K.shape == (3, 3), f"Expected K to be 3x3, but got {K.shape}"

        # Extract principal point and focal length from K
        fx = K[0, 0]
        fy = K[1, 1]
        cx = K[0, 2]
        cy = K[1, 2]

        principal_point = torch.tensor([cx, cy], dtype=torch.float)
        focal_length = torch.tensor([fx, fy], dtype=torch.float)

        half_image_size_wh_orig = (
                torch.tensor(list(reversed(image_size)), dtype=torch.float) / 2.0
        )

        # Adjust principal point and focal length in pixels
        principal_point_px = principal_point * scale
        focal_length_px = focal_length * scale

        # Convert from pixels to PyTorch3D NDC convention
        principal_point = (principal_point_px - half_image_size_wh_orig) / half_image_size_wh_orig
        half_min_image_size_output = half_image_size_wh_orig.min()
        focal_length = focal_length_px / half_min_image_size_output

        R = R.T @ np.array([[-1, 0, 0], [0, -1, 0], [0, 0, 1]], dtype=np.float64)
        T = T @ np.array([[-1, 0, 0], [0, -1, 0], [0, 0, 1]], dtype=np.float64)

        # Convert R and T to PyTorch tensors
        R_tensor = torch.tensor(R, dtype=torch.float).unsqueeze(0)  # Add batch dimension
        T_tensor = torch.tensor(T, dtype=torch.float).view(1, 3)  # Ensure T is a 1x3 tensor

        # Return PerspectiveCameras object
        return PerspectiveCameras(
            focal_length=focal_length.unsqueeze(0),  # Add batch dimension
            principal_point=principal_point.unsqueeze(0),  # Add batch dimension
            R=R_tensor,
            T=T_tensor,
        )

    def _get_output_tensor(self, sample):
        output_tensor = defaultdict(list)
        sample_size = len(sample["image"]["left"])
        output_tensor_keys = ["img", "disp", "valid_disp", "mask"]
        add_keys = ["viewpoint", "metadata"]
        for add_key in add_keys:
            if add_key in sample:
                output_tensor_keys.append(add_key)

        for key in output_tensor_keys:
            output_tensor[key] = [[] for _ in range(sample_size)]

        if "viewpoint" in sample:
            viewpoint_left = self._get_pytorch3d_camera(
                sample["viewpoint"]["left"][0],
                sample["metadata"]["left"][0][1],
                scale=1.0,
            )
            viewpoint_right = self._get_pytorch3d_camera(
                sample["viewpoint"]["right"][0],
                sample["metadata"]["right"][0][1],
                scale=1.0,
            )
            depth2disp_scale = depth2disparity_scale(
                viewpoint_left,
                viewpoint_right,
                torch.Tensor(sample["metadata"]["left"][0][1])[None],
            )
            output_tensor["depth2disp_scale"] = [depth2disp_scale]

        if "camera" in sample:
            output_tensor["viewpoint"] = [[] for _ in range(sample_size)]
            # InfinigenSV
            if sample["camera"]["left"][0][-3:] == "npz":
                # Note that the K, R, T is based on Blender world Matrix
                camera_left = np.load(sample["camera"]["left"][0])
                camera_right = np.load(sample["camera"]["right"][0])
                camera_left_RT = camera_left['T']
                camera_right_RT = camera_right['T']
                camera_left_K = camera_left['K']
                camera_right_K = camera_right['K']
                camera_left_T = camera_left['T'][:3, 3]
                camera_left_R = camera_left['T'][:3, :3]
                fix_baseline = np.linalg.norm(camera_left_RT[:3, 3] - camera_right_RT[:3, 3])
                focal_length_px = camera_left_K[0][0]
                depth2disp_scale = focal_length_px * fix_baseline

            # Sintel
            elif sample["camera"]["left"][0][-3:] == "cam":
                TAG_FLOAT = 202021.25
                f = open(sample["camera"]["left"][0], 'rb')
                check = np.fromfile(f, dtype=np.float32, count=1)[0]
                assert check == TAG_FLOAT, ' cam_read:: Wrong tag in flow file (should be: {0}, is: {1}). Big-endian machine? '.format(
                    TAG_FLOAT, check)
                camera_left_K = np.fromfile(f, dtype='float64', count=9).reshape((3, 3))
                camera_left_RT = np.fromfile(f, dtype='float64', count=12).reshape((3, 4))
                fix_baseline = 0.1 # From the MPI Sintel dataset website, the baseline of the cameras = 10cm = 0.1m
                focal_length_px = camera_left_K[0][0]
                depth2disp_scale = focal_length_px * fix_baseline
                camera_left_T = camera_left_RT[:3, 3]
                camera_left_R = camera_left_RT[:3, :3]

            # Spring
            elif any(filename in path for path in sample["camera"]["left"] for filename in ["focaldistance.txt", "extrinsics.txt", "intrinsics.txt"]):
                for path in sample["camera"]["left"]:
                    if "intrinsics.txt" in path:
                        intrinsics_path = path
                    elif "extrinsics.txt" in path:
                        extrinsics_path = path

                fx, fy, cx, cy = np.loadtxt(intrinsics_path)[0]
                # Build the 3x3 intrinsic matrix
                camera_left_K = np.array([
                    [fx, 0, cx],
                    [0, fy, cy],
                    [0, 0, 1]
                ])
                focal_length_px = camera_left_K[0][0]
                fix_baseline = 0.065  # From the dataset website, the baseline of the cameras = 6.5cm = 0.065m
                depth2disp_scale = focal_length_px * fix_baseline
                camera_left_RT = np.loadtxt(extrinsics_path).reshape(-1, 4, 4)[0]
                camera_left_T = camera_left_RT[:3, 3]
                camera_left_R = camera_left_RT[:3, :3]

            # TartanAir
            elif sample["camera"]["left"][0][-13:] == "pose_left.txt":
                fx, fy, cx, cy = 320.0, 320.0, 320.0, 240.0
                # Build the 3x3 intrinsic matrix
                camera_left_K = np.array([
                    [fx, 0, cx],
                    [0, fy, cy],
                    [0, 0, 1]
                ])
                focal_length_px = camera_left_K[0][0]
                fix_baseline = 0.25
                depth2disp_scale = focal_length_px * fix_baseline
                camera_left_RT = self.load_tartanair_pose(sample["camera"]["left"][0], index=0)
                camera_left_T = camera_left_RT[:3, 3]
                camera_left_R = camera_left_RT[:3, :3]

            # KITTI Depth
            elif sample["camera"]["left"][0][-20:] == "calib_cam_to_cam.txt":
                calib_data = {}
                with open(sample["camera"]["left"][0], 'r') as f:
                    for line in f:
                        key, value = line.split(':', 1)
                        calib_data[key.strip()] = value.strip()

                P_key = 'P_rect_02'
                if P_key in calib_data:
                    P_values = np.array([float(x) for x in calib_data[P_key].split()])
                    P_matrix = P_values.reshape(3, 4)
                else:
                    raise KeyError(f"Projection matrix for camera not found in calibration data")
                focal_length_px = P_matrix[0, 0]

                T_key1 = 'T_02'
                T_key2 = 'T_03'
                if T_key1 in calib_data and T_key2 in calib_data:
                    T1 = np.array([float(x) for x in calib_data[T_key1].split()])
                    T2 = np.array([float(x) for x in calib_data[T_key2].split()])
                    baseline = np.linalg.norm(T1 - T2)
                else:
                    raise KeyError(f"Translation vectors for cameras not found in calibration data")

                R_key1 = 'R_rect_02'
                R_key2 = 'R_rect_03'
                if R_key1 in calib_data and R_key2 in calib_data:
                    R1 = np.array([float(x) for x in calib_data[R_key1].split()]).reshape(3, 3)
                    R2 = np.array([float(x) for x in calib_data[R_key2].split()]).reshape(3, 3)
                else:
                    raise KeyError(f"Rotation vectors for cameras not found in calibration data")

                depth2disp_scale = focal_length_px * baseline
                camera_left_K = P_matrix[:, :3]
                camera_left_T = T1
                camera_left_R = R1

            # VKITTI2
            elif sample["camera"]["left"][0][-13:] == "intrinsic.txt":
                baseline = 0.532725
                with open(sample["camera"]["left"][0], 'r') as f:
                    line = f.readlines()[1]
                    values = line.strip().split()
                    frame = int(values[0])
                    camera_id = int(values[1])
                    fx = float(values[2])
                    fy = float(values[3])
                    cx = float(values[4])
                    cy = float(values[5])
                    # Construct the intrinsic matrix
                    camera_left_K = torch.tensor([[fx, 0, cx],
                                      [0, fy, cy],
                                      [0, 0, 1]], dtype=torch.float32)
                depth2disp_scale = camera_left_K[0, 0] * baseline

                with open(sample["camera"]["left"][0].replace("intrinsic.txt", "extrinsic.txt"), 'r') as f:
                    line = f.readlines()[1]
                    values = line.strip().split()
                    frame = int(values[0])
                    camera_id = int(values[1])
                    # Extract rotation (3x3) and translation (3x1)
                    camera_left_R = np.array([
                        [float(values[2]), float(values[3]), float(values[4])],
                        [float(values[6]), float(values[7]), float(values[8])],
                        [float(values[10]), float(values[11]), float(values[12])]
                    ], dtype=np.float32)

                    camera_left_T = np.array([
                        float(values[5]),
                        float(values[9]),
                        float(values[13])
                    ], dtype=np.float32)

            # SouthKensington
            elif sample["camera"]["left"][0][-3:] == "txt":
                camera_left_K, frames = self.parse_txt_file(sample["camera"]["left"][0])
                fix_baseline = 0.12
                camera_left_R = frames[0]['R']
                camera_left_T = frames[0]['T']
                focal_length_px = camera_left_K[0][0]
                depth2disp_scale = focal_length_px * fix_baseline
            else:
                raise ValueError("Other format camera is not implemented")

            output_tensor["depth2disp_scale"] = [depth2disp_scale]
            output_tensor["RTK"] = [camera_left_R, camera_left_T, camera_left_K]

        for i in range(sample_size):
            for cam in ["left", "right"]:
                if "mask" in sample and cam in sample["mask"]:
                    mask = frame_utils.read_gen(sample["mask"][cam][i])
                    mask = np.array(mask) / 255.0
                    output_tensor["mask"][i].append(mask)

                if "viewpoint" in sample and cam in sample["viewpoint"]:
                    viewpoint = self._get_pytorch3d_camera(
                        sample["viewpoint"][cam][i],
                        sample["metadata"][cam][i][1],
                        scale=1.0,
                    )
                    output_tensor["viewpoint"][i].append(viewpoint)
                if "camera" in sample:
                    # InfinigenSV
                    if sample["camera"]["left"][0][-3:] == "npz" and cam in sample["camera"]:
                        # Note that the K, R, T is based on Blender world Matrix
                        camera = np.load(sample["camera"][cam][i])
                        camera_K = camera['K']
                        camera_T = camera['T'][:3, 3]
                        camera_R = camera['T'][:3, :3]
                        viewpoint = self._get_pytorch3d_camera_from_blender(
                            camera_R, camera_T, camera_K,
                            sample["metadata"][cam][i][1],
                            scale=1.0,
                        )
                        output_tensor["viewpoint"][i].append(viewpoint)

                    # Sintel
                    elif sample["camera"]["left"][0][-3:] == "cam" and cam in sample["camera"]:
                        f = open(sample["camera"]["left"][0], 'rb')
                        check = np.fromfile(f, dtype=np.float32, count=1)[0]
                        assert check == TAG_FLOAT, ' cam_read:: Wrong tag in flow file (should be: {0}, is: {1}). Big-endian machine? '.format(
                            TAG_FLOAT, check)
                        camera_K = np.fromfile(f, dtype='float64', count=9).reshape((3, 3))
                        camera_RT = np.fromfile(f, dtype='float64', count=12).reshape((3, 4))
                        camera_T = camera_RT[:3, 3]
                        camera_R = camera_RT[:3, :3]
                        viewpoint = self._get_pytorch3d_camera_from_blender(
                            camera_R, camera_T, camera_K,
                            sample["metadata"][cam][i][1],
                            scale=1.0,
                        )
                        output_tensor["viewpoint"][i].append(viewpoint)

                    # TartanAir
                    elif sample["camera"]["left"][0][-13:] == "pose_left.txt":
                        fx, fy, cx, cy = 320.0, 320.0, 320.0, 240.0
                        # Build the 3x3 intrinsic matrix
                        camera_left_K = np.array([
                            [fx, 0, cx],
                            [0, fy, cy],
                            [0, 0, 1]
                        ])
                        focal_length_px = camera_left_K[0][0]
                        fix_baseline = 0.25
                        depth2disp_scale = focal_length_px * fix_baseline
                        camera_left_RT = self.load_tartanair_pose(sample["camera"]["left"][0], index=i)
                        camera_left_T = camera_left_RT[:3, 3]
                        camera_left_R = camera_left_RT[:3, :3]

                    # Spring
                    elif any(filename in path for path in sample["camera"]["left"] for filename
                             in ["focaldistance.txt", "extrinsics.txt", "intrinsics.txt"]) and cam in sample["camera"]:

                        for path in sample["camera"]["left"]:
                            if "intrinsics.txt" in path:
                                intrinsics_path = path
                            elif "extrinsics.txt" in path:
                                extrinsics_path = path

                        fx, fy, cx, cy = np.loadtxt(intrinsics_path)[0]
                        # Build the 3x3 intrinsic matrix
                        camera_K = np.array([
                            [fx, 0, cx],
                            [0, fy, cy],
                            [0, 0, 1]
                        ])
                        focal_length_px = camera_left_K[0][0]
                        fix_baseline = 0.065  # From the dataset website, the baseline of the cameras = 6.5cm = 0.065m
                        depth2disp_scale = focal_length_px * fix_baseline
                        camera_RT = np.loadtxt(extrinsics_path).reshape(-1, 4, 4)[i]
                        camera_T = camera_RT[:3, 3]
                        camera_R = camera_RT[:3, :3]
                        viewpoint = self._get_pytorch3d_camera_from_blender(
                            camera_R, camera_T, camera_K,
                            sample["metadata"][cam][i][1],
                            scale=1.0,
                        )
                        output_tensor["viewpoint"][i].append(viewpoint)

                    # KITTI Depth
                    elif sample["camera"]["left"][0][-20:] == "calib_cam_to_cam.txt":
                        calib_data = {}
                        with open(sample["camera"]["left"][0], 'r') as f:
                            for line in f:
                                key, value = line.split(':', 1)
                                calib_data[key.strip()] = value.strip()

                        P_key = 'P_rect_02'
                        if P_key in calib_data:
                            P_values = np.array([float(x) for x in calib_data[P_key].split()])
                            P_matrix = P_values.reshape(3, 4)
                        else:
                            raise KeyError(f"Projection matrix for camera not found in calibration data")
                        focal_length_px = P_matrix[0, 0]

                        T_key1 = 'T_02'
                        T_key2 = 'T_03'
                        if T_key1 in calib_data and T_key2 in calib_data:
                            T1 = np.array([float(x) for x in calib_data[T_key1].split()])
                            T2 = np.array([float(x) for x in calib_data[T_key2].split()])
                            baseline = np.linalg.norm(T1 - T2)
                        else:
                            raise KeyError(f"Translation vectors for cameras not found in calibration data")

                        R_key1 = 'R_rect_02'
                        R_key2 = 'R_rect_03'
                        if R_key1 in calib_data and R_key2 in calib_data:
                            R1 = np.array([float(x) for x in calib_data[R_key1].split()]).reshape(3, 3)
                            R2 = np.array([float(x) for x in calib_data[R_key2].split()]).reshape(3, 3)
                        else:
                            raise KeyError(f"Rotation vectors for cameras not found in calibration data")

                        depth2disp_scale = focal_length_px * baseline
                        camera_K = P_matrix[:, :3]
                        camera_T = T1
                        camera_R = R1
                        viewpoint = self._get_pytorch3d_camera_from_blender(
                            camera_R, camera_T, camera_K,
                            sample["metadata"][cam][i][1],
                            scale=1.0,
                        )
                        output_tensor["viewpoint"][i].append(viewpoint)

                    # VKITTI2
                    elif sample["camera"]["left"][0][-13:] == "intrinsic.txt":
                        with open(sample["camera"]["left"][0], 'r') as f:
                            line = f.readlines()[1+i]
                            values = line.strip().split()
                            frame = int(values[0])
                            camera_id = int(values[1])
                            fx = float(values[2])
                            fy = float(values[3])
                            cx = float(values[4])
                            cy = float(values[5])
                            # Construct the intrinsic matrix
                            camera_K = torch.tensor([[fx, 0, cx],
                                                          [0, fy, cy],
                                                          [0, 0, 1]], dtype=torch.float32)
                        with open(sample["camera"]["left"][0].replace("intrinsic.txt", "extrinsic.txt"), 'r') as f:
                            line = f.readlines()[1+i]
                            values = line.strip().split()
                            frame = int(values[0])
                            camera_id = int(values[1])
                            # Extract rotation (3x3) and translation (3x1)
                            camera_R = np.array([
                                [float(values[2]), float(values[3]), float(values[4])],
                                [float(values[6]), float(values[7]), float(values[8])],
                                [float(values[10]), float(values[11]), float(values[12])]
                            ], dtype=np.float32)

                            camera_T = np.array([
                                float(values[5]),
                                float(values[9]),
                                float(values[13])
                            ], dtype=np.float32)
                        viewpoint = self._get_pytorch3d_camera_from_blender(
                            camera_R, camera_T, camera_K,
                            sample["metadata"][cam][i][1],
                            scale=1.0,
                        )
                        output_tensor["viewpoint"][i].append(viewpoint)

                    # SouthKensington
                    elif sample["camera"]["left"][0][-3:] == "txt" and cam in sample["camera"]:
                        camera_left_K, frames = self.parse_txt_file(sample["camera"]["left"][0])

                        camera_K = camera_left_K
                        camera_R = frames[i]['R']
                        camera_T = frames[i]['T']
                        viewpoint = self._get_pytorch3d_camera_from_blender(
                            camera_R, camera_T, camera_K,
                            sample["metadata"][cam][i][1],
                            scale=1.0,
                        )
                        output_tensor["viewpoint"][i].append(viewpoint)

                if "metadata" in sample and cam in sample["metadata"]:
                    metadata = sample["metadata"][cam][i]
                    output_tensor["metadata"][i].append(metadata)

                if cam in sample["image"]:
                    img = frame_utils.read_gen(sample["image"][cam][i])
                    img = np.array(img).astype(np.uint8)

                    # grayscale images
                    if len(img.shape) == 2:
                        img = np.tile(img[..., None], (1, 1, 3))
                    else:
                        img = img[..., :3]
                    output_tensor["img"][i].append(img)

                if cam in sample["disparity"]:
                    disp = self.disparity_reader(sample["disparity"][cam][i])
                    if isinstance(disp, tuple):
                        disp, valid_disp = disp
                    else:
                        valid_disp = disp < 512
                    disp = np.array(disp).astype(np.float32)
                    disp = np.stack([-disp, np.zeros_like(disp)], axis=-1)
                    # disp = np.stack([disp, np.zeros_like(disp)], axis=-1)

                    output_tensor["disp"][i].append(disp)
                    output_tensor["valid_disp"][i].append(valid_disp)

                elif "depth" in sample and cam in sample["depth"]:
                    depth = self.depth_reader(sample["depth"][cam][i])
                    depth_mask = depth < self.depth_eps
                    depth[depth_mask] = self.depth_eps
                    disp = depth2disp_scale / depth
                    disp[depth_mask] = 0
                    valid_disp = (disp < 512) * (1 - depth_mask)
                    disp = np.array(disp).astype(np.float32)
                    disp = np.stack([-disp, np.zeros_like(disp)], axis=-1)
                    output_tensor["disp"][i].append(disp)
                    output_tensor["valid_disp"][i].append(valid_disp)

        return output_tensor

    def __getitem__(self, index):
        im_tensor = {"img"}
        sample = self.sample_list[index]
        if self.is_test:
            sample_size = len(sample["image"]["left"])
            im_tensor["img"] = [[] for _ in range(sample_size)]
            for i in range(sample_size):
                for cam in ["left", "right"]:
                    img = frame_utils.read_gen(sample["image"][cam][i])
                    img = np.array(img).astype(np.uint8)[..., :3]
                    img = torch.from_numpy(img).permute(2, 0, 1).float()
                    im_tensor["img"][i].append(img)
            im_tensor["img"] = torch.stack(im_tensor["img"])
            return im_tensor, self.extra_info[index]

        index = index % len(self.sample_list)
        try:
            output_tensor = self._get_output_tensor(sample)
        except:
            logging.warning(f"Exception in loading sample {index}!")
            index = np.random.randint(len(self.sample_list))
            logging.info(f"New index is {index}")
            sample = self.sample_list[index]
            output_tensor = self._get_output_tensor(sample)

        sample_size = len(sample["image"]["left"])
        if self.augmentor is not None:
            if self.sparse:
                output_tensor["img"], output_tensor["disp"], output_tensor["valid_disp"] = self.augmentor(
                    output_tensor["img"], output_tensor["disp"], output_tensor["valid_disp"]
                )
            else:
                output_tensor["img"], output_tensor["disp"] = self.augmentor(
                    output_tensor["img"], output_tensor["disp"]
                )
        for i in range(sample_size):
            for cam in (0, 1):
                if cam < len(output_tensor["img"][i]):
                    img = (
                        torch.from_numpy(output_tensor["img"][i][cam])
                        .permute(2, 0, 1)
                        .float()
                    )
                    if self.img_pad is not None:
                        padH, padW = self.img_pad
                        img = F.pad(img, [padW] * 2 + [padH] * 2)
                    output_tensor["img"][i][cam] = img

                if cam < len(output_tensor["disp"][i]):
                    disp = (
                        torch.from_numpy(output_tensor["disp"][i][cam])
                        .permute(2, 0, 1)
                        .float()
                    )

                    if self.sparse:
                        valid_disp = torch.from_numpy(
                            output_tensor["valid_disp"][i][cam]
                        )
                    else:
                        valid_disp = (
                            (disp[0].abs() < 512)
                            & (disp[1].abs() < 512)
                            & (disp[0].abs() != 0)
                        )
                    disp = disp[:1]

                    output_tensor["disp"][i][cam] = disp
                    output_tensor["valid_disp"][i][cam] = valid_disp.float()

                if "mask" in output_tensor and cam < len(output_tensor["mask"][i]):
                    mask = torch.from_numpy(output_tensor["mask"][i][cam]).float()
                    output_tensor["mask"][i][cam] = mask

                if "viewpoint" in output_tensor and cam < len(
                    output_tensor["viewpoint"][i]
                ):

                    viewpoint = output_tensor["viewpoint"][i][cam]
                    output_tensor["viewpoint"][i][cam] = viewpoint

        res = {}
        if "viewpoint" in output_tensor and self.split != "train":
            res["viewpoint"] = output_tensor["viewpoint"]
        if "metadata" in output_tensor and self.split != "train":
            res["metadata"] = output_tensor["metadata"]
        if "depth2disp_scale" in output_tensor and self.split != "train":
            res["depth2disp_scale"] = output_tensor["depth2disp_scale"]
        if "RTK" in output_tensor and self.split != "train":
            res["RTK"] = output_tensor["RTK"]

        for k, v in output_tensor.items():
            if k != "viewpoint" and k != "metadata" and k != "depth2disp_scale" and k != "RTK":
                for i in range(len(v)):
                    if len(v[i]) > 0:
                        v[i] = torch.stack(v[i])
                if len(v) > 0 and (len(v[0]) > 0):
                    res[k] = torch.stack(v)
        return res

    def __mul__(self, v):
        copy_of_self = copy.deepcopy(self)
        copy_of_self.sample_list = v * copy_of_self.sample_list
        copy_of_self.extra_info = v * copy_of_self.extra_info
        return copy_of_self

    def __len__(self):
        return len(self.sample_list)


class DynamicReplicaDataset(StereoSequenceDataset):
    def __init__(
        self,
        aug_params=None,
        root="./data/datasets/dynamic_replica_data",
        split="train",
        sample_len=-1,
        only_first_n_samples=-1,
    ):
        super(DynamicReplicaDataset, self).__init__(aug_params)
        self.root = root
        self.sample_len = sample_len
        self.split = split

        frame_annotations_file = f"frame_annotations_{split}.jgz"

        with gzip.open(
            osp.join(root, split, frame_annotations_file), "rt", encoding="utf8"
        ) as zipfile:
            frame_annots_list = load_dataclass(
                zipfile, List[DynamicReplicaFrameAnnotation]
            )
        seq_annot = defaultdict(lambda: defaultdict(list))
        for frame_annot in frame_annots_list:
            seq_annot[frame_annot.sequence_name][frame_annot.camera_name].append(
                frame_annot
            )
        for seq_name in seq_annot.keys():
            try:
                filenames = defaultdict(lambda: defaultdict(list))
                for cam in ["left", "right"]:
                    for framedata in seq_annot[seq_name][cam]:
                        im_path = osp.join(root, split, framedata.image.path)
                        depth_path = osp.join(root, split, framedata.depth.path)
                        mask_path = osp.join(root, split, framedata.mask.path)

                        assert os.path.isfile(im_path), im_path
                        if self.split == 'train':
                            assert os.path.isfile(depth_path), depth_path
                        assert os.path.isfile(mask_path), mask_path

                        filenames["image"][cam].append(im_path)
                        if os.path.isfile(depth_path):
                            filenames["depth"][cam].append(depth_path)
                        filenames["mask"][cam].append(mask_path)

                        filenames["viewpoint"][cam].append(framedata.viewpoint)
                        filenames["metadata"][cam].append(
                            [framedata.sequence_name, framedata.image.size]
                        )

                        for k in filenames.keys():
                            assert (
                                len(filenames[k][cam])
                                == len(filenames["image"][cam])
                                > 0
                            ), framedata.sequence_name

                seq_len = len(filenames["image"][cam])

                print("seq_len", seq_name, seq_len)
                if split == "train":
                    for ref_idx in range(0, seq_len, 3):
                        step = 1 if self.sample_len == 1 else np.random.randint(1, 6)
                        if ref_idx + step * self.sample_len < seq_len:
                            sample = defaultdict(lambda: defaultdict(list))
                            for cam in ["left", "right"]:
                                for idx in range(
                                    ref_idx, ref_idx + step * self.sample_len, step
                                ):
                                    for k in filenames.keys():
                                        if "mask" not in k:
                                            sample[k][cam].append(
                                                filenames[k][cam][idx]
                                            )

                            self.sample_list.append(sample)
                else:
                    step = self.sample_len if self.sample_len > 0 else seq_len
                    counter = 0
                    for ref_idx in range(0, seq_len, step):
                        sample = defaultdict(lambda: defaultdict(list))
                        for cam in ["left", "right"]:
                            for idx in range(ref_idx, ref_idx + step):
                                for k in filenames.keys():
                                    sample[k][cam].append(filenames[k][cam][idx])

                        self.sample_list.append(sample)
                        counter += 1
                        if only_first_n_samples > 0 and counter >= only_first_n_samples:
                            break
            except Exception as e:
                print(e)
                print("Skipping sequence", seq_name)

        assert len(self.sample_list) > 0, "No samples found"
        print(f"Added {len(self.sample_list)} from Dynamic Replica {split}")
        logging.info(f"Added {len(self.sample_list)} from Dynamic Replica {split}")


class InfinigenStereoVideoDataset(StereoSequenceDataset):
    def __init__(
        self,
        aug_params=None,
        root="./data/datasets/InfinigenStereo",
        split="train",
        sample_len=-1,
        only_first_n_samples=-1,
    ):
        super(InfinigenStereoVideoDataset, self).__init__(aug_params)
        self.root = root
        self.sample_len = sample_len
        self.split = split

        sequence = sorted(
            glob(osp.join(root, self.split, "*"))
        )
        for i in range(len(sequence)):
            sequence_name = os.path.basename(sequence[i])
            try:
                filenames = defaultdict(lambda: defaultdict(list))
                for cam in ["left", "right"]:
                    suffix = "camera_0/" if cam == "left" else "camera_1/"
                    im_path_list = sorted(glob(osp.join(sequence[i], "frames/Image/", suffix, "*.png")))
                    depth_path_list = sorted(glob(osp.join(sequence[i], "frames/Depth/", suffix, "*.npy")))
                    camera_list = sorted(glob(osp.join(sequence[i], "frames/camview/", suffix, "*.npz")))
                    for j in range(len(im_path_list)):
                        im_path = im_path_list[j]
                        depth_path = depth_path_list[j]
                        camera_path = camera_list[j]
                        assert os.path.isfile(im_path), im_path
                        assert os.path.isfile(depth_path), depth_path
                        filenames["image"][cam].append(im_path)
                        filenames["depth"][cam].append(depth_path)
                        filenames["camera"][cam].append(camera_path)
                        filenames["metadata"][cam].append([sequence_name , (720,1280)])

                        for k in filenames.keys():
                            assert (
                                    len(filenames[k][cam])
                                    == len(filenames["image"][cam])
                                    > 0
                            ), sequence_name
                seq_len = len(filenames["image"][cam])

                print("seq_len", sequence_name, seq_len)
                if self.split == "train":
                    for ref_idx in range(0, seq_len, 3):
                        step = 1 if self.sample_len == 1 else np.random.randint(1, 6)
                        if ref_idx + step * self.sample_len < seq_len:
                            sample = defaultdict(lambda: defaultdict(list))
                            for cam in ["left", "right"]:
                                for idx in range(
                                    ref_idx, ref_idx + step * self.sample_len, step
                                ):
                                    for k in filenames.keys():
                                        if "mask" not in k:
                                            sample[k][cam].append(
                                                filenames[k][cam][idx]
                                            )

                            self.sample_list.append(sample)
                else:
                    step = self.sample_len if (self.sample_len > 0) and (self.sample_len < seq_len) else seq_len
                    print("sample_step", step)
                    counter = 0
                    for ref_idx in range(0, seq_len, step):
                        sample = defaultdict(lambda: defaultdict(list))
                        for cam in ["left", "right"]:
                            for idx in range(ref_idx, ref_idx + step):
                                for k in filenames.keys():
                                    sample[k][cam].append(filenames[k][cam][idx])

                        self.sample_list.append(sample)
                        counter += 1
                        if only_first_n_samples > 0 and counter >= only_first_n_samples:
                            break
            except Exception as e:
                print(e)
                print("Skipping sequence", sequence_name)
        assert len(self.sample_list) > 0, "No samples found"
        print(f"Added {len(self.sample_list)} from Infinigen Stereo Video {split}")
        logging.info(f"Added {len(self.sample_list)} from Infinigen Stereo Video {split}")


class SouthKensingtonStereoVideoDataset(StereoSequenceDataset):
    def __init__(
        self,
        aug_params=None,
        root="./data/datasets/SouthKensington/data/",
        split="test",
        subroot="",
        sample_len=-1,
        only_first_n_samples=-1,
    ):
        super(SouthKensingtonStereoVideoDataset, self).__init__(aug_params)
        self.root = root
        self.split = split
        self.sample_len = sample_len

        sequence = sorted(
            glob(osp.join(root, "*"))
        )
        for i in range(len(sequence)):
            sequence_name = os.path.basename(sequence[i])
            try:
                filenames = defaultdict(lambda: defaultdict(list))
                for cam in ["left", "right"]:
                    im_path_list = sorted(glob(osp.join(sequence[i], "images", cam, "*.png")))
                    camera_path = glob(osp.join(sequence[i], "*.txt"))[0]

                    for j in range(len(im_path_list)):
                        im_path = im_path_list[j]
                        assert os.path.isfile(im_path), im_path
                        filenames["image"][cam].append(im_path)
                        filenames["camera"][cam].append(camera_path)
                        filenames["metadata"][cam].append([sequence_name , (720,1280)])

                        for k in filenames.keys():
                            assert (
                                    len(filenames[k][cam])
                                    == len(filenames["image"][cam])
                                    > 0
                            ), sequence_name
                seq_len = len(filenames["image"][cam])
                print("seq_len", sequence_name, seq_len)

                step = self.sample_len if (self.sample_len > 0) and (self.sample_len < seq_len) else seq_len
                print("sample_step", step)
                counter = 0
                for ref_idx in range(0, seq_len, step):
                    sample = defaultdict(lambda: defaultdict(list))
                    for cam in ["left", "right"]:
                        for idx in range(ref_idx, ref_idx + step):
                            for k in filenames.keys():
                                sample[k][cam].append(filenames[k][cam][idx])

                    self.sample_list.append(sample)
                    counter += 1
                    if only_first_n_samples > 0 and counter >= only_first_n_samples:
                        break
            except Exception as e:
                print(e)
                print("Skipping sequence", sequence_name)
        assert len(self.sample_list) > 0, "No samples found"
        print(f"Added {len(self.sample_list)} from SouthKensington Stereo Video")
        logging.info(f"Added {len(self.sample_list)} from SouthKensington Stereo Video")


class KITTIDepthDataset(StereoSequenceDataset):
    def __init__(
        self,
        aug_params=None,
        root="./data/datasets/",
        split="train",
        sample_len=-1,
        only_first_n_samples=-1,
    ):
        super().__init__(aug_params, sparse=True)
        # super(KITTIDepthDataset, self).__init__(aug_params)
        image_root = osp.join(root, "kitti_depth", "input")
        gt_root = osp.join(root, "kitti_depth", "gt_depth")
        self.sample_len = sample_len
        self.split = split
        # Following CODD: https://github.com/facebookresearch/CODD
        val_split = ['2011_10_03_drive_0042_sync']  # 1 scene
        test_split = ['2011_09_26_drive_0002_sync', '2011_09_26_drive_0005_sync',
                      '2011_09_26_drive_0013_sync', '2011_09_26_drive_0020_sync',
                      '2011_09_26_drive_0023_sync', '2011_09_26_drive_0036_sync',
                      '2011_09_26_drive_0079_sync', '2011_09_26_drive_0095_sync',
                      '2011_09_26_drive_0113_sync', '2011_09_28_drive_0037_sync',
                      '2011_09_29_drive_0026_sync', '2011_09_30_drive_0016_sync',
                      '2011_10_03_drive_0047_sync']  # 13 scenes

        sequence_root = sorted(glob(osp.join(gt_root, "*")))
        train_list = []
        val_list = []
        test_list = []
        for i in range(len(sequence_root)):
            sequence_name = os.path.basename(os.path.normpath(sequence_root[i]))
            if sequence_name in test_split:
                test_list.append(sequence_root[i])
            elif sequence_name in val_split:
                val_list.append(sequence_root[i])
            else:
                train_list.append(sequence_root[i])

        if self.split == "train":
            sequence_split = train_list
        elif self.split == "val":
            sequence_split = val_list
        elif self.split == "test":
            sequence_split = test_list
        else:
            raise ValueError("Wrong Split: ", self.split)

        for i in range(len(sequence_split)):
            sequence_name = os.path.basename(os.path.normpath(sequence_split[i]))
            sequence_camera = osp.join(image_root, sequence_name[:10], "calib_cam_to_cam.txt")
            try:
                filenames = defaultdict(lambda: defaultdict(list))
                for cam in ["left", "right"]:
                    suffix = "image_02/" if cam == "left" else "image_03/"
                    depth_path_list = sorted(
                        glob(osp.join(gt_root, sequence_name, "proj_depth", "groundtruth", suffix, "*.png")))
                    for j in range(len(depth_path_list)):
                        depth_path = depth_path_list[j]
                        assert os.path.isfile(depth_path), depth_path
                        filenames["depth"][cam].append(depth_path)

                        # find the corresponding images
                        im_name = os.path.basename(os.path.normpath(depth_path))
                        im_path = osp.join(image_root, sequence_name[:10], sequence_name, suffix, "data", im_name)
                        assert os.path.isfile(im_path), im_path
                        filenames["image"][cam].append(im_path)
                        filenames["camera"][cam].append(sequence_camera)
                        filenames["metadata"][cam].append([sequence_name, (370,1224)])
                        for k in filenames.keys():
                            assert (
                                    len(filenames[k][cam])
                                    == len(filenames["depth"][cam])
                                    > 0
                            ), sequence_name
                seq_len = len(filenames["image"][cam])
                print("seq_len", sequence_name, seq_len)
                if self.split == "train":
                    for ref_idx in range(0, seq_len, 3):
                        step = 1 if self.sample_len == 1 else np.random.randint(1, 6)
                        if ref_idx + step * self.sample_len < seq_len:
                            sample = defaultdict(lambda: defaultdict(list))
                            for cam in ["left", "right"]:
                                for idx in range(
                                        ref_idx, ref_idx + step * self.sample_len, step
                                ):
                                    for k in filenames.keys():
                                        if "mask" not in k:
                                            sample[k][cam].append(
                                                filenames[k][cam][idx]
                                            )
                            self.sample_list.append(sample)
                else:
                    step = self.sample_len if (self.sample_len > 0) and (self.sample_len < seq_len) else seq_len
                    print("sample_step", step)
                    counter = 0
                    for ref_idx in range(0, seq_len, step):
                        sample = defaultdict(lambda: defaultdict(list))
                        for cam in ["left", "right"]:
                            for idx in range(ref_idx, ref_idx + step):
                                for k in filenames.keys():
                                    sample[k][cam].append(filenames[k][cam][idx])

                        self.sample_list.append(sample)
                        counter += 1
                        if only_first_n_samples > 0 and counter >= only_first_n_samples:
                            break
            except Exception as e:
                print(e)
                print("Skipping sequence", sequence_name)
        assert len(self.sample_list) > 0, "No samples found"
        print(f"Added {len(self.sample_list)} from  KITTI Depth {split}")
        logging.info(f"Added {len(self.sample_list)} from KITTI Depth {split}")


def split_train_valid(path_list, valid_keywords):
    path_list_init = path_list
    for kw in valid_keywords:
        path_list = list(filter(lambda s: kw not in s, path_list))
    train_path_list = sorted(path_list)
    valid_path_list = sorted(list(set(path_list_init) - set(train_path_list)))
    return train_path_list, valid_path_list


class TartanAirDataset(StereoSequenceDataset):
    def __init__(
        self,
        aug_params=None,
        root="./data/datasets/TartanAir/",
        split="train",
        sample_len=-1,
        only_first_n_samples=-1,
    ):
        super().__init__(aug_params, sparse=False)
        self.sample_len = sample_len
        self.split = split

        # Each entry is (scene, motion, part)
        test_entries = [
            ("abandonedfactory", "Easy", "P002"),
            ("abandonedfactory", "Hard", "P002"),
            ("amusement", "Easy", "P007"),
            ("amusement", "Hard", "P007"),
            ("carwelding", "Hard", "P003"),
            ("endofworld", "Easy", "P006"),
            ("endofworld", "Hard", "P006"),
            ("gascola", "Easy", "P001"),
            ("gascola", "Hard", "P001"),
            ("hospital", "Hard", "P042"),
            ("office", "Easy", "P006"),
            ("office", "Hard", "P006"),
            ("office2", "Easy", "P004"),
            ("office2", "Hard", "P004"),
            ("oldtown", "Hard", "P006"),
            ("soulcity", "Easy", "P008"),
            ("soulcity", "Hard", "P008"),
        ]

        scene_root = sorted(glob(osp.join(root, "*")))

        sequence_root_list = []
        test_set = []
        train_set = []
        for i in range(len(scene_root)):
            sequence_root_list += sorted(glob(osp.join(scene_root[i], "Easy", "*"))) + sorted(glob(osp.join(scene_root[i], "Hard", "*")))

        for path in sequence_root_list:
            parts = path.split("/")
            if len(parts) < 5:
                continue  # skip malformed paths

            scene, motion, part = parts[-3], parts[-2], parts[-1]
            if (scene, motion, part) in test_entries:
                test_set.append(path)
            else:
                train_set.append(path)

        if self.split == "train":
            sequence_root_list = train_set
        elif self.split == "test":
            sequence_root_list = test_set
        else:
            raise KeyError(f"Wrong Split!")

        for i in range(len(sequence_root_list)):
            filenames = defaultdict(lambda: defaultdict(list))
            sequence_root = sequence_root_list[i]
            parts = os.path.normpath(sequence_root).split(os.sep)
            sequence_name = "_".join(parts[-3:])
            try:
                for cam in ['left', 'right']:
                    depth_path_list = sorted(glob(osp.join(sequence_root, "depth_left/", "*.npy")))
                    im_path_list = sorted(glob(osp.join(sequence_root, f"image_{cam}/", "*.png")))
                    pose_path = os.path.join(sequence_root, f"pose_{cam}.txt")
                    assert len(depth_path_list) == len(im_path_list), [len(depth_path_list), len(im_path_list)]
                    for j in range(len(depth_path_list)):
                        depth_path = depth_path_list[j]
                        assert os.path.isfile(depth_path), depth_path
                        filenames["depth"][cam].append(depth_path)
                        im_path = im_path_list[j]
                        assert os.path.isfile(im_path), im_path
                        filenames["image"][cam].append(im_path)
                        filenames["camera"][cam].append(pose_path)
                        filenames["metadata"][cam].append([sequence_name, (480,640)])
                        for k in filenames.keys():
                            assert (
                                    len(filenames[k][cam])
                                    == len(filenames["depth"][cam])
                                    > 0
                            ), sequence_name
                seq_len = len(filenames["image"][cam])
                print("seq_len", sequence_name, seq_len)

                if self.split == "train":
                    for ref_idx in range(0, seq_len, 3):
                        step = 1 if self.sample_len == 1 else np.random.randint(1, 6)
                        if ref_idx + step * self.sample_len < seq_len:
                            sample = defaultdict(lambda: defaultdict(list))
                            for cam in ["left", "right"]:
                                for idx in range(
                                        ref_idx, ref_idx + step * self.sample_len, step
                                ):
                                    for k in filenames.keys():
                                        if "mask" not in k:
                                            sample[k][cam].append(
                                                filenames[k][cam][idx]
                                            )
                            self.sample_list.append(sample)
                else:
                    step = self.sample_len if (self.sample_len > 0) and (self.sample_len < seq_len) else seq_len
                    print("sample_step", step)
                    counter = 0
                    for ref_idx in range(0, seq_len, step):
                        sample = defaultdict(lambda: defaultdict(list))
                        for cam in ["left", "right"]:
                            for idx in range(ref_idx, ref_idx + step):
                                for k in filenames.keys():
                                    sample[k][cam].append(filenames[k][cam][idx])

                        self.sample_list.append(sample)
                        counter += 1
                        if only_first_n_samples > 0 and counter >= only_first_n_samples:
                            break
            except Exception as e:
                print(e)
                print("Skipping sequence", sequence_name)
        assert len(self.sample_list) > 0, "No samples found"
        print(f"Added {len(self.sample_list)} from  TarTanAir  {split}")
        logging.info(f"Added {len(self.sample_list)} from TarTanAir {split}")


class VKITTI2Dataset(StereoSequenceDataset):
    def __init__(
        self,
        aug_params=None,
        root="./data/datasets/vkitti2/",
        split="train",
        sample_len=-1,
        only_first_n_samples=-1,
    ):
        super().__init__(aug_params, sparse=False)
        self.sample_len = sample_len
        self.split = split
        if self.split == "train":
            sequence_name_list = []
            scenes = ['Scene01', 'Scene02', 'Scene06', 'Scene18', 'Scene20']
            variations = ['15-deg-left', '15-deg-right', '30-deg-left', '30-deg-right',
                          'clone', 'fog', 'morning', 'overcast', 'rain', 'sunset']
            for scene in scenes:
                for variation in variations:
                    sequence_name = f"{scene}_{variation}"
                    sequence_name_list.append(sequence_name)
        elif self.split == "test":
            sequence_name_list = ["Scene01_15-deg-left", "Scene02_30-deg-right", "Scene06_fog", "Scene18_morning", "Scene20_rain"]
        else:
            raise KeyError(f"Wrong Split!")

        for i in range(len(sequence_name_list)):
            filenames = defaultdict(lambda: defaultdict(list))
            sequence_name = sequence_name_list[i]
            scene, variation = sequence_name.split("_")
            try:
                for cam in [('left', 0), ('right', 1)]:
                    depth_path_list = sorted(glob(osp.join(root, f"{scene}/{variation}/frames/depth/Camera_{cam[1]}/", "*.png")))
                    im_path_list = sorted(glob(osp.join(root, f"{scene}/{variation}/frames/rgb/Camera_{cam[1]}/", "*.jpg")))
                    intrinsic_path = os.path.join(root, f"{scene}/{variation}/intrinsic.txt")
                    assert len(depth_path_list) == len(im_path_list), [len(depth_path_list), len(im_path_list)]
                    for j in range(len(depth_path_list)):
                        depth_path = depth_path_list[j]
                        assert os.path.isfile(depth_path), depth_path
                        filenames["depth"][cam[0]].append(depth_path)
                        im_path = im_path_list[j]
                        assert os.path.isfile(im_path), im_path
                        filenames["image"][cam[0]].append(im_path)
                        filenames["camera"][cam[0]].append(intrinsic_path)
                        filenames["metadata"][cam[0]].append([sequence_name, (375,1242)])
                        for k in filenames.keys():
                            assert (
                                    len(filenames[k][cam[0]])
                                    == len(filenames["depth"][cam[0]])
                                    > 0
                            ), sequence_name
                seq_len = len(filenames["image"][cam[0]])
                print("seq_len", sequence_name, seq_len)

                if self.split == "train":
                    for ref_idx in range(0, seq_len, 3):
                        step = 1 if self.sample_len == 1 else np.random.randint(1, 6)
                        if ref_idx + step * self.sample_len < seq_len:
                            sample = defaultdict(lambda: defaultdict(list))
                            for cam in ["left", "right"]:
                                for idx in range(
                                        ref_idx, ref_idx + step * self.sample_len, step
                                ):
                                    for k in filenames.keys():
                                        if "mask" not in k:
                                            sample[k][cam].append(
                                                filenames[k][cam][idx]
                                            )
                            self.sample_list.append(sample)
                else:
                    step = self.sample_len if (self.sample_len > 0) and (self.sample_len < seq_len) else seq_len
                    print("sample_step", step)
                    counter = 0
                    for ref_idx in range(0, seq_len, step):
                        sample = defaultdict(lambda: defaultdict(list))
                        for cam in ["left", "right"]:
                            for idx in range(ref_idx, ref_idx + step):
                                for k in filenames.keys():
                                    sample[k][cam].append(filenames[k][cam][idx])

                        self.sample_list.append(sample)
                        counter += 1
                        if only_first_n_samples > 0 and counter >= only_first_n_samples:
                            break
            except Exception as e:
                print(e)
                print("Skipping sequence", sequence_name)
        assert len(self.sample_list) > 0, "No samples found"
        print(f"Added {len(self.sample_list)} from  VKITTI2  {split}")
        logging.info(f"Added {len(self.sample_list)} from VKITTI2 {split}")


class SequenceSpringDataset(StereoSequenceDataset):
    def __init__(
        self,
        aug_params=None,
        sample_len=-1,
        root="./data/datasets/Spring",
    ):
        super(SequenceSpringDataset, self).__init__(aug_params)
        self.split = "test"
        self.sample_len = sample_len
        original_length = len(self.sample_list)
        image_paths = defaultdict(list)
        disparity_paths = defaultdict(list)
        camera_paths = defaultdict(list)

        for cam in ["left", "right"]:
            image_paths[cam] = sorted(
                glob(osp.join(root, f"train_frame_{cam}/*"))
            )

        cam = "left"
        disparity_paths[cam] = sorted(
                glob(osp.join(root, f"train_disp1_{cam}/*"))
            )

        camera_paths[cam] = sorted(
                glob(osp.join(root, "train_cam_data/*"))
            )

        num_seq = len(image_paths["left"])
        # for each sequence
        for seq_idx in range(num_seq):
            sequence_name = os.path.basename(image_paths[cam][seq_idx])
            sample = defaultdict(lambda: defaultdict(list))
            for cam in ["left", "right"]:
                sample["image"][cam] = sorted(
                    glob(osp.join(image_paths[cam][seq_idx], f"frame_{cam}", "*.png"))
                )[:sample_len]
                # for _ in range(len(sample["image"][cam])):
                for _ in range(sample_len):
                    sample["metadata"][cam].append([sequence_name, (1080, 1920)])

            cam = "left"
            sample["disparity"][cam] = sorted(
                glob(osp.join(disparity_paths[cam][seq_idx], f"disp1_{cam}", "*.dsp5"))
            )[:sample_len]
            sample["camera"][cam] = sorted(
                glob(osp.join(camera_paths[cam][seq_idx], "cam_data", "*.txt"))
            )
            self.sample_list.append(sample)
            seq_len = len(sample["image"][cam])
            print("seq_len", sequence_name, seq_len)
        logging.info(
            f"Added {len(self.sample_list) - original_length} from Spring Dataset"
        )


class SequenceSceneFlowDataset(StereoSequenceDataset):
    def __init__(
        self,
        aug_params=None,
        root="./data/datasets",
        dstype="frames_cleanpass",
        sample_len=1,
        things_test=False,
        add_things=True,
        add_monkaa=True,
        add_driving=True,
    ):
        super(SequenceSceneFlowDataset, self).__init__(aug_params)
        self.root = root
        self.dstype = dstype
        self.sample_len = sample_len
        if things_test:
            self._add_things("TEST")
        else:
            if add_things:
                self._add_things("TRAIN")
            if add_monkaa:
                self._add_monkaa()
            if add_driving:
                self._add_driving()

    def _add_things(self, split="TRAIN"):
        """Add FlyingThings3D data"""

        original_length = len(self.sample_list)
        root = osp.join(self.root, "FlyingThings3D")
        image_paths = defaultdict(list)
        disparity_paths = defaultdict(list)

        for cam in ["left", "right"]:
            image_paths[cam] = sorted(
                glob(osp.join(root, self.dstype, split, f"*/*/{cam}/"))
            )
            disparity_paths[cam] = [
                path.replace(self.dstype, "disparity") for path in image_paths[cam]
            ]
        # Choose a random subset of 400 images for validation
        # state = np.random.get_state()
        # np.random.seed(1000)
        # val_idxs = set(np.random.permutation(len(image_paths["left"]))[:40])
        # np.random.set_state(state)
        # np.random.seed(0)
        num_seq = len(image_paths["left"])
        num = 0
        for seq_idx in range(num_seq):
            # if (split == "TEST" and seq_idx in val_idxs) or (
            #     split == "TRAIN" and not seq_idx in val_idxs
            # ):
            images, disparities = defaultdict(list), defaultdict(list)
            for cam in ["left", "right"]:
                images[cam] = sorted(
                    glob(osp.join(image_paths[cam][seq_idx], "*.png"))
                )
                disparities[cam] = sorted(
                    glob(osp.join(disparity_paths[cam][seq_idx], "*.pfm"))
                )
            num = num + len(images["left"])
            self._append_sample(images, disparities)
        print(num)
        assert len(self.sample_list) > 0, "No samples found"
        print(
            f"Added {len(self.sample_list) - original_length} from FlyingThings {self.dstype}"
        )
        logging.info(
            f"Added {len(self.sample_list) - original_length} from FlyingThings {self.dstype}"
        )

    def _add_monkaa(self):
        """Add FlyingThings3D data"""

        original_length = len(self.sample_list)
        root = osp.join(self.root, "Monkaa")
        image_paths = defaultdict(list)
        disparity_paths = defaultdict(list)

        for cam in ["left", "right"]:
            image_paths[cam] = sorted(glob(osp.join(root, self.dstype, f"*/{cam}/")))
            disparity_paths[cam] = [
                path.replace(self.dstype, "disparity") for path in image_paths[cam]
            ]

        num_seq = len(image_paths["left"])

        for seq_idx in range(num_seq):
            images, disparities = defaultdict(list), defaultdict(list)
            for cam in ["left", "right"]:
                images[cam] = sorted(glob(osp.join(image_paths[cam][seq_idx], "*.png")))
                disparities[cam] = sorted(
                    glob(osp.join(disparity_paths[cam][seq_idx], "*.pfm"))
                )

            self._append_sample(images, disparities)

        assert len(self.sample_list) > 0, "No samples found"
        print(
            f"Added {len(self.sample_list) - original_length} from Monkaa {self.dstype}"
        )
        logging.info(
            f"Added {len(self.sample_list) - original_length} from Monkaa {self.dstype}"
        )

    def _add_driving(self):
        """Add FlyingThings3D data"""

        original_length = len(self.sample_list)
        root = osp.join(self.root, "Driving")
        image_paths = defaultdict(list)
        disparity_paths = defaultdict(list)

        for cam in ["left", "right"]:
            image_paths[cam] = sorted(
                glob(osp.join(root, self.dstype, f"*/*/*/{cam}/"))
            )
            disparity_paths[cam] = [
                path.replace(self.dstype, "disparity") for path in image_paths[cam]
            ]

        num_seq = len(image_paths["left"])
        for seq_idx in range(num_seq):
            images, disparities = defaultdict(list), defaultdict(list)
            for cam in ["left", "right"]:
                images[cam] = sorted(glob(osp.join(image_paths[cam][seq_idx], "*.png")))
                disparities[cam] = sorted(
                    glob(osp.join(disparity_paths[cam][seq_idx], "*.pfm"))
                )

            self._append_sample(images, disparities)

        assert len(self.sample_list) > 0, "No samples found"
        print(
            f"Added {len(self.sample_list) - original_length} from Driving {self.dstype}"
        )
        logging.info(
            f"Added {len(self.sample_list) - original_length} from Driving {self.dstype}"
        )

    def _append_sample(self, images, disparities):
        seq_len = len(images["left"])
        for ref_idx in range(0, seq_len - self.sample_len):
            sample = defaultdict(lambda: defaultdict(list))
            for cam in ["left", "right"]:
                for idx in range(ref_idx, ref_idx + self.sample_len):
                    sample["image"][cam].append(images[cam][idx])
                    sample["disparity"][cam].append(disparities[cam][idx])
            self.sample_list.append(sample)

            sample = defaultdict(lambda: defaultdict(list))
            for cam in ["left", "right"]:
                for idx in range(ref_idx, ref_idx + self.sample_len):
                    sample["image"][cam].append(images[cam][seq_len - idx - 1])
                    sample["disparity"][cam].append(disparities[cam][seq_len - idx - 1])
            self.sample_list.append(sample)


class SequenceSintelStereo(StereoSequenceDataset):
    def __init__(
        self,
        dstype="clean",
        aug_params=None,
        root="./data/datasets",
    ):
        super().__init__(
            aug_params, sparse=True, reader=frame_utils.readDispSintelStereo
        )
        self.dstype = dstype
        self.split = "test"
        original_length = len(self.sample_list)
        image_root = osp.join(root, "sintel_stereo", "training")
        image_paths = defaultdict(list)
        disparity_paths = defaultdict(list)
        camera_paths = defaultdict(list)

        for cam in ["left", "right"]:
            image_paths[cam] = sorted(
                glob(osp.join(image_root, f"{self.dstype}_{cam}/*"))
            )

        cam = "left"
        disparity_paths[cam] = [
            path.replace(f"{self.dstype}_{cam}", "disparities")
            for path in image_paths[cam]
        ]
        camera_paths[cam] = [
            path.replace(f"{self.dstype}_{cam}", "camdata_left")
            for path in image_paths[cam]
        ]

        num_seq = len(image_paths["left"])
        # for each sequence
        for seq_idx in range(num_seq):
            sequence_name = os.path.basename(image_paths[cam][seq_idx])
            sample = defaultdict(lambda: defaultdict(list))
            for cam in ["left", "right"]:
                sample["image"][cam] = sorted(
                    glob(osp.join(image_paths[cam][seq_idx], "*.png"))
                )
                for _ in range(len(sample["image"][cam])):
                    sample["metadata"][cam].append([sequence_name, (436, 1024)])

            cam = "left"
            sample["disparity"][cam] = sorted(
                glob(osp.join(disparity_paths[cam][seq_idx], "*.png"))
            )
            sample["camera"][cam] = sorted(
                glob(osp.join(camera_paths[cam][seq_idx], "*.cam"))
            )

            for im1, disp, camera in zip(sample["image"][cam], sample["disparity"][cam], sample["camera"][cam]):
                assert (
                    im1.split("/")[-1].split(".")[0]
                    == disp.split("/")[-1].split(".")[0]
                    == camera.split("/")[-1].split(".")[0]
                ), (im1.split("/")[-1].split(".")[0], disp.split("/")[-1].split(".")[0], camera.split("/")[-1].split(".")[0])
            self.sample_list.append(sample)

        logging.info(
            f"Added {len(self.sample_list) - original_length} from SintelStereo {self.dstype}"
        )


def fetch_dataloader(args):
    """Create the data loader for the corresponding training set"""

    aug_params = {
        "crop_size": args.image_size,
        "min_scale": args.spatial_scale[0],
        "max_scale": args.spatial_scale[1],
        "do_flip": False,
        "yjitter": not args.noyjitter,
    }
    if hasattr(args, "saturation_range") and args.saturation_range is not None:
        aug_params["saturation_range"] = args.saturation_range
    if hasattr(args, "img_gamma") and args.img_gamma is not None:
        aug_params["gamma"] = args.img_gamma
    if hasattr(args, "do_flip") and args.do_flip is not None:
        aug_params["do_flip"] = args.do_flip

    train_dataset = None

    add_monkaa = "monkaa" in args.train_datasets
    add_driving = "driving" in args.train_datasets
    add_things = "things" in args.train_datasets
    add_dynamic_replica = "dynamic_replica" in args.train_datasets
    add_infinigensv = "infinigen_stereovideo" in args.train_datasets
    add_kittidepth = "kitti_depth" in args.train_datasets
    add_vkitti2 = "vkitti2" in args.train_datasets
    add_tartanair = "tartanair" in args.train_datasets
    new_dataset = None

    if add_monkaa or add_driving or add_things:
        # clean_dataset = SequenceSceneFlowDataset(
        #     aug_params,
        #     dstype="frames_cleanpass",
        #     sample_len=args.sample_len,
        #     add_monkaa=add_monkaa,
        #     add_driving=add_driving,
        #     add_things=add_things,
        # )

        final_dataset = SequenceSceneFlowDataset(
            aug_params,
            dstype="frames_finalpass",
            sample_len=args.sample_len,
            add_monkaa=add_monkaa,
            add_driving=add_driving,
            add_things=add_things,
        )
        # new_dataset = clean_dataset + final_dataset

        new_dataset = final_dataset

    if add_dynamic_replica:
        dr_dataset = DynamicReplicaDataset(
            aug_params, split="train", sample_len=args.sample_len
        )
        if new_dataset is None:
            new_dataset = dr_dataset
        else:
            new_dataset = new_dataset + dr_dataset

    if add_infinigensv:
        infinigensv_dataset = InfinigenStereoVideoDataset(
            aug_params, split="train", sample_len=args.sample_len
        )
        if new_dataset is None:
            new_dataset = infinigensv_dataset
        else:
            new_dataset = new_dataset + infinigensv_dataset + infinigensv_dataset + infinigensv_dataset

    if add_kittidepth:
        kittidepth_dataset = KITTIDepthDataset(
            aug_params, split="train", sample_len=args.sample_len
        )
        if new_dataset is None:
            new_dataset = kittidepth_dataset
        else:
            new_dataset = new_dataset + kittidepth_dataset

    if add_vkitti2:
        vkitti2_dataset = VKITTI2Dataset(
            aug_params, split="train", sample_len=args.sample_len
        )
        if new_dataset is None:
            new_dataset = vkitti2_dataset
        else:
            new_dataset = new_dataset + vkitti2_dataset

    if add_tartanair:
        tartanair_dataset = TartanAirDataset(
            aug_params, split="train", sample_len=args.sample_len
        )
        if new_dataset is None:
            new_dataset = tartanair_dataset
        else:
            new_dataset = new_dataset + tartanair_dataset

    logging.info(f"Adding {len(new_dataset)} samples in total")
    train_dataset = (
        new_dataset if train_dataset is None else train_dataset + new_dataset
    )

    train_loader = data.DataLoader(
        train_dataset,
        batch_size=args.batch_size,
        pin_memory=True,
        shuffle=True,
        num_workers=args.num_workers,
        drop_last=True,
    )

    logging.info("Training with %d image pairs" % len(train_dataset))
    return train_loader


================================================
FILE: demo.py
================================================
import sys

import argparse
import os
import cv2
import glob
import numpy as np
import torch
import torch.nn.functional as F
from collections import defaultdict

from PIL import Image
from matplotlib import pyplot as plt
from pathlib import Path

DEVICE = 'cuda'


def load_image(imfile):
    img = np.array(Image.open(imfile).convert('RGB')).astype(np.uint8)
    img = torch.from_numpy(img).permute(2, 0, 1).float()
    return img.to(DEVICE)


def viz(img, flo):
    img = img[0].permute(1, 2, 0).cpu().numpy()
    flo = flo[0].permute(1, 2, 0).cpu().numpy()

    # map flow to rgb image
    flo = flow_viz.flow_to_image(flo)
    img_flo = np.concatenate([img, flo], axis=0)

    cv2.imshow('image', img_flo[:, :, [2, 1, 0]] / 255.0)
    cv2.waitKey()


def demo(args):
    from stereoanyvideo.models.stereoanyvideo_model import StereoAnyVideoModel
    model = StereoAnyVideoModel()

    if args.ckpt is not None:
        assert args.ckpt.endswith(".pth") or args.ckpt.endswith(
            ".pt"
        )
        strict = True
        state_dict = torch.load(args.ckpt)
        if "model" in state_dict:
            state_dict = state_dict["model"]
        if list(state_dict.keys())[0].startswith("module."):
            state_dict = {
                k.replace("module.", ""): v for k, v in state_dict.items()
            }
        model.model.load_state_dict(state_dict, strict=strict)
        print("Done loading model checkpoint", args.ckpt)

    model.to(DEVICE)
    model.eval()

    output_directory = args.output_path
    parent_directory = os.path.dirname(output_directory)
    if not os.path.exists(parent_directory):
        os.makedirs(parent_directory)
    if not os.path.isdir(output_directory):
        os.mkdir(output_directory)

    with torch.no_grad():
        images_left = sorted(glob.glob(os.path.join(args.path, 'left/*.png')) + glob.glob(os.path.join(args.path, 'left/*.jpg')))
        images_right = sorted(glob.glob(os.path.join(args.path, 'right/*.png')) + glob.glob(os.path.join(args.path, 'right/*.jpg')))
        assert len(images_left) == len(images_right), [len(images_left), len(images_right)]
        assert len(images_left) > 0, args.path
        print(f"Found {len(images_left)} frames. Saving files to {args.output_path}")

        num_frames = len(images_left)
        frame_size = args.frame_size

        disparities_ori_all = []

        for start_idx in range(0, num_frames, frame_size):
            end_idx = min(start_idx + frame_size, num_frames)

            image_left_list = []
            image_right_list = []

            for imfile1, imfile2 in zip(images_left[start_idx:end_idx], images_right[start_idx:end_idx]):
                image_left = load_image(imfile1)
                image_right = load_image(imfile2)
                image_left = F.interpolate(image_left[None], size=args.resize, mode="bilinear", align_corners=True)
                image_right = F.interpolate(image_right[None], size=args.resize, mode="bilinear", align_corners=True)
                image_left_list.append(image_left[0])
                image_right_list.append(image_right[0])

            video_left = torch.stack(image_left_list, dim=0)
            video_right = torch.stack(image_right_list, dim=0)

            batch_dict = defaultdict(list)
            batch_dict["stereo_video"] = torch.stack([video_left, video_right], dim=1)

            predictions = model(batch_dict)

            assert "disparity" in predictions
            disparities = predictions["disparity"][:, :1].clone().data.cpu().abs().numpy()
            disparities_ori = disparities.astype(np.uint8)
            disparities_ori_all.extend(disparities_ori)

        disparities_ori_all = np.array(disparities_ori_all)

        epsilon = 1e-5  # Smallest allowable disparity
        disparities_ori_all[disparities_ori_all < epsilon] = epsilon

        disparities_all = ((disparities_ori_all - disparities_ori_all.min()) / (disparities_ori_all.max() - disparities_ori_all.min()) * 255).astype(np.uint8)

        video_ori_disparity = cv2.VideoWriter(
            os.path.join(args.output_path, "disparity.mp4"),
            cv2.VideoWriter_fourcc(*"mp4v"),
            fps=args.fps,
            frameSize=(disparities_all.shape[3], disparities_all.shape[2]),
            isColor=True,
        )
        video_disparity = cv2.VideoWriter(
            os.path.join(args.output_path, "disparity_norm.mp4"),
            cv2.VideoWriter_fourcc(*"mp4v"),
            fps=args.fps,
            frameSize=(disparities_all.shape[3], disparities_all.shape[2]),
            isColor=True,
        )

        for i in range(num_frames):
            imfile1 = images_left[i]

            disparity_norm = disparities_all[i]
            disparity_norm = disparity_norm.transpose(1, 2, 0)
            disparity_norm_vis = cv2.applyColorMap(disparity_norm, cv2.COLORMAP_INFERNO)
            video_disparity.write(disparity_norm_vis)

            disparity_ori = disparities_ori_all[i]
            disparity_ori = disparity_ori.transpose(1, 2, 0)
            disparity_ori_vis = cv2.applyColorMap(disparity_ori, cv2.COLORMAP_INFERNO)
            video_ori_disparity.write(disparity_ori_vis)

            if args.save_png:
                filename_temp = args.output_path + '/disparity_norm_' + str(i).zfill(3) + '.png'
                cv2.imwrite(filename_temp, disparity_norm_vis)
                filename_temp = args.output_path + '/disparity_ori_' + str(i).zfill(3) + '.png'
                cv2.imwrite(filename_temp, disparity_ori_vis)

        video_ori_disparity.release()
        video_disparity.release()

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--model_name', default="stereoanyvideo", help="name to specify model")
    parser.add_argument('--ckpt', default=None, help="checkpoint of stereo model")
    parser.add_argument('--resize', default=(720, 1280), help="image size input to the model")
    parser.add_argument("--fps", type=int, default=30, help="frame rate for video visualization")
    parser.add_argument('--path', help="dataset for evaluation")
    parser.add_argument("--save_png", action="store_true")
    parser.add_argument("--frame_size", type=int, default=150, help="number of updates in each forward pass.")
    parser.add_argument("--iters",type=int, default=20, help="number of updates in each forward pass.")
    parser.add_argument("--kernel_size", type=int, default=20, help="number of frames in each forward pass.")
    parser.add_argument('--output_path', help="directory to save output", default="demo_output")
    args = parser.parse_args()

    demo(args)


================================================
FILE: demo.sh
================================================
#!/bin/bash

export PYTHONPATH=`(cd ../ && pwd)`:`pwd`:$PYTHONPATH

python demo.py --ckpt ./checkpoints/StereoAnyVideo_MIX.pth --path ./demo_video/ --output_path ./demo_output/ --save_png

================================================
FILE: evaluate_stereoanyvideo.sh
================================================
#!/bin/bash

export PYTHONPATH=`(cd ../ && pwd)`:`pwd`:$PYTHONPATH

# evaluate on [sintel, dynamicreplica， infinigensv, vkitti2] using sceneflow checkpoint

python ./evaluation/evaluate.py --config-name eval_sintel_final \
MODEL.model_name=StereoAnyVideoModel \
MODEL.StereoAnyVideoModel.model_weights=./checkpoints/StereoAnyVideo_SF.pth

python ./evaluation/evaluate.py --config-name eval_dynamic_replica \
MODEL.model_name=StereoAnyVideoModel \
MODEL.StereoAnyVideoModel.model_weights=./checkpoints/StereoAnyVideo_SF.pth

python ./evaluation/evaluate.py --config-name eval_infinigensv \
MODEL.model_name=StereoAnyVideoModel \
MODEL.StereoAnyVideoModel.model_weights=./checkpoints/StereoAnyVideo_SF.pth

python ./evaluation/evaluate.py --config-name eval_vkitti2 \
MODEL.model_name=StereoAnyVideoModel \
MODEL.StereoAnyVideoModel.model_weights=./checkpoints/StereoAnyVideo_SF.pth



# evaluate on [sintel, kittidepth, southkensingtonSV] using mixed checkpoint

python ./evaluation/evaluate.py --config-name eval_sintel_final \
MODEL.model_name=StereoAnyVideoModel \
MODEL.StereoAnyVideoModel.model_weights=./checkpoints/StereoAnyVideo_MIX.pth

python ./evaluation/evaluate.py --config-name eval_kittidepth \
MODEL.model_name=StereoAnyVideoModel \
MODEL.StereoAnyVideoModel.model_weights=./checkpoints/StereoAnyVideo_MIX.pth

python ./evaluation/evaluate.py --config-name eval_southkensington \
MODEL.model_name=StereoAnyVideoModel \
MODEL.StereoAnyVideoModel.model_weights=./checkpoints/StereoAnyVideo_SF.pth

================================================
FILE: evaluation/configs/eval_dynamic_replica.yaml
================================================
defaults:
  - default_config_eval
visualize_interval: -1
exp_dir: ./outputs/stereoanyvideo_DynamicReplica
sample_len: 150
MODEL:
    model_name: StereoAnyVideoModel


================================================
FILE: evaluation/configs/eval_infinigensv.yaml
================================================
defaults:
  - default_config_eval
visualize_interval: -1
render_bin_size: 0
exp_dir: ./outputs/stereoanyvideo_InfinigenSV
sample_len: 150
dataset_name: infinigensv
MODEL:
    model_name: StereoAnyVideoModel


================================================
FILE: evaluation/configs/eval_kittidepth.yaml
================================================
defaults:
  - default_config_eval
visualize_interval: -1
render_bin_size: 0
exp_dir: ./outputs/stereoanyvideo_KITTIDepth
sample_len: 300
dataset_name: kitti_depth
MODEL:
    model_name: StereoAnyVideoModel


================================================
FILE: evaluation/configs/eval_sintel_clean.yaml
================================================
defaults:
  - default_config_eval
visualize_interval: -1
render_bin_size: 0
exp_dir: ./outputs/stereoanyvideo_sintel_clean
dataset_name: sintel
dstype: clean
MODEL:
    model_name: StereoAnyVideoModel


================================================
FILE: evaluation/configs/eval_sintel_final.yaml
================================================
defaults:
  - default_config_eval
visualize_interval: -1
render_bin_size: 0
exp_dir: ./outputs/stereoanyvideo_sintel_final
dataset_name: sintel
dstype: final
MODEL:
    model_name: StereoAnyVideoModel

================================================
FILE: evaluation/configs/eval_southkensington.yaml
================================================
defaults:
  - default_config_eval
visualize_interval: 1
exp_dir: ./outputs/stereoanyvideo_SouthKensingtonIndoor
sample_len: 300
dataset_name: southkensingtonsv
MODEL:
    model_name: StereoAnyVideoModel


================================================
FILE: evaluation/configs/eval_vkitti2.yaml
================================================
defaults:
  - default_config_eval
visualize_interval: -1
render_bin_size: 0
exp_dir: ./outputs/stereoanyvideo_VKITTI2
sample_len: 300
dataset_name: vkitti2
MODEL:
    model_name: StereoAnyVideoModel


================================================
FILE: evaluation/core/evaluator.py
================================================
import os
import numpy as np
import cv2
from collections import defaultdict
import torch.nn.functional as F
import torch
import matplotlib.pyplot as plt
from tqdm import tqdm
from omegaconf import DictConfig
from pytorch3d.implicitron.tools.config import Configurable
from stereoanyvideo.evaluation.utils.eval_utils import depth2disparity_scale, eval_batch
from stereoanyvideo.evaluation.utils.utils import (
    PerceptionPrediction,
    pretty_print_perception_metrics,
    visualize_batch,
)


def depth_to_colormap(depth, colormap='jet', eps=1e-3, scale_vmin=1.0):
    valid = (depth > eps) & (depth < 1e4)
    vmin = depth[valid].min() * scale_vmin
    vmax = depth[valid].max()
    if colormap=='jet':
        cmap = plt.cm.jet
    else:
        cmap = plt.cm.inferno
    norm = plt.Normalize(vmin=vmin, vmax=vmax)
    depth = cmap(norm(depth))
    depth[~valid] = 1
    return np.ascontiguousarray(depth[...,:3] * 255, dtype=np.uint8)


class Evaluator(Configurable):
    """
    A class defining the DynamicStereo evaluator.

    Args:
        eps: Threshold for converting disparity to depth.
    """

    eps = 1e-5

    def setup_visualization(self, cfg: DictConfig) -> None:
        # Visualization
        self.visualize_interval = cfg.visualize_interval
        self.render_bin_size = cfg.render_bin_size
        self.exp_dir = cfg.exp_dir
        if self.visualize_interval > 0:
            self.visualize_dir = os.path.join(cfg.exp_dir, "visualisations")

    @torch.no_grad()
    def evaluate_sequence(
        self,
        model,
        model_stabilizer,
        test_dataloader: torch.utils.data.DataLoader,
        is_real_data: bool = False,
        step=None,
        writer=None,
        train_mode=False,
        interp_shape=None,
        exp_dir=None,
    ):
        model.eval()
        per_batch_eval_results = []

        if self.visualize_interval > 0:
            os.makedirs(self.visualize_dir, exist_ok=True)

        for batch_idx, sequence in enumerate(tqdm(test_dataloader)):
            batch_dict = defaultdict(list)
            batch_dict["stereo_video"] = sequence["img"]
            if not is_real_data:
                batch_dict["disparity"] = sequence["disp"][:, 0].abs()
                batch_dict["disparity_mask"] = sequence["valid_disp"][:, :1]

                if "mask" in sequence:
                    batch_dict["fg_mask"] = sequence["mask"][:, :1]
                else:
                    batch_dict["fg_mask"] = torch.ones_like(
                        batch_dict["disparity_mask"]
                    )

            elif interp_shape is not None:
                left_video = batch_dict["stereo_video"][:, 0]
                left_video = F.interpolate(
                    left_video, tuple(interp_shape), mode="bilinear"
                )
                right_video = batch_dict["stereo_video"][:, 1]
                right_video = F.interpolate(
                    right_video, tuple(interp_shape), mode="bilinear"
                )
                batch_dict["stereo_video"] = torch.stack([left_video, right_video], 1)

            if model_stabilizer is not None:
                predictions = model.forward_stabilizer(batch_dict, model_stabilizer)
            elif train_mode:
                predictions = model.forward_batch_test(batch_dict)
            else:
                predictions = model(batch_dict)

            assert "disparity" in predictions
            predictions["disparity"] = predictions["disparity"][:, :1].clone().cpu()
            if not is_real_data:
                predictions["disparity"] = predictions["disparity"] * (
                    batch_dict["disparity_mask"].round()
                )

                batch_eval_result, seq_length = eval_batch(batch_dict, predictions, sequence["depth2disp_scale"][0])
                per_batch_eval_results.append((batch_eval_result, seq_length))
                pretty_print_perception_metrics(batch_eval_result)

            if (self.visualize_interval > 0) and (
                batch_idx % self.visualize_interval == 0
            ):
                perception_prediction = PerceptionPrediction()

                pred_disp = predictions["disparity"]
                pred_disp[pred_disp < self.eps] = self.eps

                scale = sequence["depth2disp_scale"][0]
                perception_prediction.depth_map = (scale / pred_disp).cuda()

                perspective_cameras = []
                if "viewpoint" in sequence:
                    for cam in sequence["viewpoint"]:
                        perspective_cameras.append(cam[0])
                        perception_prediction.perspective_cameras = perspective_cameras

                if "stereo_original_video" in batch_dict:
                    batch_dict["stereo_video"] = batch_dict[
                        "stereo_original_video"
                    ].clone()

                for k, v in batch_dict.items():
                    if isinstance(v, torch.Tensor):
                        batch_dict[k] = v.cuda()

                visualize_batch(
                    batch_dict,
                    perception_prediction,
                    output_dir=self.visualize_dir,
                    sequence_name=sequence["metadata"][0][0][0],
                    step=step,
                    writer=writer,
                    render_bin_size=self.render_bin_size,
                )
                filename = os.path.join(self.visualize_dir, sequence["metadata"][0][0][0])
                if not os.path.isdir(filename):
                    os.mkdir(filename)
                disparity_list = pred_disp.data.cpu().numpy()
                depth_list = perception_prediction.depth_map.data.cpu().numpy()
                np.save(f"{filename}_depth_list.npy", depth_list)

                video_disparity = cv2.VideoWriter(
                    f"{filename}_disparity.mp4",
                    cv2.VideoWriter_fourcc(*"mp4v"),
                    fps=30,
                    frameSize=(
                        batch_dict["stereo_video"][:, 0][0].shape[2], batch_dict["stereo_video"][:, 0][0].shape[1]),
                    isColor=True,
                )

                disparity_vis = depth_to_colormap(disparity_list[:, 0], eps=self.eps, colormap='inferno')
                for i in range(disparity_list.shape[0]):
                    filename_temp = filename + '/disparity_' + str(i).zfill(3) + '.png'
                    disparity_vis[i] = cv2.cvtColor(disparity_vis[i], cv2.COLOR_RGB2BGR)
                    cv2.imwrite(filename_temp, disparity_vis[i])
                    video_disparity.write(disparity_vis[i])
                video_disparity.release()

        return per_batch_eval_results


================================================
FILE: evaluation/evaluate.py
================================================
import json
import os
from dataclasses import dataclass, field
from typing import Any, Dict, Optional

import hydra
import numpy as np

import torch
from omegaconf import OmegaConf

from stereoanyvideo.evaluation.utils.utils import aggregate_and_print_results

import stereoanyvideo.datasets.video_datasets as datasets

from stereoanyvideo.models.core.model_zoo import (
    get_all_model_default_configs,
    model_zoo,
)
from pytorch3d.implicitron.tools.config import get_default_args_field
from stereoanyvideo.evaluation.core.evaluator import Evaluator


@dataclass(eq=False)
class DefaultConfig:
    exp_dir: str = "./outputs"
    stabilizer_ckpt: Optional[str] = None

    # one of [sintel, dynamicreplica, things, kitti_depth, infinigensv, southkensingtonsv]
    dataset_name: str = "dynamicreplica"

    sample_len: int = -1
    dstype: Optional[str] = None
    # clean, final
    MODEL: Dict[str, Any] = field(
        default_factory=lambda: get_all_model_default_configs()
    )
    EVALUATOR: Dict[str, Any] = get_default_args_field(Evaluator)

    seed: int = 42
    gpu_idx: int = 0

    visualize_interval: int = 1  # Use 0 for no visualization

    render_bin_size: Optional[int] = None

    # Override hydra's working directory to current working dir,
    # also disable storing the .hydra logs:
    hydra: dict = field(
        default_factory=lambda: {
            "run": {"dir": "."},
            "output_subdir": None,
        }
    )


def run_eval(cfg: DefaultConfig):
    """
    Evaluates new view synthesis metrics of a specified model
    on a benchmark dataset.
    """
    # make the experiment directory
    os.makedirs(cfg.exp_dir, exist_ok=True)

    # dump the exp cofig to the exp_dir
    cfg_file = os.path.join(cfg.exp_dir, "expconfig.yaml")
    with open(cfg_file, "w") as f:
        OmegaConf.save(config=cfg, f=f)

    torch.manual_seed(cfg.seed)
    np.random.seed(cfg.seed)
    evaluator = Evaluator(**cfg.EVALUATOR)

    model = model_zoo(**cfg.MODEL)
    model.cuda(0)
    evaluator.setup_visualization(cfg)

    if cfg.dataset_name == "dynamicreplica":
        test_dataloader = datasets.DynamicReplicaDataset(
            split="test", sample_len=cfg.sample_len, only_first_n_samples=1
        )
    elif cfg.dataset_name == "infinigensv":
        test_dataloader = datasets.InfinigenStereoVideoDataset(
            split="test", sample_len=cfg.sample_len, only_first_n_samples=1
        )
    elif cfg.dataset_name == "southkensingtonsv":
        test_dataloader = datasets.SouthKensingtonStereoVideoDataset(
            sample_len=cfg.sample_len, only_first_n_samples=1
        )
        evaluator.evaluate_sequence(
            model,
            None,
            test_dataloader,
            is_real_data=True,
            exp_dir=cfg.exp_dir
        )
        return

    elif cfg.dataset_name == "kitti_depth":
        test_dataloader = datasets.KITTIDepthDataset(
            split="test", sample_len=cfg.sample_len, only_first_n_samples=1
        )
    elif cfg.dataset_name == "vkitti2":
        test_dataloader = datasets.VKITTI2Dataset(
            split="test", sample_len=cfg.sample_len, only_first_n_samples=1
        )
    elif cfg.dataset_name == "sintel":
        test_dataloader = datasets.SequenceSintelStereo(dstype=cfg.dstype)
    elif cfg.dataset_name == "things":
        test_dataloader = datasets.SequenceSceneFlowDatasets(
            {},
            dstype=cfg.dstype,
            sample_len=cfg.sample_len,
            add_monkaa=False,
            add_driving=False,
            things_test=True,
        )

    evaluate_result = evaluator.evaluate_sequence(
        model,
        None,
        test_dataloader,
        is_real_data=False,
        exp_dir=cfg.exp_dir
    )

    aggreegate_result = aggregate_and_print_results(evaluate_result)

    result_file = os.path.join(cfg.exp_dir, f"result_eval.json")

    print(f"Dumping eval results to {result_file}.")
    with open(result_file, "w") as f:
        json.dump(aggreegate_result, f)


cs = hydra.core.config_store.ConfigStore.instance()
cs.store(name="default_config_eval", node=DefaultConfig)


@hydra.main(config_path="./configs/", config_name="default_config_eval")
def evaluate(cfg: DefaultConfig) -> None:
    os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
    os.environ["CUDA_VISIBLE_DEVICES"] = str(cfg.gpu_idx)
    run_eval(cfg)


if __name__ == "__main__":
    evaluate()


================================================
FILE: evaluation/utils/eval_utils.py
================================================
from dataclasses import dataclass
from typing import Dict, Optional, Union
from stereoanyvideo.evaluation.utils.ssim import SSIM
import torch
import torch.nn.functional as F
import numpy as np
import math
import cv2
from pytorch3d.utils import opencv_from_cameras_projection
from stereoanyvideo.models.raft_model import RAFTModel


@dataclass(eq=True, frozen=True)
class PerceptionMetric:
    metric: str
    depth_scaling_norm: Optional[str] = None
    suffix: str = ""
    index: str = ""

    def __str__(self):
        return (
            self.metric
            + self.index
            + (
                ("_norm_" + self.depth_scaling_norm)
                if self.depth_scaling_norm is not None
                else ""
            )
            + self.suffix
        )


def compute_flow(seq, is_seq=True):
    raft = RAFTModel().cuda()
    raft.eval()
    if is_seq:
        t, c, h, w = seq.size()
        flows_forward = []
        for i in range(t-1):
            flow_forward = raft.forward_fullres(seq[i][None], seq[i+1][None], iters=20)
            flows_forward.append(flow_forward)
        flows_forward = torch.cat(flows_forward, dim=0)
        return flows_forward

    else:
        img1, img2 = seq
        flow_forward = raft.forward_fullres(img1, img2, iters=20)
        return flow_forward

def flow_warp(x, flow):
    if flow.size(3) != 2:  # [B, H, W, 2]
        flow = flow.permute(0, 2, 3, 1)
    if x.size()[-2:] != flow.size()[1:3]:
        raise ValueError(f'The spatial sizes of input ({x.size()[-2:]}) and '
                         f'flow ({flow.size()[1:3]}) are not the same.')
    _, _, h, w = x.size()
    # create mesh grid
    grid_y, grid_x = torch.meshgrid(torch.arange(0, h), torch.arange(0, w))
    grid = torch.stack((grid_x, grid_y), 2).type_as(x)  # (h, w, 2)
    grid.requires_grad = False

    grid_flow = grid + flow
    # scale grid_flow to [-1,1]
    grid_flow_x = 2.0 * grid_flow[:, :, :, 0] / max(w - 1, 1) - 1.0
    grid_flow_y = 2.0 * grid_flow[:, :, :, 1] / max(h - 1, 1) - 1.0
    grid_flow = torch.stack((grid_flow_x, grid_flow_y), dim=3)
    output = F.grid_sample(
        x,
        grid_flow,
        mode='bilinear',
        padding_mode='zeros',
        align_corners=True)
    return output

def eval_endpoint_error_sequence(
    x: torch.Tensor,
    y: torch.Tensor,
    mask: torch.Tensor,
    crop: int = 0,
    mask_thr: float = 0.5,
    clamp_thr: float = 1e-5,
) -> Dict[str, torch.Tensor]:

    assert len(x.shape) == len(y.shape) == len(mask.shape) == 4, (
        x.shape,
        y.shape,
        mask.shape,
    )
    assert x.shape[0] == y.shape[0] == mask.shape[0], (x.shape, y.shape, mask.shape)

    # chuck out the border
    if crop > 0:
        if crop > min(y.shape[2:]) - crop:
            raise ValueError("Incorrect crop size.")
        y = y[:, :, crop:-crop, crop:-crop]
        x = x[:, :, crop:-crop, crop:-crop]
        mask = mask[:, :, crop:-crop, crop:-crop]

    y = y * (mask > mask_thr).float()
    x = x * (mask > mask_thr).float()
    y[torch.isnan(y)] = 0

    results = {}
    for epe_name in ("epe", "temp_epe"):
        if epe_name == "epe":
            endpoint_error = (mask * (x - y) ** 2).sum(dim=1).sqrt()
        elif epe_name == "temp_epe":
            delta_mask = mask[:-1] * mask[1:]
            # endpoint_error = (
            #     (delta_mask * ((x[:-1] - x[1:]) - (y[:-1] - y[1:])) ** 2)
            #     .sum(dim=1)
            #     .sqrt()
            # )
            endpoint_error = (
                (delta_mask * ((x[:-1] - x[1:]).abs() - (y[:-1] - y[1:]).abs()) ** 2)
                .sum(dim=1)
                .sqrt()
            )

        # epe_nonzero = endpoint_error != 0
        nonzero = torch.count_nonzero(endpoint_error)
        epe_mean = endpoint_error.sum() / torch.clamp(
            nonzero, clamp_thr
        )  # average error for all the sequence pixels
        epe_inv_accuracy_05px = (endpoint_error > 0.5).sum() / torch.clamp(
            nonzero, clamp_thr
        )
        epe_inv_accuracy_1px = (endpoint_error > 1).sum() / torch.clamp(
            nonzero, clamp_thr
        )
        epe_inv_accuracy_2px = (endpoint_error > 2).sum() / torch.clamp(
            nonzero, clamp_thr
        )
        epe_inv_accuracy_3px = (endpoint_error > 3).sum() / torch.clamp(
            nonzero, clamp_thr
        )

        results[f"{epe_name}_mean"] = epe_mean[None]
        results[f"{epe_name}_bad_0.5px"] = epe_inv_accuracy_05px[None] * 100
        results[f"{epe_name}_bad_1px"] = epe_inv_accuracy_1px[None] * 100
        results[f"{epe_name}_bad_2px"] = epe_inv_accuracy_2px[None] * 100
        results[f"{epe_name}_bad_3px"] = epe_inv_accuracy_3px[None] * 100
    return results


def eval_TCC_sequence(
    x: torch.Tensor,
    y: torch.Tensor,
    mask: torch.Tensor,
    crop: int = 0,
    mask_thr: float = 0.5,
) -> Dict[str, torch.Tensor]:

    assert len(x.shape) == len(y.shape) == len(mask.shape) == 4, (
        x.shape,
        y.shape,
        mask.shape,
    )
    assert x.shape[0] == y.shape[0] == mask.shape[0], (x.shape, y.shape, mask.shape)
    t, c, h, w = x.shape
    # chuck out the border
    if crop > 0:
        if crop > min(y.shape[2:]) - crop:
            raise ValueError("Incorrect crop size.")
        y = y[:, :, crop:-crop, crop:-crop]
        x = x[:, :, crop:-crop, crop:-crop]
        mask = mask[:, :, crop:-crop, crop:-crop]

    y = y * (mask > mask_thr).float()
    x = x * (mask > mask_thr).float()
    x[torch.isnan(x)] = 0
    y[torch.isnan(y)] = 0

    ssim_loss = SSIM(1.0, nonnegative_ssim=True)
    delta_mask = mask[:-1] * mask[1:]

    tcc = 0
    for i in range(t-1):
        tcc += ssim_loss((torch.abs(x[i][None] - x[i+1][None]) * delta_mask[i]).expand(-1, 3, -1, -1),
                          (torch.abs(y[i][None] - y[i+1][None]) * delta_mask[i]).expand(-1, 3, -1, -1))
    tcc = tcc / (t-1)

    return tcc

def eval_TCM_sequence(
    x: torch.Tensor,
    y: torch.Tensor,
    mask: torch.Tensor,
    crop: int = 0,
    mask_thr: float = 0.5,
) -> Dict[str, torch.Tensor]:

    assert len(x.shape) == len(y.shape) == len(mask.shape) == 4, (
        x.shape,
        y.shape,
        mask.shape,
    )
    assert x.shape[0] == y.shape[0] == mask.shape[0], (x.shape, y.shape, mask.shape)

    t, c, h, w = x.shape
    # chuck out the border
    if crop > 0:
        if crop > min(y.shape[2:]) - crop:
            raise ValueError("Incorrect crop size.")
        y = y[:, :, crop:-crop, crop:-crop]
        x = x[:, :, crop:-crop, crop:-crop]
        mask = mask[:, :, crop:-crop, crop:-crop]

    y = y * (mask > mask_thr).float()
    x = x * (mask > mask_thr).float()
    y[torch.isnan(y)] = 0

    ssim_loss = SSIM(1.0, nonnegative_ssim=True, size_average=False)
    delta_mask = mask[:-1] * mask[1:]

    tcm = 0
    for i in range(t-1):
        dmax = torch.max(y[i][None].view(1, -1), -1)[0].view(1, 1, 1, 1).expand(-1, 3, -1, -1)
        dmin = torch.min(y[i][None].view(1, -1), -1)[0].view(1, 1, 1, 1).expand(-1, 3, -1, -1)

        x_norm = (x[i][None].expand(-1, 3, -1, -1) - dmin) / (dmax - dmin) * 255.
        x2_norm = (x[i+1][None].expand(-1, 3, -1, -1) - dmin) / (dmax - dmin) * 255.
        x_flow = compute_flow([x_norm.cuda(), x2_norm.cuda()], is_seq=False).cpu()

        y_norm = (y[i][None].expand(-1, 3, -1, -1) - dmin) / (dmax - dmin) * 255.
        y2_norm = (y[i+1][None].expand(-1, 3, -1, -1) - dmin) / (dmax - dmin) * 255.
        y_flow = compute_flow([y_norm.cuda(), y2_norm.cuda()], is_seq=False).cpu()

        flow_mask = torch.sum(y_flow > 250, 1, keepdim=True) == 0

        mask = delta_mask[i][None] * flow_mask
        mask = mask.expand(-1, 3, -1, -1)
        if torch.sum(mask) > 0:
            tcm += torch.mean(ssim_loss(
                torch.cat((x_flow, torch.ones_like(x_flow[:, 0, None, ...])), 1) * mask,
                torch.cat((y_flow, torch.ones_like(x_flow[:, 0, None, ...])), 1) * mask)[:, :2])
    tcm = tcm / (t-1)

    return tcm


def eval_OPW_sequence(
    img: torch.Tensor,
    x: torch.Tensor,
    y: torch.Tensor,
    mask: torch.Tensor,
    crop: int = 0,
    mask_thr: float = 0.5,
    clamp_thr: float = 1e-5,
) -> Dict[str, torch.Tensor]:

    assert len(x.shape) == len(y.shape) == len(mask.shape) == 4, (
        x.shape,
        y.shape,
        mask.shape,
    ) # T, 1, H, W
    assert x.shape[0] == y.shape[0] == mask.shape[0], (x.shape, y.shape, mask.shape)

    t, c, h, w = img[:, 0].shape
    # chuck out the border
    if crop > 0:
        if crop > min(y.shape[2:]) - crop:
            raise ValueError("Incorrect crop size.")
        y = y[:, :, crop:-crop, crop:-crop]
        x = x[:, :, crop:-crop, crop:-crop]
        mask = mask[:, :, crop:-crop, crop:-crop]

    y = y * (mask > mask_thr).float()
    x = x * (mask > mask_thr).float()
    y[torch.isnan(y)] = 0
    delta_mask = mask[:-1] * mask[1:]
    depth_mask_30 = torch.sum(y > 30, 1, keepdim=True) == 0
    depth_mask_30 = depth_mask_30[:-1] * depth_mask_30[1:]
    depth_mask_50 = torch.sum(y > 50, 1, keepdim=True) == 0
    depth_mask_50 = depth_mask_50[:-1] * depth_mask_50[1:]
    depth_mask_100 = torch.sum(y > 100, 1, keepdim=True) == 0
    depth_mask_100 = depth_mask_100[:-1] * depth_mask_100[1:]

    flow = compute_flow(img[:, 0].cuda()).cpu()
    warped_disp = flow_warp(x[1:], flow)
    warped_img = flow_warp(img[:, 0][1:].float(), flow)

    flow_mask = torch.sum(flow > 250, 1, keepdim=True) == 0

    delta_mask = delta_mask * torch.exp(-50. * torch.sqrt(
        ((warped_img / 255. - img[:, 0][:-1].float() / 255.) ** 2).sum(dim=1, keepdim=True))) * flow_mask * (
                          warped_disp > 0) > 1e-2
    opw_err = torch.abs(warped_disp - x[:-1]) * delta_mask
    opw_err_30 = torch.abs(warped_disp - x[:-1]) * delta_mask * depth_mask_30
    opw_err_50 = torch.abs(warped_disp - x[:-1]) * delta_mask * depth_mask_50
    opw_err_100 = torch.abs(warped_disp - x[:-1]) * delta_mask * depth_mask_100

    opw = 0
    opw_30 = 0
    opw_50 = 0
    opw_100 = 0
    for i in range(t-1):
        if torch.sum(delta_mask[i]) > 0:
            opw += torch.sum(opw_err[i]) / torch.sum(delta_mask[i])
        if torch.sum(delta_mask[i] * depth_mask_30[i]) > 0:
            opw_30 += torch.sum(opw_err_30[i]) / torch.sum(delta_mask[i] * depth_mask_30[i])
        if torch.sum(delta_mask[i] * depth_mask_50[i]) > 0:
            opw_50 += torch.sum(opw_err_50[i]) / torch.sum(delta_mask[i] * depth_mask_50[i])
        if torch.sum(delta_mask[i] * depth_mask_100[i]) > 0:
            opw_100 += torch.sum(opw_err_100[i]) / torch.sum(delta_mask[i] * depth_mask_100[i])
    opw = opw / (t - 1)
    opw_30 = opw_30 / (t - 1)
    opw_50 = opw_50 / (t - 1)
    opw_100 = opw_100 / (t - 1)
    return opw, opw_30, opw_50, opw_100


def eval_RTC_sequence(
    img: torch.Tensor,
    x: torch.Tensor,
    y: torch.Tensor,
    mask: torch.Tensor,
    crop: int = 0,
    mask_thr: float = 0.5,
    clamp_thr: float = 1e-5,
) -> Dict[str, torch.Tensor]:

    assert len(x.shape) == len(y.shape) == len(mask.shape) == 4, (
        x.shape,
        y.shape,
        mask.shape,
    ) # T, 1, H, W
    assert x.shape[0] == y.shape[0] == mask.shape[0], (x.shape, y.shape, mask.shape)

    t, c, h, w = img[:, 0].shape
    # chuck out the border
    if crop > 0:
        if crop > min(y.shape[2:]) - crop:
            raise ValueError("Incorrect crop size.")
        y = y[:, :, crop:-crop, crop:-crop]
        x = x[:, :, crop:-crop, crop:-crop]
        mask = mask[:, :, crop:-crop, crop:-crop]

    y = y * (mask > mask_thr).float()
    x = x * (mask > mask_thr).float()
    y[torch.isnan(y)] = 0

    flow = compute_flow(img[:, 0].cuda()).cpu()
    delta_mask = mask[:-1] * mask[1:]

    warped_disp = flow_warp(x[1:], flow)
    warped_img = flow_warp(img[:, 0][1:], flow)

    flow_mask = torch.sum(flow > 250, 1, keepdim=True) == 0
    depth_mask = torch.sum(y > 30, 1, keepdim=True) == 0
    depth_mask = depth_mask[:-1] * depth_mask[1:]

    delta_mask = delta_mask * torch.exp(-50. * torch.sqrt(
        ((warped_img / 255. - img[:, 0][:-1] / 255.) ** 2).sum(dim=1, keepdim=True))) * flow_mask * (
                         warped_disp > 0) > 1e-2
    tau = 1.01

    x1 = x[:-1]  / warped_disp
    x2 = warped_disp / x[:-1]

    x1[torch.isinf(x1)] = -1e10
    x2[torch.isinf(x2)] = -1e10
    x = torch.max(torch.cat((x1, x2), 1), 1)[0] < tau

    rtc_err = x[:, None] * delta_mask
    rtc_err_30 = x[:, None] * delta_mask * depth_mask
    rtc = 0
    rtc_30 = 0
    for i in range(t-1):
        if torch.sum(delta_mask[i]) > 0:
            rtc += torch.sum(rtc_err[i]) / torch.sum(delta_mask[i])
        if torch.sum(delta_mask[i] * depth_mask[i]) > 0:
            rtc_30 += torch.sum(rtc_err_30[i]) / torch.sum(delta_mask[i] * depth_mask[i])
    rtc = rtc / (t-1)
    rtc_30 = rtc_30 / (t - 1)
    return rtc, rtc_30


def depth2disparity_scale(left_camera, right_camera, image_size_tensor):
    # # opencv camera matrices
    (_, T1, K1), (_, T2, _) = [
        opencv_from_cameras_projection(
            f,
            image_size_tensor,
        )
        for f in (left_camera, right_camera)
    ]
    fix_baseline = T1[0][0] - T2[0][0]
    focal_length_px = K1[0][0][0]
    # following this https://github.com/princeton-vl/RAFT-Stereo#converting-disparity-to-depth
    return focal_length_px * fix_baseline


def depth_to_pcd(
    depth_map,
    img,
    focal_length,
    cx,
    cy,
    step: int = None,
    inv_extrinsic=None,
    mask=None,
    filter=False,
):
    __, w, __ = img.shape
    if step is None:
        step = int(w / 100)
    Z = depth_map[::step, ::step]
    colors = img[::step, ::step, :]

    Pixels_Y = torch.arange(Z.shape[0]).to(Z.device) * step
    Pixels_X = torch.arange(Z.shape[1]).to(Z.device) * step

    X = (Pixels_X[None, :] - cx) * Z / focal_length
    Y = (Pixels_Y[:, None] - cy) * Z / focal_length

    inds = Z > 0

    if mask is not None:
        inds = inds * (mask[::step, ::step] > 0)

    X = X[inds].reshape(-1)
    Y = Y[inds].reshape(-1)
    Z = Z[inds].reshape(-1)
    colors = colors[inds]
    pcd = torch.stack([X, Y, Z]).T

    if inv_extrinsic is not None:
        pcd_ext = torch.vstack([pcd.T, torch.ones((1, pcd.shape[0])).to(Z.device)])
        pcd = (inv_extrinsic @ pcd_ext)[:3, :].T

    if filter:
        pcd, filt_inds = filter_outliers(pcd)
        colors = colors[filt_inds]
    return pcd, colors


def filter_outliers(pcd, sigma=3):
    mean = pcd.mean(0)
    std = pcd.std(0)
    inds = ((pcd - mean).abs() < sigma * std)[:, 2]
    pcd = pcd[inds]
    return pcd, inds


def eval_batch(batch_dict, predictions, scale) -> Dict[str, Union[float, torch.Tensor]]:
    """
    Produce performance metrics for a single batch of perception
    predictions.
    Args:
        frame_data: A PixarFrameData object containing the input to the new view
            synthesis method.
        preds: A PerceptionPrediction object with the predicted data.
    Returns:
        results: A dictionary holding evaluation metrics.
    """
    results = {}

    assert "disparity" in predictions
    mask_now = torch.ones_like(batch_dict["fg_mask"])

    mask_now = mask_now * batch_dict["disparity_mask"]

    eval_flow_traj_output = eval_endpoint_error_sequence(
        predictions["disparity"], batch_dict["disparity"], mask_now
    )
    for epe_name in ("epe", "temp_epe"):
        results[PerceptionMetric(f"disp_{epe_name}_mean")] = eval_flow_traj_output[
            f"{epe_name}_mean"
        ]

        results[PerceptionMetric(f"disp_{epe_name}_bad_3px")] = eval_flow_traj_output[
            f"{epe_name}_bad_3px"
        ]

        results[PerceptionMetric(f"disp_{epe_name}_bad_2px")] = eval_flow_traj_output[
            f"{epe_name}_bad_2px"
        ]

        results[PerceptionMetric(f"disp_{epe_name}_bad_1px")] = eval_flow_traj_output[
            f"{epe_name}_bad_1px"
        ]

        results[PerceptionMetric(f"disp_{epe_name}_bad_0.5px")] = eval_flow_traj_output[
            f"{epe_name}_bad_0.5px"
        ]
    if "endpoint_error_per_pixel" in eval_flow_traj_output:
        results["disp_endpoint_error_per_pixel"] = eval_flow_traj_output[
            "endpoint_error_per_pixel"
        ]

    # disparity to depth
    depth = scale / predictions["disparity"].clamp(min=1e-10)

    eval_TCC_output = eval_TCC_sequence(
        depth, scale / batch_dict["disparity"].clamp(min=1e-10), mask_now
    )
    results[PerceptionMetric("disp_TCC")] = eval_TCC_output[None]

    eval_TCM_output = eval_TCM_sequence(
        depth, scale / batch_dict["disparity"].clamp(min=1e-10), mask_now
    )
    results[PerceptionMetric("disp_TCM")] = eval_TCM_output[None]

    eval_OPW_output, eval_OPW_30_output, eval_OPW_50_output, eval_OPW_100_output = eval_OPW_sequence(
        batch_dict["stereo_video"], depth, scale / batch_dict["disparity"].clamp(min=1e-10), mask_now
    )
    results[PerceptionMetric("disp_OPW")] = eval_OPW_output[None]
    results[PerceptionMetric("disp_OPW_100")] = eval_OPW_100_output[None]
    results[PerceptionMetric("disp_OPW_50")] = eval_OPW_50_output[None]
    if eval_OPW_30_output > 0:
        results[PerceptionMetric("disp_OPW_30")] = eval_OPW_30_output[None]
    else:
        results[PerceptionMetric("disp_OPW_30")] = torch.tensor([0.0])

    eval_RTC_output, eval_RTC_30_output = eval_RTC_sequence(
        batch_dict["stereo_video"], depth, scale / batch_dict["disparity"].clamp(min=1e-10), mask_now
    )
    results[PerceptionMetric("disp_RTC")] = eval_RTC_output[None]
    if eval_RTC_30_output > 0:
        results[PerceptionMetric("disp_RTC_30")] = eval_RTC_30_output[None]
    else:
        results[PerceptionMetric("disp_RTC_30")] = torch.tensor([0.0])

    return (results, len(predictions["disparity"]))


================================================
FILE: evaluation/utils/ssim.py
================================================
# Copyright 2020 by Gongfan Fang, Zhejiang University.
# All rights reserved.
import warnings

import torch
import torch.nn.functional as F


def _fspecial_gauss_1d(size, sigma):
    r"""Create 1-D gauss kernel
    Args:
        size (int): the size of gauss kernel
        sigma (float): sigma of normal distribution
    Returns:
        torch.Tensor: 1D kernel (1 x 1 x size)
    """
    coords = torch.arange(size, dtype=torch.float)
    coords -= size // 2

    g = torch.exp(-(coords ** 2) / (2 * sigma ** 2))
    g /= g.sum()

    return g.unsqueeze(0).unsqueeze(0)


def gaussian_filter(input, win):
    r""" Blur input with 1-D kernel
    Args:
        input (torch.Tensor): a batch of tensors to be blurred
        window (torch.Tensor): 1-D gauss kernel
    Returns:
        torch.Tensor: blurred tensors
    """
    assert all([ws == 1 for ws in win.shape[1:-1]]), win.shape
    if len(input.shape) == 4:
        conv = F.conv2d
    elif len(input.shape) == 5:
        conv = F.conv3d
    else:
        raise NotImplementedError(input.shape)

    C = input.shape[1]
    out = input
    for i, s in enumerate(input.shape[2:]):
        if s >= win.shape[-1]:
            out = conv(out, weight=win.transpose(2 + i, -1), stride=1, padding=0, groups=C)
        else:
            warnings.warn(
                f"Skipping Gaussian Smoothing at dimension 2+{i} for input: {input.shape} and win size: {win.shape[-1]}"
            )
    return out


def _ssim(X, Y, data_range, win, size_average=True, K=(0.01, 0.03)):

    r""" Calculate ssim index for X and Y
    Args:
        X (torch.Tensor): images
        Y (torch.Tensor): images
        win (torch.Tensor): 1-D gauss kernel
        data_range (float or int, optional): value range of input images. (usually 1.0 or 255)
        size_average (bool, optional): if size_average=True, ssim of all images will be averaged as a scalar
    Returns:
        torch.Tensor: ssim results.
    """
    K1, K2 = K
    # batch, channel, [depth,] height, width = X.shape
    compensation = 1.0

    C1 = (K1 * data_range) ** 2
    C2 = (K2 * data_range) ** 2

    win = win.to(X.device, dtype=X.dtype)

    mu1 = gaussian_filter(X, win)
    mu2 = gaussian_filter(Y, win)

    mu1_sq = mu1.pow(2)
    mu2_sq = mu2.pow(2)
    mu1_mu2 = mu1 * mu2

    sigma1_sq = compensation * (gaussian_filter(X * X, win) - mu1_sq)
    sigma2_sq = compensation * (gaussian_filter(Y * Y, win) - mu2_sq)
    sigma12 = compensation * (gaussian_filter(X * Y, win) - mu1_mu2)

    cs_map = (2 * sigma12 + C2) / (sigma1_sq + sigma2_sq + C2)  # set alpha=beta=gamma=1
    ssim_map = ((2 * mu1_mu2 + C1) / (mu1_sq + mu2_sq + C1)) * cs_map

    ssim_per_channel = torch.flatten(ssim_map, 2).mean(-1)
    cs = torch.flatten(cs_map, 2).mean(-1)
    return ssim_per_channel, cs


def ssim(
    X,
    Y,
    data_range=255,
    size_average=True,
    win_size=11,
    win_sigma=1.5,
    win=None,
    K=(0.01, 0.03),
    nonnegative_ssim=False,
):
    r""" interface of ssim
    Args:
        X (torch.Tensor): a batch of images, (N,C,H,W)
        Y (torch.Tensor): a batch of images, (N,C,H,W)
        data_range (float or int, optional): value range of input images. (usually 1.0 or 255)
        size_average (bool, optional): if size_average=True, ssim of all images will be averaged as a scalar
        win_size: (int, optional): the size of gauss kernel
        win_sigma: (float, optional): sigma of normal distribution
        win (torch.Tensor, optional): 1-D gauss kernel. if None, a new kernel will be created according to win_size and win_sigma
        K (list or tuple, optional): scalar constants (K1, K2). Try a larger K2 constant (e.g. 0.4) if you get a negative or NaN results.
        nonnegative_ssim (bool, optional): force the ssim response to be nonnegative with relu
    Returns:
        torch.Tensor: ssim results
    """
    if not X.shape == Y.shape:
        raise ValueError(f"Input images should have the same dimensions, but got {X.shape} and {Y.shape}.")

    for d in range(len(X.shape) - 1, 1, -1):
        X = X.squeeze(dim=d)
        Y = Y.squeeze(dim=d)

    if len(X.shape) not in (4, 5):
        raise ValueError(f"Input images should be 4-d or 5-d tensors, but got {X.shape}")

    if not X.type() == Y.type():
        raise ValueError(f"Input images should have the same dtype, but got {X.type()} and {Y.type()}.")

    if win is not None:  # set win_size
        win_size = win.shape[-1]

    if not (win_size % 2 == 1):
        raise ValueError("Window size should be odd.")

    if win is None:
        win = _fspecial_gauss_1d(win_size, win_sigma)
        win = win.repeat([X.shape[1]] + [1] * (len(X.shape) - 1))

    ssim_per_channel, cs = _ssim(X, Y, data_range=data_range, win=win, size_average=False, K=K)
    if nonnegative_ssim:
        ssim_per_channel = torch.relu(ssim_per_channel)

    if size_average:
        return ssim_per_channel.mean()
    else:
        return ssim_per_channel #.mean(1)


def ms_ssim(
    X, Y, data_range=255, size_average=True, win_size=11, win_sigma=1.5, win=None, weights=None, K=(0.01, 0.03)
):

    r""" interface of ms-ssim
    Args:
        X (torch.Tensor): a batch of images, (N,C,[T,]H,W)
        Y (torch.Tensor): a batch of images, (N,C,[T,]H,W)
        data_range (float or int, optional): value range of input images. (usually 1.0 or 255)
        size_average (bool, optional): if size_average=True, ssim of all images will be averaged as a scalar
        win_size: (int, optional): the size of gauss kernel
        win_sigma: (float, optional): sigma of normal distribution
        win (torch.Tensor, optional): 1-D gauss kernel. if None, a new kernel will be created according to win_size and win_sigma
        weights (list, optional): weights for different levels
        K (list or tuple, optional): scalar constants (K1, K2). Try a larger K2 constant (e.g. 0.4) if you get a negative or NaN results.
    Returns:
        torch.Tensor: ms-ssim results
    """
    if not X.shape == Y.shape:
        raise ValueError(f"Input images should have the same dimensions, but got {X.shape} and {Y.shape}.")

    for d in range(len(X.shape) - 1, 1, -1):
        X = X.squeeze(dim=d)
        Y = Y.squeeze(dim=d)

    if not X.type() == Y.type():
        raise ValueError(f"Input images should have the same dtype, but got {X.type()} and {Y.type()}.")

    if len(X.shape) == 4:
        avg_pool = F.avg_pool2d
    elif len(X.shape) == 5:
        avg_pool = F.avg_pool3d
    else:
        raise ValueError(f"Input images should be 4-d or 5-d tensors, but got {X.shape}")

    if win is not None:  # set win_size
        win_size = win.shape[-1]

    if not (win_size % 2 == 1):
        raise ValueError("Window size should be odd.")

    smaller_side = min(X.shape[-2:])
    assert smaller_side > (win_size - 1) * (
        2 ** 4
    ), "Image size should be larger than %d due to the 4 downsamplings in ms-ssim" % ((win_size - 1) * (2 ** 4))

    if weights is None:
        weights = [0.0448, 0.2856, 0.3001, 0.2363, 0.1333]
    weights = X.new_tensor(weights)

    if win is None:
        win = _fspecial_gauss_1d(win_size, win_sigma)
        win = win.repeat([X.shape[1]] + [1] * (len(X.shape) - 1))

    levels = weights.shape[0]
    mcs = []
    for i in range(levels):
        ssim_per_channel, cs = _ssim(X, Y, win=win, data_range=data_range, size_average=False, K=K)

        if i < levels - 1:
            mcs.append(torch.relu(cs))
            padding = [s % 2 for s in X.shape[2:]]
            X = avg_pool(X, kernel_size=2, padding=padding)
            Y = avg_pool(Y, kernel_size=2, padding=padding)

    ssim_per_channel = torch.relu(ssim_per_channel)  # (batch, channel)
    mcs_and_ssim = torch.stack(mcs + [ssim_per_channel], dim=0)  # (level, batch, channel)
    ms_ssim_val = torch.prod(mcs_and_ssim ** weights.view(-1, 1, 1), dim=0)

    if size_average:
        return ms_ssim_val.mean()
    else:
        return ms_ssim_val.mean(1)


class SSIM(torch.nn.Module):
    def __init__(
        self,
        data_range=255,
        size_average=True,
        win_size=11,
        win_sigma=1.5,
        channel=3,
        spatial_dims=2,
        K=(0.01, 0.03),
        nonnegative_ssim=False,
    ):
        r""" class for ssim
        Args:
            data_range (float or int, optional): value range of input images. (usually 1.0 or 255)
            size_average (bool, optional): if size_average=True, ssim of all images will be averaged as a scalar
            win_size: (int, optional): the size of gauss kernel
            win_sigma: (float, optional): sigma of normal distribution
            channel (int, optional): input channels (default: 3)
            K (list or tuple, optional): scalar constants (K1, K2). Try a larger K2 constant (e.g. 0.4) if you get a negative or NaN results.
            nonnegative_ssim (bool, optional): force the ssim response to be nonnegative with relu.
        """

        super(SSIM, self).__init__()
        self.win_size = win_size
        self.win = _fspecial_gauss_1d(win_size, win_sigma).repeat([channel, 1] + [1] * spatial_dims)
        self.size_average = size_average
        self.data_range = data_range
        self.K = K
        self.nonnegative_ssim = nonnegative_ssim

    def forward(self, X, Y):
        return ssim(
            X,
            Y,
            data_range=self.data_range,
            size_average=self.size_average,
            win=self.win,
            K=self.K,
            nonnegative_ssim=self.nonnegative_ssim,
        )


class MS_SSIM(torch.nn.Module):
    def __init__(
        self,
        data_range=255,
        size_average=True,
        win_size=11,
        win_sigma=1.5,
        channel=3,
        spatial_dims=2,
        weights=None,
        K=(0.01, 0.03),
    ):
        r""" class for ms-ssim
        Args:
            data_range (float or int, optional): value range of input images. (usually 1.0 or 255)
            size_average (bool, optional): if size_average=True, ssim of all images will be averaged as a scalar
            win_size: (int, optional): the size of gauss kernel
            win_sigma: (float, optional): sigma of normal distribution
            channel (int, optional): input channels (default: 3)
            weights (list, optional): weights for different levels
            K (list or tuple, optional): scalar constants (K1, K2). Try a larger K2 constant (e.g. 0.4) if you get a negative or NaN results.
        """

        super(MS_SSIM, self).__init__()
        self.win_size = win_size
        self.win = _fspecial_gauss_1d(win_size, win_sigma).repeat([channel, 1] + [1] * spatial_dims)
        self.size_average = size_average
        self.data_range = data_range
        self.weights = weights
        self.K = K

    def forward(self, X, Y):
        return ms_ssim(
            X,
            Y,
            data_range=self.data_range,
            size_average=self.size_average,
            win=self.win,
            weights=self.weights,
            K=self.K,
        )


================================================
FILE: evaluation/utils/utils.py
================================================
from collections import defaultdict
import configparser
import os
import math
from typing import Optional, List
import torch
import cv2
import numpy as np
from dataclasses import dataclass
from tabulate import tabulate
import logging

from pytorch3d.structures import Pointclouds
from pytorch3d.transforms import RotateAxisAngle
from pytorch3d.utils import (
    opencv_from_cameras_projection,
)
from pytorch3d.renderer import (
    AlphaCompositor,
    PointsRasterizationSettings,
    PointsRasterizer,
    PointsRenderer,
)
from stereoanyvideo.evaluation.utils.eval_utils import depth_to_pcd


@dataclass
class PerceptionPrediction:
    """
    Holds the tensors that describe a result of any perception module.
    """

    depth_map: Optional[torch.Tensor] = None
    disparity: Optional[torch.Tensor] = None
    image_rgb: Optional[torch.Tensor] = None
    fg_probability: Optional[torch.Tensor] = None


def aggregate_eval_results(per_batch_eval_results, reduction="mean"):

    total_length = 0
    aggregate_results = defaultdict(list)
    for result in per_batch_eval_results:
        if isinstance(result, tuple):
            reduction = "sum"
            length = result[1]
            total_length += length
            result = result[0]
        for metric, val in result.items():
            if reduction == "sum":
                aggregate_results[metric].append(val * length)

    if reduction == "mean":
        return {k: torch.cat(v).mean().item() for k, v in aggregate_results.items()}
    elif reduction == "sum":
        return {
            k: torch.cat(v).sum().item() / float(total_length)
            for k, v in aggregate_results.items()
        }


def aggregate_and_print_results(
    per_batch_eval_results: List[dict],
):
    print("")
    result = aggregate_eval_results(
        per_batch_eval_results,
    )
    pretty_print_perception_metrics(result)
    result = {str(k): v for k, v in result.items()}

    print("")
    return result


def pretty_print_perception_metrics(results):

    metrics = sorted(list(results.keys()), key=lambda x: x.metric)

    print("===== Perception results =====")
    print(
        tabulate(
            [[metric, results[metric]] for metric in metrics],
        )
    )
    logging.info("===== Perception results =====")
    logging.info(tabulate(
            [[metric, results[metric]] for metric in metrics],
        ))


def read_calibration(calibration_file, resolution_string):
    # ported from https://github.com/stereolabs/zed-open-capture/
    # blob/dfa0aee51ccd2297782230a05ca59e697df496b2/examples/include/calibration.hpp#L4172

    zed_resolutions = {
        "2K": (1242, 2208),
        "FHD": (1080, 1920),
        "HD": (720, 1280),
        # "qHD": (540, 960),
        "VGA": (376, 672),
    }
    assert resolution_string in zed_resolutions.keys()
    image_height, image_width = zed_resolutions[resolution_string]

    # Open camera configuration file
    assert os.path.isfile(calibration_file)
    calib = configparser.ConfigParser()
    calib.read(calibration_file)

    # Get translations
    T = np.zeros((3, 1))
    T[0] = float(calib["STEREO"]["baseline"])
    T[1] = float(calib["STEREO"]["ty"])
    T[2] = float(calib["STEREO"]["tz"])

    baseline = T[0]

    # Get left parameters
    left_cam_cx = float(calib[f"LEFT_CAM_{resolution_string}"]["cx"])
    left_cam_cy = float(calib[f"LEFT_CAM_{resolution_string}"]["cy"])
    left_cam_fx = float(calib[f"LEFT_CAM_{resolution_string}"]["fx"])
    left_cam_fy = float(calib[f"LEFT_CAM_{resolution_string}"]["fy"])
    left_cam_k1 = float(calib[f"LEFT_CAM_{resolution_string}"]["k1"])
    left_cam_k2 = float(calib[f"LEFT_CAM_{resolution_string}"]["k2"])
    left_cam_p1 = float(calib[f"LEFT_CAM_{resolution_string}"]["p1"])
    left_cam_p2 = float(calib[f"LEFT_CAM_{resolution_string}"]["p2"])
    left_cam_k3 = float(calib[f"LEFT_CAM_{resolution_string}"]["k3"])

    # Get right parameters
    right_cam_cx = float(calib[f"RIGHT_CAM_{resolution_string}"]["cx"])
    right_cam_cy = float(calib[f"RIGHT_CAM_{resolution_string}"]["cy"])
    right_cam_fx = float(calib[f"RIGHT_CAM_{resolution_string}"]["fx"])
    right_cam_fy = float(calib[f"RIGHT_CAM_{resolution_string}"]["fy"])
    right_cam_k1 = float(calib[f"RIGHT_CAM_{resolution_string}"]["k1"])
    right_cam_k2 = float(calib[f"RIGHT_CAM_{resolution_string}"]["k2"])
    right_cam_p1 = float(calib[f"RIGHT_CAM_{resolution_string}"]["p1"])
    right_cam_p2 = float(calib[f"RIGHT_CAM_{resolution_string}"]["p2"])
    right_cam_k3 = float(calib[f"RIGHT_CAM_{resolution_string}"]["k3"])

    # Get rotations
    R_zed = np.zeros(3)
    R_zed[0] = float(calib["STEREO"][f"rx_{resolution_string.lower()}"])
    R_zed[1] = float(calib["STEREO"][f"cv_{resolution_string.lower()}"])
    R_zed[2] = float(calib["STEREO"][f"rz_{resolution_string.lower()}"])

    R = cv2.Rodrigues(R_zed)[0]

    # Left
    cameraMatrix_left = np.array(
        [[left_cam_fx, 0, left_cam_cx], [0, left_cam_fy, left_cam_cy], [0, 0, 1]]
    )
    distCoeffs_left = np.array(
        [left_cam_k1, left_cam_k2, left_cam_p1, left_cam_p2, left_cam_k3]
    )

    # Right
    cameraMatrix_right = np.array(
        [
            [right_cam_fx, 0, right_cam_cx],
            [0, right_cam_fy, right_cam_cy],
            [0, 0, 1],
        ]
    )
    distCoeffs_right = np.array(
        [right_cam_k1, right_cam_k2, right_cam_p1, right_cam_p2, right_cam_k3]
    )

    # Stereo
    R1, R2, P1, P2, Q = cv2.stereoRectify(
        cameraMatrix1=cameraMatrix_left,
        distCoeffs1=distCoeffs_left,
        cameraMatrix2=cameraMatrix_right,
        distCoeffs2=distCoeffs_right,
        imageSize=(image_width, image_height),
        R=R,
        T=T,
        flags=cv2.CALIB_ZERO_DISPARITY,
        newImageSize=(image_width, image_height),
        alpha=0,
    )[:5]

    # Precompute maps for cv::remap()
    map_left_x, map_left_y = cv2.initUndistortRectifyMap(
        cameraMatrix_left,
        distCoeffs_left,
        R1,
        P1,
        (image_width, image_height),
        cv2.CV_32FC1,
    )
    map_right_x, map_right_y = cv2.initUndistortRectifyMap(
        cameraMatrix_right,
        distCoeffs_right,
        R2,
        P2,
        (image_width, image_height),
        cv2.CV_32FC1,
    )

    zed_calib = {
        "map_left_x": map_left_x,
        "map_left_y": map_left_y,
        "map_right_x": map_right_x,
        "map_right_y": map_right_y,
        "pose_left": P1,
        "pose_right": P2,
        "baseline": baseline,
        "image_width": image_width,
        "image_height": image_height,
    }

    return zed_calib


def filter_depth_discontinuities(pcd, depth_map, threshold=5):
    """
    Removes points that belong to high-depth discontinuity regions.

    Args:
        pcd (torch.Tensor): Nx3 point cloud tensor.
        depth_map (torch.Tensor): HxW depth map.
        threshold (float): Depth change threshold.

    Returns:
        torch.Tensor: Filtered point cloud.
    """
    # Compute depth differences in x and y directions
    depth_diff_x = torch.abs(depth_map[:, 1:] - depth_map[:, :-1])  # Shape (H, W-1)
    depth_diff_y = torch.abs(depth_map[1:, :] - depth_map[:-1, :])  # Shape (H-1, W)

    # Initialize mask with all True (valid points)
    mask = torch.ones_like(depth_map, dtype=torch.bool)  # Shape (H, W)

    # Apply filtering: set False where depth difference is too large
    mask[:, :-1] &= depth_diff_x <= threshold  # X-direction filtering
    mask[:-1, :] &= depth_diff_y <= threshold  # Y-direction filtering

    # Flatten mask to match point cloud size
    mask_flat = mask.flatten()[: pcd.shape[0]]

    return pcd[mask_flat]  # Return only valid points



def visualize_batch(
    batch_dict: dict,
    preds: PerceptionPrediction,
    output_dir: str,
    ref_frame: int = 0,
    only_foreground=False,
    step=0,
    sequence_name=None,
    writer=None,
    render_bin_size=None
):
    os.makedirs(output_dir, exist_ok=True)

    outputs = {}

    if preds.depth_map is not None:
        device = preds.depth_map.device

        pcd_global_seq = []
        H, W = batch_dict["stereo_video"].shape[3:]

        for i in range(len(batch_dict["stereo_video"])):
            if hasattr(preds, 'perspective_cameras'):
                R, T, K = opencv_from_cameras_projection(
                    preds.perspective_cameras[i],
                    torch.tensor([H, W])[None].to(device),
                )  # 1x3x3, 1x3, 1x3x3
            else:
                raise KeyError(f"R T K not found!")
            extrinsic_3x4_0 = torch.cat([R[0], T[0, :, None]], dim=1)
            extr_matrix = torch.cat(
                [
                    extrinsic_3x4_0,
                    torch.Tensor([[0, 0, 0, 1]]).to(extrinsic_3x4_0.device),
                ],
                dim=0,
            )
            inv_extr_matrix = extr_matrix.inverse().to(device)
            pcd, colors = depth_to_pcd(
                preds.depth_map[i, 0],
                batch_dict["stereo_video"][i][0].permute(1, 2, 0),
                K[0][0][0],
                K[0][0][2],
                K[0][1][2],
                step=1,
                inv_extrinsic=inv_extr_matrix,
                mask=batch_dict["fg_mask"][i, 0] if only_foreground else None,
                filter=False,
            )
            R, T = inv_extr_matrix[None, :3, :3], inv_extr_matrix[None, :3, 3]
            pcd_global_seq.append((pcd, colors, (R, T, preds.perspective_cameras[i])))

        raster_settings = PointsRasterizationSettings(
            image_size=[H, W],
            radius=0.003,
            points_per_pixel=10,
        )
        R, T, cam_ = pcd_global_seq[ref_frame][2]
        median_depth = preds.depth_map.median()
        cam_.cuda()

        for mode in ["angle_15", "angle_-15", "angle_0", "changing_angle"]:
            res = []
            for t, (pcd, color, __) in enumerate(pcd_global_seq):

                if mode == "changing_angle":
                    angle = math.cos((math.pi) * (t / 60)) * 15
                elif mode == "angle_15":
                    angle = 15
                elif mode == "angle_-15":
                    angle = -15
                elif mode == "angle_0":
                    angle = 0

                delta_x = median_depth * math.sin(math.radians(angle))
                delta_z = median_depth * (1 - math.cos(math.radians(angle)))

                cam = cam_.clone()
                cam.R = torch.bmm(
                    cam.R,
                    RotateAxisAngle(angle=angle, axis="Y", device=device).get_matrix()[
                        :, :3, :3
                    ],
                )
                cam.T[0, 0] = cam.T[0, 0] - delta_x
                cam.T[0, 2] = cam.T[0, 2] - delta_z + median_depth / 2.0

                rasterizer = PointsRasterizer(
                    cameras=cam, raster_settings=raster_settings
                )
                renderer = PointsRenderer(
                    rasterizer=rasterizer,
                    compositor=AlphaCompositor(background_color=(1, 1, 1)),
                )
                pcd_copy = pcd.clone()
                point_cloud = Pointclouds(points=[pcd_copy], features=[color / 255.0])

                images = renderer(point_cloud)
                res.append(images[0, ..., :3].cpu())
            res = torch.stack(res)
            video = (res * 255).numpy().astype(np.uint8)
            save_name = f"{sequence_name}_reconstruction_{step}_mode_{mode}_"

            if writer is None:
                outputs[mode] = video
            if only_foreground:
                save_name += "fg_only"
            else:
                save_name += "full_scene"
            video_out = cv2.VideoWriter(
                os.path.join(
                    output_dir,
                    f"{save_name}.mp4",
                ),
                cv2.VideoWriter_fourcc(*"mp4v"),
                fps=30,
                frameSize=(res.shape[2], res.shape[1]),
                isColor=True,
            )
            filename = os.path.join(output_dir, sequence_name + '_img_')
            if not os.path.isdir(filename + str(mode)):
                os.mkdir(filename + str(mode))
            for i in range(len(video)):
                filename_temp = filename + str(mode) + '/' + str(i).zfill(3) + '.png'
                cv2.imwrite(filename_temp, cv2.cvtColor(video[i], cv2.COLOR_BGR2RGB))
                video_out.write(cv2.cvtColor(video[i], cv2.COLOR_BGR2RGB))
            video_out.release()

            if writer is not None:
                writer.add_video(
                    f"{sequence_name}_reconstruction_mode_{mode}",
                    (res * 255).permute(0, 3, 1, 2).to(torch.uint8)[None],
                    global_step=step,
                    fps=30,
                )

    return outputs


================================================
FILE: models/Video-Depth-Anything/app.py
================================================
# Copyright (2025) Bytedance Ltd. and/or its affiliates 

# Licensed under the Apache License, Version 2.0 (the "License"); 
# you may not use this file except in compliance with the License. 
# You may obtain a copy of the License at 

#     http://www.apache.org/licenses/LICENSE-2.0 

# Unless required by applicable law or agreed to in writing, software 
# distributed under the License is distributed on an "AS IS" BASIS, 
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 
# See the License for the specific language governing permissions and 
# limitations under the License. 
import gradio as gr


import numpy as np
import os
import torch

from video_depth_anything.video_depth import VideoDepthAnything
from utils.dc_utils import read_video_frames, vis_sequence_depth, save_video

examples = [
    ['assets/example_videos/davis_rollercoaster.mp4'],
]

model_configs = {
    'vits': {'encoder': 'vits', 'features': 64, 'out_channels': [48, 96, 192, 384]},
    'vitl': {'encoder': 'vitl', 'features': 256, 'out_channels': [256, 512, 1024, 1024]},
}

encoder='vitl'

video_depth_anything = VideoDepthAnything(**model_configs[encoder])
video_depth_anything.load_state_dict(torch.load(f'./checkpoints/video_depth_anything_{encoder}.pth', map_location='cpu'), strict=True)
video_depth_anything = video_depth_anything.to('cuda').eval()


def infer_video_depth(
    input_video: str,
    max_len: int = -1,
    target_fps: int = -1,
    max_res: int = 1280,
    output_dir: str = './outputs',
    input_size: int = 518,
):
    frames, target_fps = read_video_frames(input_video, max_len, target_fps, max_res)
    depth_list, fps = video_depth_anything.infer_video_depth(frames, target_fps, input_size=input_size, device='cuda')
    depth_list = np.stack(depth_list, axis=0)
    vis = vis_sequence_depth(depth_list)
    video_name = os.path.basename(input_video)
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    processed_video_path = os.path.join(output_dir, os.path.splitext(video_name)[0]+'_src.mp4')
    depth_vis_path = os.path.join(output_dir, os.path.splitext(video_name)[0]+'_vis.mp4')
    save_video(frames, processed_video_path, fps=fps)
    save_video(vis, depth_vis_path, fps=fps)

    return [processed_video_path, depth_vis_path]


def construct_demo():
    with gr.Blocks(analytics_enabled=False) as demo:
        gr.Markdown(
            f"""
            blablabla
            """
        )

        with gr.Row(equal_height=True):
            with gr.Column(scale=1):
                input_video = gr.Video(label="Input Video")

            # with gr.Tab(label="Output"):
            with gr.Column(scale=2):
                with gr.Row(equal_height=True):
                    processed_video = gr.Video(
                        label="Preprocessed video",
                        interactive=False,
                        autoplay=True,
                        loop=True,
                        show_share_button=True,
                        scale=5,
                    )
                    depth_vis_video = gr.Video(
                        label="Generated Depth Video",
                        interactive=False,
                        autoplay=True,
                        loop=True,
                        show_share_button=True,
                        scale=5,
                    )

        with gr.Row(equal_height=True):
            with gr.Column(scale=1):
                with gr.Row(equal_height=False):
                    with gr.Accordion("Advanced Settings", open=False):
                        max_len = gr.Slider(
                            label="max process length",
                            minimum=-1,
                            maximum=1000,
                            value=-1,
                            step=1,
                        )
                        target_fps = gr.Slider(
                            label="target FPS",
                            minimum=-1,
                            maximum=30,
                            value=15,
                            step=1,
                        )
                        max_res = gr.Slider(
                            label="max side resolution",
                            minimum=480,
                            maximum=1920,
                            value=1280,
                            step=1,
                        )
                    generate_btn = gr.Button("Generate")
            with gr.Column(scale=2):
                pass

        gr.Examples(
            examples=examples,
            inputs=[
                input_video,
                max_len,
                target_fps,
                max_res
            ],
            outputs=[processed_video, depth_vis_video],
            fn=infer_video_depth,
            cache_examples="lazy",
        )

        generate_btn.click(
            fn=infer_video_depth,
            inputs=[
                input_video,
                max_len,
                target_fps,
                max_res
            ],
            outputs=[processed_video, depth_vis_video],
        )

    return demo

if __name__ == "__main__":
    demo = construct_demo()
    demo.queue()
    demo.launch(server_name="0.0.0.0")

================================================
FILE: models/Video-Depth-Anything/get_weights.sh
================================================
#!/bin/bash

mkdir checkpoints
cd checkpoints
wget https://huggingface.co/depth-anything/Video-Depth-Anything-Small/resolve/main/video_depth_anything_vits.pth
wget https://huggingface.co/depth-anything/Video-Depth-Anything-Large/resolve/main/video_depth_anything_vitl.pth

================================================
FILE: models/Video-Depth-Anything/run.py
================================================
# Copyright (2025) Bytedance Ltd. and/or its affiliates 

# Licensed under the Apache License, Version 2.0 (the "License"); 
# you may not use this file except in compliance with the License. 
# You may obtain a copy of the License at 

#     http://www.apache.org/licenses/LICENSE-2.0 

# Unless required by applicable law or agreed to in writing, software 
# distributed under the License is distributed on an "AS IS" BASIS, 
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 
# See the License for the specific language governing permissions and 
# limitations under the License. 
import argparse
import numpy as np
import os
import torch

from video_depth_anything.video_depth import VideoDepthAnything
from utils.dc_utils import read_video_frames, vis_sequence_depth, save_video

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Video Depth Anything')
    parser.add_argument('--input_video', type=str, default='./assets/example_videos/davis_rollercoaster.mp4')
    parser.add_argument('--output_dir', type=str, default='./outputs')
    parser.add_argument('--input_size', type=int, default=518)
    parser.add_argument('--max_res', type=int, default=1280)
    parser.add_argument('--encoder', type=str, default='vitl', choices=['vits', 'vitl'])
    parser.add_argument('--max_len', type=int, default=-1, help='maximum length of the input video, -1 means no limit')
    parser.add_argument('--target_fps', type=int, default=-1, help='target fps of the input video, -1 means the original fps')

    args = parser.parse_args()

    DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'

    model_configs = {
        'vits': {'encoder': 'vits', 'features': 64, 'out_channels': [48, 96, 192, 384]},
        'vitl': {'encoder': 'vitl', 'features': 256, 'out_channels': [256, 512, 1024, 1024]},
    }

    video_depth_anything = VideoDepthAnything(**model_configs[args.encoder])
    video_depth_anything.load_state_dict(torch.load(f'./checkpoints/video_depth_anything_{args.encoder}.pth', map_location='cpu'), strict=True)
    video_depth_anything = video_depth_anything.to(DEVICE).eval()

    frames, target_fps = read_video_frames(args.input_video, args.max_len, args.target_fps, args.max_res)
    depth_list, fps = video_depth_anything.infer_video_depth(frames, target_fps, input_size=args.input_size, device=DEVICE)
    depth_list = np.stack(depth_list, axis=0)
    vis = vis_sequence_depth(depth_list)
    video_name = os.path.basename(args.input_video)
    if not os.path.exists(args.output_dir):
        os.makedirs(args.output_dir)

    processed_video_path = os.path.join(args.output_dir, os.path.splitext(video_name)[0]+'_src.mp4')
    depth_vis_path = os.path.join(args.output_dir, os.path.splitext(video_name)[0]+'_vis.mp4')
    save_video(frames, processed_video_path, fps=fps)
    save_video(vis, depth_vis_path, fps=fps)

    




================================================
FILE: models/Video-Depth-Anything/utils/dc_utils.py
================================================
# This file is originally from DepthCrafter/depthcrafter/utils.py at main · Tencent/DepthCrafter
# SPDX-License-Identifier: MIT License license
#
# This file may have been modified by ByteDance Ltd. and/or its affiliates on [date of modification]
# Original file is released under [ MIT License license], with the full license text available at [https://github.com/Tencent/DepthCrafter?tab=License-1-ov-file].
from typing import Union, List
import tempfile
import numpy as np
import PIL.Image
import matplotlib.cm as cm
import mediapy
import torch
try:
    from decord import VideoReader, cpu
    DECORD_AVAILABLE = True
except:
    import cv2
    DECORD_AVAILABLE = False


def read_video_frames(video_path, process_length, target_fps=-1, max_res=-1, dataset="open"):
    if DECORD_AVAILABLE:
        vid = VideoReader(video_path, ctx=cpu(0))
        original_height, original_width = vid.get_batch([0]).shape[1:3]
        height = original_height
        width = original_width
        if max_res > 0 and max(height, width) > max_res:
            scale = max_res / max(original_height, original_width)
            height = round(original_height * scale)
            width = round(original_width * scale)

        vid = VideoReader(video_path, ctx=cpu(0), width=width, height=height)

        fps = vid.get_avg_fps() if target_fps == -1 else target_fps
        stride = round(vid.get_avg_fps() / fps)
        stride = max(stride, 1)
        frames_idx = list(range(0, len(vid), stride))
        if process_length != -1 and process_length < len(frames_idx):
            frames_idx = frames_idx[:process_length]
        frames = vid.get_batch(frames_idx).asnumpy()
    else:
        cap = cv2.VideoCapture(video_path)
        original_fps = cap.get(cv2.CAP_PROP_FPS)
        original_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
        original_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))

        if max_res > 0 and max(original_height, original_width) > max_res:
            scale = max_res / max(original_height, original_width)
            height = round(original_height * scale)
            width = round(original_width * scale)

        fps = original_fps if target_fps < 0 else target_fps

        stride = max(round(original_fps / fps), 1)

        frames = []
        frame_count = 0
        while cap.isOpened():
            ret, frame = cap.read()
            if not ret or (process_length > 0 and frame_count >= process_length):
                break
            if frame_count % stride == 0:
                frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)  # Convert BGR to RGB
                if max_res > 0 and max(original_height, original_width) > max_res:
                    frame = cv2.resize(frame, (width, height))  # Resize frame
                frames.append(frame)
            frame_count += 1
        cap.release()
        frames = np.stack(frames, axis=0)

    return frames, fps


def save_video(
    video_frames: Union[List[np.ndarray], List[PIL.Image.Image]],
    output_video_path: str = None,
    fps: int = 10,
    crf: int = 18,
) -> str:
    if output_video_path is None:
        output_video_path = tempfile.NamedTemporaryFile(suffix=".mp4").name

    if isinstance(video_frames[0], np.ndarray):
        video_frames = [frame.astype(np.uint8) for frame in video_frames]

    elif isinstance(video_frames[0], PIL.Image.Image):
        video_frames = [np.array(frame) for frame in video_frames]
    mediapy.write_video(output_video_path, video_frames, fps=fps, crf=crf)
    return output_video_path


class ColorMapper:
    # a color mapper to map depth values to a certain colormap
    def __init__(self, colormap: str = "inferno"):
        self.colormap = torch.tensor(cm.get_cmap(colormap).colors)

    def apply(self, image: torch.Tensor, v_min=None, v_max=None):
        # assert len(image.shape) == 2
        if v_min is None:
            v_min = image.min()
        if v_max is None:
            v_max = image.max()
        image = (image - v_min) / (v_max - v_min)
        image = (image * 255).long()
        image = self.colormap[image] * 255
        return image


def vis_sequence_depth(depths: np.ndarray, v_min=None, v_max=None):
    visualizer = ColorMapper()
    if v_min is None:
        v_min = depths.min()
    if v_max is None:
        v_max = depths.max()
    res = visualizer.apply(torch.tensor(depths), v_min=v_min, v_max=v_max).numpy()
    return res


================================================
FILE: models/Video-Depth-Anything/utils/util.py
================================================
# Copyright (2025) Bytedance Ltd. and/or its affiliates 

# Licensed under the Apache License, Version 2.0 (the "License"); 
# you may not use this file except in compliance with the License. 
# You may obtain a copy of the License at 

#     http://www.apache.org/licenses/LICENSE-2.0 

# Unless required by applicable law or agreed to in writing, software 
# distributed under the License is distributed on an "AS IS" BASIS, 
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 
# See the License for the specific language governing permissions and 
# limitations under the License. 
import numpy as np

def compute_scale_and_shift(prediction, target, mask, scale_only=False):
    if scale_only:
        return compute_scale(prediction, target, mask), 0
    else:
        return compute_scale_and_shift_full(prediction, target, mask)


def compute_scale(prediction, target, mask):
    # system matrix: A = [[a_00, a_01], [a_10, a_11]]
    prediction = prediction.astype(np.float32)
    target = target.astype(np.float32)
    mask = mask.astype(np.float32)

    a_00 = np.sum(mask * prediction * prediction)
    a_01 = np.sum(mask * prediction)
    a_11 = np.sum(mask)

    # right hand side: b = [b_0, b_1]
    b_0 = np.sum(mask * prediction * target)

    x_0 = b_0 / (a_00 + 1e-6)

    return x_0

def compute_scale_and_shift_full(prediction, target, mask):
    # system matrix: A = [[a_00, a_01], [a_10, a_11]]
    prediction = prediction.astype(np.float32)
    target = target.astype(np.float32)
    mask = mask.astype(np.float32)

    a_00 = np.sum(mask * prediction * prediction)
    a_01 = np.sum(mask * prediction)
    a_11 = np.sum(mask)

    b_0 = np.sum(mask * prediction * target)
    b_1 = np.sum(mask * target)

    x_0 = 1
    x_1 = 0

    det = a_00 * a_11 - a_01 * a_01

    if det != 0:
        x_0 = (a_11 * b_0 - a_01 * b_1) / det
        x_1 = (-a_01 * b_0 + a_00 * b_1) / det

    return x_0, x_1


def get_interpolate_frames(frame_list_pre, frame_list_post):
    assert len(frame_list_pre) == len(frame_list_post)
    min_w = 0.0
    max_w = 1.0
    step = (max_w - min_w) / (len(frame_list_pre)-1)
    post_w_list = [min_w] + [i * step for i in range(1,len(frame_list_pre)-1)] + [max_w]
    interpolated_frames = []
    for i in range(len(frame_list_pre)):
        interpolated_frames.append(frame_list_pre[i] * (1-post_w_list[i]) + frame_list_post[i] * post_w_list[i])
    return interpolated_frames

================================================
FILE: models/Video-Depth-Anything/video_depth_anything/dinov2.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
#
# This source code is licensed under the Apache License, Version 2.0
# found in the LICENSE file in the root directory of this source tree.

# References:
#   https://github.com/facebookresearch/dino/blob/main/vision_transformer.py
#   https://github.com/rwightman/pytorch-image-models/tree/master/timm/models/vision_transformer.py

from functools import partial
import math
import logging
from typing import Sequence, Tuple, Union, Callable

import torch
import torch.nn as nn
import torch.utils.checkpoint
from torch.nn.init import trunc_normal_

from .dinov2_layers import Mlp, PatchEmbed, SwiGLUFFNFused, MemEffAttention, NestedTensorBlock as Block


logger = logging.getLogger("dinov2")


def named_apply(fn: Callable, module: nn.Module, name="", depth_first=True, include_root=False) -> nn.Module:
    if not depth_first and include_root:
        fn(module=module, name=name)
    for child_name, child_module in module.named_children():
        child_name = ".".join((name, child_name)) if name else child_name
        named_apply(fn=fn, module=child_module, name=child_name, depth_first=depth_first, include_root=True)
    if depth_first and include_root:
        fn(module=module, name=name)
    return module


class BlockChunk(nn.ModuleList):
    def forward(self, x):
        for b in self:
            x = b(x)
        return x


class DinoVisionTransformer(nn.Module):
    def __init__(
        self,
        img_size=224,
        patch_size=16,
        in_chans=3,
        embed_dim=768,
        depth=12,
        num_heads=12,
        mlp_ratio=4.0,
        qkv_bias=True,
        ffn_bias=True,
        proj_bias=True,
        drop_path_rate=0.0,
        drop_path_uniform=False,
        init_values=None,  # for layerscale: None or 0 => no layerscale
        embed_layer=PatchEmbed,
        act_layer=nn.GELU,
        block_fn=Block,
        ffn_layer="mlp",
        block_chunks=1,
        num_register_tokens=0,
        interpolate_antialias=False,
        interpolate_offset=0.1,
    ):
        """
        Args:
            img_size (int, tuple): input image size
            patch_size (int, tuple): patch size
            in_chans (int): number of input channels
            embed_dim (int): embedding dimension
            depth (int): depth of transformer
            num_heads (int): number of attention heads
            mlp_ratio (int): ratio of mlp hidden dim to embedding dim
            qkv_bias (bool): enable bias for qkv if True
            proj_bias (bool): enable bias for proj in attn if True
            ffn_bias (bool): enable bias for ffn if True
            drop_path_rate (float): stochastic depth rate
            drop_path_uniform (bool): apply uniform drop rate across blocks
            weight_init (str): weight init scheme
            init_values (float): layer-scale init values
            embed_layer (nn.Module): patch embedding layer
            act_layer (nn.Module): MLP activation layer
            block_fn (nn.Module): transformer block class
            ffn_layer (str): "mlp", "swiglu", "swiglufused" or "identity"
            block_chunks: (int) split block sequence into block_chunks units for FSDP wrap
            num_register_tokens: (int) number of extra cls tokens (so-called "registers")
            interpolate_antialias: (str) flag to apply anti-aliasing when interpolating positional embeddings
            interpolate_offset: (float) work-around offset to apply when interpolating positional embeddings
        """
        super().__init__()
        norm_layer = partial(nn.LayerNorm, eps=1e-6)

        self.num_features = self.embed_dim = embed_dim  # num_features for consistency with other models
        self.num_tokens = 1
        self.n_blocks = depth
        self.num_heads = num_heads
        self.patch_size = patch_size
        self.num_register_tokens = num_register_tokens
        self.interpolate_antialias = interpolate_antialias
        self.interpolate_offset = interpolate_offset

        self.patch_embed = embed_layer(img_size=img_size, patch_size=patch_size, in_chans=in_chans, embed_dim=embed_dim)
        num_patches = self.patch_embed.num_patches

        self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
        self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + self.num_tokens, embed_dim))
        assert num_register_tokens >= 0
        self.register_tokens = (
            nn.Parameter(torch.zeros(1, num_register_tokens, embed_dim)) if num_register_tokens else None
        )

        if drop_path_uniform is True:
            dpr = [drop_path_rate] * depth
        else:
            dpr = [x.item() for x in torch.linspace(0, drop_path_rate, depth)]  # stochastic depth decay rule

        if ffn_layer == "mlp":
            logger.info("using MLP layer as FFN")
            ffn_layer = Mlp
        elif ffn_layer == "swiglufused" or ffn_layer == "swiglu":
            logger.info("using SwiGLU layer as FFN")
            ffn_layer = SwiGLUFFNFused
        elif ffn_layer == "identity":
            logger.info("using Identity layer as FFN")

            def f(*args, **kwargs):
                return nn.Identity()

            ffn_layer = f
        else:
            raise NotImplementedError

        blocks_list = [
            block_fn(
                dim=embed_dim,
                num_heads=num_heads,
                mlp_ratio=mlp_ratio,
                qkv_bias=qkv_bias,
                proj_bias=proj_bias,
                ffn_bias=ffn_bias,
                drop_path=dpr[i],
                norm_layer=norm_layer,
                act_layer=act_layer,
                ffn_layer=ffn_layer,
                init_values=init_values,
            )
            for i in range(depth)
        ]
        if block_chunks > 0:
            self.chunked_blocks = True
            chunked_blocks = []
            chunksize = depth // block_chunks
            for i in range(0, depth, chunksize):
                # this is to keep the block index consistent if we chunk the block list
                chunked_blocks.append([nn.Identity()] * i + blocks_list[i : i + chunksize])
            self.blocks = nn.ModuleList([BlockChunk(p) for p in chunked_blocks])
        else:
            self.chunked_blocks = False
            self.blocks = nn.ModuleList(blocks_list)

        self.norm = norm_layer(embed_dim)
        self.head = nn.Identity()

        self.mask_token = nn.Parameter(torch.zeros(1, embed_dim))

        self.init_weights()

    def init_weights(self):
        trunc_normal_(self.pos_embed, std=0.02)
        nn.init.normal_(self.cls_token, std=1e-6)
        if self.register_tokens is not None:
            nn.init.normal_(self.register_tokens, std=1e-6)
        named_apply(init_weights_vit_timm, self)

    def interpolate_pos_encoding(self, x, w, h):
        previous_dtype = x.dtype
        npatch = x.shape[1] - 1
        N = self.pos_embed.shape[1] - 1
        if npatch == N and w == h:
            return self.pos_embed
        pos_embed = self.pos_embed.float()
        class_pos_embed = pos_embed[:, 0]
        patch_pos_embed = pos_embed[:, 1:]
        dim = x.shape[-1]
        w0 = w // self.patch_size
        h0 = h // self.patch_size
        # we add a small number to avoid floating point error in the interpolation
        # see discussion at https://github.com/facebookresearch/dino/issues/8
        # DINOv2 with register modify the interpolate_offset from 0.1 to 0.0
        w0, h0 = w0 + self.interpolate_offset, h0 + self.interpolate_offset
        # w0, h0 = w0 + 0.1, h0 + 0.1
        
        s

Download .txt

gitextract_751vu01n/

├── LICENSE
├── README.md
├── assets/
│   └── 1
├── checkpoints/
│   └── checkpoints here.txt
├── data/
│   └── datasets/
│       └── dataset here.txt
├── datasets/
│   ├── augmentor.py
│   ├── frame_utils.py
│   └── video_datasets.py
├── demo.py
├── demo.sh
├── evaluate_stereoanyvideo.sh
├── evaluation/
│   ├── configs/
│   │   ├── eval_dynamic_replica.yaml
│   │   ├── eval_infinigensv.yaml
│   │   ├── eval_kittidepth.yaml
│   │   ├── eval_sintel_clean.yaml
│   │   ├── eval_sintel_final.yaml
│   │   ├── eval_southkensington.yaml
│   │   └── eval_vkitti2.yaml
│   ├── core/
│   │   └── evaluator.py
│   ├── evaluate.py
│   └── utils/
│       ├── eval_utils.py
│       ├── ssim.py
│       └── utils.py
├── models/
│   ├── Video-Depth-Anything/
│   │   ├── app.py
│   │   ├── get_weights.sh
│   │   ├── run.py
│   │   ├── utils/
│   │   │   ├── dc_utils.py
│   │   │   └── util.py
│   │   └── video_depth_anything/
│   │       ├── dinov2.py
│   │       ├── dinov2_layers/
│   │       │   ├── __init__.py
│   │       │   ├── attention.py
│   │       │   ├── block.py
│   │       │   ├── drop_path.py
│   │       │   ├── layer_scale.py
│   │       │   ├── mlp.py
│   │       │   ├── patch_embed.py
│   │       │   └── swiglu_ffn.py
│   │       ├── dpt.py
│   │       ├── dpt_temporal.py
│   │       ├── motion_module/
│   │       │   ├── attention.py
│   │       │   └── motion_module.py
│   │       ├── util/
│   │       │   ├── blocks.py
│   │       │   └── transform.py
│   │       └── video_depth.py
│   ├── core/
│   │   ├── attention.py
│   │   ├── corr.py
│   │   ├── extractor.py
│   │   ├── model_zoo.py
│   │   ├── stereoanyvideo.py
│   │   ├── update.py
│   │   └── utils/
│   │       ├── config.py
│   │       └── utils.py
│   ├── raft_model.py
│   └── stereoanyvideo_model.py
├── requirements.txt
├── third_party/
│   └── RAFT/
│       ├── LICENSE
│       ├── README.md
│       ├── alt_cuda_corr/
│       │   ├── correlation.cpp
│       │   ├── correlation_kernel.cu
│       │   └── setup.py
│       ├── chairs_split.txt
│       ├── core/
│       │   ├── __init__.py
│       │   ├── corr.py
│       │   ├── datasets.py
│       │   ├── extractor.py
│       │   ├── raft.py
│       │   ├── update.py
│       │   └── utils/
│       │       ├── __init__.py
│       │       ├── augmentor.py
│       │       ├── flow_viz.py
│       │       ├── frame_utils.py
│       │       └── utils.py
│       ├── demo.py
│       ├── download_models.sh
│       ├── evaluate.py
│       ├── train.py
│       ├── train_mixed.sh
│       └── train_standard.sh
├── train_stereoanyvideo.py
├── train_stereoanyvideo.sh
└── train_utils/
    ├── logger.py
    ├── losses.py
    └── utils.py

Download .txt

SYMBOL INDEX (534 symbols across 54 files)

FILE: datasets/augmentor.py
  class AdjustGamma (line 13) | class AdjustGamma(object):
    method __init__ (line 14) | def __init__(self, gamma_min, gamma_max, gain_min=1.0, gain_max=1.0):
    method __call__ (line 22) | def __call__(self, sample):
    method __repr__ (line 27) | def __repr__(self):
  class SequenceDispFlowAugmentor (line 31) | class SequenceDispFlowAugmentor:
    method __init__ (line 32) | def __init__(
    method color_transform (line 71) | def color_transform(self, seq):
    method eraser_transform (line 95) | def eraser_transform(self, seq, bounds=[50, 100]):
    method spatial_transform (line 111) | def spatial_transform(self, img, disp):
    method __call__ (line 183) | def __call__(self, img, disp):
  class SequenceDispSparseFlowAugmentor (line 197) | class SequenceDispSparseFlowAugmentor:
    method __init__ (line 198) | def __init__(
    method color_transform (line 237) | def color_transform(self, seq):
    method eraser_transform (line 252) | def eraser_transform(self, seq, bounds=[50, 100]):
    method resize_sparse_flow_map (line 268) | def resize_sparse_flow_map(self, flow, valid, fx=1.0, fy=1.0):
    method spatial_transform (line 302) | def spatial_transform(self, img, disp, valid):
    method __call__ (line 352) | def __call__(self, img, disp, valid):

FILE: datasets/frame_utils.py
  function readFlow (line 14) | def readFlow(fn):
  function readPFM (line 36) | def readPFM(file):
  function readDispSintelStereo (line 74) | def readDispSintelStereo(file_name):
  function readDispMiddlebury (line 87) | def readDispMiddlebury(file_name):
  function read_gen (line 98) | def read_gen(file_name, pil=False):

FILE: datasets/video_datasets.py
  class DynamicReplicaFrameAnnotation (line 31) | class DynamicReplicaFrameAnnotation(ImplicitronFrameAnnotation):
  class StereoSequenceDataset (line 37) | class StereoSequenceDataset(data.Dataset):
    method __init__ (line 38) | def __init__(self, aug_params=None, sparse=False, reader=None):
    method _load_depth (line 60) | def _load_depth(self, depth_path):
    method _load_npy_depth (line 73) | def _load_npy_depth(self, depth_npy):
    method _load_vkitti2 (line 77) | def _load_vkitti2(self, depth_png):
    method _load_kitti_depth (line 84) | def _load_kitti_depth(self, depth_png):
    method _load_16big_png_depth (line 96) | def _load_16big_png_depth(self, depth_png):
    method load_tartanair_pose (line 107) | def load_tartanair_pose(self, filepath, index=0):
    method parse_txt_file (line 122) | def parse_txt_file(self, file_path):
    method _get_pytorch3d_camera (line 158) | def _get_pytorch3d_camera(
    method _get_pytorch3d_camera_from_blender (line 204) | def _get_pytorch3d_camera_from_blender(self, R, T, K, image_size, scal...
    method _get_output_tensor (line 247) | def _get_output_tensor(self, sample):
    method __getitem__ (line 666) | def __getitem__(self, index):
    method __mul__ (line 766) | def __mul__(self, v):
    method __len__ (line 772) | def __len__(self):
  class DynamicReplicaDataset (line 776) | class DynamicReplicaDataset(StereoSequenceDataset):
    method __init__ (line 777) | def __init__(
  class InfinigenStereoVideoDataset (line 876) | class InfinigenStereoVideoDataset(StereoSequenceDataset):
    method __init__ (line 877) | def __init__(
  class SouthKensingtonStereoVideoDataset (line 961) | class SouthKensingtonStereoVideoDataset(StereoSequenceDataset):
    method __init__ (line 962) | def __init__(
  class KITTIDepthDataset (line 1025) | class KITTIDepthDataset(StereoSequenceDataset):
    method __init__ (line 1026) | def __init__(
  function split_train_valid (line 1139) | def split_train_valid(path_list, valid_keywords):
  class TartanAirDataset (line 1148) | class TartanAirDataset(StereoSequenceDataset):
    method __init__ (line 1149) | def __init__(
  class VKITTI2Dataset (line 1275) | class VKITTI2Dataset(StereoSequenceDataset):
    method __init__ (line 1276) | def __init__(
  class SequenceSpringDataset (line 1367) | class SequenceSpringDataset(StereoSequenceDataset):
    method __init__ (line 1368) | def __init__(
  class SequenceSceneFlowDataset (line 1424) | class SequenceSceneFlowDataset(StereoSequenceDataset):
    method __init__ (line 1425) | def __init__(
    method _add_things (line 1450) | def _add_things(self, split="TRAIN"):
    method _add_monkaa (line 1496) | def _add_monkaa(self):
    method _add_driving (line 1530) | def _add_driving(self):
    method _append_sample (line 1565) | def _append_sample(self, images, disparities):
  class SequenceSintelStereo (line 1583) | class SequenceSintelStereo(StereoSequenceDataset):
    method __init__ (line 1584) | def __init__(
  function fetch_dataloader (line 1649) | def fetch_dataloader(args):

FILE: demo.py
  function load_image (line 19) | def load_image(imfile):
  function viz (line 25) | def viz(img, flo):
  function demo (line 37) | def demo(args):

FILE: evaluation/core/evaluator.py
  function depth_to_colormap (line 19) | def depth_to_colormap(depth, colormap='jet', eps=1e-3, scale_vmin=1.0):
  class Evaluator (line 33) | class Evaluator(Configurable):
    method setup_visualization (line 43) | def setup_visualization(self, cfg: DictConfig) -> None:
    method evaluate_sequence (line 52) | def evaluate_sequence(

FILE: evaluation/evaluate.py
  class DefaultConfig (line 25) | class DefaultConfig:
  function run_eval (line 57) | def run_eval(cfg: DefaultConfig):
  function evaluate (line 141) | def evaluate(cfg: DefaultConfig) -> None:

FILE: evaluation/utils/eval_utils.py
  class PerceptionMetric (line 14) | class PerceptionMetric:
    method __str__ (line 20) | def __str__(self):
  function compute_flow (line 33) | def compute_flow(seq, is_seq=True):
  function flow_warp (line 50) | def flow_warp(x, flow):
  function eval_endpoint_error_sequence (line 75) | def eval_endpoint_error_sequence(
  function eval_TCC_sequence (line 146) | def eval_TCC_sequence(
  function eval_TCM_sequence (line 185) | def eval_TCM_sequence(
  function eval_OPW_sequence (line 242) | def eval_OPW_sequence(
  function eval_RTC_sequence (line 313) | def eval_RTC_sequence(
  function depth2disparity_scale (line 379) | def depth2disparity_scale(left_camera, right_camera, image_size_tensor):
  function depth_to_pcd (line 394) | def depth_to_pcd(
  function filter_outliers (line 438) | def filter_outliers(pcd, sigma=3):
  function eval_batch (line 446) | def eval_batch(batch_dict, predictions, scale) -> Dict[str, Union[float,...

FILE: evaluation/utils/ssim.py
  function _fspecial_gauss_1d (line 9) | def _fspecial_gauss_1d(size, sigma):
  function gaussian_filter (line 26) | def gaussian_filter(input, win):
  function _ssim (line 54) | def _ssim(X, Y, data_range, win, size_average=True, K=(0.01, 0.03)):
  function ssim (line 94) | def ssim(
  function ms_ssim (line 152) | def ms_ssim(
  class SSIM (line 227) | class SSIM(torch.nn.Module):
    method __init__ (line 228) | def __init__(
    method forward (line 258) | def forward(self, X, Y):
  class MS_SSIM (line 270) | class MS_SSIM(torch.nn.Module):
    method __init__ (line 271) | def __init__(
    method forward (line 301) | def forward(self, X, Y):

FILE: evaluation/utils/utils.py
  class PerceptionPrediction (line 28) | class PerceptionPrediction:
  function aggregate_eval_results (line 39) | def aggregate_eval_results(per_batch_eval_results, reduction="mean"):
  function aggregate_and_print_results (line 62) | def aggregate_and_print_results(
  function pretty_print_perception_metrics (line 76) | def pretty_print_perception_metrics(results):
  function read_calibration (line 92) | def read_calibration(calibration_file, resolution_string):
  function filter_depth_discontinuities (line 216) | def filter_depth_discontinuities(pcd, depth_map, threshold=5):
  function visualize_batch (line 246) | def visualize_batch(

FILE: models/Video-Depth-Anything/app.py
  function infer_video_depth (line 40) | def infer_video_depth(
  function construct_demo (line 64) | def construct_demo():

FILE: models/Video-Depth-Anything/utils/dc_utils.py
  function read_video_frames (line 21) | def read_video_frames(video_path, process_length, target_fps=-1, max_res...
  function save_video (line 74) | def save_video(
  class ColorMapper (line 92) | class ColorMapper:
    method __init__ (line 94) | def __init__(self, colormap: str = "inferno"):
    method apply (line 97) | def apply(self, image: torch.Tensor, v_min=None, v_max=None):
  function vis_sequence_depth (line 109) | def vis_sequence_depth(depths: np.ndarray, v_min=None, v_max=None):

FILE: models/Video-Depth-Anything/utils/util.py
  function compute_scale_and_shift (line 16) | def compute_scale_and_shift(prediction, target, mask, scale_only=False):
  function compute_scale (line 23) | def compute_scale(prediction, target, mask):
  function compute_scale_and_shift_full (line 40) | def compute_scale_and_shift_full(prediction, target, mask):
  function get_interpolate_frames (line 65) | def get_interpolate_frames(frame_list_pre, frame_list_post):

FILE: models/Video-Depth-Anything/video_depth_anything/dinov2.py
  function named_apply (line 26) | def named_apply(fn: Callable, module: nn.Module, name="", depth_first=Tr...
  class BlockChunk (line 37) | class BlockChunk(nn.ModuleList):
    method forward (line 38) | def forward(self, x):
  class DinoVisionTransformer (line 44) | class DinoVisionTransformer(nn.Module):
    method __init__ (line 45) | def __init__(
    method init_weights (line 172) | def init_weights(self):
    method interpolate_pos_encoding (line 179) | def interpolate_pos_encoding(self, x, w, h):
    method prepare_tokens_with_masks (line 212) | def prepare_tokens_with_masks(self, x, masks=None):
    method forward_features_list (line 233) | def forward_features_list(self, x_list, masks_list):
    method forward_features (line 253) | def forward_features(self, x, masks=None):
    method _get_intermediate_layers_not_chunked (line 271) | def _get_intermediate_layers_not_chunked(self, x, n=1):
    method _get_intermediate_layers_chunked (line 283) | def _get_intermediate_layers_chunked(self, x, n=1):
    method get_intermediate_layers (line 297) | def get_intermediate_layers(
    method forward (line 323) | def forward(self, *args, is_training=False, **kwargs):
  function init_weights_vit_timm (line 331) | def init_weights_vit_timm(module: nn.Module, name: str = ""):
  function vit_small (line 339) | def vit_small(patch_size=16, num_register_tokens=0, **kwargs):
  function vit_base (line 353) | def vit_base(patch_size=16, num_register_tokens=0, **kwargs):
  function vit_large (line 367) | def vit_large(patch_size=16, num_register_tokens=0, **kwargs):
  function vit_giant2 (line 381) | def vit_giant2(patch_size=16, num_register_tokens=0, **kwargs):
  function DINOv2 (line 398) | def DINOv2(model_name):

FILE: models/Video-Depth-Anything/video_depth_anything/dinov2_layers/attention.py
  class Attention (line 29) | class Attention(nn.Module):
    method __init__ (line 30) | def __init__(
    method forward (line 49) | def forward(self, x: Tensor) -> Tensor:
  class MemEffAttention (line 65) | class MemEffAttention(Attention):
    method forward (line 66) | def forward(self, x: Tensor, attn_bias=None) -> Tensor:

FILE: models/Video-Depth-Anything/video_depth_anything/dinov2_layers/block.py
  class Block (line 36) | class Block(nn.Module):
    method __init__ (line 37) | def __init__(
    method forward (line 82) | def forward(self, x: Tensor) -> Tensor:
  function drop_add_residual_stochastic_depth (line 110) | def drop_add_residual_stochastic_depth(
  function get_branges_scales (line 134) | def get_branges_scales(x, sample_drop_ratio=0.0):
  function add_residual (line 142) | def add_residual(x, brange, residual, residual_scale_factor, scaling_vec...
  function get_attn_bias_and_cat (line 157) | def get_attn_bias_and_cat(x_list, branges=None):
  function drop_add_residual_stochastic_depth_list (line 181) | def drop_add_residual_stochastic_depth_list(
  class NestedTensorBlock (line 204) | class NestedTensorBlock(Block):
    method forward_nested (line 205) | def forward_nested(self, x_list: List[Tensor]) -> List[Tensor]:
    method forward (line 245) | def forward(self, x_or_x_list):

FILE: models/Video-Depth-Anything/video_depth_anything/dinov2_layers/drop_path.py
  function drop_path (line 15) | def drop_path(x, drop_prob: float = 0.0, training: bool = False):
  class DropPath (line 27) | class DropPath(nn.Module):
    method __init__ (line 30) | def __init__(self, drop_prob=None):
    method forward (line 34) | def forward(self, x):

FILE: models/Video-Depth-Anything/video_depth_anything/dinov2_layers/layer_scale.py
  class LayerScale (line 16) | class LayerScale(nn.Module):
    method __init__ (line 17) | def __init__(
    method forward (line 27) | def forward(self, x: Tensor) -> Tensor:

FILE: models/Video-Depth-Anything/video_depth_anything/dinov2_layers/mlp.py
  class Mlp (line 17) | class Mlp(nn.Module):
    method __init__ (line 18) | def __init__(
    method forward (line 35) | def forward(self, x: Tensor) -> Tensor:

FILE: models/Video-Depth-Anything/video_depth_anything/dinov2_layers/patch_embed.py
  function make_2tuple (line 17) | def make_2tuple(x):
  class PatchEmbed (line 26) | class PatchEmbed(nn.Module):
    method __init__ (line 38) | def __init__(
    method forward (line 69) | def forward(self, x: Tensor) -> Tensor:
    method flops (line 84) | def flops(self) -> float:

FILE: models/Video-Depth-Anything/video_depth_anything/dinov2_layers/swiglu_ffn.py
  class SwiGLUFFN (line 13) | class SwiGLUFFN(nn.Module):
    method __init__ (line 14) | def __init__(
    method forward (line 29) | def forward(self, x: Tensor) -> Tensor:
  class SwiGLUFFNFused (line 45) | class SwiGLUFFNFused(SwiGLU):
    method __init__ (line 46) | def __init__(

FILE: models/Video-Depth-Anything/video_depth_anything/dpt.py
  function _make_fusion_block (line 21) | def _make_fusion_block(features, use_bn, size=None):
  class ConvBlock (line 33) | class ConvBlock(nn.Module):
    method __init__ (line 34) | def __init__(self, in_feature, out_feature):
    method forward (line 43) | def forward(self, x):
  class DPTHead (line 47) | class DPTHead(nn.Module):
    method __init__ (line 48) | def __init__(
    method forward (line 126) | def forward(self, out_features, patch_h, patch_w):

FILE: models/Video-Depth-Anything/video_depth_anything/dpt_temporal.py
  class DPTHeadTemporal (line 22) | class DPTHeadTemporal(DPTHead):
    method __init__ (line 23) | def __init__(self,
    method forward (line 53) | def forward(self, out_features, patch_h, patch_w, frame_length):

FILE: models/Video-Depth-Anything/video_depth_anything/motion_module/attention.py
  class CrossAttention (line 30) | class CrossAttention(nn.Module):
    method __init__ (line 45) | def __init__(
    method reshape_heads_to_batch_dim (line 93) | def reshape_heads_to_batch_dim(self, tensor):
    method reshape_heads_to_4d (line 100) | def reshape_heads_to_4d(self, tensor):
    method reshape_batch_dim_to_heads (line 106) | def reshape_batch_dim_to_heads(self, tensor):
    method reshape_4d_to_heads (line 113) | def reshape_4d_to_heads(self, tensor):
    method set_attention_slice (line 119) | def set_attention_slice(self, slice_size):
    method forward (line 125) | def forward(self, hidden_states, encoder_hidden_states=None, attention...
    method _attention (line 182) | def _attention(self, query, key, value, attention_mask=None):
    method _sliced_attention (line 213) | def _sliced_attention(self, query, key, value, sequence_length, dim, a...
    method _memory_efficient_attention_xformers (line 256) | def _memory_efficient_attention_xformers(self, query, key, value, atte...
    method _memory_efficient_attention_split (line 275) | def _memory_efficient_attention_split(self, query, key, value, attenti...
  class FeedForward (line 296) | class FeedForward(nn.Module):
    method __init__ (line 308) | def __init__(
    method forward (line 335) | def forward(self, hidden_states):
  class GELU (line 341) | class GELU(nn.Module):
    method __init__ (line 346) | def __init__(self, dim_in: int, dim_out: int):
    method gelu (line 350) | def gelu(self, gate):
    method forward (line 356) | def forward(self, hidden_states):
  class GEGLU (line 363) | class GEGLU(nn.Module):
    method __init__ (line 372) | def __init__(self, dim_in: int, dim_out: int):
    method gelu (line 376) | def gelu(self, gate):
    method forward (line 382) | def forward(self, hidden_states):
  class ApproximateGELU (line 387) | class ApproximateGELU(nn.Module):
    method __init__ (line 394) | def __init__(self, dim_in: int, dim_out: int):
    method forward (line 398) | def forward(self, x):
  function precompute_freqs_cis (line 403) | def precompute_freqs_cis(dim: int, end: int, theta: float = 10000.0):
  function reshape_for_broadcast (line 411) | def reshape_for_broadcast(freqs_cis: torch.Tensor, x: torch.Tensor):
  function apply_rotary_emb (line 419) | def apply_rotary_emb(

FILE: models/Video-Depth-Anything/video_depth_anything/motion_module/motion_module.py
  function zero_module (line 25) | def zero_module(module):
  class TemporalModule (line 32) | class TemporalModule(nn.Module):
    method __init__ (line 33) | def __init__(
    method forward (line 60) | def forward(self, input_tensor, encoder_hidden_states, attention_mask=...
  class TemporalTransformer3DModel (line 68) | class TemporalTransformer3DModel(nn.Module):
    method __init__ (line 69) | def __init__(
    method forward (line 102) | def forward(self, hidden_states, encoder_hidden_states=None, attention...
  class TemporalTransformerBlock (line 129) | class TemporalTransformerBlock(nn.Module):
    method __init__ (line 130) | def __init__(
    method forward (line 164) | def forward(self, hidden_states, encoder_hidden_states=None, attention...
  class PositionalEncoding (line 180) | class PositionalEncoding(nn.Module):
    method __init__ (line 181) | def __init__(
    method forward (line 196) | def forward(self, x):
  class TemporalAttention (line 200) | class TemporalAttention(CrossAttention):
    method __init__ (line 201) | def __init__(
    method forward (line 230) | def forward(self, hidden_states, encoder_hidden_states=None, attention...

FILE: models/Video-Depth-Anything/video_depth_anything/util/blocks.py
  function _make_scratch (line 4) | def _make_scratch(in_shape, out_shape, groups=1, expand=False):
  class ResidualConvUnit (line 37) | class ResidualConvUnit(nn.Module):
    method __init__ (line 40) | def __init__(self, features, activation, bn):
    method forward (line 68) | def forward(self, x):
  class FeatureFusionBlock (line 94) | class FeatureFusionBlock(nn.Module):
    method __init__ (line 97) | def __init__(
    method forward (line 135) | def forward(self, *xs, size=None):

FILE: models/Video-Depth-Anything/video_depth_anything/util/transform.py
  class Resize (line 5) | class Resize(object):
    method __init__ (line 9) | def __init__(
    method constrain_to_multiple_of (line 51) | def constrain_to_multiple_of(self, x, min_val=0, max_val=None):
    method get_size (line 62) | def get_size(self, width, height):
    method __call__ (line 109) | def __call__(self, sample):
  class NormalizeImage (line 125) | class NormalizeImage(object):
    method __init__ (line 129) | def __init__(self, mean, std):
    method __call__ (line 133) | def __call__(self, sample):
  class PrepareForNet (line 139) | class PrepareForNet(object):
    method __init__ (line 143) | def __init__(self):
    method __call__ (line 146) | def __call__(self, sample):

FILE: models/Video-Depth-Anything/video_depth_anything/video_depth.py
  class VideoDepthAnything (line 35) | class VideoDepthAnything(nn.Module):
    method __init__ (line 36) | def __init__(
    method forward (line 57) | def forward(self, x):
    method infer_video_depth (line 66) | def infer_video_depth(self, frames, target_fps, input_size=518, device...

FILE: models/core/attention.py
  function elu_feature_map (line 13) | def elu_feature_map(x):
  class PositionEncodingSine (line 17) | class PositionEncodingSine(nn.Module):
    method __init__ (line 22) | def __init__(self, d_model, max_shape=(256, 256), temp_bug_fix=True):
    method forward (line 53) | def forward(self, x):
  class LinearAttention (line 61) | class LinearAttention(Module):
    method __init__ (line 62) | def __init__(self, eps=1e-6):
    method forward (line 67) | def forward(self, queries, keys, values, q_mask=None, kv_mask=None):
  class FullAttention (line 97) | class FullAttention(Module):
    method __init__ (line 98) | def __init__(self, use_dropout=False, attention_dropout=0.1):
    method forward (line 103) | def forward(self, queries, keys, values, q_mask=None, kv_mask=None):
  class LoFTREncoderLayer (line 134) | class LoFTREncoderLayer(nn.Module):
    method __init__ (line 135) | def __init__(self, d_model, nhead, attention="linear"):
    method forward (line 159) | def forward(self, x, source, x_mask=None, source_mask=None):
  class LocalFeatureTransformer (line 187) | class LocalFeatureTransformer(nn.Module):
    method __init__ (line 190) | def __init__(self, d_model, nhead, layer_names, attention):
    method _reset_parameters (line 202) | def _reset_parameters(self):
    method forward (line 207) | def forward(self, feat0, feat1, mask0=None, mask1=None):

FILE: models/core/corr.py
  function bilinear_sampler (line 6) | def bilinear_sampler(img, coords, mode="bilinear", mask=False):
  function coords_grid (line 24) | def coords_grid(batch, ht, wd, device):
  class AAPC (line 32) | class AAPC:
    method __init__ (line 36) | def __init__(self, fmap1, fmap2, att=None):
    method __call__ (line 43) | def __call__(self, flow, extra_offset, small_patch=False):
    method correlation (line 49) | def correlation(self, left_feature, right_feature, flow, small_patch):
    method get_correlation (line 73) | def get_correlation(self, left_feature, right_feature, psize=(3, 3), d...

FILE: models/core/extractor.py
  class ResidualBlock (line 12) | class ResidualBlock(nn.Module):
    method __init__ (line 13) | def __init__(self, in_planes, planes, norm_fn="group", stride=1):
    method forward (line 48) | def forward(self, x):
  class BasicEncoder (line 58) | class BasicEncoder(nn.Module):
    method __init__ (line 59) | def __init__(self, input_dim=3, output_dim=128, norm_fn="batch", dropo...
    method _make_layer (line 99) | def _make_layer(self, dim, stride=1):
    method forward (line 107) | def forward(self, x):
  class MultiBasicEncoder (line 134) | class MultiBasicEncoder(nn.Module):
    method __init__ (line 135) | def __init__(self, output_dim=[128], norm_fn='batch', dropout=0.0, dow...
    method _make_layer (line 201) | def _make_layer(self, dim, stride=1):
    method forward (line 209) | def forward(self, x, dual_inp=False, num_layers=3):
  class DepthExtractor (line 238) | class DepthExtractor(nn.Module):
    method __init__ (line 239) | def __init__(self):
    method forward (line 258) | def forward(self, x):

FILE: models/core/model_zoo.py
  function model_zoo (line 18) | def model_zoo(model_name: str, **kwargs):
  function get_all_model_default_configs (line 37) | def get_all_model_default_configs():

FILE: models/core/stereoanyvideo.py
  function _ntuple (line 19) | def _ntuple(n):
  function exists (line 28) | def exists(val):
  function default (line 32) | def default(val, d):
  class Mlp (line 38) | class Mlp(nn.Module):
    method __init__ (line 39) | def __init__(
    method forward (line 66) | def forward(self, x):
  class StereoAnyVideo (line 75) | class StereoAnyVideo(nn.Module):
    method __init__ (line 76) | def __init__(self, mixed_precision=False):
    method no_weight_decay (line 93) | def no_weight_decay(self):
    method freeze_bn (line 96) | def freeze_bn(self):
    method convex_upsample (line 101) | def convex_upsample(self, flow, mask, rate=4):
    method convex_upsample_3D (line 114) | def convex_upsample_3D(self, flow, mask, b, T, rate=4):
    method zero_init (line 144) | def zero_init(self, fmap):
    method forward_batch_test (line 151) | def forward_batch_test(
    method forward (line 205) | def forward(self, image1, image2, flow_init=None, iters=12, test_mode=...

FILE: models/core/update.py
  function pool2x (line 9) | def pool2x(x):
  function pool4x (line 12) | def pool4x(x):
  function interp (line 15) | def interp(x, dest):
  class FlowHead (line 20) | class FlowHead(nn.Module):
    method __init__ (line 21) | def __init__(self, input_dim=128, hidden_dim=256, output_dim=2):
    method forward (line 27) | def forward(self, x):
  class FlowHead3D (line 31) | class FlowHead3D(nn.Module):
    method __init__ (line 32) | def __init__(self, input_dim=128, hidden_dim=256):
    method forward (line 38) | def forward(self, x):
  class ConvGRU (line 42) | class ConvGRU(nn.Module):
    method __init__ (line 43) | def __init__(self, hidden_dim, input_dim, kernel_size=3):
    method forward (line 49) | def forward(self, h, cz, cr, cq, *x_list):
  class SepConvGRU (line 61) | class SepConvGRU(nn.Module):
    method __init__ (line 62) | def __init__(self, hidden_dim=128, input_dim=192+128):
    method forward (line 73) | def forward(self, h, *x):
  class BasicMotionEncoder (line 92) | class BasicMotionEncoder(nn.Module):
    method __init__ (line 93) | def __init__(self, cor_planes):
    method forward (line 102) | def forward(self, flow, corr):
  class BasicMotionEncoder3D (line 113) | class BasicMotionEncoder3D(nn.Module):
    method __init__ (line 114) | def __init__(self, cor_planes):
    method forward (line 123) | def forward(self, flow, corr):
  class SepConvGRU3D (line 134) | class SepConvGRU3D(nn.Module):
    method __init__ (line 135) | def __init__(self, hidden_dim=128, input_dim=192 + 128):
    method forward (line 167) | def forward(self, h, x):
  class SKSepConvGRU3D (line 191) | class SKSepConvGRU3D(nn.Module):
    method __init__ (line 192) | def __init__(self, hidden_dim=128, input_dim=192 + 128):
    method forward (line 228) | def forward(self, h, x):
  class BasicUpdateBlock (line 252) | class BasicUpdateBlock(nn.Module):
    method __init__ (line 253) | def __init__(self, hidden_dim, cor_planes, mask_size=8, attention_type...
    method forward (line 273) | def forward(self, net, inp, corr, flow, upsample=True, t=1):
  class Attention (line 289) | class Attention(nn.Module):
    method __init__ (line 290) | def __init__(self, dim, num_heads=8, qkv_bias=False, qk_scale=None):
    method forward (line 298) | def forward(self, x):
  class TimeAttnBlock (line 312) | class TimeAttnBlock(nn.Module):
    method __init__ (line 313) | def __init__(self, dim=256, num_heads=8):
    method forward (line 322) | def forward(self, x, T=1):
  class SpaceAttnBlock (line 340) | class SpaceAttnBlock(nn.Module):
    method __init__ (line 341) | def __init__(self, dim=256, num_heads=8):
    method forward (line 345) | def forward(self, x, T=1):
  class SequenceUpdateBlock3D (line 353) | class SequenceUpdateBlock3D(nn.Module):
    method __init__ (line 354) | def __init__(self, hidden_dim, cor_planes, mask_size=8):
    method forward (line 368) | def forward(self, net, inp, corrs, flows, t):

FILE: models/core/utils/config.py
  class ReplaceableBase (line 174) | class ReplaceableBase:
    method __new__ (line 180) | def __new__(cls, *args, **kwargs):
  class Configurable (line 192) | class Configurable:
    method __new__ (line 200) | def __new__(cls, *args, **kwargs):
  class _Registry (line 215) | class _Registry:
    method __init__ (line 222) | def __init__(self) -> None:
    method register (line 227) | def register(self, some_class: Type[_X]) -> Type[_X]:
    method _register (line 235) | def _register(
    method get (line 260) | def get(
    method get_all (line 292) | def get_all(
    method _is_base_class (line 320) | def _is_base_class(some_class: Type[ReplaceableBase]) -> bool:
    method _base_class_from_class (line 328) | def _base_class_from_class(
  class _ProcessType (line 344) | class _ProcessType(Enum):
  function _default_create (line 355) | def _default_create(
  function run_auto_creation (line 413) | def run_auto_creation(self: Any) -> None:
  function _is_configurable_class (line 421) | def _is_configurable_class(C) -> bool:
  function get_default_args (line 425) | def get_default_args(C, *, _do_not_process: Tuple[type, ...] = ()) -> Di...
  function _dataclass_name_for_function (line 487) | def _dataclass_name_for_function(C: Any) -> str:
  function enable_get_default_args (line 496) | def enable_get_default_args(C: Any, *, overwrite: bool = True) -> None:
  function _params_iter (line 553) | def _params_iter(C):
  function _is_immutable_type (line 563) | def _is_immutable_type(type_: Type, val: Any) -> bool:
  function _resolve_optional (line 575) | def _resolve_optional(type_: Any) -> Tuple[bool, Any]:
  function _is_actually_dataclass (line 587) | def _is_actually_dataclass(some_class) -> bool:
  function expand_args_fields (line 597) | def expand_args_fields(
  function get_default_args_field (line 776) | def get_default_args_field(C, *, _do_not_process: Tuple[type, ...] = ()):
  function _get_type_to_process (line 793) | def _get_type_to_process(type_) -> Optional[Tuple[Type, _ProcessType]]:
  function _process_member (line 825) | def _process_member(
  function remove_unused_components (line 917) | def remove_unused_components(dict_: DictConfig) -> None:

FILE: models/core/utils/utils.py
  function interp (line 7) | def interp(tensor, size):
  class InputPadder (line 16) | class InputPadder:
    method __init__ (line 19) | def __init__(self, dims, mode="sintel", divis_by=8):
    method pad (line 33) | def pad(self, *inputs):
    method unpad (line 37) | def unpad(self, x):
  function coords_grid (line 44) | def coords_grid(batch, ht, wd):
  function upflow8 (line 50) | def upflow8(flow, mode='bilinear'):

FILE: models/raft_model.py
  class RAFTModel (line 13) | class RAFTModel(Configurable, torch.nn.Module):
    method __post_init__ (line 16) | def __post_init__(self):
    method forward (line 45) | def forward(self, image1, image2, iters=10):
    method forward_fullres (line 59) | def forward_fullres(self, image1, image2, iters=20):

FILE: models/stereoanyvideo_model.py
  class StereoAnyVideoModel (line 9) | class StereoAnyVideoModel(Configurable, torch.nn.Module):
    method __post_init__ (line 14) | def __post_init__(self):
    method forward (line 32) | def forward(self, batch_dict, iters=20):

FILE: third_party/RAFT/alt_cuda_corr/correlation.cpp
  function corr_forward (line 23) | std::vector<torch::Tensor> corr_forward(
  function corr_backward (line 36) | std::vector<torch::Tensor> corr_backward(
  function PYBIND11_MODULE (line 51) | PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {

FILE: third_party/RAFT/core/corr.py
  class CorrBlock (line 12) | class CorrBlock:
    method __init__ (line 13) | def __init__(self, fmap1, fmap2, num_levels=4, radius=4):
    method __call__ (line 29) | def __call__(self, coords):
    method corr (line 53) | def corr(fmap1, fmap2):
  class AlternateCorrBlock (line 63) | class AlternateCorrBlock:
    method __init__ (line 64) | def __init__(self, fmap1, fmap2, num_levels=4, radius=4):
    method __call__ (line 74) | def __call__(self, coords):

FILE: third_party/RAFT/core/datasets.py
  class FlowDataset (line 18) | class FlowDataset(data.Dataset):
    method __init__ (line 19) | def __init__(self, aug_params=None, sparse=False):
    method __getitem__ (line 34) | def __getitem__(self, index):
    method __rmul__ (line 93) | def __rmul__(self, v):
    method __len__ (line 98) | def __len__(self):
  class MpiSintel (line 102) | class MpiSintel(FlowDataset):
    method __init__ (line 103) | def __init__(self, aug_params=None, split='training', root='datasets/S...
  class FlyingChairs (line 121) | class FlyingChairs(FlowDataset):
    method __init__ (line 122) | def __init__(self, aug_params=None, split='train', root='datasets/Flyi...
  class FlyingThings3D (line 137) | class FlyingThings3D(FlowDataset):
    method __init__ (line 138) | def __init__(self, aug_params=None, root='datasets/FlyingThings3D', ds...
  class KITTI (line 161) | class KITTI(FlowDataset):
    method __init__ (line 162) | def __init__(self, aug_params=None, split='training', root='datasets/K...
  class HD1K (line 180) | class HD1K(FlowDataset):
    method __init__ (line 181) | def __init__(self, aug_params=None, root='datasets/HD1k'):
  function fetch_dataloader (line 199) | def fetch_dataloader(args, TRAIN_DS='C+T+K+S+H'):

FILE: third_party/RAFT/core/extractor.py
  class ResidualBlock (line 6) | class ResidualBlock(nn.Module):
    method __init__ (line 7) | def __init__(self, in_planes, planes, norm_fn='group', stride=1):
    method forward (line 48) | def forward(self, x):
  class BottleneckBlock (line 60) | class BottleneckBlock(nn.Module):
    method __init__ (line 61) | def __init__(self, in_planes, planes, norm_fn='group', stride=1):
    method forward (line 107) | def forward(self, x):
  class BasicEncoder (line 118) | class BasicEncoder(nn.Module):
    method __init__ (line 119) | def __init__(self, output_dim=128, norm_fn='batch', dropout=0.0):
    method _make_layer (line 159) | def _make_layer(self, dim, stride=1):
    method forward (line 168) | def forward(self, x):
  class SmallEncoder (line 195) | class SmallEncoder(nn.Module):
    method __init__ (line 196) | def __init__(self, output_dim=128, norm_fn='batch', dropout=0.0):
    method _make_layer (line 235) | def _make_layer(self, dim, stride=1):
    method forward (line 244) | def forward(self, x):

FILE: third_party/RAFT/core/raft.py
  class autocast (line 15) | class autocast:
    method __init__ (line 16) | def __init__(self, enabled):
    method __enter__ (line 18) | def __enter__(self):
    method __exit__ (line 20) | def __exit__(self, *args):
  class RAFT (line 24) | class RAFT(nn.Module):
    method __init__ (line 25) | def __init__(self, args):
    method freeze_bn (line 58) | def freeze_bn(self):
    method initialize_flow (line 63) | def initialize_flow(self, img):
    method upsample_flow (line 72) | def upsample_flow(self, flow, mask):
    method forward (line 86) | def forward(self, image1, image2, iters=12, flow_init=None, upsample=T...

FILE: third_party/RAFT/core/update.py
  class FlowHead (line 6) | class FlowHead(nn.Module):
    method __init__ (line 7) | def __init__(self, input_dim=128, hidden_dim=256):
    method forward (line 13) | def forward(self, x):
  class ConvGRU (line 16) | class ConvGRU(nn.Module):
    method __init__ (line 17) | def __init__(self, hidden_dim=128, input_dim=192+128):
    method forward (line 23) | def forward(self, h, x):
  class SepConvGRU (line 33) | class SepConvGRU(nn.Module):
    method __init__ (line 34) | def __init__(self, hidden_dim=128, input_dim=192+128):
    method forward (line 45) | def forward(self, h, x):
  class SmallMotionEncoder (line 62) | class SmallMotionEncoder(nn.Module):
    method __init__ (line 63) | def __init__(self, args):
    method forward (line 71) | def forward(self, flow, corr):
  class BasicMotionEncoder (line 79) | class BasicMotionEncoder(nn.Module):
    method __init__ (line 80) | def __init__(self, args):
    method forward (line 89) | def forward(self, flow, corr):
  class SmallUpdateBlock (line 99) | class SmallUpdateBlock(nn.Module):
    method __init__ (line 100) | def __init__(self, args, hidden_dim=96):
    method forward (line 106) | def forward(self, net, inp, corr, flow):
  class BasicUpdateBlock (line 114) | class BasicUpdateBlock(nn.Module):
    method __init__ (line 115) | def __init__(self, args, hidden_dim=128, input_dim=128):
    method forward (line 127) | def forward(self, net, inp, corr, flow, upsample=True):

FILE: third_party/RAFT/core/utils/augmentor.py
  class FlowAugmentor (line 15) | class FlowAugmentor:
    method __init__ (line 16) | def __init__(self, crop_size, min_scale=-0.2, max_scale=0.5, do_flip=T...
    method color_transform (line 36) | def color_transform(self, img1, img2):
    method eraser_transform (line 52) | def eraser_transform(self, img1, img2, bounds=[50, 100]):
    method spatial_transform (line 67) | def spatial_transform(self, img1, img2, flow):
    method __call__ (line 111) | def __call__(self, img1, img2, flow):
  class SparseFlowAugmentor (line 122) | class SparseFlowAugmentor:
    method __init__ (line 123) | def __init__(self, crop_size, min_scale=-0.2, max_scale=0.5, do_flip=F...
    method color_transform (line 142) | def color_transform(self, img1, img2):
    method eraser_transform (line 148) | def eraser_transform(self, img1, img2):
    method resize_sparse_flow_map (line 161) | def resize_sparse_flow_map(self, flow, valid, fx=1.0, fy=1.0):
    method spatial_transform (line 195) | def spatial_transform(self, img1, img2, flow, valid):
    method __call__ (line 236) | def __call__(self, img1, img2, flow, valid):

FILE: third_party/RAFT/core/utils/flow_viz.py
  function make_colorwheel (line 20) | def make_colorwheel():
  function flow_uv_to_colors (line 70) | def flow_uv_to_colors(u, v, convert_to_bgr=False):
  function flow_to_image (line 109) | def flow_to_image(flow_uv, clip_flow=None, convert_to_bgr=False):

FILE: third_party/RAFT/core/utils/frame_utils.py
  function readFlow (line 12) | def readFlow(fn):
  function readPFM (line 33) | def readPFM(file):
  function writeFlow (line 70) | def writeFlow(filename,uv,v=None):
  function readFlowKITTI (line 102) | def readFlowKITTI(filename):
  function readDispKITTI (line 109) | def readDispKITTI(filename):
  function writeFlowKITTI (line 116) | def writeFlowKITTI(filename, uv):
  function read_gen (line 123) | def read_gen(file_name, pil=False):

FILE: third_party/RAFT/core/utils/utils.py
  class InputPadder (line 7) | class InputPadder:
    method __init__ (line 9) | def __init__(self, dims, mode='sintel'):
    method pad (line 18) | def pad(self, *inputs):
    method unpad (line 21) | def unpad(self,x):
  function forward_interpolate (line 26) | def forward_interpolate(flow):
  function bilinear_sampler (line 57) | def bilinear_sampler(img, coords, mode='bilinear', mask=False):
  function coords_grid (line 74) | def coords_grid(batch, ht, wd, device):
  function upflow8 (line 80) | def upflow8(flow, mode='bilinear'):

FILE: third_party/RAFT/demo.py
  function load_image (line 20) | def load_image(imfile):
  function viz (line 26) | def viz(img, flo):
  function demo (line 42) | def demo(args):

FILE: third_party/RAFT/evaluate.py
  function create_sintel_submission (line 22) | def create_sintel_submission(model, iters=32, warm_start=False, output_p...
  function create_kitti_submission (line 54) | def create_kitti_submission(model, iters=24, output_path='kitti_submissi...
  function validate_chairs (line 75) | def validate_chairs(model, iters=24):
  function validate_sintel (line 96) | def validate_sintel(model, iters=32):
  function validate_kitti (line 131) | def validate_kitti(model, iters=24):

FILE: third_party/RAFT/train.py
  class GradScaler (line 28) | class GradScaler:
    method __init__ (line 29) | def __init__(self):
    method scale (line 31) | def scale(self, loss):
    method unscale_ (line 33) | def unscale_(self, optimizer):
    method step (line 35) | def step(self, optimizer):
    method update (line 37) | def update(self):
  function sequence_loss (line 47) | def sequence_loss(flow_preds, flow_gt, valid, gamma=0.8, max_flow=MAX_FL...
  function count_parameters (line 75) | def count_parameters(model):
  function fetch_optimizer (line 79) | def fetch_optimizer(args, model):
  class Logger (line 89) | class Logger:
    method __init__ (line 90) | def __init__(self, model, scheduler):
    method _print_training_status (line 97) | def _print_training_status(self):
    method push (line 112) | def push(self, metrics):
    method write_dict (line 125) | def write_dict(self, results):
    method close (line 132) | def close(self):
  function train (line 136) | def train(args):

FILE: train_stereoanyvideo.py
  function fetch_optimizer (line 29) | def fetch_optimizer(args, model):
  function forward_batch (line 49) | def forward_batch(batch, model, args):
  class Lite (line 69) | class Lite(LightningLite):
    method run (line 70) | def run(self, args):

FILE: train_utils/logger.py
  class Logger (line 7) | class Logger:
    method __init__ (line 11) | def __init__(self, model, scheduler, ckpt_path):
    method _print_training_status (line 22) | def _print_training_status(self):
    method push (line 43) | def push(self, metrics, task):
    method update (line 50) | def update(self):
    method write_dict (line 57) | def write_dict(self, results):
    method close (line 64) | def close(self):

FILE: train_utils/losses.py
  function sequence_loss (line 6) | def sequence_loss(flow_preds, flow_gt, valid, loss_gamma=0.9, max_flow=7...
  function temporal_loss (line 65) | def temporal_loss(flow_preds, flow_preds2, flow_gt, flow_gt2, valid, los...
  function compute_flow (line 126) | def compute_flow(Flow_Model, seq):
  function flow_warp (line 143) | def flow_warp(x, flow):
  function bidirectional_alignment (line 169) | def bidirectional_alignment(seq, flows_backward, flows_forward):
  function consistency_loss (line 202) | def consistency_loss(seq, disparities, Flow_Model, alpha=50):

FILE: train_utils/utils.py
  function count_parameters (line 13) | def count_parameters(model):
  function run_test_eval (line 17) | def run_test_eval(ckpt_path, eval_type, evaluator, model, dataloaders, w...
  function fig2data (line 66) | def fig2data(fig):
  function save_ims_to_tb (line 89) | def save_ims_to_tb(writer, batch, output, total_steps):

Download .json

Condensed preview — 83 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (571K chars).

[
  {
    "path": "LICENSE",
    "chars": 11357,
    "preview": "                                 Apache License\n                           Version 2.0, January 2004\n                   "
  },
  {
    "path": "README.md",
    "chars": 3671,
    "preview": "<h1 align='center' style=\"text-align:center; font-weight:bold; font-size:2.0em;letter-spacing:2.0px;\">\nStereo Any Video:"
  },
  {
    "path": "assets/1",
    "chars": 1,
    "preview": "\n"
  },
  {
    "path": "checkpoints/checkpoints here.txt",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "data/datasets/dataset here.txt",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "datasets/augmentor.py",
    "chars": 12953,
    "preview": "import numpy as np\nimport random\nfrom PIL import Image\n\nimport cv2\n\ncv2.setNumThreads(0)\ncv2.ocl.setUseOpenCL(False)\n\nfr"
  },
  {
    "path": "datasets/frame_utils.py",
    "chars": 3366,
    "preview": "import numpy as np\nfrom PIL import Image\nfrom os.path import *\nimport re\nimport imageio\nimport cv2\n\ncv2.setNumThreads(0)"
  },
  {
    "path": "datasets/video_datasets.py",
    "chars": 79294,
    "preview": "import os\nimport copy\nimport gzip\nimport logging\nimport torch\nimport numpy as np\nimport torch.utils.data as data\nimport "
  },
  {
    "path": "demo.py",
    "chars": 6638,
    "preview": "import sys\n\nimport argparse\nimport os\nimport cv2\nimport glob\nimport numpy as np\nimport torch\nimport torch.nn.functional "
  },
  {
    "path": "demo.sh",
    "chars": 187,
    "preview": "#!/bin/bash\n\nexport PYTHONPATH=`(cd ../ && pwd)`:`pwd`:$PYTHONPATH\n\npython demo.py --ckpt ./checkpoints/StereoAnyVideo_M"
  },
  {
    "path": "evaluate_stereoanyvideo.sh",
    "chars": 1509,
    "preview": "#!/bin/bash\n\nexport PYTHONPATH=`(cd ../ && pwd)`:`pwd`:$PYTHONPATH\n\n# evaluate on [sintel, dynamicreplica， infinigensv, "
  },
  {
    "path": "evaluation/configs/eval_dynamic_replica.yaml",
    "chars": 165,
    "preview": "defaults:\n  - default_config_eval\nvisualize_interval: -1\nexp_dir: ./outputs/stereoanyvideo_DynamicReplica\nsample_len: 15"
  },
  {
    "path": "evaluation/configs/eval_infinigensv.yaml",
    "chars": 207,
    "preview": "defaults:\n  - default_config_eval\nvisualize_interval: -1\nrender_bin_size: 0\nexp_dir: ./outputs/stereoanyvideo_InfinigenS"
  },
  {
    "path": "evaluation/configs/eval_kittidepth.yaml",
    "chars": 206,
    "preview": "defaults:\n  - default_config_eval\nvisualize_interval: -1\nrender_bin_size: 0\nexp_dir: ./outputs/stereoanyvideo_KITTIDepth"
  },
  {
    "path": "evaluation/configs/eval_sintel_clean.yaml",
    "chars": 201,
    "preview": "defaults:\n  - default_config_eval\nvisualize_interval: -1\nrender_bin_size: 0\nexp_dir: ./outputs/stereoanyvideo_sintel_cle"
  },
  {
    "path": "evaluation/configs/eval_sintel_final.yaml",
    "chars": 200,
    "preview": "defaults:\n  - default_config_eval\nvisualize_interval: -1\nrender_bin_size: 0\nexp_dir: ./outputs/stereoanyvideo_sintel_fin"
  },
  {
    "path": "evaluation/configs/eval_southkensington.yaml",
    "chars": 203,
    "preview": "defaults:\n  - default_config_eval\nvisualize_interval: 1\nexp_dir: ./outputs/stereoanyvideo_SouthKensingtonIndoor\nsample_l"
  },
  {
    "path": "evaluation/configs/eval_vkitti2.yaml",
    "chars": 199,
    "preview": "defaults:\n  - default_config_eval\nvisualize_interval: -1\nrender_bin_size: 0\nexp_dir: ./outputs/stereoanyvideo_VKITTI2\nsa"
  },
  {
    "path": "evaluation/core/evaluator.py",
    "chars": 6702,
    "preview": "import os\nimport numpy as np\nimport cv2\nfrom collections import defaultdict\nimport torch.nn.functional as F\nimport torch"
  },
  {
    "path": "evaluation/evaluate.py",
    "chars": 4416,
    "preview": "import json\nimport os\nfrom dataclasses import dataclass, field\nfrom typing import Any, Dict, Optional\n\nimport hydra\nimpo"
  },
  {
    "path": "evaluation/utils/eval_utils.py",
    "chars": 17789,
    "preview": "from dataclasses import dataclass\nfrom typing import Dict, Optional, Union\nfrom stereoanyvideo.evaluation.utils.ssim imp"
  },
  {
    "path": "evaluation/utils/ssim.py",
    "chars": 11023,
    "preview": "# Copyright 2020 by Gongfan Fang, Zhejiang University.\n# All rights reserved.\nimport warnings\n\nimport torch\nimport torch"
  },
  {
    "path": "evaluation/utils/utils.py",
    "chars": 12844,
    "preview": "from collections import defaultdict\nimport configparser\nimport os\nimport math\nfrom typing import Optional, List\nimport t"
  },
  {
    "path": "models/Video-Depth-Anything/app.py",
    "chars": 5259,
    "preview": "# Copyright (2025) Bytedance Ltd. and/or its affiliates \n\n# Licensed under the Apache License, Version 2.0 (the \"License"
  },
  {
    "path": "models/Video-Depth-Anything/get_weights.sh",
    "chars": 271,
    "preview": "#!/bin/bash\n\nmkdir checkpoints\ncd checkpoints\nwget https://huggingface.co/depth-anything/Video-Depth-Anything-Small/reso"
  },
  {
    "path": "models/Video-Depth-Anything/run.py",
    "chars": 2899,
    "preview": "# Copyright (2025) Bytedance Ltd. and/or its affiliates \n\n# Licensed under the Apache License, Version 2.0 (the \"License"
  },
  {
    "path": "models/Video-Depth-Anything/utils/dc_utils.py",
    "chars": 4416,
    "preview": "# This file is originally from DepthCrafter/depthcrafter/utils.py at main · Tencent/DepthCrafter\n# SPDX-License-Identifi"
  },
  {
    "path": "models/Video-Depth-Anything/utils/util.py",
    "chars": 2449,
    "preview": "# Copyright (2025) Bytedance Ltd. and/or its affiliates \n\n# Licensed under the Apache License, Version 2.0 (the \"License"
  },
  {
    "path": "models/Video-Depth-Anything/video_depth_anything/dinov2.py",
    "chars": 15178,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n#\n# This source code is licensed under the Apache License, Version "
  },
  {
    "path": "models/Video-Depth-Anything/video_depth_anything/dinov2_layers/__init__.py",
    "chars": 382,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the l"
  },
  {
    "path": "models/Video-Depth-Anything/video_depth_anything/dinov2_layers/attention.py",
    "chars": 2343,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the l"
  },
  {
    "path": "models/Video-Depth-Anything/video_depth_anything/dinov2_layers/block.py",
    "chars": 9332,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the l"
  },
  {
    "path": "models/Video-Depth-Anything/video_depth_anything/dinov2_layers/drop_path.py",
    "chars": 1160,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the l"
  },
  {
    "path": "models/Video-Depth-Anything/video_depth_anything/dinov2_layers/layer_scale.py",
    "chars": 823,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the l"
  },
  {
    "path": "models/Video-Depth-Anything/video_depth_anything/dinov2_layers/mlp.py",
    "chars": 1272,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the l"
  },
  {
    "path": "models/Video-Depth-Anything/video_depth_anything/dinov2_layers/patch_embed.py",
    "chars": 2832,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the l"
  },
  {
    "path": "models/Video-Depth-Anything/video_depth_anything/dinov2_layers/swiglu_ffn.py",
    "chars": 1859,
    "preview": "# Copyright (c) Meta Platforms, Inc. and affiliates.\n# All rights reserved.\n#\n# This source code is licensed under the l"
  },
  {
    "path": "models/Video-Depth-Anything/video_depth_anything/dpt.py",
    "chars": 5467,
    "preview": "# Copyright (2025) Bytedance Ltd. and/or its affiliates \n\n# Licensed under the Apache License, Version 2.0 (the \"License"
  },
  {
    "path": "models/Video-Depth-Anything/video_depth_anything/dpt_temporal.py",
    "chars": 4231,
    "preview": "# Copyright (2025) Bytedance Ltd. and/or its affiliates \n\n# Licensed under the Apache License, Version 2.0 (the \"License"
  },
  {
    "path": "models/Video-Depth-Anything/video_depth_anything/motion_module/attention.py",
    "chars": 16719,
    "preview": "# Copyright 2022 The HuggingFace Team. All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Lic"
  },
  {
    "path": "models/Video-Depth-Anything/video_depth_anything/motion_module/motion_module.py",
    "chars": 11162,
    "preview": "# This file is originally from AnimateDiff/animatediff/models/motion_module.py at main · guoyww/AnimateDiff\n# SPDX-Licen"
  },
  {
    "path": "models/Video-Depth-Anything/video_depth_anything/util/blocks.py",
    "chars": 4064,
    "preview": "import torch.nn as nn\n\n\ndef _make_scratch(in_shape, out_shape, groups=1, expand=False):\n    scratch = nn.Module()\n\n    o"
  },
  {
    "path": "models/Video-Depth-Anything/video_depth_anything/util/transform.py",
    "chars": 6075,
    "preview": "import numpy as np\nimport cv2\n\n\nclass Resize(object):\n    \"\"\"Resize sample to given size (width, height).\n    \"\"\"\n\n    d"
  },
  {
    "path": "models/Video-Depth-Anything/video_depth_anything/video_depth.py",
    "chars": 6363,
    "preview": "# Copyright (2025) Bytedance Ltd. and/or its affiliates \n\n# Licensed under the Apache License, Version 2.0 (the \"License"
  },
  {
    "path": "models/core/attention.py",
    "chars": 8421,
    "preview": "import math\nimport copy\nimport torch\nimport torch.nn as nn\nfrom torch.nn import Module, Dropout\n\n\"\"\"\nLinear Transformer "
  },
  {
    "path": "models/core/corr.py",
    "chars": 3401,
    "preview": "import torch\nimport torch.nn.functional as F\nfrom einops import rearrange\n\n\ndef bilinear_sampler(img, coords, mode=\"bili"
  },
  {
    "path": "models/core/extractor.py",
    "chars": 9725,
    "preview": "import torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nimport numpy as np\nimport os\nimport sys\nimport import"
  },
  {
    "path": "models/core/model_zoo.py",
    "chars": 1184,
    "preview": "import copy\nfrom pytorch3d.implicitron.tools.config import get_default_args\nfrom stereoanyvideo.models.stereoanyvideo_mo"
  },
  {
    "path": "models/core/stereoanyvideo.py",
    "chars": 12842,
    "preview": "import torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\nfrom typing import Dict, List\nfrom einops import rea"
  },
  {
    "path": "models/core/update.py",
    "chars": 14208,
    "preview": "from einops import rearrange\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nfrom opt_einsum import c"
  },
  {
    "path": "models/core/utils/config.py",
    "chars": 33854,
    "preview": "import dataclasses\nimport inspect\nimport itertools\nimport sys\nimport warnings\nfrom collections import Counter, defaultdi"
  },
  {
    "path": "models/core/utils/utils.py",
    "chars": 1597,
    "preview": "import torch\nimport torch.nn.functional as F\nimport numpy as np\nfrom scipy import interpolate\n\n\ndef interp(tensor, size)"
  },
  {
    "path": "models/raft_model.py",
    "chars": 2592,
    "preview": "from types import SimpleNamespace\nfrom typing import ClassVar\nimport torch.nn.functional as F\n\nfrom pytorch3d.implicitro"
  },
  {
    "path": "models/stereoanyvideo_model.py",
    "chars": 1124,
    "preview": "from typing import ClassVar\n\nimport torch\nimport torch.nn.functional as F\nfrom pytorch3d.implicitron.tools.config import"
  },
  {
    "path": "requirements.txt",
    "chars": 123,
    "preview": "hydra-core==1.1\nnumpy==1.23.5\nmunch==2.5.0\nomegaconf==2.1.0\nflow_vis==0.1\neinops==0.4.1\nopt_einsum==3.3.0\nrequests\nmovie"
  },
  {
    "path": "third_party/RAFT/LICENSE",
    "chars": 1512,
    "preview": "BSD 3-Clause License\n\nCopyright (c) 2020, princeton-vl\nAll rights reserved.\n\nRedistribution and use in source and binary"
  },
  {
    "path": "third_party/RAFT/README.md",
    "chars": 2725,
    "preview": "# RAFT\nThis repository contains the source code for our paper:\n\n[RAFT: Recurrent All Pairs Field Transforms for Optical "
  },
  {
    "path": "third_party/RAFT/alt_cuda_corr/correlation.cpp",
    "chars": 1368,
    "preview": "#include <torch/extension.h>\n#include <vector>\n\n// CUDA forward declarations\nstd::vector<torch::Tensor> corr_cuda_forwar"
  },
  {
    "path": "third_party/RAFT/alt_cuda_corr/correlation_kernel.cu",
    "chars": 10249,
    "preview": "#include <torch/extension.h>\n#include <cuda.h>\n#include <cuda_runtime.h>\n#include <vector>\n\n\n#define BLOCK_H 4\n#define B"
  },
  {
    "path": "third_party/RAFT/alt_cuda_corr/setup.py",
    "chars": 381,
    "preview": "from setuptools import setup\nfrom torch.utils.cpp_extension import BuildExtension, CUDAExtension\n\n\nsetup(\n    name='corr"
  },
  {
    "path": "third_party/RAFT/chairs_split.txt",
    "chars": 45743,
    "preview": "1\n1\n1\n1\n1\n2\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n2\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n2\n1\n1\n2\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n1\n2\n1\n"
  },
  {
    "path": "third_party/RAFT/core/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "third_party/RAFT/core/corr.py",
    "chars": 3086,
    "preview": "import torch\nimport torch.nn.functional as F\nfrom .utils.utils import bilinear_sampler, coords_grid\n\ntry:\n    import alt"
  },
  {
    "path": "third_party/RAFT/core/datasets.py",
    "chars": 9245,
    "preview": "# Data loading based on https://github.com/NVIDIA/flownet2-pytorch\n\nimport numpy as np\nimport torch\nimport torch.utils.d"
  },
  {
    "path": "third_party/RAFT/core/extractor.py",
    "chars": 8847,
    "preview": "import torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\n\nclass ResidualBlock(nn.Module):\n    def __init__(se"
  },
  {
    "path": "third_party/RAFT/core/raft.py",
    "chars": 4924,
    "preview": "import numpy as np\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\nfrom .update import BasicUpdateBl"
  },
  {
    "path": "third_party/RAFT/core/update.py",
    "chars": 5227,
    "preview": "import torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\n\nclass FlowHead(nn.Module):\n    def __init__(self, i"
  },
  {
    "path": "third_party/RAFT/core/utils/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "third_party/RAFT/core/utils/augmentor.py",
    "chars": 9108,
    "preview": "import numpy as np\nimport random\nimport math\nfrom PIL import Image\n\nimport cv2\ncv2.setNumThreads(0)\ncv2.ocl.setUseOpenCL"
  },
  {
    "path": "third_party/RAFT/core/utils/flow_viz.py",
    "chars": 4318,
    "preview": "# Flow visualization code used from https://github.com/tomrunia/OpticalFlow_Visualization\n\n\n# MIT License\n#\n# Copyright "
  },
  {
    "path": "third_party/RAFT/core/utils/frame_utils.py",
    "chars": 4024,
    "preview": "import numpy as np\nfrom PIL import Image\nfrom os.path import *\nimport re\n\nimport cv2\ncv2.setNumThreads(0)\ncv2.ocl.setUse"
  },
  {
    "path": "third_party/RAFT/core/utils/utils.py",
    "chars": 2489,
    "preview": "import torch\nimport torch.nn.functional as F\nimport numpy as np\nfrom scipy import interpolate\n\n\nclass InputPadder:\n    \""
  },
  {
    "path": "third_party/RAFT/demo.py",
    "chars": 2073,
    "preview": "import sys\nsys.path.append('core')\n\nimport argparse\nimport os\nimport cv2\nimport glob\nimport numpy as np\nimport torch\nfro"
  },
  {
    "path": "third_party/RAFT/download_models.sh",
    "chars": 97,
    "preview": "#!/bin/bash\nwget https://dl.dropboxusercontent.com/s/4j4z58wuv8o0mfz/models.zip\nunzip models.zip\n"
  },
  {
    "path": "third_party/RAFT/evaluate.py",
    "chars": 6618,
    "preview": "import sys\nsys.path.append('core')\n\nfrom PIL import Image\nimport argparse\nimport os\nimport time\nimport numpy as np\nimpor"
  },
  {
    "path": "third_party/RAFT/train.py",
    "chars": 7987,
    "preview": "from __future__ import print_function, division\nimport sys\nsys.path.append('core')\n\nimport argparse\nimport os\nimport cv2"
  },
  {
    "path": "third_party/RAFT/train_mixed.sh",
    "chars": 921,
    "preview": "#!/bin/bash\nmkdir -p checkpoints\npython -u train.py --name raft-chairs --stage chairs --validation chairs --gpus 0 --num"
  },
  {
    "path": "third_party/RAFT/train_standard.sh",
    "chars": 860,
    "preview": "#!/bin/bash\nmkdir -p checkpoints\npython -u train.py --name raft-chairs --stage chairs --validation chairs --gpus 0 1 --n"
  },
  {
    "path": "train_stereoanyvideo.py",
    "chars": 11369,
    "preview": "import argparse\nimport logging\nfrom pathlib import Path\nfrom tqdm import tqdm\nimport os\nimport cv2\nimport numpy as np\nim"
  },
  {
    "path": "train_stereoanyvideo.sh",
    "chars": 371,
    "preview": "#!/bin/bash\n\nexport PYTHONPATH=`(cd ../ && pwd)`:`pwd`:$PYTHONPATH\n\npython train_stereoanyvideo.py --batch_size 1 \\\n --s"
  },
  {
    "path": "train_utils/logger.py",
    "chars": 2133,
    "preview": "import logging\nimport os\n\nfrom torch.utils.tensorboard import SummaryWriter\n\n\nclass Logger:\n\n    SUM_FREQ = 100\n\n    def"
  },
  {
    "path": "train_utils/losses.py",
    "chars": 8395,
    "preview": "import torch\nfrom einops import rearrange\nimport torch.nn.functional as F\n\n\ndef sequence_loss(flow_preds, flow_gt, valid"
  },
  {
    "path": "train_utils/utils.py",
    "chars": 4586,
    "preview": "import numpy as np\nimport os\nimport torch\n\nimport json\nimport flow_vis\nimport matplotlib.pyplot as plt\n\nimport stereoany"
  }
]

About this extraction

This page contains the full source code of the TomTomTommi/stereoanyvideo GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 83 files (513.1 KB), approximately 168.4k tokens, and a symbol index with 534 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo