Full Code of dyson-ai/hdp for AI

main 383d5323b881 cached

116 files

3.4 MB

905.6k tokens

602 symbols

1 requests

Download .txt

Showing preview only (3,621K chars total). Download the full file or copy to clipboard to get everything.

Repository: dyson-ai/hdp
Branch: main
Commit: 383d5323b881
Files: 116
Total size: 3.4 MB

Directory structure:
gitextract_oibojri5/

├── .gitignore
├── LICENSE
├── README.md
├── cfgs/
│   ├── diffuser_config.yaml
│   ├── env/
│   │   ├── rlbench/
│   │   │   ├── open_box.yaml
│   │   │   ├── open_door.yaml
│   │   │   ├── open_drawer.yaml
│   │   │   ├── open_grill.yaml
│   │   │   ├── open_microwave.yaml
│   │   │   ├── open_oven.yaml
│   │   │   ├── pick_and_lift.yaml
│   │   │   ├── pick_up_cup.yaml
│   │   │   ├── put_knife_on_chopping_board.yaml
│   │   │   ├── reach_target.yaml
│   │   │   ├── take_lid_off_saucepan.yaml
│   │   │   └── toilet_seat_up.yaml
│   │   ├── rlbench.yaml
│   │   └── sim.yaml
│   ├── eval.yaml
│   └── method/
│       └── rk_diffuser.yaml
├── eval.py
├── extra_scripts/
│   ├── dataset_generator.py
│   └── install_coppeliasim.sh
├── helpers/
│   ├── __init__.py
│   ├── clip/
│   │   ├── __init__.py
│   │   └── core/
│   │       ├── __init__.py
│   │       ├── attention.py
│   │       ├── attention_image_goal.py
│   │       ├── clip.py
│   │       ├── fusion.py
│   │       ├── resnet.py
│   │       ├── simple_tokenizer.py
│   │       ├── transport.py
│   │       ├── transport_image_goal.py
│   │       └── unet.py
│   ├── custom_rlbench_env.py
│   ├── demo_loading_utils.py
│   ├── network_utils.py
│   ├── optim/
│   │   ├── __init__.py
│   │   └── lamb.py
│   ├── preprocess_agent.py
│   └── utils.py
├── panda_urdf/
│   ├── meshes/
│   │   ├── Pandagripper_coll_1.dae
│   │   ├── Pandagrippervisual_vis_1.dae
│   │   ├── Pandagrippervisual_vis_2.dae
│   │   ├── Pandagrippervisual_vis_3.dae
│   │   ├── Pandagrippervisual_vis_4.dae
│   │   ├── Pandagrippervisual_vis_5.dae
│   │   ├── Pandaleftfingervisible_vis_1.dae
│   │   ├── Pandaleftfingervisible_vis_2.dae
│   │   ├── Pandalink0visual_vis_1.dae
│   │   ├── Pandalink1respondable_coll_1.dae
│   │   ├── Pandalink1visual_vis_1.dae
│   │   ├── Pandalink2respondable_coll_1.dae
│   │   ├── Pandalink2visual_vis_1.dae
│   │   ├── Pandalink3respondable_coll_1.dae
│   │   ├── Pandalink3visual_vis_1.dae
│   │   ├── Pandalink3visual_vis_2.dae
│   │   ├── Pandalink4respondable_coll_1.dae
│   │   ├── Pandalink4visual_vis_1.dae
│   │   ├── Pandalink4visual_vis_2.dae
│   │   ├── Pandalink5respondable_coll_1.dae
│   │   ├── Pandalink5respondable_coll_2.dae
│   │   ├── Pandalink5respondable_coll_3.dae
│   │   ├── Pandalink5respondable_coll_4.dae
│   │   ├── Pandalink5respondable_coll_5.dae
│   │   ├── Pandalink5respondable_coll_6.dae
│   │   ├── Pandalink5respondable_coll_7.dae
│   │   ├── Pandalink5visual_vis_1.dae
│   │   ├── Pandalink5visual_vis_2.dae
│   │   ├── Pandalink6respondable_coll_1.dae
│   │   ├── Pandalink6visual_vis_1.dae
│   │   ├── Pandalink6visual_vis_2.dae
│   │   ├── Pandalink6visual_vis_3.dae
│   │   ├── Pandalink6visual_vis_4.dae
│   │   ├── Pandalink6visual_vis_5.dae
│   │   ├── Pandalink6visual_vis_6.dae
│   │   ├── Pandalink7respondable_coll_1.dae
│   │   ├── Pandalink7visual_vis_1.dae
│   │   ├── Pandalink7visual_vis_2.dae
│   │   ├── Pandalink7visual_vis_3.dae
│   │   ├── Pandalink7visual_vis_4.dae
│   │   ├── Pandalink7visual_vis_5.dae
│   │   ├── Pandarightfingervisual_vis_1.dae
│   │   ├── Pandarightfingervisual_vis_2.dae
│   │   └── robot_base_coll_1.dae
│   └── panda.urdf
├── peract/
│   ├── ARM_LICENSE
│   ├── LICENSE
│   ├── agents/
│   │   ├── __init__.py
│   │   └── peract_diffusion/
│   │       ├── __init__.py
│   │       ├── launch_utils.py
│   │       ├── perceiver_lang_io.py
│   │       ├── qattention_peract_bc_agent.py
│   │       └── qattention_stack_agent.py
│   └── voxel/
│       ├── __init__.py
│       ├── augmentation.py
│       └── voxel_grid.py
├── requirements.txt
├── rk_diffuser/
│   ├── __init__.py
│   ├── dataset/
│   │   ├── __init__.py
│   │   ├── realworld_dataset.py
│   │   ├── rl_bench_dataset.py
│   │   └── rl_bench_env.py
│   ├── models/
│   │   ├── __init__.py
│   │   ├── diffusion.py
│   │   ├── helpers.py
│   │   ├── multi_level_diffusion.py
│   │   ├── pointnet.py
│   │   ├── resnet.py
│   │   └── temporal.py
│   ├── robot.py
│   ├── trainer.py
│   └── utils.py
├── setup.py
└── train_low_level.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
__pycache__
wandb/
outputs/
*.pt
logs
saved_models/
*.json
exp_local/
eval_video/
*.html
*egg-info/

================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) 2024 Dyson AI

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================
FILE: README.md
================================================
# Hierarchical Diffusion Policy

This repo contains the PyTorch implementation of the CVPR 2024 paper

Hierarchical Diffusion Policy for Multi-Task Robotic Manipulation

[Xiao Ma](https://yusufma03.github.io/), [Sumit Patidar](https://rocketsumit.github.io/), [Iain Haughton](https://www.linkedin.com/in/iain-haughton-194321135/?originalSubdomain=uk), [Stephen James](https://stepjam.github.io/)

CVPR 2024

Dyson Robot Learning Lab

![teaser](images/sim.gif)

HDP factorises a manipulation policy into a hierarchical structure: a high-level task-planning agent which predicts a distant next-best end-effector pose (NBP), and a low-level goal-conditioned diffusion policy which generates optimal motion trajectories. The factorised policy representation allows HDP to tackle both long-horizon task planning while generating fine-grained low-level actions. To generate context-aware motion trajectories while satisfying robot kinematics constraints, we present a novel kinematics-aware goal-conditioned control agent, Robot Kinematics Diffuser (RK-Diffuser). Specifically, RK-Diffuser learns to generate both the end-effector pose and joint position trajectories, and distill the accurate but kinematics-unaware end-effector pose diffuser to the kinematics-aware but less accurate joint position diffuser via differentiable kinematics.

In this repository, we provide the code for training the low-level RK-Diffuser. We use PerAct as our high-level agent and we refer to its [official implementation](https://github.com/peract/peract) for detailed training configurations. We also include the evaluation code for the HDP architecture.

For more details, see our [project page](https://yusufma03.github.io/projects/hdp/).

## Installation

```bash
conda create -n hdp python=3.10
conda activate hdp
bash ./extra_scripts/install_coppeliasim.sh
conda install pytorch torchvision pytorch-cuda=11.8 -c pytorch -c nvidia
pip install cffi==1.15
pip install -r requirements.txt
python setup.py develop
```

## Running in headless mode
Please refer to the official guide of [RLBench](https://github.com/stepjam/RLBench?tab=readme-ov-file#running-headless).

## Generate the dataset
First, we need to generate the training dataset.
```bash
python extra_scripts/dataset_generator.py --save_path=<your dataset path> --tasks=<your task> --variations=1 --processes=1 --episodes_per_task=100
```
For example, to generate a dataset for the `reach_target` task,
```bash
python extra_scripts/dataset_generator.py --save_path=/data/${USER}/rlbench --tasks=reach_target --variations=1 --processes=1 --episodes_per_task=100
```
The script will generate both `train` and `eval` datasets at the same time.

## Training the low-level RK-Diffuser
To train the low-level sim agent, simply do
```bash
python3 train_low_level.py env=sim env.data_path=<your dataset path> env.tasks="[<task1>, <task2>, ...]"
```

For example, to train a model for the `reach_target` and the `take_lid_off_saucepan` tasks, run
```bash
python3 train_low_level.py env=sim env.data_path=/data/${USER}/rlbench env.tasks="[reach_target, take_lid_off_saucepan]"
```

You can enable online logging or set wandb run name by adding the following args
```bash
log=True run_name=<your run name>
```

## Evaluate the models
We also provide the full evaluation pipeline to HDP. To run the evaluation, simply do
```bash
python eval.py rlbench.tasks="[<your task>]" rlbench.headless=False method.model_path=<path to rk-diffuser ckpt> framework.logdir=<path to peract ckpt dir>
```

## Citation

```bibtex
@article{ma2024hierarchical,
  author    = {Ma, Xiao and Patidar, Sumit and Haughton, Iain and James, Stephen},
  title     = {Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic Manipulation},
  journal   = {CVPR},
  year      = {2024},
}
```

## Credits

This repository is adapted from [PerAct](https://github.com/peract/peract) and [decision diffusers](https://github.com/anuragajay/decision-diffuser/tree/main/code).

================================================
FILE: cfgs/diffuser_config.yaml
================================================
defaults:
  - _self_
  - method: rk_diffuser
  - env: sim

frame_skips: 1
condition_dropout: 0.25
condition_guidance_w: 1.2

clip_denoised: true
discount: 0.99
hidden_dim: 256

## training
n_epochs: 100000
loss_type: l2
n_train_steps: 1e6
batch_size: 32
learning_rate: 5e-5
gradient_accumulate_every: 2
ema_decay: 0.995
log: false
log_freq: 1000
save_freq: 10000
log_dir: 'logs/'
project_name: HDP
run_name: ''

load_model_path: ''
online_eval: false
headless: true 

diffusion_var: multi
action_mode: diff
online_eval_start: 20000

diff_optim: true
diff_optim_steps: 200
diff_lr: 10
eval_only: false

pose_augment: false
use_cached: true
ds_img_size: 128


================================================
FILE: cfgs/env/rlbench/open_box.yaml
================================================
# @package _global_

defaults:
  - rlbench/easy
  - rlbench
  - _self_

env:
  task_name: open_box
  episode_length: 3


================================================
FILE: cfgs/env/rlbench/open_door.yaml
================================================
# @package _global_

defaults:
  - rlbench/easy
  - rlbench
  - _self_

env:
  task_name: open_door
  episode_length: 5


================================================
FILE: cfgs/env/rlbench/open_drawer.yaml
================================================
# @package _global_

defaults:
  - rlbench/easy
  - rlbench
  - _self_

env:
  task_name: open_drawer
  episode_length: 5


================================================
FILE: cfgs/env/rlbench/open_grill.yaml
================================================
# @package _global_

defaults:
  - rlbench/easy
  - rlbench
  - _self_

env:
  task_name: open_grill
  episode_length: 3


================================================
FILE: cfgs/env/rlbench/open_microwave.yaml
================================================
# @package _global_

defaults:
  - rlbench/easy
  - rlbench
  - _self_

env:
  task_name: open_microwave
  episode_length: 5


================================================
FILE: cfgs/env/rlbench/open_oven.yaml
================================================
# @package _global_

defaults:
  - rlbench/easy
  - rlbench
  - _self_

env:
  task_name: open_oven
  episode_length: 5


================================================
FILE: cfgs/env/rlbench/pick_and_lift.yaml
================================================
# @package _global_

defaults:
  - rlbench/easy
  - rlbench
  - _self_

env:
  task_name: pick_and_lift
  episode_length: 5


================================================
FILE: cfgs/env/rlbench/pick_up_cup.yaml
================================================
# @package _global_

defaults:
  - rlbench/easy
  - rlbench
  - _self_

env:
  task_name: pick_up_cup
  episode_length: 5


================================================
FILE: cfgs/env/rlbench/put_knife_on_chopping_board.yaml
================================================
# @package _global_

defaults:
  - rlbench/easy
  - rlbench
  - _self_

env:
  task_name: put_knife_on_chopping_board
  episode_length: 3


================================================
FILE: cfgs/env/rlbench/reach_target.yaml
================================================
# @package _global_

defaults:
  - rlbench/easy
  - rlbench
  - _self_

env:
  task_name: reach_target
  episode_length: 2


================================================
FILE: cfgs/env/rlbench/take_lid_off_saucepan.yaml
================================================
# @package _global_

defaults:
  - rlbench/easy
  - rlbench
  - _self_

env:
  task_name: take_lid_off_saucepan
  episode_length: 3


================================================
FILE: cfgs/env/rlbench/toilet_seat_up.yaml
================================================
# @package _global_

defaults:
  - rlbench/easy
  - rlbench
  - _self_

env:
  task_name: toilet_seat_up
  episode_length: 5


================================================
FILE: cfgs/env/rlbench.yaml
================================================
# @package _global_

env:
  env_name: rlbench
  episode_length: 100
  dataset_root: ''
  demos: 10
  visual_observation_shape: [128, 128]
  action_mode: JOINT_POSITION_COLLISION
  cameras: ['wrist', "front", "left_shoulder", "right_shoulder"]
  use_standardization: false
  renderer: opengl


================================================
FILE: cfgs/env/sim.yaml
================================================
name: sim
tasks: [reach_target, take_lid_off_saucepan, pick_up_cup, toilet_seat_up, open_box, open_door, open_drawer, open_grill, open_microwave, put_knife_on_chopping_board, open_oven]
tasks_ratio: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
scene_bounds: [[-0.5, -0.6, 0.75], [0.5, 0.55, 1.48]]
robot_offset: [0.30895137786865234, 0.0, -0.8200176358222961]
robot_rot: [[0.0, -0.15618760883808136, 0.9877274036407471], [0.0, -0.9877274036407471, -0.15618760883808136], [1.0, 0.0, 0.0]]
cameras: [front]
num_episodes: 100
num_episodes_eval: 1
data_path: ""
demo_aug_ratio: 0
demo_aug_min_len: 5


================================================
FILE: cfgs/eval.yaml
================================================
defaults:
  - _self_
  - env: sim

method:
    name: HDP
    model_path: ''

rlbench:
    task_name: "multi_18T"
    tasks: [open_drawer]
    demo_path: /home/maxiao/repos/dataset/peract/eval
    episode_length: 4
    cameras: [front,left_shoulder,right_shoulder,wrist]
    camera_resolution: [128, 128]
    scene_bounds: [-0.3, -0.5, 0.6, 0.7, 0.5, 1.6]
    include_lang_goal_in_obs: True
    time_in_state: True
    headless: True

framework:
    tensorboard_logging: True
    csv_logging: True
    gpu: 0
    logdir: /home/maxiao/repos/peract/logs/open_drawer/
    start_seed: 0
    record_every_n: 5

    eval_envs: 1
    eval_from_eps_number: 0
    eval_episodes: 10
    eval_type: 'last' # or 'best'
    eval_save_metrics: True

cinematic_recorder:
    enabled: True
    camera_resolution: [1280, 720]
    fps: 30
    rotate_speed: 0.005
    save_path: '/tmp/videos/'


================================================
FILE: cfgs/method/rk_diffuser.yaml
================================================
_target_: rk_diffuser.models.diffusion.GaussianDynDiffusion

horizon: 64
observation_dim: 7
dim_mults: [1, 2, 4]
action_dim: 7
scene_bounds: ${env.scene_bounds}
joint_limits: [[-2.89, -1.76, -2.89, -3.07, -2.89, -0.017, -2.89], [2.89, 1.76, 2.89, -0.069, 2.89, 3.75, 2.89]]
n_timesteps: 100
loss_type: l2
clip_denoised: true
predict_epsilon: false
hidden_dim: 256
loss_discount: 1
condition_guidance_w: 1.2
reverse_train: true
conditions: [start, end, proprios, rank, pcds] # pcds
hard_conditions: [proprios, pcds, start, end]
noise_init_method: normal
loss_fn: state_l2
coverage_weight: 1.0
detach_reverse: false
joint_weight: 1.0
robot_offset: ${env.robot_offset}
trans_loss_scale: 100.0
rot_loss_scale: 1.0
joint_pred_pose_loss: true
joint_loss_scale: 100.0
rank_bins: 10

backbone: unet
num_encoder_layers: 4
num_decoder_layers: 4
n_head: 8
causal_attn: true

depth_proc: pointnet
rgb_encoder: resnet50


================================================
FILE: eval.py
================================================
# code adapted from PerAct: https://github.com/peract/peract

import gc
import logging
import os
import sys

import yaml


import hydra
import numpy as np
import pandas as pd
import torch
import torch.multiprocessing
from omegaconf import DictConfig, ListConfig, OmegaConf
from rlbench.action_modes.action_mode import MoveArmThenGripper
from rlbench.action_modes.gripper_action_modes import Discrete
from rlbench.backend import task as rlbench_task
from rlbench.backend.utils import task_file_to_task_class
from torch.multiprocessing import Manager
from yarr.runners.independent_env_runner import IndependentEnvRunner
from yarr.utils.rollout_generator import RolloutGenerator
from yarr.utils.stat_accumulator import SimpleAccumulator

from peract.agents import peract_diffusion
from helpers import utils


from pyrep.robots.configuration_paths.arm_configuration_path import ArmConfigurationPath
from rlbench.action_modes.arm_action_modes import ArmActionMode, assert_action_shape
from rlbench.backend.exceptions import InvalidActionError
from rlbench.backend.scene import Scene

os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
torch.multiprocessing.set_sharing_strategy("file_system")

JOINT_THRESH = 0.05


class JointsTrajectoryActionMode(ArmActionMode):
    """A sequence of joint configurations representing a trajectory."""

    def __init__(self, points: int):
        self._points = points

    def action(self, scene: Scene, action: np.ndarray, ignore_collisions: bool = True):
        assert_action_shape(action, (7 * self._points,))
        if np.all(action == 0):
            raise InvalidActionError("No valid trajectory given.")
        action = self._pre_proc_traj(action)
        path = ArmConfigurationPath(scene.robot.arm, action)
        done = False
        path.visualize()
        while not done:
            done = path.step()
            scene.step()
            if self._callable_each_step is not None:
                # Record observations
                self._callable_each_step(scene.get_observation())
            success, terminate = scene.task.success()
            # If the task succeeds while traversing path, then break early
            if success:
                break

        path.clear_visualization()

    def _pre_proc_traj(self, action):
        action = action.reshape(-1, 7)
        new_actions = [action[0]]

        for idx in range(1, len(action) - 2):
            diff = new_actions[-1] - action[idx]
            if np.abs(diff).max() > JOINT_THRESH:
                new_actions.append(action[idx])

        new_actions.append(action[-1])
        return np.stack(new_actions, axis=0).reshape(-1)

    def action_shape(self, scene: Scene) -> tuple:
        return (7 * self._points,)

    def set_callable_each_step(self, callable_each_step):
        self._callable_each_step = callable_each_step

    def record_end(self, scene, steps=60, step_scene=True):
        if self._callable_each_step is not None:
            for _ in range(steps):
                scene.step()
                self._callable_each_step(scene.get_observation())


def eval_seed(
    train_cfg, eval_cfg, logdir, cams, env_device, multi_task, seed, env_config
) -> None:
    tasks = eval_cfg.rlbench.tasks
    rg = RolloutGenerator()

    train_cfg.method.update(eval_cfg.method)

    if train_cfg.method.name == "HDP":
        method_path = "cfgs/method/rk_diffuser.yaml"
        method_path = os.path.join(os.getcwd().split("output")[0], method_path)
        with open(method_path, "r") as f:
            method_cfg = yaml.safe_load(f)

        method_cfg.update(
            {
                "scene_bounds": eval_cfg.env.scene_bounds,
                "robot_offset": eval_cfg.env.robot_offset,
            }
        )
        train_cfg.method.update(method_cfg)
        agent = peract_diffusion.launch_utils.create_agent(train_cfg)

    else:
        raise ValueError("Method %s does not exists." % train_cfg.method.name)

    stat_accum = SimpleAccumulator(eval_video_fps=30)

    weightsdir = os.path.join(logdir, "weights")

    env_runner = IndependentEnvRunner(
        train_env=None,
        agent=agent,
        train_replay_buffer=None,
        num_train_envs=0,
        num_eval_envs=eval_cfg.framework.eval_envs,
        rollout_episodes=99999,
        eval_episodes=eval_cfg.framework.eval_episodes,
        training_iterations=train_cfg.framework.training_iterations,
        eval_from_eps_number=eval_cfg.framework.eval_from_eps_number,
        episode_length=eval_cfg.rlbench.episode_length,
        stat_accumulator=stat_accum,
        weightsdir=weightsdir,
        logdir=logdir,
        env_device=env_device,
        rollout_generator=rg,
        num_eval_runs=len(tasks),
        multi_task=multi_task,
    )

    manager = Manager()
    save_load_lock = manager.Lock()
    writer_lock = manager.Lock()

    # evaluate all checkpoints (0, 1000, ...) which don't have results, i.e. validation phase
    if eval_cfg.framework.eval_type == "missing":
        weight_folders = os.listdir(weightsdir)
        weight_folders = sorted(map(int, weight_folders))

        env_data_csv_file = os.path.join(logdir, "eval_data.csv")
        if os.path.exists(env_data_csv_file):
            env_dict = pd.read_csv(env_data_csv_file).to_dict()
            evaluated_weights = sorted(map(int, list(env_dict["step"].values())))
            weight_folders = [w for w in weight_folders if w not in evaluated_weights]

        print("Missing weights: ", weight_folders)

    # pick the best checkpoint from validation and evaluate, i.e. test phase
    elif eval_cfg.framework.eval_type == "best":
        env_data_csv_file = os.path.join(logdir, "eval_data.csv")
        if os.path.exists(env_data_csv_file):
            env_dict = pd.read_csv(env_data_csv_file).to_dict()
            existing_weights = list(
                map(int, sorted(os.listdir(os.path.join(logdir, "weights"))))
            )
            task_weights = {}
            for task in tasks:
                weights = list(env_dict["step"].values())

                if len(tasks) > 1:
                    task_score = list(env_dict["eval_envs/return/%s" % task].values())
                else:
                    task_score = list(env_dict["eval_envs/return"].values())

                avail_weights, avail_task_scores = [], []
                for step_idx, step in enumerate(weights):
                    if step in existing_weights:
                        avail_weights.append(step)
                        avail_task_scores.append(task_score[step_idx])

                assert len(avail_weights) == len(avail_task_scores)
                best_weight = avail_weights[
                    np.argwhere(avail_task_scores == np.amax(avail_task_scores))
                    .flatten()
                    .tolist()[-1]
                ]
                task_weights[task] = best_weight

            weight_folders = [task_weights]
            print("Best weights:", weight_folders)
        else:
            raise Exception("No existing eval_data.csv file found in %s" % logdir)

    # evaluate only the last checkpoint
    elif eval_cfg.framework.eval_type == "last":
        weight_folders = os.listdir(weightsdir)
        weight_folders = sorted(map(int, weight_folders))
        weight_folders = [weight_folders[-1]]
        print("Last weight:", weight_folders)

    # evaluate a specific checkpoint
    elif type(eval_cfg.framework.eval_type) == int:
        weight_folders = [int(eval_cfg.framework.eval_type)]
        print("Weight:", weight_folders)

    else:
        raise Exception("Unknown eval type")

    num_weights_to_eval = np.arange(len(weight_folders))
    if len(num_weights_to_eval) == 0:
        logging.info(
            "No weights to evaluate. Results are already available in eval_data.csv"
        )
        sys.exit(0)

    # evaluate several checkpoints in parallel
    # NOTE: in multi-task settings, each task is evaluated serially, which makes everything slow!
    split_n = utils.split_list(num_weights_to_eval, eval_cfg.framework.eval_envs)
    for split in split_n:
        processes = []
        for e_idx, weight_idx in enumerate(split):
            weight = weight_folders[weight_idx]
            env_runner.start(
                weight,
                save_load_lock,
                writer_lock,
                env_config,
                e_idx % torch.cuda.device_count(),
                eval_cfg.framework.eval_save_metrics,
                eval_cfg.cinematic_recorder,
            )

    del env_runner
    del agent
    gc.collect()
    torch.cuda.empty_cache()


@hydra.main(config_name="eval", config_path="cfgs")
def main(eval_cfg: DictConfig) -> None:
    logging.info("\n" + OmegaConf.to_yaml(eval_cfg))

    start_seed = eval_cfg.framework.start_seed
    logdir = eval_cfg.framework.logdir

    train_config_path = os.path.join(logdir, "config.yaml")
    if os.path.exists(train_config_path):
        with open(train_config_path, "r") as f:
            train_cfg = OmegaConf.load(f)
    else:
        raise Exception("Missing seed%d/config.yaml" % start_seed)

    env_device = utils.get_device(eval_cfg.framework.gpu)
    logging.info("Using env device %s." % str(env_device))

    gripper_mode = Discrete()
    arm_action_mode = JointsTrajectoryActionMode(64)
    action_mode = MoveArmThenGripper(arm_action_mode, gripper_mode)

    task_files = [
        t.replace(".py", "")
        for t in os.listdir(rlbench_task.TASKS_PATH)
        if t != "__init__.py" and t.endswith(".py")
    ]
    eval_cfg.rlbench.cameras = (
        eval_cfg.rlbench.cameras
        if isinstance(eval_cfg.rlbench.cameras, ListConfig)
        else [eval_cfg.rlbench.cameras]
    )
    obs_config = utils.create_obs_config(
        eval_cfg.rlbench.cameras,
        eval_cfg.rlbench.camera_resolution,
        train_cfg.method.name,
    )

    if eval_cfg.cinematic_recorder.enabled:
        obs_config.record_gripper_closing = True

    # single-task or multi-task
    if len(eval_cfg.rlbench.tasks) > 1:
        tasks = eval_cfg.rlbench.tasks
        multi_task = True

        task_classes = []
        for task in tasks:
            if task not in task_files:
                raise ValueError("Task %s not recognised!." % task)
            task_classes.append(task_file_to_task_class(task))

        env_config = (
            task_classes,
            obs_config,
            action_mode,
            eval_cfg.rlbench.demo_path,
            eval_cfg.rlbench.episode_length,
            eval_cfg.rlbench.headless,
            eval_cfg.framework.eval_episodes,
            train_cfg.rlbench.include_lang_goal_in_obs,
            eval_cfg.rlbench.time_in_state,
            eval_cfg.framework.record_every_n,
        )
    else:
        task = eval_cfg.rlbench.tasks[0]
        multi_task = False

        if task not in task_files:
            raise ValueError("Task %s not recognised!." % task)
        task_class = task_file_to_task_class(task)

        env_config = (
            task_class,
            obs_config,
            action_mode,
            eval_cfg.rlbench.demo_path,
            eval_cfg.rlbench.episode_length,
            eval_cfg.rlbench.headless,
            train_cfg.rlbench.include_lang_goal_in_obs,
            eval_cfg.rlbench.time_in_state,
            eval_cfg.framework.record_every_n,
        )

    logging.info("Evaluating seed %d." % start_seed)
    eval_seed(
        train_cfg,
        eval_cfg,
        logdir,
        eval_cfg.rlbench.cameras,
        env_device,
        multi_task,
        start_seed,
        env_config,
    )


if __name__ == "__main__":
    main()


================================================
FILE: extra_scripts/dataset_generator.py
================================================
from multiprocessing import Process, Manager

from pyrep.const import RenderMode

from rlbench import ObservationConfig
from rlbench.action_modes.action_mode import MoveArmThenGripper
from rlbench.action_modes.arm_action_modes import JointVelocity, JointPosition
from rlbench.action_modes.gripper_action_modes import Discrete
from rlbench.backend.utils import task_file_to_task_class
from rlbench.environment import Environment
import rlbench.backend.task as task

import os
import pickle
from PIL import Image
from rlbench.backend import utils
from rlbench.backend.const import *
import numpy as np

from absl import app
from absl import flags

FLAGS = flags.FLAGS

flags.DEFINE_string("save_path", "/tmp/rlbench_data/", "Where to save the demos.")
flags.DEFINE_list(
    "tasks", [], "The tasks to collect. If empty, all tasks are collected."
)
flags.DEFINE_list("image_size", [128, 128], "The size of the images tp save.")
flags.DEFINE_enum(
    "renderer",
    "opengl3",
    ["opengl", "opengl3"],
    "The renderer to use. opengl does not include shadows, " "but is faster.",
)
flags.DEFINE_integer(
    "processes", 1, "The number of parallel processes during collection."
)
flags.DEFINE_integer(
    "episodes_per_task", 10, "The number of episodes to collect per task."
)
flags.DEFINE_integer(
    "variations", -1, "Number of variations to collect per task. -1 for all."
)
flags.DEFINE_string(
    "joint_action_mode",
    "vel",
    'The joint action mode to use. Can be "vel" or "abs" or "delta".',
)


def check_and_make(dir):
    if not os.path.exists(dir):
        os.makedirs(dir)


def save_demo(demo, example_path):

    # Save image data first, and then None the image data, and pickle
    left_shoulder_rgb_path = os.path.join(example_path, LEFT_SHOULDER_RGB_FOLDER)
    left_shoulder_depth_path = os.path.join(example_path, LEFT_SHOULDER_DEPTH_FOLDER)
    left_shoulder_mask_path = os.path.join(example_path, LEFT_SHOULDER_MASK_FOLDER)
    right_shoulder_rgb_path = os.path.join(example_path, RIGHT_SHOULDER_RGB_FOLDER)
    right_shoulder_depth_path = os.path.join(example_path, RIGHT_SHOULDER_DEPTH_FOLDER)
    right_shoulder_mask_path = os.path.join(example_path, RIGHT_SHOULDER_MASK_FOLDER)
    overhead_rgb_path = os.path.join(example_path, OVERHEAD_RGB_FOLDER)
    overhead_depth_path = os.path.join(example_path, OVERHEAD_DEPTH_FOLDER)
    overhead_mask_path = os.path.join(example_path, OVERHEAD_MASK_FOLDER)
    wrist_rgb_path = os.path.join(example_path, WRIST_RGB_FOLDER)
    wrist_depth_path = os.path.join(example_path, WRIST_DEPTH_FOLDER)
    wrist_mask_path = os.path.join(example_path, WRIST_MASK_FOLDER)
    front_rgb_path = os.path.join(example_path, FRONT_RGB_FOLDER)
    front_depth_path = os.path.join(example_path, FRONT_DEPTH_FOLDER)
    front_mask_path = os.path.join(example_path, FRONT_MASK_FOLDER)

    check_and_make(left_shoulder_rgb_path)
    check_and_make(left_shoulder_depth_path)
    check_and_make(left_shoulder_mask_path)
    check_and_make(right_shoulder_rgb_path)
    check_and_make(right_shoulder_depth_path)
    check_and_make(right_shoulder_mask_path)
    check_and_make(overhead_rgb_path)
    check_and_make(overhead_depth_path)
    check_and_make(overhead_mask_path)
    check_and_make(wrist_rgb_path)
    check_and_make(wrist_depth_path)
    check_and_make(wrist_mask_path)
    check_and_make(front_rgb_path)
    check_and_make(front_depth_path)
    check_and_make(front_mask_path)

    for i, obs in enumerate(demo):
        left_shoulder_rgb = Image.fromarray(obs.left_shoulder_rgb)
        left_shoulder_depth = utils.float_array_to_rgb_image(
            obs.left_shoulder_depth, scale_factor=DEPTH_SCALE
        )
        left_shoulder_mask = Image.fromarray(
            (obs.left_shoulder_mask * 255).astype(np.uint8)
        )
        right_shoulder_rgb = Image.fromarray(obs.right_shoulder_rgb)
        right_shoulder_depth = utils.float_array_to_rgb_image(
            obs.right_shoulder_depth, scale_factor=DEPTH_SCALE
        )
        right_shoulder_mask = Image.fromarray(
            (obs.right_shoulder_mask * 255).astype(np.uint8)
        )
        overhead_rgb = Image.fromarray(obs.overhead_rgb)
        overhead_depth = utils.float_array_to_rgb_image(
            obs.overhead_depth, scale_factor=DEPTH_SCALE
        )
        overhead_mask = Image.fromarray((obs.overhead_mask * 255).astype(np.uint8))
        wrist_rgb = Image.fromarray(obs.wrist_rgb)
        wrist_depth = utils.float_array_to_rgb_image(
            obs.wrist_depth, scale_factor=DEPTH_SCALE
        )
        wrist_mask = Image.fromarray((obs.wrist_mask * 255).astype(np.uint8))
        front_rgb = Image.fromarray(obs.front_rgb)
        front_depth = utils.float_array_to_rgb_image(
            obs.front_depth, scale_factor=DEPTH_SCALE
        )
        front_mask = Image.fromarray((obs.front_mask * 255).astype(np.uint8))

        left_shoulder_rgb.save(os.path.join(left_shoulder_rgb_path, IMAGE_FORMAT % i))
        left_shoulder_depth.save(
            os.path.join(left_shoulder_depth_path, IMAGE_FORMAT % i)
        )
        left_shoulder_mask.save(os.path.join(left_shoulder_mask_path, IMAGE_FORMAT % i))
        right_shoulder_rgb.save(os.path.join(right_shoulder_rgb_path, IMAGE_FORMAT % i))
        right_shoulder_depth.save(
            os.path.join(right_shoulder_depth_path, IMAGE_FORMAT % i)
        )
        right_shoulder_mask.save(
            os.path.join(right_shoulder_mask_path, IMAGE_FORMAT % i)
        )
        overhead_rgb.save(os.path.join(overhead_rgb_path, IMAGE_FORMAT % i))
        overhead_depth.save(os.path.join(overhead_depth_path, IMAGE_FORMAT % i))
        overhead_mask.save(os.path.join(overhead_mask_path, IMAGE_FORMAT % i))
        wrist_rgb.save(os.path.join(wrist_rgb_path, IMAGE_FORMAT % i))
        wrist_depth.save(os.path.join(wrist_depth_path, IMAGE_FORMAT % i))
        wrist_mask.save(os.path.join(wrist_mask_path, IMAGE_FORMAT % i))
        front_rgb.save(os.path.join(front_rgb_path, IMAGE_FORMAT % i))
        front_depth.save(os.path.join(front_depth_path, IMAGE_FORMAT % i))
        front_mask.save(os.path.join(front_mask_path, IMAGE_FORMAT % i))

        # We save the images separately, so set these to None for pickling.
        obs.left_shoulder_rgb = None
        obs.left_shoulder_depth = None
        obs.left_shoulder_point_cloud = None
        obs.left_shoulder_mask = None
        obs.right_shoulder_rgb = None
        obs.right_shoulder_depth = None
        obs.right_shoulder_point_cloud = None
        obs.right_shoulder_mask = None
        obs.overhead_rgb = None
        obs.overhead_depth = None
        obs.overhead_point_cloud = None
        obs.overhead_mask = None
        obs.wrist_rgb = None
        obs.wrist_depth = None
        obs.wrist_point_cloud = None
        obs.wrist_mask = None
        obs.front_rgb = None
        obs.front_depth = None
        obs.front_point_cloud = None
        obs.front_mask = None

    # Save the low-dimension data
    with open(os.path.join(example_path, LOW_DIM_PICKLE), "wb") as f:
        pickle.dump(demo, f)


class CustomMoveArmThenGripper(MoveArmThenGripper):
    def action_bounds(self):
        return (
            np.array(7 * [-0.2] + [0.0], dtype=np.float32),
            np.array(7 * [0.2] + [1.0], dtype=np.float32),
        )


class AbsCustomMoveArmThenGripper(MoveArmThenGripper):
    ACT_MIN = [
        -2.8973000049591064,
        -1.7627999782562256,
        -2.8973000049591064,
        -3.0717999935150146,
        -2.8973000049591064,
        -0.017500000074505806,
        -2.8973000049591064,
        0.0,
    ]
    ACT_RANGE = [
        5.794600009918213,
        3.525599956512451,
        5.794600009918213,
        3.002000093460083,
        5.794600009918213,
        3.7699999809265137,
        5.794600009918213,
        1.0,
    ]

    def action_bounds(self):
        return (
            np.array(ACT_MIN, dtype=np.float32),
            np.array(ACT_MIN + ACT_RANGE, dtype=np.float32),
        )


def run(
    i,
    lock,
    task_index,
    variation_count,
    results,
    file_lock,
    tasks,
    joint_action_mode,
    tag,
):
    """Each thread will choose one task and variation, and then gather
    all the episodes_per_task for that variation."""

    # Initialise each thread with random seed
    np.random.seed(None)
    num_tasks = len(tasks)

    img_size = list(map(int, FLAGS.image_size))

    obs_config = ObservationConfig()
    obs_config.set_all(True)
    obs_config.right_shoulder_camera.image_size = img_size
    obs_config.left_shoulder_camera.image_size = img_size
    obs_config.overhead_camera.image_size = img_size
    obs_config.wrist_camera.image_size = img_size
    obs_config.front_camera.image_size = img_size

    # Store depth as 0 - 1
    obs_config.right_shoulder_camera.depth_in_meters = False
    obs_config.left_shoulder_camera.depth_in_meters = False
    obs_config.overhead_camera.depth_in_meters = False
    obs_config.wrist_camera.depth_in_meters = False
    obs_config.front_camera.depth_in_meters = False

    # We want to save the masks as rgb encodings.
    obs_config.left_shoulder_camera.masks_as_one_channel = False
    obs_config.right_shoulder_camera.masks_as_one_channel = False
    obs_config.overhead_camera.masks_as_one_channel = False
    obs_config.wrist_camera.masks_as_one_channel = False
    obs_config.front_camera.masks_as_one_channel = False

    if FLAGS.renderer == "opengl":
        obs_config.right_shoulder_camera.render_mode = RenderMode.OPENGL
        obs_config.left_shoulder_camera.render_mode = RenderMode.OPENGL
        obs_config.overhead_camera.render_mode = RenderMode.OPENGL
        obs_config.wrist_camera.render_mode = RenderMode.OPENGL
        obs_config.front_camera.render_mode = RenderMode.OPENGL

    if joint_action_mode == "delta":

        class CustomMoveArmThenGripper(MoveArmThenGripper):
            def action_bounds(self):
                return (
                    np.array(7 * [-0.2] + [0.0], dtype=np.float32),
                    np.array(7 * [0.2] + [1.0], dtype=np.float32),
                )

        action_mode = CustomMoveArmThenGripper(JointPosition(False), Discrete())

    elif joint_action_mode == "abs":

        class CustomMoveArmThenGripper(MoveArmThenGripper):
            def action_bounds(self):
                ACT_MIN = [
                    -2.8973000049591064,
                    -1.7627999782562256,
                    -2.8973000049591064,
                    -3.0717999935150146,
                    -2.8973000049591064,
                    -0.017500000074505806,
                    -2.8973000049591064,
                    0.0,
                ]
                ACT_RANGE = [
                    5.794600009918213,
                    3.525599956512451,
                    5.794600009918213,
                    3.002000093460083,
                    5.794600009918213,
                    3.7699999809265137,
                    5.794600009918213,
                    1.0,
                ]

                return (
                    np.array(ACT_MIN, dtype=np.float32),
                    np.array(ACT_MIN, dtype=np.float32)
                    + np.array(ACT_RANGE, dtype=np.float32),
                )

        action_mode = CustomMoveArmThenGripper(JointPosition(True), Discrete())

    else:
        action_mode = MoveArmThenGripper(JointVelocity(), Discrete())

    rlbench_env = Environment(
        action_mode=action_mode, obs_config=obs_config, headless=True
    )
    rlbench_env.launch()

    task_env = None

    tasks_with_problems = results[i] = ""

    while True:
        # Figure out what task/variation this thread is going to do
        with lock:

            if task_index.value >= num_tasks * 2:
                print("Process", i, "finished")
                break

            my_variation_count = variation_count.value
            t = tasks[task_index.value]
            task_env = rlbench_env.get_task(t)
            var_target = task_env.variation_count()
            if FLAGS.variations >= 0:
                var_target = np.minimum(FLAGS.variations, var_target)
            if my_variation_count >= var_target:
                # If we have reached the required number of variations for this
                # task, then move on to the next task.
                variation_count.value = my_variation_count = 0
                task_index.value += 1

            variation_count.value += 1
            if task_index.value >= num_tasks * 2:
                print("Process", i, "finished")
                break
            t = tasks[task_index.value // 2]

        task_env = rlbench_env.get_task(t)
        task_env.set_variation(my_variation_count)
        descriptions, _ = task_env.reset()

        variation_path = os.path.join(
            FLAGS.save_path,
            tag,
            task_env.get_name(),
            VARIATIONS_FOLDER % my_variation_count,
        )

        check_and_make(variation_path)

        with open(os.path.join(variation_path, VARIATION_DESCRIPTIONS), "wb") as f:
            pickle.dump(descriptions, f)

        episodes_path = os.path.join(variation_path, EPISODES_FOLDER)
        check_and_make(episodes_path)

        abort_variation = False
        num_existing_episodes = len(
            [e for e in os.listdir(episodes_path) if "episode" in e]
        )
        print(
            f"Found {num_existing_episodes} existing episodes for task {task_env.get_name()} | variation {my_variation_count}"
        )
        for ex_idx in range(num_existing_episodes, FLAGS.episodes_per_task):
            print(
                "Process",
                i,
                "// Task:",
                task_env.get_name(),
                "// Variation:",
                my_variation_count,
                "// Demo:",
                ex_idx,
            )
            attempts = 10
            while attempts > 0:
                try:
                    # TODO: for now we do the explicit looping.
                    (demo,) = task_env.get_demos(amount=1, live_demos=True)
                except Exception as e:
                    attempts -= 1
                    if attempts > 0:
                        continue
                    problem = (
                        "Process %d failed collecting task %s (variation: %d, "
                        "example: %d). Skipping this task/variation.\n%s\n"
                        % (i, task_env.get_name(), my_variation_count, ex_idx, str(e))
                    )
                    print(problem)
                    tasks_with_problems += problem
                    abort_variation = True
                    break
                episode_path = os.path.join(episodes_path, EPISODE_FOLDER % ex_idx)
                with file_lock:
                    save_demo(demo, episode_path)
                    print(
                        "\tTask: %s // Demo Length: %d"
                        % (task_env.get_name(), len(demo))
                    )
                break
            if abort_variation:
                break

    results[i] = tasks_with_problems
    rlbench_env.shutdown()


def main(argv):

    task_files = [
        t.replace(".py", "")
        for t in os.listdir(task.TASKS_PATH)
        if t != "__init__.py" and t.endswith(".py")
    ]

    if len(FLAGS.tasks) > 0:
        for t in FLAGS.tasks:
            if t not in task_files:
                raise ValueError("Task %s not recognised!." % t)
        task_files = FLAGS.tasks

    tasks = [task_file_to_task_class(t) for t in task_files]
    joint_action_mode = FLAGS.joint_action_mode
    print(f"Joint action mode: {joint_action_mode}")

    manager = Manager()

    result_dict = manager.dict()
    file_lock = manager.Lock()

    task_index = manager.Value("i", 0)
    variation_count = manager.Value("i", 0)
    lock = manager.Lock()

    check_and_make(FLAGS.save_path)

    processes = [
        Process(
            target=run,
            args=(
                i,
                lock,
                task_index,
                variation_count,
                result_dict,
                file_lock,
                tasks,
                joint_action_mode,
                tag,
            ),
        )
        for i in range(FLAGS.processes)
        for tag in ["train", "eval"]
    ]
    [t.start() for t in processes]
    [t.join() for t in processes]

    print("Data collection done!")
    for i in range(FLAGS.processes):
        print(result_dict[i])


if __name__ == "__main__":
    app.run(main)


================================================
FILE: extra_scripts/install_coppeliasim.sh
================================================
#!/bin/bash
# Install CoppeliaSim 4.1.0 for Ubuntu 20.04
# Refer to PyRep README for other versions
export COPPELIASIM_ROOT=${HOME}/.local/bin/CoppeliaSim
wget https://www.coppeliarobotics.com/files/V4_1_0/CoppeliaSim_Edu_V4_1_0_Ubuntu20_04.tar.xz
mkdir -p $COPPELIASIM_ROOT && tar -xf CoppeliaSim_Edu_V4_1_0_Ubuntu20_04.tar.xz -C $COPPELIASIM_ROOT --strip-components 1
rm CoppeliaSim_Edu_V4_1_0_Ubuntu20_04.tar.xz
## Add environment variables into bashrc (or zshrc)
echo "export COPPELIASIM_ROOT=$COPPELIASIM_ROOT
export LD_LIBRARY_PATH=\$LD_LIBRARY_PATH:\$COPPELIASIM_ROOT
export QT_QPA_PLATFORM_PLUGIN_PATH=\$COPPELIASIM_ROOT" >> ~/.bashrc
source ~/.bashrc


================================================
FILE: helpers/__init__.py
================================================


================================================
FILE: helpers/clip/__init__.py
================================================


================================================
FILE: helpers/clip/core/__init__.py
================================================


================================================
FILE: helpers/clip/core/attention.py
================================================
"""Attention module."""

import cliport.models as models
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from cliport.utils import utils


class Attention(nn.Module):
    """Attention (a.k.a Pick) module."""

    def __init__(self, stream_fcn, in_shape, n_rotations, preprocess, cfg, device):
        super().__init__()
        self.stream_fcn = stream_fcn
        self.n_rotations = n_rotations
        self.preprocess = preprocess
        self.cfg = cfg
        self.device = device
        self.batchnorm = self.cfg["train"]["batchnorm"]

        self.padding = np.zeros((3, 2), dtype=int)
        max_dim = np.max(in_shape[:2])
        pad = (max_dim - np.array(in_shape[:2])) / 2
        self.padding[:2] = pad.reshape(2, 1)

        in_shape = np.array(in_shape)
        in_shape += np.sum(self.padding, axis=1)
        in_shape = tuple(in_shape)
        self.in_shape = in_shape

        self.rotator = utils.ImageRotator(self.n_rotations)

        self._build_nets()

    def _build_nets(self):
        stream_one_fcn, _ = self.stream_fcn
        self.attn_stream = models.names[stream_one_fcn](
            self.in_shape, 1, self.cfg, self.device
        )
        print(f"Attn FCN: {stream_one_fcn}")

    def attend(self, x):
        return self.attn_stream(x)

    def forward(self, inp_img, softmax=True):
        """Forward pass."""
        in_data = np.pad(inp_img, self.padding, mode="constant")
        in_shape = (1,) + in_data.shape
        in_data = in_data.reshape(in_shape)
        in_tens = torch.from_numpy(in_data).to(
            dtype=torch.float, device=self.device
        )  # [B W H 6]

        # Rotation pivot.
        pv = np.array(in_data.shape[1:3]) // 2

        # Rotate input.
        in_tens = in_tens.permute(0, 3, 1, 2)  # [B 6 W H]
        in_tens = in_tens.repeat(self.n_rotations, 1, 1, 1)
        in_tens = self.rotator(in_tens, pivot=pv)

        # Forward pass.
        logits = []
        for x in in_tens:
            lgts = self.attend(x)
            logits.append(lgts)
        logits = torch.cat(logits, dim=0)

        # Rotate back output.
        logits = self.rotator(logits, reverse=True, pivot=pv)
        logits = torch.cat(logits, dim=0)
        c0 = self.padding[:2, 0]
        c1 = c0 + inp_img.shape[:2]
        logits = logits[:, :, c0[0] : c1[0], c0[1] : c1[1]]

        logits = logits.permute(1, 2, 3, 0)  # [B W H 1]
        output = logits.reshape(1, np.prod(logits.shape))
        if softmax:
            output = F.softmax(output, dim=-1)
            output = output.reshape(logits.shape[1:])
        return output


================================================
FILE: helpers/clip/core/attention_image_goal.py
================================================
"""Attention module."""

import numpy as np
import torch
import torch.nn.functional as F
from cliport.models.core.attention import Attention


class AttentionImageGoal(Attention):
    """Attention (a.k.a Pick) with image-goals module."""

    def __init__(self, stream_fcn, in_shape, n_rotations, preprocess, cfg, device):
        super().__init__(stream_fcn, in_shape, n_rotations, preprocess, cfg, device)

    def forward(self, inp_img, goal_img, softmax=True):
        """Forward pass."""
        # Input image.
        in_data = np.pad(inp_img, self.padding, mode="constant")
        in_shape = (1,) + in_data.shape
        in_data = in_data.reshape(in_shape)
        in_tens = torch.from_numpy(in_data).to(dtype=torch.float, device=self.device)

        goal_tensor = np.pad(goal_img, self.padding, mode="constant")
        goal_shape = (1,) + goal_tensor.shape
        goal_tensor = goal_tensor.reshape(goal_shape)
        goal_tensor = torch.from_numpy(goal_tensor.copy()).to(
            dtype=torch.float, device=self.device
        )
        in_tens = in_tens * goal_tensor

        # Rotation pivot.
        pv = np.array(in_data.shape[1:3]) // 2

        # Rotate input.
        in_tens = in_tens.permute(0, 3, 1, 2)
        in_tens = in_tens.repeat(self.n_rotations, 1, 1, 1)
        in_tens = self.rotator(in_tens, pivot=pv)

        # Forward pass.
        logits = []
        for x in in_tens:
            logits.append(self.attend(x))
        logits = torch.cat(logits, dim=0)

        # Rotate back output.
        logits = self.rotator(logits, reverse=True, pivot=pv)
        logits = torch.cat(logits, dim=0)
        c0 = self.padding[:2, 0]
        c1 = c0 + inp_img.shape[:2]
        logits = logits[:, :, c0[0] : c1[0], c0[1] : c1[1]]

        logits = logits.permute(1, 2, 3, 0)  # D H W C
        output = logits.reshape(1, np.prod(logits.shape))
        if softmax:
            output = F.softmax(output, dim=-1)
            output = output.reshape(logits.shape[1:])
        return output


================================================
FILE: helpers/clip/core/clip.py
================================================
###########################################
#### Authors: OpenAI
#### Credit: https://github.com/openai/CLIP
#### https://github.com/openai/CLIP/blob/main/LICENSE

import hashlib
import os
import ssl
import urllib
import warnings
from collections import OrderedDict
from typing import List, Tuple, Union

import torch
import torch.nn.functional as F
from PIL import Image
from torch import nn
from torchvision.transforms import CenterCrop, Compose, Normalize, Resize, ToTensor
from tqdm import tqdm

from .simple_tokenizer import SimpleTokenizer as _Tokenizer

ssl._create_default_https_context = ssl._create_unverified_context


__all__ = ["available_models", "load", "tokenize"]
_tokenizer = _Tokenizer()

_MODELS = {
    "RN50": "https://openaipublic.azureedge.net/clip/models/afeb0e10f9e5a86da6080e35cf09123aca3b358a0c3e3b6c78a7b63bc04b6762/RN50.pt",
    "RN101": "https://openaipublic.azureedge.net/clip/models/8fa8567bab74a42d41c5915025a8e4538c3bdbe8804a470a72f30b0d94fab599/RN101.pt",
    "RN50x4": "https://openaipublic.azureedge.net/clip/models/7e526bd135e493cef0776de27d5f42653e6b4c8bf9e0f653bb11773263205fdd/RN50x4.pt",
    "RN50x16": "https://openaipublic.azureedge.net/clip/models/52378b407f34354e150460fe41077663dd5b39c54cd0bfd2b27167a4a06ec9aa/RN50x16.pt",
    "RN50x64": "https://openaipublic.azureedge.net/clip/models/be1cfb55d75a9666199fb2206c106743da0f6468c9d327f3e0d0a543a9919d9c/RN50x64.pt",
    "ViT-B/32": "https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt",
    "ViT-B/16": "https://openaipublic.azureedge.net/clip/models/5806e77cd80f8b59890b7e101eabd078d9fb84e6937f9e85e4ecb61988df416f/ViT-B-16.pt",
    "ViT-L/14": "https://openaipublic.azureedge.net/clip/models/b8cca3fd41ae0c99ba7e8951adf17d267cdb84cd88be6f7c2e0eca1737a03836/ViT-L-14.pt",
}


class Bottleneck(nn.Module):
    expansion = 4

    def __init__(self, inplanes, planes, stride=1):
        super().__init__()

        # all conv layers have stride 1. an avgpool is performed after the second convolution when stride > 1
        self.conv1 = nn.Conv2d(inplanes, planes, 1, bias=False)
        self.bn1 = nn.BatchNorm2d(planes)

        self.conv2 = nn.Conv2d(planes, planes, 3, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(planes)

        self.avgpool = nn.AvgPool2d(stride) if stride > 1 else nn.Identity()

        self.conv3 = nn.Conv2d(planes, planes * self.expansion, 1, bias=False)
        self.bn3 = nn.BatchNorm2d(planes * self.expansion)

        self.relu = nn.ReLU(inplace=True)
        self.downsample = None
        self.stride = stride

        if stride > 1 or inplanes != planes * Bottleneck.expansion:
            # downsampling layer is prepended with an avgpool, and the subsequent convolution has stride 1
            self.downsample = nn.Sequential(
                OrderedDict(
                    [
                        ("-1", nn.AvgPool2d(stride)),
                        (
                            "0",
                            nn.Conv2d(
                                inplanes,
                                planes * self.expansion,
                                1,
                                stride=1,
                                bias=False,
                            ),
                        ),
                        ("1", nn.BatchNorm2d(planes * self.expansion)),
                    ]
                )
            )

    def forward(self, x: torch.Tensor):
        identity = x

        out = self.relu(self.bn1(self.conv1(x)))
        out = self.relu(self.bn2(self.conv2(out)))
        out = self.avgpool(out)
        out = self.bn3(self.conv3(out))

        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity
        out = self.relu(out)
        return out


class AttentionPool2d(nn.Module):
    def __init__(
        self, spacial_dim: int, embed_dim: int, num_heads: int, output_dim: int = None
    ):
        super().__init__()
        self.positional_embedding = nn.Parameter(
            torch.randn(spacial_dim**2 + 1, embed_dim) / embed_dim**0.5
        )
        self.k_proj = nn.Linear(embed_dim, embed_dim)
        self.q_proj = nn.Linear(embed_dim, embed_dim)
        self.v_proj = nn.Linear(embed_dim, embed_dim)
        self.c_proj = nn.Linear(embed_dim, output_dim or embed_dim)
        self.num_heads = num_heads

    def forward(self, x):
        x = x.reshape(x.shape[0], x.shape[1], x.shape[2] * x.shape[3]).permute(
            2, 0, 1
        )  # NCHW -> (HW)NC
        x = torch.cat([x.mean(dim=0, keepdim=True), x], dim=0)  # (HW+1)NC
        x = x + self.positional_embedding[:, None, :].to(x.dtype)  # (HW+1)NC
        x, _ = F.multi_head_attention_forward(
            query=x,
            key=x,
            value=x,
            embed_dim_to_check=x.shape[-1],
            num_heads=self.num_heads,
            q_proj_weight=self.q_proj.weight,
            k_proj_weight=self.k_proj.weight,
            v_proj_weight=self.v_proj.weight,
            in_proj_weight=None,
            in_proj_bias=torch.cat(
                [self.q_proj.bias, self.k_proj.bias, self.v_proj.bias]
            ),
            bias_k=None,
            bias_v=None,
            add_zero_attn=False,
            dropout_p=0,
            out_proj_weight=self.c_proj.weight,
            out_proj_bias=self.c_proj.bias,
            use_separate_proj_weight=True,
            training=self.training,
            need_weights=False,
        )

        return x[0]


class ModifiedResNet(nn.Module):
    """
    A ResNet class that is similar to torchvision's but contains the following changes:
    - There are now 3 "stem" convolutions as opposed to 1, with an average pool instead of a max pool.
    - Performs anti-aliasing strided convolutions, where an avgpool is prepended to convolutions with stride > 1
    - The final pooling layer is a QKV attention instead of an average pool
    """

    def __init__(self, layers, output_dim, heads, input_resolution=224, width=64):
        super().__init__()
        self.output_dim = output_dim
        self.input_resolution = input_resolution

        # the 3-layer stem
        self.conv1 = nn.Conv2d(
            3, width // 2, kernel_size=3, stride=2, padding=1, bias=False
        )
        self.bn1 = nn.BatchNorm2d(width // 2)
        self.conv2 = nn.Conv2d(
            width // 2, width // 2, kernel_size=3, padding=1, bias=False
        )
        self.bn2 = nn.BatchNorm2d(width // 2)
        self.conv3 = nn.Conv2d(width // 2, width, kernel_size=3, padding=1, bias=False)
        self.bn3 = nn.BatchNorm2d(width)
        self.avgpool = nn.AvgPool2d(2)
        self.relu = nn.ReLU(inplace=True)

        # residual layers
        self._inplanes = width  # this is a *mutable* variable used during construction
        self.layer1 = self._make_layer(width, layers[0])
        self.layer2 = self._make_layer(width * 2, layers[1], stride=2)
        self.layer3 = self._make_layer(width * 4, layers[2], stride=2)
        self.layer4 = self._make_layer(width * 8, layers[3], stride=2)

        embed_dim = width * 32  # the ResNet feature dimension
        self.attnpool = AttentionPool2d(
            input_resolution // 32, embed_dim, heads, output_dim
        )

    def _make_layer(self, planes, blocks, stride=1):
        layers = [Bottleneck(self._inplanes, planes, stride)]

        self._inplanes = planes * Bottleneck.expansion
        for _ in range(1, blocks):
            layers.append(Bottleneck(self._inplanes, planes))

        return nn.Sequential(*layers)

    def forward(self, x):
        x = self.prepool(x)
        x = self.attnpool(x)
        return x

    def prepool(self, x):
        def stem(x):
            for conv, bn in [
                (self.conv1, self.bn1),
                (self.conv2, self.bn2),
                (self.conv3, self.bn3),
            ]:
                x = self.relu(bn(conv(x)))
            x = self.avgpool(x)
            return x

        x = x.type(self.conv1.weight.dtype)
        x = stem(x)
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)
        return x

    def prepool_im(self, x):
        """Run until prepool and save intermediate features"""
        im = []

        def stem(x):
            for conv, bn in [
                (self.conv1, self.bn1),
                (self.conv2, self.bn2),
                (self.conv3, self.bn3),
            ]:
                x = self.relu(bn(conv(x)))
                im.append(x)
            x = self.avgpool(x)
            im.append(x)
            return x

        x = x.type(self.conv1.weight.dtype)
        x = stem(x)

        for layer in [self.layer1, self.layer2, self.layer3, self.layer4]:
            x = layer(x)
            im.append(x)

        return x, im


class LayerNorm(nn.LayerNorm):
    """Subclass torch's LayerNorm to handle fp16."""

    def forward(self, x: torch.Tensor):
        orig_type = x.dtype
        ret = super().forward(x.type(torch.float32))
        return ret.type(orig_type)


class QuickGELU(nn.Module):
    def forward(self, x: torch.Tensor):
        return x * torch.sigmoid(1.702 * x)


class ResidualAttentionBlock(nn.Module):
    def __init__(self, d_model: int, n_head: int, attn_mask: torch.Tensor = None):
        super().__init__()

        self.attn = nn.MultiheadAttention(d_model, n_head)
        self.ln_1 = LayerNorm(d_model)
        self.mlp = nn.Sequential(
            OrderedDict(
                [
                    ("c_fc", nn.Linear(d_model, d_model * 4)),
                    ("gelu", QuickGELU()),
                    ("c_proj", nn.Linear(d_model * 4, d_model)),
                ]
            )
        )
        self.ln_2 = LayerNorm(d_model)
        self.attn_mask = attn_mask

    def attention(self, x: torch.Tensor):
        self.attn_mask = (
            self.attn_mask.to(dtype=x.dtype, device=x.device)
            if self.attn_mask is not None
            else None
        )
        return self.attn(x, x, x, need_weights=False, attn_mask=self.attn_mask)[0]

    def forward(self, x: torch.Tensor):
        x = x + self.attention(self.ln_1(x))
        x = x + self.mlp(self.ln_2(x))
        return x


class Transformer(nn.Module):
    def __init__(
        self, width: int, layers: int, heads: int, attn_mask: torch.Tensor = None
    ):
        super().__init__()
        self.width = width
        self.layers = layers
        self.resblocks = nn.Sequential(
            *[ResidualAttentionBlock(width, heads, attn_mask) for _ in range(layers)]
        )

    def forward(self, x: torch.Tensor):
        return self.resblocks(x)


class VisualTransformer(nn.Module):
    def __init__(
        self,
        input_resolution: int,
        patch_size: int,
        width: int,
        layers: int,
        heads: int,
        output_dim: int,
    ):
        super().__init__()
        self.input_resolution = input_resolution
        self.output_dim = output_dim
        self.conv1 = nn.Conv2d(
            in_channels=3,
            out_channels=width,
            kernel_size=patch_size,
            stride=patch_size,
            bias=False,
        )

        scale = width**-0.5
        self.class_embedding = nn.Parameter(scale * torch.randn(width))
        self.positional_embedding = nn.Parameter(
            scale * torch.randn((input_resolution // patch_size) ** 2 + 1, width)
        )
        self.ln_pre = LayerNorm(width)

        self.transformer = Transformer(width, layers, heads)

        self.ln_post = LayerNorm(width)
        self.proj = nn.Parameter(scale * torch.randn(width, output_dim))

    def forward(self, x: torch.Tensor):
        x = self.conv1(x)  # shape = [*, width, grid, grid]
        x = x.reshape(x.shape[0], x.shape[1], -1)  # shape = [*, width, grid ** 2]
        x = x.permute(0, 2, 1)  # shape = [*, grid ** 2, width]
        x = torch.cat(
            [
                self.class_embedding.to(x.dtype)
                + torch.zeros(
                    x.shape[0], 1, x.shape[-1], dtype=x.dtype, device=x.device
                ),
                x,
            ],
            dim=1,
        )  # shape = [*, grid ** 2 + 1, width]
        x = x + self.positional_embedding.to(x.dtype)
        x = self.ln_pre(x)

        x = x.permute(1, 0, 2)  # NLD -> LND
        x = self.transformer(x)
        x = x.permute(1, 0, 2)  # LND -> NLD

        x = self.ln_post(x[:, 0, :])

        if self.proj is not None:
            x = x @ self.proj

        return x

    def forward_spatial(self, x: torch.Tensor):
        x = self.conv1(x)  # shape = [*, width, grid, grid]
        x = x.reshape(x.shape[0], x.shape[1], -1)  # shape = [*, width, grid ** 2]
        x = x.permute(0, 2, 1)  # shape = [*, grid ** 2, width]
        x = torch.cat(
            [
                self.class_embedding.to(x.dtype)
                + torch.zeros(
                    x.shape[0], 1, x.shape[-1], dtype=x.dtype, device=x.device
                ),
                x,
            ],
            dim=1,
        )  # shape = [*, grid ** 2 + 1, width]
        x = x + self.positional_embedding.to(x.dtype)
        x = self.ln_pre(x)

        x = x.permute(1, 0, 2)  # NLD -> LND
        x = self.transformer(x)
        x = x.permute(1, 0, 2)  # LND -> NLD

        x = self.ln_post(x)[:, 1:]
        return x


class CLIP(nn.Module):
    def __init__(
        self,
        embed_dim: int,
        # vision
        image_resolution: int,
        vision_layers: Union[Tuple[int, int, int, int], int],
        vision_width: int,
        vision_patch_size: int,
        # text
        context_length: int,
        vocab_size: int,
        transformer_width: int,
        transformer_heads: int,
        transformer_layers: int,
    ):
        super().__init__()

        self.context_length = context_length

        if isinstance(vision_layers, (tuple, list)):
            vision_heads = vision_width * 32 // 64
            self.visual = ModifiedResNet(
                layers=vision_layers,
                output_dim=embed_dim,
                heads=vision_heads,
                input_resolution=image_resolution,
                width=vision_width,
            )
        else:
            vision_heads = vision_width // 64
            self.visual = VisualTransformer(
                input_resolution=image_resolution,
                patch_size=vision_patch_size,
                width=vision_width,
                layers=vision_layers,
                heads=vision_heads,
                output_dim=embed_dim,
            )

        self.transformer = Transformer(
            width=transformer_width,
            layers=transformer_layers,
            heads=transformer_heads,
            attn_mask=self.build_attention_mask(),
        )

        self.vocab_size = vocab_size
        self.token_embedding = nn.Embedding(vocab_size, transformer_width)
        self.positional_embedding = nn.Parameter(
            torch.empty(self.context_length, transformer_width)
        )
        self.ln_final = LayerNorm(transformer_width)

        self.text_projection = nn.Parameter(torch.empty(transformer_width, embed_dim))
        self.logit_scale = nn.Parameter(torch.ones([]))

        self.initialize_parameters()

    def initialize_parameters(self):
        nn.init.normal_(self.token_embedding.weight, std=0.02)
        nn.init.normal_(self.positional_embedding, std=0.01)

        if isinstance(self.visual, ModifiedResNet):
            if self.visual.attnpool is not None:
                std = self.visual.attnpool.c_proj.in_features**-0.5
                nn.init.normal_(self.visual.attnpool.q_proj.weight, std=std)
                nn.init.normal_(self.visual.attnpool.k_proj.weight, std=std)
                nn.init.normal_(self.visual.attnpool.v_proj.weight, std=std)
                nn.init.normal_(self.visual.attnpool.c_proj.weight, std=std)

            for resnet_block in [
                self.visual.layer1,
                self.visual.layer2,
                self.visual.layer3,
                self.visual.layer4,
            ]:
                for name, param in resnet_block.named_parameters():
                    if name.endswith("bn3.weight"):
                        nn.init.zeros_(param)

        proj_std = (self.transformer.width**-0.5) * (
            (2 * self.transformer.layers) ** -0.5
        )
        attn_std = self.transformer.width**-0.5
        fc_std = (2 * self.transformer.width) ** -0.5
        for block in self.transformer.resblocks:
            nn.init.normal_(block.attn.in_proj_weight, std=attn_std)
            nn.init.normal_(block.attn.out_proj.weight, std=proj_std)
            nn.init.normal_(block.mlp.c_fc.weight, std=fc_std)
            nn.init.normal_(block.mlp.c_proj.weight, std=proj_std)

        if self.text_projection is not None:
            nn.init.normal_(self.text_projection, std=self.transformer.width**-0.5)

    def build_attention_mask(self):
        # lazily create causal attention mask, with full attention between the vision tokens
        # pytorch uses additive attention mask; fill with -inf
        mask = torch.empty(self.context_length, self.context_length)
        mask.fill_(float("-inf"))
        mask.triu_(1)  # zero out the lower diagonal
        return mask

    @property
    def dtype(self):
        return self.visual.conv1.weight.dtype

    def encode_image(self, image):
        return self.visual(image.type(self.dtype))

    def encode_text(self, text):
        x = self.token_embedding(text).type(self.dtype)  # [batch_size, n_ctx, d_model]

        x = x + self.positional_embedding.type(self.dtype)
        x = x.permute(1, 0, 2)  # NLD -> LND
        x = self.transformer(x)
        x = x.permute(1, 0, 2)  # LND -> NLD
        x = self.ln_final(x).type(self.dtype)

        # x.shape = [batch_size, n_ctx, transformer.width]
        # take features from the eot embedding (eot_token is the highest number in each sequence)
        x = x[torch.arange(x.shape[0]), text.argmax(dim=-1)] @ self.text_projection

        return x

    def encode_text_with_embeddings(self, text):
        x = self.token_embedding(text).type(self.dtype)  # [batch_size, n_ctx, d_model]

        x = x + self.positional_embedding.type(self.dtype)
        x = x.permute(1, 0, 2)  # NLD -> LND
        x = self.transformer(x)
        x = x.permute(1, 0, 2)  # LND -> NLD
        x = self.ln_final(x).type(self.dtype)

        emb = x.clone()
        # x.shape = [batch_size, n_ctx, transformer.width]
        # take features from the eot embedding (eot_token is the highest number in each sequence)
        x = x[torch.arange(x.shape[0]), text.argmax(dim=-1)] @ self.text_projection

        return x, emb

    def forward(self, image, text):
        image_features = self.encode_image(image)
        text_features = self.encode_text(text)

        # normalized features
        image_features = image_features / image_features.norm(dim=-1, keepdim=True)
        text_features = text_features / text_features.norm(dim=-1, keepdim=True)

        # cosine similarity as logits
        logit_scale = self.logit_scale.exp()
        logits_per_image = logit_scale * image_features @ text_features.t()
        logits_per_text = logit_scale * text_features @ image_features.t()

        # shape = [global_batch_size, global_batch_size]
        return logits_per_image, logits_per_text


def convert_weights(model: nn.Module):
    """Convert applicable model parameters to fp16"""

    def _convert_weights_to_fp16(l):
        if isinstance(l, (nn.Conv1d, nn.Conv2d, nn.Linear)):
            l.weight.data = l.weight.data.half()
            if l.bias is not None:
                l.bias.data = l.bias.data.half()

        if isinstance(l, nn.MultiheadAttention):
            for attr in [
                *[f"{s}_proj_weight" for s in ["in", "q", "k", "v"]],
                "in_proj_bias",
                "bias_k",
                "bias_v",
            ]:
                tensor = getattr(l, attr)
                if tensor is not None:
                    tensor.data = tensor.data.half()

        for name in ["text_projection", "proj"]:
            if hasattr(l, name):
                attr = getattr(l, name)
                if attr is not None:
                    attr.data = attr.data.half()

    model.apply(_convert_weights_to_fp16)


def build_model(state_dict: dict):
    vit = "visual.proj" in state_dict

    if vit:
        vision_width = state_dict["visual.conv1.weight"].shape[0]
        vision_layers = len(
            [
                k
                for k in state_dict.keys()
                if k.startswith("visual.") and k.endswith(".attn.in_proj_weight")
            ]
        )
        vision_patch_size = state_dict["visual.conv1.weight"].shape[-1]
        grid_size = round(
            (state_dict["visual.positional_embedding"].shape[0] - 1) ** 0.5
        )
        image_resolution = vision_patch_size * grid_size
    else:
        counts: list = [
            len(
                set(
                    k.split(".")[2]
                    for k in state_dict
                    if k.startswith(f"visual.layer{b}")
                )
            )
            for b in [1, 2, 3, 4]
        ]
        vision_layers = tuple(counts)
        vision_width = state_dict["visual.layer1.0.conv1.weight"].shape[0]
        output_width = round(
            (state_dict["visual.attnpool.positional_embedding"].shape[0] - 1) ** 0.5
        )
        vision_patch_size = None
        assert (
            output_width**2 + 1
            == state_dict["visual.attnpool.positional_embedding"].shape[0]
        )
        image_resolution = output_width * 32

    embed_dim = state_dict["text_projection"].shape[1]
    context_length = state_dict["positional_embedding"].shape[0]
    vocab_size = state_dict["token_embedding.weight"].shape[0]
    transformer_width = state_dict["ln_final.weight"].shape[0]
    transformer_heads = transformer_width // 64
    transformer_layers = len(
        set(
            k.split(".")[2]
            for k in state_dict
            if k.startswith(f"transformer.resblocks")
        )
    )

    model = CLIP(
        embed_dim,
        image_resolution,
        vision_layers,
        vision_width,
        vision_patch_size,
        context_length,
        vocab_size,
        transformer_width,
        transformer_heads,
        transformer_layers,
    )

    for key in ["input_resolution", "context_length", "vocab_size"]:
        if key in state_dict:
            del state_dict[key]

    convert_weights(model)
    model.load_state_dict(state_dict)
    return model.eval()


def _download(url: str, root: str = os.path.expanduser("~/.cache/clip")):
    os.makedirs(root, exist_ok=True)
    filename = os.path.basename(url)

    expected_sha256 = url.split("/")[-2]
    download_target = os.path.join(root, filename)

    if os.path.exists(download_target) and not os.path.isfile(download_target):
        raise RuntimeError(f"{download_target} exists and is not a regular file")

    if os.path.isfile(download_target):
        if (
            hashlib.sha256(open(download_target, "rb").read()).hexdigest()
            == expected_sha256
        ):
            return download_target
        else:
            warnings.warn(
                f"{download_target} exists, but the SHA256 checksum does not match; re-downloading the file"
            )

    with urllib.request.urlopen(url) as source, open(download_target, "wb") as output:
        with tqdm(total=int(source.info().get("Content-Length")), ncols=80) as loop:
            while True:
                buffer = source.read(8192)
                if not buffer:
                    break

                output.write(buffer)
                loop.update(len(buffer))

    if (
        hashlib.sha256(open(download_target, "rb").read()).hexdigest()
        != expected_sha256
    ):
        raise RuntimeError(
            f"Model has been downloaded but the SHA256 checksum does not not match"
        )

    return download_target


def available_models():
    return list(_MODELS.keys())


def load_clip(
    name: str,
    device: Union[str, torch.device] = "cuda" if torch.cuda.is_available() else "cpu",
    jit=True,
):
    if name not in _MODELS:
        raise RuntimeError(
            f"Model {name} not found; available models = {available_models()}"
        )

    model_path = _download(_MODELS[name])
    model = torch.jit.load(model_path, map_location=device if jit else "cpu").eval()
    n_px = model.input_resolution.item()

    transform = Compose(
        [
            Resize(n_px, interpolation=Image.BICUBIC),
            CenterCrop(n_px),
            # lambda image: image.convert("RGB"),
            # ToTensor(),
            Normalize(
                (0.48145466, 0.4578275, 0.40821073),
                (0.26862954, 0.26130258, 0.27577711),
            ),
        ]
    )

    if not jit:
        model = build_model(model.state_dict()).to(device)
        if str(device) == "cpu":
            model.float()
        return model, transform

    # patch the device names
    device_holder = torch.jit.trace(
        lambda: torch.ones([]).to(torch.device(device)), example_inputs=[]
    )
    device_node = [
        n
        for n in device_holder.graph.findAllNodes("prim::Constant")
        if "Device" in repr(n)
    ][-1]

    def patch_device(module):
        graphs = [module.graph] if hasattr(module, "graph") else []
        if hasattr(module, "forward1"):
            graphs.append(module.forward1.graph)

        for graph in graphs:
            for node in graph.findAllNodes("prim::Constant"):
                if "value" in node.attributeNames() and str(node["value"]).startswith(
                    "cuda"
                ):
                    node.copyAttributes(device_node)

    model.apply(patch_device)
    patch_device(model.encode_image)
    patch_device(model.encode_text)

    # patch dtype to float32 on CPU
    if str(device) == "cpu":
        float_holder = torch.jit.trace(
            lambda: torch.ones([]).float(), example_inputs=[]
        )
        float_input = list(float_holder.graph.findNode("aten::to").inputs())[1]
        float_node = float_input.node()

        def patch_float(module):
            graphs = [module.graph] if hasattr(module, "graph") else []
            if hasattr(module, "forward1"):
                graphs.append(module.forward1.graph)

            for graph in graphs:
                for node in graph.findAllNodes("aten::to"):
                    inputs = list(node.inputs())
                    for i in [
                        1,
                        2,
                    ]:  # dtype can be the second or third argument to aten::to()
                        if inputs[i].node()["value"] == 5:
                            inputs[i].node().copyAttributes(float_node)

        model.apply(patch_float)
        patch_float(model.encode_image)
        patch_float(model.encode_text)

        model.float()

    return model, transform


def tokenize(texts: Union[str, List[str]], context_length: int = 77):
    if isinstance(texts, str):
        texts = [texts]

    sot_token = _tokenizer.encoder["<|startoftext|>"]
    eot_token = _tokenizer.encoder["<|endoftext|>"]
    all_tokens = [[sot_token] + _tokenizer.encode(text) + [eot_token] for text in texts]
    result = torch.zeros(len(all_tokens), context_length, dtype=torch.long)

    for i, tokens in enumerate(all_tokens):
        if len(tokens) > context_length:
            raise RuntimeError(
                f"Input {texts[i]} is too long for context length {context_length}"
            )
        result[i, : len(tokens)] = torch.tensor(tokens)

    return result


================================================
FILE: helpers/clip/core/fusion.py
================================================
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F


class DotAttn(nn.Module):
    """Dot-Attention"""

    def forward(self, inp, h):
        score = self.softmax(inp, h)
        return score.expand_as(inp).mul(inp).sum(1), score

    def softmax(self, inp, h):
        raw_score = inp.bmm(h.unsqueeze(2))
        score = F.softmax(raw_score, dim=1)
        return score


class ScaledDotAttn(nn.Module):
    """Scaled Dot-Attention"""

    def forward(self, inp, h):
        score = self.softmax(inp, h)
        return score.expand_as(inp).mul(inp).sum(1), score

    def softmax(self, inp, h):
        raw_score = inp.bmm(h.unsqueeze(2)) / np.sqrt(h.shape[-1])
        score = F.softmax(raw_score, dim=1)
        return score


class Fusion(nn.Module):
    """Base Fusion Class"""

    def __init__(self, input_dim=3):
        super().__init__()
        self.input_dim = input_dim

    def tile_x2(self, x1, x2, x2_proj=None):
        if x2_proj:
            x2 = x2_proj(x2)

        x2 = x2.unsqueeze(-1).unsqueeze(-1)
        x2 = x2.repeat(1, 1, x1.shape[-2], x1.shape[-1])
        return x2

    def forward(self, x1, x2, x2_mask=None, x2_proj=None):
        raise NotImplementedError()


class FusionAdd(Fusion):
    """x1 + x2"""

    def __init__(self, input_dim=3):
        super(FusionAdd, self).__init__(input_dim=input_dim)

    def forward(self, x1, x2, x2_mask=None, x2_proj=None):
        if x1.shape != x2.shape and len(x1.shape) != len(x2.shape):
            x2 = self.tile_x2(x1, x2, x2_proj)
        return x1 + x2


class FusionMult(Fusion):
    """x1 * x2"""

    def __init__(self, input_dim=3):
        super(FusionMult, self).__init__(input_dim=input_dim)

    def forward(self, x1, x2, x2_mask=None, x2_proj=None):
        if x1.shape != x2.shape and len(x1.shape) != len(x2.shape):
            x2 = self.tile_x2(x1, x2, x2_proj)
        return x1 * x2


class FusionMax(Fusion):
    """max(x1, x2)"""

    def __init__(self, input_dim=3):
        super(FusionMax, self).__init__(input_dim=input_dim)

    def forward(self, x1, x2, x2_mask=None, x2_proj=None):
        if x1.shape != x2.shape and len(x1.shape) != len(x2.shape):
            x2 = self.tile_x2(x1, x2, x2_proj)
        return torch.max(x1, x2)


class FusionConcat(Fusion):
    """[x1; x2]"""

    def __init__(self, input_dim=3):
        super(FusionConcat, self).__init__(input_dim=input_dim)

    def forward(self, x1, x2, x2_mask=None, x2_proj=None):
        if x1.shape != x2.shape and len(x1.shape) != len(x2.shape):
            x2 = self.tile_x2(x1, x2, x2_proj)
        return torch.cat([x1, x2], dim=1)


class FusionConv(Fusion):
    """1x1 convs after [x1; x2]"""

    def __init__(self, input_dim=3):
        super(FusionConv, self).__init__(input_dim=input_dim)
        self.conv = nn.Sequential(
            nn.ReLU(True),
            nn.Conv2d(input_dim * 2, input_dim, kernel_size=1, bias=False),
        )

    def forward(self, x1, x2, x2_mask=None, x2_proj=None):
        if x1.shape != x2.shape and len(x1.shape) != len(x2.shape):
            x2 = self.tile_x2(x1, x2, x2_proj)
        x = torch.cat([x1, x2], dim=1)  # [B, 2C, H, W]
        x = self.conv(x)  # [B, C, H, W]
        return x


class FusionConvLat(Fusion):
    """1x1 convs after [x1; x2] for lateral fusion"""

    def __init__(self, input_dim=3, output_dim=3):
        super(FusionConvLat, self).__init__(input_dim=input_dim)
        self.conv = nn.Sequential(
            nn.ReLU(True), nn.Conv2d(input_dim, output_dim, kernel_size=1, bias=False)
        )

    def forward(self, x1, x2, x2_mask=None, x2_proj=None):
        if x1.shape != x2.shape and len(x1.shape) != len(x2.shape):
            x2 = self.tile_x2(x1, x2, x2_proj)
        x = torch.cat([x1, x2], dim=1)  # [B, input_dim, H, W]
        x = self.conv(x)  # [B, output_dim, H, W]
        return x


## ------------- NOTE ----------------
## The following are various fusion types I experimented with.
## Most of them didn't work well ¯\_(ツ)_/¯
## But it doesn't mean there isn't a better way of
## doing lateral and multi-modal (language+vision) fusion.


class FusionFiLM(Fusion):
    """FiLM (Perez et. al, https://arxiv.org/abs/1709.07871).
    Note: This is not used inside a Residual block before ReLU.
    I had a version this in UpBlock with FiLM, which didn't seem to work at all.
    """

    def __init__(self, input_dim=3, output_dim=3):
        super(FusionFiLM, self).__init__(input_dim=input_dim)

    def forward(self, x1, x2, gamma, beta):
        g = self.tile_x2(x1, x2, gamma)
        b = self.tile_x2(x1, x2, beta)
        return x1 * g + b


class FusionDeepConv(Fusion):
    """Multi-Layer 1x1 convs after [x1; x2]"""

    def __init__(self, input_dim=3):
        super(FusionDeepConv, self).__init__(input_dim=input_dim)
        self.conv = nn.Sequential(
            nn.ReLU(True),
            nn.Conv2d(input_dim * 2, input_dim, kernel_size=1, bias=False),
            nn.ReLU(True),
            nn.Conv2d(input_dim, input_dim, kernel_size=1, bias=False),
            nn.ReLU(True),
            nn.Conv2d(input_dim, input_dim, kernel_size=1, bias=False),
        )

    def forward(self, x1, x2, x2_mask=None, x2_proj=None):
        if x1.shape != x2.shape and len(x1.shape) != len(x2.shape):
            x2 = self.tile_x2(x1, x2, x2_proj)
        x = torch.cat([x1, x2], dim=1)  # [B, 2C, H, W]
        x = self.conv(x)  # [B, C, H, W]
        return x


class FusionMultWord(nn.Module):
    """Product with weighted-sum of words"""

    def __init__(self, input_dim=3):
        super().__init__()
        self.input_dim = input_dim

    def forward(self, x1, x2, x2_mask=None, x2_proj=None):
        B, D, H, W = x1.shape
        x2_len = int(x2_mask.count_nonzero())

        weighted_x1 = torch.zeros_like(x1)
        for t in range(x2_len):
            x2_t = x2_proj(x2[:, t]) if x2_proj else x2[:, t]
            x2_t = x2_t.unsqueeze(-1).unsqueeze(-1).repeat(B, 1, H, W)
            weighted_x1 += x1 * x2_t
        weighted_x1 /= x2_len
        return weighted_x1


class FusionWordAttention(nn.Module):
    """Word Attention"""

    def __init__(self, input_dim=3):
        super().__init__()
        self.input_dim = input_dim
        self.dot_attn = DotAttn()

    def forward(self, x1, x2, x2_mask=None, x2_proj=None):
        B, D, H, W = x1.shape
        x1_flat = x1.reshape(B, D, H * W)
        x2_len = int(x2_mask.count_nonzero())

        # TODO: batch this unrolling?
        weight_sum_x1_flat = torch.zeros_like(x1_flat)
        for t in range(x2_len):
            x2_t = x2_proj(x2[:, t]) if x2_proj else x2[:, t]
            x2_t = x2_t.repeat(B, 1)

            _, attn_x1 = self.dot_attn(x1_flat.transpose(1, 2), x2_t)
            weight_sum_x1_flat += x1_flat * attn_x1.transpose(1, 2)

        weight_sum_x1_flat /= x2_len
        x2 = weight_sum_x1_flat.reshape(B, D, H, W)
        return x2


class FusionSentenceAttention(nn.Module):
    """Sentence Attention"""

    def __init__(self, input_dim=3):
        super().__init__()
        self.input_dim = input_dim
        self.dot_attn = ScaledDotAttn()

    def forward(self, x1, x2, x2_mask=None, x2_proj=None):
        B, D, H, W = x1.shape
        x1_flat = x1.reshape(B, D, H * W)

        x2_t = x2_proj(x2) if x2_proj else x2
        x2_t = x2_t.repeat(B, 1)

        _, attn_x1 = self.dot_attn(x1_flat.transpose(1, 2), x2_t)
        weight_sum_x1_flat = x1_flat * attn_x1.transpose(1, 2)

        x2 = weight_sum_x1_flat.reshape(B, D, H, W)
        return x2


class CrossModalAttention2d(nn.Module):
    """Cross-Modal Attention. Adapted from: https://github.com/openai/CLIP/blob/main/clip/model.py#L56"""

    def __init__(
        self,
        spacial_dim=7,
        embed_dim=1024,
        num_heads=32,
        output_dim=1024,
        lang_dim=512,
        lang_max_tokens=77,
    ):
        super().__init__()
        self.embed_dim = embed_dim
        self.lang_dim = lang_dim
        self.lang_max_tokens = lang_max_tokens
        self.num_heads = num_heads
        self.lang_proj = nn.Linear(self.lang_dim, embed_dim)
        self.vision_positional_embedding = nn.Parameter(
            torch.randn(spacial_dim**2, embed_dim) / embed_dim**0.5
        )
        self.lang_positional_embedding = nn.Parameter(
            torch.randn(lang_max_tokens, embed_dim) / embed_dim**0.5
        )
        self.k_proj = nn.Linear(embed_dim, embed_dim)
        self.q_proj = nn.Linear(embed_dim, embed_dim)
        self.v_proj = nn.Linear(embed_dim, embed_dim)
        self.c_proj = nn.Linear(embed_dim, output_dim or embed_dim)

    def forward(self, x, l, l_mask):
        # reshape vision features
        x_shape = x.shape
        x = x.reshape(x.shape[0], x.shape[1], x.shape[2] * x.shape[3]).permute(
            2, 0, 1
        )  # NCHW -> (HW)NC
        x = x + self.vision_positional_embedding[: x.shape[0], None, :].to(
            x.dtype
        )  # (HW)NC

        # project language
        l = l.permute(1, 0, 2)
        l_shape = l.shape
        l = l.reshape(-1, self.lang_dim)
        l = self.lang_proj(l)
        l = l.reshape(l_shape[0], l_shape[1], self.embed_dim)
        l = l + self.lang_positional_embedding[:, None, :].to(l.dtype)

        # hard language mask
        l_len = int(l_mask.count_nonzero())
        l = l[:l_len]
        l = l.repeat(1, x.shape[1], 1)

        x, _ = F.multi_head_attention_forward(
            query=x,
            key=l,
            value=l,
            embed_dim_to_check=x.shape[-1],
            num_heads=self.num_heads,
            q_proj_weight=self.q_proj.weight,
            k_proj_weight=self.k_proj.weight,
            v_proj_weight=self.v_proj.weight,
            in_proj_weight=None,
            in_proj_bias=torch.cat(
                [self.q_proj.bias, self.k_proj.bias, self.v_proj.bias]
            ),
            bias_k=None,
            bias_v=None,
            add_zero_attn=False,
            dropout_p=0,
            out_proj_weight=self.c_proj.weight,
            out_proj_bias=self.c_proj.bias,
            use_separate_proj_weight=True,
            training=self.training,
            need_weights=False,
        )

        x = x.permute(1, 2, 0)
        x = x.reshape(x_shape)
        return x


class FusionMultiHeadedWordAttention(nn.Module):
    """Multi-Headed Word Attention that uses Cross Modal Attention at different scales"""

    def __init__(self, input_dim=3):
        super().__init__()
        self.input_dim = input_dim
        self.attn1 = CrossModalAttention2d(
            spacial_dim=7, embed_dim=1024, output_dim=1024
        )
        self.attn2 = CrossModalAttention2d(
            spacial_dim=14, embed_dim=512, output_dim=512
        )
        self.attn3 = CrossModalAttention2d(
            spacial_dim=28, embed_dim=256, output_dim=256
        )

        self.multi_headed_attns = {
            1024: self.attn1,
            512: self.attn2,
            256: self.attn3,
        }

    def forward(self, x1, x2, x2_mask=None, x2_proj=None):
        emb_dim = x1.shape[1]
        x = self.multi_headed_attns[emb_dim](x1, x2, x2_mask)
        return x


names = {
    "add": FusionAdd,
    "mult": FusionMult,
    "mult_word": FusionMultWord,
    "film": FusionFiLM,
    "max": FusionMax,
    "concat": FusionConcat,
    "conv": FusionConv,
    "deep_conv": FusionDeepConv,
    "word_attn": FusionWordAttention,
    "sent_attn": FusionSentenceAttention,
    "multi_headed_word_attn": FusionMultiHeadedWordAttention,
}


================================================
FILE: helpers/clip/core/resnet.py
================================================
import torch
import torch.nn as nn
import torch.nn.functional as F


class IdentityBlock(nn.Module):
    def __init__(
        self, in_planes, filters, kernel_size, stride=1, final_relu=True, batchnorm=True
    ):
        super(IdentityBlock, self).__init__()
        self.final_relu = final_relu
        self.batchnorm = batchnorm

        filters1, filters2, filters3 = filters
        self.conv1 = nn.Conv2d(in_planes, filters1, kernel_size=1, bias=False)
        self.bn1 = nn.BatchNorm2d(filters1) if self.batchnorm else nn.Identity()
        self.conv2 = nn.Conv2d(
            filters1,
            filters2,
            kernel_size=kernel_size,
            dilation=1,
            stride=stride,
            padding=1,
            bias=False,
        )
        self.bn2 = nn.BatchNorm2d(filters2) if self.batchnorm else nn.Identity()
        self.conv3 = nn.Conv2d(filters2, filters3, kernel_size=1, bias=False)
        self.bn3 = nn.BatchNorm2d(filters3) if self.batchnorm else nn.Identity()

    def forward(self, x):
        out = F.relu(self.bn1(self.conv1(x)))
        out = F.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))
        out += x
        if self.final_relu:
            out = F.relu(out)
        return out


class ConvBlock(nn.Module):
    def __init__(
        self, in_planes, filters, kernel_size, stride=1, final_relu=True, batchnorm=True
    ):
        super(ConvBlock, self).__init__()
        self.final_relu = final_relu
        self.batchnorm = batchnorm

        filters1, filters2, filters3 = filters
        self.conv1 = nn.Conv2d(in_planes, filters1, kernel_size=1, bias=False)
        self.bn1 = nn.BatchNorm2d(filters1) if self.batchnorm else nn.Identity()
        self.conv2 = nn.Conv2d(
            filters1,
            filters2,
            kernel_size=kernel_size,
            dilation=1,
            stride=stride,
            padding=1,
            bias=False,
        )
        self.bn2 = nn.BatchNorm2d(filters2) if self.batchnorm else nn.Identity()
        self.conv3 = nn.Conv2d(filters2, filters3, kernel_size=1, bias=False)
        self.bn3 = nn.BatchNorm2d(filters3) if self.batchnorm else nn.Identity()

        self.shortcut = nn.Sequential(
            nn.Conv2d(in_planes, filters3, kernel_size=1, stride=stride, bias=False),
            nn.BatchNorm2d(filters3) if self.batchnorm else nn.Identity(),
        )

    def forward(self, x):
        out = F.relu(self.bn1(self.conv1(x)))
        out = F.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))
        out += self.shortcut(x)
        if self.final_relu:
            out = F.relu(out)
        return out


class ResNet43_8s(nn.Module):
    def __init__(self, input_shape, output_dim, cfg, device, preprocess):
        super(ResNet43_8s, self).__init__()
        self.input_shape = input_shape
        self.input_dim = input_shape[-1]
        self.output_dim = output_dim
        self.cfg = cfg
        self.device = device
        self.batchnorm = self.cfg["train"]["batchnorm"]
        self.preprocess = preprocess

        self.layers = self._make_layers()

    def _make_layers(self):
        layers = nn.Sequential(
            # conv1
            nn.Conv2d(self.input_dim, 64, stride=1, kernel_size=3, padding=1),
            nn.BatchNorm2d(64) if self.batchnorm else nn.Identity(),
            nn.ReLU(True),
            # fcn
            ConvBlock(
                64, [64, 64, 64], kernel_size=3, stride=1, batchnorm=self.batchnorm
            ),
            IdentityBlock(
                64, [64, 64, 64], kernel_size=3, stride=1, batchnorm=self.batchnorm
            ),
            ConvBlock(
                64, [128, 128, 128], kernel_size=3, stride=2, batchnorm=self.batchnorm
            ),
            IdentityBlock(
                128, [128, 128, 128], kernel_size=3, stride=1, batchnorm=self.batchnorm
            ),
            ConvBlock(
                128, [256, 256, 256], kernel_size=3, stride=2, batchnorm=self.batchnorm
            ),
            IdentityBlock(
                256, [256, 256, 256], kernel_size=3, stride=1, batchnorm=self.batchnorm
            ),
            ConvBlock(
                256, [512, 512, 512], kernel_size=3, stride=2, batchnorm=self.batchnorm
            ),
            IdentityBlock(
                512, [512, 512, 512], kernel_size=3, stride=1, batchnorm=self.batchnorm
            ),
            # head
            ConvBlock(
                512, [256, 256, 256], kernel_size=3, stride=1, batchnorm=self.batchnorm
            ),
            IdentityBlock(
                256, [256, 256, 256], kernel_size=3, stride=1, batchnorm=self.batchnorm
            ),
            nn.UpsamplingBilinear2d(scale_factor=2),
            ConvBlock(
                256, [128, 128, 128], kernel_size=3, stride=1, batchnorm=self.batchnorm
            ),
            IdentityBlock(
                128, [128, 128, 128], kernel_size=3, stride=1, batchnorm=self.batchnorm
            ),
            nn.UpsamplingBilinear2d(scale_factor=2),
            ConvBlock(
                128, [64, 64, 64], kernel_size=3, stride=1, batchnorm=self.batchnorm
            ),
            IdentityBlock(
                64, [64, 64, 64], kernel_size=3, stride=1, batchnorm=self.batchnorm
            ),
            nn.UpsamplingBilinear2d(scale_factor=2),
            # conv2
            ConvBlock(
                64,
                [16, 16, self.output_dim],
                kernel_size=3,
                stride=1,
                final_relu=False,
                batchnorm=self.batchnorm,
            ),
            IdentityBlock(
                self.output_dim,
                [16, 16, self.output_dim],
                kernel_size=3,
                stride=1,
                final_relu=False,
                batchnorm=self.batchnorm,
            ),
        )
        return layers

    def forward(self, x):
        x = self.preprocess(x, dist="transporter")

        out = self.layers(x)
        return out


================================================
FILE: helpers/clip/core/simple_tokenizer.py
================================================
import gzip
import html
import os
from functools import lru_cache

import ftfy
import regex as re


@lru_cache()
def default_bpe():
    return os.path.join(
        os.path.dirname(os.path.abspath(__file__)), "bpe_simple_vocab_16e6.txt.gz"
    )


@lru_cache()
def bytes_to_unicode():
    """
    Returns list of utf-8 byte and a corresponding list of unicode strings.
    The reversible bpe codes work on unicode strings.
    This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
    When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
    This is a signficant percentage of your normal, say, 32K bpe vocab.
    To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
    And avoids mapping to whitespace/control characters the bpe code barfs on.
    """
    bs = (
        list(range(ord("!"), ord("~") + 1))
        + list(range(ord("¡"), ord("¬") + 1))
        + list(range(ord("®"), ord("ÿ") + 1))
    )
    cs = bs[:]
    n = 0
    for b in range(2**8):
        if b not in bs:
            bs.append(b)
            cs.append(2**8 + n)
            n += 1
    cs = [chr(n) for n in cs]
    return dict(zip(bs, cs))


def get_pairs(word):
    """Return set of symbol pairs in a word.
    Word is represented as tuple of symbols (symbols being variable-length strings).
    """
    pairs = set()
    prev_char = word[0]
    for char in word[1:]:
        pairs.add((prev_char, char))
        prev_char = char
    return pairs


def basic_clean(text):
    text = ftfy.fix_text(text)
    text = html.unescape(html.unescape(text))
    return text.strip()


def whitespace_clean(text):
    text = re.sub(r"\s+", " ", text)
    text = text.strip()
    return text


class SimpleTokenizer(object):
    def __init__(self, bpe_path: str = default_bpe()):
        self.byte_encoder = bytes_to_unicode()
        self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}
        merges = gzip.open(bpe_path).read().decode("utf-8").split("\n")
        merges = merges[1 : 49152 - 256 - 2 + 1]
        merges = [tuple(merge.split()) for merge in merges]
        vocab = list(bytes_to_unicode().values())
        vocab = vocab + [v + "</w>" for v in vocab]
        for merge in merges:
            vocab.append("".join(merge))
        vocab.extend(["<|startoftext|>", "<|endoftext|>"])
        self.encoder = dict(zip(vocab, range(len(vocab))))
        self.decoder = {v: k for k, v in self.encoder.items()}
        self.bpe_ranks = dict(zip(merges, range(len(merges))))
        self.cache = {
            "<|startoftext|>": "<|startoftext|>",
            "<|endoftext|>": "<|endoftext|>",
        }
        self.pat = re.compile(
            r"""<\|startoftext\|>|<\|endoftext\|>|'s|'t|'re|'ve|'m|'ll|'d|[\p{L}]+|[\p{N}]|[^\s\p{L}\p{N}]+""",
            re.IGNORECASE,
        )

    def bpe(self, token):
        if token in self.cache:
            return self.cache[token]
        word = tuple(token[:-1]) + (token[-1] + "</w>",)
        pairs = get_pairs(word)

        if not pairs:
            return token + "</w>"

        while True:
            bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float("inf")))
            if bigram not in self.bpe_ranks:
                break
            first, second = bigram
            new_word = []
            i = 0
            while i < len(word):
                try:
                    j = word.index(first, i)
                    new_word.extend(word[i:j])
                    i = j
                except:
                    new_word.extend(word[i:])
                    break

                if word[i] == first and i < len(word) - 1 and word[i + 1] == second:
                    new_word.append(first + second)
                    i += 2
                else:
                    new_word.append(word[i])
                    i += 1
            new_word = tuple(new_word)
            word = new_word
            if len(word) == 1:
                break
            else:
                pairs = get_pairs(word)
        word = " ".join(word)
        self.cache[token] = word
        return word

    def encode(self, text):
        bpe_tokens = []
        text = whitespace_clean(basic_clean(text)).lower()
        for token in re.findall(self.pat, text):
            token = "".join(self.byte_encoder[b] for b in token.encode("utf-8"))
            bpe_tokens.extend(
                self.encoder[bpe_token] for bpe_token in self.bpe(token).split(" ")
            )
        return bpe_tokens

    def decode(self, tokens):
        text = "".join([self.decoder[token] for token in tokens])
        text = (
            bytearray([self.byte_decoder[c] for c in text])
            .decode("utf-8", errors="replace")
            .replace("</w>", " ")
        )
        return text


================================================
FILE: helpers/clip/core/transport.py
================================================
import cliport.models as models
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from cliport.utils import utils


class Transport(nn.Module):
    def __init__(
        self, stream_fcn, in_shape, n_rotations, crop_size, preprocess, cfg, device
    ):
        """Transport (a.k.a Place) module."""
        super().__init__()

        self.iters = 0
        self.stream_fcn = stream_fcn
        self.n_rotations = n_rotations
        self.crop_size = crop_size  # crop size must be N*16 (e.g. 96)
        self.preprocess = preprocess
        self.cfg = cfg
        self.device = device
        self.batchnorm = self.cfg["train"]["batchnorm"]

        self.pad_size = int(self.crop_size / 2)
        self.padding = np.zeros((3, 2), dtype=int)
        self.padding[:2, :] = self.pad_size

        in_shape = np.array(in_shape)
        in_shape = tuple(in_shape)
        self.in_shape = in_shape

        # Crop before network (default from Transporters CoRL 2020).
        self.kernel_shape = (self.crop_size, self.crop_size, self.in_shape[2])

        if not hasattr(self, "output_dim"):
            self.output_dim = 3
        if not hasattr(self, "kernel_dim"):
            self.kernel_dim = 3

        self.rotator = utils.ImageRotator(self.n_rotations)

        self._build_nets()

    def _build_nets(self):
        stream_one_fcn, _ = self.stream_fcn
        model = models.names[stream_one_fcn]
        self.key_resnet = model(self.in_shape, self.output_dim, self.cfg, self.device)
        self.query_resnet = model(
            self.kernel_shape, self.kernel_dim, self.cfg, self.device
        )
        print(f"Transport FCN: {stream_one_fcn}")

    def correlate(self, in0, in1, softmax):
        """Correlate two input tensors."""
        output = F.conv2d(in0, in1, padding=(self.pad_size, self.pad_size))
        output = F.interpolate(
            output, size=(in0.shape[-2], in0.shape[-1]), mode="bilinear"
        )
        output = output[
            :, :, self.pad_size : -self.pad_size, self.pad_size : -self.pad_size
        ]
        if softmax:
            output_shape = output.shape
            output = output.reshape((1, np.prod(output.shape)))
            output = F.softmax(output, dim=-1)
            output = output.reshape(output_shape[1:])
        return output

    def transport(self, in_tensor, crop):
        logits = self.key_resnet(in_tensor)
        kernel = self.query_resnet(crop)
        return logits, kernel

    def forward(self, inp_img, p, softmax=True):
        """Forward pass."""
        img_unprocessed = np.pad(inp_img, self.padding, mode="constant")
        input_data = img_unprocessed
        in_shape = (1,) + input_data.shape
        input_data = input_data.reshape(in_shape)  # [B W H D]
        in_tensor = torch.from_numpy(input_data).to(
            dtype=torch.float, device=self.device
        )

        # Rotation pivot.
        pv = np.array([p[0], p[1]]) + self.pad_size

        # Crop before network (default from Transporters CoRL 2020).
        hcrop = self.pad_size
        in_tensor = in_tensor.permute(0, 3, 1, 2)  # [B D W H]

        crop = in_tensor.repeat(self.n_rotations, 1, 1, 1)
        crop = self.rotator(crop, pivot=pv)
        crop = torch.cat(crop, dim=0)
        crop = crop[:, :, pv[0] - hcrop : pv[0] + hcrop, pv[1] - hcrop : pv[1] + hcrop]

        logits, kernel = self.transport(in_tensor, crop)

        # TODO(Mohit): Crop after network. Broken for now.
        # in_tensor = in_tensor.permute(0, 3, 1, 2)
        # logits, crop = self.transport(in_tensor)
        # crop = crop.repeat(self.n_rotations, 1, 1, 1)
        # crop = self.rotator(crop, pivot=pv)
        # crop = torch.cat(crop, dim=0)

        # kernel = crop[:, :, pv[0]-hcrop:pv[0]+hcrop, pv[1]-hcrop:pv[1]+hcrop]
        # kernel = crop[:, :, p[0]:(p[0] + self.crop_size), p[1]:(p[1] + self.crop_size)]

        return self.correlate(logits, kernel, softmax)


================================================
FILE: helpers/clip/core/transport_image_goal.py
================================================
import cliport.models as models
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from cliport.utils import utils


class TransportImageGoal(nn.Module):
    """Transport module."""

    def __init__(
        self, stream_fcn, in_shape, n_rotations, crop_size, preprocess, cfg, device
    ):
        """Transport module for placing.
        Args:
          in_shape: shape of input image.
          n_rotations: number of rotations of convolving kernel.
          crop_size: crop size around pick argmax used as convolving kernel.
          preprocess: function to preprocess input images.
        """
        super().__init__()

        self.iters = 0
        self.stream_fcn = stream_fcn
        self.n_rotations = n_rotations
        self.crop_size = crop_size  # crop size must be N*16 (e.g. 96)
        self.preprocess = preprocess
        self.cfg = cfg
        self.device = device
        self.batchnorm = self.cfg["train"]["batchnorm"]

        self.pad_size = int(self.crop_size / 2)
        self.padding = np.zeros((3, 2), dtype=int)
        self.padding[:2, :] = self.pad_size

        in_shape = np.array(in_shape)
        in_shape = tuple(in_shape)
        self.in_shape = in_shape

        # Crop before network (default for Transporters CoRL 2020).
        self.kernel_shape = (self.crop_size, self.crop_size, self.in_shape[2])

        if not hasattr(self, "output_dim"):
            self.output_dim = 3
        if not hasattr(self, "kernel_dim"):
            self.kernel_dim = 3

        self.rotator = utils.ImageRotator(self.n_rotations)

        self._build_nets()

    def _build_nets(self):
        stream_one_fcn, _ = self.stream_fcn
        model = models.names[stream_one_fcn]
        self.key_resnet = model(self.in_shape, self.output_dim, self.cfg, self.device)
        self.query_resnet = model(self.in_shape, self.kernel_dim, self.cfg, self.device)
        self.goal_resnet = model(self.in_shape, self.output_dim, self.cfg, self.device)
        print(f"Transport FCN: {stream_one_fcn}")

    def correlate(self, in0, in1, softmax):
        """Correlate two input tensors."""
        output = F.conv2d(in0, in1, padding=(self.pad_size, self.pad_size))
        output = F.interpolate(
            output, size=(in0.shape[-2], in0.shape[-1]), mode="bilinear"
        )
        output = output[
            :, :, self.pad_size : -self.pad_size, self.pad_size : -self.pad_size
        ]
        if softmax:
            output_shape = output.shape
            output = output.reshape((1, np.prod(output.shape)))
            output = F.softmax(output, dim=-1)
            output = output.reshape(output_shape[1:])
        return output

    def forward(self, inp_img, goal_img, p, softmax=True):
        """Forward pass."""

        # Input image.
        img_unprocessed = np.pad(inp_img, self.padding, mode="constant")
        input_data = img_unprocessed
        in_shape = (1,) + input_data.shape
        input_data = input_data.reshape(in_shape)
        in_tensor = torch.from_numpy(input_data.copy()).to(
            dtype=torch.float, device=self.device
        )
        in_tensor = in_tensor.permute(0, 3, 1, 2)

        # Goal image.
        goal_tensor = np.pad(goal_img, self.padding, mode="constant")
        goal_shape = (1,) + goal_tensor.shape
        goal_tensor = goal_tensor.reshape(goal_shape)
        goal_tensor = torch.from_numpy(goal_tensor.copy()).to(
            dtype=torch.float, device=self.device
        )
        goal_tensor = goal_tensor.permute(0, 3, 1, 2)

        # Rotation pivot.
        pv = np.array([p[0], p[1]]) + self.pad_size
        hcrop = self.pad_size

        # Cropped input features.
        in_crop = in_tensor.repeat(self.n_rotations, 1, 1, 1)
        in_crop = self.rotator(in_crop, pivot=pv)
        in_crop = torch.cat(in_crop, dim=0)
        in_crop = in_crop[
            :, :, pv[0] - hcrop : pv[0] + hcrop, pv[1] - hcrop : pv[1] + hcrop
        ]

        # Cropped goal features.
        goal_crop = goal_tensor.repeat(self.n_rotations, 1, 1, 1)
        goal_crop = self.rotator(goal_crop, pivot=pv)
        goal_crop = torch.cat(goal_crop, dim=0)
        goal_crop = goal_crop[
            :, :, pv[0] - hcrop : pv[0] + hcrop, pv[1] - hcrop : pv[1] + hcrop
        ]

        in_logits = self.key_resnet(in_tensor)
        goal_logits = self.goal_resnet(goal_tensor)
        kernel_crop = self.query_resnet(in_crop)
        goal_crop = self.goal_resnet(goal_crop)

        # Fuse Goal and Transport features
        goal_x_in_logits = (
            goal_logits + in_logits
        )  # Mohit: why doesn't multiply work? :(
        goal_x_kernel = goal_crop + kernel_crop

        # TODO(Mohit): Crop after network. Broken for now
        # in_logits = self.key_resnet(in_tensor)
        # kernel_nocrop_logits = self.query_resnet(in_tensor)
        # goal_logits = self.goal_resnet(goal_tensor)

        # goal_x_in_logits = in_logits
        # goal_x_kernel_logits = goal_logits * kernel_nocrop_logits

        # goal_crop = goal_x_kernel_logits.repeat(self.n_rotations, 1, 1, 1)
        # goal_crop = self.rotator(goal_crop, pivot=pv)
        # goal_crop = torch.cat(goal_crop, dim=0)
        # goal_crop = goal_crop[:, :, pv[0]-hcrop:pv[0]+hcrop, pv[1]-hcrop:pv[1]+hcrop]

        return self.correlate(goal_x_in_logits, goal_x_kernel, softmax)


================================================
FILE: helpers/clip/core/unet.py
================================================
# Credit: https://github.com/milesial/Pytorch-UNet/

import torch
import torch.nn as nn
import torch.nn.functional as F


class DoubleConv(nn.Module):
    """(convolution => [BN] => ReLU) * 2"""

    def __init__(self, in_channels, out_channels, mid_channels=None):
        super().__init__()
        if not mid_channels:
            mid_channels = out_channels
        self.double_conv = nn.Sequential(
            nn.Conv2d(in_channels, mid_channels, kernel_size=3, padding=1),
            # nn.BatchNorm2d(mid_channels),                                     # (Mohit): argh... forgot to remove this batchnorm
            nn.ReLU(inplace=True),
            nn.Conv2d(mid_channels, out_channels, kernel_size=3, padding=1),
            # nn.BatchNorm2d(out_channels),                                     # (Mohit): argh... forgot to remove this batchnorm
            nn.ReLU(inplace=True),
        )

    def forward(self, x):
        return self.double_conv(x)


class Down(nn.Module):
    """Downscaling with maxpool then double conv"""

    def __init__(self, in_channels, out_channels):
        super().__init__()
        self.maxpool_conv = nn.Sequential(
            nn.MaxPool2d(2), DoubleConv(in_channels, out_channels)
        )

    def forward(self, x):
        return self.maxpool_conv(x)


class Up(nn.Module):
    """Upscaling then double conv"""

    def __init__(self, in_channels, out_channels, bilinear=True):
        super().__init__()

        # if bilinear, use the normal convolutions to reduce the number of channels
        if bilinear:
            self.up = nn.Upsample(scale_factor=2, mode="bilinear", align_corners=True)
            self.conv = DoubleConv(in_channels, out_channels, in_channels // 2)
        else:
            self.up = nn.ConvTranspose2d(
                in_channels, in_channels // 2, kernel_size=2, stride=2
            )
            self.conv = DoubleConv(in_channels, out_channels)

    def forward(self, x1, x2):
        x1 = self.up(x1)
        # input is CHW
        diffY = x2.size()[2] - x1.size()[2]
        diffX = x2.size()[3] - x1.size()[3]

        x1 = F.pad(x1, [diffX // 2, diffX - diffX // 2, diffY // 2, diffY - diffY // 2])
        # if you have padding issues, see
        # https://github.com/HaiyongJiang/U-Net-Pytorch-Unstructured-Buggy/commit/0e854509c2cea854e247a9c615f175f76fbb2e3a
        # https://github.com/xiaopeng-liao/Pytorch-UNet/commit/8ebac70e633bac59fc22bb5195e513d5832fb3bd
        x = torch.cat([x2, x1], dim=1)
        return self.conv(x)


class OutConv(nn.Module):
    def __init__(self, in_channels, out_channels):
        super(OutConv, self).__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, kernel_size=1)

    def forward(self, x):
        return self.conv(x)


================================================
FILE: helpers/custom_rlbench_env.py
================================================
from typing import List, Type

import numpy as np
from pyrep.const import RenderMode
from pyrep.errors import ConfigurationPathError, IKError
from pyrep.objects import Dummy, VisionSensor
from rlbench import ActionMode, ObservationConfig
from rlbench.backend.exceptions import InvalidActionError
from rlbench.backend.observation import Observation
from rlbench.backend.task import Task
from yarr.agents.agent import ActResult, TextSummary, VideoSummary
from yarr.envs.rlbench_env import MultiTaskRLBenchEnv, RLBenchEnv
from yarr.utils.observation_type import ObservationElement
from yarr.utils.process_str import change_case
from yarr.utils.transition import Transition


class CustomRLBenchEnv(RLBenchEnv):
    def __init__(
        self,
        task_class: Type[Task],
        observation_config: ObservationConfig,
        action_mode: ActionMode,
        episode_length: int,
        dataset_root: str = "",
        channels_last: bool = False,
        reward_scale=100.0,
        headless: bool = True,
        time_in_state: bool = False,
        include_lang_goal_in_obs: bool = False,
        record_every_n: int = 20,
    ):
        super(CustomRLBenchEnv, self).__init__(
            task_class,
            observation_config,
            action_mode,
            dataset_root,
            channels_last,
            headless=headless,
            include_lang_goal_in_obs=include_lang_goal_in_obs,
        )
        self._reward_scale = reward_scale
        self._episode_index = 0
        self._record_current_episode = False
        self._record_cam = None
        self._previous_obs, self._previous_obs_dict = None, None
        self._recorded_images = []
        self._episode_length = episode_length
        self._time_in_state = time_in_state
        self._record_every_n = record_every_n
        self._i = 0
        self._error_type_counts = {
            "IKError": 0,
            "ConfigurationPathError": 0,
            "InvalidActionError": 0,
        }
        self._last_exception = None

    @property
    def observation_elements(self) -> List[ObservationElement]:
        obs_elems = super(CustomRLBenchEnv, self).observation_elements
        for oe in obs_elems:
            if oe.name == "low_dim_state":
                oe.shape = (
                    oe.shape[0] - 7 * 3 + int(self._time_in_state),
                )  # remove pose and joint velocities as they will not be included
                self.low_dim_state_len = oe.shape[0]
        return obs_elems

    def extract_obs(self, obs: Observation, t=None, prev_action=None):
        joint_vel = obs.joint_velocities
        obs.joint_velocities = None
        grip_mat = obs.gripper_matrix
        grip_pose = obs.gripper_pose
        joint_pos = obs.joint_positions
        obs.gripper_pose = None
        # obs.gripper_pose = None
        obs.gripper_matrix = None
        obs.wrist_camera_matrix = None
        obs.joint_positions = None
        if obs.gripper_joint_positions is not None:
            obs.gripper_joint_positions = np.clip(
                obs.gripper_joint_positions, 0.0, 0.04
            )

        obs_dict = super(CustomRLBenchEnv, self).extract_obs(obs)

        if self._time_in_state:
            time = (
                1.0 - ((self._i if t is None else t) / float(self._episode_length - 1))
            ) * 2.0 - 1.0
            obs_dict["low_dim_state"] = np.concatenate(
                [obs_dict["low_dim_state"], [time]]
            ).astype(np.float32)

        obs.gripper_matrix = grip_mat
        obs.joint_positions = joint_pos
        obs.gripper_pose = grip_pose
        obs.joint_velocities = joint_vel

        obs_dict["gripper_pose"] = grip_pose
        obs_dict["joint_position"] = joint_pos
        obs_dict["proprios"] = obs.get_low_dim_data()
        # obs_dict['gripper_pose'] = grip_pose
        return obs_dict

    def launch(self):
        super(CustomRLBenchEnv, self).launch()
        self._task._scene.register_step_callback(self._my_callback)
        if self.eval:
            cam_placeholder = Dummy("cam_cinematic_placeholder")
            cam_base = Dummy("cam_cinematic_base")
            cam_base.rotate([0, 0, np.pi * 0.75])
            self._record_cam = VisionSensor.create([320, 180])
            self._record_cam.set_explicit_handling(True)
            self._record_cam.set_pose(cam_placeholder.get_pose())
            self._record_cam.set_render_mode(RenderMode.OPENGL)

    def reset(self) -> dict:
        self._i = 0
        var_count = self._task.variation_count()
        rand_var = np.random.randint(0, var_count)
        self._task.set_variation(rand_var)
        self._previous_obs_dict = super(CustomRLBenchEnv, self).reset()
        self._record_current_episode = (
            self.eval and self._episode_index % self._record_every_n == 0
        )
        self._episode_index += 1
        self._recorded_images.clear()
        return self._previous_obs_dict

    def register_callback(self, func):
        self._task._scene.register_step_callback(func)

    def _my_callback(self):
        if self._record_current_episode:
            self._record_cam.handle_explicitly()
            cap = (self._record_cam.capture_rgb() * 255).astype(np.uint8)
            self._recorded_images.append(cap)

    def _append_final_frame(self, success: bool):
        self._record_cam.handle_explicitly()
        img = (self._record_cam.capture_rgb() * 255).astype(np.uint8)
        self._recorded_images.append(img)
        final_frames = np.zeros((10,) + img.shape[:2] + (3,), dtype=np.uint8)
        # Green/red for success/failure
        final_frames[:, :, :, 1 if success else 0] = 255
        self._recorded_images.extend(list(final_frames))

    def step(self, act_result: ActResult) -> Transition:
        action = act_result.action
        success = False
        obs = self._previous_obs_dict  # in case action fails.

        try:
            obs, reward, terminal = self._task.step(action)
            if reward >= 1:
                success = True
                reward *= self._reward_scale
            else:
                reward = 0.0
            obs = self.extract_obs(obs)
            self._previous_obs_dict = obs
        except (IKError, ConfigurationPathError, InvalidActionError) as e:
            terminal = True
            reward = 0.0

            if isinstance(e, IKError):
                self._error_type_counts["IKError"] += 1
            elif isinstance(e, ConfigurationPathError):
                self._error_type_counts["ConfigurationPathError"] += 1
            elif isinstance(e, InvalidActionError):
                self._error_type_counts["InvalidActionError"] += 1

            self._last_exception = e

        summaries = []
        self._i += 1
        if (
            terminal or self._i == self._episode_length
        ) and self._record_current_episode:
            self._append_final_frame(success)
            vid = np.array(self._recorded_images).transpose((0, 3, 1, 2))
            summaries.append(
                VideoSummary(
                    "episode_rollout_" + ("success" if success else "fail"), vid, fps=30
                )
            )

            # error summary
            error_str = (
                f"Errors - IK : {self._error_type_counts['IKError']}, "
                f"ConfigPath : {self._error_type_counts['ConfigurationPathError']}, "
                f"InvalidAction : {self._error_type_counts['InvalidActionError']}"
            )
            if not success and self._last_exception is not None:
                error_str += f"\n Last Exception: {self._last_exception}"
                self._last_exception = None

            summaries.append(
                TextSummary("errors", f"Success: {success} | " + error_str)
            )
        return Transition(obs, reward, terminal, summaries=summaries)

    def _extract_traj_info(self, demo):
        joint_vels = []
        gripper_opens = []
        gripper_poses = []
        joints = []

        for obs in demo._observations:
            joint_vels.append(obs.joint_velocities)
            gripper_opens.append(obs.gripper_open)
            gripper_poses.append(obs.gripper_pose)
            joints.append(obs.joint_positions)

        datas = (
            np.stack(joint_vels),
            np.stack(joints),
            np.stack(gripper_opens),
            np.stack(gripper_poses),
        )
        keys = (
            "traj_joint_vels",
            "traj_joint_positions",
            "traj_gripper_opens",
            "traj_gripper_poses",
        )
        return {k: v for k, v in zip(keys, datas)}

    def reset_to_demo(self, i):
        self._i = 0
        # super(CustomRLBenchEnv, self).reset()

        self._task.set_variation(-1)
        (d,) = self._task.get_demos(
            1, live_demos=False, random_selection=False, from_episode_number=i
        )

        self._task.set_variation(d.variation_number)
        _, obs = self._task.reset_to_demo(d)
        self._lang_goal = self._task.get_task_descriptions()[0]

        self._previous_obs_dict = self.extract_obs(obs)
        self._previous_obs_dict.update(self._extract_traj_info(d))
        self._record_current_episode = (
            self.eval and self._episode_index % self._record_every_n == 0
        )
        self._episode_index += 1
        self._recorded_images.clear()

        return self._previous_obs_dict


class CustomMultiTaskRLBenchEnv(MultiTaskRLBenchEnv):
    def __init__(
        self,
        task_classes: List[Type[Task]],
        observation_config: ObservationConfig,
        action_mode: ActionMode,
        episode_length: int,
        dataset_root: str = "",
        channels_last: bool = False,
        reward_scale=100.0,
        headless: bool = True,
        swap_task_every: int = 1,
        time_in_state: bool = False,
        include_lang_goal_in_obs: bool = False,
        record_every_n: int = 20,
    ):
        super(CustomMultiTaskRLBenchEnv, self).__init__(
            task_classes,
            observation_config,
            action_mode,
            dataset_root,
            channels_last,
            headless=headless,
            swap_task_every=swap_task_every,
            include_lang_goal_in_obs=include_lang_goal_in_obs,
        )
        self._reward_scale = reward_scale
        self._episode_index = 0
        self._record_current_episode = False
        self._record_cam = None
        self._previous_obs, self._previous_obs_dict = None, None
        self._recorded_images = []
        self._episode_length = episode_length
        self._time_in_state = time_in_state
        self._record_every_n = record_every_n
        self._i = 0
        self._error_type_counts = {
            "IKError": 0,
            "ConfigurationPathError": 0,
            "InvalidActionError": 0,
        }
        self._last_exception = None

    @property
    def observation_elements(self) -> List[ObservationElement]:
        obs_elems = super(CustomMultiTaskRLBenchEnv, self).observation_elements
        for oe in obs_elems:
            if oe.name == "low_dim_state":
                oe.shape = (
                    oe.shape[0] - 7 * 3 + int(self._time_in_state),
                )  # remove pose and joint velocities as they will not be included
                self.low_dim_state_len = oe.shape[0]
        return obs_elems

    def extract_obs(self, obs: Observation, t=None, prev_action=None):
        obs.joint_velocities = None
        grip_mat = obs.gripper_matrix
        grip_pose = obs.gripper_pose
        joint_pos = obs.joint_positions
        obs.gripper_pose = None
        # obs.gripper_pose = None
        obs.gripper_matrix = None
        obs.wrist_camera_matrix = None
        obs.joint_positions = None
        if obs.gripper_joint_positions is not None:
            obs.gripper_joint_positions = np.clip(
                obs.gripper_joint_positions, 0.0, 0.04
            )

        obs_dict = super(CustomMultiTaskRLBenchEnv, self).extract_obs(obs)

        if self._time_in_state:
            time = (
                1.0 - ((self._i if t is None else t) / float(self._episode_length - 1))
            ) * 2.0 - 1.0
            obs_dict["low_dim_state"] = np.concatenate(
                [obs_dict["low_dim_state"], [time]]
            ).astype(np.float32)

        obs.gripper_matrix = grip_mat
        # obs.gripper_pose = grip_pose
        obs.joint_positions = joint_pos
        obs.gripper_pose = grip_pose
        # obs_dict['gripper_pose'] = grip_pose
        return obs_dict

    def launch(self):
        super(CustomMultiTaskRLBenchEnv, self).launch()
        self._task._scene.register_step_callback(self._my_callback)
        if self.eval:
            cam_placeholder = Dummy("cam_cinematic_placeholder")
            cam_base = Dummy("cam_cinematic_base")
            cam_base.rotate([0, 0, np.pi * 0.75])
            self._record_cam = VisionSensor.create([320, 180])
            self._record_cam.set_explicit_handling(True)
            self._record_cam.set_pose(cam_placeholder.get_pose())
            self._record_cam.set_render_mode(RenderMode.OPENGL)

    def reset(self) -> dict:
        self._i = 0
        self._previous_obs_dict = super(CustomMultiTaskRLBenchEnv, self).reset()
        self._record_current_episode = (
            self.eval and self._episode_index % self._record_every_n == 0
        )
        self._episode_index += 1
        self._recorded_images.clear()
        return self._previous_obs_dict

    def register_callback(self, func):
        self._task._scene.register_step_callback(func)

    def _my_callback(self):
        if self._record_current_episode:
            self._record_cam.handle_explicitly()
            cap = (self._record_cam.capture_rgb() * 255).astype(np.uint8)
            self._recorded_images.append(cap)

    def _append_final_frame(self, success: bool):
        self._record_cam.handle_explicitly()
        img = (self._record_cam.capture_rgb() * 255).astype(np.uint8)
        self._recorded_images.append(img)
        final_frames = np.zeros((10,) + img.shape[:2] + (3,), dtype=np.uint8)
        # Green/red for success/failure
        final_frames[:, :, :, 1 if success else 0] = 255
        self._recorded_images.extend(list(final_frames))

    def step(self, act_result: ActResult) -> Transition:
        action = act_result.action
        success = False
        obs = self._previous_obs_dict  # in case action fails.

        try:
            obs, reward, terminal = self._task.step(action)
            if reward >= 1:
                success = True
                reward *= self._reward_scale
            else:
                reward = 0.0
            obs = self.extract_obs(obs)
            self._previous_obs_dict = obs
        except (IKError, ConfigurationPathError, InvalidActionError) as e:
            terminal = True
            reward = 0.0

            if isinstance(e, IKError):
                self._error_type_counts["IKError"] += 1
            elif isinstance(e, ConfigurationPathError):
                self._error_type_counts["ConfigurationPathError"] += 1
            elif isinstance(e, InvalidActionError):
                self._error_type_counts["InvalidActionError"] += 1

            self._last_exception = e

        summaries = []
        self._i += 1
        if (
            terminal or self._i == self._episode_length
        ) and self._record_current_episode:
            self._append_final_frame(success)
            vid = np.array(self._recorded_images).transpose((0, 3, 1, 2))
            task_name = change_case(self._task._task.__class__.__name__)
            summaries.append(
                VideoSummary(
                    "episode_rollout_"
                    + ("success" if success else "fail")
                    + f"/{task_name}",
                    vid,
                    fps=30,
                )
            )

            # error summary
            error_str = (
                f"Errors - IK : {self._error_type_counts['IKError']}, "
                f"ConfigPath : {self._error_type_counts['ConfigurationPathError']}, "
                f"InvalidAction : {self._error_type_counts['InvalidActionError']}"
            )
            if not success and self._last_exception is not None:
                error_str += f"\n Last Exception: {self._last_exception}"
                self._last_exception = None

            summaries.append(
                TextSummary("errors", f"Success: {success} | " + error_str)
            )
        return Transition(obs, reward, terminal, summaries=summaries)

    def reset_to_demo(self, i, variation_number=-1):
        if self._episodes_this_task == self._swap_task_every:
            self._set_new_task()
            self._episodes_this_task = 0
        self._episodes_this_task += 1

        self._i = 0
        # super(CustomMultiTaskRLBenchEnv, self).reset()

        # if variation_number == -1:
        #     self._task.sample_variation()
        # else:
        #     self._task.set_variation(variation_number)

        self._task.set_variation(-1)
        d = self._task.get_demos(
            1, live_demos=False, random_selection=False, from_episode_number=i
        )[0]

        self._task.set_variation(d.variation_number)
        _, obs = self._task.reset_to_demo(d)
        self._lang_goal = self._task.get_task_descriptions()[0]

        self._previous_obs_dict = self.extract_obs(obs)
        self._record_current_episode = (
            self.eval and self._episode_index % self._record_every_n == 0
        )
        self._episode_index += 1
        self._recorded_images.clear()

        return self._previous_obs_dict


================================================
FILE: helpers/demo_loading_utils.py
================================================
import logging
from typing import List

import numpy as np
from rlbench.demo import Demo


def _is_stopped(demo, i, obs, stopped_buffer, delta=0.1):
    next_is_not_final = i == (len(demo) - 2)
    gripper_state_no_change = i < (len(demo) - 2) and (
        obs.gripper_open == demo[i + 1].gripper_open
        and obs.gripper_open == demo[i - 1].gripper_open
        and demo[i - 2].gripper_open == demo[i - 1].gripper_open
    )
    small_delta = np.allclose(obs.joint_velocities, 0, atol=delta)
    stopped = (
        stopped_buffer <= 0
        and small_delta
        and (not next_is_not_final)
        and gripper_state_no_change
    )
    return stopped


def keypoint_discovery(demo: Demo, stopping_delta=0.1) -> List[int]:
    episode_keypoints = []
    prev_gripper_open = demo[0].gripper_open
    stopped_buffer = 0
    for i, obs in enumerate(demo):
        stopped = _is_stopped(demo, i, obs, stopped_buffer, stopping_delta)
        stopped_buffer = 4 if stopped else stopped_buffer - 1
        # If change in gripper, or end of episode.
        last = i == (len(demo) - 1)
        if i != 0 and (obs.gripper_open != prev_gripper_open or last or stopped):
            episode_keypoints.append(i)
        prev_gripper_open = obs.gripper_open
    if (
        len(episode_keypoints) > 1
        and (episode_keypoints[-1] - 1) == episode_keypoints[-2]
    ):
        episode_keypoints.pop(-2)
    logging.debug("Found %d keypoints." % len(episode_keypoints), episode_keypoints)
    return episode_keypoints


# def keypoint_discovery(demo: Demo,
#                        stopping_delta=0.1,
#                        method='heuristic') -> List[int]:
#     episode_keypoints = []
#     if method == 'heuristic':
#         prev_gripper_open = demo[0].gripper_open
#         stopped_buffer = 0
#         for i, obs in enumerate(demo):
#             stopped = _is_stopped(demo, i, obs, stopped_buffer, stopping_delta)
#             stopped_buffer = 4 if stopped else stopped_buffer - 1
#             # If change in gripper, or end of episode.
#             last = i == (len(demo) - 1)
#             if i != 0 and (obs.gripper_open != prev_gripper_open or
#                            last or stopped):
#                 episode_keypoints.append(i)
#             prev_gripper_open = obs.gripper_open
#         if len(episode_keypoints) > 1 and (episode_keypoints[-1] - 1) == \
#                 episode_keypoints[-2]:
#             episode_keypoints.pop(-2)
#         logging.debug('Found %d keypoints.' % len(episode_keypoints),
#                       episode_keypoints)
#         return episode_keypoints

#     elif method == 'random':
#         # Randomly select keypoints.
#         episode_keypoints = np.random.choice(
#             range(len(demo)),
#             size=20,
#             replace=False)
#         episode_keypoints.sort()
#         return episode_keypoints

#     elif method == 'fixed_interval':
#         # Fixed interval.
#         episode_keypoints = []
#         segment_length = len(demo) // 20
#         for i in range(0, len(demo), segment_length):
#             episode_keypoints.append(i)
#         return episode_keypoints

#     else:
#         raise NotImplementedError


# find minimum difference between any two elements in list
def find_minimum_difference(lst):
    minimum = lst[-1]
    for i in range(1, len(lst)):
        if lst[i] - lst[i - 1] < minimum:
            minimum = lst[i] - lst[i - 1]
    return minimum


================================================
FILE: helpers/network_utils.py
================================================
import copy
from typing import List, Union

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from einops import rearrange, repeat
from einops.layers.torch import Rearrange

LRELU_SLOPE = 0.02


def act_layer(act):
    if act == "relu":
        return nn.ReLU()
    elif act == "lrelu":
        return nn.LeakyReLU(LRELU_SLOPE)
    elif act == "elu":
        return nn.ELU()
    elif act == "tanh":
        return nn.Tanh()
    elif act == "prelu":
        return nn.PReLU()
    else:
        raise ValueError("%s not recognized." % act)


def norm_layer2d(norm, channels):
    if norm == "batch":
        return nn.BatchNorm2d(channels)
    elif norm == "instance":
        return nn.InstanceNorm2d(channels, affine=True)
    elif norm == "layer":
        return nn.GroupNorm(1, channels, affine=True)
    elif norm == "group":
        return nn.GroupNorm(4, channels, affine=True)
    else:
        raise ValueError("%s not recognized." % norm)


def norm_layer1d(norm, num_channels):
    if norm == "batch":
        return nn.BatchNorm1d(num_channels)
    elif norm == "instance":
        return nn.InstanceNorm1d(num_channels, affine=True)
    elif norm == "layer":
        return nn.LayerNorm(num_channels)
    else:
        raise ValueError("%s not recognized." % norm)


class FiLMBlock(nn.Module):
    def __init__(self):
        super(FiLMBlock, self).__init__()

    def forward(self, x, gamma, beta):
        beta = beta.view(x.size(0), x.size(1), 1, 1)
        gamma = gamma.view(x.size(0), x.size(1), 1, 1)

        x = gamma * x + beta

        return x


class Conv2DBlock(nn.Module):
    def __init__(
        self,
        in_channels,
        out_channels,
        kernel_sizes,
        strides,
        norm=None,
        activation=None,
        padding_mode="replicate",
    ):
        super(Conv2DBlock, self).__init__()
        padding = (
            kernel_sizes // 2
            if isinstance(kernel_sizes, int)
            else (kernel_sizes[0] // 2, kernel_sizes[1] // 2)
        )
        self.conv2d = nn.Conv2d(
            in_channels,
            out_channels,
            kernel_sizes,
            strides,
            padding=padding,
            padding_mode=padding_mode,
        )

        if activation is None:
            nn.init.xavier_uniform_(
                self.conv2d.weight, gain=nn.init.calculate_gain("linear")
            )
            nn.init.zeros_(self.conv2d.bias)
        elif activation == "tanh":
            nn.init.xavier_uniform_(
                self.conv2d.weight, gain=nn.init.calculate_gain("tanh")
            )
            nn.init.zeros_(self.conv2d.bias)
        elif activation == "lrelu":
            nn.init.kaiming_uniform_(
                self.conv2d.weight, a=LRELU_SLOPE, nonlinearity="leaky_relu"
            )
            nn.init.zeros_(self.conv2d.bias)
        elif activation == "relu":
            nn.init.kaiming_uniform_(self.conv2d.weight, nonlinearity="relu")
            nn.init.zeros_(self.conv2d.bias)
        else:
            raise ValueError()

        self.activation = None
        self.norm = None
        if norm is not None:
            self.norm = norm_layer2d(norm, out_channels)
        if activation is not None:
            self.activation = act_layer(activation)

    def forward(self, x):
        x = self.conv2d(x)
        x = self.norm(x) if self.norm is not None else x
        x = self.activation(x) if self.activation is not None else x
        return x


class Conv2DFiLMBlock(Conv2DBlock):
    def __init__(
        self,
        in_channels,
        out_channels,
        kernel_sizes,
        strides,
        norm=None,
        activation=None,
        padding_mode="replicate",
    ):
        super(Conv2DFiLMBlock, self).__init__(
            in_channels,
            out_channels,
            kernel_sizes,
            strides,
            norm,
            activation,
            padding_mode,
        )

        self.film = FiLMBlock()

    def forward(self, x, gamma, beta):
        x = self.conv2d(x)
        x = self.norm(x) if self.norm is not None else x
        x = self.film(x, gamma, beta)
        x = self.activation(x) if self.activation is not None else x
        return x


class Conv3DBlock(nn.Module):
    def __init__(
        self,
        in_channels,
        out_channels,
        kernel_sizes: Union[int, list] = 3,
        strides=1,
        norm=None,
        activation=None,
        padding_mode="replicate",
        padding=None,
    ):
        super(Conv3DBlock, self).__init__()
        padding = kernel_sizes // 2 if padding is None else padding
        self.conv3d = nn.Conv3d(
            in_channels,
            out_channels,
            kernel_sizes,
            strides,
            padding=padding,
            padding_mode=padding_mode,
        )

        if activation is None:
            nn.init.xavier_uniform_(
                self.conv3d.weight, gain=nn.init.calculate_gain("linear")
            )
            nn.init.zeros_(self.conv3d.bias)
        elif activation == "tanh":
            nn.init.xavier_uniform_(
                self.conv3d.weight, gain=nn.init.calculate_gain("tanh")
            )
            nn.init.zeros_(self.conv3d.bias)
        elif activation == "lrelu":
            nn.init.kaiming_uniform_(
                self.conv3d.weight, a=LRELU_SLOPE, nonlinearity="leaky_relu"
            )
            nn.init.zeros_(self.conv3d.bias)
        elif activation == "relu":
            nn.init.kaiming_uniform_(self.conv3d.weight, nonlinearity="relu")
            nn.init.zeros_(self.conv3d.bias)
        else:
            raise ValueError()

        self.activation = None
        self.norm = None
        if norm is not None:
            raise NotImplementedError("Norm not implemented.")
        if activation is not None:
            self.activation = act_layer(activation)
        self.out_channels = out_channels

    def forward(self, x):
        x = self.conv3d(x)
        x = self.norm(x) if self.norm is not None else x
        x = self.activation(x) if self.activation is not None else x
        return x


class ConvTranspose3DBlock(nn.Module):
    def __init__(
        self,
        in_channels,
        out_channels,
        kernel_sizes: Union[int, list],
        strides,
        norm=None,
        activation=None,
        padding_mode="zeros",
        padding=None,
    ):
        super(ConvTranspose3DBlock, self).__init__()
        padding = kernel_sizes // 2 if padding is None else padding
        self.conv3d = nn.ConvTranspose3d(
            in_channels,
            out_channels,
            kernel_sizes,
            strides,
            padding=padding,
            padding_mode=padding_mode,
        )

        if activation is None:
            nn.init.xavier_uniform_(
                self.conv3d.weight, gain=nn.init.calculate_gain("linear")
            )
            nn.init.zeros_(self.conv3d.bias)
        elif activation == "tanh":
            nn.init.xavier_uniform_(
                self.conv3d.weight, gain=nn.init.calculate_gain("tanh")
            )
            nn.init.zeros_(self.conv3d.bias)
        elif activation == "lrelu":
            nn.init.kaiming_uniform_(
                self.conv3d.weight, a=LRELU_SLOPE, nonlinearity="leaky_relu"
            )
            nn.init.zeros_(self.conv3d.bias)
        elif activation == "relu":
            nn.init.kaiming_uniform_(self.conv3d.weight, nonlinearity="relu")
            nn.init.zeros_(self.conv3d.bias)
        else:
            raise ValueError()

        self.activation = None
        self.norm = None
        if norm is not None:
            self.norm = norm_layer3d(norm, out_channels)
        if activation is not None:
            self.activation = act_layer(activation)

    def forward(self, x):
        x = self.conv3d(x)
        x = self.norm(x) if self.norm is not None else x
        x = self.activation(x) if self.activation is not None else x
        return x


class Conv2DUpsampleBlock(nn.Module):
    def __init__(
        self,
        in_channels,
        out_channels,
        kernel_sizes,
        strides,
        norm=None,
        activation=None,
    ):
        super(Conv2DUpsampleBlock, self).__init__()
        layer = [
            Conv2DBlock(in_channels, out_channels, kernel_sizes, 1, norm, activation)
        ]
        if strides > 1:
            layer.append(
                nn.Upsample(scale_factor=strides, mode="bilinear", align_corners=False)
            )
        convt_block = Conv2DBlock(
            out_channels, out_channels, kernel_sizes, 1, norm, activation
        )
        layer.append(convt_block)
        self.conv_up = nn.Sequential(*layer)

    def forward(self, x):
        return self.conv_up(x)


class Conv3DUpsampleBlock(nn.Module):
    def __init__(
        self,
        in_channels,
        out_channels,
        strides,
        kernel_sizes=3,
        norm=None,
        activation=None,
    ):
        super(Conv3DUpsampleBlock, self).__init__()
        layer = [
            Conv3DBlock(in_channels, out_channels, kernel_sizes, 1, norm, activation)
        ]
        if strides > 1:
            layer.append(
                nn.Upsample(scale_factor=strides, mode="trilinear", align_corners=False)
            )
        convt_block = Conv3DBlock(
            out_channels, out_channels, kernel_sizes, 1, norm, activation
        )
        layer.append(convt_block)
        self.conv_up = nn.Sequential(*layer)

    def forward(self, x):
        return self.conv_up(x)


class DenseBlock(nn.Module):
    def __init__(self, in_features, out_features, norm=None, activation=None):
        super(DenseBlock, self).__init__()
        self.linear = nn.Linear(in_features, out_features)

        if activation is None:
            nn.init.xavier_uniform_(
                self.linear.weight, gain=nn.init.calculate_gain("linear")
            )
            nn.init.zeros_(self.linear.bias)
        elif activation == "tanh":
            nn.init.xavier_uniform_(
                self.linear.weight, gain=nn.init.calculate_gain("tanh")
            )
            nn.init.zeros_(self.linear.bias)
        elif activation == "lrelu":
            nn.init.kaiming_uniform_(
                self.linear.weight, a=LRELU_SLOPE, nonlinearity="leaky_relu"
            )
            nn.init.zeros_(self.linear.bias)
        elif activation == "relu":
            nn.init.kaiming_uniform_(self.linear.weight, nonlinearity="relu")
            nn.init.zeros_(self.linear.bias)
        else:
            raise ValueError()

        self.activation = None
        self.norm = None
        if norm is not None:
            self.norm = norm_layer1d(norm, out_features)
        if activation is not None:
            self.activation = act_layer(activation)

    def forward(self, x):
        x = self.linear(x)
        x = self.norm(x) if self.norm is not None else x
        x = self.activation(x) if self.activation is not None else x
        return x


class SiameseNet(nn.Module):
    def __init__(
        self,
        input_channels: List[int],
        filters: List[int],
        kernel_sizes: List[int],
        strides: List[int],
        norm: str = None,
        activation: str = "relu",
    ):
        super(SiameseNet, self).__init__()
        self._input_channels = input_channels
        self._filters = filters
        self._kernel_sizes = kernel_sizes
        self._strides = strides
        self._norm = norm
        self._activation = activation
        self.output_channels = filters[-1]  # * len(input_channels)

    def build(self):
        self._siamese_blocks = nn.ModuleList()
        for i, ch in enumerate(self._input_channels):
            blocks = []
            for i, (filt, ksize, stride) in enumerate(
                zip(self._filters, self._kernel_sizes, self._strides)
            ):
                conv_block = Conv2DBlock(
                    ch, filt, ksize, stride, self._norm, self._activation
                )
                blocks.append(conv_block)
            self._siamese_blocks.append(nn.Sequential(*blocks))
        self._fuse = Conv2DBlock(
            self._filters[-1] * len(self._siamese_blocks),
            self._filters[-1],
            1,
            1,
            self._norm,
            self._activation,
        )

    def forward(self, x):
        if len(x) != len(self._siamese_blocks):
            raise ValueError(
                "Expected a list of tensors of size %d." % len(self._siamese_blocks)
            )
        self.streams = [stream(y) for y, stream in zip(x, self._siamese_blocks)]
        y = self._fuse(torch.cat(self.streams, 1))
        return y


class CNNAndFcsNet(nn.Module):
    def __init__(
        self,
        siamese_net: SiameseNet,
        low_dim_state_len: int,
        input_resolution: List[int],
        filters: List[int],
        kernel_sizes: List[int],
        strides: List[int],
        norm: str = None,
        fc_layers: List[int] = None,
        activation: str = "relu",
    ):
        super(CNNAndFcsNet, self).__init__()
        self._siamese_net = copy.deepcopy(siamese_net)
        self._input_channels = self._siamese_net.output_channels + low_dim_state_len
        self._filters = filters
        self._kernel_sizes = kernel_sizes
        self._strides = strides
        self._norm = norm
        self._activation = activation
        self._fc_layers = [] if fc_layers is None else fc_layers
        self._input_resolution = input_resolution

    def build(self):
        self._siamese_net.build()
        layers = []
        channels = self._input_channels
        for i, (filt, ksize, stride) in enumerate(
            list(zip(self._filters, self._kernel_sizes, self._strides))[:-1]
        ):
            layers.append(
                Conv2DBlock(channels, filt, ksize, stride, self._norm, self._activation)
            )
            channels = filt
        layers.append(
            Conv2DBlock(
                channels, self._filters[-1], self._kernel_sizes[-1], self._strides[-1]
            )
        )
        self._cnn = nn.Sequential(*layers)
        self._maxp = nn.AdaptiveMaxPool2d(1)

        channels = self._filters[-1]
        dense_layers = []
        for n in self._fc_layers[:-1]:
            dense_layers.append(DenseBlock(channels, n, activation=self._activation))
            channels = n
        dense_layers.append(DenseBlock(channels, self._fc_layers[-1]))
        self._fcs = nn.Sequential(*dense_layers)

    def forward(self, observations, low_dim_ins):
        x = self._siamese_net(observations)
        _, _, h, w = x.shape
        low_dim_latents = low_dim_ins.unsqueeze(-1).unsqueeze(-1).repeat(1, 1, h, w)
        combined = torch.cat([x, low_dim_latents], dim=1)
        x = self._cnn(combined)
        x = self._maxp(x).squeeze(-1).squeeze(-1)
        return self._fcs(x)


class CNNLangAndFcsNet(nn.Module):
    def __init__(
        self,
        siamese_net: SiameseNet,
        low_dim_state_len: int,
        input_resolution: List[int],
        filters: List[int],
        kernel_sizes: List[int],
        strides: List[int],
        norm: str = None,
        fc_layers: List[int] = None,
        activation: str = "relu",
    ):
        super(CNNLangAndFcsNet, self).__init__()
        self._siamese_net = copy.deepcopy(siamese_net)
        self._input_channels = self._siamese_net.output_channels + low_dim_state_len
        self._filters = filters
        self._kernel_sizes = kernel_sizes
        self._strides = strides
        self._norm = norm
        self._activation = activation
        self._fc_layers = [] if fc_layers is None else fc_layers
        self._input_resolution = input_resolution

        self._lang_feat_dim = 1024

    def build(self):
        self._siamese_net.build()
        layers = []
        channels = self._input_channels

        self.conv1 = Conv2DFiLMBlock(
            channels, self._filters[0], self._kernel_sizes[0], self._strides[0]
        )
        self.gamma1 = nn.Linear(self._lang_feat_dim, self._filters[0])
        self.beta1 = nn.Linear(self._lang_feat_dim, self._filters[0])

        self.conv2 = Conv2DFiLMBlock(
            self._filters[0], self._filters[1], self._kernel_sizes[1], self._strides[1]
        )
        self.gamma2 = nn.Linear(self._lang_feat_dim, self._filters[1])
        self.beta2 = nn.Linear(self._lang_feat_dim, self._filters[1])

        self.conv3 = Conv2DFiLMBlock(
            self._filters[1], self._filters[2], self._kernel_sizes[2], self._strides[2]
        )
        self.gamma3 = nn.Linear(self._lang_feat_dim, self._filters[2])
        self.beta3 = nn.Linear(self._lang_feat_dim, self._filters[2])

        self._maxp = nn.AdaptiveMaxPool2d(1)

        channels = self._filters[-1]
        dense_layers = []
        for n in self._fc_layers[:-1]:
            dense_layers.append(DenseBlock(channels, n, activation=self._activation))
            channels = n
        dense_layers.append(DenseBlock(channels, self._fc_layers[-1]))
        self._fcs = nn.Sequential(*dense_layers)

    def forward(self, observations, low_dim_ins, lang_goal_emb):
        x = self._siamese_net(observations)
        _, _, h, w = x.shape
        low_dim_latents = low_dim_ins.unsqueeze(-1).unsqueeze(-1).repeat(1, 1, h, w)
        combined = torch.cat([x, low_dim_latents], dim=1)

        g1 = self.gamma1(lang_goal_emb)
        b1 = self.beta1(lang_goal_emb)
        x = self.conv1(combined, g1, b1)

        g2 = self.gamma2(lang_goal_emb)
        b2 = self.beta2(lang_goal_emb)
        x = self.conv2(x, g2, b2)

        g3 = self.gamma3(lang_goal_emb)
        b3 = self.beta3(lang_goal_emb)
        x = self.conv3(x, g3, b3)

        x = self._maxp(x).squeeze(-1).squeeze(-1)
        return self._fcs(x)


# helpers


def pair(t):
    return t if isinstance(t, tuple) else (t, t)


# classes


class PreNorm(nn.Module):
    def __init__(self, dim, fn):
        super().__init__()
        self.norm = nn.LayerNorm(dim)
        self.fn = fn

    def forward(self, x, **kwargs):
        return self.fn(self.norm(x), **kwargs)


class FeedForward(nn.Module):
    def __init__(self, dim, hidden_dim, dropout=0.0):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(dim, hidden_dim),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, dim),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)


class Attention(nn.Module):
    def __init__(self, dim, heads=8, dim_head=64, dropout=0.0):
        super().__init__()
        inner_dim = dim_head * heads
        project_out = not (heads == 1 and dim_head == dim)

        self.heads = heads
        self.scale = dim_head**-0.5

        self.attend = nn.Softmax(dim=-1)
        self.dropout = nn.Dropout(dropout)

        self.to_qkv = nn.Linear(dim, inner_dim * 3, bias=False)

        self.to_out = (
            nn.Sequential(nn.Linear(inner_dim, dim), nn.Dropout(dropout))
            if project_out
            else nn.Identity()
        )

    def forward(self, x):
        qkv = self.to_qkv(x).chunk(3, dim=-1)
        q, k, v = map(lambda t: rearrange(t, "b n (h d) -> b h n d", h=self.heads), qkv)

        dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale

        attn = self.attend(dots)
        attn = self.dropout(attn)

        out = torch.matmul(attn, v)
        out = rearrange(out, "b h n d -> b n (h d)")
        return self.to_out(out)


class Transformer(nn.Module):
    def __init__(self, dim, depth, heads, dim_head, mlp_dim, dropout=0.0):
        super().__init__()
        self.layers = nn.ModuleList([])
        for _ in range(depth):
            self.layers.append(
                nn.ModuleList(
                    [
                        PreNorm(
                            dim,
                            Attention(
                                dim, heads=heads, dim_head=dim_head, dropout=dropout
                            ),
                        ),
                        PreNorm(dim, FeedForward(dim, mlp_dim, dropout=dropout)),
                    ]
                )
            )

    def forward(self, x):
        for attn, ff in self.layers:
            x = attn(x) + x
            x = ff(x) + x
        return x


# ViT IO implementation adpated for baseline
# Source: https://github.com/lucidrains/vit-pytorch
# License: https://github.com/lucidrains/vit-pytorch/blob/main/LICENSE


class ViT(nn.Module):
    def __init__(
        self,
        *,
        image_size,
        patch_size,
        num_classes,
        dim,
        depth,
        heads,
        mlp_dim,
        pool="cls",
        channels=3,
        dim_head=64,
        dropout=0.0,
        emb_dropout=0.0
    ):
        super().__init__()
        image_height, image_width = pair(image_size)
        patch_height, patch_width = pair(patch_size)

        assert (
            image_height % patch_height == 0 and image_width % patch_width == 0
        ), "Image dimensions must be divisible by the patch size."

        self.num_patches_x = image_height // patch_height
        self.num_patches_y = image_width // patch_width
        self.num_patches = self.num_patches_x * self.num_patches_y
        patch_dim = channels * patch_height * patch_width
        assert pool in {
            "cls",
            "mean",
        }, "pool type must be either cls (cls token) or mean (mean pooling)"

        self.to_patch_embedding = nn.Sequential(
            Rearrange(
                "b c (h p1) (w p2) -> b (h w) (p1 p2 c)",
                p1=patch_height,
                p2=patch_width,
            ),
            nn.Linear(patch_dim, dim),
        )

        self.pos_embedding = nn.Parameter(torch.randn(1, self.num_patches + 1, dim))
        self.cls_token = nn.Parameter(torch.randn(1, 1, dim))
        self.dropout = nn.Dropout(emb_dropout)

        self.transformer = Transformer(dim, depth, heads, dim_head, mlp_dim, dropout)

    def forward(self, img):
        x = self.to_patch_embedding(img)
        b, n, _ = x.shape

        cls_tokens = repeat(self.cls_token, "1 1 d -> b 1 d", b=b)
        x = torch.cat((cls_tokens, x), dim=1)
        x += self.pos_embedding[:, : (n + 1)]
        x = self.dropout(x)

        x = self.transformer(x)
        x = x[:, 1:].reshape(b, -1, self.num_patches_x, self.num_patches_y)

        return x


class ViTLangAndFcsNet(nn.Module):
    def __init__(
        self,
        vit: ViT,
        low_dim_state_len: int,
        input_resolution: List[int],
        filters: List[int],
        kernel_sizes: List[int],
        strides: List[int],
        norm: str = None,
        fc_layers: List[int] = None,
        activation: str = "relu",
    ):
        super(ViTLangAndFcsNet, self).__init__()
        self._vit = copy.deepcopy(vit)
        self._input_channels = 64 + low_dim_state_len
        self._filters = filters
        self._kernel_sizes = kernel_sizes
        self._strides = strides
        self._norm = norm
        self._activation = activation
        self._fc_layers = [] if fc_layers is None else fc_layers
        self._input_resolution = input_resolution

        self._lang_feat_dim = 1024

    def build(self):
        layers = []
        channels = self._input_channels

        self.conv1 = Conv2DFiLMBlock(
            channels, self._filters[0], self._kernel_sizes[0], self._strides[0]
        )
        self.gamma1 = nn.Linear(self._lang_feat_dim, self._filters[0])
        self.beta1 = nn.Linear(self._lang_feat_dim, self._filters[0])

        self.conv2 = Conv2DFiLMBlock(
            self._filters[0], self._filters[1], self._kernel_sizes[1], self._strides[1]
        )
        self.gamma2 = nn.Linear(self._lang_feat_dim, self._filters[1])
        self.beta2 = nn.Linear(self._lang_feat_dim, self._filters[1])

        self.conv3 = Conv2DFiLMBlock(
            self._filters[1], self._filters[2], self._kernel_sizes[2], self._strides[2]
        )
        self.gamma3 = nn.Linear(self._lang_feat_dim, self._filters[2])
        self.beta3 = nn.Linear(self._lang_feat_dim, self._filters[2])

        self._maxp = nn.AdaptiveMaxPool2d(1)

        channels = self._filters[-1]
        dense_layers = []
        for n in self._fc_layers[:-1]:
            dense_layers.append(DenseBlock(channels, n, activation=self._activation))
            channels = n
        dense_layers.append(DenseBlock(channels, self._fc_layers[-1]))
        self._fcs = nn.Sequential(*dense_layers)

    def forward(self, observations, low_dim_ins, lang_goal_emb):
        rgb_depth = torch.cat([*observations], dim=1)
        x = self._vit(rgb_depth)
        _, _, h, w = x.shape
        low_dim_latents = low_dim_ins.unsqueeze(-1).unsqueeze(-1).repeat(1, 1, h, w)
        combined = torch.cat([x, low_dim_latents], dim=1)

        g1 = self.gamma1(lang_goal_emb)
        b1 = self.beta1(lang_goal_emb)
        x = self.conv1(combined, g1, b1)

        g2 = self.gamma2(lang_goal_emb)
        b2 = self.beta2(lang_goal_emb)
        x = self.conv2(x, g2, b2)

        g3 = self.gamma3(lang_goal_emb)
        b3 = self.beta3(lang_goal_emb)
        x = self.conv3(x, g3, b3)

        x = self._maxp(x).squeeze(-1).squeeze(-1)
        return self._fcs(x)


class Conv3DInceptionBlockUpsampleBlock(nn.Module):
    def __init__(
        self,
        in_channels,
        out_channels,
        scale_factor,
        norm=None,
        activation=None,
        residual=False,
    ):
        super(Conv3DInceptionBlockUpsampleBlock, self).__init__()
        layer = []

        convt_block = Conv3DInceptionBlock(in_channels, out_channels, norm, activation)
        layer.append(convt_block)

        if scale_factor > 1:
            layer.append(
                nn.Upsample(
                    scale_factor=scale_factor, mode="trilinear", align_corners=False
                )
            )

        convt_block = Conv3DInceptionBlock(out_channels, out_channels, norm, activation)
        layer.append(convt_block)

        self.conv_up = nn.Sequential(*layer)

    def forward(self, x):
        return self.conv_up(x)


class Conv3DInceptionBlock(nn.Module):
    def __init__(
        self, in_channels, out_channels, norm=None, activation=None, residual=False
    ):
        super(Conv3DInceptionBlock, self).__init__()
        self._residual = residual
        cs = out_channels // 4
        assert out_channels % 4 == 0
        latent = 32
        self._1x1conv = Conv3DBlock(
            in_channels,
            cs * 2,
            kernel_sizes=1,
            strides=1,
            norm=norm,
            activation=activation,
        )

        self._1x1conv_a = Conv3DBlock(
            in_channels,
            latent,
            kernel_sizes=1,
            strides=1,
            norm=norm,
            activation=activation,
        )
        self._3x3conv = Conv3DBlock(
            latent, cs, kernel_sizes=3, strides=1, norm=norm, activation=activation
        )

        self._1x1conv_b = Conv3DBlock(
            in_channels,
            latent,
            kernel_sizes=1,
            strides=1,
            norm=norm,
            activation=activation,
        )
        self._5x5_via_3x3conv_a = Conv3DBlock(
            latent, latent, kernel_sizes=3, strides=1, norm=norm, activation=activation
        )
        self._5x5_via_3x3conv_b = Conv3DBlock(
            latent, cs, kernel_sizes=3, strides=1, norm=norm, activation=activation
        )
        self.out_channels = out_channels + (in_channels if residual else 0)

    def forward(self, x):
        yy = []
        if self._residual:
            yy = [x]
        return torch.cat(
            yy
            + [
                self._1x1conv(x),
                self._3x3conv(self._1x1conv_a(x)),
                self._5x5_via_3x3conv_b(self._5x5_via_3x3conv_a(self._1x1conv_b(x))),
            ],
            1,
        )


class ConvTransposeUp3DBlock(nn.Module):
    def __init__(
        self,
        in_channels,
        out_channels,
        strides=2,
        padding=0,
        norm=None,
        activation=None,
        residual=False,
    ):
        super(ConvTransposeUp3DBlock, self).__init__()
        self._residual = residual

        self._1x1conv = Conv3DBlock(
            in_channels,
            out_channels,
            kernel_sizes=1,
            strides=1,
            norm=norm,
            activation=activation,
        )
        self._3x3conv = ConvTranspose3DBlock(
            out_channels,
            out_channels,
            kernel_sizes=2,
            strides=strides,
            norm=norm,
            activation=activation,
            padding=padding,
        )
        self._1x1conv_a = Conv3DBlock(
            out_channels,
            out_channels,
            kernel_sizes=1,
            strides=1,
            norm=norm,
        )
        self.out_channels = out_channels

    def forward(self, x):
        x = self._1x1conv(x)
        x = self._3x3conv(x)
        x = self._1x1conv_a(x)
        return x


class SpatialSoftmax3D(torch.nn.Module):
    def __init__(self, depth, height, width, channel):
        super(SpatialSoftmax3D, self).__init__()
        self.depth = depth
        self.height = height
        self.width = width
        self.channel = channel
        self.temperature = 0.01
        pos_x, pos_y, pos_z = np.meshgrid(
            np.linspace(-1.0, 1.0, self.depth),
            np.linspace(-1.0, 1.0, self.height),
            np.linspace(-1.0, 1.0, self.width),
        )
        pos_x = torch.from_numpy(
            pos_x.reshape(self.depth * self.height * self.width)
        ).float()
        pos_y = torch.from_numpy(
            pos_y.reshape(self.depth * self.height * self.width)
        ).float()
        pos_z = torch.from_numpy(
            pos_z.reshape(self.depth * self.height * self.width)
        ).float()
        self.register_buffer("pos_x", pos_x)
        self.register_buffer("pos_y", pos_y)
        self.register_buffer("pos_z", pos_z)

    def forward(self, feature):
        feature = feature.view(
            -1, self.height * self.width * self.depth
        )  # (B, c*d*h*w)
        softmax_attention = F.softmax(feature / self.temperature, dim=-1)
        expected_x = torch.sum(self.pos_x * softmax_attention, dim=1, keepdim=True)
        expected_y = torch.sum(self.pos_y * softmax_attention, dim=1, keepdim=True)
        expected_z = torch.sum(self.pos_z * softmax_attention, dim=1, keepdim=True)
        expected_xy = torch.cat([expected_x, expected_y, expected_z], 1)
        feature_keypoints = expected_xy.view(-1, self.channel * 3)
        return feature_keypoints


================================================
FILE: helpers/optim/__init__.py
================================================


================================================
FILE: helpers/optim/lamb.py
================================================
"""Lamb optimizer."""

# LAMB optimizer used as is.
# Source: https://github.com/cybertronai/pytorch-lamb
# License: https://github.com/cybertronai/pytorch-lamb/blob/master/LICENSE

import collections
import math

import torch
from torch.optim import Optimizer

# def log_lamb_rs(optimizer: Optimizer, event_writer: SummaryWriter, token_count: int):
#     """Log a histogram of trust ratio scalars in across layers."""
#     results = collections.defaultdict(list)
#     for group in optimizer.param_groups:
#         for p in group['params']:
#             state = optimizer.state[p]
#             for i in ('weight_norm', 'adam_norm', 'trust_ratio'):
#                 if i in state:
#                     results[i].append(state[i])
#
#     for k, v in results.items():
#         event_writer.add_histogram(f'lamb/{k}', torch.tensor(v), token_count)


class Lamb(Optimizer):
    r"""Implements Lamb algorithm.
    It has been proposed in `Large Batch Optimization for Deep Learning: Training BERT in 76 minutes`_.
    Arguments:
        params (iterable): iterable of parameters to optimize or dicts defining
            parameter groups
        lr (float, optional): learning rate (default: 1e-3)
        betas (Tuple[float, float], optional): coefficients used for computing
            running averages of gradient and its square (default: (0.9, 0.999))
        eps (float, optional): term added to the denominator to improve
            numerical stability (default: 1e-8)
        weight_decay (float, optional): weight decay (L2 penalty) (default: 0)
        adam (bool, optional): always use trust ratio = 1, which turns this into
            Adam. Useful for comparison purposes.
    .. _Large Batch Optimization for Deep Learning: Training BERT in 76 minutes:
        https://arxiv.org/abs/1904.00962
    """

    def __init__(
        self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-6, weight_decay=0, adam=False
    ):
        if not 0.0 <= lr:
            raise ValueError("Invalid learning rate: {}".format(lr))
        if not 0.0 <= eps:
            raise ValueError("Invalid epsilon value: {}".format(eps))
        if not 0.0 <= betas[0] < 1.0:
            raise ValueError("Invalid beta parameter at index 0: {}".format(betas[0]))
        if not 0.0 <= betas[1] < 1.0:
            raise ValueError("Invalid beta parameter at index 1: {}".format(betas[1]))
        defaults = dict(lr=lr, betas=betas, eps=eps, weight_decay=weight_decay)
        self.adam = adam
        super(Lamb, self).__init__(params, defaults)

    def step(self, closure=None):
        """Performs a single optimization step.
        Arguments:
            closure (callable, optional): A closure that reevaluates the model
                and returns the loss.
        """
        loss = None
        if closure is not None:
            loss = closure()

        for group in self.param_groups:
            for p in group["params"]:
                if p.grad is None:
                    continue
                grad = p.grad.data
                if grad.is_sparse:
                    raise RuntimeError(
                        "Lamb does not support sparse gradients, consider SparseAdam instad."
                    )

                state = self.state[p]

                # State initialization
                if len(state) == 0:
                    state["step"] = 0
                    # Exponential moving average of gradient values
                    state["exp_avg"] = torch.zeros_like(p.data)
                    # Exponential moving average of squared gradient values
                    state["exp_avg_sq"] = torch.zeros_like(p.data)

                exp_avg, exp_avg_sq = state["exp_avg"], state["exp_avg_sq"]
                beta1, beta2 = group["betas"]

                state["step"] += 1

                # Decay the first and second moment running average coefficient
                # m_t
                exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1)
                # v_t
                exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=1 - beta2)

                # Paper v3 does not use debiasing.
                # bias_correction1 = 1 - beta1 ** state['step']
                # bias_correction2 = 1 - beta2 ** state['step']
                # Apply bias to lr to avoid broadcast.
                step_size = group[
                    "lr"
                ]  # * math.sqrt(bias_correction2) / bias_correction1

                weight_norm = p.data.pow(2).sum().sqrt().clamp(0, 10)

                adam_step = exp_avg / exp_avg_sq.sqrt().add(group["eps"])
                if group["weight_decay"] != 0:
                    adam_step.add_(p.data, alpha=group["weight_decay"])

                adam_norm = adam_step.pow(2).sum().sqrt()
                if weight_norm == 0 or adam_norm == 0:
                    trust_ratio = 1
                else:
                    trust_ratio = weight_norm / adam_norm
                state["weight_norm"] = weight_norm
                state["adam_norm"] = adam_norm
                state["trust_ratio"] = trust_ratio
                if self.adam:
                    trust_ratio = 1

                p.data.add_(adam_step, alpha=-step_size * trust_ratio)

        return loss


================================================
FILE: helpers/preprocess_agent.py
================================================
from typing import List

import torch
from yarr.agents.agent import (
    ActResult,
    Agent,
    HistogramSummary,
    ImageSummary,
    ScalarSummary,
    Summary,
)


class PreprocessAgent(Agent):
    def __init__(self, pose_agent: Agent, norm_rgb: bool = True):
        self._pose_agent = pose_agent
        self._norm_rgb = norm_rgb

    def build(self, training: bool, device: torch.device = None):
        self._pose_agent.build(training, device)

    def _norm_rgb_(self, x):
        return (x.float() / 255.0) * 2.0 - 1.0

    def update(self, step: int, replay_sample: dict) -> dict:
        # Samples are (B, N, ...) where N is number of buffers/tasks. This is a single task setup, so 0 index.
        replay_sample = {
            k: v[:, 0] if len(v.shape) > 2 else v for k, v in replay_sample.items()
        }
        for k, v in replay_sample.items():
            if self._norm_rgb and "rgb" in k:
                replay_sample[k] = self._norm_rgb_(v)
            else:
                replay_sample[k] = v.float()
        self._replay_sample = replay_sample
        return self._pose_agent.update(step, replay_sample)

    def act(self, step: int, observation: dict, deterministic=False) -> ActResult:
        # observation = {k: torch.tensor(v) for k, v in observation.items()}
        for k, v in observation.items():
            if self._norm_rgb and "rgb" in k:
                observation[k] = self._norm_rgb_(v)
            else:
                observation[k] = v.float()
        act_res = self._pose_agent.act(step, observation, deterministic)
        act_res.replay_elements.update({"demo": False})
        return act_res

    def update_summaries(self) -> List[Summary]:
        prefix = "inputs"
        demo_f = self._replay_sample["demo"].float()
        demo_proportion = demo_f.mean()
        tile = lambda x: torch.squeeze(torch.cat(x.split(1, dim=1), dim=-1), dim=1)
        sums = [
            ScalarSummary("%s/demo_proportion" % prefix, demo_proportion),
            HistogramSummary(
                "%s/low_dim_state" % prefix, self._replay_sample["low_dim_state"]
            ),
            HistogramSummary(
                "%s/low_dim_state_tp1" % prefix,
                self._replay_sample["low_dim_state_tp1"],
            ),
            ScalarSummary(
                "%s/low_dim_state_mean" % prefix,
                self._replay_sample["low_dim_state"].mean(),
            ),
            ScalarSummary(
                "%s/low_dim_state_min" % prefix,
                self._replay_sample["low_dim_state"].min(),
            ),
            ScalarSummary(
                "%s/low_dim_state_max" % prefix,
                self._replay_sample["low_dim_state"].max(),
            ),
            ScalarSummary(
                "%s/timeouts" % prefix, self._replay_sample["timeout"].float().mean()
            ),
        ]

        for k, v in self._replay_sample.items():
            if "rgb" in k or "point_cloud" in k:
                if "rgb" in k:
                    # Convert back to 0 - 1
                    v = (v + 1.0) / 2.0
                sums.append(ImageSummary("%s/%s" % (prefix, k), tile(v)))

        if "sampling_probabilities" in self._replay_sample:
            sums.extend(
                [
                    HistogramSummary(
                        "replay/priority", self._replay_sample["sampling_probabilities"]
                    ),
                ]
            )
        sums.extend(self._pose_agent.update_summaries())
        return sums

    def act_summaries(self) -> List[Summary]:
        return self._pose_agent.act_summaries()

    def load_weights(self, savedir: str):
        self._pose_agent.load_weights(savedir)

    def save_weights(self, savedir: str):
        self._pose_agent.save_weights(savedir)

    def reset(self) -> None:
        self._pose_agent.reset()


================================================
FILE: helpers/utils.py
================================================
from typing import List

import numpy as np
import pyrender
import torch
import trimesh
from pyrender.trackball import Trackball
from pyrep.const import RenderMode
from rlbench import CameraConfig, ObservationConfig
from rlbench.backend.const import DEPTH_SCALE
from rlbench.backend.observation import Observation
from scipy.spatial.transform import Rotation

REMOVE_KEYS = [
    "joint_velocities",
    "joint_positions",
    "joint_forces",
    "gripper_open",
    "gripper_pose",
    "gripper_joint_positions",
    "gripper_touch_forces",
    "task_low_dim_state",
    "misc",
]

SCALE_FACTOR = DEPTH_SCALE
DEFAULT_SCENE_SCALE = 2.0


def loss_weights(replay_sample, beta=1.0):
    loss_weights = 1.0
    if "sampling_probabilities" in replay_sample:
        probs = replay_sample["sampling_probabilities"]
        loss_weights = 1.0 / torch.sqrt(probs + 1e-10)
        loss_weights = (loss_weights / torch.max(loss_weights)) ** beta
    return loss_weights


def soft_updates(net, target_net, tau):
    for param, target_param in zip(net.parameters(), target_net.parameters()):
        target_param.data.copy_(tau * param.data + (1 - tau) * target_param.data)


def stack_on_channel(x):
    # expect (B, T, C, ...)
    return torch.cat(torch.split(x, 1, dim=1), dim=2).squeeze(1)


def normalize_quaternion(quat):
    return np.array(quat) / np.linalg.norm(quat, axis=-1, keepdims=True)


def correct_rotation_instability(disc, resolution):
    # q1 = discrete_euler_to_quaternion(disc, resolution)
    # q2 = discrete_euler_to_quaternion(quaternion_to_discrete_euler(q1, resolution), resolution)
    #
    # d2 = quaternion_to_discrete_euler(q2, resolution)
    #
    # # choose the smallest change
    # if np.any(disc != d2):
    #     if np.sum(disc) < np.sum(d2):
    #         return disc
    #     else:
    #         return d2
    return disc


def check_gimbal_lock(pred_rot_and_grip, gt_rot_and_grip, resolution):
    pred_rot_and_grip_np = pred_rot_and_grip.detach().cpu().numpy()
    gt_rot_and_grip_np = gt_rot_and_grip.detach().cpu().numpy()

    pred_rot = discrete_euler_to_quaternion(pred_rot_and_grip_np[:, :3], resolution)
    gt_rot = discrete_euler_to_quaternion(gt_rot_and_grip_np[:, :3], resolution)
    gimbal_lock_matches = [
        np.all(np.abs(pred_rot[i] - gt_rot[i]) < 1e-10)
        and np.any(pred_rot_and_grip_np[i, :3] != gt_rot_and_grip_np[i, :3])
        for i in range(pred_rot.shape[0])
    ]
    return 0


def quaternion_to_discrete_euler(quaternion, resolution):
    euler = Rotation.from_quat(quaternion).as_euler("xyz", degrees=True) + 180
    assert np.min(euler) >= 0 and np.max(euler) <= 360
    disc = np.around((euler / resolution)).astype(int)
    disc[disc == int(360 / resolution)] = 0
    return disc


def discrete_euler_to_quaternion(discrete_euler, resolution):
    euluer = (discrete_euler * resolution) - 180
    return Rotation.from_euler("xyz", euluer, degrees=True).as_quat()


def point_to_voxel_index(
    point: np.ndarray, voxel_size: np.ndarray, coord_bounds: np.ndarray
):
    bb_mins = np.array(coord_bounds[0:3])
    bb_maxs = np.array(coord_bounds[3:])
    dims_m_one = np.array([voxel_size] * 3) - 1
    bb_ranges = bb_maxs - bb_mins
    res = bb_ranges / (np.array([voxel_size] * 3) + 1e-12)
    voxel_indicy = np.minimum(
        np.floor((point - bb_mins) / (res + 1e-12)).astype(np.int32), dims_m_one
    )
    return voxel_indicy


def voxel_index_to_point(
    voxel_index: torch.Tensor, voxel_size: int, coord_bounds: np.ndarray
):
    res = (coord_bounds[:, 3:] - coord_bounds[:, :3]) / voxel_size
    points = (voxel_index * res) + coord_bounds[:, :3]
    return points


def point_to_pixel_index(
    point: np.ndarray, extrinsics: np.ndarray, intrinsics: np.ndarray
):
    point = np.array([point[0], point[1], point[2], 1])
    world_to_cam = np.linalg.inv(extrinsics)
    point_in_cam_frame = world_to_cam.dot(point)
    px, py, pz = point_in_cam_frame[:3]
    px = 2 * intrinsics[0, 2] - int(-intrinsics[0, 0] * (px / pz) + intrinsics[0, 2])
    py = 2 * intrinsics[1, 2] - int(-intrinsics[1, 1] * (py / pz) + intrinsics[1, 2])
    return px, py


def _compute_initial_camera_pose(scene):
    # Adapted from:
    # https://github.com/mmatl/pyrender/blob/master/pyrender/viewer.py#L1032
    centroid = scene.centroid
    scale = scene.scale
    if scale == 0.0:
        scale = DEFAULT_SCENE_SCALE
    s2 = 1.0 / np.sqrt(2.0)
    cp = np.eye(4)
    cp[:3, :3] = np.array([[0.0, -s2, s2], [1.0, 0.0, 0.0], [0.0, s2, s2]])
    hfov = np.pi / 6.0
    dist = scale / (2.0 * np.tan(hfov))
    cp[:3, 3] = dist * np.array([1.0, 0.0, 1.0]) + centroid
    return cp


def _from_trimesh_scene(trimesh_scene, bg_color=None, ambient_light=None):
    # convert trimesh geometries to pyrender geometries
    geometries = {
        name: pyrender.Mesh.from_trimesh(geom, smooth=False)
        for name, geom in trimesh_scene.geometry.items()
    }
    # create the pyrender scene object
    scene_pr = pyrender.Scene(bg_color=bg_color, ambient_light=ambient_light)
    # add every node with geometry to the pyrender scene
    for node in trimesh_scene.graph.nodes_geometry:
        pose, geom_name = trimesh_scene.graph[node]
        scene_pr.add(geometries[geom_name], pose=pose)
    return scene_pr


def _create_bounding_box(scene, voxel_size, res):
    l = voxel_size * res
    T = np.eye(4)
    w = 0.01
    for trans in [[0, 0, l / 2], [0, l, l / 2], [l, l, l / 2], [l, 0, l / 2]]:
        T[:3, 3] = np.array(trans) - voxel_size / 2
        scene.add_geometry(
            trimesh.creation.box([w, w, l], T, face_colors=[0, 0, 0, 255])
        )
    for trans in [[l / 2, 0, 0], [l / 2, 0, l], [l / 2, l, 0], [l / 2, l, l]]:
        T[:3, 3] = np.array(trans) - voxel_size / 2
        scene.add_geometry(
            trimesh.creation.box([l, w, w], T, face_colors=[0, 0, 0, 255])
        )
    for trans in [[0, l / 2, 0], [0, l / 2, l], [l, l / 2, 0], [l, l / 2, l]]:
        T[:3, 3] = np.array(trans) - voxel_size / 2
        scene.add_geometry(
            trimesh.creation.box([w, l, w], T, face_colors=[0, 0, 0, 255])
        )


def create_voxel_scene(
    voxel_grid: np.ndarray,
    q_attention: np.ndarray = None,
    highlight_coordinate: np.ndarray = None,
    highlight_gt_coordinate: np.ndarray = None,
    highlight_alpha: float = 1.0,
    voxel_size: float = 0.1,
    show_bb: bool = False,
    alpha: float = 0.5,
):
    _, d, h, w = voxel_grid.shape
    v = voxel_grid.transpose((1, 2, 3, 0))
    occupancy = v[:, :, :, -1] != 0
    alpha = np.expand_dims(np.full_like(occupancy, alpha, dtype=np.float32), -1)
    rgb = np.concatenate([(v[:, :, :, 3:6] + 1) / 2.0, alpha], axis=-1)

    if q_attention is not None:
        q = np.max(q_attention, 0)
        q = q / np.max(q)
        show_q = q > 0.75
        occupancy = (show_q + occupancy).astype(bool)
        q = np.expand_dims(q - 0.5, -1)  # Max q can be is 0.9
        q_rgb = np.concatenate(
            [q, np.zeros_like(q), np.zeros_like(q), np.clip(q, 0, 1)], axis=-1
        )
        rgb = np.where(np.expand_dims(show_q, -1), q_rgb, rgb)

    if highlight_coordinate is not None:
        x, y, z = highlight_coordinate
        occupancy[x, y, z] = True
        rgb[x, y, z] = [1.0, 0.0, 0.0, highlight_alpha]

    if highlight_gt_coordinate is not None:
        x, y, z = highlight_gt_coordinate
        occupancy[x, y, z] = True
        rgb[x, y, z] = [0.0, 0.0, 1.0, highlight_alpha]

    transform = trimesh.transformations.scale_and_translate(
        scale=voxel_size, translate=(0.0, 0.0, 0.0)
    )
    trimesh_voxel_grid = trimesh.voxel.VoxelGrid(
        encoding=occupancy, transform=transform
    )
    geometry = trimesh_voxel_grid.as_boxes(colors=rgb)
    scene = trimesh.Scene()
    scene.add_geometry(geometry)
    if show_bb:
        assert d == h == w
        _create_bounding_box(scene, voxel_size, d)
    return scene


def visualise_voxel(
    voxel_grid: np.ndarray,
    q_attention: np.ndarray = None,
    highlight_coordinate: np.ndarray = None,
    highlight_gt_coordinate: np.ndarray = None,
    highlight_alpha: float = 1.0,
    rotation_amount: float = 0.0,
    show: bool = False,
    voxel_size: float = 0.1,
    offscreen_renderer: pyrender.OffscreenRenderer = None,
    show_bb: bool = False,
    alpha: float = 0.5,
):
    scene = create_voxel_scene(
        voxel_grid,
        q_attention,
        highlight_coordinate,
        highlight_gt_coordinate,
        highlight_alpha,
        voxel_size,
        show_bb,
        alpha,
    )
    if show:
        scene.show()
    else:
        r = offscreen_renderer or pyrender.OffscreenRenderer(
            viewport_width=640, viewport_height=480, point_size=1.0
        )
        s = _from_trimesh_scene(
            scene, ambient_light=[0.8, 0.8, 0.8], bg_color=[1.0, 1.0, 1.0]
        )
        cam = pyrender.PerspectiveCamera(
            yfov=np.pi / 4.0, aspectRatio=r.viewport_width / r.viewport_height
        )
        p = _compute_initial_camera_pose(s)
        t = Trackball(p, (r.viewport_width, r.viewport_height), s.scale, s.centroid)
        t.rotate(rotation_amount, np.array([0.0, 0.0, 1.0]))
        s.add(cam, pose=t.pose)
        color, depth = r.render(s)
        return color.copy()


def preprocess(img, dist="transporter"):
    """Pre-process input (subtract mean, divide by std)."""

    transporter_color_mean = [0.18877631, 0.18877631, 0.18877631]
    transporter_color_std = [0.07276466, 0.07276466, 0.07276466]
    transporter_depth_mean = 0.00509261
    transporter_depth_std = 0.00903967

    franka_color_mean = [0.622291933, 0.628313992, 0.623031488]
    franka_color_std = [0.168154213, 0.17626014, 0.184527364]
    franka_depth_mean = 0.872146842
    franka_depth_std = 0.195743116

    clip_color_mean = [0.48145466, 0.4578275, 0.40821073]
    clip_color_std = [0.26862954, 0.26130258, 0.27577711]

    # choose distribution
    if dist == "clip":
        color_mean = clip_color_mean
        color_std = clip_color_std
    elif dist == "franka":
        color_mean = franka_color_mean
        color_std = franka_color_std
    else:
        color_mean = transporter_color_mean
        color_std = transporter_color_std

    if dist == "franka":
        depth_mean = franka_depth_mean
        depth_std = franka_depth_std
    else:
        depth_mean = transporter_depth_mean
        depth_std = transporter_depth_std

    # convert to pytorch tensor (if required)
    if type(img) == torch.Tensor:

        def cast_shape(stat, img):
            tensor = torch.from_numpy(np.array(stat)).to(
                device=img.device, dtype=img.dtype
            )
            tensor = tensor.unsqueeze(0).unsqueeze(-1).unsqueeze(-1)
            tensor = tensor.repeat(img.shape[0], 1, img.shape[-2], img.shape[-1])
            return tensor

        color_mean = cast_shape(color_mean, img)
        color_std = cast_shape(color_std, img)
        depth_mean = cast_shape(depth_mean, img)
        depth_std = cast_shape(depth_std, img)

        # normalize
        img = img.clone()
        img[:, :3, :, :] = (img[:, :3, :, :] / 255 - color_mean) / color_std
        img[:, 3:, :, :] = (img[:, 3:, :, :] - depth_mean) / depth_std
    else:
        # normalize
        img[:, :, :3] = (img[:, :, :3] / 255 - color_mean) / color_std
        img[:, :, 3:] = (img[:, :, 3:] - depth_mean) / depth_std
    return img


def rand_dist(size, min=-1.0, max=1.0):
    return (max - min) * torch.rand(size) + min


def rand_discrete(size, min=0, max=1):
    if min == max:
        return torch.zeros(size)
    return torch.randint(min, max + 1, size)


def split_list(lst, n):
    for i in range(0, len(lst), n):
        yield lst[i : i + n]


def extract_obs(
    obs: Observation,
    cameras,
    t: int = 0,
    prev_action=None,
    channels_last: bool = False,
    episode_length: int = 10,
):
    obs.joint_velocities = None
    grip_mat = obs.gripper_matrix
    grip_pose = obs.gripper_pose
    joint_pos = obs.joint_positions
    obs.gripper_pose = None
    obs.gripper_matrix = None
    obs.wrist_camera_matrix = None
    obs.joint_positions = None
    if obs.gripper_joint_positions is not None:
        obs.gripper_joint_positions = np.clip(obs.gripper_joint_positions, 0.0, 0.04)

    obs_dict = vars(obs)
    obs_dict = {k: v for k, v in obs_dict.items() if v is not None}
    robot_state = np.array([obs.gripper_open, *obs.gripper_joint_positions])
    # remove low-level proprioception variables that are not needed
    obs_dict = {k: v for k, v in obs_dict.items() if k not in REMOVE_KEYS}

    if not channels_last:
        # swap channels from last dim to 1st dim
        obs_dict = {
            k: np.transpose(v, [2, 0, 1]) if v.ndim == 3 else np.expand_dims(v, 0)
            for k, v in obs_dict.items()
            if type(v) == np.ndarray or type(v) == list
        }
    else:
        # add extra dim to depth data
        obs_dict = {
            k: v if v.ndim == 3 else np.expand_dims(v, -1) for k, v in obs_dict.items()
        }
    obs_dict["low_dim_state"] = np.array(robot_state, dtype=np.float32)

    # binary variable indicating if collisions are allowed or not while planning paths to reach poses
    # obs_dict['ignore_collisions'] = np.array([obs.ignore_collisions], dtype=np.float32)
    obs_dict["ignore_collisions"] = np.array([0], dtype=np.float32)
    for k, v in [(k, v) for k, v in obs_dict.items() if "point_cloud" in k]:
        obs_dict[k] = v.astype(np.float32)

    for camera_name in cameras:
        obs_dict["%s_camera_extrinsics" % camera_name] = obs.misc[
            "%s_camera_extrinsics" % camera_name
        ]
        obs_dict["%s_camera_intrinsics" % camera_name] = obs.misc[
            "%s_camera_intrinsics" % camera_name
        ]

    # add timestep to low_dim_state
    time = (1.0 - (t / float(episode_length - 1))) * 2.0 - 1.0
    obs_dict["low_dim_state"] = np.concatenate(
        [obs_dict["low_dim_state"], [time]]
    ).astype(np.float32)

    obs.gripper_matrix = grip_mat
    obs.joint_positions = joint_pos
    obs.gripper_pose = grip_pose

    return obs_dict


def create_obs_config(
    camera_names: List[str], camera_resolution: List[int], method_name: str
):
    unused_cams = CameraConfig()
    unused_cams.set_all(False)
    used_cams = CameraConfig(
        rgb=True,
        point_cloud=True,
        mask=False,
        depth=False,
        image_size=camera_resolution,
        render_mode=RenderMode.OPENGL,
    )

    cam_obs = []
    kwargs = {}
    for n in camera_names:
        kwargs[n] = used_cams
        cam_obs.append("%s_rgb" % n)
        cam_obs.append("%s_pointcloud" % n)

    # Some of these obs are only used for keypoint detection.
    obs_config = ObservationConfig(
        front_camera=kwargs.get("front", unused_cams),
        left_shoulder_camera=kwargs.get("left_shoulder", unused_cams),
        right_shoulder_camera=kwargs.get("right_shoulder", unused_cams),
        wrist_camera=kwargs.get("wrist", unused_cams),
        overhead_camera=kwargs.get("overhead", unused_cams),
        joint_forces=False,
        joint_positions=True,
        joint_velocities=True,
        task_low_dim_state=False,
        gripper_touch_forces=False,
        gripper_pose=True,
        gripper_open=True,
        gripper_matrix=True,
        gripper_joint_positions=True,
    )
    return obs_config


def get_device(gpu):
    if gpu is not None and gpu >= 0 and torch.cuda.is_available():
        device = torch.device("cuda:%d" % gpu)
        torch.backends.cudnn.enabled = torch.backends.cudnn.benchmark = True
    else:
        device = torch.device("cpu")
    return device


================================================
FILE: panda_urdf/meshes/Pandagripper_coll_1.dae
================================================
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<COLLADA xmlns="http://www.collada.org/2005/11/COLLADASchema" version="1.4.1">
  <asset>
    <contributor>
      <author>Assimp</author>
      <authoring_tool>Assimp Exporter</authoring_tool>
    </contributor>
    <created>2023-09-26T11:24:21</created>
    <modified>2023-09-26T11:24:21</modified>
    <unit name="meter" meter="1" />
    <up_axis>Y_UP</up_axis>
  </asset>
  <library_effects>
    <effect id="m0mat-fx" name="m0mat">
      <profile_COMMON>
        <technique sid="standard">
          <phong>
            <emission>
              <color sid="emission">0   0   0   4.5895327e-41</color>
            </emission>
            <ambient>
              <color sid="ambient">0.73463768   0.65503019   0.69176537   4.5893926e-41</color>
            </ambient>
            <diffuse>
              <color sid="diffuse">0.73463768   0.65503019   0.69176537   4.5893926e-41</color>
            </diffuse>
            <specular>
              <color sid="specular">0.25   0.25   0.25   -3.0261091e+37</color>
            </specular>
          </phong>
        </technique>
      </profile_COMMON>
    </effect>
  </library_effects>
  <library_materials>
    <material id="m0mat" name="m0mat">
      <instance_effect url="#m0mat-fx"/>
    </material>
  </library_materials>
  <library_geometries>
    <geometry id="meshId0" name="meshId0_name" >
      <mesh>
        <source id="meshId0-positions" name="meshId0-positions">
          <float_array id="meshId0-positions-array" count="1800"> 0.016744969 0.091910884 0.044413116 0.017144307 -0.087489262 0.043595828 0.031599864 0.0012233835 -0.013804421 -0.017043602 -0.093742862 0.00210264 -0.017500317 -0.09612567 0.031647604 -0.031587429 -0.00020460619 -0.018865729 -0.02597091 0.077551603 -0.025100973 -0.031587429 -0.00020460619 -0.018865729 -0.031625845 0.0012031699 -0.014045588 0.017043227 -0.095020331 0.010303484 0.031625856 -7.8961224e-05 -0.01883894 0.017595161 -0.095625103 0.031966094 0.031625856 -7.8961224e-05 -0.01883894 0.031599864 0.0012233835 -0.013804421 0.017595161 -0.095625103 0.031966094 0.018565923 0.096383899 0.036598537 0.016744969 0.091910884 0.044413116 0.031599864 0.0012233835 -0.013804421 -0.016794099 -0.09012422 -0.013723081 -0.031587429 -0.00020460619 -0.018865729 -0.016802166 -0.085751005 -0.021225022 -0.031625845 0.0012031699 -0.014045588 -0.031587429 -0.00020460619 -0.018865729 -0.017500317 -0.09612567 0.031647604 -0.017043602 -0.093742862 0.00210264 -0.031587429 -0.00020460619 -0.018865729 -0.016794099 -0.09012422 -0.013723081 0.016240167 -0.097755574 0.03156554 -3.1015581e-05 -0.10220782 -0.014327673 -7.2011758e-06 -0.10196081 -0.029217854 -0.003606698 -0.099375933 -0.032628879 -0.003419484 -0.10081596 -0.028982848 -0.016794099 -0.09012422 -0.013723081 -0.025260285 0.083847016 -0.014821704 -0.02597091 0.077551603 -0.025100973 -0.031625845 0.0012031699 -0.014045588 -0.028000791 0.04356619 -0.026415108 -0.031587429 -0.00020460619 -0.018865729 -0.02597091 0.077551603 -0.025100973 -0.00010250774 0.094590455 0.045943402 -0.01255665 0.095494457 0.045149717 -0.014411896 0.093058422 0.045670982 -0.017244603 -0.088918351 0.043063443 -0.015461653 -0.088924363 0.045171533 -0.016757177 0.092034169 0.044313449 0.016965598 -0.089821011 -0.014455725 0.031625856 -7.8961224e-05 -0.01883894 0.017043227 -0.095020331 0.010303484 0.017144307 -0.087489262 0.043595828 0.01761722 -0.092065617 0.038429931 0.031599864 0.0012233835 -0.013804421 0.01761722 -0.092065617 0.038429931 0.017595161 -0.095625103 0.031966094 0.031599864 0.0012233835 -0.013804421 0.012515741 -0.098405562 0.036875598 -3.1015581e-05 -0.10220782 -0.014327673 0.016240167 -0.097755574 0.03156554 0.015560988 0.10198174 0.034653842 0.019278891 0.10027551 0.024922986 0.01622845 0.10206716 0.025851013 0.001927514 -0.043738071 -0.045678042 0.028025156 0.043824237 -0.026368296 0.0038204142 -0.081184313 -0.044467907 0.020680942 0.095473781 0.023289545 0.031599864 0.0012233835 -0.013804421 0.025458017 0.082962401 -0.016518129 -0.031587429 -0.00020460619 -0.018865729 -0.0036267391 -0.081804037 -0.044624828 -0.016522622 -0.080454387 -0.025136685 -0.017244603 -0.088918351 0.043063443 -0.031625845 0.0012031699 -0.014045588 -0.017500317 -0.09612567 0.031647604 -0.014007059 -0.098309167 0.036008392 -0.016003072 -0.097846687 0.029282873 -3.1015581e-05 -0.10220782 -0.014327673 -0.003606698 -0.099375933 -0.032628879 -0.016802166 -0.085751005 -0.021225022 -0.0032050153 -0.085984111 -0.04390296 -0.00010250774 0.094590455 0.045943402 -0.014411896 0.093058422 0.045670982 -0.012977874 -0.090713792 0.045693383 -0.012977874 -0.090713792 0.045693383 -0.00049558369 -0.091163352 0.045934543 -0.00010250774 0.094590455 0.045943402 0.013549642 0.094454102 0.045669131 -0.00010250774 0.094590455 0.045943402 -0.00049558369 -0.091163352 0.045934543 -0.00049558369 -0.091163352 0.045934543 0.013434866 -0.090145044 0.04571275 0.013549642 0.094454102 0.045669131 -0.025260285 0.083847016 -0.014821704 -0.031625845 0.0012031699 -0.014045588 -0.019219158 0.09904252 0.030569669 0.017043227 -0.095020331 0.010303484 0.017595161 -0.095625103 0.031966094 0.016240167 -0.097755574 0.03156554 0.0038204142 -0.081184313 -0.044467907 0.031625856 -7.8961224e-05 -0.01883894 0.01656829 -0.083350286 -0.023891302 0.0035048416 -0.10093375 -0.028047891 0.016965598 -0.089821011 -0.014455725 0.017043227 -0.095020331 0.010303484 0.031625856 -7.8961224e-05 -0.01883894 0.0038204142 -0.081184313 -0.044467907 0.028025156 0.043824237 -0.026368296 0.018565923 0.096383899 0.036598537 0.020680942 0.095473781 0.023289545 0.019278891 0.10027551 0.024922986 0.020680942 0.095473781 0.023289545 0.018565923 0.096383899 0.036598537 0.031599864 0.0012233835 -0.013804421 0.020680942 0.095473781 0.023289545 0.025458017 0.082962401 -0.016518129 0.023879232 0.092183307 -0.015650505 0.020680942 0.095473781 0.023289545 0.022717874 0.095517464 -0.005678724 0.019278891 0.10027551 0.024922986 0.022717874 0.095517464 -0.005678724 0.020680942 0.095473781 0.023289545 0.023879232 0.092183307 -0.015650505 0.031625856 -7.8961224e-05 -0.01883894 0.028025156 0.043824237 -0.026368296 0.026044382 0.077079467 -0.025049575 0.031625856 -7.8961224e-05 -0.01883894 0.026044382 0.077079467 -0.025049575 0.025458017 0.082962401 -0.016518129 0.031625856 -7.8961224e-05 -0.01883894 0.025458017 0.082962401 -0.016518129 0.031599864 0.0012233835 -0.013804421 0.024826827 0.076159433 -0.027174426 0.028025156 0.043824237 -0.026368296 0.001927514 -0.043738071 -0.045678042 -0.0014900707 -0.04408798 -0.045943413 -0.00082369358 0.08204186 -0.027340455 0.022969175 0.079683267 -0.02741777 -0.00082369358 0.08204186 -0.027340455 -0.0014900707 -0.04408798 -0.045943413 -0.022751052 0.081956252 -0.026928281 0.022969175 0.079683267 -0.02741777 0.024826827 0.076159433 -0.027174426 0.001927514 -0.043738071 -0.045678042 -0.0036267391 -0.081804037 -0.044624828 -0.031587429 -0.00020460619 -0.018865729 -0.028000791 0.04356619 -0.026415108 4.1883774e-05 -0.097669549 -0.037179761 -0.003606698 -0.099375933 -0.032628879 -0.0032050153 -0.085984111 -0.04390296 -0.016522622 -0.080454387 -0.025136685 -0.0036267391 -0.081804037 -0.044624828 -0.0032050153 -0.085984111 -0.04390296 -0.016802166 -0.085751005 -0.021225022 -0.016522622 -0.080454387 -0.025136685 -0.0032050153 -0.085984111 -0.04390296 -0.016802166 -0.085751005 -0.021225022 -0.031587429 -0.00020460619 -0.018865729 -0.016522622 -0.080454387 -0.025136685 -0.016622087 -0.09677203 0.034715377 -0.017244603 -0.088918351 0.043063443 -0.017500317 -0.09612567 0.031647604 4.1883774e-05 -0.097669549 -0.037179761 0.0036121989 -0.099412799 -0.032646187 -7.2011758e-06 -0.10196081 -0.029217854 -0.016003072 -0.097846687 0.029282873 -0.016622087 -0.09677203 0.034715377 -0.017500317 -0.09612567 0.031647604 -3.1015581e-05 -0.10220782 -0.014327673 -0.016003072 -0.097846687 0.029282873 -7.2011758e-06 -0.10196081 -0.029217854 -0.0016066673 0.099253021 0.042288579 -0.00010250774 0.094590455 0.045943402 0.013549642 0.094454102 0.045669131 -0.00010250774 0.094590455 0.045943402 -0.0016066673 0.099253021 0.042288579 -0.01255665 0.095494457 0.045149717 0.016744969 0.091910884 0.044413116 0.013549642 0.094454102 0.045669131 0.015728435 -0.089997612 0.044893976 0.013434866 -0.090145044 0.04571275 0.015728435 -0.089997612 0.044893976 0.013549642 0.094454102 0.045669131 -0.017244603 -0.088918351 0.043063443 -0.018438619 0.093797483 0.038374417 -0.031625845 0.0012031699 -0.014045588 -0.018438619 0.093797483 0.038374417 -0.017244603 -0.088918351 0.043063443 -0.016757177 0.092034169 0.044313449 -0.018438619 0.093797483 0.038374417 -0.019219158 0.09904252 0.030569669 -0.031625845 0.0012031699 -0.014045588 -0.00061691785 -0.086192869 -0.044980168 4.1883774e-05 -0.097669549 -0.037179761 -0.0032050153 -0.085984111 -0.04390296 0.0030860777 -0.086191021 -0.043997291 0.0038204142 -0.081184313 -0.044467907 0.01656829 -0.083350286 -0.023891302 0.015728435 -0.089997612 0.044893976 0.017144307 -0.087489262 0.043595828 0.016744969 0.091910884 0.044413116 0.01622845 0.10206716 0.025851013 0.019278891 0.10027551 0.024922986 0.019812118 0.098792978 0.00089522492 0.0099516101 0.10181669 0.038322039 -0.0016066673 0.099253021 0.042288579 0.013549642 0.094454102 0.045669131 0.019278891 0.10027551 0.024922986 0.017139517 0.10084917 0.034286406 0.018565923 0.096383899 0.036598537 -0.00030405712 0.096397251 -0.012444094 -5.8770747e-06 0.10027799 0.0070630368 0.019812118 0.098792978 0.00089522492 0.026044382 0.077079467 -0.025049575 0.028025156 0.043824237 -0.026368296 0.024826827 0.076159433 -0.027174426 -0.00032724603 0.088343509 -0.024182025 0.022324363 0.09003187 -0.02195194 0.022387056 0.085137084 -0.025544848 -0.021314884 0.092708796 -0.019038951 -0.022722701 0.08721637 -0.02413908 -0.024485176 0.087674245 -0.021851478 0.0018528663 -0.082220443 -0.045543469 -0.0014900707 -0.04408798 -0.045943413 0.001927514 -0.043738071 -0.045678042 -0.0014900707 -0.04408798 -0.045943413 -0.024640171 0.07844989 -0.026950866 -0.022751052 0.081956252 -0.026928281 -0.017043602 -0.093742862 0.00210264 -0.016003072 -0.097846687 0.029282873 -0.017500317 -0.09612567 0.031647604 0.0035048416 -0.10093375 -0.028047891 0.016240167 -0.097755574 0.03156554 -7.2011758e-06 -0.10196081 -0.029217854 4.1883774e-05 -0.097669549 -0.037179761 -7.2011758e-06 -0.10196081 -0.029217854 -0.003606698 -0.099375933 -0.032628879 -0.02597091 0.077551603 -0.025100973 -0.025260285 0.083847016 -0.014821704 -0.024485176 0.087674245 -0.021851478 -0.014411896 0.093058422 0.045670982 -0.015461653 -0.088924363 0.045171533 -0.012977874 -0.090713792 0.045693383 -0.016757177 0.092034169 0.044313449 -0.015461653 -0.088924363 0.045171533 -0.014411896 0.093058422 0.045670982 -0.00049558369 -0.091163352 0.045934543 0.01314879 -0.092280924 0.044664629 0.013434866 -0.090145044 0.04571275 -0.014007059 -0.098309167 0.036008392 -3.1015581e-05 -0.10220782 -0.014327673 -0.0093138115 -0.098200686 0.038470645 -0.019056723 0.099479057 0.004608721 -5.8770747e-06 0.10027799 0.0070630368 -0.00030405712 0.096397251 -0.012444094 -0.0016066673 0.099253021 0.042288579 -0.0044843247 0.10220783 0.038100332 -0.01255665 0.095494457 0.045149717 0.01622845 0.10206716 0.025851013 -0.014613034 0.10195017 0.035716157 -0.0044843247 0.10220783 0.038100332 0.01656829 -0.083350286 -0.023891302 0.016965598 -0.089821011 -0.014455725 0.0036121989 -0.099412799 -0.032646187 0.0036121989 -0.099412799 -0.032646187 0.016965598 -0.089821011 -0.014455725 0.0035048416 -0.10093375 -0.028047891 0.0036121989 -0.099412799 -0.032646187 0.0035048416 -0.10093375 -0.028047891 -7.2011758e-06 -0.10196081 -0.029217854 0.015728435 -0.089997612 0.044893976 0.015814964 -0.097634315 0.03495172 0.01761722 -0.092065617 0.038429931 0.015814964 -0.097634315 0.03495172 0.012515741 -0.098405562 0.036875598 0.016240167 -0.097755574 0.03156554 0.01314879 -0.092280924 0.044664629 0.015814964 -0.097634315 0.03495172 0.01

Download .txt

gitextract_oibojri5/

├── .gitignore
├── LICENSE
├── README.md
├── cfgs/
│   ├── diffuser_config.yaml
│   ├── env/
│   │   ├── rlbench/
│   │   │   ├── open_box.yaml
│   │   │   ├── open_door.yaml
│   │   │   ├── open_drawer.yaml
│   │   │   ├── open_grill.yaml
│   │   │   ├── open_microwave.yaml
│   │   │   ├── open_oven.yaml
│   │   │   ├── pick_and_lift.yaml
│   │   │   ├── pick_up_cup.yaml
│   │   │   ├── put_knife_on_chopping_board.yaml
│   │   │   ├── reach_target.yaml
│   │   │   ├── take_lid_off_saucepan.yaml
│   │   │   └── toilet_seat_up.yaml
│   │   ├── rlbench.yaml
│   │   └── sim.yaml
│   ├── eval.yaml
│   └── method/
│       └── rk_diffuser.yaml
├── eval.py
├── extra_scripts/
│   ├── dataset_generator.py
│   └── install_coppeliasim.sh
├── helpers/
│   ├── __init__.py
│   ├── clip/
│   │   ├── __init__.py
│   │   └── core/
│   │       ├── __init__.py
│   │       ├── attention.py
│   │       ├── attention_image_goal.py
│   │       ├── clip.py
│   │       ├── fusion.py
│   │       ├── resnet.py
│   │       ├── simple_tokenizer.py
│   │       ├── transport.py
│   │       ├── transport_image_goal.py
│   │       └── unet.py
│   ├── custom_rlbench_env.py
│   ├── demo_loading_utils.py
│   ├── network_utils.py
│   ├── optim/
│   │   ├── __init__.py
│   │   └── lamb.py
│   ├── preprocess_agent.py
│   └── utils.py
├── panda_urdf/
│   ├── meshes/
│   │   ├── Pandagripper_coll_1.dae
│   │   ├── Pandagrippervisual_vis_1.dae
│   │   ├── Pandagrippervisual_vis_2.dae
│   │   ├── Pandagrippervisual_vis_3.dae
│   │   ├── Pandagrippervisual_vis_4.dae
│   │   ├── Pandagrippervisual_vis_5.dae
│   │   ├── Pandaleftfingervisible_vis_1.dae
│   │   ├── Pandaleftfingervisible_vis_2.dae
│   │   ├── Pandalink0visual_vis_1.dae
│   │   ├── Pandalink1respondable_coll_1.dae
│   │   ├── Pandalink1visual_vis_1.dae
│   │   ├── Pandalink2respondable_coll_1.dae
│   │   ├── Pandalink2visual_vis_1.dae
│   │   ├── Pandalink3respondable_coll_1.dae
│   │   ├── Pandalink3visual_vis_1.dae
│   │   ├── Pandalink3visual_vis_2.dae
│   │   ├── Pandalink4respondable_coll_1.dae
│   │   ├── Pandalink4visual_vis_1.dae
│   │   ├── Pandalink4visual_vis_2.dae
│   │   ├── Pandalink5respondable_coll_1.dae
│   │   ├── Pandalink5respondable_coll_2.dae
│   │   ├── Pandalink5respondable_coll_3.dae
│   │   ├── Pandalink5respondable_coll_4.dae
│   │   ├── Pandalink5respondable_coll_5.dae
│   │   ├── Pandalink5respondable_coll_6.dae
│   │   ├── Pandalink5respondable_coll_7.dae
│   │   ├── Pandalink5visual_vis_1.dae
│   │   ├── Pandalink5visual_vis_2.dae
│   │   ├── Pandalink6respondable_coll_1.dae
│   │   ├── Pandalink6visual_vis_1.dae
│   │   ├── Pandalink6visual_vis_2.dae
│   │   ├── Pandalink6visual_vis_3.dae
│   │   ├── Pandalink6visual_vis_4.dae
│   │   ├── Pandalink6visual_vis_5.dae
│   │   ├── Pandalink6visual_vis_6.dae
│   │   ├── Pandalink7respondable_coll_1.dae
│   │   ├── Pandalink7visual_vis_1.dae
│   │   ├── Pandalink7visual_vis_2.dae
│   │   ├── Pandalink7visual_vis_3.dae
│   │   ├── Pandalink7visual_vis_4.dae
│   │   ├── Pandalink7visual_vis_5.dae
│   │   ├── Pandarightfingervisual_vis_1.dae
│   │   ├── Pandarightfingervisual_vis_2.dae
│   │   └── robot_base_coll_1.dae
│   └── panda.urdf
├── peract/
│   ├── ARM_LICENSE
│   ├── LICENSE
│   ├── agents/
│   │   ├── __init__.py
│   │   └── peract_diffusion/
│   │       ├── __init__.py
│   │       ├── launch_utils.py
│   │       ├── perceiver_lang_io.py
│   │       ├── qattention_peract_bc_agent.py
│   │       └── qattention_stack_agent.py
│   └── voxel/
│       ├── __init__.py
│       ├── augmentation.py
│       └── voxel_grid.py
├── requirements.txt
├── rk_diffuser/
│   ├── __init__.py
│   ├── dataset/
│   │   ├── __init__.py
│   │   ├── realworld_dataset.py
│   │   ├── rl_bench_dataset.py
│   │   └── rl_bench_env.py
│   ├── models/
│   │   ├── __init__.py
│   │   ├── diffusion.py
│   │   ├── helpers.py
│   │   ├── multi_level_diffusion.py
│   │   ├── pointnet.py
│   │   ├── resnet.py
│   │   └── temporal.py
│   ├── robot.py
│   ├── trainer.py
│   └── utils.py
├── setup.py
└── train_low_level.py

Download .txt

SYMBOL INDEX (602 symbols across 36 files)

FILE: eval.py
  class JointsTrajectoryActionMode (line 41) | class JointsTrajectoryActionMode(ArmActionMode):
    method __init__ (line 44) | def __init__(self, points: int):
    method action (line 47) | def action(self, scene: Scene, action: np.ndarray, ignore_collisions: ...
    method _pre_proc_traj (line 68) | def _pre_proc_traj(self, action):
    method action_shape (line 80) | def action_shape(self, scene: Scene) -> tuple:
    method set_callable_each_step (line 83) | def set_callable_each_step(self, callable_each_step):
    method record_end (line 86) | def record_end(self, scene, steps=60, step_scene=True):
  function eval_seed (line 93) | def eval_seed(
  function main (line 242) | def main(eval_cfg: DictConfig) -> None:

FILE: extra_scripts/dataset_generator.py
  function check_and_make (line 52) | def check_and_make(dir):
  function save_demo (line 57) | def save_demo(demo, example_path):
  class CustomMoveArmThenGripper (line 172) | class CustomMoveArmThenGripper(MoveArmThenGripper):
    method action_bounds (line 173) | def action_bounds(self):
  class AbsCustomMoveArmThenGripper (line 180) | class AbsCustomMoveArmThenGripper(MoveArmThenGripper):
    method action_bounds (line 202) | def action_bounds(self):
  function run (line 209) | def run(
  function main (line 410) | def main(argv):

FILE: helpers/clip/core/attention.py
  class Attention (line 11) | class Attention(nn.Module):
    method __init__ (line 14) | def __init__(self, stream_fcn, in_shape, n_rotations, preprocess, cfg,...
    method _build_nets (line 37) | def _build_nets(self):
    method attend (line 44) | def attend(self, x):
    method forward (line 47) | def forward(self, inp_img, softmax=True):

FILE: helpers/clip/core/attention_image_goal.py
  class AttentionImageGoal (line 9) | class AttentionImageGoal(Attention):
    method __init__ (line 12) | def __init__(self, stream_fcn, in_shape, n_rotations, preprocess, cfg,...
    method forward (line 15) | def forward(self, inp_img, goal_img, softmax=True):

FILE: helpers/clip/core/clip.py
  class Bottleneck (line 41) | class Bottleneck(nn.Module):
    method __init__ (line 44) | def __init__(self, inplanes, planes, stride=1):
    method forward (line 84) | def forward(self, x: torch.Tensor):
  class AttentionPool2d (line 100) | class AttentionPool2d(nn.Module):
    method __init__ (line 101) | def __init__(
    method forward (line 114) | def forward(self, x):
  class ModifiedResNet (line 147) | class ModifiedResNet(nn.Module):
    method __init__ (line 155) | def __init__(self, layers, output_dim, heads, input_resolution=224, wi...
    method _make_layer (line 186) | def _make_layer(self, planes, blocks, stride=1):
    method forward (line 195) | def forward(self, x):
    method prepool (line 200) | def prepool(self, x):
    method prepool_im (line 219) | def prepool_im(self, x):
  class LayerNorm (line 245) | class LayerNorm(nn.LayerNorm):
    method forward (line 248) | def forward(self, x: torch.Tensor):
  class QuickGELU (line 254) | class QuickGELU(nn.Module):
    method forward (line 255) | def forward(self, x: torch.Tensor):
  class ResidualAttentionBlock (line 259) | class ResidualAttentionBlock(nn.Module):
    method __init__ (line 260) | def __init__(self, d_model: int, n_head: int, attn_mask: torch.Tensor ...
    method attention (line 277) | def attention(self, x: torch.Tensor):
    method forward (line 285) | def forward(self, x: torch.Tensor):
  class Transformer (line 291) | class Transformer(nn.Module):
    method __init__ (line 292) | def __init__(
    method forward (line 302) | def forward(self, x: torch.Tensor):
  class VisualTransformer (line 306) | class VisualTransformer(nn.Module):
    method __init__ (line 307) | def __init__(
    method forward (line 339) | def forward(self, x: torch.Tensor):
    method forward_spatial (line 367) | def forward_spatial(self, x: torch.Tensor):
  class CLIP (line 392) | class CLIP(nn.Module):
    method __init__ (line 393) | def __init__(
    method initialize_parameters (line 451) | def initialize_parameters(self):
    method build_attention_mask (line 487) | def build_attention_mask(self):
    method dtype (line 496) | def dtype(self):
    method encode_image (line 499) | def encode_image(self, image):
    method encode_text (line 502) | def encode_text(self, text):
    method encode_text_with_embeddings (line 517) | def encode_text_with_embeddings(self, text):
    method forward (line 533) | def forward(self, image, text):
  function convert_weights (line 550) | def convert_weights(model: nn.Module):
  function build_model (line 579) | def build_model(state_dict: dict):
  function _download (line 654) | def _download(url: str, root: str = os.path.expanduser("~/.cache/clip")):
  function available_models (line 696) | def available_models():
  function load_clip (line 700) | def load_clip(
  function tokenize (line 791) | def tokenize(texts: Union[str, List[str]], context_length: int = 77):

FILE: helpers/clip/core/fusion.py
  class DotAttn (line 7) | class DotAttn(nn.Module):
    method forward (line 10) | def forward(self, inp, h):
    method softmax (line 14) | def softmax(self, inp, h):
  class ScaledDotAttn (line 20) | class ScaledDotAttn(nn.Module):
    method forward (line 23) | def forward(self, inp, h):
    method softmax (line 27) | def softmax(self, inp, h):
  class Fusion (line 33) | class Fusion(nn.Module):
    method __init__ (line 36) | def __init__(self, input_dim=3):
    method tile_x2 (line 40) | def tile_x2(self, x1, x2, x2_proj=None):
    method forward (line 48) | def forward(self, x1, x2, x2_mask=None, x2_proj=None):
  class FusionAdd (line 52) | class FusionAdd(Fusion):
    method __init__ (line 55) | def __init__(self, input_dim=3):
    method forward (line 58) | def forward(self, x1, x2, x2_mask=None, x2_proj=None):
  class FusionMult (line 64) | class FusionMult(Fusion):
    method __init__ (line 67) | def __init__(self, input_dim=3):
    method forward (line 70) | def forward(self, x1, x2, x2_mask=None, x2_proj=None):
  class FusionMax (line 76) | class FusionMax(Fusion):
    method __init__ (line 79) | def __init__(self, input_dim=3):
    method forward (line 82) | def forward(self, x1, x2, x2_mask=None, x2_proj=None):
  class FusionConcat (line 88) | class FusionConcat(Fusion):
    method __init__ (line 91) | def __init__(self, input_dim=3):
    method forward (line 94) | def forward(self, x1, x2, x2_mask=None, x2_proj=None):
  class FusionConv (line 100) | class FusionConv(Fusion):
    method __init__ (line 103) | def __init__(self, input_dim=3):
    method forward (line 110) | def forward(self, x1, x2, x2_mask=None, x2_proj=None):
  class FusionConvLat (line 118) | class FusionConvLat(Fusion):
    method __init__ (line 121) | def __init__(self, input_dim=3, output_dim=3):
    method forward (line 127) | def forward(self, x1, x2, x2_mask=None, x2_proj=None):
  class FusionFiLM (line 142) | class FusionFiLM(Fusion):
    method __init__ (line 148) | def __init__(self, input_dim=3, output_dim=3):
    method forward (line 151) | def forward(self, x1, x2, gamma, beta):
  class FusionDeepConv (line 157) | class FusionDeepConv(Fusion):
    method __init__ (line 160) | def __init__(self, input_dim=3):
    method forward (line 171) | def forward(self, x1, x2, x2_mask=None, x2_proj=None):
  class FusionMultWord (line 179) | class FusionMultWord(nn.Module):
    method __init__ (line 182) | def __init__(self, input_dim=3):
    method forward (line 186) | def forward(self, x1, x2, x2_mask=None, x2_proj=None):
  class FusionWordAttention (line 199) | class FusionWordAttention(nn.Module):
    method __init__ (line 202) | def __init__(self, input_dim=3):
    method forward (line 207) | def forward(self, x1, x2, x2_mask=None, x2_proj=None):
  class FusionSentenceAttention (line 226) | class FusionSentenceAttention(nn.Module):
    method __init__ (line 229) | def __init__(self, input_dim=3):
    method forward (line 234) | def forward(self, x1, x2, x2_mask=None, x2_proj=None):
  class CrossModalAttention2d (line 248) | class CrossModalAttention2d(nn.Module):
    method __init__ (line 251) | def __init__(
    method forward (line 277) | def forward(self, x, l, l_mask):
  class FusionMultiHeadedWordAttention (line 329) | class FusionMultiHeadedWordAttention(nn.Module):
    method __init__ (line 332) | def __init__(self, input_dim=3):
    method forward (line 351) | def forward(self, x1, x2, x2_mask=None, x2_proj=None):

FILE: helpers/clip/core/resnet.py
  class IdentityBlock (line 6) | class IdentityBlock(nn.Module):
    method __init__ (line 7) | def __init__(
    method forward (line 30) | def forward(self, x):
  class ConvBlock (line 40) | class ConvBlock(nn.Module):
    method __init__ (line 41) | def __init__(
    method forward (line 69) | def forward(self, x):
  class ResNet43_8s (line 79) | class ResNet43_8s(nn.Module):
    method __init__ (line 80) | def __init__(self, input_shape, output_dim, cfg, device, preprocess):
    method _make_layers (line 92) | def _make_layers(self):
    method forward (line 165) | def forward(self, x):

FILE: helpers/clip/core/simple_tokenizer.py
  function default_bpe (line 11) | def default_bpe():
  function bytes_to_unicode (line 18) | def bytes_to_unicode():
  function get_pairs (line 44) | def get_pairs(word):
  function basic_clean (line 56) | def basic_clean(text):
  function whitespace_clean (line 62) | def whitespace_clean(text):
  class SimpleTokenizer (line 68) | class SimpleTokenizer(object):
    method __init__ (line 69) | def __init__(self, bpe_path: str = default_bpe()):
    method bpe (line 92) | def bpe(self, token):
    method encode (line 133) | def encode(self, text):
    method decode (line 143) | def decode(self, tokens):

FILE: helpers/clip/core/transport.py
  class Transport (line 9) | class Transport(nn.Module):
    method __init__ (line 10) | def __init__(
    method _build_nets (line 45) | def _build_nets(self):
    method correlate (line 54) | def correlate(self, in0, in1, softmax):
    method transport (line 70) | def transport(self, in_tensor, crop):
    method forward (line 75) | def forward(self, inp_img, p, softmax=True):

FILE: helpers/clip/core/transport_image_goal.py
  class TransportImageGoal (line 9) | class TransportImageGoal(nn.Module):
    method __init__ (line 12) | def __init__(
    method _build_nets (line 53) | def _build_nets(self):
    method correlate (line 61) | def correlate(self, in0, in1, softmax):
    method forward (line 77) | def forward(self, inp_img, goal_img, p, softmax=True):

FILE: helpers/clip/core/unet.py
  class DoubleConv (line 8) | class DoubleConv(nn.Module):
    method __init__ (line 11) | def __init__(self, in_channels, out_channels, mid_channels=None):
    method forward (line 24) | def forward(self, x):
  class Down (line 28) | class Down(nn.Module):
    method __init__ (line 31) | def __init__(self, in_channels, out_channels):
    method forward (line 37) | def forward(self, x):
  class Up (line 41) | class Up(nn.Module):
    method __init__ (line 44) | def __init__(self, in_channels, out_channels, bilinear=True):
    method forward (line 57) | def forward(self, x1, x2):
  class OutConv (line 71) | class OutConv(nn.Module):
    method __init__ (line 72) | def __init__(self, in_channels, out_channels):
    method forward (line 76) | def forward(self, x):

FILE: helpers/custom_rlbench_env.py
  class CustomRLBenchEnv (line 18) | class CustomRLBenchEnv(RLBenchEnv):
    method __init__ (line 19) | def __init__(
    method observation_elements (line 60) | def observation_elements(self) -> List[ObservationElement]:
    method extract_obs (line 70) | def extract_obs(self, obs: Observation, t=None, prev_action=None):
    method launch (line 107) | def launch(self):
    method reset (line 119) | def reset(self) -> dict:
    method register_callback (line 132) | def register_callback(self, func):
    method _my_callback (line 135) | def _my_callback(self):
    method _append_final_frame (line 141) | def _append_final_frame(self, success: bool):
    method step (line 150) | def step(self, act_result: ActResult) -> Transition:
    method _extract_traj_info (line 205) | def _extract_traj_info(self, demo):
    method reset_to_demo (line 231) | def reset_to_demo(self, i):
  class CustomMultiTaskRLBenchEnv (line 255) | class CustomMultiTaskRLBenchEnv(MultiTaskRLBenchEnv):
    method __init__ (line 256) | def __init__(
    method observation_elements (line 299) | def observation_elements(self) -> List[ObservationElement]:
    method extract_obs (line 309) | def extract_obs(self, obs: Observation, t=None, prev_action=None):
    method launch (line 341) | def launch(self):
    method reset (line 353) | def reset(self) -> dict:
    method register_callback (line 363) | def register_callback(self, func):
    method _my_callback (line 366) | def _my_callback(self):
    method _append_final_frame (line 372) | def _append_final_frame(self, success: bool):
    method step (line 381) | def step(self, act_result: ActResult) -> Transition:
    method reset_to_demo (line 441) | def reset_to_demo(self, i, variation_number=-1):

FILE: helpers/demo_loading_utils.py
  function _is_stopped (line 8) | def _is_stopped(demo, i, obs, stopped_buffer, delta=0.1):
  function keypoint_discovery (line 25) | def keypoint_discovery(demo: Demo, stopping_delta=0.1) -> List[int]:
  function find_minimum_difference (line 91) | def find_minimum_difference(lst):

FILE: helpers/network_utils.py
  function act_layer (line 14) | def act_layer(act):
  function norm_layer2d (line 29) | def norm_layer2d(norm, channels):
  function norm_layer1d (line 42) | def norm_layer1d(norm, num_channels):
  class FiLMBlock (line 53) | class FiLMBlock(nn.Module):
    method __init__ (line 54) | def __init__(self):
    method forward (line 57) | def forward(self, x, gamma, beta):
  class Conv2DBlock (line 66) | class Conv2DBlock(nn.Module):
    method __init__ (line 67) | def __init__(
    method forward (line 120) | def forward(self, x):
  class Conv2DFiLMBlock (line 127) | class Conv2DFiLMBlock(Conv2DBlock):
    method __init__ (line 128) | def __init__(
    method forward (line 150) | def forward(self, x, gamma, beta):
  class Conv3DBlock (line 158) | class Conv3DBlock(nn.Module):
    method __init__ (line 159) | def __init__(
    method forward (line 210) | def forward(self, x):
  class ConvTranspose3DBlock (line 217) | class ConvTranspose3DBlock(nn.Module):
    method __init__ (line 218) | def __init__(
    method forward (line 268) | def forward(self, x):
  class Conv2DUpsampleBlock (line 275) | class Conv2DUpsampleBlock(nn.Module):
    method __init__ (line 276) | def __init__(
    method forward (line 299) | def forward(self, x):
  class Conv3DUpsampleBlock (line 303) | class Conv3DUpsampleBlock(nn.Module):
    method __init__ (line 304) | def __init__(
    method forward (line 327) | def forward(self, x):
  class DenseBlock (line 331) | class DenseBlock(nn.Module):
    method __init__ (line 332) | def __init__(self, in_features, out_features, norm=None, activation=No...
    method forward (line 364) | def forward(self, x):
  class SiameseNet (line 371) | class SiameseNet(nn.Module):
    method __init__ (line 372) | def __init__(
    method build (line 390) | def build(self):
    method forward (line 411) | def forward(self, x):
  class CNNAndFcsNet (line 421) | class CNNAndFcsNet(nn.Module):
    method __init__ (line 422) | def __init__(
    method build (line 445) | def build(self):
    method forward (line 472) | def forward(self, observations, low_dim_ins):
  class CNNLangAndFcsNet (line 482) | class CNNLangAndFcsNet(nn.Module):
    method __init__ (line 483) | def __init__(
    method build (line 508) | def build(self):
    method forward (line 541) | def forward(self, observations, low_dim_ins, lang_goal_emb):
  function pair (line 566) | def pair(t):
  class PreNorm (line 573) | class PreNorm(nn.Module):
    method __init__ (line 574) | def __init__(self, dim, fn):
    method forward (line 579) | def forward(self, x, **kwargs):
  class FeedForward (line 583) | class FeedForward(nn.Module):
    method __init__ (line 584) | def __init__(self, dim, hidden_dim, dropout=0.0):
    method forward (line 594) | def forward(self, x):
  class Attention (line 598) | class Attention(nn.Module):
    method __init__ (line 599) | def __init__(self, dim, heads=8, dim_head=64, dropout=0.0):
    method forward (line 618) | def forward(self, x):
  class Transformer (line 632) | class Transformer(nn.Module):
    method __init__ (line 633) | def __init__(self, dim, depth, heads, dim_head, mlp_dim, dropout=0.0):
    method forward (line 651) | def forward(self, x):
  class ViT (line 663) | class ViT(nn.Module):
    method __init__ (line 664) | def __init__(
    method forward (line 712) | def forward(self, img):
  class ViTLangAndFcsNet (line 727) | class ViTLangAndFcsNet(nn.Module):
    method __init__ (line 728) | def __init__(
    method build (line 753) | def build(self):
    method forward (line 785) | def forward(self, observations, low_dim_ins, lang_goal_emb):
  class Conv3DInceptionBlockUpsampleBlock (line 808) | class Conv3DInceptionBlockUpsampleBlock(nn.Module):
    method __init__ (line 809) | def __init__(
    method forward (line 836) | def forward(self, x):
  class Conv3DInceptionBlock (line 840) | class Conv3DInceptionBlock(nn.Module):
    method __init__ (line 841) | def __init__(
    method forward (line 886) | def forward(self, x):
  class ConvTransposeUp3DBlock (line 901) | class ConvTransposeUp3DBlock(nn.Module):
    method __init__ (line 902) | def __init__(
    method forward (line 941) | def forward(self, x):
  class SpatialSoftmax3D (line 948) | class SpatialSoftmax3D(torch.nn.Module):
    method __init__ (line 949) | def __init__(self, depth, height, width, channel):
    method forward (line 974) | def forward(self, feature):

FILE: helpers/optim/lamb.py
  class Lamb (line 27) | class Lamb(Optimizer):
    method __init__ (line 45) | def __init__(
    method step (line 60) | def step(self, closure=None):

FILE: helpers/preprocess_agent.py
  class PreprocessAgent (line 14) | class PreprocessAgent(Agent):
    method __init__ (line 15) | def __init__(self, pose_agent: Agent, norm_rgb: bool = True):
    method build (line 19) | def build(self, training: bool, device: torch.device = None):
    method _norm_rgb_ (line 22) | def _norm_rgb_(self, x):
    method update (line 25) | def update(self, step: int, replay_sample: dict) -> dict:
    method act (line 38) | def act(self, step: int, observation: dict, deterministic=False) -> Ac...
    method update_summaries (line 49) | def update_summaries(self) -> List[Summary]:
    method act_summaries (line 98) | def act_summaries(self) -> List[Summary]:
    method load_weights (line 101) | def load_weights(self, savedir: str):
    method save_weights (line 104) | def save_weights(self, savedir: str):
    method reset (line 107) | def reset(self) -> None:

FILE: helpers/utils.py
  function loss_weights (line 30) | def loss_weights(replay_sample, beta=1.0):
  function soft_updates (line 39) | def soft_updates(net, target_net, tau):
  function stack_on_channel (line 44) | def stack_on_channel(x):
  function normalize_quaternion (line 49) | def normalize_quaternion(quat):
  function correct_rotation_instability (line 53) | def correct_rotation_instability(disc, resolution):
  function check_gimbal_lock (line 68) | def check_gimbal_lock(pred_rot_and_grip, gt_rot_and_grip, resolution):
  function quaternion_to_discrete_euler (line 82) | def quaternion_to_discrete_euler(quaternion, resolution):
  function discrete_euler_to_quaternion (line 90) | def discrete_euler_to_quaternion(discrete_euler, resolution):
  function point_to_voxel_index (line 95) | def point_to_voxel_index(
  function voxel_index_to_point (line 109) | def voxel_index_to_point(
  function point_to_pixel_index (line 117) | def point_to_pixel_index(
  function _compute_initial_camera_pose (line 129) | def _compute_initial_camera_pose(scene):
  function _from_trimesh_scene (line 145) | def _from_trimesh_scene(trimesh_scene, bg_color=None, ambient_light=None):
  function _create_bounding_box (line 160) | def _create_bounding_box(scene, voxel_size, res):
  function create_voxel_scene (line 181) | def create_voxel_scene(
  function visualise_voxel (line 233) | def visualise_voxel(
  function preprocess (line 276) | def preprocess(img, dist="transporter"):
  function rand_dist (line 337) | def rand_dist(size, min=-1.0, max=1.0):
  function rand_discrete (line 341) | def rand_discrete(size, min=0, max=1):
  function split_list (line 347) | def split_list(lst, n):
  function extract_obs (line 352) | def extract_obs(
  function create_obs_config (line 418) | def create_obs_config(
  function get_device (line 459) | def get_device(gpu):

FILE: peract/agents/peract_diffusion/launch_utils.py
  function create_replay (line 40) | def create_replay(
  function _get_action (line 146) | def _get_action(
  function _add_keypoints_to_replay (line 196) | def _add_keypoints_to_replay(
  function fill_replay (line 284) | def fill_replay(
  function fill_multi_task_replay (line 370) | def fill_multi_task_replay(
  function create_diffusion (line 441) | def create_diffusion(cfgs: DictConfig):
  function create_agent (line 450) | def create_agent(cfg: DictConfig):

FILE: peract/agents/peract_diffusion/perceiver_lang_io.py
  function exists (line 24) | def exists(val):
  function default (line 28) | def default(val, d):
  function cache_fn (line 32) | def cache_fn(f):
  function fourier_encode (line 48) | def fourier_encode(x, max_freq, num_bands=4):
  class PreNorm (line 64) | class PreNorm(nn.Module):
    method __init__ (line 65) | def __init__(self, dim, fn, context_dim=None):
    method forward (line 71) | def forward(self, x, **kwargs):
  class GEGLU (line 82) | class GEGLU(nn.Module):
    method forward (line 83) | def forward(self, x):
  class FeedForward (line 88) | class FeedForward(nn.Module):
    method __init__ (line 89) | def __init__(self, dim, mult=4):
    method forward (line 95) | def forward(self, x):
  class Attention (line 99) | class Attention(nn.Module):
    method __init__ (line 100) | def __init__(self, query_dim, context_dim=None, heads=8, dim_head=64, ...
    method forward (line 113) | def forward(self, x, context=None, mask=None):
  class PerceiverVoxelLangEncoder (line 142) | class PerceiverVoxelLangEncoder(nn.Module):
    method __init__ (line 143) | def __init__(
    method encode_text (line 394) | def encode_text(self, x):
    method forward (line 403) | def forward(

FILE: peract/agents/peract_diffusion/qattention_peract_bc_agent.py
  class QFunction (line 30) | class QFunction(nn.Module):
    method __init__ (line 31) | def __init__(
    method _argmax_3d (line 50) | def _argmax_3d(self, tensor_orig):
    method choose_highest_action (line 56) | def choose_highest_action(self, q_trans, q_rot_grip, q_collision):
    method forward (line 79) | def forward(
  class QAttentionPerActBCAgent (line 127) | class QAttentionPerActBCAgent(Agent):
    method __init__ (line 128) | def __init__(
    method build (line 196) | def build(self, training: bool, device: torch.device = None):
    method _extract_crop (line 320) | def _extract_crop(self, pixel_action, observation):
    method _preprocess_inputs (line 335) | def _preprocess_inputs(self, replay_sample):
    method _act_preprocess_inputs (line 347) | def _act_preprocess_inputs(self, observation):
    method _get_value_from_voxel_index (line 357) | def _get_value_from_voxel_index(self, q, voxel_idx):
    method _get_value_from_rot_and_grip (line 369) | def _get_value_from_rot_and_grip(self, rot_grip_q, rot_and_grip_idx):
    method _celoss (line 388) | def _celoss(self, pred, labels):
    method _softmax_q_trans (line 391) | def _softmax_q_trans(self, q):
    method _softmax_q_rot_grip (line 395) | def _softmax_q_rot_grip(self, q_rot_grip):
    method _softmax_ignore_collision (line 422) | def _softmax_ignore_collision(self, q_collision):
    method update (line 426) | def update(self, step: int, replay_sample: dict) -> dict:
    method act (line 601) | def act(self, step: int, observation: dict, deterministic=False) -> Ac...
    method update_summaries (line 715) | def update_summaries(self) -> List[Summary]:
    method act_summaries (line 748) | def act_summaries(self) -> List[Summary]:
    method load_weights (line 762) | def load_weights(self, savedir: str):
    method save_weights (line 786) | def save_weights(self, savedir: str):

FILE: peract/agents/peract_diffusion/qattention_stack_agent.py
  class QAttentionStackAgent (line 23) | class QAttentionStackAgent(Agent):
    method __init__ (line 24) | def __init__(
    method build (line 49) | def build(self, training: bool, device=None) -> None:
    method update (line 56) | def update(self, step: int, replay_sample: dict) -> dict:
    method act (line 67) | def act(self, step: int, observation: dict, deterministic=False) -> Ac...
    method _plot (line 152) | def _plot(self, predicted_trajs, pcds, rgbs, gt_trajs=None):
    method _prepare_diffusion_data_dict (line 224) | def _prepare_diffusion_data_dict(
    method update_summaries (line 257) | def update_summaries(self) -> List[Summary]:
    method act_summaries (line 263) | def act_summaries(self) -> List[Summary]:
    method load_weights (line 269) | def load_weights(self, savedir: str, backbone: str = "unet"):
    method save_weights (line 286) | def save_weights(self, savedir: str):

FILE: peract/voxel/augmentation.py
  function _axis_angle_rotation (line 8) | def _axis_angle_rotation(axis: str, angle: torch.Tensor) -> torch.Tensor:
  function standardize_quaternion (line 38) | def standardize_quaternion(quaternions: torch.Tensor) -> torch.Tensor:
  function _sqrt_positive_part (line 53) | def _sqrt_positive_part(x: torch.Tensor) -> torch.Tensor:
  function quaternion_to_matrix (line 64) | def quaternion_to_matrix(quaternions: torch.Tensor) -> torch.Tensor:
  function matrix_to_quaternion (line 96) | def matrix_to_quaternion(matrix: torch.Tensor) -> torch.Tensor:
  function euler_angles_to_matrix (line 158) | def euler_angles_to_matrix(euler_angles: torch.Tensor, convention: str) ...
  function perturb_se3 (line 187) | def perturb_se3(pcd, trans_shift_4x4, rot_shift_4x4, action_gripper_4x4,...
  function apply_se3_augmentation (line 255) | def apply_se3_augmentation(

FILE: peract/voxel/voxel_grid.py
  class VoxelGrid (line 15) | class VoxelGrid(nn.Module):
    method __init__ (line 16) | def __init__(
    method _broadcast (line 132) | def _broadcast(self, src: torch.Tensor, other: torch.Tensor, dim: int):
    method _scatter_mean (line 143) | def _scatter_mean(
    method _scatter_nd (line 165) | def _scatter_nd(self, indices, updates):
    method coords_to_bounding_voxel_grid (line 191) | def coords_to_bounding_voxel_grid(

FILE: rk_diffuser/dataset/realworld_dataset.py
  class RealWorldDataset (line 20) | class RealWorldDataset:
    method __init__ (line 23) | def __init__(
    method _load_dataset (line 152) | def _load_dataset(self, task, data_raw_path, demo_dict):
    method _downsample_images (line 187) | def _downsample_images(self, demo):
    method _extract_obs_from_demo (line 210) | def _extract_obs_from_demo(self, demo: dict):
    method _keypoint_discovery (line 304) | def _keypoint_discovery(self, obs: dict, stopping_delta=0.1) -> List[i...
    method _is_stopped (line 355) | def _is_stopped(self, joint_poses, i, stopped_buffer, delta=0.1):
    method _is_gripper_open (line 386) | def _is_gripper_open(self, pose):
    method _matrix_to_pose (line 398) | def _matrix_to_pose(self, matrix, offset=np.eye(4)):
    method _pose_to_matrix (line 416) | def _pose_to_matrix(self, pose):
    method _get_low_dim_data (line 431) | def _get_low_dim_data(self, obs):
    method __len__ (line 447) | def __len__(self):
    method min_traj_len (line 450) | def min_traj_len(self):
    method get_conditions (line 454) | def get_conditions(self, observations):
    method _calc_rank (line 464) | def _calc_rank(self, poses):
    method _filter_traj (line 475) | def _filter_traj(self, joints_positions: np.ndarray) -> list:
    method __getitem__ (line 496) | def __getitem__(self, index) -> Any:
    method _prepare_input (line 565) | def _prepare_input(self, task: str, use_demos: bool = False):
    method _create_uniform_pixel_coords_image (line 623) | def _create_uniform_pixel_coords_image(self, resolution: np.ndarray):
    method _transform (line 647) | def _transform(self, coords, trans):
    method _pixel_to_world_coords (line 668) | def _pixel_to_world_coords(self, pixel_coords, cam_proj_mat_inv):
    method _pointcloud_from_depth_and_camera_params (line 685) | def _pointcloud_from_depth_and_camera_params(

FILE: rk_diffuser/dataset/rl_bench_dataset.py
  class NormEEPosePlanning (line 35) | class NormEEPosePlanning(EndEffectorPoseViaPlanning):
    method action (line 36) | def action(self, scene: Scene, action: np.ndarray):
  class TrajectoryActionMode (line 46) | class TrajectoryActionMode(EndEffectorPoseViaIK):
    method __init__ (line 49) | def __init__(
    method action (line 59) | def action(self, scene: Scene, action: np.ndarray):
    method action_shape (line 71) | def action_shape(self, scene: Scene) -> tuple:
  class JointsTrajectoryActionMode (line 75) | class JointsTrajectoryActionMode(ArmActionMode):
    method __init__ (line 78) | def __init__(self, points: int):
    method action (line 81) | def action(self, scene: Scene, action: np.ndarray):
    method _pre_proc_traj (line 97) | def _pre_proc_traj(self, action):
    method action_shape (line 109) | def action_shape(self, scene: Scene) -> tuple:
  function _create_obs_config (line 113) | def _create_obs_config(camera_names: List[str], camera_resolution: List[...
  function _get_action_mode (line 152) | def _get_action_mode(num_points, use_traj=False):
  class RLBenchDataset (line 161) | class RLBenchDataset(Dataset):
    method __init__ (line 162) | def __init__(
    method _get_env (line 263) | def _get_env(self, task, headless=True, use_traj=False):
    method _get_demos (line 291) | def _get_demos(self, task, ret_dict):
    method _load_demos_to_dataset (line 317) | def _load_demos_to_dataset(self, demos, num_of_episodes, ds_list):
    method __len__ (line 399) | def __len__(self):
    method min_traj_len (line 402) | def min_traj_len(self):
    method get_conditions (line 406) | def get_conditions(self, observations):
    method _calc_rank (line 416) | def _calc_rank(self, poses):
    method __getitem__ (line 428) | def __getitem__(self, index) -> Any:

FILE: rk_diffuser/dataset/rl_bench_env.py
  class CustomRLBenchEnv (line 20) | class CustomRLBenchEnv(RLBenchEnv):
    method __init__ (line 21) | def __init__(
    method observation_elements (line 52) | def observation_elements(self) -> List[ObservationElement]:
    method extract_obs (line 63) | def extract_obs(self, obs: Observation, t=None, prev_action=None):
    method launch (line 94) | def launch(self):
    method reset (line 105) | def reset(self) -> dict:
    method register_callback (line 113) | def register_callback(self, func):
    method _my_callback (line 116) | def _my_callback(self):
    method _append_final_frame (line 121) | def _append_final_frame(self, success: bool):
    method step (line 130) | def step(self, action: np.ndarray, record: bool = False) -> Transition:
    method reset_to_demo (line 166) | def reset_to_demo(self, demo):
    method eval_planner (line 172) | def eval_planner(self, amount, callable_each_step):
    method _get_live_demos (line 180) | def _get_live_demos(

FILE: rk_diffuser/models/diffusion.py
  function _create_model_fn (line 20) | def _create_model_fn(
  class GaussianDynDiffusion (line 62) | class GaussianDynDiffusion(nn.Module):
    method __init__ (line 63) | def __init__(
    method gripper_pose_loss (line 252) | def gripper_pose_loss(self, p1, p2):
    method get_loss_weights (line 271) | def get_loss_weights(self, discount: float) -> torch.Tensor:
    method predict_start_from_noise (line 296) | def predict_start_from_noise(
    method q_posterior (line 318) | def q_posterior(
    method p_mean_variance (line 343) | def p_mean_variance(
    method p_sample (line 414) | def p_sample(
    method p_sample_loop (line 441) | def p_sample_loop(
    method conditional_sample (line 512) | def conditional_sample(
    method q_sample (line 538) | def q_sample(
    method init_noise (line 562) | def init_noise(
    method p_losses (line 607) | def p_losses(
    method proc_cond (line 718) | def proc_cond(self, cond: dict) -> dict:
    method loss (line 735) | def loss(self, x: torch.tensor, cond: dict, **kwargs) -> tuple:
    method forward (line 753) | def forward(self, cond: dict, *args, **kwargs) -> dict:

FILE: rk_diffuser/models/helpers.py
  class SinusoidalPosEmb (line 18) | class SinusoidalPosEmb(nn.Module):
    method __init__ (line 19) | def __init__(self, dim):
    method forward (line 23) | def forward(self, x):
  class Downsample1d (line 33) | class Downsample1d(nn.Module):
    method __init__ (line 34) | def __init__(self, dim):
    method forward (line 38) | def forward(self, x):
  class Upsample1d (line 42) | class Upsample1d(nn.Module):
    method __init__ (line 43) | def __init__(self, dim):
    method forward (line 47) | def forward(self, x):
  class Conv1dBlock (line 51) | class Conv1dBlock(nn.Module):
    method __init__ (line 56) | def __init__(self, inp_channels, out_channels, kernel_size, mish=True,...
    method forward (line 74) | def forward(self, x):
  function extract (line 83) | def extract(a, t, x_shape):
  function cosine_beta_schedule (line 89) | def cosine_beta_schedule(timesteps, s=0.008, dtype=torch.float32):
  function apply_conditioning (line 103) | def apply_conditioning(x, conditions, action_dim):
  class WeightedLoss (line 114) | class WeightedLoss(nn.Module):
    method __init__ (line 115) | def __init__(self, weights, action_dim):
    method forward (line 120) | def forward(self, pred, targ):
  class WeightedStateLoss (line 133) | class WeightedStateLoss(nn.Module):
    method __init__ (line 134) | def __init__(self, weights):
    method forward (line 138) | def forward(self, pred, targ):
  class ValueLoss (line 148) | class ValueLoss(nn.Module):
    method __init__ (line 149) | def __init__(self, *args):
    method forward (line 153) | def forward(self, pred, targ):
  class WeightedL1 (line 176) | class WeightedL1(WeightedLoss):
    method _loss (line 177) | def _loss(self, pred, targ):
  class WeightedL2 (line 181) | class WeightedL2(WeightedLoss):
    method _loss (line 182) | def _loss(self, pred, targ):
  class WeightedStateL2 (line 186) | class WeightedStateL2(WeightedStateLoss):
    method _loss (line 187) | def _loss(self, pred, targ):
  class WeightedChamfer (line 191) | class WeightedChamfer(WeightedStateLoss):
    method __init__ (line 192) | def __init__(self, weights, converage_weight=1.0, norm="l1"):
    method _loss (line 197) | def _loss(self, pred, targ):
  class ValueL1 (line 212) | class ValueL1(ValueLoss):
    method _loss (line 213) | def _loss(self, pred, targ):
  class ValueL2 (line 217) | class ValueL2(ValueLoss):
    method _loss (line 218) | def _loss(self, pred, targ):

FILE: rk_diffuser/models/multi_level_diffusion.py
  class MultiLevelDiffusion (line 10) | class MultiLevelDiffusion(nn.Module):
    method __init__ (line 18) | def __init__(
    method robot_offset (line 50) | def robot_offset(self):
    method robot_rot (line 55) | def robot_rot(self):
    method conditional_sample (line 61) | def conditional_sample(
    method loss (line 120) | def loss(self, x: torch.tensor, cond: dict, **kwargs) -> tuple:
    method _form_diffusion_batch (line 144) | def _form_diffusion_batch(

FILE: rk_diffuser/models/pointnet.py
  class STN3d (line 12) | class STN3d(nn.Module):
    method __init__ (line 13) | def __init__(self):
    method forward (line 32) | def forward(self, x):
  class STNkd (line 60) | class STNkd(nn.Module):
    method __init__ (line 61) | def __init__(self, k=64):
    method forward (line 79) | def forward(self, x):
  class PointNetfeat (line 103) | class PointNetfeat(nn.Module):
    method __init__ (line 104) | def __init__(self, feature_transform=False):
    method forward (line 117) | def forward(self, x):
  class PointNetCls (line 143) | class PointNetCls(nn.Module):
    method __init__ (line 144) | def __init__(self, k=2, feature_transform=False):
    method forward (line 156) | def forward(self, x):
  class PointNetDenseCls (line 164) | class PointNetDenseCls(nn.Module):
    method __init__ (line 165) | def __init__(self, k=2, feature_transform=False):
    method forward (line 178) | def forward(self, x):

FILE: rk_diffuser/models/resnet.py
  function preproc_fn (line 13) | def preproc_fn(h=224, w=224, rgb=True):
  function infer_model_out_shape (line 38) | def infer_model_out_shape(model, input_shape: Tuple) -> tuple:
  class ResnetEncoder (line 51) | class ResnetEncoder(nn.Module):
    method __init__ (line 54) | def __init__(
    method forward (line 90) | def forward(self, img: torch.Tensor) -> torch.Tensor:

FILE: rk_diffuser/models/temporal.py
  class Residual (line 20) | class Residual(nn.Module):
    method __init__ (line 21) | def __init__(self, fn):
    method forward (line 25) | def forward(self, x, *args, **kwargs):
  class PreNorm (line 29) | class PreNorm(nn.Module):
    method __init__ (line 30) | def __init__(self, dim, fn):
    method forward (line 35) | def forward(self, x):
  class LinearAttention (line 40) | class LinearAttention(nn.Module):
    method __init__ (line 41) | def __init__(self, dim, heads=4, dim_head=128):
    method forward (line 48) | def forward(self, x):
  class GlobalMixing (line 63) | class GlobalMixing(nn.Module):
    method __init__ (line 64) | def __init__(self, dim, heads=4, dim_head=128):
    method forward (line 71) | def forward(self, x):
  class ResidualTemporalBlock (line 86) | class ResidualTemporalBlock(nn.Module):
    method __init__ (line 87) | def __init__(
    method forward (line 116) | def forward(self, x, t):
  class Temporal (line 129) | class Temporal(nn.Module):
    method __init__ (line 130) | def __init__(
    method forward (line 279) | def forward(
  class UnetBackbone (line 326) | class UnetBackbone(nn.Module):
    method __init__ (line 329) | def __init__(
    method forward (line 451) | def forward(self, x: torch.tensor, t: torch.tensor) -> torch.tensor:
  class TransformerBackbone (line 480) | class TransformerBackbone(nn.Module):
    method __init__ (line 481) | def __init__(
    method forward (line 567) | def forward(self, x: torch.tensor, t: torch.tensor) -> torch.tensor:

FILE: rk_diffuser/robot.py
  class DiffRobot (line 11) | class DiffRobot(nn.Module):
    method __init__ (line 14) | def __init__(
    method to (line 59) | def to(self, device: torch.device) -> None:
    method forward_kinematics_batch (line 72) | def forward_kinematics_batch(self, joints: torch.tensor) -> torch.tensor:
    method inverse_kinematics_autodiff_single_step_batch_pt (line 90) | def inverse_kinematics_autodiff_single_step_batch_pt(

FILE: rk_diffuser/trainer.py
  function cycle (line 36) | def cycle(dl):
  class EMA (line 42) | class EMA:
    method __init__ (line 47) | def __init__(self, beta):
    method update_model_average (line 51) | def update_model_average(self, ma_model, current_model):
    method _update_average (line 58) | def _update_average(self, old, new):
  class Trainer (line 64) | class Trainer(object):
    method __init__ (line 65) | def __init__(
    method create_envs (line 181) | def create_envs(self) -> None:
    method _set_logger (line 255) | def _set_logger(self) -> None:
    method _reset_parameters (line 285) | def _reset_parameters(self) -> None:
    method _step_ema (line 291) | def _step_ema(self):
    method generate_traj (line 310) | def generate_traj(self, **kwargs) -> dict:
    method get_eval_log (line 333) | def get_eval_log(
    method train (line 434) | def train(self, n_train_steps: int, train: bool = True) -> None:
    method _prepare_demo_data (line 511) | def _prepare_demo_data(self, task: str) -> tuple:
    method _dispatch_eval_trajs (line 608) | def _dispatch_eval_trajs(self, trajs: list, task_names: list) -> dict:
    method online_eval (line 626) | def online_eval(self) -> dict:
    method __exit__ (line 714) | def __exit__():

FILE: rk_diffuser/utils.py
  class Timer (line 18) | class Timer:
    method __init__ (line 19) | def __init__(self):
    method __call__ (line 22) | def __call__(self, reset=True):
  class Progress (line 30) | class Progress:
    method __init__ (line 31) | def __init__(
    method update (line 62) | def update(self, description, n=1):
    method resume (line 69) | def resume(self):
    method pause (line 75) | def pause(self):
    method set_description (line 79) | def set_description(self, params=[]):
    method append_description (line 112) | def append_description(self, descr):
    method _clear (line 115) | def _clear(self):
    method _format_percent (line 122) | def _format_percent(self, n, total):
    method _format_speed (line 140) | def _format_speed(self, n):
    method _chunk (line 149) | def _chunk(self, l, n):
    method _format (line 152) | def _format(self, chunks):
    method _format_chunk (line 159) | def _format_chunk(self, chunk):
    method _format_param (line 163) | def _format_param(self, param):
    method stamp (line 167) | def stamp(self):
    method close (line 180) | def close(self):
  class Silent (line 184) | class Silent:
    method __init__ (line 185) | def __init__(self, *args, **kwargs):
    method __getattr__ (line 188) | def __getattr__(self, attr):
  function to_np (line 192) | def to_np(x):
  function to_torch (line 198) | def to_torch(x, dtype=None, device=None):
  function to_device (line 209) | def to_device(x, device=DEVICE):
  function batchify (line 232) | def batchify(batch, device):
  function apply_dict (line 248) | def apply_dict(fn, d, *args, **kwargs):
  function normalize (line 252) | def normalize(x):
  function to_img (line 261) | def to_img(x):
  function set_device (line 268) | def set_device(device):
  function batch_to_device (line 274) | def batch_to_device(batch, device="cuda:0"):
  function _to_str (line 278) | def _to_str(num):
  function param_to_module (line 290) | def param_to_module(param):
  function report_parameters (line 295) | def report_parameters(model, topk=10):
  function _is_stopped (line 316) | def _is_stopped(demo, i, obs, stopped_buffer, delta=0.1):
  function keypoint_discovery (line 333) | def keypoint_discovery(demo: Demo, stopping_delta=0.1) -> List[int]:
  function cumsum_traj (line 354) | def cumsum_traj(x, cond):
  function concatenate_tensors_dict_list (line 362) | def concatenate_tensors_dict_list(dict_list, dim=0):
  function geodesic_distance_between_quaternions (line 382) | def geodesic_distance_between_quaternions(
  function diff_kinematics_pose_to_coppelia (line 418) | def diff_kinematics_pose_to_coppelia(diff_k_pose, offset):
  function proc_quaternion (line 435) | def proc_quaternion(poses):
  function _sqrt_positive_part (line 446) | def _sqrt_positive_part(x: torch.Tensor) -> torch.Tensor:
  function matrix_to_quaternion (line 457) | def matrix_to_quaternion(matrix: torch.Tensor) -> torch.Tensor:
  function rlb_pose_to_matrix (line 519) | def rlb_pose_to_matrix(pose):
  function matrix_to_rlb_pose (line 531) | def matrix_to_rlb_pose(matrix):
  function load_low_level_ckpt (line 540) | def load_low_level_ckpt(path):
  function load_checkpoint (line 549) | def load_checkpoint(diffusion, load_model_path, backbone):

FILE: train_low_level.py
  function _create_agent_fn (line 17) | def _create_agent_fn(
  function main (line 73) | def main(cfgs):

Download .json

Condensed preview — 116 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (3,634K chars).

[
  {
    "path": ".gitignore",
    "chars": 99,
    "preview": "__pycache__\nwandb/\noutputs/\n*.pt\nlogs\nsaved_models/\n*.json\nexp_local/\neval_video/\n*.html\n*egg-info/"
  },
  {
    "path": "LICENSE",
    "chars": 1065,
    "preview": "MIT License\n\nCopyright (c) 2024 Dyson AI\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\no"
  },
  {
    "path": "README.md",
    "chars": 3981,
    "preview": "# Hierarchical Diffusion Policy\n\nThis repo contains the PyTorch implementation of the CVPR 2024 paper\n\nHierarchical Diff"
  },
  {
    "path": "cfgs/diffuser_config.yaml",
    "chars": 656,
    "preview": "defaults:\n  - _self_\n  - method: rk_diffuser\n  - env: sim\n\nframe_skips: 1\ncondition_dropout: 0.25\ncondition_guidance_w: "
  },
  {
    "path": "cfgs/env/rlbench/open_box.yaml",
    "chars": 119,
    "preview": "# @package _global_\n\ndefaults:\n  - rlbench/easy\n  - rlbench\n  - _self_\n\nenv:\n  task_name: open_box\n  episode_length: 3\n"
  },
  {
    "path": "cfgs/env/rlbench/open_door.yaml",
    "chars": 120,
    "preview": "# @package _global_\n\ndefaults:\n  - rlbench/easy\n  - rlbench\n  - _self_\n\nenv:\n  task_name: open_door\n  episode_length: 5\n"
  },
  {
    "path": "cfgs/env/rlbench/open_drawer.yaml",
    "chars": 122,
    "preview": "# @package _global_\n\ndefaults:\n  - rlbench/easy\n  - rlbench\n  - _self_\n\nenv:\n  task_name: open_drawer\n  episode_length: "
  },
  {
    "path": "cfgs/env/rlbench/open_grill.yaml",
    "chars": 121,
    "preview": "# @package _global_\n\ndefaults:\n  - rlbench/easy\n  - rlbench\n  - _self_\n\nenv:\n  task_name: open_grill\n  episode_length: 3"
  },
  {
    "path": "cfgs/env/rlbench/open_microwave.yaml",
    "chars": 125,
    "preview": "# @package _global_\n\ndefaults:\n  - rlbench/easy\n  - rlbench\n  - _self_\n\nenv:\n  task_name: open_microwave\n  episode_lengt"
  },
  {
    "path": "cfgs/env/rlbench/open_oven.yaml",
    "chars": 120,
    "preview": "# @package _global_\n\ndefaults:\n  - rlbench/easy\n  - rlbench\n  - _self_\n\nenv:\n  task_name: open_oven\n  episode_length: 5\n"
  },
  {
    "path": "cfgs/env/rlbench/pick_and_lift.yaml",
    "chars": 124,
    "preview": "# @package _global_\n\ndefaults:\n  - rlbench/easy\n  - rlbench\n  - _self_\n\nenv:\n  task_name: pick_and_lift\n  episode_length"
  },
  {
    "path": "cfgs/env/rlbench/pick_up_cup.yaml",
    "chars": 122,
    "preview": "# @package _global_\n\ndefaults:\n  - rlbench/easy\n  - rlbench\n  - _self_\n\nenv:\n  task_name: pick_up_cup\n  episode_length: "
  },
  {
    "path": "cfgs/env/rlbench/put_knife_on_chopping_board.yaml",
    "chars": 138,
    "preview": "# @package _global_\n\ndefaults:\n  - rlbench/easy\n  - rlbench\n  - _self_\n\nenv:\n  task_name: put_knife_on_chopping_board\n  "
  },
  {
    "path": "cfgs/env/rlbench/reach_target.yaml",
    "chars": 123,
    "preview": "# @package _global_\n\ndefaults:\n  - rlbench/easy\n  - rlbench\n  - _self_\n\nenv:\n  task_name: reach_target\n  episode_length:"
  },
  {
    "path": "cfgs/env/rlbench/take_lid_off_saucepan.yaml",
    "chars": 132,
    "preview": "# @package _global_\n\ndefaults:\n  - rlbench/easy\n  - rlbench\n  - _self_\n\nenv:\n  task_name: take_lid_off_saucepan\n  episod"
  },
  {
    "path": "cfgs/env/rlbench/toilet_seat_up.yaml",
    "chars": 125,
    "preview": "# @package _global_\n\ndefaults:\n  - rlbench/easy\n  - rlbench\n  - _self_\n\nenv:\n  task_name: toilet_seat_up\n  episode_lengt"
  },
  {
    "path": "cfgs/env/rlbench.yaml",
    "chars": 291,
    "preview": "# @package _global_\n\nenv:\n  env_name: rlbench\n  episode_length: 100\n  dataset_root: ''\n  demos: 10\n  visual_observation_"
  },
  {
    "path": "cfgs/env/sim.yaml",
    "chars": 588,
    "preview": "name: sim\ntasks: [reach_target, take_lid_off_saucepan, pick_up_cup, toilet_seat_up, open_box, open_door, open_drawer, op"
  },
  {
    "path": "cfgs/eval.yaml",
    "chars": 874,
    "preview": "defaults:\n  - _self_\n  - env: sim\n\nmethod:\n    name: HDP\n    model_path: ''\n\nrlbench:\n    task_name: \"multi_18T\"\n    tas"
  },
  {
    "path": "cfgs/method/rk_diffuser.yaml",
    "chars": 907,
    "preview": "_target_: rk_diffuser.models.diffusion.GaussianDynDiffusion\n\nhorizon: 64\nobservation_dim: 7\ndim_mults: [1, 2, 4]\naction_"
  },
  {
    "path": "eval.py",
    "chars": 11645,
    "preview": "# code adapted from PerAct: https://github.com/peract/peract\n\nimport gc\nimport logging\nimport os\nimport sys\n\nimport yaml"
  },
  {
    "path": "extra_scripts/dataset_generator.py",
    "chars": 16560,
    "preview": "from multiprocessing import Process, Manager\n\nfrom pyrep.const import RenderMode\n\nfrom rlbench import ObservationConfig\n"
  },
  {
    "path": "extra_scripts/install_coppeliasim.sh",
    "chars": 660,
    "preview": "#!/bin/bash\n# Install CoppeliaSim 4.1.0 for Ubuntu 20.04\n# Refer to PyRep README for other versions\nexport COPPELIASIM_R"
  },
  {
    "path": "helpers/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "helpers/clip/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "helpers/clip/core/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "helpers/clip/core/attention.py",
    "chars": 2625,
    "preview": "\"\"\"Attention module.\"\"\"\n\nimport cliport.models as models\nimport numpy as np\nimport torch\nimport torch.nn as nn\nimport to"
  },
  {
    "path": "helpers/clip/core/attention_image_goal.py",
    "chars": 2016,
    "preview": "\"\"\"Attention module.\"\"\"\n\nimport numpy as np\nimport torch\nimport torch.nn.functional as F\nfrom cliport.models.core.attent"
  },
  {
    "path": "helpers/clip/core/clip.py",
    "chars": 27809,
    "preview": "###########################################\n#### Authors: OpenAI\n#### Credit: https://github.com/openai/CLIP\n#### https:"
  },
  {
    "path": "helpers/clip/core/fusion.py",
    "chars": 11523,
    "preview": "import numpy as np\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\n\nclass DotAttn(nn.Module):\n    \"\""
  },
  {
    "path": "helpers/clip/core/resnet.py",
    "chars": 6012,
    "preview": "import torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\n\nclass IdentityBlock(nn.Module):\n    def __init__(\n "
  },
  {
    "path": "helpers/clip/core/simple_tokenizer.py",
    "chars": 4851,
    "preview": "import gzip\nimport html\nimport os\nfrom functools import lru_cache\n\nimport ftfy\nimport regex as re\n\n\n@lru_cache()\ndef def"
  },
  {
    "path": "helpers/clip/core/transport.py",
    "chars": 3958,
    "preview": "import cliport.models as models\nimport numpy as np\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nfr"
  },
  {
    "path": "helpers/clip/core/transport_image_goal.py",
    "chars": 5368,
    "preview": "import cliport.models as models\nimport numpy as np\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nfr"
  },
  {
    "path": "helpers/clip/core/unet.py",
    "chars": 2770,
    "preview": "# Credit: https://github.com/milesial/Pytorch-UNet/\n\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n"
  },
  {
    "path": "helpers/custom_rlbench_env.py",
    "chars": 17636,
    "preview": "from typing import List, Type\n\nimport numpy as np\nfrom pyrep.const import RenderMode\nfrom pyrep.errors import Configurat"
  },
  {
    "path": "helpers/demo_loading_utils.py",
    "chars": 3468,
    "preview": "import logging\nfrom typing import List\n\nimport numpy as np\nfrom rlbench.demo import Demo\n\n\ndef _is_stopped(demo, i, obs,"
  },
  {
    "path": "helpers/network_utils.py",
    "chars": 30675,
    "preview": "import copy\nfrom typing import List, Union\n\nimport numpy as np\nimport torch\nimport torch.nn as nn\nimport torch.nn.functi"
  },
  {
    "path": "helpers/optim/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "helpers/optim/lamb.py",
    "chars": 5236,
    "preview": "\"\"\"Lamb optimizer.\"\"\"\n\n# LAMB optimizer used as is.\n# Source: https://github.com/cybertronai/pytorch-lamb\n# License: htt"
  },
  {
    "path": "helpers/preprocess_agent.py",
    "chars": 3856,
    "preview": "from typing import List\n\nimport torch\nfrom yarr.agents.agent import (\n    ActResult,\n    Agent,\n    HistogramSummary,\n  "
  },
  {
    "path": "helpers/utils.py",
    "chars": 15643,
    "preview": "from typing import List\n\nimport numpy as np\nimport pyrender\nimport torch\nimport trimesh\nfrom pyrender.trackball import T"
  },
  {
    "path": "panda_urdf/meshes/Pandagripper_coll_1.dae",
    "chars": 28196,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>\n<COLLADA xmlns=\"http://www.collada.org/2005/11/COLLADASchema\" ve"
  },
  {
    "path": "panda_urdf/meshes/Pandagrippervisual_vis_1.dae",
    "chars": 8258,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>\n<COLLADA xmlns=\"http://www.collada.org/2005/11/COLLADASchema\" ve"
  },
  {
    "path": "panda_urdf/meshes/Pandagrippervisual_vis_2.dae",
    "chars": 33238,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>\n<COLLADA xmlns=\"http://www.collada.org/2005/11/COLLADASchema\" ve"
  },
  {
    "path": "panda_urdf/meshes/Pandagrippervisual_vis_3.dae",
    "chars": 9173,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>\n<COLLADA xmlns=\"http://www.collada.org/2005/11/COLLADASchema\" ve"
  },
  {
    "path": "panda_urdf/meshes/Pandagrippervisual_vis_4.dae",
    "chars": 38444,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>\n<COLLADA xmlns=\"http://www.collada.org/2005/11/COLLADASchema\" ve"
  },
  {
    "path": "panda_urdf/meshes/Pandagrippervisual_vis_5.dae",
    "chars": 188964,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>\n<COLLADA xmlns=\"http://www.collada.org/2005/11/COLLADASchema\" ve"
  },
  {
    "path": "panda_urdf/meshes/Pandaleftfingervisible_vis_1.dae",
    "chars": 50503,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>\n<COLLADA xmlns=\"http://www.collada.org/2005/11/COLLADASchema\" ve"
  },
  {
    "path": "panda_urdf/meshes/Pandaleftfingervisible_vis_2.dae",
    "chars": 37443,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>\n<COLLADA xmlns=\"http://www.collada.org/2005/11/COLLADASchema\" ve"
  },
  {
    "path": "panda_urdf/meshes/Pandalink0visual_vis_1.dae",
    "chars": 269927,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>\n<COLLADA xmlns=\"http://www.collada.org/2005/11/COLLADASchema\" ve"
  },
  {
    "path": "panda_urdf/meshes/Pandalink1respondable_coll_1.dae",
    "chars": 40290,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>\n<COLLADA xmlns=\"http://www.collada.org/2005/11/COLLADASchema\" ve"
  },
  {
    "path": "panda_urdf/meshes/Pandalink1visual_vis_1.dae",
    "chars": 162894,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>\n<COLLADA xmlns=\"http://www.collada.org/2005/11/COLLADASchema\" ve"
  },
  {
    "path": "panda_urdf/meshes/Pandalink2respondable_coll_1.dae",
    "chars": 39724,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>\n<COLLADA xmlns=\"http://www.collada.org/2005/11/COLLADASchema\" ve"
  },
  {
    "path": "panda_urdf/meshes/Pandalink2visual_vis_1.dae",
    "chars": 165477,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>\n<COLLADA xmlns=\"http://www.collada.org/2005/11/COLLADASchema\" ve"
  },
  {
    "path": "panda_urdf/meshes/Pandalink3respondable_coll_1.dae",
    "chars": 40624,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>\n<COLLADA xmlns=\"http://www.collada.org/2005/11/COLLADASchema\" ve"
  },
  {
    "path": "panda_urdf/meshes/Pandalink3visual_vis_1.dae",
    "chars": 162522,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>\n<COLLADA xmlns=\"http://www.collada.org/2005/11/COLLADASchema\" ve"
  },
  {
    "path": "panda_urdf/meshes/Pandalink3visual_vis_2.dae",
    "chars": 48269,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>\n<COLLADA xmlns=\"http://www.collada.org/2005/11/COLLADASchema\" ve"
  },
  {
    "path": "panda_urdf/meshes/Pandalink4respondable_coll_1.dae",
    "chars": 40522,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>\n<COLLADA xmlns=\"http://www.collada.org/2005/11/COLLADASchema\" ve"
  },
  {
    "path": "panda_urdf/meshes/Pandalink4visual_vis_1.dae",
    "chars": 46712,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>\n<COLLADA xmlns=\"http://www.collada.org/2005/11/COLLADASchema\" ve"
  },
  {
    "path": "panda_urdf/meshes/Pandalink4visual_vis_2.dae",
    "chars": 167052,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>\n<COLLADA xmlns=\"http://www.collada.org/2005/11/COLLADASchema\" ve"
  },
  {
    "path": "panda_urdf/meshes/Pandalink5respondable_coll_1.dae",
    "chars": 13665,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>\n<COLLADA xmlns=\"http://www.collada.org/2005/11/COLLADASchema\" ve"
  },
  {
    "path": "panda_urdf/meshes/Pandalink5respondable_coll_2.dae",
    "chars": 7683,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>\n<COLLADA xmlns=\"http://www.collada.org/2005/11/COLLADASchema\" ve"
  },
  {
    "path": "panda_urdf/meshes/Pandalink5respondable_coll_3.dae",
    "chars": 15789,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>\n<COLLADA xmlns=\"http://www.collada.org/2005/11/COLLADASchema\" ve"
  },
  {
    "path": "panda_urdf/meshes/Pandalink5respondable_coll_4.dae",
    "chars": 6501,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>\n<COLLADA xmlns=\"http://www.collada.org/2005/11/COLLADASchema\" ve"
  },
  {
    "path": "panda_urdf/meshes/Pandalink5respondable_coll_5.dae",
    "chars": 4895,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>\n<COLLADA xmlns=\"http://www.collada.org/2005/11/COLLADASchema\" ve"
  },
  {
    "path": "panda_urdf/meshes/Pandalink5respondable_coll_6.dae",
    "chars": 9994,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>\n<COLLADA xmlns=\"http://www.collada.org/2005/11/COLLADASchema\" ve"
  },
  {
    "path": "panda_urdf/meshes/Pandalink5respondable_coll_7.dae",
    "chars": 16145,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>\n<COLLADA xmlns=\"http://www.collada.org/2005/11/COLLADASchema\" ve"
  },
  {
    "path": "panda_urdf/meshes/Pandalink5visual_vis_1.dae",
    "chars": 214892,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>\n<COLLADA xmlns=\"http://www.collada.org/2005/11/COLLADASchema\" ve"
  },
  {
    "path": "panda_urdf/meshes/Pandalink5visual_vis_2.dae",
    "chars": 191632,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>\n<COLLADA xmlns=\"http://www.collada.org/2005/11/COLLADASchema\" ve"
  },
  {
    "path": "panda_urdf/meshes/Pandalink6respondable_coll_1.dae",
    "chars": 28208,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>\n<COLLADA xmlns=\"http://www.collada.org/2005/11/COLLADASchema\" ve"
  },
  {
    "path": "panda_urdf/meshes/Pandalink6visual_vis_1.dae",
    "chars": 48604,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>\n<COLLADA xmlns=\"http://www.collada.org/2005/11/COLLADASchema\" ve"
  },
  {
    "path": "panda_urdf/meshes/Pandalink6visual_vis_2.dae",
    "chars": 377795,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>\n<COLLADA xmlns=\"http://www.collada.org/2005/11/COLLADASchema\" ve"
  },
  {
    "path": "panda_urdf/meshes/Pandalink6visual_vis_3.dae",
    "chars": 7950,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>\n<COLLADA xmlns=\"http://www.collada.org/2005/11/COLLADASchema\" ve"
  },
  {
    "path": "panda_urdf/meshes/Pandalink6visual_vis_4.dae",
    "chars": 70002,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>\n<COLLADA xmlns=\"http://www.collada.org/2005/11/COLLADASchema\" ve"
  },
  {
    "path": "panda_urdf/meshes/Pandalink6visual_vis_5.dae",
    "chars": 87929,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>\n<COLLADA xmlns=\"http://www.collada.org/2005/11/COLLADASchema\" ve"
  },
  {
    "path": "panda_urdf/meshes/Pandalink6visual_vis_6.dae",
    "chars": 21683,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>\n<COLLADA xmlns=\"http://www.collada.org/2005/11/COLLADASchema\" ve"
  },
  {
    "path": "panda_urdf/meshes/Pandalink7respondable_coll_1.dae",
    "chars": 28303,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>\n<COLLADA xmlns=\"http://www.collada.org/2005/11/COLLADASchema\" ve"
  },
  {
    "path": "panda_urdf/meshes/Pandalink7visual_vis_1.dae",
    "chars": 9119,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>\n<COLLADA xmlns=\"http://www.collada.org/2005/11/COLLADASchema\" ve"
  },
  {
    "path": "panda_urdf/meshes/Pandalink7visual_vis_2.dae",
    "chars": 72999,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>\n<COLLADA xmlns=\"http://www.collada.org/2005/11/COLLADASchema\" ve"
  },
  {
    "path": "panda_urdf/meshes/Pandalink7visual_vis_3.dae",
    "chars": 130719,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>\n<COLLADA xmlns=\"http://www.collada.org/2005/11/COLLADASchema\" ve"
  },
  {
    "path": "panda_urdf/meshes/Pandalink7visual_vis_4.dae",
    "chars": 7998,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>\n<COLLADA xmlns=\"http://www.collada.org/2005/11/COLLADASchema\" ve"
  },
  {
    "path": "panda_urdf/meshes/Pandalink7visual_vis_5.dae",
    "chars": 14537,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>\n<COLLADA xmlns=\"http://www.collada.org/2005/11/COLLADASchema\" ve"
  },
  {
    "path": "panda_urdf/meshes/Pandarightfingervisual_vis_1.dae",
    "chars": 49948,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>\n<COLLADA xmlns=\"http://www.collada.org/2005/11/COLLADASchema\" ve"
  },
  {
    "path": "panda_urdf/meshes/Pandarightfingervisual_vis_2.dae",
    "chars": 38234,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>\n<COLLADA xmlns=\"http://www.collada.org/2005/11/COLLADASchema\" ve"
  },
  {
    "path": "panda_urdf/meshes/robot_base_coll_1.dae",
    "chars": 28080,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\" ?>\n<COLLADA xmlns=\"http://www.collada.org/2005/11/COLLADASchema\" ve"
  },
  {
    "path": "panda_urdf/panda.urdf",
    "chars": 26990,
    "preview": "<?xml version=\"1.0\"?>\n<robot name=\"Panda\">\n    <link name=\"robot_base\">\n        <collision name=\"Panda\">\n            <or"
  },
  {
    "path": "peract/ARM_LICENSE",
    "chars": 11222,
    "preview": "Q-attention: Enabling Efficient Learning for Vision-based Robotic Manipulation\n\nLICENCE AGREEMENT\n\nWE (Imperial College "
  },
  {
    "path": "peract/LICENSE",
    "chars": 22714,
    "preview": "                                 Apache License\n                           Version 2.0, January 2004\n                   "
  },
  {
    "path": "peract/agents/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "peract/agents/peract_diffusion/__init__.py",
    "chars": 51,
    "preview": "import peract.agents.peract_diffusion.launch_utils\n"
  },
  {
    "path": "peract/agents/peract_diffusion/launch_utils.py",
    "chars": 18519,
    "preview": "# Adapted from ARM\n# Source: https://github.com/stepjam/ARM\n# License: https://github.com/stepjam/ARM/LICENSE\n\nimport ge"
  },
  {
    "path": "peract/agents/peract_diffusion/perceiver_lang_io.py",
    "chars": 18219,
    "preview": "# Perceiver IO implementation adpated for manipulation\n# Source: https://github.com/lucidrains/perceiver-pytorch\n# Licen"
  },
  {
    "path": "peract/agents/peract_diffusion/qattention_peract_bc_agent.py",
    "chars": 29010,
    "preview": "import logging\nimport os\nfrom typing import List\n\nimport numpy as np\nimport torch\nimport torch.nn as nn\nimport torch.nn."
  },
  {
    "path": "peract/agents/peract_diffusion/qattention_stack_agent.py",
    "chars": 10209,
    "preview": "import getpass\nimport logging\nimport os\nfrom typing import List, Union\n\nimport numpy as np\nimport plotly.graph_objects a"
  },
  {
    "path": "peract/voxel/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "peract/voxel/augmentation.py",
    "chars": 14784,
    "preview": "import numpy as np\nimport torch\nimport torch.nn.functional as F\n\nfrom helpers import utils\n\n\ndef _axis_angle_rotation(ax"
  },
  {
    "path": "peract/voxel/voxel_grid.py",
    "chars": 8801,
    "preview": "# Voxelizer modified from ARM for DDP training\n# Source: https://github.com/stepjam/ARM\n# License: https://github.com/st"
  },
  {
    "path": "requirements.txt",
    "chars": 292,
    "preview": "absl-py\nhydra-core\nopencv-python-headless\nplotly\neinops\npytorch-kinematics[mujoco]\npyrender\npandas\nftfy\nregex\ntransforme"
  },
  {
    "path": "rk_diffuser/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "rk_diffuser/dataset/__init__.py",
    "chars": 29,
    "preview": "\"\"\"Dataset and its utils.\"\"\"\n"
  },
  {
    "path": "rk_diffuser/dataset/realworld_dataset.py",
    "chars": 27528,
    "preview": "import logging\nimport os\nfrom multiprocessing import Manager, Process\nfrom pathlib import Path\nfrom typing import Any, L"
  },
  {
    "path": "rk_diffuser/dataset/rl_bench_dataset.py",
    "chars": 16354,
    "preview": "import os\nimport pickle\nfrom collections import namedtuple\nfrom multiprocessing import Manager, Process\nfrom typing impo"
  },
  {
    "path": "rk_diffuser/dataset/rl_bench_env.py",
    "chars": 7098,
    "preview": "import logging\nfrom typing import List, Type\n\nimport numpy as np\nfrom pyrep.const import RenderMode\nfrom pyrep.errors im"
  },
  {
    "path": "rk_diffuser/models/__init__.py",
    "chars": 45,
    "preview": "\"\"\"Diffusion-based trajectory prediction.\"\"\"\n"
  },
  {
    "path": "rk_diffuser/models/diffusion.py",
    "chars": 28024,
    "preview": "\"\"\"DDPM diffusion model.\"\"\"\n\nimport torch\nfrom torch import nn\nfrom torch.nn import functional as F\n\nimport rk_diffuser."
  },
  {
    "path": "rk_diffuser/models/helpers.py",
    "chars": 6391,
    "preview": "\"\"\"Helper functions.\"\"\"\n\nimport math\n\nimport numpy as np\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional a"
  },
  {
    "path": "rk_diffuser/models/multi_level_diffusion.py",
    "chars": 7315,
    "preview": "from typing import Dict\n\nimport torch\nimport torch.nn as nn\n\nfrom rk_diffuser import utils\nfrom rk_diffuser.models.diffu"
  },
  {
    "path": "rk_diffuser/models/pointnet.py",
    "chars": 6031,
    "preview": "\"\"\"PointNet to handle the 3D point cloud.\"\"\"\n\nimport numpy as np\nimport torch\nimport torch.nn as nn\nimport torch.nn.func"
  },
  {
    "path": "rk_diffuser/models/resnet.py",
    "chars": 3251,
    "preview": "import ssl\nfrom functools import reduce\nfrom typing import Tuple\n\nimport torch\nimport torch.nn as nn\nimport torchvision "
  },
  {
    "path": "rk_diffuser/models/temporal.py",
    "chars": 19421,
    "preview": "\"\"\"Temporal UNet to process trajectories.\"\"\"\n\nimport einops\nimport torch\nimport torch.nn as nn\nfrom einops import rearra"
  },
  {
    "path": "rk_diffuser/robot.py",
    "chars": 4365,
    "preview": "from typing import Optional\n\nimport pytorch_kinematics as pk\nimport torch\nimport torch.nn as nn\nimport torch.nn.function"
  },
  {
    "path": "rk_diffuser/trainer.py",
    "chars": 24714,
    "preview": "\"\"\"Trainer for diffusion models.\"\"\"\n\nimport copy\nimport datetime\nimport os\nfrom multiprocessing import Manager, Process\n"
  },
  {
    "path": "rk_diffuser/utils.py",
    "chars": 17146,
    "preview": "\"\"\"Util functions for diffusion models.\"\"\"\n\nimport logging\nimport math\nimport time\nfrom typing import List, Optional\n\nim"
  },
  {
    "path": "setup.py",
    "chars": 512,
    "preview": "from setuptools import find_packages, setup\n\nsetup(\n    name=\"hdp\",\n    version=\"0.1.0\",\n    packages=find_packages(),\n "
  },
  {
    "path": "train_low_level.py",
    "chars": 6221,
    "preview": "\"\"\"Main script.\"\"\"\n\nimport os\n\nimport hydra\nimport torch\n\nfrom rk_diffuser.dataset.rl_bench_dataset import RLBenchDatase"
  }
]

About this extraction

This page contains the full source code of the dyson-ai/hdp GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 116 files (3.4 MB), approximately 905.6k tokens, and a symbol index with 602 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo