Full Code of LYX0501/InstructNav for AI

main e58a5aac51cd cached
72 files
565.1 KB
145.9k tokens
498 symbols
1 requests
Download .txt
Showing preview only (593K chars total). Download the full file or copy to clipboard to get everything.
Repository: LYX0501/InstructNav
Branch: main
Commit: e58a5aac51cd
Files: 72
Total size: 565.1 KB

Directory structure:
gitextract_3pjq364l/

├── .gitignore
├── README.md
├── config_utils.py
├── constants.py
├── cv_utils/
│   ├── glee_detector.py
│   ├── image_percevior.py
│   └── object_list.py
├── llm_utils/
│   ├── gpt_request.py
│   └── nav_prompt.py
├── mapper.py
├── mapping_utils/
│   ├── geometry.py
│   ├── path_planning.py
│   ├── preprocess.py
│   ├── projection.py
│   └── transform.py
├── objnav_agent.py
├── objnav_benchmark.py
├── requirements.txt
└── thirdparty/
    └── GLEE/
        ├── configs/
        │   ├── R50.yaml
        │   └── SwinL.yaml
        └── glee/
            ├── __init__.py
            ├── backbone/
            │   ├── __init__.py
            │   ├── backbone.py
            │   ├── build.py
            │   ├── davit.py
            │   ├── eva01.py
            │   ├── eva02-dino.py
            │   ├── eva02.py
            │   ├── eva_01_utils.py
            │   ├── eva_02_utils.py
            │   ├── internimage.py
            │   ├── registry.py
            │   ├── resnet.py
            │   ├── swin.py
            │   ├── vit.py
            │   └── vit_utils.py
            ├── config.py
            ├── config_deeplab.py
            ├── models/
            │   ├── glee_model.py
            │   ├── pixel_decoder/
            │   │   ├── __init__.py
            │   │   ├── early_fusion.py
            │   │   ├── maskdino_encoder.py
            │   │   ├── ops/
            │   │   │   ├── functions/
            │   │   │   │   ├── __init__.py
            │   │   │   │   └── ms_deform_attn_func.py
            │   │   │   ├── make.sh
            │   │   │   ├── modules/
            │   │   │   │   ├── __init__.py
            │   │   │   │   └── ms_deform_attn.py
            │   │   │   ├── setup.py
            │   │   │   ├── src/
            │   │   │   │   ├── cpu/
            │   │   │   │   │   ├── ms_deform_attn_cpu.cpp
            │   │   │   │   │   └── ms_deform_attn_cpu.h
            │   │   │   │   ├── cuda/
            │   │   │   │   │   ├── ms_deform_attn_cuda.cu
            │   │   │   │   │   ├── ms_deform_attn_cuda.h
            │   │   │   │   │   └── ms_deform_im2col_cuda.cuh
            │   │   │   │   ├── ms_deform_attn.h
            │   │   │   │   └── vision.cpp
            │   │   │   └── test.py
            │   │   └── position_encoding.py
            │   ├── transformer_decoder/
            │   │   ├── __init__.py
            │   │   ├── dino_decoder.py
            │   │   └── maskdino_decoder.py
            │   └── vos_utils.py
            ├── modules/
            │   ├── __init__.py
            │   ├── attention.py
            │   ├── point_features.py
            │   ├── position_encoding.py
            │   └── postprocessing.py
            └── utils/
                ├── __init__.py
                ├── box_ops.py
                ├── config.py
                ├── it_contrastive.py
                ├── misc.py
                └── utils.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
./tmp
*.pth
*.pyc
**/__pycache__


================================================
FILE: README.md
================================================
# InstructNav

Enabling robots to navigate following diverse language instructions in unexplored environments is an attractive goal for human-robot interaction. In this work, we propose InstructNav, a generic instruction navigation system. InstructNav makes the first endeavor to handle various instruction navigation tasks without any navigation training or pre-built maps. To reach this goal, we introduce **Dynamic Chain-of-Navigation (DCoN)** to unify the planning process for different types of navigation instructions. Furthermore, we propose **Multi-sourced Value Maps** to model key elements in instruction navigation so that linguistic DCoN planning can be converted into robot actionable trajectories. 

[InstructNav](https://github.com/LYX0501/InstructNav/blob/main/InstructNav.png)

With InstructNav, we complete the R2R-CE task in a zero-shot way for the first time and outperform many task-training methods. Besides, InstructNav also surpasses the previous SOTA method by 10.48% on the zero-shot Habitat ObjNav and by 86.34% on demand-driven navigation DDN. Real robot experiments on diverse indoor scenes further demonstrate our method's robustness in coping with the environment and instruction variations. Please refer to more details in our paper: 
![InstructNav: Zero-shot System for Generic Instruction Navigation in Unexplored Environment](https://arxiv.org/abs/2406.04882).
## 🔥 News
- 2024.9.11: The HM3D objnav benchmark code is released.
- 2024.9.5: Our paper is accepted by CoRL 2024. Codes will be released in the recent.

### Dependency ###
Our project is based on the [habitat-sim](https://github.com/facebookresearch/habitat-sim?tab=readme-ov-file), [habitat-lab](https://github.com/facebookresearch/habitat-lab), and [Detectron2](https://github.com/facebookresearch/detectron2). Please follow the guides to install them in your python environment. You can directly install the latest version of habitat-lab and habitat-sim. And make sure you have properly download the navigation scenes [(HM3D, MP3D)](https://github.com/facebookresearch/habitat-lab/blob/main/DATASETS.md) and the episode dataset for both visual-language navigation (VLN-CE) and object navigation.

### Installation ###
Firstly, clone our repo as:
```
git clone https://github.com/LYX0501/InstructNav.git
cd InstructNav
pip install -r requirements.txt
```
Our method depends on an open-vocalbulary detection and segmentation model [GLEE](https://github.com/FoundationVision/GLEE). Please check the original repo or try to use the copy located in the ./thirdparty/ directory.
### 

### Prepare your GPT4 and GPT4V API Keys ###
Please prepare your keys for calling the API for large-language model and large vision-language model.
We prefer to use the GPT4 and GPT4V to do the inference. And our code follows the AzureOpenAI calling process.
Before running the benchmark, you should prepare for your own api-keys and api-endpoint and api-version. You can check the ./llm_utils/gpt_request.py for usage details.
```
export GPT4_API_BASE=<YOUR_GPT4_ENDPOINT>
export GPT4_API_KEY=<YOUR_GPT4_KEY>
export GPT4_API_DEPLOY=<GPT4_MODEL_NAME>
export GPT4_API_VERSION=<GPT4_MODEL_VERSION>
export GPT4V_API_BASE=<YOUR_GPT4V_ENDPOINT>
export GPT4V_API_KEY=<YOUR_GPT4V_KEY>
export GPT4V_API_DEPLOY=<GPT4V_MODEL_NAME>
export GPT4V_API_VERSION=<GPT4V_MODEL_VERSION>
```

### Running our Benchmark Code ###
If everything goes well, you can directly run the evaluation code for different navigation tasks.
For example, 
```
python objnav_benchmark.py
```
And all the episode results, intermediate results such as GPT4 input/output and value maps will be saved in /tmp/ directory. The real-time agent first-person-view image observation, depth and segmentation will be saved in the project root directory. Examples are shown below:
![test](https://github.com/user-attachments/assets/51a65b07-70e2-49f3-a850-815b0ec151d0)

https://github.com/user-attachments/assets/04e37b91-c524-4c51-86d1-8fb72325f612






## BibTex
Please cite our paper if you find it helpful :)
```
@misc{InstructNav,
      title={InstructNav: Zero-shot System for Generic Instruction Navigation in Unexplored Environment}, 
      author={Yuxing Long and Wenzhe Cai and Hongcheng Wang and Guanqi Zhan and Hao Dong},
      year={2024},
      eprint={2406.04882},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
}
```


================================================
FILE: config_utils.py
================================================
import habitat
from habitat.config.read_write import read_write
from habitat.config.default_structured_configs import (
    CollisionsMeasurementConfig,
    FogOfWarConfig,
    TopDownMapMeasurementConfig,
)
HM3D_CONFIG_PATH = "<YOUR SAVE PATH>/habitat-lab/habitat-lab/habitat/config/benchmark/nav/objectnav/objectnav_hm3d.yaml"
MP3D_CONFIG_PATH = "<YOUR SAVE PATH>/habitat-lab/habitat-lab/habitat/config/benchmark/nav/objectnav/objectnav_mp3d.yaml"
R2R_CONFIG_PATH = "<YOUR SAVE PATH>/habitat-lab/habitat-lab/habitat/config/benchmark/nav/vln_r2r.yaml"

def hm3d_config(path:str=HM3D_CONFIG_PATH,stage:str='val',episodes=200):
    habitat_config = habitat.get_config(path)
    with read_write(habitat_config):
        habitat_config.habitat.dataset.split = stage
        habitat_config.habitat.dataset.scenes_dir = "/home/PJLAB/caiwenzhe/Desktop/dataset/scenes"
        habitat_config.habitat.dataset.data_path = "/home/PJLAB/caiwenzhe/Desktop/dataset/habitat_task/objectnav/hm3d/v2/{split}/{split}.json.gz"
        habitat_config.habitat.simulator.scene_dataset = "/home/PJLAB/caiwenzhe/Desktop/dataset/scenes/hm3d_v0.2/hm3d_annotated_basis.scene_dataset_config.json"
        habitat_config.habitat.environment.iterator_options.num_episode_sample = episodes
        habitat_config.habitat.task.measurements.update(
        {
            "top_down_map": TopDownMapMeasurementConfig(
                map_padding=3,
                map_resolution=1024,
                draw_source=True,
                draw_border=True,
                draw_shortest_path=False,
                draw_view_points=True,
                draw_goal_positions=True,
                draw_goal_aabbs=True,
                fog_of_war=FogOfWarConfig(
                    draw=True,
                    visibility_dist=5.0,
                    fov=90,
                ),
            ),
            "collisions": CollisionsMeasurementConfig(),
        })
        habitat_config.habitat.simulator.agents.main_agent.sim_sensors.depth_sensor.max_depth=5.0
        habitat_config.habitat.simulator.agents.main_agent.sim_sensors.depth_sensor.normalize_depth=False
        habitat_config.habitat.task.measurements.success.success_distance = 0.25
    return habitat_config
    
def mp3d_config(path:str=MP3D_CONFIG_PATH,stage:str='val',episodes=200):
    habitat_config = habitat.get_config(path)
    with read_write(habitat_config):
        habitat_config.habitat.dataset.split = stage
        habitat_config.habitat.dataset.scenes_dir = "/home/PJLAB/caiwenzhe/Desktop/dataset/scenes"
        habitat_config.habitat.dataset.data_path = "/home/PJLAB/caiwenzhe/Desktop/dataset/habitat_task/objectnav/mp3d/v1/{split}/{split}.json.gz"
        habitat_config.habitat.simulator.scene_dataset = "/home/PJLAB/caiwenzhe/Desktop/dataset/scenes/mp3d/mp3d.scene_dataset_config.json"
        habitat_config.habitat.environment.iterator_options.num_episode_sample = episodes
        habitat_config.habitat.task.measurements.update(
        {
            "top_down_map": TopDownMapMeasurementConfig(
                map_padding=3,
                map_resolution=1024,
                draw_source=True,
                draw_border=True,
                draw_shortest_path=False,
                draw_view_points=True,
                draw_goal_positions=True,
                draw_goal_aabbs=True,
                fog_of_war=FogOfWarConfig(
                    draw=True,
                    visibility_dist=5.0,
                    fov=79,
                ),
            ),
            "collisions": CollisionsMeasurementConfig(),
        })
        habitat_config.habitat.simulator.agents.main_agent.sim_sensors.depth_sensor.max_depth=5.0
        habitat_config.habitat.simulator.agents.main_agent.sim_sensors.depth_sensor.normalize_depth=False
        habitat_config.habitat.task.measurements.success.success_distance = 0.25
    return habitat_config

def r2r_config(path:str=R2R_CONFIG_PATH,stage:str='val_seen',episodes=200):
    habitat_config = habitat.get_config(path)
    with read_write(habitat_config):
        habitat_config.habitat.dataset.split = stage
        habitat_config.habitat.dataset.scenes_dir = "/home/PJLAB/caiwenzhe/Desktop/dataset/scenes"
        habitat_config.habitat.dataset.data_path = "/home/PJLAB/caiwenzhe/Desktop/dataset/habitat_task/vln/r2r/{split}/{split}.json.gz"
        habitat_config.habitat.simulator.scene_dataset = "/home/PJLAB/caiwenzhe/Desktop/dataset/scenes/mp3d/mp3d.scene_dataset_config.json"
        habitat_config.habitat.environment.iterator_options.num_episode_sample = episodes
        habitat_config.habitat.task.measurements.update(
        {
            "top_down_map": TopDownMapMeasurementConfig(
                map_padding=3,
                map_resolution=1024,
                draw_source=True,
                draw_border=True,
                draw_shortest_path=False,
                draw_view_points=True,
                draw_goal_positions=True,
                draw_goal_aabbs=True,
                fog_of_war=FogOfWarConfig(
                    draw=True,
                    visibility_dist=5.0,
                    fov=79,
                ),
            ),
            "collisions": CollisionsMeasurementConfig(),
        })  
        habitat_config.habitat.simulator.agents.main_agent.sim_sensors.depth_sensor.max_depth=5.0
        habitat_config.habitat.simulator.agents.main_agent.sim_sensors.depth_sensor.normalize_depth=False
        habitat_config.habitat.task.measurements.success.success_distance = 0.25
    return habitat_config


================================================
FILE: constants.py
================================================
from cv_utils.object_list import categories
GLEE_CONFIG_PATH = "./thirdparty/GLEE/configs/SwinL.yaml"
GLEE_CHECKPOINT_PATH = "./thirdparty/GLEE/GLEE_SwinL_Scaleup10m.pth"
DETECT_OBJECTS = [[cat['name'].lower()] for cat in categories]
INTEREST_OBJECTS = ['bed','chair','toilet','potted_plant','television_set','sofa']




================================================
FILE: cv_utils/glee_detector.py
================================================
from thirdparty.GLEE.glee.models.glee_model import GLEE_Model
from thirdparty.GLEE.glee.config_deeplab import add_deeplab_config
from thirdparty.GLEE.glee.config import add_glee_config
from habitat_sim.utils.common import d3_40_colors_rgb
from constants import *
from detectron2.config import get_cfg
from .object_list import categories as CATEGORIES
import torch
import torch.nn.functional as F
import torchvision
import cv2
import numpy as np
CATEGORIES = [cat['name'].lower() for cat in CATEGORIES]

def initialize_glee(glee_config=GLEE_CONFIG_PATH,
                    glee_checkpoint=GLEE_CHECKPOINT_PATH,
                    device="cuda:0"):
    cfg_swin = get_cfg()
    add_deeplab_config(cfg_swin)
    add_glee_config(cfg_swin)
    conf_files_swin = glee_config
    checkpoints_swin = torch.load(glee_checkpoint) 
    cfg_swin.merge_from_file(conf_files_swin)
    GLEEmodel_swin = GLEE_Model(cfg_swin, None, device, None, True).to(device)
    GLEEmodel_swin.load_state_dict(checkpoints_swin, strict=False)
    GLEEmodel_swin.eval()
    return GLEEmodel_swin

# prompt_mode="categories", 
# results_select=["box", "mask", "name", "score"],

def glee_segmentation(img, 
                      GLEEmodel, 
                      custom_category=CATEGORIES,
                      num_inst_select=15,
                      threshold_select=0.2,
                      device="cuda:0"):        
    pixel_mean = torch.Tensor([123.675, 116.28, 103.53]).to(device).view(3, 1, 1)
    pixel_std = torch.Tensor([58.395, 57.12, 57.375]).to(device).view(3, 1, 1)
    normalizer = lambda x: (x - pixel_mean) / pixel_std
    ori_image = torch.as_tensor(np.ascontiguousarray(img.transpose(2, 0, 1)))
    ori_image = normalizer(ori_image.to(device))[None,]
    _,_, ori_height, ori_width = ori_image.shape
    resizer = torchvision.transforms.Resize(800)
    resize_image = resizer(ori_image)
    image_size = torch.as_tensor((resize_image.shape[-2],resize_image.shape[-1]))
    re_size = resize_image.shape[-2:]
    stride = 32
    # the last two dims are H,W, both subject to divisibility requirement
    padding_size = ((image_size + (stride - 1)).div(stride, rounding_mode="floor") * stride).tolist()
    infer_image = torch.zeros(1,3,padding_size[0],padding_size[1]).to(resize_image)
    infer_image[0,:,:image_size[0],:image_size[1]] = resize_image
    batch_category_name = custom_category
    prompt_list = []
    with torch.no_grad():
        (outputs,_) = GLEEmodel(infer_image, prompt_list, task="coco", batch_name_list=batch_category_name, is_train=False)
    topK_instance = max(num_inst_select,1)
    bbox_pred = outputs['pred_boxes'][0]
    bbox_pred[:,0],bbox_pred[:,2] = bbox_pred[:,0] * img.shape[1] - bbox_pred[:,2] * img.shape[1] * 0.5, bbox_pred[:,0] * img.shape[1] + bbox_pred[:,2] * img.shape[1] * 0.5
    bbox_pred[:,1],bbox_pred[:,3] = bbox_pred[:,1] * img.shape[0] - bbox_pred[:,3] * img.shape[0] * 0.5, bbox_pred[:,1] * img.shape[0] + bbox_pred[:,3] * img.shape[0] * 0.5
    mask_pred = outputs['pred_masks'][0]
    mask_cls = outputs['pred_logits'][0]
    scores = mask_cls.sigmoid().max(-1)[0]
    scores_per_image, topk_indices = scores.topk(topK_instance, sorted=True)
    valid = scores_per_image>threshold_select
    topk_indices = topk_indices[valid]
    scores_per_image = scores_per_image[valid]
    pred_class = mask_cls[topk_indices].max(-1)[1].tolist()
    if len(pred_class) == 0: 
        return [], [], [], []
    mask_pred = mask_pred[topk_indices]
    bbox_pred = bbox_pred[topk_indices].cpu().numpy()
    pred_masks = F.interpolate( mask_pred[None,], size=(padding_size[0], padding_size[1]), mode="bilinear", align_corners=False)
    pred_masks = pred_masks[:,:,:re_size[0],:re_size[1]]
    pred_masks = F.interpolate( pred_masks, size=(ori_height,ori_width), mode="bilinear", align_corners=False  )
    pred_masks = (pred_masks>0).detach().cpu().numpy()[0]
    return bbox_pred, pred_masks, np.array(batch_category_name)[pred_class], scores_per_image

def visualize_segmentation(image,classes,masks):
    copy_image = image.copy()
    label_classes = np.unique(classes)
    for cls,mask in zip(classes,masks):
        if len(np.unique(mask)) != 2: continue
        copy_image[np.where(mask == 1)] = d3_40_colors_rgb[label_classes.tolist().index(cls)]
        x, y = int(np.mean(np.where(mask)[1])), int(np.mean(np.where(mask)[0])) 
        cv2.putText(copy_image, str(cls), (x, y), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255,255,255), 2)
    ret_image = cv2.addWeighted(image,0.2,copy_image,0.8,0)
    return ret_image

def visualize_detection(image,classes,bboxes):
    copy_image = image.copy()
    label_classes = np.unique(classes)
    for cls,bbox in zip(classes,bboxes):
        color = d3_40_colors_rgb[label_classes.tolist().index(cls)%40]
        copy_image = cv2.rectangle(copy_image,(int(bbox[0]),int(bbox[1])),(int(bbox[2]),int(bbox[3])),color.tolist(),2)
        x, y = int(bbox[0]*0.5+bbox[2]*0.5), int(bbox[1]*0.5+bbox[3]*0.5)
        cv2.putText(copy_image, str(cls), (x, y), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255,255,255), 2)
    return copy_image



    

================================================
FILE: cv_utils/image_percevior.py
================================================
from constants import *
from .glee_detector import *
class GLEE_Percevior:
    def __init__(self,
                 glee_config=GLEE_CONFIG_PATH,
                 glee_checkpoint=GLEE_CHECKPOINT_PATH,
                 device = "cuda:0"):
        self.device = device
        self.glee_model = initialize_glee(glee_config,glee_checkpoint,device)
    def perceive(self,image,confidence_threshold=0.25,area_threshold=2500):
        pred_bboxes, pred_masks, pred_class, pred_confidence = glee_segmentation(image,self.glee_model,threshold_select=confidence_threshold,device=self.device)
        try:
            mask_area = np.array([mask.sum() for mask in pred_masks])
            bbox_trust = np.array([(bbox[0] > 20) & (bbox[2] < image.shape[1] - 20) for bbox in pred_bboxes])
            visualization = visualize_segmentation(image,pred_class[(mask_area>area_threshold) & bbox_trust],pred_masks[(mask_area>area_threshold) & bbox_trust])
            return pred_class[(mask_area>area_threshold) & bbox_trust],pred_masks[(mask_area>area_threshold) & bbox_trust],pred_confidence[(mask_area>area_threshold) & bbox_trust],[visualization]
        except:
            return [],[],[],[image]


================================================
FILE: cv_utils/object_list.py
================================================
categories = [
  {"id": 1, "name": "bed"},
  {"id": 2, "name": "sofa"},
  {"id": 3, "name": "chair"},
  {"id": 4, "name": "television_set"},
  {"id": 5, "name": "potted_plant"},
  {"id": 6, "name": "toilet"},
  {"id": 7, "name": "lamp"},
  {"id": 8, "name": "desk"},
  {"id": 9, "name": "bookshelf"},
  {"id": 10, "name": "cupboard"},
  {"id": 11, "name": "drawer"},
  {"id": 12, "name": "refrigerator"},
  {"id": 13, "name": "oven"},
  {"id": 14, "name": "microwave"},
  {"id": 15, "name": "toaster"},
  {"id": 16, "name": "sink"},
  {"id": 17, "name": "dishwasher"},
  {"id": 18, "name": "coffee_machine"},
  {"id": 19, "name": "kettle"},
  {"id": 20, "name": "stove"},
  {"id": 21, "name": "washing_machine"},
  {"id": 22, "name": "dryer"},
  {"id": 23, "name": "mirror"},
  {"id": 24, "name": "clock"},
  {"id": 25, "name": "curtains"},
  {"id": 26, "name": "blinds"},
  {"id": 27, "name": "bathtub"},
  {"id": 28, "name": "shower"},
  {"id": 29, "name": "table"},
  {"id": 30, "name": "towel"},
  {"id": 31, "name": "soap_dispenser"},
  {"id": 32, "name": "toothbrush"},
  {"id": 33, "name": "toothpaste"},
  {"id": 34, "name": "shampoo"},
  {"id": 35, "name": "conditioner"},
  {"id": 36, "name": "hair_dryer"},
  {"id": 37, "name": "razor"},
  {"id": 38, "name": "makeup"},
  {"id": 39, "name": "tissue_box"},
  {"id": 40, "name": "trash_can"},
  {"id": 41, "name": "vacuum_cleaner"},
  {"id": 42, "name": "mop"},
  {"id": 43, "name": "broom"},
  {"id": 44, "name": "bucket"},
  {"id": 45, "name": "sponge"},
  {"id": 46, "name": "detergent"},
  {"id": 47, "name": "iron"},
  {"id": 48, "name": "ironing_board"},
  {"id": 49, "name": "laundry_basket"},
  {"id": 50, "name": "clothes_hanger"},
  {"id": 51, "name": "coat_rack"},
  {"id": 52, "name": "shoe_rack"},
  {"id": 53, "name": "umbrella"},
  {"id": 54, "name": "fire_extinguisher"},
  {"id": 55, "name": "first_aid_kit"},
  {"id": 56, "name": "thermometer"},
  {"id": 57, "name": "scale"},
  {"id": 58, "name": "fan"},
  {"id": 59, "name": "heater"},
  {"id": 60, "name": "air_conditioner"},
  {"id": 61, "name": "humidifier"},
  {"id": 62, "name": "dehumidifier"},
  {"id": 63, "name": "light_switch"},
  {"id": 64, "name": "electrical_outlet"},
  {"id": 65, "name": "extension_cord"},
  {"id": 66, "name": "remote_control"},
  {"id": 67, "name": "game_console"},
  {"id": 68, "name": "router"},
  {"id": 69, "name": "modem"},
  {"id": 70, "name": "computer"},
  {"id": 71, "name": "laptop"},
  {"id": 72, "name": "printer"},
  {"id": 73, "name": "scanner"},
  {"id": 74, "name": "fax_machine"},
  {"id": 75, "name": "telephone"},
  {"id": 76, "name": "smartphone"},
  {"id": 77, "name": "tablet"},
  {"id": 78, "name": "keyboard"},
  {"id": 79, "name": "mouse"},
  {"id": 80, "name": "monitor"},
  {"id": 81, "name": "notebook"},
  {"id": 82, "name": "pen"},
  {"id": 83, "name": "pencil"},
  {"id": 84, "name": "eraser"},
  {"id": 85, "name": "stapler"},
  {"id": 86, "name": "scissors"},
  {"id": 87, "name": "tape_dispenser"},
  {"id": 88, "name": "paper_clip"},
  {"id": 89, "name": "envelope"},
  {"id": 90, "name": "letter_opener"},
  {"id": 91, "name": "cabinet"},
  {"id": 92, "name": "whiteboard"},
  {"id": 93, "name": "calendar"},
  {"id": 94, "name": "photo_frame"},
  {"id": 95, "name": "vase"},
  {"id": 96, "name": "candle"},
  {"id": 97, "name": "incense"},
  {"id": 98, "name": "book"},
  {"id": 99, "name": "magazine"},
  {"id": 100, "name": "newspaper"},
  {"id": 101, "name": "album"},
  {"id": 102, "name": "record_player"},
  {"id": 103, "name": "cd_player"},
  {"id": 104, "name": "dvd_player"},
  {"id": 105, "name": "blu_ray_player"},
  {"id": 106, "name": "speaker"},
  {"id": 107, "name": "headphones"},
  {"id": 108, "name": "microphone"},
  {"id": 109, "name": "camera"},
  {"id": 110, "name": "camcorder"},
  {"id": 111, "name": "tripod"},
  {"id": 112, "name": "flashlight"},
  {"id": 113, "name": "batteries"},
  {"id": 114, "name": "charger"},
  {"id": 115, "name": "cable"},
  {"id": 116, "name": "usb_drive"},
  {"id": 117, "name": "hard_drive"},
  {"id": 118, "name": "router"},
  {"id": 119, "name": "switch"},
  {"id": 120, "name": "firewall"},
  {"id": 121, "name": "server"},
  {"id": 122, "name": "keyboard_tray"},
  {"id": 123, "name": "mouse_pad"},
  {"id": 124, "name": "speaker_stand"},
  {"id": 125, "name": "monitor_stand"},
  {"id": 126, "name": "file_folder"},
  {"id": 127, "name": "binder"},
  {"id": 128, "name": "clipboard"},
  {"id": 129, "name": "calculator"},
  {"id": 130, "name": "label_maker"},
  {"id": 131, "name": "hole_punch"},
  {"id": 132, "name": "paper_shredder"},
  {"id": 133, "name": "post_it_note"},
  {"id": 134, "name": "thumbtack"},
  {"id": 135, "name": "magnet"},
  {"id": 136, "name": "ruler"},
  {"id": 137, "name": "protractor"},
  {"id": 138, "name": "compass"},
  {"id": 139, "name": "glue"},
  {"id": 140, "name": "white_out"},
  {"id": 141, "name": "marker"},
  {"id": 142, "name": "highlighter"},
  {"id": 143, "name": "crayon"},
  {"id": 144, "name": "paint"},
  {"id": 145, "name": "paintbrush"},
  {"id": 146, "name": "easel"},
  {"id": 147, "name": "canvas"},
  {"id": 148, "name": "palette"},
  {"id": 149, "name": "sculpting_tools"},
  {"id": 150, "name": "clay"},
  {"id": 151, "name": "sewing_machine"},
  {"id": 152, "name": "thread"},
  {"id": 153, "name": "needle"},
  {"id": 154, "name": "scissors"},
  {"id": 155, "name": "fabric"},
  {"id": 156, "name": "measuring_tape"},
  {"id": 157, "name": "pin_cushion"},
  {"id": 158, "name": "thimble"},
  {"id": 159, "name": "seam_ripper"},
  {"id": 160, "name": "iron"},
  {"id": 161, "name": "pattern"},
  {"id": 162, "name": "ribbon"},
  {"id": 163, "name": "button"},
  {"id": 164, "name": "zipper"},
  {"id": 165, "name": "hook"},
  {"id": 166, "name": "stairs"},
  {"id": 167, "name": "snap"},
  {"id": 168, "name": "velcro"},
  {"id": 169, "name": "elastic"},
  {"id": 170, "name": "lace"},
  {"id": 171, "name": "trim"},
  {"id": 172, "name": "bead"},
  {"id": 173, "name": "sequin"},
  {"id": 174, "name": "glue_gun"},
  {"id": 175, "name": "glue_stick"},
  {"id": 176, "name": "craft_knife"},
  {"id": 177, "name": "cutting_mat"},
  {"id": 178, "name": "ruler"},
  {"id": 179, "name": "scalpel"},
  {"id": 180, "name": "tweezers"},
  {"id": 181, "name": "pliers"},
  {"id": 182, "name": "hammer"},
  {"id": 183, "name": "screwdriver"},
  {"id": 184, "name": "wrench"},
  {"id": 185, "name": "drill"},
  {"id": 186, "name": "saw"},
  {"id": 187, "name": "chisel"},
  {"id": 188, "name": "level"},
  {"id": 189, "name": "tape_measure"},
  {"id": 190, "name": "toolbox"},
  {"id": 191, "name": "nail"},
  {"id": 192, "name": "screw"},
  {"id": 193, "name": "bolt"},
  {"id": 194, "name": "nut"},
  {"id": 195, "name": "washer"},
  {"id": 196, "name": "sandpaper"},
  {"id": 197, "name": "wood_glue"},
  {"id": 198, "name": "clamp"},
  {"id": 199, "name": "vise"},
  {"id": 200, "name": "workbench"}
]

================================================
FILE: llm_utils/gpt_request.py
================================================
import os
from openai import AzureOpenAI,OpenAI
import requests
import base64
import cv2
import numpy as np
from mimetypes import guess_type

gpt4_api_base = os.environ['GPT4_API_BASE']
gpt4_api_key = os.environ['GPT4_API_KEY']
gpt4v_api_base = os.environ['GPT4V_API_BASE']
gpt4v_api_key = os.environ['GPT4V_API_KEY']

deployment_name = os.environ['GPT4_API_DEPLOY']
api_version = os.environ['GPT4_API_VERSION']
gpt4_client = AzureOpenAI(
    api_key=gpt4_api_key,  
    api_version=api_version,
    base_url=f"{gpt4_api_base}/openai/deployments/{deployment_name}"
)

deployment_name = os.environ['GPT4V_API_DEPLOY']
api_version = os.environ['GPT4V_API_VERSION']
gpt4v_client = AzureOpenAI(
    api_key=gpt4v_api_key,  
    api_version=api_version,
    base_url=f"{gpt4v_api_base}/openai/deployments/{deployment_name}")

def local_image_to_data_url(image):
    if isinstance(image,str):
        mime_type, _ = guess_type(image)
        with open(image, "rb") as image_file:
            base64_encoded_data = base64.b64encode(image_file.read()).decode('utf-8')
        return f"data:{mime_type};base64,{base64_encoded_data}"
    elif isinstance(image,np.ndarray):
        base64_encoded_data = base64.b64encode(cv2.imencode('.jpg',image)[1]).decode('utf-8')
        return f"data:image/jpeg;base64,{base64_encoded_data}"

def gptv_response(text_prompt,image_prompt,system_prompt=""):
    prompt = [{'role':'system','content':system_prompt},
             {'role':'user','content':[{'type':'text','text':text_prompt},
                                       {'type':'image_url','image_url':{'url':local_image_to_data_url(image_prompt)}}]}]
    response = gpt4v_client.chat.completions.create(model=deployment_name,
                                                    messages=prompt,
                                                    max_tokens=1000)
    return response.choices[0].message.content

def gpt_response(text_prompt,system_prompt=""):
    prompt = [{'role':'system','content':system_prompt},
              {'role':'user','content':[{'type':'text','text':text_prompt}]}]
    response = gpt4_client.chat.completions.create(model=deployment_name,
                                              messages=prompt,
                                              max_tokens=1000)
    return response.choices[0].message.content




================================================
FILE: llm_utils/nav_prompt.py
================================================
CHAINON_PROMPT = "You are a wheeled mobile robot working in an indoor environment.\
And you are required to finish a navigation task indicated by a human instruction in a new house.\
Your task is to make a navigation plan for finishing the task as soon as possible.\
The navigation plan should be formulated as a chain as {<Action_1> - <Landmark_1> - <Action_2> - <Landmark_2> ...}.\
To make the plan, I will provide you the following elements:\
(1) <Navigation Instruction>: The navigation instruction given by the human.\
(2) <Previous Plan>: The completed steps in the plan recording your history trajectory.\
(3) <Semantic Clue>: The list recording all the observed rooms and objects in this house from your perspective.\
The allowed <Action> in the plan contains ['Explore','Approach','Move Forward','Turn Left','Turn Right','Turn Around'].\
The action 'Explore' will lead you to the exploration frontiers to help unlock new areas.\
The action 'Approach' will lead you close to a specific object or room for more detailed observations.\
The allowed <Landmark> should be one appeared semantic instance in the <Navigation Instruction> or <Semantic Clue>.\
Do not output an imagined instance as <Landmark> which has not been observerd in <Semantic Clue> or mentioned in the <Navigation Instruction>.\
To select the landmark, you should consider the common house layout, human's habit of objects placement and the task navigation instruction.\
For example, the sofa is often close to a television, therefore, sofa is a good landmark for finding the television and to satisfy the human entertainment demand.\
If the action and landmark is clearly specified in the instruction like 'walk forward to the television', then you can directly decompose the instruction into 'Move_Forward' - 'Television' without 'Explore' action.\
You only need to plan one <Action> and one <Landmark> ahead, besides, you should output a flag to indicate whether you have finished the navigation task.\
Therefore, your output answer should be formatted as Answer={'Reason':<Your Reason>, 'Action':<Chosen Action>, 'Landmark':<Chosen Instance>, 'Flag':<Task Finished Flag>}.\
If you find a specific instance of the target object or a synonyms object, the output 'Flag' should be True.\
Try to select the <Landmark> that is closely related to the <Navigation Instruction> according to the human habit.\
Try not repeatly select the same <Landmark> as the <Previous Plan>."

GPT4V_PROMPT = f"You are an indoor navigation agent. I give you a panoramic observation image, complete navigation instruction and the sub-instruction you should execute now.  \
Direction 1 and 11 are ahead, Direction 5 and 7 are back, Direction 3 is to the right, and Direction 9 is to the left. Please carefully analyze visual information in each direction \
and judge which direction is most suitable for next movement according to the act and landmark mentioned in the sub-instruction. \
You answer should follow \"Thinking Process\" and \"Judgement\". In the \"Judgement: \" field, you should only write down direction ID you choose. \
If you think you have arrived the destination, you can answer \"Stop\" in the \"Judgement: \" field. Note that the \"Direction 5\" and \"Direction 7\" are the directions you just came from. \
Generally, the direction with more navigation landmarks in the complete navigation instruction is better."




================================================
FILE: mapper.py
================================================
from mapping_utils.geometry import *
from mapping_utils.preprocess import *
from mapping_utils.projection import *
from mapping_utils.transform import *
from mapping_utils.path_planning import *
from cv_utils.image_percevior import GLEE_Percevior
from matplotlib import colormaps
from habitat_sim.utils.common import d3_40_colors_rgb
from constants import *
import open3d as o3d
from lavis.models import load_model_and_preprocess
from PIL import Image
class Instruct_Mapper:
    def __init__(self,
                 camera_intrinsic,
                 pcd_resolution=0.05,
                 grid_resolution=0.1,
                 grid_size=5,
                 floor_height=-0.8,
                 ceiling_height=0.8,
                 translation_func=habitat_translation,
                 rotation_func=habitat_rotation,
                 rotate_axis=[0,1,0],
                 device='cuda:0'):
        self.device = device
        self.camera_intrinsic = camera_intrinsic
        self.pcd_resolution = pcd_resolution
        self.grid_resolution = grid_resolution
        self.grid_size = grid_size
        self.floor_height = floor_height
        self.ceiling_height = ceiling_height
        self.translation_func = translation_func
        self.rotation_func = rotation_func
        self.rotate_axis = np.array(rotate_axis)
        self.object_percevior = GLEE_Percevior(device=device)
        self.pcd_device = o3d.core.Device(device.upper())
    
    def reset(self,position,rotation):
        self.update_iterations = 0
        self.initial_position = self.translation_func(position)
        self.current_position = self.translation_func(position) - self.initial_position
        self.current_rotation = self.rotation_func(rotation)
        self.scene_pcd = o3d.t.geometry.PointCloud(self.pcd_device)
        self.navigable_pcd = o3d.t.geometry.PointCloud(self.pcd_device)
        self.object_pcd = o3d.t.geometry.PointCloud(self.pcd_device)
        self.object_entities = []
        self.trajectory_position = []
    
    def update(self,rgb,depth,position,rotation):
        self.current_position = self.translation_func(position) - self.initial_position
        self.current_rotation = self.rotation_func(rotation)
        self.current_depth = preprocess_depth(depth)
        self.current_rgb = preprocess_image(rgb)
        self.trajectory_position.append(self.current_position)
        # to avoid there is no valid depth value (especially in real-world)
        if np.sum(self.current_depth) > 0:
            camera_points,camera_colors = get_pointcloud_from_depth(self.current_rgb,self.current_depth,self.camera_intrinsic)
            world_points = translate_to_world(camera_points,self.current_position,self.current_rotation)
            self.current_pcd = gpu_pointcloud_from_array(world_points,camera_colors,self.pcd_device).voxel_down_sample(self.pcd_resolution)
        else:
            return
        # semantic masking and project object mask to pointcloud
        classes,masks,confidences,visualization = self.object_percevior.perceive(self.current_rgb)
        self.segmentation = visualization[0]
        current_object_entities = self.get_object_entities(self.current_depth,classes,masks,confidences)
        self.object_entities = self.associate_object_entities(self.object_entities,current_object_entities)
        self.object_pcd = self.update_object_pcd()
        # pointcloud update
        self.scene_pcd = gpu_merge_pointcloud(self.current_pcd,self.scene_pcd).voxel_down_sample(self.pcd_resolution)
        self.scene_pcd = self.scene_pcd.select_by_index((self.scene_pcd.point.positions[:,2]>self.floor_height-0.2).nonzero()[0])
        self.useful_pcd = self.scene_pcd.select_by_index((self.scene_pcd.point.positions[:,2]<self.ceiling_height).nonzero()[0])
        
        # all the stairs will be regarded as navigable
        for entity in current_object_entities:
            if entity['class'] == 'stairs':
                self.navigable_pcd = gpu_merge_pointcloud(self.navigable_pcd,entity['pcd'])
        # geometry 
        current_navigable_point = self.current_pcd.select_by_index((self.current_pcd.point.positions[:,2]<self.floor_height).nonzero()[0])
        current_navigable_position = current_navigable_point.point.positions.cpu().numpy()
        standing_position = np.array([self.current_position[0],self.current_position[1],current_navigable_position[:,2].mean()])
        interpolate_points = np.linspace(np.ones_like(current_navigable_position)*standing_position,current_navigable_position,25).reshape(-1,3)
        interpolate_points = interpolate_points[(interpolate_points[:,2] > self.floor_height-0.2) & (interpolate_points[:,2] < self.floor_height+0.2)]
        interpolate_colors = np.ones_like(interpolate_points) * 100
        try:
            current_navigable_pcd = gpu_pointcloud_from_array(interpolate_points,interpolate_colors,self.pcd_device).voxel_down_sample(self.grid_resolution)
            self.navigable_pcd = gpu_merge_pointcloud(self.navigable_pcd,current_navigable_pcd).voxel_down_sample(self.pcd_resolution)
        except:
            self.navigable_pcd = self.useful_pcd.select_by_index((self.useful_pcd.point.positions[:,2]<self.floor_height).nonzero()[0])
       
        
        # try:
        #     self.navigable_pcd = self.navigable_pcd.voxel_down_sample(self.pcd_resolution)
        # except:
        #     self.navigable_pcd = self.useful_pcd.select_by_index((self.useful_pcd.point.positions[:,2]<self.floor_height).nonzero()[0])
        #print("Warning: hello world")
        # self.navigable_pcd = self.useful_pcd.select_by_index((self.useful_pcd.point.positions[:,2]<self.floor_height).nonzero()[0])
            
        # filter the obstacle pointcloud
        self.obstacle_pcd = self.useful_pcd.select_by_index((self.useful_pcd.point.positions[:,2]>self.floor_height+0.1).nonzero()[0])
        self.trajectory_pcd = gpu_pointcloud_from_array(np.array(self.trajectory_position),np.zeros((len(self.trajectory_position),3)),self.pcd_device)
        self.frontier_pcd = project_frontier(self.obstacle_pcd,self.navigable_pcd,self.floor_height+0.2,self.grid_resolution)
        self.frontier_pcd[:,2] = self.navigable_pcd.point.positions.cpu().numpy()[:,2].mean()
        self.frontier_pcd = gpu_pointcloud_from_array(self.frontier_pcd,np.ones((self.frontier_pcd.shape[0],3))*np.array([[255,0,0]]),self.pcd_device)
        self.update_iterations += 1
    
    def update_object_pcd(self):
        object_pcd = o3d.geometry.PointCloud()
        for entity in self.object_entities:
            points = entity['pcd'].point.positions.cpu().numpy()
            colors = entity['pcd'].point.colors.cpu().numpy()
            new_pcd = o3d.geometry.PointCloud()
            new_pcd.points = o3d.utility.Vector3dVector(points)
            new_pcd.colors = o3d.utility.Vector3dVector(colors)
            object_pcd = object_pcd + new_pcd
        try:
            return gpu_pointcloud(object_pcd,self.pcd_device)
        except:
            return self.scene_pcd
    
    def get_view_pointcloud(self,rgb,depth,translation,rotation):
        current_position = self.translation_func(translation) - self.initial_position
        current_rotation = self.rotation_func(rotation)
        current_depth = preprocess_depth(depth)
        current_rgb = preprocess_image(rgb)
        camera_points,camera_colors = get_pointcloud_from_depth(current_rgb,current_depth,self.camera_intrinsic)
        world_points = translate_to_world(camera_points,current_position,current_rotation)
        current_pcd = gpu_pointcloud_from_array(world_points,camera_colors,self.pcd_device).voxel_down_sample(self.pcd_resolution)
        return current_pcd
    
    def get_object_entities(self,depth,classes,masks,confidences):
        entities = []
        exist_objects = np.unique([ent['class'] for ent in self.object_entities]).tolist()
        for cls,mask,score in zip(classes,masks,confidences):
            if depth[mask>0].min() < 1.0 and score < 0.5:
                continue
            if cls not in exist_objects:
                exist_objects.append(cls)
            camera_points = get_pointcloud_from_depth_mask(depth,mask,self.camera_intrinsic)
            world_points = translate_to_world(camera_points,self.current_position,self.current_rotation)
            point_colors = np.array([d3_40_colors_rgb[exist_objects.index(cls)%40]]*world_points.shape[0])
            if world_points.shape[0] < 10:
                continue
            object_pcd = gpu_pointcloud_from_array(world_points,point_colors,self.pcd_device).voxel_down_sample(self.pcd_resolution)
            object_pcd = gpu_cluster_filter(object_pcd)
            if object_pcd.point.positions.shape[0] < 10:
                continue
            entity = {'class':cls,'pcd':object_pcd,'confidence':score}
            entities.append(entity)
        return entities
    
    def associate_object_entities(self,ref_entities,eval_entities):        
        for entity in eval_entities:
            if len(ref_entities) == 0:
                ref_entities.append(entity)
                continue
            overlap_score = []
            eval_pcd = entity['pcd']
            for ref_entity in ref_entities:
                if eval_pcd.point.positions.shape[0] == 0:
                    break
                cdist = pointcloud_distance(eval_pcd,ref_entity['pcd'])
                overlap_condition = (cdist < 0.1)
                nonoverlap_condition = overlap_condition.logical_not()
                eval_pcd = eval_pcd.select_by_index(o3d.core.Tensor(nonoverlap_condition.cpu().numpy(),device=self.pcd_device).nonzero()[0])
                overlap_score.append((overlap_condition.sum()/(overlap_condition.shape[0]+1e-6)).cpu().numpy())
            max_overlap_score = np.max(overlap_score)
            arg_overlap_index = np.argmax(overlap_score)
            if max_overlap_score < 0.25:
                entity['pcd'] = eval_pcd
                ref_entities.append(entity)
            else:
                argmax_entity = ref_entities[arg_overlap_index]
                argmax_entity['pcd'] = gpu_merge_pointcloud(argmax_entity['pcd'],eval_pcd)
                if argmax_entity['pcd'].point.positions.shape[0] < entity['pcd'].point.positions.shape[0] or entity['class'] in INTEREST_OBJECTS:
                    argmax_entity['class'] = entity['class']
                ref_entities[arg_overlap_index] = argmax_entity
        return ref_entities
    
    def get_obstacle_affordance(self):
        try:
            distance = pointcloud_distance(self.navigable_pcd,self.obstacle_pcd)
            affordance = (distance - distance.min())/(distance.max() - distance.min() + 1e-6)
            affordance[distance < 0.25] = 0
            return affordance.cpu().numpy()
        except:
            return np.zeros((self.navigable_pcd.point.positions.shape[0],),dtype=np.float32)
    
    def get_trajectory_affordance(self):
        try:
            distance = pointcloud_distance(self.navigable_pcd,self.trajectory_pcd)
            affordance = (distance - distance.min()) / (distance.max() - distance.min() + 1e-6)
            return affordance.cpu().numpy()
        except:
            return np.zeros((self.navigable_pcd.point.positions.shape[0],),dtype=np.float32)
    
    def get_semantic_affordance(self,target_class,threshold=0.1):
        semantic_pointcloud = o3d.t.geometry.PointCloud()
        for entity in self.object_entities:
            if entity['class'] in target_class:
                semantic_pointcloud = gpu_merge_pointcloud(semantic_pointcloud,entity['pcd'])
        try:
            distance = pointcloud_2d_distance(self.navigable_pcd,semantic_pointcloud) 
            affordance = 1 - (distance - distance.min()) / (distance.max() - distance.min() + 1e-6)
            affordance[distance > threshold] = 0
            affordance = affordance.cpu().numpy()
            return affordance
        except:
            return np.zeros((self.navigable_pcd.point.positions.shape[0],),dtype=np.float32)
    
    def get_gpt4v_affordance(self,gpt4v_pcd):
        try:
            distance = pointcloud_distance(self.navigable_pcd,gpt4v_pcd)
            affordance = 1 - (distance - distance.min()) / (distance.max() - distance.min() + 1e-6)
            affordance[distance > 0.1] = 0
            return affordance.cpu().numpy()
        except:
            return np.zeros((self.navigable_pcd.point.positions.shape[0],),dtype=np.float32)
    
    def get_action_affordance(self,action):
        try:
            if action == 'Explore':
                distance = pointcloud_2d_distance(self.navigable_pcd,self.frontier_pcd)
                affordance = 1 - (distance - distance.min()) / (distance.max() - distance.min() + 1e-6)
                affordance[distance > 0.2] = 0
                return affordance.cpu().numpy()
            elif action == 'Move_Forward':
                pixel_x,pixel_z,depth_values = project_to_camera(self.navigable_pcd,self.camera_intrinsic,self.current_position,self.current_rotation)
                filter_condition = (pixel_x >= 0) & (pixel_x < self.camera_intrinsic[0][2]*2) & (pixel_z >= 0) & (pixel_z < self.camera_intrinsic[1][2]*2) & (depth_values > 1.5) & (depth_values < 2.5)
                filter_pcd = self.navigable_pcd.select_by_index(o3d.core.Tensor(np.where(filter_condition==1)[0],device=self.navigable_pcd.device))
                distance = pointcloud_distance(self.navigable_pcd,filter_pcd)
                affordance = 1 - (distance - distance.min()) / (distance.max() - distance.min() + 1e-6)
                affordance[distance > 0.1] = 0
                return affordance.cpu().numpy()
            elif action == 'Turn_Around':
                R = np.array([np.pi,np.pi,np.pi]) * self.rotate_axis
                turn_extrinsic = np.matmul(self.current_rotation,quaternion.as_rotation_matrix(quaternion.from_euler_angles(R)))
                pixel_x,pixel_z,depth_values = project_to_camera(self.navigable_pcd,self.camera_intrinsic,self.current_position,turn_extrinsic)
                filter_condition = (pixel_x >= 0) & (pixel_x < self.camera_intrinsic[0][2]*2) & (pixel_z >= 0) & (pixel_z < self.camera_intrinsic[1][2]*2) & (depth_values > 1.5) & (depth_values < 2.5)
                filter_pcd = self.navigable_pcd.select_by_index(o3d.core.Tensor(np.where(filter_condition==1)[0],device=self.navigable_pcd.device))
                distance = pointcloud_distance(self.navigable_pcd,filter_pcd)
                affordance = 1 - (distance - distance.min()) / (distance.max() - distance.min() + 1e-6)
                affordance[distance > 0.1] = 0
                return affordance.cpu().numpy()
            elif action == 'Turn_Left':
                R = np.array([np.pi/2,np.pi/2,np.pi/2]) * self.rotate_axis
                turn_extrinsic = np.matmul(self.current_rotation,quaternion.as_rotation_matrix(quaternion.from_euler_angles(R)))
                pixel_x,pixel_z,depth_values = project_to_camera(self.navigable_pcd,self.camera_intrinsic,self.current_position,turn_extrinsic)
                filter_condition = (pixel_x >= 0) & (pixel_x < self.camera_intrinsic[0][2]*2) & (pixel_z >= 0) & (pixel_z < self.camera_intrinsic[1][2]*2) & (depth_values > 1.5) & (depth_values < 2.5)
                filter_pcd = self.navigable_pcd.select_by_index(o3d.core.Tensor(np.where(filter_condition==1)[0],device=self.navigable_pcd.device))
                distance = pointcloud_distance(self.navigable_pcd,filter_pcd)
                affordance = 1 - (distance - distance.min()) / (distance.max() - distance.min() + 1e-6)
                affordance[distance > 0.1] = 0
                return affordance.cpu().numpy()
            elif action == 'Turn_Right':
                R = np.array([-np.pi/2,-np.pi/2,-np.pi/2]) * self.rotate_axis
                turn_extrinsic = np.matmul(self.current_rotation,quaternion.as_rotation_matrix(quaternion.from_euler_angles(R)))
                pixel_x,pixel_z,depth_values = project_to_camera(self.navigable_pcd,self.camera_intrinsic,self.current_position,turn_extrinsic)
                filter_condition = (pixel_x >= 0) & (pixel_x < self.camera_intrinsic[0][2]*2) & (pixel_z >= 0) & (pixel_z < self.camera_intrinsic[1][2]*2) & (depth_values > 1.5) & (depth_values < 2.5)
                filter_pcd = self.navigable_pcd.select_by_index(o3d.core.Tensor(np.where(filter_condition==1)[0],device=self.navigable_pcd.device))
                distance = pointcloud_distance(self.navigable_pcd,filter_pcd)
                affordance = 1 - (distance - distance.min()) / (distance.max() - distance.min() + 1e-6)
                affordance[distance > 0.1] = 0
                return affordance.cpu().numpy()
            elif action == 'Enter':
                return self.get_semantic_affordance(['doorway','door','entrance','exit'])
            elif action == 'Exit':
                return self.get_semantic_affordance(['doorway','door','entrance','exit'])
            else:
                return np.zeros((self.navigable_pcd.point.positions.shape[0],),dtype=np.float32) 
        except:
            return np.zeros((self.navigable_pcd.point.positions.shape[0],),dtype=np.float32) 

    def get_objnav_affordance_map(self,action,target_class,gpt4v_pcd,complete_flag=False,failure_mode=False):
        if failure_mode:
            obstacle_affordance = self.get_obstacle_affordance()
            affordance = self.get_action_affordance('Explore')
            affordance = np.clip(affordance,0.1,1.0)
            affordance[obstacle_affordance == 0] = 0
            return affordance,self.visualize_affordance(affordance)
        elif complete_flag:
            affordance = self.get_semantic_affordance([target_class],threshold=0.1)
            return affordance,self.visualize_affordance(affordance)
        else:
            obstacle_affordance = self.get_obstacle_affordance()
            semantic_affordance = self.get_semantic_affordance([target_class],threshold=1.5)
            action_affordance = self.get_action_affordance(action)
            gpt4v_affordance = self.get_gpt4v_affordance(gpt4v_pcd)
            history_affordance = self.get_trajectory_affordance()
            affordance = 0.25*semantic_affordance + 0.25*action_affordance + 0.25*gpt4v_affordance + 0.25*history_affordance
            affordance = np.clip(affordance,0.1,1.0)
            affordance[obstacle_affordance == 0] = 0
            return affordance,self.visualize_affordance(affordance/(affordance.max()+1e-6))

    def get_debug_affordance_map(self,action,target_class,gpt4v_pcd):
        obstacle_affordance = self.get_obstacle_affordance()
        semantic_affordance = self.get_semantic_affordance([target_class],threshold=1.5)
        action_affordance = self.get_action_affordance(action)
        gpt4v_affordance = self.get_gpt4v_affordance(gpt4v_pcd)
        history_affordance = self.get_trajectory_affordance()
        return self.visualize_affordance(semantic_affordance/(semantic_affordance.max()+1e-6)),\
               self.visualize_affordance(history_affordance/(history_affordance.max()+1e-6)),\
               self.visualize_affordance(action_affordance/(action_affordance.max()+1e-6)),\
               self.visualize_affordance(gpt4v_affordance/(gpt4v_affordance.max()+1e-6)),\
               self.visualize_affordance(obstacle_affordance/(obstacle_affordance.max()+1e-6))

    def visualize_affordance(self,affordance):
        cmap = colormaps.get('jet')
        color_affordance = cmap(affordance)[:,0:3]
        color_affordance = cpu_pointcloud_from_array(self.navigable_pcd.point.positions.cpu().numpy(),color_affordance)
        return color_affordance
    
    def get_appeared_objects(self):
        return [entity['class'] for entity in self.object_entities]

    def save_pointcloud_debug(self,path="./"):
        save_pcd = o3d.geometry.PointCloud()
        try:
            assert self.useful_pcd.point.positions.shape[0] > 0
            save_pcd.points = o3d.utility.Vector3dVector(self.useful_pcd.point.positions.cpu().numpy())
            save_pcd.colors = o3d.utility.Vector3dVector(self.useful_pcd.point.colors.cpu().numpy())
            o3d.io.write_point_cloud(path + "scene.ply",save_pcd)
        except:
            pass
        try:
            assert self.navigable_pcd.point.positions.shape[0] > 0
            save_pcd.points = o3d.utility.Vector3dVector(self.navigable_pcd.point.positions.cpu().numpy())
            save_pcd.colors = o3d.utility.Vector3dVector(self.navigable_pcd.point.colors.cpu().numpy())
            o3d.io.write_point_cloud(path + "navigable.ply",save_pcd)
        except:
            pass
        try:
            assert self.obstacle_pcd.point.positions.shape[0] > 0
            save_pcd.points = o3d.utility.Vector3dVector(self.obstacle_pcd.point.positions.cpu().numpy())
            save_pcd.colors = o3d.utility.Vector3dVector(self.obstacle_pcd.point.colors.cpu().numpy())
            o3d.io.write_point_cloud(path + "obstacle.ply",save_pcd)
        except:
            pass
        
        object_pcd = o3d.geometry.PointCloud()
        for entity in self.object_entities:
            points = entity['pcd'].point.positions.cpu().numpy()
            colors = entity['pcd'].point.colors.cpu().numpy()
            new_pcd = o3d.geometry.PointCloud()
            new_pcd.points = o3d.utility.Vector3dVector(points)
            new_pcd.colors = o3d.utility.Vector3dVector(colors)
            object_pcd = object_pcd + new_pcd
        if len(object_pcd.points) > 0:
            o3d.io.write_point_cloud(path + "object.ply",object_pcd)
    
   


================================================
FILE: mapping_utils/geometry.py
================================================
import numpy as np
import open3d as o3d
import quaternion
import time
import torch
import cv2

def get_pointcloud_from_depth(rgb:np.ndarray,depth:np.ndarray,intrinsic:np.ndarray):
    if len(depth.shape) == 3:
        depth = depth[:,:,0]
    filter_z,filter_x = np.where(depth>0)
    depth_values = depth[filter_z,filter_x]
    pixel_z = (depth.shape[0] - 1 - filter_z - intrinsic[1][2]) * depth_values / intrinsic[1][1]
    pixel_x = (filter_x - intrinsic[0][2])*depth_values / intrinsic[0][0]
    pixel_y = depth_values
    color_values = rgb[filter_z,filter_x]
    point_values = np.stack([pixel_x,pixel_z,-pixel_y],axis=-1)
    return point_values,color_values

def get_pointcloud_from_depth_mask(depth:np.ndarray,mask:np.ndarray,intrinsic:np.ndarray):
    if len(depth.shape) == 3:
        depth = depth[:,:,0]
    if len(mask.shape) == 3:
        mask = mask[:,:,0]
    filter_z,filter_x = np.where((depth>0) & (mask>0))
    depth_values = depth[filter_z,filter_x]
    pixel_z = (depth.shape[0] - 1 - filter_z - intrinsic[1][2]) * depth_values / intrinsic[1][1]
    pixel_x = (filter_x - intrinsic[0][2])*depth_values / intrinsic[0][0]
    pixel_y = depth_values
    point_values = np.stack([pixel_x,pixel_z,-pixel_y],axis=-1)
    return point_values

def translate_to_world(pointcloud,position,rotation):
    extrinsic = np.eye(4)
    extrinsic[0:3,0:3] = rotation 
    extrinsic[0:3,3] = position
    world_points = np.matmul(extrinsic,np.concatenate((pointcloud,np.ones((pointcloud.shape[0],1))),axis=-1).T).T
    return world_points[:,0:3]

def project_to_camera(pcd,intrinsic,position,rotation):
    extrinsic = np.eye(4)
    extrinsic[0:3,0:3] = rotation
    extrinsic[0:3,3] = position
    extrinsic = np.linalg.inv(extrinsic)
    try:
        camera_points = np.concatenate((pcd.point.positions.cpu().numpy(),np.ones((pcd.point.positions.shape[0],1))),axis=-1)
    except:
        camera_points = np.concatenate((pcd.points,np.ones((np.array(pcd.points).shape[0],1))),axis=-1)
    camera_points = np.matmul(extrinsic,camera_points.T).T[:,0:3]
    depth_values = -camera_points[:,2]
    filter_x = (camera_points[:,0] * intrinsic[0][0] / depth_values + intrinsic[0][2]).astype(np.int32)
    filter_z = (-camera_points[:,1] * intrinsic[1][1] / depth_values - intrinsic[1][2] + intrinsic[1][2]*2 - 1).astype(np.int32)
    return filter_x,filter_z,depth_values
    
def pointcloud_distance(pcdA,pcdB,device='cpu'):
    try:
        pointsA = torch.tensor(pcdA.point.positions.cpu().numpy(),device=device)
        pointsB = torch.tensor(pcdB.point.positions.cpu().numpy(),device=device)
    except:
        pointsA = torch.tensor(np.array(pcdA.points),device=device)
        pointsB = torch.tensor(np.array(pcdB.points),device=device)
    cdist = torch.cdist(pointsA,pointsB)
    min_distances1, _ = cdist.min(dim=1)
    return min_distances1

def pointcloud_2d_distance(pcdA,pcdB,device='cpu'):
    pointsA = torch.tensor(pcdA.point.positions.cpu().numpy(),device=device)
    pointsA[:,2] = 0
    pointsB = torch.tensor(pcdB.point.positions.cpu().numpy(),device=device)
    pointsB[:,2] = 0
    cdist = torch.cdist(pointsA,pointsB)
    min_distances1, _ = cdist.min(dim=1)
    return min_distances1

def cpu_pointcloud_from_array(points,colors):
    pointcloud = o3d.geometry.PointCloud()
    pointcloud.points = o3d.utility.Vector3dVector(points)
    pointcloud.colors = o3d.utility.Vector3dVector(colors)
    return pointcloud

def gpu_pointcloud_from_array(points,colors,device):
    pointcloud = o3d.t.geometry.PointCloud(device)
    pointcloud.point.positions = o3d.core.Tensor(points,dtype=o3d.core.Dtype.Float32,device=device)
    pointcloud.point.colors = o3d.core.Tensor(colors.astype(np.float32)/255.0,dtype=o3d.core.Dtype.Float32,device=device)
    return pointcloud

def gpu_pointcloud(pointcloud,device):
    new_pointcloud = o3d.t.geometry.PointCloud(device)
    new_pointcloud.point.positions = o3d.core.Tensor(np.asarray(pointcloud.points),device=device)
    new_pointcloud.point.colors = o3d.core.Tensor(np.asarray(pointcloud.colors),device=device)
    return new_pointcloud
    
def cpu_pointcloud(pointcloud):
    new_pointcloud = o3d.geometry.PointCloud()
    new_pointcloud.points = o3d.utility.Vector3dVector(pointcloud.point.positions.cpu().numpy())
    new_pointcloud.colors = o3d.utility.Vector3dVector(pointcloud.point.colors.cpu().numpy())
    return new_pointcloud

def cpu_merge_pointcloud(pcdA,pcdB):
    return pcdA + pcdB

def gpu_merge_pointcloud(pcdA,pcdB):
    if pcdA.is_empty():
        return pcdB
    if pcdB.is_empty():
        return pcdA
    return pcdA + pcdB

def gpu_cluster_filter(pointcloud,eps=0.3,min_points=20):
    labels = pointcloud.cluster_dbscan(eps=eps, min_points=min_points, print_progress=False)
    numpy_labels = labels.cpu().numpy()
    unique_labels = np.unique(numpy_labels)
    largest_cluster_label = max(unique_labels, key=lambda x: np.sum(numpy_labels == x))
    largest_cluster_pc = pointcloud.select_by_index((labels == largest_cluster_label).nonzero()[0])
    return largest_cluster_pc

def cpu_cluster_filter(pointcloud,eps=0.3,min_points=20):
    labels = pointcloud.cluster_dbscan(eps=eps, min_points=min_points, print_progress=False)
    unique_labels = np.unique(labels)
    largest_cluster_label = max(unique_labels, key=lambda x: np.sum(labels == x))
    largest_cluster_pc = pointcloud.select_by_index((labels == largest_cluster_label).nonzero()[0])
    return largest_cluster_pc

def quat2array(quat):
    return np.array([quat.w,quat.x,quat.y,quat.z],np.float32)

def quaternion_distance(quatA,quatB):
    # M*4, N*4
    dot = np.dot(quatA,quatB.T)
    dot[dot<0] = -dot[dot<0]
    angle = 2*np.arccos(dot)
    return angle/np.pi*180

def eculidean_distance(posA,posB):
    posA_reshaped = posA[:, np.newaxis, :]
    posB_reshaped = posB[np.newaxis, :, :]
    pairwise_distance = np.sqrt(np.sum((posA_reshaped - posB_reshaped)**2, axis=2))
    return pairwise_distance
    



================================================
FILE: mapping_utils/path_planning.py
================================================
import numpy as np
import cv2
from pathfinding.core.diagonal_movement import DiagonalMovement
from pathfinding.core.grid import Grid
from pathfinding.finder.a_star import AStarFinder
from .projection import *

def path_planning(costmap,start_index,goal_index):
    planmap = costmap.copy()
    planmap[planmap == 1] = 10
    grid = Grid(matrix=(planmap*100).astype(np.int32))
    finder = AStarFinder(diagonal_movement=DiagonalMovement.always)
    start_index[0][1] = np.clip(start_index[0][1],0,costmap.shape[1]-1)
    start_index[0][0] = np.clip(start_index[0][0],0,costmap.shape[0]-1)
    goal_index[0][1] = np.clip(goal_index[0][1],0,costmap.shape[1]-1)
    goal_index[0][0] = np.clip(goal_index[0][0],0,costmap.shape[0]-1)
    start = grid.node(start_index[0][1],start_index[0][0])
    goal = grid.node(goal_index[0][1],goal_index[0][0])
    path,_ = finder.find_path(start,goal,grid)
    return path

def visualize_path(costmap,path):
    visualize_costmap = costmap.copy()
    for waypoint in path:
        x = waypoint.y
        y = waypoint.x
        visualize_costmap[x,y] = 10
    visualize_costmap = cv2.resize(visualize_costmap,(0,0),fx=10,fy=10,interpolation=cv2.INTER_NEAREST)
    visualize_costmap = cv2.applyColorMap((255*visualize_costmap/10).astype(np.uint8),cv2.COLORMAP_JET)
    return visualize_costmap    

    
    
    
    
    
    
    
    

================================================
FILE: mapping_utils/preprocess.py
================================================
import numpy as np
def preprocess_depth(depth:np.ndarray,lower_bound:float=0.1,upper_bound:float=4.9):
    depth[np.where((depth<lower_bound)|(depth>upper_bound))] = 0
    return depth
def preprocess_image(image:np.ndarray):
    return image

================================================
FILE: mapping_utils/projection.py
================================================
import numpy as np
import open3d as o3d
import cv2
# obstacle = 0
# unknown = 1
# position = 2
# navigable = 3
# frontier = 4

def project_frontier(obstacle_pcd,navigable_pcd,obstacle_height=-0.7,grid_resolution=0.25):
    np_obstacle_points = obstacle_pcd.point.positions.cpu().numpy()
    np_navigable_points = navigable_pcd.point.positions.cpu().numpy()
    np_all_points = np.concatenate((np_obstacle_points,np_navigable_points),axis=0)
    max_bound = np.max(np_all_points,axis=0)
    min_bound = np.min(np_all_points,axis=0)
    grid_dimensions = np.ceil((max_bound - min_bound) / grid_resolution).astype(int)
    grid_map = np.ones((grid_dimensions[0],grid_dimensions[1]),dtype=np.int32)
    # get navigable occupancy
    navigable_points = np_navigable_points
    navigable_indices = np.floor((navigable_points - min_bound) / grid_resolution).astype(int)
    navigable_indices[:,0] = np.clip(navigable_indices[:,0],0,grid_dimensions[0]-1)
    navigable_indices[:,1] = np.clip(navigable_indices[:,1],0,grid_dimensions[1]-1)
    navigable_indices[:,2] = np.clip(navigable_indices[:,2],0,grid_dimensions[2]-1)
    navigable_voxels = np.zeros(grid_dimensions,dtype=np.int32)
    navigable_voxels[navigable_indices[:,0],navigable_indices[:,1],navigable_indices[:,2]] = 1
    navigable_map = (navigable_voxels.sum(axis=2) > 0)
    grid_map[np.where(navigable_map>0)] = 3
    # get obstacle occupancy
    obstacle_points = np_obstacle_points
    obstacle_indices = np.floor((obstacle_points - min_bound) / grid_resolution).astype(int)
    obstacle_indices[:,0] = np.clip(obstacle_indices[:,0],0,grid_dimensions[0]-1)
    obstacle_indices[:,1] = np.clip(obstacle_indices[:,1],0,grid_dimensions[1]-1)
    obstacle_indices[:,2] = np.clip(obstacle_indices[:,2],0,grid_dimensions[2]-1)
    obstacle_voxels = np.zeros(grid_dimensions,dtype=np.int32)
    obstacle_voxels[obstacle_indices[:,0],obstacle_indices[:,1],obstacle_indices[:,2]] = 1
    obstacle_map = (obstacle_voxels.sum(axis=2) > 0)
    grid_map[np.where(obstacle_map>0)] = 0
     # get outer-border of navigable areas
    outer_border_navigable = ((grid_map == 3)*255).astype(np.uint8)
    contours,hierarchiy = cv2.findContours(outer_border_navigable,cv2.RETR_EXTERNAL,cv2.CHAIN_APPROX_SIMPLE)
    outer_border_navigable = cv2.drawContours(np.zeros((grid_map.shape[0],grid_map.shape[1])),contours,-1,(255,255,255),1).astype(np.float32)
    obstacles = ((grid_map == 0)*255).astype(np.float32)
    obstacles = cv2.dilate(obstacles.astype(np.uint8),np.ones((3,3)))
    outer_border_navigable = ((outer_border_navigable - obstacles) > 0)
    grid_map_x,grid_map_y = np.where(outer_border_navigable>0)
    grid_indexes = np.stack((grid_map_x,grid_map_y,obstacle_height*np.ones((grid_map_x.shape[0],))),axis=1)
    frontier_points = grid_indexes * grid_resolution + min_bound
    return frontier_points
    
def translate_grid_to_point(pointcloud,grid_indexes,grid_resolution=0.25):
    np_all_points = pointcloud.point.positions.cpu().numpy()
    min_bound = np.min(np_all_points,axis=0)
    translate_points = grid_indexes * grid_resolution + min_bound
    return translate_points

def translate_point_to_grid(pointcloud,point_poses,grid_resolution=0.25):
    if len(point_poses.shape) == 1:
        point_poses = point_poses[np.newaxis,:]
    np_all_points = pointcloud.point.positions.cpu().numpy()
    min_bound = np.min(np_all_points,axis=0)
    grid_index = np.floor((point_poses - min_bound) / grid_resolution).astype(int)
    return grid_index[:,0:2]

def project_costmap(navigable_pcd,affordance_value,grid_resolution=0.25):
    navigable_points = navigable_pcd.point.positions.cpu().numpy()
    max_bound = np.max(navigable_points,axis=0)
    min_bound = np.min(navigable_points,axis=0)
    grid_dimensions = np.ceil((max_bound - min_bound) / grid_resolution).astype(int)
    navigable_voxels = np.zeros(grid_dimensions,dtype=np.float32)
    navigable_indices = np.floor((navigable_points - min_bound) / grid_resolution).astype(int)
    navigable_indices[:,0] = np.clip(navigable_indices[:,0],0,grid_dimensions[0]-1)
    navigable_indices[:,1] = np.clip(navigable_indices[:,1],0,grid_dimensions[1]-1)
    navigable_indices[:,2] = np.clip(navigable_indices[:,2],0,grid_dimensions[2]-1)
    navigable_voxels[navigable_indices[:,0],navigable_indices[:,1],navigable_indices[:,2]] = affordance_value
    navigable_costmap = navigable_voxels.max(axis=2)
    navigable_costmap = 1 - navigable_costmap
    color_navigable_costmap = cv2.applyColorMap((navigable_costmap*255).astype(np.uint8),cv2.COLORMAP_JET)
    color_navigable_costmap = cv2.resize(color_navigable_costmap,(0,0),fx=5,fy=5,interpolation=cv2.INTER_NEAREST)
    return navigable_costmap,color_navigable_costmap



================================================
FILE: mapping_utils/transform.py
================================================
import numpy as np
import quaternion
def habitat_camera_intrinsic(config):
    assert config.habitat.simulator.agents.main_agent.sim_sensors.depth_sensor.width == config.habitat.simulator.agents.main_agent.sim_sensors.rgb_sensor.width, 'The configuration of the depth camera should be the same as rgb camera.'
    assert config.habitat.simulator.agents.main_agent.sim_sensors.depth_sensor.height == config.habitat.simulator.agents.main_agent.sim_sensors.rgb_sensor.height, 'The configuration of the depth camera should be the same as rgb camera.'
    assert config.habitat.simulator.agents.main_agent.sim_sensors.depth_sensor.hfov == config.habitat.simulator.agents.main_agent.sim_sensors.rgb_sensor.hfov, 'The configuration of the depth camera should be the same as rgb camera.'
    width = config.habitat.simulator.agents.main_agent.sim_sensors.depth_sensor.width
    height = config.habitat.simulator.agents.main_agent.sim_sensors.depth_sensor.height
    hfov = config.habitat.simulator.agents.main_agent.sim_sensors.depth_sensor.hfov
    xc = (width - 1.) / 2.
    zc = (height - 1.) / 2.
    f = (width / 2.) / np.tan(np.deg2rad(hfov / 2.))
    intrinsic_matrix = np.array([[f,0,xc],
                                 [0,f,zc],
                                 [0,0,1]],np.float32)
    return intrinsic_matrix

def habitat_translation(position):
    return np.array([position[0],position[2],position[1]])

def habitat_rotation(rotation):
    rotation_matrix = quaternion.as_rotation_matrix(rotation)
    transform_matrix = np.array([[1,0,0],
                                 [0,0,1],
                                 [0,1,0]])
    rotation_matrix = np.matmul(transform_matrix,rotation_matrix)
    return rotation_matrix


================================================
FILE: objnav_agent.py
================================================
import habitat
import numpy as np
import cv2
import ast
import open3d as o3d
from mapping_utils.geometry import *
from mapping_utils.projection import *
from mapping_utils.path_planning import *
from habitat.tasks.nav.shortest_path_follower import ShortestPathFollower
from mapper import Instruct_Mapper
from habitat.utils.visualizations.maps import colorize_draw_agent_and_fit_to_height
from llm_utils.nav_prompt import CHAINON_PROMPT,GPT4V_PROMPT
from llm_utils.gpt_request import gpt_response,gptv_response

class HM3D_Objnav_Agent:
    def __init__(self,env:habitat.Env,mapper:Instruct_Mapper):
        self.env = env
        self.mapper = mapper
        self.episode_samples = 0
        self.planner = ShortestPathFollower(env.sim,0.5,False)

    def translate_objnav(self,object_goal):
        if object_goal.lower() == 'plant':
            return "Find the <%s>."%"potted_plant"
        elif object_goal.lower() == "tv_monitor":
            return "Find the <%s>."%"television_set"
        else:
            return "Find the <%s>."%object_goal
    
    def reset_debug_probes(self):
        self.rgb_trajectory = []
        self.depth_trajectory = []
        self.topdown_trajectory = []
        self.segmentation_trajectory = []

        self.gpt_trajectory = []
        self.gptv_trajectory = []
        self.panoramic_trajectory = []
        
        self.obstacle_affordance_trajectory = []
        self.semantic_affordance_trajectory = []
        self.history_affordance_trajectory = []
        self.action_affordance_trajectory = []
        self.gpt4v_affordance_trajectory = []
        self.affordance_trajectory = []

    def reset(self):
        self.episode_samples += 1
        self.episode_steps = 0
        self.obs = self.env.reset()
        self.mapper.reset(self.env.sim.get_agent_state().sensor_states['rgb'].position,self.env.sim.get_agent_state().sensor_states['rgb'].rotation)
        self.instruct_goal = self.translate_objnav(self.env.current_episode.object_category)
        self.trajectory_summary = ""
        self.reset_debug_probes()     
       
    def rotate_panoramic(self,rotate_times = 12):
        self.temporary_pcd = []
        self.temporary_images = []
        for i in range(rotate_times):
            if self.env.episode_over:
                break
            self.update_trajectory()
            self.temporary_pcd.append(self.mapper.current_pcd)
            self.temporary_images.append(self.rgb_trajectory[-1])
            self.obs = self.env.step(3)
            
    def concat_panoramic(self,images):
        try:
            height,width = images[0].shape[0],images[0].shape[1]
        except:
            height,width = 480,640
        background_image = np.zeros((2*height + 3*10, 3*width + 4*10, 3),np.uint8)
        copy_images = np.array(images,dtype=np.uint8)
        for i in range(len(copy_images)):
            if i % 2 != 0:
                row = (i//6)
                col = ((i%6)//2)
                copy_images[i] = cv2.putText(copy_images[i],"Direction %d"%i,(100,100),cv2.FONT_HERSHEY_SIMPLEX, 2, (255, 0, 0), 6, cv2.LINE_AA)
                background_image[10*(row+1)+row*height:10*(row+1)+row*height+height:,col*width + col * 10:col*width+col*10+width,:] = copy_images[i]
                
        return background_image
    
    def update_trajectory(self):
        self.episode_steps += 1
        self.metrics = self.env.get_metrics()
        self.rgb_trajectory.append(cv2.cvtColor(self.obs['rgb'],cv2.COLOR_BGR2RGB))
        self.depth_trajectory.append((self.obs['depth']/5.0 * 255.0).astype(np.uint8))
        
        topdown_image = cv2.cvtColor(colorize_draw_agent_and_fit_to_height(self.metrics['top_down_map'],1024),cv2.COLOR_BGR2RGB)
        topdown_image = cv2.putText(topdown_image,'Success:%.2f,SPL:%.2f,SoftSPL:%.2f,DTS:%.2f'%(self.metrics['success'],self.metrics['spl'],self.metrics['soft_spl'],self.metrics['distance_to_goal']),(0,100),cv2.FONT_HERSHEY_SIMPLEX,2,(0,0,0),2,cv2.LINE_AA)
        self.topdown_trajectory.append(topdown_image)
        
        self.position = self.env.sim.get_agent_state().sensor_states['rgb'].position
        self.rotation = self.env.sim.get_agent_state().sensor_states['rgb'].rotation

        self.mapper.update(self.rgb_trajectory[-1],self.obs['depth'],self.position,self.rotation)
        self.segmentation_trajectory.append(self.mapper.segmentation)
        self.observed_objects = self.mapper.get_appeared_objects()

        cv2.imwrite("monitor-rgb.jpg",self.rgb_trajectory[-1])
        cv2.imwrite("monitor-depth.jpg",self.depth_trajectory[-1])
        cv2.imwrite("monitor-segmentation.jpg",self.segmentation_trajectory[-1])
            
    def save_trajectory(self,dir="./tmp_objnav/"):
        import imageio
        import os
        os.makedirs(dir)

        self.mapper.save_pointcloud_debug(dir) 
        fps_writer = imageio.get_writer(dir+"fps.mp4", fps=4)
        dps_writer = imageio.get_writer(dir+"depth.mp4", fps=4)
        seg_writer = imageio.get_writer(dir+"segmentation.mp4", fps=4)
        metric_writer = imageio.get_writer(dir+"metrics.mp4",fps=4)
        for i,img,dep,seg,met in zip(np.arange(len(self.rgb_trajectory)),self.rgb_trajectory,self.depth_trajectory,self.segmentation_trajectory,self.topdown_trajectory):
            fps_writer.append_data(cv2.cvtColor(img,cv2.COLOR_BGR2RGB))
            dps_writer.append_data(dep)
            seg_writer.append_data(cv2.cvtColor(seg,cv2.COLOR_BGR2RGB))
            metric_writer.append_data(cv2.cvtColor(met,cv2.COLOR_BGR2RGB))

        for index,pano_img in enumerate(self.panoramic_trajectory):
            cv2.imwrite(dir+"%d-pano.jpg"%index,pano_img)
        with open(dir+"gpt4_history.txt",'w') as file:
            file.write("".join(self.gpt_trajectory))
        with open(dir+"gpt4v_history.txt",'w') as file:
            file.write("".join(self.gptv_trajectory))

        for i,afford,safford,hafford,cafford,gafford,oafford in zip(np.arange(len(self.affordance_trajectory)),self.affordance_trajectory,self.semantic_affordance_trajectory,self.history_affordance_trajectory,self.action_affordance_trajectory,self.gpt4v_affordance_trajectory,self.obstacle_affordance_trajectory):
            o3d.io.write_point_cloud(dir+"afford-%d-plan.ply"%i,afford)
            o3d.io.write_point_cloud(dir+"semantic-afford-%d-plan.ply"%i,safford)
            o3d.io.write_point_cloud(dir+"history-afford-%d-plan.ply"%i,hafford)
            o3d.io.write_point_cloud(dir+"action-afford-%d-plan.ply"%i,cafford)
            o3d.io.write_point_cloud(dir+"gpt4v-afford-%d-plan.ply"%i,gafford)
            o3d.io.write_point_cloud(dir+"obstacle-afford-%d-plan.ply"%i,oafford)

        fps_writer.close()
        dps_writer.close()
        seg_writer.close()
        metric_writer.close()
    
    def query_chainon(self):
        semantic_clue = {'observed object':self.observed_objects}
        query_content = "<Navigation Instruction>:{}, <Previous Plan>:{}, <Semantic Clue>:{}".format(self.instruct_goal,"{" + self.trajectory_summary + "}",semantic_clue)
        self.gpt_trajectory.append("Input:\n%s \n"%query_content)
        for i in range(10):
            try:
                raw_answer = gpt_response(query_content,CHAINON_PROMPT)
                print("GPT-4 Output Response: %s"%raw_answer)
                answer = raw_answer.replace(" ","")
                answer = answer[answer.index("{"):answer.index("}")+1]
                answer = ast.literal_eval(answer)
                if 'Action' in answer.keys() and 'Landmark' in answer.keys() and 'Flag' in answer.keys():
                    break
            except:
                continue
        self.gpt_trajectory.append("\nGPT-4 Answer:\n%s"%raw_answer)
        if self.trajectory_summary == "":
            self.trajectory_summary = self.trajectory_summary + str(answer['Action']) + '-' + str(answer['Landmark'])
        else:
            self.trajectory_summary = self.trajectory_summary + '-' + str(answer['Action']) + '-' + str(answer['Landmark'])
        return answer
    
    def query_gpt4v(self):
        images = self.temporary_images
        inference_image = self.concat_panoramic(images)
        cv2.imwrite("monitor-panoramic.jpg",inference_image)
        text_content = "<Navigation Instruction>:{}\n <Sub Instruction>:{}".format(self.instruct_goal,self.trajectory_summary.split("-")[-2] + "-" + self.trajectory_summary.split("-")[-1])
        self.gptv_trajectory.append("\nInput:\n%s \n"%text_content)
        for i in range(10):
            try:
                raw_answer = gptv_response(text_content,inference_image,GPT4V_PROMPT)
                print("GPT-4V Output Response: %s"%raw_answer)
                answer = raw_answer[raw_answer.index("Judgement: Direction"):]
                answer = answer.replace(" ","")
                answer = int(answer.split("Direction")[-1])
                break
            except:
                continue
        self.gptv_trajectory.append("GPT-4V Answer:\n%s"%raw_answer)
        self.panoramic_trajectory.append(inference_image)
        try:
            return answer
        except:
            return np.random.randint(0,12)
    
    def make_plan(self,rotate=True,failed=False):
        if rotate == True:
            self.rotate_panoramic()
        self.chainon_answer = self.query_chainon()
        self.gpt4v_answer = self.query_gpt4v()
        self.gpt4v_pcd = o3d.t.geometry.PointCloud(self.mapper.pcd_device)
        self.gpt4v_pcd = gpu_merge_pointcloud(self.gpt4v_pcd,self.temporary_pcd[self.gpt4v_answer])
        self.found_goal = bool(self.chainon_answer['Flag'])
        self.affordance_pcd,self.colored_affordance_pcd = self.mapper.get_objnav_affordance_map(self.chainon_answer['Action'],self.chainon_answer['Landmark'],self.gpt4v_pcd,self.chainon_answer['Flag'],failure_mode=failed)
        self.semantic_afford,self.history_afford,self.action_afford,self.gpt4v_afford,self.obs_afford = self.mapper.get_debug_affordance_map(self.chainon_answer['Action'],self.chainon_answer['Landmark'],self.gpt4v_pcd)
        if self.affordance_pcd.max() == 0:
            self.affordance_pcd,self.colored_affordance_pcd = self.mapper.get_objnav_affordance_map(self.chainon_answer['Action'],self.chainon_answer['Landmark'],self.gpt4v_pcd,False,failure_mode=failed)
            self.found_goal = False
            
        self.affordance_map,self.colored_affordance_map = project_costmap(self.mapper.navigable_pcd,self.affordance_pcd,self.mapper.grid_resolution)
        self.target_point = self.mapper.navigable_pcd.point.positions[self.affordance_pcd.argmax()].cpu().numpy()
        self.plan_position = self.mapper.current_position.copy()
        target_index = translate_point_to_grid(self.mapper.navigable_pcd,self.target_point,self.mapper.grid_resolution)
        start_index = translate_point_to_grid(self.mapper.navigable_pcd,self.mapper.current_position,self.mapper.grid_resolution)
        self.path = path_planning(self.affordance_map,start_index,target_index)
        self.path = [translate_grid_to_point(self.mapper.navigable_pcd,np.array([[waypoint.y,waypoint.x,0]]),self.mapper.grid_resolution)[0] for waypoint in self.path]
        if len(self.path) == 0:
            self.waypoint = self.mapper.navigable_pcd.point.positions.cpu().numpy()[np.argmax(self.affordance_pcd)]
            self.waypoint[2] = self.mapper.current_position[2]
        elif len(self.path) < 5: 
            self.waypoint = self.path[-1]
            self.waypoint[2] = self.mapper.current_position[2]
        else:
            self.waypoint = self.path[4]
            self.waypoint[2] = self.mapper.current_position[2]

        self.affordance_trajectory.append(self.colored_affordance_pcd)
        self.obstacle_affordance_trajectory.append(self.obs_afford)
        self.semantic_affordance_trajectory.append(self.semantic_afford)
        self.history_affordance_trajectory.append(self.history_afford)
        self.action_affordance_trajectory.append(self.action_afford)
        self.gpt4v_affordance_trajectory.append(self.gpt4v_afford)
    
    def step(self):
        to_target_distance = np.sqrt(np.sum(np.square(self.mapper.current_position - self.waypoint)))
        if to_target_distance < 0.6 and len(self.path) > 0:
            self.path = self.path[min(5,len(self.path)-1):]
            if len(self.path) < 3:
                self.waypoint = self.path[-1]
                self.waypoint[2] = self.mapper.current_position[2]
            else:
                self.waypoint = self.path[2]
                self.waypoint[2] = self.mapper.current_position[2]

        pid_waypoint = self.waypoint + self.mapper.initial_position
        pid_waypoint = np.array([pid_waypoint[0],self.env.sim.get_agent_state().position[1],pid_waypoint[1]])
        act = self.planner.get_next_action(pid_waypoint)
        move_distance =  np.sqrt(np.sum(np.square(self.mapper.current_position - self.plan_position)))
        if (act == 0 or move_distance > 3.0) and not self.found_goal:
            self.make_plan(rotate=True)
            pid_waypoint = self.waypoint + self.mapper.initial_position
            pid_waypoint = np.array([pid_waypoint[0],self.env.sim.get_agent_state().position[1],pid_waypoint[1]])
            act = self.planner.get_next_action(pid_waypoint)
        if act == 0 and not self.found_goal:
            self.make_plan(False,True)
            pid_waypoint = self.waypoint + self.mapper.initial_position
            pid_waypoint = np.array([pid_waypoint[0],self.env.sim.get_agent_state().position[1],pid_waypoint[1]])
            act = self.planner.get_next_action(pid_waypoint)
            print("Warning: Failure locomotion and action = %d"%act)
        if not self.env.episode_over:
            self.obs = self.env.step(act)
            self.update_trajectory()
       


================================================
FILE: objnav_benchmark.py
================================================
import habitat
import os
import argparse
import csv
from tqdm import tqdm
from config_utils import hm3d_config,mp3d_config
from mapping_utils.transform import habitat_camera_intrinsic
from mapper import Instruct_Mapper
from objnav_agent import HM3D_Objnav_Agent
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
os.environ["MAGNUM_LOG"] = "quiet"
os.environ["HABITAT_SIM_LOG"] = "quiet"
def write_metrics(metrics,path="objnav_hm3d.csv"):
    with open(path, mode="w", newline="") as csv_file:
        fieldnames = metrics[0].keys()
        writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(metrics)

def get_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--eval_episodes",type=int,default=500)
    parser.add_argument("--mapper_resolution",type=float,default=0.05)
    parser.add_argument("--path_resolution",type=float,default=0.2)
    parser.add_argument("--path_scale",type=int,default=5)
    return parser.parse_known_args()[0]

if __name__ == "__main__":
    args = get_args()
    habitat_config = hm3d_config(stage='val',episodes=args.eval_episodes)
    habitat_env = habitat.Env(habitat_config)
    habitat_mapper = Instruct_Mapper(habitat_camera_intrinsic(habitat_config),
                                    pcd_resolution=args.mapper_resolution,
                                    grid_resolution=args.path_resolution,
                                    grid_size=args.path_scale)
    habitat_agent = HM3D_Objnav_Agent(habitat_env,habitat_mapper)
    evaluation_metrics = []
    for i in tqdm(range(args.eval_episodes)):
        habitat_agent.reset()
        habitat_agent.make_plan()
        while not habitat_env.episode_over and habitat_agent.episode_steps < 495:
            habitat_agent.step()
        habitat_agent.save_trajectory("./tmp/episode-%d/"%i)
        evaluation_metrics.append({'success':habitat_agent.metrics['success'],
                                'spl':habitat_agent.metrics['spl'],
                                'distance_to_goal':habitat_agent.metrics['distance_to_goal'],
                                'object_goal':habitat_agent.instruct_goal})
        write_metrics(evaluation_metrics)

================================================
FILE: requirements.txt
================================================
apex==0.9.10dev
einops==0.8.0
fairscale==0.4.4
fvcore==0.1.5.post20221221
imageio==2.34.1
matplotlib==3.8.4
MultiScaleDeformableAttention==1.0
numpy==1.23.5
numpy_quaternion==2023.0.3
omegaconf==2.3.0
open3d==0.18.0
openai==1.45.0
opencv_python==4.4.0.46
opencv_python_headless==4.5.5.64
pathfinding==1.0.9
Pillow==10.4.0
Requests==2.32.3
salesforce_lavis==1.0.2
scipy==1.14.1
setuptools==60.2.0
timm==0.4.12
torch==2.2.2+cu121
torchvision==0.17.2+cu121
tqdm==4.65.2
transformers==4.26.1
xformers==0.0.28.post1


================================================
FILE: thirdparty/GLEE/configs/R50.yaml
================================================
MODEL:
  META_ARCHITECTURE: "GLEE"
  MASK_ON: True
  BACKBONE:
    FREEZE_AT: 0
    NAME: "build_resnet_backbone"
  PIXEL_MEAN: [123.675, 116.280, 103.530]
  PIXEL_STD: [58.395, 57.120, 57.375]
  RESNETS:
    DEPTH: 50
    STEM_TYPE: "basic"  # not used
    STEM_OUT_CHANNELS: 64
    STRIDE_IN_1X1: False
    OUT_FEATURES: ["res2", "res3", "res4", "res5"]
    # NORM: "SyncBN"
    RES5_MULTI_GRID: [1, 1, 1]  # not used
  SEM_SEG_HEAD:
    NAME: "MaskDINOHead"
    IGNORE_VALUE: 255
    NUM_CLASSES: 80
    LOSS_WEIGHT: 1.0
    CONVS_DIM: 256
    MASK_DIM: 256
    NORM: "GN"
    # pixel decoder
    PIXEL_DECODER_NAME: "MaskDINOEncoder"
    DIM_FEEDFORWARD: 2048
    NUM_FEATURE_LEVELS: 3
    TOTAL_NUM_FEATURE_LEVELS: 4
    IN_FEATURES: ["res2", "res3", "res4", "res5"]
    DEFORMABLE_TRANSFORMER_ENCODER_IN_FEATURES: ["res3", "res4", "res5"]
    COMMON_STRIDE: 4
    TRANSFORMER_ENC_LAYERS: 6
    FEATURE_ORDER: "low2high"
  MaskDINO:
    TRANSFORMER_DECODER_NAME: "MaskDINODecoder"
    DEEP_SUPERVISION: True
    NO_OBJECT_WEIGHT: 0.1
    CLASS_WEIGHT: 4.0
    MASK_WEIGHT: 5.0
    DICE_WEIGHT: 5.0
    BOX_WEIGHT: 5.0
    GIOU_WEIGHT: 2.0
    HIDDEN_DIM: 256
    NUM_OBJECT_QUERIES: 300
    NHEADS: 8
    DROPOUT: 0.0
    DIM_FEEDFORWARD: 2048
    ENC_LAYERS: 0
    PRE_NORM: False
    ENFORCE_INPUT_PROJ: False
    SIZE_DIVISIBILITY: 32
    DEC_LAYERS: 9  # 9+1, 9 decoder layers, add one for the loss on learnable query
    TRAIN_NUM_POINTS: 12544
    OVERSAMPLE_RATIO: 3.0
    IMPORTANCE_SAMPLE_RATIO: 0.75
    INITIAL_PRED: True
    TWO_STAGE: True
    DN: "standard"
    DN_NUM: 100
    INITIALIZE_BOX_TYPE: "no"
    TEST:
      SEMANTIC_ON: False
      INSTANCE_ON: True
      PANOPTIC_ON: False
      OVERLAP_THRESHOLD: 0.8
      OBJECT_MASK_THRESHOLD: 0.25
  TEXT:
    ARCH: clip_teacher
  LANGUAGE_BACKBONE:
    LANG_DIM: 512


================================================
FILE: thirdparty/GLEE/configs/SwinL.yaml
================================================
MODEL:
  META_ARCHITECTURE: "GLEE"
  MASK_ON: True
  BACKBONE:
    NAME: "D2SwinTransformer"
  SWIN:
    EMBED_DIM: 192
    DEPTHS: [2, 2, 18, 2]
    NUM_HEADS: [6, 12, 24, 48]
    WINDOW_SIZE: 12
    APE: False
    DROP_PATH_RATE: 0.3
    PATCH_NORM: True
    PRETRAIN_IMG_SIZE: 384
  PIXEL_MEAN: [123.675, 116.280, 103.530]
  PIXEL_STD: [58.395, 57.120, 57.375]
  RESNETS:
    DEPTH: 50
    STEM_TYPE: "basic"  # not used
    STEM_OUT_CHANNELS: 64
    STRIDE_IN_1X1: False
    OUT_FEATURES: ["res2", "res3", "res4", "res5"]
    # NORM: "SyncBN"
    RES5_MULTI_GRID: [1, 1, 1]  # not used
  SEM_SEG_HEAD:
    NAME: "MaskDINOHead"
    IGNORE_VALUE: 255
    NUM_CLASSES: 80
    LOSS_WEIGHT: 1.0
    CONVS_DIM: 256
    MASK_DIM: 256
    NORM: "GN"
    # pixel decoder
    PIXEL_DECODER_NAME: "MaskDINOEncoder"
    DIM_FEEDFORWARD: 2048
    NUM_FEATURE_LEVELS: 3
    TOTAL_NUM_FEATURE_LEVELS: 4
    IN_FEATURES: ["res2", "res3", "res4", "res5"]
    DEFORMABLE_TRANSFORMER_ENCODER_IN_FEATURES: ["res3", "res4", "res5"]
    COMMON_STRIDE: 4
    TRANSFORMER_ENC_LAYERS: 6
    FEATURE_ORDER: "low2high"
  MaskDINO:
    TRANSFORMER_DECODER_NAME: "MaskDINODecoder"
    DEEP_SUPERVISION: True
    NO_OBJECT_WEIGHT: 0.1
    CLASS_WEIGHT: 4.0
    MASK_WEIGHT: 5.0
    DICE_WEIGHT: 5.0
    BOX_WEIGHT: 5.0
    GIOU_WEIGHT: 2.0
    HIDDEN_DIM: 256
    NUM_OBJECT_QUERIES: 300
    NHEADS: 8
    DROPOUT: 0.0
    DIM_FEEDFORWARD: 2048
    ENC_LAYERS: 0
    PRE_NORM: False
    ENFORCE_INPUT_PROJ: False
    SIZE_DIVISIBILITY: 32
    DEC_LAYERS: 9  # 9+1, 9 decoder layers, add one for the loss on learnable query
    TRAIN_NUM_POINTS: 12544
    OVERSAMPLE_RATIO: 3.0
    IMPORTANCE_SAMPLE_RATIO: 0.75
    INITIAL_PRED: True
    TWO_STAGE: True
    DN: "standard"
    DN_NUM: 100
    INITIALIZE_BOX_TYPE: "no"
    TEST:
      SEMANTIC_ON: False
      INSTANCE_ON: True
      PANOPTIC_ON: False
      OVERLAP_THRESHOLD: 0.8
      OBJECT_MASK_THRESHOLD: 0.25
  TEXT:
    ARCH: clip_teacher
  LANGUAGE_BACKBONE:
    LANG_DIM: 512


================================================
FILE: thirdparty/GLEE/glee/__init__.py
================================================
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function


from .config import add_glee_config
from .config_deeplab import add_deeplab_config
# from .GLEE import GLEE
# from .data import build_detection_train_loader, build_detection_test_loader
from .backbone.swin import D2SwinTransformer
from .backbone.eva02 import D2_EVA02



================================================
FILE: thirdparty/GLEE/glee/backbone/__init__.py
================================================
from .build import build_backbone

from .resnet import *
from .swin import *
# from .focal import *
# from .focal_dw import *
from .backbone import *

================================================
FILE: thirdparty/GLEE/glee/backbone/backbone.py
================================================
# Copyright (c) Facebook, Inc. and its affiliates.
import torch.nn as nn

from detectron2.modeling import ShapeSpec

__all__ = ["Backbone"]


class Backbone(nn.Module):
    """
    Abstract base class for network backbones.
    """

    def __init__(self):
        """
        The `__init__` method of any subclass can specify its own set of arguments.
        """
        super().__init__()

    def forward(self):
        """
        Subclasses must override this method, but adhere to the same return type.

        Returns:
            dict[str->Tensor]: mapping from feature name (e.g., "res2") to tensor
        """
        pass

    @property
    def size_divisibility(self) -> int:
        """
        Some backbones require the input height and width to be divisible by a
        specific integer. This is typically true for encoder / decoder type networks
        with lateral connection (e.g., FPN) for which feature maps need to match
        dimension in the "bottom up" and "top down" paths. Set to 0 if no specific
        input size divisibility is required.
        """
        return 0

    def output_shape(self):
        """
        Returns:
            dict[str->ShapeSpec]
        """
        # this is a backward-compatible default
        return {
            name: ShapeSpec(
                channels=self._out_feature_channels[name], stride=self._out_feature_strides[name]
            )
            for name in self._out_features
        }


================================================
FILE: thirdparty/GLEE/glee/backbone/build.py
================================================
from .registry import model_entrypoints
from .registry import is_model

from .backbone import *

def build_backbone(config, **kwargs):
    model_name = config['MODEL']['BACKBONE']['NAME']
    if not is_model(model_name):
        raise ValueError(f'Unkown model: {model_name}')
    model = model_entrypoints(model_name)(config, **kwargs)
    return model

================================================
FILE: thirdparty/GLEE/glee/backbone/davit.py
================================================
import os
import itertools
import logging

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.checkpoint as checkpoint
from collections import OrderedDict

from einops import rearrange
from timm.models.layers import DropPath, trunc_normal_

from detectron2.utils.file_io import PathManager
from detectron2.modeling import BACKBONE_REGISTRY, Backbone, ShapeSpec

from .registry import register_backbone

logger = logging.getLogger(__name__)


class MySequential(nn.Sequential):
    def forward(self, *inputs):
        for module in self._modules.values():
            if type(inputs) == tuple:
                inputs = module(*inputs)
            else:
                inputs = module(inputs)
        return inputs


class PreNorm(nn.Module):
    def __init__(self, norm, fn, drop_path=None):
        super().__init__()
        self.norm = norm
        self.fn = fn
        self.drop_path = drop_path

    def forward(self, x, *args, **kwargs):
        shortcut = x
        if self.norm != None:
            x, size = self.fn(self.norm(x), *args, **kwargs)
        else:
            x, size = self.fn(x, *args, **kwargs)

        if self.drop_path:
            x = self.drop_path(x)

        x = shortcut + x

        return x, size


class Mlp(nn.Module):
    def __init__(
        self,
        in_features,
        hidden_features=None,
        out_features=None,
        act_layer=nn.GELU,
    ):
        super().__init__()
        out_features = out_features or in_features
        hidden_features = hidden_features or in_features
        self.net = nn.Sequential(OrderedDict([
            ("fc1", nn.Linear(in_features, hidden_features)),
            ("act", act_layer()),
            ("fc2", nn.Linear(hidden_features, out_features))
        ]))

    def forward(self, x, size):
        return self.net(x), size


class DepthWiseConv2d(nn.Module):
    def __init__(
        self,
        dim_in,
        kernel_size,
        padding,
        stride,
        bias=True,
    ):
        super().__init__()
        self.dw = nn.Conv2d(
            dim_in, dim_in,
            kernel_size=kernel_size,
            padding=padding,
            groups=dim_in,
            stride=stride,
            bias=bias
        )

    def forward(self, x, size):
        B, N, C = x.shape
        H, W = size
        assert N == H * W

        x = self.dw(x.transpose(1, 2).view(B, C, H, W))
        size = (x.size(-2), x.size(-1))
        x = x.flatten(2).transpose(1, 2)
        return x, size


class ConvEmbed(nn.Module):
    """ Image to Patch Embedding
    """

    def __init__(
        self,
        patch_size=7,
        in_chans=3,
        embed_dim=64,
        stride=4,
        padding=2,
        norm_layer=None,
        pre_norm=True
    ):
        super().__init__()
        self.patch_size = patch_size

        self.proj = nn.Conv2d(
            in_chans, embed_dim,
            kernel_size=patch_size,
            stride=stride,
            padding=padding
        )

        dim_norm = in_chans if pre_norm else embed_dim
        self.norm = norm_layer(dim_norm) if norm_layer else None

        self.pre_norm = pre_norm

    def forward(self, x, size):
        H, W = size
        if len(x.size()) == 3:
            if self.norm and self.pre_norm:
                x = self.norm(x)
            x = rearrange(
                x, 'b (h w) c -> b c h w',
                h=H, w=W
            )

        x = self.proj(x)

        _, _, H, W = x.shape
        x = rearrange(x, 'b c h w -> b (h w) c')
        if self.norm and not self.pre_norm:
            x = self.norm(x)

        return x, (H, W)


class ChannelAttention(nn.Module):

    def __init__(self, dim, groups=8, qkv_bias=True):
        super().__init__()

        self.groups = groups
        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
        self.proj = nn.Linear(dim, dim)

    def forward(self, x, size):
        B, N, C = x.shape

        qkv = self.qkv(x).reshape(B, N, 3, self.groups, C // self.groups).permute(2, 0, 3, 1, 4)
        q, k, v = qkv[0], qkv[1], qkv[2]

        q = q * (N ** -0.5)
        attention = q.transpose(-1, -2) @ k
        attention = attention.softmax(dim=-1)
        x = (attention @ v.transpose(-1, -2)).transpose(-1, -2)
        x = x.transpose(1, 2).reshape(B, N, C)
        x = self.proj(x)
        return x, size


class ChannelBlock(nn.Module):

    def __init__(self, dim, groups, mlp_ratio=4., qkv_bias=True,
                 drop_path_rate=0., act_layer=nn.GELU, norm_layer=nn.LayerNorm,
                 conv_at_attn=True, conv_at_ffn=True):
        super().__init__()

        drop_path = DropPath(drop_path_rate) if drop_path_rate > 0. else nn.Identity()

        self.conv1 = PreNorm(None, DepthWiseConv2d(dim, 3, 1, 1)) if conv_at_attn else None
        self.channel_attn = PreNorm(
            norm_layer(dim),
            ChannelAttention(dim, groups=groups, qkv_bias=qkv_bias),
            drop_path
        )
        self.conv2 = PreNorm(None, DepthWiseConv2d(dim, 3, 1, 1)) if conv_at_ffn else None
        self.ffn = PreNorm(
            norm_layer(dim),
            Mlp(in_features=dim, hidden_features=int(dim*mlp_ratio), act_layer=act_layer),
            drop_path
        )

    def forward(self, x, size):
        if self.conv1:
            x, size = self.conv1(x, size)
        x, size = self.channel_attn(x, size)

        if self.conv2:
            x, size = self.conv2(x, size)
        x, size = self.ffn(x, size)

        return x, size


def window_partition(x, window_size: int):
    B, H, W, C = x.shape
    x = x.view(B, H // window_size, window_size, W // window_size, window_size, C)
    windows = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(-1, window_size, window_size, C)
    return windows


def window_reverse(windows, window_size: int, H: int, W: int):
    B = int(windows.shape[0] / (H * W / window_size / window_size))
    x = windows.view(B, H // window_size, W // window_size, window_size, window_size, -1)
    x = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(B, H, W, -1)
    return x


class WindowAttention(nn.Module):
    def __init__(self, dim, num_heads, window_size, qkv_bias=True):

        super().__init__()
        self.dim = dim
        self.window_size = window_size
        self.num_heads = num_heads
        head_dim = dim // num_heads
        self.scale = head_dim ** -0.5

        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
        self.proj = nn.Linear(dim, dim)

        self.softmax = nn.Softmax(dim=-1)

    def forward(self, x, size):

        H, W = size
        B, L, C = x.shape
        assert L == H * W, "input feature has wrong size"

        x = x.view(B, H, W, C)

        pad_l = pad_t = 0
        pad_r = (self.window_size - W % self.window_size) % self.window_size
        pad_b = (self.window_size - H % self.window_size) % self.window_size
        x = F.pad(x, (0, 0, pad_l, pad_r, pad_t, pad_b))
        _, Hp, Wp, _ = x.shape

        x = window_partition(x, self.window_size)
        x = x.view(-1, self.window_size * self.window_size, C)

        # W-MSA/SW-MSA
        # attn_windows = self.attn(x_windows)

        B_, N, C = x.shape
        qkv = self.qkv(x).reshape(B_, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
        q, k, v = qkv[0], qkv[1], qkv[2]

        q = q * self.scale
        attn = (q @ k.transpose(-2, -1))
        attn = self.softmax(attn)

        x = (attn @ v).transpose(1, 2).reshape(B_, N, C)
        x = self.proj(x)

        # merge windows
        x = x.view(
            -1, self.window_size, self.window_size, C
        )
        x = window_reverse(x, self.window_size, Hp, Wp)

        if pad_r > 0 or pad_b > 0:
            x = x[:, :H, :W, :].contiguous()

        x = x.view(B, H * W, C)

        return x, size


class SpatialBlock(nn.Module):

    def __init__(self, dim, num_heads, window_size,
                 mlp_ratio=4., qkv_bias=True, drop_path_rate=0., act_layer=nn.GELU,
                 norm_layer=nn.LayerNorm, conv_at_attn=True, conv_at_ffn=True):
        super().__init__()

        drop_path = DropPath(drop_path_rate) if drop_path_rate > 0. else nn.Identity()

        self.conv1 = PreNorm(None, DepthWiseConv2d(dim, 3, 1, 1)) if conv_at_attn else None
        self.window_attn = PreNorm(
            norm_layer(dim),
            WindowAttention(dim, num_heads, window_size, qkv_bias=qkv_bias),
            drop_path
        )
        self.conv2 = PreNorm(None, DepthWiseConv2d(dim, 3, 1, 1)) if conv_at_ffn else None
        self.ffn = PreNorm(
            norm_layer(dim),
            Mlp(in_features=dim, hidden_features=int(dim*mlp_ratio), act_layer=act_layer),
            drop_path
        )

    def forward(self, x, size):
        if self.conv1:
            x, size = self.conv1(x, size)
        x, size = self.window_attn(x, size)

        if self.conv2:
            x, size = self.conv2(x, size)
        x, size = self.ffn(x, size)
        return x, size


class DaViT(nn.Module):
    """ DaViT: Dual-Attention Transformer

    Args:
        img_size (int): Image size, Default: 224.
        in_chans (int): Number of input image channels. Default: 3.
        num_classes (int): Number of classes for classification head. Default: 1000.
        patch_size (tuple(int)): Patch size of convolution in different stages. Default: (7, 2, 2, 2).
        patch_stride (tuple(int)): Patch stride of convolution in different stages. Default: (4, 2, 2, 2).
        patch_padding (tuple(int)): Patch padding of convolution in different stages. Default: (3, 0, 0, 0).
        patch_prenorm (tuple(bool)): If True, perform norm before convlution layer. Default: (True, False, False, False).
        embed_dims (tuple(int)): Patch embedding dimension in different stages. Default: (64, 128, 192, 256).
        num_heads (tuple(int)): Number of spatial attention heads in different stages. Default: (4, 8, 12, 16).
        num_groups (tuple(int)): Number of channel groups in different stages. Default: (4, 8, 12, 16).
        window_size (int): Window size. Default: 7.
        mlp_ratio (float): Ratio of mlp hidden dim to embedding dim. Default: 4.
        qkv_bias (bool): If True, add a learnable bias to query, key, value. Default: True.
        drop_path_rate (float): Stochastic depth rate. Default: 0.1.
        norm_layer (nn.Module): Normalization layer. Default: nn.LayerNorm.
        enable_checkpoint (bool): If True, enable checkpointing. Default: False.
        conv_at_attn (bool): If True, performe depthwise convolution before attention layer. Default: True.
        conv_at_ffn (bool): If True, performe depthwise convolution before ffn layer. Default: True.
    """

    def __init__(
        self,
        img_size=224,
        in_chans=3,
        num_classes=1000,
        depths=(1, 1, 3, 1),
        patch_size=(7, 2, 2, 2),
        patch_stride=(4, 2, 2, 2),
        patch_padding=(3, 0, 0, 0),
        patch_prenorm=(False, False, False, False),
        embed_dims=(64, 128, 192, 256),
        num_heads=(3, 6, 12, 24),
        num_groups=(3, 6, 12, 24),
        window_size=7,
        mlp_ratio=4.,
        qkv_bias=True,
        drop_path_rate=0.1,
        norm_layer=nn.LayerNorm,
        enable_checkpoint=False,
        conv_at_attn=True,
        conv_at_ffn=True,
        out_indices=[],
     ):
        super().__init__()

        self.num_classes = num_classes
        self.embed_dims = embed_dims
        self.num_heads = num_heads
        self.num_groups = num_groups
        self.num_stages = len(self.embed_dims)
        self.enable_checkpoint = enable_checkpoint
        assert self.num_stages == len(self.num_heads) == len(self.num_groups)

        num_stages = len(embed_dims)
        self.img_size = img_size
        dpr = [x.item() for x in torch.linspace(0, drop_path_rate, sum(depths)*2)]


        depth_offset = 0
        convs = []
        blocks = []
        for i in range(num_stages):
            conv_embed = ConvEmbed(
                patch_size=patch_size[i],
                stride=patch_stride[i],
                padding=patch_padding[i],
                in_chans=in_chans if i == 0 else self.embed_dims[i - 1],
                embed_dim=self.embed_dims[i],
                norm_layer=norm_layer,
                pre_norm=patch_prenorm[i]
            )
            convs.append(conv_embed)

            print(f'=> Depth offset in stage {i}: {depth_offset}')
            block = MySequential(
                *[
                    MySequential(OrderedDict([
                        (
                            'spatial_block', SpatialBlock(
                                embed_dims[i],
                                num_heads[i],
                                window_size,
                                drop_path_rate=dpr[depth_offset+j*2],
                                qkv_bias=qkv_bias,
                                mlp_ratio=mlp_ratio,
                                conv_at_attn=conv_at_attn,
                                conv_at_ffn=conv_at_ffn,
                            )
                        ),
                        (
                            'channel_block', ChannelBlock(
                                embed_dims[i],
                                num_groups[i],
                                drop_path_rate=dpr[depth_offset+j*2+1],
                                qkv_bias=qkv_bias,
                                mlp_ratio=mlp_ratio,
                                conv_at_attn=conv_at_attn,
                                conv_at_ffn=conv_at_ffn,
                            )
                        )
                    ])) for j in range(depths[i])
                ]
            )
            blocks.append(block)
            depth_offset += depths[i]*2

        self.convs = nn.ModuleList(convs)
        self.blocks = nn.ModuleList(blocks)

        self.out_indices = out_indices
        # self.norms = norm_layer(self.embed_dims[-1])
        # self.avgpool = nn.AdaptiveAvgPool1d(1)
        # self.head = nn.Linear(self.embed_dims[-1], num_classes) if num_classes > 0 else nn.Identity()
        self.apply(self._init_weights)

    @property
    def dim_out(self):
        return self.embed_dims[-1]

    def _init_weights(self, m):
        if isinstance(m, nn.Linear):
            trunc_normal_(m.weight, std=0.02)
            if m.bias is not None:
                nn.init.constant_(m.bias, 0)
        elif isinstance(m, nn.Conv2d):
            nn.init.normal_(m.weight, std=0.02)
            for name, _ in m.named_parameters():
                if name in ['bias']:
                    nn.init.constant_(m.bias, 0)
        elif isinstance(m, nn.LayerNorm):
            nn.init.constant_(m.weight, 1.0)
            nn.init.constant_(m.bias, 0)
        elif isinstance(m, nn.BatchNorm2d):
            nn.init.constant_(m.weight, 1.0)
            nn.init.constant_(m.bias, 0)

    def _try_remap_keys(self, pretrained_dict):
        remap_keys = {
            "conv_embeds": "convs",
            "main_blocks": "blocks",
            "0.cpe.0.proj": "spatial_block.conv1.fn.dw",
            "0.attn": "spatial_block.window_attn.fn",
            "0.cpe.1.proj": "spatial_block.conv2.fn.dw",
            "0.mlp": "spatial_block.ffn.fn.net",
            "1.cpe.0.proj": "channel_block.conv1.fn.dw",
            "1.attn": "channel_block.channel_attn.fn",
            "1.cpe.1.proj": "channel_block.conv2.fn.dw",
            "1.mlp": "channel_block.ffn.fn.net",
            "0.norm1": "spatial_block.window_attn.norm",
            "0.norm2": "spatial_block.ffn.norm",
            "1.norm1": "channel_block.channel_attn.norm",
            "1.norm2": "channel_block.ffn.norm"
        }

        full_key_mappings = {}
        for k in pretrained_dict.keys():
            old_k = k
            for remap_key in remap_keys.keys():
                if remap_key in k:
                    print(f'=> Repace {remap_key} with {remap_keys[remap_key]}')
                    k = k.replace(remap_key, remap_keys[remap_key])

            full_key_mappings[old_k] = k

        return full_key_mappings

    def from_state_dict(self, pretrained_dict, pretrained_layers=[], verbose=True):
        model_dict = self.state_dict()
        stripped_key = lambda x: x[14:] if x.startswith('image_encoder.') else x
        full_key_mappings = self._try_remap_keys(pretrained_dict)

        pretrained_dict = {
            stripped_key(full_key_mappings[k]): v for k, v in pretrained_dict.items()
            if stripped_key(full_key_mappings[k]) in model_dict.keys()
        }
        need_init_state_dict = {}
        for k, v in pretrained_dict.items():
            need_init = (
                k.split('.')[0] in pretrained_layers
                or pretrained_layers[0] == '*'
            )
            if need_init:
                if verbose:
                    print(f'=> init {k} from pretrained state dict')

                need_init_state_dict[k] = v
        self.load_state_dict(need_init_state_dict, strict=False)

    def from_pretrained(self, pretrained='', pretrained_layers=[], verbose=True):
        if os.path.isfile(pretrained):
            print(f'=> loading pretrained model {pretrained}')
            pretrained_dict = torch.load(pretrained, map_location='cpu')

            self.from_state_dict(pretrained_dict, pretrained_layers, verbose)

    def forward_features(self, x):
        input_size = (x.size(2), x.size(3))

        outs = {}
        for i, (conv, block) in enumerate(zip(self.convs, self.blocks)):
            x, input_size = conv(x, input_size)
            if self.enable_checkpoint:
                x, input_size = checkpoint.checkpoint(block, x, input_size)
            else:
                x, input_size = block(x, input_size)
            if i in self.out_indices:
                out = x.view(-1, *input_size, self.embed_dims[i]).permute(0, 3, 1, 2).contiguous()
                outs["res{}".format(i + 2)] = out       

        if len(self.out_indices) == 0:
            outs["res5"] = x.view(-1, *input_size, self.embed_dims[-1]).permute(0, 3, 1, 2).contiguous()
        
        return outs

    def forward(self, x):
        x = self.forward_features(x)
        # x = self.head(x)
        return x

class D2DaViT(DaViT, Backbone):
    def __init__(self, cfg, input_shape):

        spec = cfg['BACKBONE']['DAVIT']

        super().__init__(
            num_classes=0,
            depths=spec['DEPTHS'],
            embed_dims=spec['DIM_EMBED'],
            num_heads=spec['NUM_HEADS'],
            num_groups=spec['NUM_GROUPS'],
            patch_size=spec['PATCH_SIZE'],
            patch_stride=spec['PATCH_STRIDE'],
            patch_padding=spec['PATCH_PADDING'],
            patch_prenorm=spec['PATCH_PRENORM'],
            drop_path_rate=spec['DROP_PATH_RATE'],
            img_size=input_shape,
            window_size=spec.get('WINDOW_SIZE', 7),
            enable_checkpoint=spec.get('ENABLE_CHECKPOINT', False),
            conv_at_attn=spec.get('CONV_AT_ATTN', True),
            conv_at_ffn=spec.get('CONV_AT_FFN', True),
            out_indices=spec.get('OUT_INDICES', []),
        )

        self._out_features = cfg['BACKBONE']['DAVIT']['OUT_FEATURES']

        self._out_feature_strides = {
            "res2": 4,
            "res3": 8,
            "res4": 16,
            "res5": 32,
        }
        self._out_feature_channels = {
            "res2": self.embed_dims[0],
            "res3": self.embed_dims[1],
            "res4": self.embed_dims[2],
            "res5": self.embed_dims[3],
        }

    def forward(self, x):
        """
        Args:
            x: Tensor of shape (N,C,H,W). H, W must be a multiple of ``self.size_divisibility``.
        Returns:
            dict[str->Tensor]: names and the corresponding features
        """
        assert (
            x.dim() == 4
        ), f"SwinTransformer takes an input of shape (N, C, H, W). Got {x.shape} instead!"
        outputs = {}
        y = super().forward(x)

        for k in y.keys():
            if k in self._out_features:
                outputs[k] = y[k]
        return outputs

    def output_shape(self):
        return {
            name: ShapeSpec(
                channels=self._out_feature_channels[name], stride=self._out_feature_strides[name]
            )
            for name in self._out_features
        }

    @property
    def size_divisibility(self):
        return 32

@register_backbone
def get_davit_backbone(cfg):
    davit = D2DaViT(cfg['MODEL'], 224)    

    if cfg['MODEL']['BACKBONE']['LOAD_PRETRAINED'] is True:
        filename = cfg['MODEL']['BACKBONE']['PRETRAINED']
        logger.info(f'=> init from {filename}')
        davit.from_pretrained(
            filename, 
            cfg['MODEL']['BACKBONE']['DAVIT'].get('PRETRAINED_LAYERS', ['*']), 
            cfg['VERBOSE'])

    return davit

================================================
FILE: thirdparty/GLEE/glee/backbone/eva01.py
================================================
import logging
import math
from functools import partial

import fvcore.nn.weight_init as weight_init
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import Tensor, Size
from typing import Union, List
from torch.nn.parameter import Parameter
import numbers

from detectron2.layers import CNNBlockBase, Conv2d, get_norm
from detectron2.modeling.backbone.fpn import _assert_strides_are_log2_contiguous

from fairscale.nn.checkpoint import checkpoint_wrapper
from timm.models.layers import DropPath, Mlp, trunc_normal_

# from detectron2.modeling.backbone import Backbone
from detectron2.modeling import BACKBONE_REGISTRY, Backbone, ShapeSpec

from .eva_01_utils import (
    PatchEmbed,
    add_decomposed_rel_pos,
    get_abs_pos,
    window_partition,
    window_unpartition,
)
from detectron2.modeling.backbone.fpn import LastLevelMaxPool

logger = logging.getLogger(__name__)


__all__ = ["EVAViT", "SimpleFeaturePyramid", "get_vit_lr_decay_rate"]


_shape_t = Union[int, List[int], Size]


# steal from beit https://github.com/microsoft/unilm/tree/master/beit
class LayerNormWithForceFP32(nn.Module):
    __constants__ = ['normalized_shape', 'eps', 'elementwise_affine']
    normalized_shape: _shape_t
    eps: float
    elementwise_affine: bool

    def __init__(self, normalized_shape: _shape_t, eps: float = 1e-5, elementwise_affine: bool = True) -> None:
        super(LayerNormWithForceFP32, self).__init__()
        if isinstance(normalized_shape, numbers.Integral):
            normalized_shape = (normalized_shape,)
        self.normalized_shape = tuple(normalized_shape)
        self.eps = eps
        self.elementwise_affine = elementwise_affine
        if self.elementwise_affine:
            self.weight = Parameter(torch.Tensor(*normalized_shape))
            self.bias = Parameter(torch.Tensor(*normalized_shape))
        else:
            self.register_parameter('weight', None)
            self.register_parameter('bias', None)
        self.reset_parameters()

    def reset_parameters(self) -> None:
        if self.elementwise_affine:
            nn.init.ones_(self.weight)
            nn.init.zeros_(self.bias)

    def forward(self, input: Tensor) -> Tensor:
        return F.layer_norm(
            input.float(), self.normalized_shape, self.weight.float(), self.bias.float(), self.eps).type_as(input)

    def extra_repr(self) -> Tensor:
        return '{normalized_shape}, eps={eps}, ' \
               'elementwise_affine={elementwise_affine}'.format(**self.__dict__)


class Attention(nn.Module):
    """Multi-head Attention block with relative position embeddings."""

    def __init__(
        self,
        dim,
        num_heads=8,
        qkv_bias=True,
        beit_like_qkv_bias=False,
        use_rel_pos=False,
        rel_pos_zero_init=True,
        input_size=None,
        interp_type="vitdet",
    ):
        """
        Args:
            dim (int): Number of input channels.
            num_heads (int): Number of attention heads.
            qkv_bias (bool:  If True, add a learnable bias to query, key, value.
            rel_pos (bool): If True, add relative positional embeddings to the attention map.
            rel_pos_zero_init (bool): If True, zero initialize relative positional parameters.
            input_size (int or None): Input resolution for calculating the relative positional
                parameter size.
        """
        super().__init__()
        self.num_heads = num_heads
        head_dim = dim // num_heads
        self.scale = head_dim**-0.5

        self.beit_like_qkv_bias = beit_like_qkv_bias
        if beit_like_qkv_bias:
            self.q_bias = nn.Parameter(torch.zeros(dim))
            self.v_bias = nn.Parameter(torch.zeros(dim))

        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
        self.proj = nn.Linear(dim, dim)

        self.use_rel_pos = use_rel_pos
        self.interp_type = interp_type
        if self.use_rel_pos:
            # initialize relative positional embeddings
            self.rel_pos_h = nn.Parameter(torch.zeros(2 * input_size[0] - 1, head_dim))
            self.rel_pos_w = nn.Parameter(torch.zeros(2 * input_size[1] - 1, head_dim))

            if not rel_pos_zero_init:
                trunc_normal_(self.rel_pos_h, std=0.02)
                trunc_normal_(self.rel_pos_w, std=0.02)
        self.qk_float = False

    def forward(self, x):
        B, H, W, _ = x.shape
        # qkv with shape (3, B, nHead, H * W, C)
        if self.beit_like_qkv_bias:
            qkv_bias = torch.cat((self.q_bias, torch.zeros_like(self.v_bias, requires_grad=False), self.v_bias))
            qkv = torch.nn.functional.linear(input=x, weight=self.qkv.weight, bias=qkv_bias)
            qkv = qkv.reshape(B, H * W, 3, self.num_heads, -1).permute(2, 0, 3, 1, 4)
        else:
            qkv = self.qkv(x).reshape(B, H * W, 3, self.num_heads, -1).permute(2, 0, 3, 1, 4)
        # q, k, v with shape (B * nHead, H * W, C)
        q, k, v = qkv.reshape(3, B * self.num_heads, H * W, -1).unbind(0)

        if self.qk_float:
            attn = (q.float() * self.scale) @ k.float().transpose(-2, -1)
            if self.use_rel_pos:
                attn = add_decomposed_rel_pos(attn, q, self.rel_pos_h, self.rel_pos_w, (H, W), (H, W), self.interp_type)
            attn = attn.softmax(dim=-1).type_as(x)
        else:
            attn = (q * self.scale) @ k.transpose(-2, -1)
            if self.use_rel_pos:
                attn = add_decomposed_rel_pos(attn, q, self.rel_pos_h, self.rel_pos_w, (H, W), (H, W), self.interp_type)
            attn = attn.softmax(dim=-1)
        x = (attn @ v).view(B, self.num_heads, H, W, -1).permute(0, 2, 3, 1, 4).reshape(B, H, W, -1)
        x = self.proj(x)

        return x


class ResBottleneckBlock(CNNBlockBase):
    """
    The standard bottleneck residual block without the last activation layer.
    It contains 3 conv layers with kernels 1x1, 3x3, 1x1.
    """

    def __init__(
        self,
        in_channels,
        out_channels,
        bottleneck_channels,
        norm="LN",
        act_layer=nn.GELU,
    ):
        """
        Args:
            in_channels (int): Number of input channels.
            out_channels (int): Number of output channels.
            bottleneck_channels (int): number of output channels for the 3x3
                "bottleneck" conv layers.
            norm (str or callable): normalization for all conv layers.
                See :func:`layers.get_norm` for supported format.
            act_layer (callable): activation for all conv layers.
        """
        super().__init__(in_channels, out_channels, 1)

        self.conv1 = Conv2d(in_channels, bottleneck_channels, 1, bias=False)
        self.norm1 = get_norm(norm, bottleneck_channels)
        self.act1 = act_layer()

        self.conv2 = Conv2d(
            bottleneck_channels,
            bottleneck_channels,
            3,
            padding=1,
            bias=False,
        )
        self.norm2 = get_norm(norm, bottleneck_channels)
        self.act2 = act_layer()

        self.conv3 = Conv2d(bottleneck_channels, out_channels, 1, bias=False)
        self.norm3 = get_norm(norm, out_channels)

        for layer in [self.conv1, self.conv2, self.conv3]:
            weight_init.c2_msra_fill(layer)
        for layer in [self.norm1, self.norm2]:
            layer.weight.data.fill_(1.0)
            layer.bias.data.zero_()
        # zero init last norm layer.
        self.norm3.weight.data.zero_()
        self.norm3.bias.data.zero_()

    def forward(self, x):
        out = x
        for layer in self.children():
            out = layer(out)

        out = x + out
        return out


class Block(nn.Module):
    """Transformer blocks with support of window attention and residual propagation blocks"""

    def __init__(
        self,
        dim,
        num_heads,
        mlp_ratio=4.0,
        qkv_bias=True,
        drop_path=0.0,
        norm_layer=LayerNormWithForceFP32,
        act_layer=nn.GELU,
        use_rel_pos=False,
        rel_pos_zero_init=True,
        window_size=0,
        use_residual_block=False,
        input_size=None,
        beit_like_qkv_bias=False,
        beit_like_gamma=False,
        interp_type="vitdet",
    ):
        """
        Args:
            dim (int): Number of input channels.
            num_heads (int): Number of attention heads in each ViT block.
            mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.
            qkv_bias (bool): If True, add a learnable bias to query, key, value.
            drop_path (float): Stochastic depth rate.
            norm_layer (nn.Module): Normalization layer.
            act_layer (nn.Module): Activation layer.
            use_rel_pos (bool): If True, add relative positional embeddings to the attention map.
            rel_pos_zero_init (bool): If True, zero initialize relative positional parameters.
            window_size (int): Window size for window attention blocks. If it equals 0, then not
                use window attention.
            use_residual_block (bool): If True, use a residual block after the MLP block.
            input_size (int or None): Input resolution for calculating the relative positional
                parameter size.
            beit_like_qkv_bias (bool)
            beit_like_gamma (bool)
        """
        super().__init__()
        self.norm1 = norm_layer(dim)
        self.attn = Attention(
            dim,
            num_heads=num_heads,
            qkv_bias=qkv_bias,
            use_rel_pos=use_rel_pos,
            rel_pos_zero_init=rel_pos_zero_init,
            input_size=input_size if window_size == 0 else (window_size, window_size),
            beit_like_qkv_bias=beit_like_qkv_bias,
            interp_type=interp_type,
        )

        self.drop_path = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()
        self.norm2 = norm_layer(dim)
        self.mlp = Mlp(in_features=dim, hidden_features=int(dim * mlp_ratio), act_layer=act_layer)

        self.window_size = window_size

        self.use_residual_block = use_residual_block
        if use_residual_block:
            # Use a residual block with bottleneck channel as dim // 2
            self.residual = ResBottleneckBlock(
                in_channels=dim,
                out_channels=dim,
                bottleneck_channels=dim // 2,
                norm="LN",
                act_layer=act_layer,
            )

        self.beit_like_gamma = beit_like_gamma
        if beit_like_gamma:
            self.gamma_1 = nn.Parameter(torch.ones((dim)), requires_grad=True)
            self.gamma_2 = nn.Parameter(torch.ones((dim)), requires_grad=True)

    def forward(self, x):
        shortcut = x
        x = self.norm1(x)
        # Window partition
        if self.window_size > 0:
            H, W = x.shape[1], x.shape[2]
            x, pad_hw = window_partition(x, self.window_size)

        x = self.attn(x)
        # Reverse window partition
        if self.window_size > 0:
            x = window_unpartition(x, self.window_size, pad_hw, (H, W))

        if self.beit_like_gamma:
            x = shortcut + self.drop_path(self.gamma_1 * x)
            x = x + self.drop_path(self.gamma_2 * self.mlp(self.norm2(x)))
        else:
            x = shortcut + self.drop_path(x)
            x = x + self.drop_path(self.mlp(self.norm2(x)))

        if self.use_residual_block:
            x = self.residual(x.permute(0, 3, 1, 2)).permute(0, 2, 3, 1)

        return x


class EVAViT(Backbone):
    """
    This module implements Vision Transformer (ViT) backbone in :paper:`vitdet`.
    "Exploring Plain Vision Transformer Backbones for Object Detection",
    https://arxiv.org/abs/2203.16527
    """

    def __init__(
        self,
        img_size=1024,
        patch_size=16,
        in_chans=3,
        embed_dim=768,
        depth=12,
        num_heads=12,
        mlp_ratio=4.0,
        qkv_bias=True,
        drop_path_rate=0.0,
        norm_layer=LayerNormWithForceFP32,
        act_layer=nn.GELU,
        use_abs_pos=True,
        use_rel_pos=False,
        rel_pos_zero_init=True,
        window_size=0,
        window_block_indexes=(),
        residual_block_indexes=(),
        use_act_checkpoint=False,
        pretrain_img_size=224,
        pretrain_use_cls_token=True,
        out_feature="last_feat",
        beit_like_qkv_bias=True,
        beit_like_gamma=False,
        freeze_patch_embed=False,
        interp_type="vitdet", 
    ):
        """
        Args:
            img_size (int): Input image size.
            patch_size (int): Patch size.
            in_chans (int): Number of input image channels.
            embed_dim (int): Patch embedding dimension.
            depth (int): Depth of ViT.
            num_heads (int): Number of attention heads in each ViT block.
            mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.
            qkv_bias (bool): If True, add a learnable bias to query, key, value.
            drop_path_rate (float): Stochastic depth rate.
            norm_layer (nn.Module): Normalization layer.
            act_layer (nn.Module): Activation layer.
            use_abs_pos (bool): If True, use absolute positional embeddings.
            use_rel_pos (bool): If True, add relative positional embeddings to the attention map.
            rel_pos_zero_init (bool): If True, zero initialize relative positional parameters.
            window_size (int): Window size for window attention blocks.
            window_block_indexes (list): Indexes for blocks using window attention.
            residual_block_indexes (list): Indexes for blocks using conv propagation.
            use_act_checkpoint (bool): If True, use activation checkpointing.
            pretrain_img_size (int): input image size for pretraining models.
            pretrain_use_cls_token (bool): If True, pretrainig models use class token.
            out_feature (str): name of the feature from the last block.
            beit_like_qkv_bias (bool): beit_like_model that has gamma_1 and gamma_2 in blocks and qkv_bias=False
            beit_like_gamma (bool)
            freeze_patch_embed (bool)
            interp_type: "vitdet" for training / fine-ting, "beit" for eval (slightly improvement at a higher res)
        """
        super().__init__()
        self.pretrain_use_cls_token = pretrain_use_cls_token

        self.patch_embed = PatchEmbed(
            kernel_size=(patch_size, patch_size),
            stride=(patch_size, patch_size),
            in_chans=in_chans,
            embed_dim=embed_dim,
        )

        if use_abs_pos:
            # Initialize absolute positional embedding with pretrain image size.
            num_patches = (pretrain_img_size // patch_size) * (pretrain_img_size // patch_size)
            num_positions = (num_patches + 1) if pretrain_use_cls_token else num_patches
            self.pos_embed = nn.Parameter(torch.zeros(1, num_positions, embed_dim))
        else:
            self.pos_embed = None

        # stochastic depth decay rule
        dpr = [x.item() for x in torch.linspace(0, drop_path_rate, depth)]

        self.blocks = nn.ModuleList()
        if beit_like_qkv_bias:
            qkv_bias = False
        for i in range(depth):
            block = Block(
                dim=embed_dim,
                num_heads=num_heads,
                mlp_ratio=mlp_ratio,
                qkv_bias=qkv_bias,
                drop_path=dpr[i],
                norm_layer=norm_layer,
                act_layer=act_layer,
                use_rel_pos=use_rel_pos,
                rel_pos_zero_init=rel_pos_zero_init,
                window_size=window_size if i in window_block_indexes else 0,
                use_residual_block=i in residual_block_indexes,
                input_size=(img_size // patch_size, img_size // patch_size),
                beit_like_qkv_bias=beit_like_qkv_bias,
                beit_like_gamma=beit_like_gamma,
                interp_type=interp_type,
            )
            if use_act_checkpoint:
                block = checkpoint_wrapper(block)
            self.blocks.append(block)

        self._out_feature_channels = {out_feature: embed_dim}
        self._out_feature_strides = {out_feature: patch_size}
        self._out_features = [out_feature]

        if self.pos_embed is not None:
            trunc_normal_(self.pos_embed, std=0.02)

        self.freeze_patch_embed = freeze_patch_embed
        self.apply(self._init_weights)

    def _init_weights(self, m):
        if isinstance(m, nn.Linear):
            trunc_normal_(m.weight, std=0.02)
            if isinstance(m, nn.Linear) and m.bias is not None:
                nn.init.constant_(m.bias, 0)
        elif isinstance(m, LayerNormWithForceFP32):
            nn.init.constant_(m.bias, 0)
            nn.init.constant_(m.weight, 1.0)

        if self.freeze_patch_embed:
            for n, p in self.patch_embed.named_parameters():
                p.requires_grad = False

    def forward(self, x):
        x = self.patch_embed(x)
        if self.pos_embed is not None:
            x = x + get_abs_pos(
                self.pos_embed, self.pretrain_use_cls_token, (x.shape[1], x.shape[2])
            )

        for blk in self.blocks:
            x = blk(x)

        outputs = {self._out_features[0]: x.permute(0, 3, 1, 2)}
        return outputs


class SimpleFeaturePyramid(Backbone):
    """
    This module implements SimpleFeaturePyramid in :paper:`vitdet`.
    It creates pyramid features built on top of the input feature map.
    """

    def __init__(
        self,
        net,
        in_feature,
        out_channels,
        scale_factors,
        top_block=None,
        norm="LN",
        square_pad=0,
    ):
        """
        Args:
            net (Backbone): module representing the subnetwork backbone.
                Must be a subclass of :class:`Backbone`.
            in_feature (str): names of the input feature maps coming
                from the net.
            out_channels (int): number of channels in the output feature maps.
            scale_factors (list[float]): list of scaling factors to upsample or downsample
                the input features for creating pyramid features.
            top_block (nn.Module or None): if provided, an extra operation will
                be performed on the output of the last (smallest resolution)
                pyramid output, and the result will extend the result list. The top_block
                further downsamples the feature map. It must have an attribute
                "num_levels", meaning the number of extra pyramid levels added by
                this block, and "in_feature", which is a string representing
                its input feature (e.g., p5).
            norm (str): the normalization to use.
            square_pad (int): If > 0, require input images to be padded to specific square size.
        """
        super(SimpleFeaturePyramid, self).__init__()
        assert isinstance(net, Backbone)
        self.scale_factors = scale_factors

        input_shapes = net.output_shape()
        strides = [int(input_shapes[in_feature].stride / scale) for scale in scale_factors]
        _assert_strides_are_log2_contiguous(strides)

        dim = input_shapes[in_feature].channels
        self.stages = []
        use_bias = norm == ""
        for idx, scale in enumerate(scale_factors):
            out_dim = dim
            if scale == 4.0:
                layers = [
                    nn.ConvTranspose2d(dim, dim // 2, kernel_size=2, stride=2),
                    get_norm(norm, dim // 2),
                    nn.GELU(),
                    nn.ConvTranspose2d(dim // 2, dim // 4, kernel_size=2, stride=2),
                ]
                out_dim = dim // 4
            elif scale == 2.0:
                layers = [nn.ConvTranspose2d(dim, dim // 2, kernel_size=2, stride=2)]
                out_dim = dim // 2
            elif scale == 1.0:
                layers = []
            elif scale == 0.5:
                layers = [nn.MaxPool2d(kernel_size=2, stride=2)]
            else:
                raise NotImplementedError(f"scale_factor={scale} is not supported yet.")

            layers.extend(
                [
                    Conv2d(
                        out_dim,
                        out_channels,
                        kernel_size=1,
                        bias=use_bias,
                        norm=get_norm(norm, out_channels),
                    ),
                    Conv2d(
                        out_channels,
                        out_channels,
                        kernel_size=3,
                        padding=1,
                        bias=use_bias,
                        norm=get_norm(norm, out_channels),
                    ),
                ]
            )
            layers = nn.Sequential(*layers)

            stage = int(math.log2(strides[idx]))
            self.add_module(f"simfp_{stage}", layers)
            self.stages.append(layers)

        self.net = net
        self.in_feature = in_feature
        self.top_block = top_block
        # Return feature names are "p<stage>", like ["p2", "p3", ..., "p6"]
        self._out_feature_strides = {"p{}".format(int(math.log2(s))): s for s in strides}
        # top block output feature maps.
        if self.top_block is not None:
            for s in range(stage, stage + self.top_block.num_levels):
                self._out_feature_strides["p{}".format(s + 1)] = 2 ** (s + 1)

        self._out_features = list(self._out_feature_strides.keys())
        self._out_feature_channels = {k: out_channels for k in self._out_features}
        self._size_divisibility = strides[-1]
        self._square_pad = square_pad

    @property
    def padding_constraints(self):
        return {
            "size_divisiblity": self._size_divisibility,
            "square_size": self._square_pad,
        }

    def forward(self, x):
        """
        Args:
            x: Tensor of shape (N,C,H,W). H, W must be a multiple of ``self.size_divisibility``.

        Returns:
            dict[str->Tensor]:
                mapping from feature map name to pyramid feature map tensor
                in high to low resolution order. Returned feature names follow the FPN
                convention: "p<stage>", where stage has stride = 2 ** stage e.g.,
                ["p2", "p3", ..., "p6"].
        """
        bottom_up_features = self.net(x)
        features = bottom_up_features[self.in_feature]
        results = []

        for stage in self.stages:
            results.append(stage(features))

        if self.top_block is not None:
            if self.top_block.in_feature in bottom_up_features:
                top_block_in_feature = bottom_up_features[self.top_block.in_feature]
            else:
                top_block_in_feature = results[self._out_features.index(self.top_block.in_feature)]
            results.extend(self.top_block(top_block_in_feature))
        assert len(self._out_features) == len(results)
        return {f: res for f, res in zip(self._out_features, results)}



@BACKBONE_REGISTRY.register()
class D2_EVA01(SimpleFeaturePyramid):
    def __init__(self, cfg, input_shape):
        
        super().__init__(
            net = EVAViT(
                img_size= cfg.MODEL.EVA01.IMAGE_SIZE,
                patch_size=cfg.MODEL.EVA01.PATCH_SIZE,
                window_size= cfg.MODEL.EVA01.WINDOW_SIZE,
                embed_dim= cfg.MODEL.EVA01.DMBED_DIM,
                depth= cfg.MODEL.EVA01.DEPTH,
                num_heads= cfg.MODEL.EVA01.NUM_HEADS ,
                drop_path_rate= cfg.MODEL.EVA01.DROP_PATH_RATE,
                mlp_ratio= cfg.MODEL.EVA01.MLP_RATIO,
                qkv_bias=True,
                norm_layer=partial(nn.LayerNorm, eps=1e-6),
                window_block_indexes= cfg.MODEL.EVA01.WINDOW_BLOCK_INDEXES,
                residual_block_indexes=[],
                use_act_checkpoint = True,
                use_rel_pos = True,
                out_feature="last_feat",
                beit_like_qkv_bias=cfg.MODEL.EVA01.BEIT_LIKE_QKV_BIAS ,
                beit_like_gamma= cfg.MODEL.EVA01.BEIT_LIKE_GAMMA,
                freeze_patch_embed= cfg.MODEL.EVA01.FREEZE_PATH_EMBED,
            ),
            in_feature = "last_feat",
            out_channels=256,
            scale_factors=(2.0, 1.0, 0.5),  # (4.0, 2.0, 1.0, 0.5) in ViTDet
            top_block=LastLevelMaxPool(),
            norm="LN",
            square_pad=cfg.MODEL.EVA01.IMAGE_SIZE,

        )
        pretrained_weight = cfg.MODEL.EVA01.PRETRAINED_WEIGHT 
        if pretrained_weight:
            checkpoint = torch.load(pretrained_weight, map_location='cpu')
            print(f'\nload pretrain weight from {pretrained_weight} \n') 
            self.load_state_dict(checkpoint['model'], strict=False)
 
    def output_shape(self):
        return {
            name: ShapeSpec(
                channels=self._out_feature_channels[name], stride=self._out_feature_strides[name]
            )
            for name in self._out_features
        }

    @property
    def size_divisibility(self):
        return 32



def get_vit_lr_decay_rate(name, lr_decay_rate=1.0, num_layers=12):
    """
    Calculate lr decay rate for different ViT blocks.
    Args:
        name (string): parameter name.
        lr_decay_rate (float): base lr decay rate.
        num_layers (int): number of ViT blocks.

    Returns:
        lr decay rate for the given parameter.
    """
    layer_id = num_layers + 1
    if 'backbone' in name: #name.startswith("backbone"):
        if ".pos_embed" in name or ".patch_embed" in name:
            layer_id = 0
        elif ".blocks." in name and ".residual." not in name:
            layer_id = int(name[name.find(".blocks.") :].split(".")[2]) + 1

    return lr_decay_rate ** (num_layers + 1 - layer_id)


================================================
FILE: thirdparty/GLEE/glee/backbone/eva02-dino.py
================================================
import logging
import math
from functools import partial

import fvcore.nn.weight_init as weight_init
import torch
import torch.nn as nn
import torch.nn.functional as F

from detectron2.layers import CNNBlockBase, Conv2d, get_norm
from detectron2.modeling.backbone.fpn import _assert_strides_are_log2_contiguous

from detectron2.modeling.backbone import Backbone
from .eva_02_utils import (
    PatchEmbed,
    add_decomposed_rel_pos,
    get_abs_pos,
    window_partition,
    window_unpartition,
    VisionRotaryEmbeddingFast,
)

try:
    import xformers.ops as xops
    HAS_XFORMER=True
except:
    HAS_XFORMER=False
    pass


logger = logging.getLogger(__name__)



__all__ = ["EVA02_ViT", "SimpleFeaturePyramid", "get_vit_lr_decay_rate"]



class SwiGLU(nn.Module):
    def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.SiLU, drop=0., 
                norm_layer=nn.LayerNorm, subln=False
            ):
        super().__init__()
        out_features = out_features or in_features
        hidden_features = hidden_features or in_features

        self.w1 = nn.Linear(in_features, hidden_features)
        self.w2 = nn.Linear(in_features, hidden_features)

        self.act = act_layer()
        self.ffn_ln = norm_layer(hidden_features) if subln else nn.Identity()
        self.w3 = nn.Linear(hidden_features, out_features)
        
        self.drop = nn.Dropout(drop)

    def forward(self, x):
        x1 = self.w1(x)
        x2 = self.w2(x)
        hidden = self.act(x1) * x2
        x = self.ffn_ln(hidden)
        x = self.w3(x)
        x = self.drop(x)
        return x
    

class Attention(nn.Module):
    def __init__(
            self, 
            dim, 
            num_heads=8, 
            qkv_bias=True, 
            qk_scale=None, 
            attn_head_dim=None, 
            rope=None,
            xattn=True,
        ):
        super().__init__()
        self.num_heads = num_heads
        head_dim = dim // num_heads
        if attn_head_dim is not None:
            head_dim = attn_head_dim
        all_head_dim = head_dim * self.num_heads
        self.scale = qk_scale or head_dim ** -0.5

        self.q_proj = nn.Linear(dim, all_head_dim, bias=False)
        self.k_proj = nn.Linear(dim, all_head_dim, bias=False)
        self.v_proj = nn.Linear(dim, all_head_dim, bias=False)

        if qkv_bias:
            self.q_bias = nn.Parameter(torch.zeros(all_head_dim))
            self.v_bias = nn.Parameter(torch.zeros(all_head_dim))
        else:
            self.q_bias = None
            self.v_bias = None

        self.rope = rope
        self.xattn = xattn
        self.proj = nn.Linear(all_head_dim, dim)

        if not HAS_XFORMER:
            self.xattn = False

    def forward(self, x):
        B, H, W, C = x.shape
        x = x.view(B, -1, C)
        N = H * W

        q = F.linear(input=x, weight=self.q_proj.weight, bias=self.q_bias)
        k = F.linear(input=x, weight=self.k_proj.weight, bias=None)
        v = F.linear(input=x, weight=self.v_proj.weight, bias=self.v_bias)

        q = q.reshape(B, N, self.num_heads, -1).permute(0, 2, 1, 3)     # B, num_heads, N, C
        k = k.reshape(B, N, self.num_heads, -1).permute(0, 2, 1, 3)  
        v = v.reshape(B, N, self.num_heads, -1).permute(0, 2, 1, 3) 

        ## rope
        q = self.rope(q).type_as(v)
        k = self.rope(k).type_as(v)

        if self.xattn:
            q = q.permute(0, 2, 1, 3)   # B, num_heads, N, C -> B, N, num_heads, C
            k = k.permute(0, 2, 1, 3)
            v = v.permute(0, 2, 1, 3)
            
            x = xops.memory_efficient_attention(q, k, v)
            x = x.reshape(B, N, -1)
        else:
            q = q * self.scale
            attn = (q @ k.transpose(-2, -1))
            attn = attn.softmax(dim=-1).type_as(x)
            x = (attn @ v).transpose(1, 2).reshape(B, N, -1)

        x = self.proj(x)
        x = x.view(B, H, W, C)

        return x


class ResBottleneckBlock(CNNBlockBase):
    """
    The standard bottleneck residual block without the last activation layer.
    It contains 3 conv layers with kernels 1x1, 3x3, 1x1.
    """

    def __init__(
        self,
        in_channels,
        out_channels,
        bottleneck_channels,
        norm="LN",
        act_layer=nn.GELU,
    ):
        """
        Args:
            in_channels (int): Number of input channels.
            out_channels (int): Number of output channels.
            bottleneck_channels (int): number of output channels for the 3x3
                "bottleneck" conv layers.
            norm (str or callable): normalization for all conv layers.
                See :func:`layers.get_norm` for supported format.
            act_layer (callable): activation for all conv layers.
        """
        super().__init__(in_channels, out_channels, 1)

        self.conv1 = Conv2d(in_channels, bottleneck_channels, 1, bias=False)
        self.norm1 = get_norm(norm, bottleneck_channels)
        self.act1 = act_layer()

        self.conv2 = Conv2d(
            bottleneck_channels,
            bottleneck_channels,
            3,
            padding=1,
            bias=False,
        )
        self.norm2 = get_norm(norm, bottleneck_channels)
        self.act2 = act_layer()

        self.conv3 = Conv2d(bottleneck_channels, out_channels, 1, bias=False)
        self.norm3 = get_norm(norm, out_channels)

        for layer in [self.conv1, self.conv2, self.conv3]:
            weight_init.c2_msra_fill(layer)
        for layer in [self.norm1, self.norm2]:
            layer.weight.data.fill_(1.0)
            layer.bias.data.zero_()
        # zero init last norm layer.
        self.norm3.weight.data.zero_()
        self.norm3.bias.data.zero_()

    def forward(self, x):
        out = x
        for layer in self.children():
            out = layer(out)

        out = x + out
        return out


class Block(nn.Module):
    """Transformer blocks with support of window attention and residual propagation blocks"""

    def __init__(
        self,
        dim,
        num_heads,
        mlp_ratio=4*2/3,
        qkv_bias=True,
        drop_path=0.0,
        norm_layer=partial(nn.LayerNorm, eps=1e-6), 
        window_size=0,
        use_residual_block=False,
        rope=None,
        xattn=True,
    ):
        """
        Args:
            dim (int): Number of input channels.
            num_heads (int): Number of attention heads in each ViT block.
            mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.
            qkv_bias (bool): If True, add a learnable bias to query, key, value.
            drop_path (float): Stochastic depth rate.
            norm_layer (nn.Module): Normalization layer.
            act_layer (nn.Module): Activation layer.
            use_rel_pos (bool): If True, add relative positional embeddings to the attention map.
            rel_pos_zero_init (bool): If True, zero initialize relative positional parameters.
            window_size (int): Window size for window attention blocks. If it equals 0, then not
                use window attention.
            use_residual_block (bool): If True, use a residual block after the MLP block.
            input_size (int or None): Input resolution for calculating the relative positional
                parameter size.
        """
        super().__init__()
        self.norm1 = norm_layer(dim)
        self.attn = Attention(
            dim,
            num_heads=num_heads,
            qkv_bias=qkv_bias,
            rope=rope,
            xattn=xattn,
        )

        from timm.models.layers import DropPath

        self.drop_path = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()
        self.norm2 = norm_layer(dim)
        self.mlp = SwiGLU(
                in_features=dim, 
                hidden_features=int(dim * mlp_ratio), 
                subln=True,
                norm_layer=norm_layer,
            )

        self.window_size = window_size

        self.use_residual_block = use_residual_block
        if use_residual_block:
            # Use a residual block with bottleneck channel as dim // 2
            self.residual = ResBottleneckBlock(
                in_channels=dim,
                out_channels=dim,
                bottleneck_channels=dim // 2,
                norm="LN",
            )

    def forward(self, x):
        shortcut = x
        x = self.norm1(x)

        # Window partition
        if self.window_size > 0:
            H, W = x.shape[1], x.shape[2]
            x, pad_hw = window_partition(x, self.window_size)

        x = self.attn(x)

        # Reverse window partition
        if self.window_size > 0:
            x = window_unpartition(x, self.window_size, pad_hw, (H, W))

        x = shortcut + self.drop_path(x)
        x = x + self.drop_path(self.mlp(self.norm2(x)))

        if self.use_residual_block:
            x = self.residual(x.permute(0, 3, 1, 2)).permute(0, 2, 3, 1)

        return x


class EVA02_ViT(Backbone):
    """
    This module implements Vision Transformer (ViT) backbone in :paper:`vitdet`.
    "Exploring Plain Vision Transformer Backbones for Object Detection",
    https://arxiv.org/abs/2203.16527
    """

    def __init__(
        self,
        img_size=1024,
        patch_size=16,
        in_chans=3,
        embed_dim=768,
        depth=12,
        num_heads=12,
        mlp_ratio=4*2/3,
        qkv_bias=True,
        drop_path_rate=0.0,
        norm_layer=partial(nn.LayerNorm, eps=1e-6),
        act_layer=nn.GELU,
        use_abs_pos=True,
        use_rel_pos=False,
        rope=True,
        pt_hw_seq_len=16,
        intp_freq=True,
        window_size=0,
        window_block_indexes=(),
        residual_block_indexes=(),
        use_act_checkpoint=False,
        pretrain_img_size=224,
        pretrain_use_cls_token=True,
        out_feature="last_feat",
        xattn=True,
    ):
        """
        Args:
            img_size (int): Input image size.
            patch_size (int): Patch size.
            in_chans (int): Number of input image channels.
            embed_dim (int): Patch embedding dimension.
            depth (int): Depth of ViT.
            num_heads (int): Number of attention heads in each ViT block.
            mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.
            qkv_bias (bool): If True, add a learnable bias to query, key, value.
            drop_path_rate (float): Stochastic depth rate.
            norm_layer (nn.Module): Normalization layer.
            act_layer (nn.Module): Activation layer.
            use_abs_pos (bool): If True, use absolute positional embeddings.
            use_rel_pos (bool): If True, add relative positional embeddings to the attention map.
            rel_pos_zero_init (bool): If True, zero initialize relative positional parameters.
            window_size (int): Window size for window attention blocks.
            window_block_indexes (list): Indexes for blocks using window attention.
            residual_block_indexes (list): Indexes for blocks using conv propagation.
            use_act_checkpoint (bool): If True, use activation checkpointing.
            pretrain_img_size (int): input image size for pretraining models.
            pretrain_use_cls_token (bool): If True, pretrainig models use class token.
            out_feature (str): name of the feature from the last block.
        """
        super().__init__()
        self.pretrain_use_cls_token = pretrain_use_cls_token

        self.patch_embed = PatchEmbed(
            kernel_size=(patch_size, patch_size),
            stride=(patch_size, patch_size),
            in_chans=in_chans,
            embed_dim=embed_dim,
        )

        if use_abs_pos:
            # Initialize absolute positional embedding with pretrain image size.
            num_patches = (pretrain_img_size // patch_size) * (pretrain_img_size // patch_size)
            num_positions = (num_patches + 1) if pretrain_use_cls_token else num_patches
            self.pos_embed = nn.Parameter(torch.zeros(1, num_positions, embed_dim))
        else:
            self.pos_embed = None


        half_head_dim = embed_dim // num_heads // 2
        hw_seq_len = img_size // patch_size

        self.rope_win = VisionRotaryEmbeddingFast(
            dim=half_head_dim,
            pt_seq_len=pt_hw_seq_len,
            ft_seq_len=window_size if intp_freq else None,
        )
        self.rope_glb = VisionRotaryEmbeddingFast(
            dim=half_head_dim,
            pt_seq_len=pt_hw_seq_len,
            ft_seq_len=hw_seq_len if intp_freq else None,
        )

        # stochastic depth decay rule
        dpr = [x.item() for x in torch.linspace(0, drop_path_rate, depth)]

        self.blocks = nn.ModuleList()
        for i in range(depth):
            block = Block(
                dim=embed_dim,
                num_heads=num_heads,
                mlp_ratio=mlp_ratio,
                qkv_bias=qkv_bias,
                drop_path=dpr[i],
                norm_layer=norm_layer,
                window_size=window_size if i in window_block_indexes else 0,
                use_residual_block=i in residual_block_indexes,
                rope=self.rope_win if i in window_block_indexes else self.rope_glb,
                xattn=xattn
            )
            if use_act_checkpoint:
                # TODO: use torch.utils.checkpoint
                from fairscale.nn.checkpoint import checkpoint_wrapper

                block = checkpoint_wrapper(block)
            self.blocks.append(block)

        self._out_feature_channels = {out_feature: embed_dim}
        self._out_feature_strides = {out_feature: patch_size}
        self._out_features = [out_feature]

        if self.pos_embed is not None:
            nn.init.trunc_normal_(self.pos_embed, std=0.02)

        self.apply(self._init_weights)

    def _init_weights(self, m):
        if isinstance(m, nn.Linear):
            nn.init.trunc_normal_(m.weight, std=0.02)
            if isinstance(m, nn.Linear) and m.bias is not None:
                nn.init.constant_(m.bias, 0)
        elif isinstance(m, nn.LayerNorm):
            nn.init.constant_(m.bias, 0)
            nn.init.constant_(m.weight, 1.0)

    def forward(self, x):
        x = self.patch_embed(x)
        if self.pos_embed is not None:
            x = x + get_abs_pos(
                self.pos_embed, self.pretrain_use_cls_token, (x.shape[1], x.shape[2])
            )

        for blk in self.blocks:
            x = blk(x)

        outputs = {self._out_features[0]: x.permute(0, 3, 1, 2)}
        return outputs


class SimpleFeaturePyramid(Backbone):
    """
    This module implements SimpleFeaturePyramid in :paper:`vitdet`.
    It creates pyramid features built on top of the input feature map.
    """

    def __init__(
        self,
        net,
        in_feature,
        out_channels,
        scale_factors,
        top_block=None,
        norm="LN",
        square_pad=0,
    ):
        """
        Args:
            net (Backbone): module representing the subnetwork backbone.
                Must be a subclass of :class:`Backbone`.
            in_feature (str): names of the input feature maps coming
                from the net.
            out_channels (int): number of channels in the output feature maps.
            scale_factors (list[float]): list of scaling factors to upsample or downsample
                the input features for creating pyramid features.
            top_block (nn.Module or None): if provided, an extra operation will
                be performed on the output of the last (smallest resolution)
                pyramid output, and the result will extend the result list. The top_block
                further downsamples the feature map. It must have an attribute
                "num_levels", meaning the number of extra pyramid levels added by
                this block, and "in_feature", which is a string representing
                its input feature (e.g., p5).
            norm (str): the normalization to use.
            square_pad (int): If > 0, require input images to be padded to specific square size.
        """
        super(SimpleFeaturePyramid, self).__init__()
        assert isinstance(net, Backbone)

        self.scale_factors = scale_factors

        input_shapes = net.output_shape()
        strides = [int(input_shapes[in_feature].stride / scale) for scale in scale_factors]
        _assert_strides_are_log2_contiguous(strides)

        dim = input_shapes[in_feature].channels
        self.stages = []
        use_bias = norm == ""
        for idx, scale in enumerate(scale_factors):
            out_dim = dim
            if scale == 4.0:
                layers = [
                    nn.ConvTranspose2d(dim, dim // 2, kernel_size=2, stride=2),
                    get_norm(norm, dim // 2),
                    nn.GELU(),
                    nn.ConvTranspose2d(dim // 2, dim // 4, kernel_size=2, stride=2),
                ]
                out_dim = dim // 4
            elif scale == 2.0:
                layers = [nn.ConvTranspose2d(dim, dim // 2, kernel_size=2, stride=2)]
                out_dim = dim // 2
            elif scale == 1.0:
                layers = []
            elif scale == 0.5:
                layers = [nn.MaxPool2d(kernel_size=2, stride=2)]
            else:
                raise NotImplementedError(f"scale_factor={scale} is not supported yet.")

            layers.extend(
                [
                    Conv2d(
                        out_dim,
                        out_channels,
                        kernel_size=1,
                        bias=use_bias,
                        norm=get_norm(norm, out_channels),
                    ),
                    Conv2d(
                        out_channels,
                        out_channels,
                        kernel_size=3,
                        padding=1,
                        bias=use_bias,
                        norm=get_norm(norm, out_channels),
                    ),
                ]
            )
            layers = nn.Sequential(*layers)

            stage = int(math.log2(strides[idx]))
            self.add_module(f"simfp_{stage}", layers)
            self.stages.append(layers)

        self.net = net
        self.in_feature = in_feature
        self.top_block = top_block
        # Return feature names are "p<stage>", like ["p2", "p3", ..., "p6"]
        self._out_feature_strides = {"p{}".format(int(math.log2(s))): s for s in strides}
        # top block output feature maps.
        if self.top_block is not None:
            for s in range(stage, stage + self.top_block.num_levels):
                self._out_feature_strides["p{}".format(s + 1)] = 2 ** (s + 1)

        self._out_features = list(self._out_feature_strides.keys())
        self._out_feature_channels = {k: out_channels for k in self._out_features}
        self._size_divisibility = strides[-1]
        self._square_pad = square_pad

    @property
    def padding_constraints(self):
        return {
            "size_divisiblity": self._size_divisibility,
            "square_size": self._square_pad,
        }

    def forward(self, x):
        """
        Args:
            x: Tensor of shape (N,C,H,W). H, W must be a multiple of ``self.size_divisibility``.

        Returns:
            dict[str->Tensor]:
                mapping from feature map name to pyramid feature map tensor
                in high to low resolution order. Returned feature names follow the FPN
                convention: "p<stage>", where stage has stride = 2 ** stage e.g.,
                ["p2", "p3", ..., "p6"].
        """
        bottom_up_features = self.net(x)
        features = bottom_up_features[self.in_feature]
        results = []

        for stage in self.stages:
            results.append(stage(features))

        if self.top_block is not None:
            if self.top_block.in_feature in bottom_up_features:
                top_block_in_feature = bottom_up_features[self.top_block.in_feature]
            else:
                top_block_in_feature = results[self._out_features.index(self.top_block.in_feature)]
            results.extend(self.top_block(top_block_in_feature))
        assert len(self._out_features) == len(results)
        return {f: res for f, res in zip(self._out_features, results)}


def get_vit_lr_decay_rate(name, lr_decay_rate=1.0, num_layers=12):
    """
    Calculate lr decay rate for different ViT blocks.
    Args:
        name (string): parameter name.
        lr_decay_rate (float): base lr decay rate.
        num_layers (int): number of ViT blocks.

    Returns:
        lr decay rate for the given parameter.
    """
    layer_id = num_layers + 1
    if name.startswith("backbone"):
        if ".pos_embed" in name or ".patch_embed" in name:
            layer_id = 0
        elif ".blocks." in name and ".residual." not in name:
            layer_id = int(name[name.find(".blocks.") :].split(".")[2]) + 1

    return lr_decay_rate ** (num_layers + 1 - layer_id)

================================================
FILE: thirdparty/GLEE/glee/backbone/eva02.py
================================================
# --------------------------------------------------------
# EVA02
# --------------------------------------------------------
import logging
import math
from functools import partial

import fvcore.nn.weight_init as weight_init
import torch
import torch.nn as nn
import torch.nn.functional as F

from detectron2.layers import CNNBlockBase, Conv2d, get_norm
from detectron2.modeling.backbone.fpn import _assert_strides_are_log2_contiguous

from detectron2.modeling.backbone import Backbone
from timm.models.layers import DropPath, Mlp, trunc_normal_
from detectron2.modeling import BACKBONE_REGISTRY, Backbone, ShapeSpec


from .eva_02_utils import (
    PatchEmbed,
    add_decomposed_rel_pos,
    get_abs_pos,
    window_partition,
    window_unpartition,
    VisionRotaryEmbeddingFast,
)
from detectron2.modeling.backbone.fpn import LastLevelMaxPool


try:
    import xformers.ops as xops
    HAS_XFORMER=True
except:
    HAS_XFORMER=False
    pass

try:
    from apex.normalization import FusedLayerNorm
except:
    pass

logger = logging.getLogger(__name__)



__all__ = ["EVA02_ViT", "SimpleFeaturePyramid", "get_vit_lr_decay_rate"]



class SwiGLU(nn.Module):
    def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.SiLU, drop=0., 
                norm_layer=nn.LayerNorm, subln=False
            ):
        super().__init__()
        out_features = out_features or in_features
        hidden_features = hidden_features or in_features

        self.w1 = nn.Linear(in_features, hidden_features)
        self.w2 = nn.Linear(in_features, hidden_features)

        self.act = act_layer()
        self.ffn_ln = norm_layer(hidden_features) if subln else nn.Identity()
        self.w3 = nn.Linear(hidden_features, out_features)
        
        self.drop = nn.Dropout(drop)

    def forward(self, x):
        x1 = self.w1(x)
        x2 = self.w2(x)
        hidden = self.act(x1) * x2
        x = self.ffn_ln(hidden)
        x = self.w3(x)
        x = self.drop(x)
        return x
    

class Attention(nn.Module):
    def __init__(
            self, 
            dim, 
            num_heads=8, 
            qkv_bias=True, 
            qk_scale=None, 
            attn_head_dim=None, 
            rope=None,
            xattn=True,
        ):
        super().__init__()
        self.num_heads = num_heads
        head_dim = dim // num_heads
        if attn_head_dim is not None:
            head_dim = attn_head_dim
        all_head_dim = head_dim * self.num_heads
        self.scale = qk_scale or head_dim ** -0.5

        self.q_proj = nn.Linear(dim, all_head_dim, bias=False)
        self.k_proj = nn.Linear(dim, all_head_dim, bias=False)
        self.v_proj = nn.Linear(dim, all_head_dim, bias=False)

        if qkv_bias:
            self.q_bias = nn.Parameter(torch.zeros(all_head_dim))
            self.v_bias = nn.Parameter(torch.zeros(all_head_dim))
        else:
            self.q_bias = None
            self.v_bias = None

        self.rope = rope
        self.xattn = xattn
        self.proj = nn.Linear(all_head_dim, dim)
        if not HAS_XFORMER:
            self.xattn = False

    def forward(self, x):
        B, H, W, C = x.shape
        x = x.view(B, -1, C)
        N = H * W

        q = F.linear(input=x, weight=self.q_proj.weight, bias=self.q_bias)
        k = F.linear(input=x, weight=self.k_proj.weight, bias=None)
        v = F.linear(input=x, weight=self.v_proj.weight, bias=self.v_bias)

        q = q.reshape(B, N, self.num_heads, -1).permute(0, 2, 1, 3)     # B, num_heads, N, C
        k = k.reshape(B, N, self.num_heads, -1).permute(0, 2, 1, 3)  
        v = v.reshape(B, N, self.num_heads, -1).permute(0, 2, 1, 3) 

        ## rope
        q = self.rope(q).type_as(v)
        k = self.rope(k).type_as(v)

        if self.xattn:
            q = q.permute(0, 2, 1, 3)   # B, num_heads, N, C -> B, N, num_heads, C
            k = k.permute(0, 2, 1, 3)
            v = v.permute(0, 2, 1, 3)
            
            x = xops.memory_efficient_attention(q, k, v)
            x = x.reshape(B, N, -1)
        else:
            q = q * self.scale
            attn = (q @ k.transpose(-2, -1))
            attn = attn.softmax(dim=-1).type_as(x)
            x = (attn @ v).transpose(1, 2).reshape(B, N, -1)

        x = self.proj(x)
        x = x.view(B, H, W, C)

        return x


class ResBottleneckBlock(CNNBlockBase):
    """
    The standard bottleneck residual block without the last activation layer.
    It contains 3 conv layers with kernels 1x1, 3x3, 1x1.
    """

    def __init__(
        self,
        in_channels,
        out_channels,
        bottleneck_channels,
        norm="LN",
        act_layer=nn.GELU,
    ):
        """
        Args:
            in_channels (int): Number of input channels.
            out_channels (int): Number of output channels.
            bottleneck_channels (int): number of output channels for the 3x3
                "bottleneck" conv layers.
            norm (str or callable): normalization for all conv layers.
                See :func:`layers.get_norm` for supported format.
            act_layer (callable): activation for all conv layers.
        """
        super().__init__(in_channels, out_channels, 1)

        self.conv1 = Conv2d(in_channels, bottleneck_channels, 1, bias=False)
        self.norm1 = get_norm(norm, bottleneck_channels)
        self.act1 = act_layer()

        self.conv2 = Conv2d(
            bottleneck_channels,
            bottleneck_channels,
            3,
            padding=1,
            bias=False,
        )
        self.norm2 = get_norm(norm, bottleneck_channels)
        self.act2 = act_layer()

        self.conv3 = Conv2d(bottleneck_channels, out_channels, 1, bias=False)
        self.norm3 = get_norm(norm, out_channels)

        for layer in [self.conv1, self.conv2, self.conv3]:
            weight_init.c2_msra_fill(layer)
        for layer in [self.norm1, self.norm2]:
            layer.weight.data.fill_(1.0)
            layer.bias.data.zero_()
        # zero init last norm layer.
        self.norm3.weight.data.zero_()
        self.norm3.bias.data.zero_()

    def forward(self, x):
        out = x
        for layer in self.children():
            out = layer(out)

        out = x + out
        return out


class Block(nn.Module):
    """Transformer blocks with support of window attention and residual propagation blocks"""

    def __init__(
        self,
        dim,
        num_heads,
        mlp_ratio=4*2/3,
        qkv_bias=True,
        drop_path=0.0,
        norm_layer=partial(nn.LayerNorm, eps=1e-6), 
        window_size=0,
        use_residual_block=False,
        rope=None,
        xattn=True,
    ):
        """
        Args:
            dim (int): Number of input channels.
            num_heads (int): Number of attention heads in each ViT block.
            mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.
            qkv_bias (bool): If True, add a learnable bias to query, key, value.
            drop_path (float): Stochastic depth rate.
            norm_layer (nn.Module): Normalization layer.
            act_layer (nn.Module): Activation layer.
            use_rel_pos (bool): If True, add relative positional embeddings to the attention map.
            rel_pos_zero_init (bool): If True, zero initialize relative positional parameters.
            window_size (int): Window size for window attention blocks. If it equals 0, then not
                use window attention.
            use_residual_block (bool): If True, use a residual block after the MLP block.
            input_size (int or None): Input resolution for calculating the relative positional
                parameter size.
        """
        super().__init__()
        self.norm1 = norm_layer(dim)
        self.attn = Attention(
            dim,
            num_heads=num_heads,
            qkv_bias=qkv_bias,
            rope=rope,
            xattn=xattn,
        )

        from timm.models.layers import DropPath

        self.drop_path = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()
        self.norm2 = norm_layer(dim)
        self.mlp = SwiGLU(
                in_features=dim, 
                hidden_features=int(dim * mlp_ratio), 
                subln=True,
                norm_layer=norm_layer,
            )

        self.window_size = window_size

        self.use_residual_block = use_residual_block
        if use_residual_block:
            # Use a residual block with bottleneck channel as dim // 2
            self.residual = ResBottleneckBlock(
                in_channels=dim,
                out_channels=dim,
                bottleneck_channels=dim // 2,
                norm="LN",
            )

    def forward(self, x):
        shortcut = x
        x = self.norm1(x)

        # Window partition
        if self.window_size > 0:
            H, W = x.shape[1], x.shape[2]
            x, pad_hw = window_partition(x, self.window_size)

        x = self.attn(x)

        # Reverse window partition
        if self.window_size > 0:
            x = window_unpartition(x, self.window_size, pad_hw, (H, W))

        x = shortcut + self.drop_path(x)
        x = x + self.drop_path(self.mlp(self.norm2(x)))

        if self.use_residual_block:
            x = self.residual(x.permute(0, 3, 1, 2)).permute(0, 2, 3, 1)

        return x


class EVA02_ViT(Backbone):
    """
    This module implements Vision Transformer (ViT) backbone in :paper:`vitdet`.
    "Exploring Plain Vision Transformer Backbones for Object Detection",
    https://arxiv.org/abs/2203.16527
    """

    def __init__(
        self,
        img_size=1024,
        patch_size=16,
        in_chans=3,
        embed_dim=768,
        depth=12,
        num_heads=12,
        mlp_ratio=4*2/3,
        qkv_bias=True,
        drop_path_rate=0.0,
        norm_layer=partial(nn.LayerNorm, eps=1e-6),
        act_layer=nn.GELU,
        use_abs_pos=True,
        use_rel_pos=False,
        rope=True,
        pt_hw_seq_len=16,
        intp_freq=True,
        window_size=0,
        window_block_indexes=(),
        residual_block_indexes=(),
        use_act_checkpoint=False,
        pretrain_img_size=224,
        pretrain_use_cls_token=True,
        out_feature="last_feat",
        xattn=True,
    ):
        """
        Args:
            img_size (int): Input image size.
            patch_size (int): Patch size.
            in_chans (int): Number of input image channels.
            embed_dim (int): Patch embedding dimension.
            depth (int): Depth of ViT.
            num_heads (int): Number of attention heads in each ViT block.
            mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.
            qkv_bias (bool): If True, add a learnable bias to query, key, value.
            drop_path_rate (float): Stochastic depth rate.
            norm_layer (nn.Module): Normalization layer.
            act_layer (nn.Module): Activation layer.
            use_abs_pos (bool): If True, use absolute positional embeddings.
            use_rel_pos (bool): If True, add relative positional embeddings to the attention map.
            rel_pos_zero_init (bool): If True, zero initialize relative positional parameters.
            window_size (int): Window size for window attention blocks.
            window_block_indexes (list): Indexes for blocks using window attention.
            residual_block_indexes (list): Indexes for blocks using conv propagation.
            use_act_checkpoint (bool): If True, use activation checkpointing.
            pretrain_img_size (int): input image size for pretraining models.
            pretrain_use_cls_token (bool): If True, pretrainig models use class token.
            out_feature (str): name of the feature from the last block.
        """
        super().__init__()
        self.pretrain_use_cls_token = pretrain_use_cls_token

        self.patch_embed = PatchEmbed(
            kernel_size=(patch_size, patch_size),
            stride=(patch_size, patch_size),
            in_chans=in_chans,
            embed_dim=embed_dim,
        )

        if use_abs_pos:
            # Initialize absolute positional embedding with pretrain image size.
            num_patches = (pretrain_img_size // patch_size) * (pretrain_img_size // patch_size)
            num_positions = (num_patches + 1) if pretrain_use_cls_token else num_patches
            self.pos_embed = nn.Parameter(torch.zeros(1, num_positions, embed_dim))
        else:
            self.pos_embed = None


        half_head_dim = embed_dim // num_heads // 2
        hw_seq_len = img_size // patch_size

        self.rope_win = VisionRotaryEmbeddingFast(
            dim=half_head_dim,
            pt_seq_len=pt_hw_seq_len,
            ft_seq_len=window_size if intp_freq else None,
        )
        self.rope_glb = VisionRotaryEmbeddingFast(
            dim=half_head_dim,
            pt_seq_len=pt_hw_seq_len,
            ft_seq_len=hw_seq_len if intp_freq else None,
        )

        # stochastic depth decay rule
        dpr = [x.item() for x in torch.linspace(0, drop_path_rate, depth)]

        self.blocks = nn.ModuleList()
        for i in range(depth):
            block = Block(
                dim=embed_dim,
                num_heads=num_heads,
                mlp_ratio=mlp_ratio,
                qkv_bias=qkv_bias,
                drop_path=dpr[i],
                norm_layer=norm_layer,
                window_size=window_size if i in window_block_indexes else 0,
                use_residual_block=i in residual_block_indexes,
                rope=self.rope_win if i in window_block_indexes else self.rope_glb,
                xattn=xattn
            )
            if use_act_checkpoint:
                # TODO: use torch.utils.checkpoint
                from fairscale.nn.checkpoint import checkpoint_wrapper

                block = checkpoint_wrapper(block)
            self.blocks.append(block)

        self._out_feature_channels = {out_feature: embed_dim}
        self._out_feature_strides = {out_feature: patch_size}
        self._out_features = [out_feature]

        if self.pos_embed is not None:
            nn.init.trunc_normal_(self.pos_embed, std=0.02)

        self.apply(self._init_weights)

    def _init_weights(self, m):
        if isinstance(m, nn.Linear):
            nn.init.trunc_normal_(m.weight, std=0.02)
            if isinstance(m, nn.Linear) and m.bias is not None:
                nn.init.constant_(m.bias, 0)
        elif isinstance(m, nn.LayerNorm):
            nn.init.constant_(m.bias, 0)
            nn.init.constant_(m.weight, 1.0)

    def forward(self, x):
        x = self.patch_embed(x)
        if self.pos_embed is not None:
            x = x + get_abs_pos(
                self.pos_embed, self.pretrain_use_cls_token, (x.shape[1], x.shape[2])
            )

        for blk in self.blocks:
            x = blk(x)

        outputs = {self._out_features[0]: x.permute(0, 3, 1, 2)}
        return outputs


class SimpleFeaturePyramid(Backbone):
    """
    This module implements SimpleFeaturePyramid in :paper:`vitdet`.
    It creates pyramid features built on top of the input feature map.
    """

    def __init__(
        self,
        net,
        in_feature,
        out_channels,
        scale_factors,
        top_block=None,
        norm="LN",
        square_pad=0,
    ):
        """
        Args:
            net (Backbone): module representing the subnetwork backbone.
                Must be a subclass of :class:`Backbone`.
            in_feature (str): names of the input feature maps coming
                from the net.
            out_channels (int): number of channels in the output feature maps.
            scale_factors (list[float]): list of scaling factors to upsample or downsample
                the input features for creating pyramid features.
            top_block (nn.Module or None): if provided, an extra operation will
                be performed on the output of the last (smallest resolution)
                pyramid output, and the result will extend the result list. The top_block
                further downsamples the feature map. It must have an attribute
                "num_levels", meaning the number of extra pyramid levels added by
                this block, and "in_feature", which is a string representing
                its input feature (e.g., p5).
            norm (str): the normalization to use.
            square_pad (int): If > 0, require input images to be padded to specific square size.
        """
        super(SimpleFeaturePyramid, self).__init__()
        assert isinstance(net, Backbone)

        self.scale_factors = scale_factors

        input_shapes = net.output_shape()
        strides = [int(input_shapes[in_feature].stride / scale) for scale in scale_factors]
        _assert_strides_are_log2_contiguous(strides)

        dim = input_shapes[in_feature].channels
        self.stages = []
        use_bias = norm == ""
        for idx, scale in enumerate(scale_factors):
            out_dim = dim
            if scale == 4.0:
                layers = [
                    nn.ConvTranspose2d(dim, dim // 2, kernel_size=2, stride=2),
                    get_norm(norm, dim // 2),
                    nn.GELU(),
                    nn.ConvTranspose2d(dim // 2, dim // 4, kernel_size=2, stride=2),
                ]
                out_dim = dim // 4
            elif scale == 2.0:
                layers = [nn.ConvTranspose2d(dim, dim // 2, kernel_size=2, stride=2)]
                out_dim = dim // 2
            elif scale == 1.0:
                layers = []
            elif scale == 0.5:
                layers = [nn.MaxPool2d(kernel_size=2, stride=2)]
            else:
                raise NotImplementedError(f"scale_factor={scale} is not supported yet.")

            layers.extend(
                [
                    Conv2d(
                        out_dim,
                        out_channels,
                        kernel_size=1,
                        bias=use_bias,
                        norm=get_norm(norm, out_channels),
                    ),
                    Conv2d(
                        out_channels,
                        out_channels,
                        kernel_size=3,
                        padding=1,
                        bias=use_bias,
                        norm=get_norm(norm, out_channels),
                    ),
                ]
            )
            layers = nn.Sequential(*layers)

            stage = int(math.log2(strides[idx]))
            self.add_module(f"simfp_{stage}", layers)
            self.stages.append(layers)

        self.net = net
        self.in_feature = in_feature
        self.top_block = top_block
        # Return feature names are "p<stage>", like ["p2", "p3", ..., "p6"]
        self._out_feature_strides = {"p{}".format(int(math.log2(s))): s for s in strides}
        # top block output feature maps.
        if self.top_block is not None:
            for s in range(stage, stage + self.top_block.num_levels):
                self._out_feature_strides["p{}".format(s + 1)] = 2 ** (s + 1)

        self._out_features = list(self._out_feature_strides.keys())
        self._out_feature_channels = {k: out_channels for k in self._out_features}
        self._size_divisibility = strides[-1]
        self._square_pad = square_pad

    @property
    def padding_constraints(self):
        return {
            "size_divisiblity": self._size_divisibility,
            "square_size": self._square_pad,
        }

    def forward(self, x):
        """
        Args:
            x: Tensor of shape (N,C,H,W). H, W must be a multiple of ``self.size_divisibility``.

        Returns:
            dict[str->Tensor]:
                mapping from feature map name to pyramid feature map tensor
                in high to low resolution order. Returned feature names follow the FPN
                convention: "p<stage>", where stage has stride = 2 ** stage e.g.,
                ["p2", "p3", ..., "p6"].
        """
        bottom_up_features = self.net(x)
        features = bottom_up_features[self.in_feature]
        results = []

        for stage in self.stages:
            results.append(stage(features))

        if self.top_block is not None:
            if self.top_block.in_feature in bottom_up_features:
                top_block_in_feature = bottom_up_features[self.top_block.in_feature]
            else:
                top_block_in_feature = results[self._out_features.index(self.top_block.in_feature)]
            results.extend(self.top_block(top_block_in_feature))
        assert len(self._out_features) == len(results)
        return {f: res for f, res in zip(self._out_features, results)}



@BACKBONE_REGISTRY.register()
class D2_EVA02(SimpleFeaturePyramid):
    def __init__(self, cfg, input_shape):

        super().__init__(
            
            net = EVA02_ViT(
                img_size= cfg.MODEL.EVA02.IMAGE_SIZE,
                patch_size=cfg.MODEL.EVA02.PATCH_SIZE,
                window_size= cfg.MODEL.EVA02.WINDOW_SIZE,
                embed_dim= cfg.MODEL.EVA02.DMBED_DIM,
                depth= cfg.MODEL.EVA02.DEPTH,
                num_heads= cfg.MODEL.EVA02.NUM_HEADS ,
                drop_path_rate= cfg.MODEL.EVA02.DROP_PATH_RATE,
                mlp_ratio= cfg.MODEL.EVA02.MLP_RATIO,
                # qkv_bias=True,
                norm_layer=partial(nn.LayerNorm, eps=1e-6),
                window_block_indexes= cfg.MODEL.EVA02.WINDOW_BLOCK_INDEXES,
                # residual_block_indexes=[],
                # use_rel_pos=False,
                use_act_checkpoint = cfg.MODEL.EVA02.CHECKPOINT,
                out_feature="last_feat",
                # intp_freq=True,
            ),
            in_feature = "last_feat",
            out_channels=256,
            scale_factors=(2.0, 1.0, 0.5),  # (4.0, 2.0, 1.0, 0.5) in ViTDet
            top_block=LastLevelMaxPool(),
            norm="LN",
            square_pad=cfg.MODEL.EVA02.IMAGE_SIZE,

        )

        pretrained_weight = cfg.MODEL.EVA02.PRETRAINED_WEIGHT 
        if pretrained_weight:
            checkpoint = torch.load(pretrained_weight, map_location='cpu')
            print(f'\nload pretrain weight from {pretrained_weight} \n') 

            self.load_state_dict(checkpoint['model'], strict=False)
            

 
    def output_shape(self):
        return {
            name: ShapeSpec(
                channels=self._out_feature_channels[name], stride=self._out_feature_strides[name]
            )
            for name in self._out_features
        }

    @property
    def size_divisibility(self):
        return 32




================================================
FILE: thirdparty/GLEE/glee/backbone/eva_01_utils.py
================================================
# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
import math
import numpy as np
from scipy import interpolate
import torch
import torch.nn as nn
import torch.nn.functional as F

__all__ = [
    "window_partition",
    "window_unpartition",
    "add_decomposed_rel_pos",
    "get_abs_pos",
    "PatchEmbed",
]


def window_partition(x, window_size):
    """
    Partition into non-overlapping windows with padding if needed.
    Args:
        x (tensor): input tokens with [B, H, W, C].
        window_size (int): window size.

    Returns:
        windows: windows after partition with [B * num_windows, window_size, window_size, C].
        (Hp, Wp): padded height and width before partition
    """
    B, H, W, C = x.shape

    pad_h = (window_size - H % window_size) % window_size
    pad_w = (window_size - W % window_size) % window_size
    if pad_h > 0 or pad_w > 0:
        x = F.pad(x, (0, 0, 0, pad_w, 0, pad_h))
    Hp, Wp = H + pad_h, W + pad_w

    x = x.view(B, Hp // window_size, window_size, Wp // window_size, window_size, C)
    windows = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(-1, window_size, window_size, C)
    return windows, (Hp, Wp)


def window_unpartition(windows, window_size, pad_hw, hw):
    """
    Window unpartition into original sequences and removing padding.
    Args:
        x (tensor): input tokens with [B * num_windows, window_size, window_size, C].
        window_size (int): window size.
        pad_hw (Tuple): padded height and width (Hp, Wp).
        hw (Tuple): original height and width (H, W) before padding.

    Returns:
        x: unpartitioned sequences with [B, H, W, C].
    """
    Hp, Wp = pad_hw
    H, W = hw
    B = windows.shape[0] // (Hp * Wp // window_size // window_size)
    x = windows.view(B, Hp // window_size, Wp // window_size, window_size, window_size, -1)
    x = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(B, Hp, Wp, -1)

    if Hp > H or Wp > W:
        x = x[:, :H, :W, :].contiguous()
    return x


def get_rel_pos(q_size, k_size, rel_pos, interp_type):
    """
    Get relative positional embeddings according to the relative positions of
        query and key sizes.
    Args:
        q_size (int): size of query q.
        k_size (int): size of key k.
        rel_pos (Tensor): relative position embeddings (L, C).

    Returns:
        Extracted positional embeddings according to relative positions.
    """
    max_rel_dist = int(2 * max(q_size, k_size) - 1)
    # Interpolate rel pos if needed.
    if rel_pos.shape[0] != max_rel_dist:
        if interp_type == "vitdet":
            # the vitdet impl: 
            # https://github.com/facebookresearch/detectron2/blob/96c752ce821a3340e27edd51c28a00665dd32a30/detectron2/modeling/backbone/utils.py#L77.

            rel_pos_resized = F.interpolate(
                rel_pos.reshape(1, rel_pos.shape[0], -1).permute(0, 2, 1),
                size=max_rel_dist,
                mode="linear",
            )
            rel_pos_resized = rel_pos_resized.reshape(-1, max_rel_dist).permute(1, 0)
        elif interp_type == "beit":
            # steal from beit https://github.com/microsoft/unilm/tree/master/beit
            # modified by Yuxin Fang

            src_size = rel_pos.shape[0]
            dst_size = max_rel_dist

            q = 1.0903078
            dis = []

            cur = 1
            for i in range(src_size // 2):
                dis.append(cur)
                cur += q ** (i + 1)

            r_ids = [-_ for _ in reversed(dis)]
            x = r_ids + [0] + dis
            t = dst_size // 2.0
            dx = np.arange(-t, t + 0.1, 1.0)

            all_rel_pos_bias = []
            for i in range(rel_pos.shape[1]):
                # a hack from https://github.com/baaivision/EVA/issues/8,
                # could also be used in fine-tuning but the performance haven't been tested.
                z = rel_pos[:, i].view(src_size).cpu().float().detach().numpy()
                f = interpolate.interp1d(x, z, kind='cubic', fill_value="extrapolate")
                all_rel_pos_bias.append(
                    torch.Tensor(f(dx)).contiguous().view(-1, 1).to(rel_pos.device))
            rel_pos_resized = torch.cat(all_rel_pos_bias, dim=-1)
        else:
            raise NotImplementedError()
    else:
        rel_pos_resized = rel_pos

    # Scale the coords with short length if shapes for q and k are different.
    q_coords = torch.arange(q_size)[:, None] * max(k_size / q_size, 1.0)
    k_coords = torch.arange(k_size)[None, :] * max(q_size / k_size, 1.0)
    relative_coords = (q_coords - k_coords) + (k_size - 1) * max(q_size / k_size, 1.0)

    return rel_pos_resized[relative_coords.long()]


def add_decomposed_rel_pos(attn, q, rel_pos_h, rel_pos_w, q_size, k_size, interp_type):
    """
    Calculate decomposed Relative Positional Embeddings from :paper:`mvitv2`.
    https://github.com/facebookresearch/mvit/blob/19786631e330df9f3622e5402b4a419a263a2c80/mvit/models/attention.py   # noqa B950
    Args:
        attn (Tensor): attention map.
        q (Tensor): query q in the attention layer with shape (B, q_h * q_w, C).
        rel_pos_h (Tensor): relative position embeddings (Lh, C) for height axis.
        rel_pos_w (Tensor): relative position embeddings (Lw, C) for width axis.
        q_size (Tuple): spatial sequence size of query q with (q_h, q_w).
        k_size (Tuple): spatial sequence size of key k with (k_h, k_w).

    Returns:
        attn (Tensor): attention map with added relative positional embeddings.
    """
    q_h, q_w = q_size
    k_h, k_w = k_size
    Rh = get_rel_pos(q_h, k_h, rel_pos_h, interp_type)
    Rw = get_rel_pos(q_w, k_w, rel_pos_w, interp_type)

    B, _, dim = q.shape
    r_q = q.reshape(B, q_h, q_w, dim)
    rel_h = torch.einsum("bhwc,hkc->bhwk", r_q, Rh)
    rel_w = torch.einsum("bhwc,wkc->bhwk", r_q, Rw)

    attn = (
        attn.view(B, q_h, q_w, k_h, k_w) + rel_h[:, :, :, :, None] + rel_w[:, :, :, None, :]
    ).view(B, q_h * q_w, k_h * k_w)

    return attn


def get_abs_pos(abs_pos, has_cls_token, hw):
    """
    Calculate absolute positional embeddings. If needed, resize embeddings and remove cls_token
        dimension for the original embeddings.
    Args:
        abs_pos (Tensor): absolute positional embeddings with (1, num_position, C).
        has_cls_token (bool): If true, has 1 embedding in abs_pos for cls token.
        hw (Tuple): size of input image tokens.

    Returns:
        Absolute positional embeddings after processing with shape (1, H, W, C)
    """
    h, w = hw
    if has_cls_token:
        abs_pos = abs_pos[:, 1:]
    xy_num = abs_pos.shape[1]
    size = int(math.sqrt(xy_num))
    assert size * size == xy_num

    if size != h or size != w:
        new_abs_pos = F.interpolate(
            abs_pos.reshape(1, size, size, -1).permute(0, 3, 1, 2),
            size=(h, w),
            mode="bicubic",
            align_corners=False,
        )

        return new_abs_pos.permute(0, 2, 3, 1)
    else:
        return abs_pos.reshape(1, h, w, -1)


class PatchEmbed(nn.Module):
    """
    Image to Patch Embedding.
    """

    def __init__(
        self, kernel_size=(16, 16), stride=(16, 16), padding=(0, 0), in_chans=3, embed_dim=768
    ):
        """
        Args:
            kernel_size (Tuple): kernel size of the projection layer.
            stride (Tuple): stride of the projection layer.
            padding (Tuple): padding size of the projection layer.
            in_chans (int): Number of input image channels.
            embed_dim (int):  embed_dim (int): Patch embedding dimension.
        """
        super().__init__()

        self.proj = nn.Conv2d(
            in_chans, embed_dim, kernel_size=kernel_size, stride=stride, padding=padding
        )

    def forward(self, x):
        x = self.proj(x)
        # B C H W -> B H W C
        x = x.permute(0, 2, 3, 1)
        return x

================================================
FILE: thirdparty/GLEE/glee/backbone/eva_02_utils.py
================================================
# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
import math
import numpy as np
from scipy import interpolate
import torch
import torch.nn as nn
import torch.nn.functional as F

__all__ = [
    "window_partition",
    "window_unpartition",
    "add_decomposed_rel_pos",
    "get_abs_pos",
    "PatchEmbed",
    "VisionRotaryEmbeddingFast",
]


def window_partition(x, window_size):
    """
    Partition into non-overlapping windows with padding if needed.
    Args:
        x (tensor): input tokens with [B, H, W, C].
        window_size (int): window size.

    Returns:
        windows: windows after partition with [B * num_windows, window_size, window_size, C].
        (Hp, Wp): padded height and width before partition
    """
    B, H, W, C = x.shape

    pad_h = (window_size - H % window_size) % window_size
    pad_w = (window_size - W % window_size) % window_size
    if pad_h > 0 or pad_w > 0:
        x = F.pad(x, (0, 0, 0, pad_w, 0, pad_h))
    Hp, Wp = H + pad_h, W + pad_w

    x = x.view(B, Hp // window_size, window_size, Wp // window_size, window_size, C)
    windows = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(-1, window_size, window_size, C)
    return windows, (Hp, Wp)


def window_unpartition(windows, window_size, pad_hw, hw):
    """
    Window unpartition into original sequences and removing padding.
    Args:
        x (tensor): input tokens with [B * num_windows, window_size, window_size, C].
        window_size (int): window size.
        pad_hw (Tuple): padded height and width (Hp, Wp).
        hw (Tuple): original height and width (H, W) before padding.

    Returns:
        x: unpartitioned sequences with [B, H, W, C].
    """
    Hp, Wp = pad_hw
    H, W = hw
    B = windows.shape[0] // (Hp * Wp // window_size // window_size)
    x = windows.view(B, Hp // window_size, Wp // window_size, window_size, window_size, -1)
    x = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(B, Hp, Wp, -1)

    if Hp > H or Wp > W:
        x = x[:, :H, :W, :].contiguous()
    return x


def get_rel_pos(q_size, k_size, rel_pos):
    """
    Get relative positional embeddings according to the relative positions of
        query and key sizes.
    Args:
        q_size (int): size of query q.
        k_size (int): size of key k.
        rel_pos (Tensor): relative position embeddings (L, C).

    Returns:
        Extracted positional embeddings according to relative positions.
    """
    max_rel_dist = int(2 * max(q_size, k_size) - 1)
    use_log_interpolation = True

    # Interpolate rel pos if needed.
    if rel_pos.shape[0] != max_rel_dist:
        if not use_log_interpolation:
            # Interpolate rel pos.
            rel_pos_resized = F.interpolate(
                rel_pos.reshape(1, rel_pos.shape[0], -1).permute(0, 2, 1),
                size=max_rel_dist,
                mode="linear",
            )
            rel_pos_resized = rel_pos_resized.reshape(-1, max_rel_dist).permute(1, 0)
        else:
            src_size = rel_pos.shape[0]
            dst_size = max_rel_dist

            # q = 1.13492
            q = 1.0903078
            dis = []

            cur = 1
            for i in range(src_size // 2):
                dis.append(cur)
                cur += q ** (i + 1)

            r_ids = [-_ for _ in reversed(dis)]
            x = r_ids + [0] + dis
            t = dst_size // 2.0
            dx = np.arange(-t, t + 0.1, 1.0)
            # print("x = %s" % str(x))
            # print("dx = %s" % str(dx))
            all_rel_pos_bias = []
            for i in range(rel_pos.shape[1]):
                z = rel_pos[:, i].view(src_size).cpu().float().numpy()
                f = interpolate.interp1d(x, z, kind='cubic', fill_value="extrapolate")
                all_rel_pos_bias.append(
                    torch.Tensor(f(dx)).contiguous().view(-1, 1).to(rel_pos.device))
            rel_pos_resized = torch.cat(all_rel_pos_bias, dim=-1)
    else:
        rel_pos_resized = rel_pos

    # Scale the coords with short length if shapes for q and k are different.
    q_coords = torch.arange(q_size)[:, None] * max(k_size / q_size, 1.0)
    k_coords = torch.arange(k_size)[None, :] * max(q_size / k_size, 1.0)
    relative_coords = (q_coords - k_coords) + (k_size - 1) * max(q_size / k_size, 1.0)

    return rel_pos_resized[relative_coords.long()]


def add_decomposed_rel_pos(attn, q, rel_pos_h, rel_pos_w, q_size, k_size):
    """
    Calculate decomposed Relative Positional Embeddings from :paper:`mvitv2`.
    https://github.com/facebookresearch/mvit/blob/19786631e330df9f3622e5402b4a419a263a2c80/mvit/models/attention.py   # noqa B950
    Args:
        attn (Tensor): attention map.
        q (Tensor): query q in the attention layer with shape (B, q_h * q_w, C).
        rel_pos_h (Tensor): relative position embeddings (Lh, C) for height axis.
        rel_pos_w (Tensor): relative position embeddings (Lw, C) for width axis.
        q_size (Tuple): spatial sequence size of query q with (q_h, q_w).
        k_size (Tuple): spatial sequence size of key k with (k_h, k_w).

    Returns:
        attn (Tensor): attention map with added relative positional embeddings.
    """
    q_h, q_w = q_size
    k_h, k_w = k_size
    Rh = get_rel_pos(q_h, k_h, rel_pos_h)
    Rw = get_rel_pos(q_w, k_w, rel_pos_w)

    B, _, dim = q.shape
    r_q = q.reshape(B, q_h, q_w, dim)
    rel_h = torch.einsum("bhwc,hkc->bhwk", r_q, Rh)
    rel_w = torch.einsum("bhwc,wkc->bhwk", r_q, Rw)

    attn = (
        attn.view(B, q_h, q_w, k_h, k_w) + rel_h[:, :, :, :, None] + rel_w[:, :, :, None, :]
    ).view(B, q_h * q_w, k_h * k_w)

    return attn


def get_abs_pos(abs_pos, has_cls_token, hw):
    """
    Calculate absolute positional embeddings. If needed, resize embeddings and remove cls_token
        dimension for the original embeddings.
    Args:
        abs_pos (Tensor): absolute positional embeddings with (1, num_position, C).
        has_cls_token (bool): If true, has 1 embedding in abs_pos for cls token.
        hw (Tuple): size of input image tokens.

    Returns:
        Absolute positional embeddings after processing with shape (1, H, W, C)
    """
    h, w = hw
    if has_cls_token:
        abs_pos = abs_pos[:, 1:]
    xy_num = abs_pos.shape[1]
    size = int(math.sqrt(xy_num))
    assert size * size == xy_num

    if size != h or size != w:
        new_abs_pos = F.interpolate(
            abs_pos.reshape(1, size, size, -1).permute(0, 3, 1, 2),
            size=(h, w),
            mode="bicubic",
            align_corners=False,
        )

        return new_abs_pos.permute(0, 2, 3, 1)
    else:
        return abs_pos.reshape(1, h, w, -1)


class PatchEmbed(nn.Module):
    """
    Image to Patch Embedding.
    """

    def __init__(
        self, kernel_size=(16, 16), stride=(16, 16), padding=(0, 0), in_chans=3, embed_dim=768
    ):
        """
        Args:
            kernel_size (Tuple): kernel size of the projection layer.
            stride (Tuple): stride of the projection layer.
            padding (Tuple): padding size of the projection layer.
            in_chans (int): Number of input image channels.
            embed_dim (int):  embed_dim (int): Patch embedding dimension.
        """
        super().__init__()

        self.proj = nn.Conv2d(
            in_chans, embed_dim, kernel_size=kernel_size, stride=stride, padding=padding
        )

    def forward(self, x):
        x = self.proj(x)
        # B C H W -> B H W C
        x = x.permute(0, 2, 3, 1)
        return x
    



from math import pi

import torch
from torch import nn

from einops import rearrange, repeat



def broadcat(tensors, dim = -1):
    num_tensors = len(tensors)
    shape_lens = set(list(map(lambda t: len(t.shape), tensors)))
    assert len(shape_lens) == 1, 'tensors must all have the same number of dimensions'
    shape_len = list(shape_lens)[0]
    dim = (dim + shape_len) if dim < 0 else dim
    dims = list(zip(*map(lambda t: list(t.shape), tensors)))
    expandable_dims = [(i, val) for i, val in enumerate(dims) if i != dim]
    assert all([*map(lambda t: len(set(t[1])) <= 2, expandable_dims)]), 'invalid dimensions for broadcastable concatentation'
    max_dims = list(map(lambda t: (t[0], max(t[1])), expandable_dims))
    expanded_dims = list(map(lambda t: (t[0], (t[1],) * num_tensors), max_dims))
    expanded_dims.insert(dim, (dim, dims[dim]))
    expandable_shapes = list(zip(*map(lambda t: t[1], expanded_dims)))
    tensors = list(map(lambda t: t[0].expand(*t[1]), zip(tensors, expandable_shapes)))
    return torch.cat(tensors, dim = dim)



def rotate_half(x):
    x = rearrange(x, '... (d r) -> ... d r', r = 2)
    x1, x2 = x.unbind(dim = -1)
    x = torch.stack((-x2, x1), dim = -1)
    return rearrange(x, '... d r -> ... (d r)')



class VisionRotaryEmbedding(nn.Module):
    def __init__(
        self,
        dim,
        pt_seq_len,
        ft_seq_len=None,
        custom_freqs = None,
        freqs_for = 'lang',
        theta = 10000,
        max_freq = 10,
        num_freqs = 1,
    ):
        super().__init__()
        if custom_freqs:
            freqs = custom_freqs
        elif freqs_for == 'lang':
            freqs = 1. / (theta ** (torch.arange(0, dim, 2)[:(dim // 2)].float() / dim))
        elif freqs_for == 'pixel':
            freqs = torch.linspace(1., max_freq / 2, dim // 2) * pi
        elif freqs_for == 'constant':
            freqs = torch.ones(num_freqs).float()
        else:
            raise ValueError(f'unknown modality {freqs_for}')

        if ft_seq_len is None: ft_seq_len = pt_seq_len
        t = torch.arange(ft_seq_len) / ft_seq_len * pt_seq_len

        freqs_h = torch.einsum('..., f -> ... f', t, freqs)
        freqs_h = repeat(freqs_h, '... n -> ... (n r)', r = 2)

        freqs_w = torch.einsum('..., f -> ... f', t, freqs)
        freqs_w = repeat(freqs_w, '... n -> ... (n r)', r = 2)

        freqs = broadcat((freqs_h[:, None, :], freqs_w[None, :, :]), dim = -1)

        self.register_buffer("freqs_cos", freqs.cos())
        self.register_buffer("freqs_sin", freqs.sin())

        print('======== shape of rope freq', self.freqs_cos
Download .txt
gitextract_3pjq364l/

├── .gitignore
├── README.md
├── config_utils.py
├── constants.py
├── cv_utils/
│   ├── glee_detector.py
│   ├── image_percevior.py
│   └── object_list.py
├── llm_utils/
│   ├── gpt_request.py
│   └── nav_prompt.py
├── mapper.py
├── mapping_utils/
│   ├── geometry.py
│   ├── path_planning.py
│   ├── preprocess.py
│   ├── projection.py
│   └── transform.py
├── objnav_agent.py
├── objnav_benchmark.py
├── requirements.txt
└── thirdparty/
    └── GLEE/
        ├── configs/
        │   ├── R50.yaml
        │   └── SwinL.yaml
        └── glee/
            ├── __init__.py
            ├── backbone/
            │   ├── __init__.py
            │   ├── backbone.py
            │   ├── build.py
            │   ├── davit.py
            │   ├── eva01.py
            │   ├── eva02-dino.py
            │   ├── eva02.py
            │   ├── eva_01_utils.py
            │   ├── eva_02_utils.py
            │   ├── internimage.py
            │   ├── registry.py
            │   ├── resnet.py
            │   ├── swin.py
            │   ├── vit.py
            │   └── vit_utils.py
            ├── config.py
            ├── config_deeplab.py
            ├── models/
            │   ├── glee_model.py
            │   ├── pixel_decoder/
            │   │   ├── __init__.py
            │   │   ├── early_fusion.py
            │   │   ├── maskdino_encoder.py
            │   │   ├── ops/
            │   │   │   ├── functions/
            │   │   │   │   ├── __init__.py
            │   │   │   │   └── ms_deform_attn_func.py
            │   │   │   ├── make.sh
            │   │   │   ├── modules/
            │   │   │   │   ├── __init__.py
            │   │   │   │   └── ms_deform_attn.py
            │   │   │   ├── setup.py
            │   │   │   ├── src/
            │   │   │   │   ├── cpu/
            │   │   │   │   │   ├── ms_deform_attn_cpu.cpp
            │   │   │   │   │   └── ms_deform_attn_cpu.h
            │   │   │   │   ├── cuda/
            │   │   │   │   │   ├── ms_deform_attn_cuda.cu
            │   │   │   │   │   ├── ms_deform_attn_cuda.h
            │   │   │   │   │   └── ms_deform_im2col_cuda.cuh
            │   │   │   │   ├── ms_deform_attn.h
            │   │   │   │   └── vision.cpp
            │   │   │   └── test.py
            │   │   └── position_encoding.py
            │   ├── transformer_decoder/
            │   │   ├── __init__.py
            │   │   ├── dino_decoder.py
            │   │   └── maskdino_decoder.py
            │   └── vos_utils.py
            ├── modules/
            │   ├── __init__.py
            │   ├── attention.py
            │   ├── point_features.py
            │   ├── position_encoding.py
            │   └── postprocessing.py
            └── utils/
                ├── __init__.py
                ├── box_ops.py
                ├── config.py
                ├── it_contrastive.py
                ├── misc.py
                └── utils.py
Download .txt
SYMBOL INDEX (498 symbols across 51 files)

FILE: config_utils.py
  function hm3d_config (line 12) | def hm3d_config(path:str=HM3D_CONFIG_PATH,stage:str='val',episodes=200):
  function mp3d_config (line 44) | def mp3d_config(path:str=MP3D_CONFIG_PATH,stage:str='val',episodes=200):
  function r2r_config (line 76) | def r2r_config(path:str=R2R_CONFIG_PATH,stage:str='val_seen',episodes=200):

FILE: cv_utils/glee_detector.py
  function initialize_glee (line 15) | def initialize_glee(glee_config=GLEE_CONFIG_PATH,
  function glee_segmentation (line 32) | def glee_segmentation(img,
  function visualize_segmentation (line 79) | def visualize_segmentation(image,classes,masks):
  function visualize_detection (line 90) | def visualize_detection(image,classes,bboxes):

FILE: cv_utils/image_percevior.py
  class GLEE_Percevior (line 3) | class GLEE_Percevior:
    method __init__ (line 4) | def __init__(self,
    method perceive (line 10) | def perceive(self,image,confidence_threshold=0.25,area_threshold=2500):

FILE: llm_utils/gpt_request.py
  function local_image_to_data_url (line 29) | def local_image_to_data_url(image):
  function gptv_response (line 39) | def gptv_response(text_prompt,image_prompt,system_prompt=""):
  function gpt_response (line 48) | def gpt_response(text_prompt,system_prompt=""):

FILE: mapper.py
  class Instruct_Mapper (line 13) | class Instruct_Mapper:
    method __init__ (line 14) | def __init__(self,
    method reset (line 38) | def reset(self,position,rotation):
    method update (line 49) | def update(self,rgb,depth,position,rotation):
    method update_object_pcd (line 106) | def update_object_pcd(self):
    method get_view_pointcloud (line 120) | def get_view_pointcloud(self,rgb,depth,translation,rotation):
    method get_object_entities (line 130) | def get_object_entities(self,depth,classes,masks,confidences):
    method associate_object_entities (line 151) | def associate_object_entities(self,ref_entities,eval_entities):
    method get_obstacle_affordance (line 179) | def get_obstacle_affordance(self):
    method get_trajectory_affordance (line 188) | def get_trajectory_affordance(self):
    method get_semantic_affordance (line 196) | def get_semantic_affordance(self,target_class,threshold=0.1):
    method get_gpt4v_affordance (line 210) | def get_gpt4v_affordance(self,gpt4v_pcd):
    method get_action_affordance (line 219) | def get_action_affordance(self,action):
    method get_objnav_affordance_map (line 273) | def get_objnav_affordance_map(self,action,target_class,gpt4v_pcd,compl...
    method get_debug_affordance_map (line 294) | def get_debug_affordance_map(self,action,target_class,gpt4v_pcd):
    method visualize_affordance (line 306) | def visualize_affordance(self,affordance):
    method get_appeared_objects (line 312) | def get_appeared_objects(self):
    method save_pointcloud_debug (line 315) | def save_pointcloud_debug(self,path="./"):

FILE: mapping_utils/geometry.py
  function get_pointcloud_from_depth (line 8) | def get_pointcloud_from_depth(rgb:np.ndarray,depth:np.ndarray,intrinsic:...
  function get_pointcloud_from_depth_mask (line 20) | def get_pointcloud_from_depth_mask(depth:np.ndarray,mask:np.ndarray,intr...
  function translate_to_world (line 33) | def translate_to_world(pointcloud,position,rotation):
  function project_to_camera (line 40) | def project_to_camera(pcd,intrinsic,position,rotation):
  function pointcloud_distance (line 55) | def pointcloud_distance(pcdA,pcdB,device='cpu'):
  function pointcloud_2d_distance (line 66) | def pointcloud_2d_distance(pcdA,pcdB,device='cpu'):
  function cpu_pointcloud_from_array (line 75) | def cpu_pointcloud_from_array(points,colors):
  function gpu_pointcloud_from_array (line 81) | def gpu_pointcloud_from_array(points,colors,device):
  function gpu_pointcloud (line 87) | def gpu_pointcloud(pointcloud,device):
  function cpu_pointcloud (line 93) | def cpu_pointcloud(pointcloud):
  function cpu_merge_pointcloud (line 99) | def cpu_merge_pointcloud(pcdA,pcdB):
  function gpu_merge_pointcloud (line 102) | def gpu_merge_pointcloud(pcdA,pcdB):
  function gpu_cluster_filter (line 109) | def gpu_cluster_filter(pointcloud,eps=0.3,min_points=20):
  function cpu_cluster_filter (line 117) | def cpu_cluster_filter(pointcloud,eps=0.3,min_points=20):
  function quat2array (line 124) | def quat2array(quat):
  function quaternion_distance (line 127) | def quaternion_distance(quatA,quatB):
  function eculidean_distance (line 134) | def eculidean_distance(posA,posB):

FILE: mapping_utils/path_planning.py
  function path_planning (line 8) | def path_planning(costmap,start_index,goal_index):
  function visualize_path (line 22) | def visualize_path(costmap,path):

FILE: mapping_utils/preprocess.py
  function preprocess_depth (line 2) | def preprocess_depth(depth:np.ndarray,lower_bound:float=0.1,upper_bound:...
  function preprocess_image (line 5) | def preprocess_image(image:np.ndarray):

FILE: mapping_utils/projection.py
  function project_frontier (line 10) | def project_frontier(obstacle_pcd,navigable_pcd,obstacle_height=-0.7,gri...
  function translate_grid_to_point (line 50) | def translate_grid_to_point(pointcloud,grid_indexes,grid_resolution=0.25):
  function translate_point_to_grid (line 56) | def translate_point_to_grid(pointcloud,point_poses,grid_resolution=0.25):
  function project_costmap (line 64) | def project_costmap(navigable_pcd,affordance_value,grid_resolution=0.25):

FILE: mapping_utils/transform.py
  function habitat_camera_intrinsic (line 3) | def habitat_camera_intrinsic(config):
  function habitat_translation (line 18) | def habitat_translation(position):
  function habitat_rotation (line 21) | def habitat_rotation(rotation):

FILE: objnav_agent.py
  class HM3D_Objnav_Agent (line 15) | class HM3D_Objnav_Agent:
    method __init__ (line 16) | def __init__(self,env:habitat.Env,mapper:Instruct_Mapper):
    method translate_objnav (line 22) | def translate_objnav(self,object_goal):
    method reset_debug_probes (line 30) | def reset_debug_probes(self):
    method reset (line 47) | def reset(self):
    method rotate_panoramic (line 56) | def rotate_panoramic(self,rotate_times = 12):
    method concat_panoramic (line 67) | def concat_panoramic(self,images):
    method update_trajectory (line 83) | def update_trajectory(self):
    method save_trajectory (line 104) | def save_trajectory(self,dir="./tmp_objnav/"):
    method query_chainon (line 140) | def query_chainon(self):
    method query_gpt4v (line 162) | def query_gpt4v(self):
    method make_plan (line 185) | def make_plan(self,rotate=True,failed=False):
    method step (line 223) | def step(self):

FILE: objnav_benchmark.py
  function write_metrics (line 13) | def write_metrics(metrics,path="objnav_hm3d.csv"):
  function get_args (line 20) | def get_args():

FILE: thirdparty/GLEE/glee/backbone/backbone.py
  class Backbone (line 9) | class Backbone(nn.Module):
    method __init__ (line 14) | def __init__(self):
    method forward (line 20) | def forward(self):
    method size_divisibility (line 30) | def size_divisibility(self) -> int:
    method output_shape (line 40) | def output_shape(self):

FILE: thirdparty/GLEE/glee/backbone/build.py
  function build_backbone (line 6) | def build_backbone(config, **kwargs):

FILE: thirdparty/GLEE/glee/backbone/davit.py
  class MySequential (line 22) | class MySequential(nn.Sequential):
    method forward (line 23) | def forward(self, *inputs):
  class PreNorm (line 32) | class PreNorm(nn.Module):
    method __init__ (line 33) | def __init__(self, norm, fn, drop_path=None):
    method forward (line 39) | def forward(self, x, *args, **kwargs):
  class Mlp (line 54) | class Mlp(nn.Module):
    method __init__ (line 55) | def __init__(
    method forward (line 71) | def forward(self, x, size):
  class DepthWiseConv2d (line 75) | class DepthWiseConv2d(nn.Module):
    method __init__ (line 76) | def __init__(
    method forward (line 94) | def forward(self, x, size):
  class ConvEmbed (line 105) | class ConvEmbed(nn.Module):
    method __init__ (line 109) | def __init__(
    method forward (line 134) | def forward(self, x, size):
  class ChannelAttention (line 154) | class ChannelAttention(nn.Module):
    method __init__ (line 156) | def __init__(self, dim, groups=8, qkv_bias=True):
    method forward (line 163) | def forward(self, x, size):
  class ChannelBlock (line 178) | class ChannelBlock(nn.Module):
    method __init__ (line 180) | def __init__(self, dim, groups, mlp_ratio=4., qkv_bias=True,
    method forward (line 200) | def forward(self, x, size):
  function window_partition (line 212) | def window_partition(x, window_size: int):
  function window_reverse (line 219) | def window_reverse(windows, window_size: int, H: int, W: int):
  class WindowAttention (line 226) | class WindowAttention(nn.Module):
    method __init__ (line 227) | def __init__(self, dim, num_heads, window_size, qkv_bias=True):
    method forward (line 241) | def forward(self, x, size):
  class SpatialBlock (line 286) | class SpatialBlock(nn.Module):
    method __init__ (line 288) | def __init__(self, dim, num_heads, window_size,
    method forward (line 308) | def forward(self, x, size):
  class DaViT (line 319) | class DaViT(nn.Module):
    method __init__ (line 343) | def __init__(
    method dim_out (line 439) | def dim_out(self):
    method _init_weights (line 442) | def _init_weights(self, m):
    method _try_remap_keys (line 459) | def _try_remap_keys(self, pretrained_dict):
    method from_state_dict (line 489) | def from_state_dict(self, pretrained_dict, pretrained_layers=[], verbo...
    method from_pretrained (line 511) | def from_pretrained(self, pretrained='', pretrained_layers=[], verbose...
    method forward_features (line 518) | def forward_features(self, x):
    method forward (line 537) | def forward(self, x):
  class D2DaViT (line 542) | class D2DaViT(DaViT, Backbone):
    method __init__ (line 543) | def __init__(self, cfg, input_shape):
    method forward (line 581) | def forward(self, x):
    method output_shape (line 599) | def output_shape(self):
    method size_divisibility (line 608) | def size_divisibility(self):
  function get_davit_backbone (line 612) | def get_davit_backbone(cfg):

FILE: thirdparty/GLEE/glee/backbone/eva01.py
  class LayerNormWithForceFP32 (line 42) | class LayerNormWithForceFP32(nn.Module):
    method __init__ (line 48) | def __init__(self, normalized_shape: _shape_t, eps: float = 1e-5, elem...
    method reset_parameters (line 63) | def reset_parameters(self) -> None:
    method forward (line 68) | def forward(self, input: Tensor) -> Tensor:
    method extra_repr (line 72) | def extra_repr(self) -> Tensor:
  class Attention (line 77) | class Attention(nn.Module):
    method __init__ (line 80) | def __init__(
    method forward (line 126) | def forward(self, x):
  class ResBottleneckBlock (line 154) | class ResBottleneckBlock(CNNBlockBase):
    method __init__ (line 160) | def __init__(
    method forward (line 206) | def forward(self, x):
  class Block (line 215) | class Block(nn.Module):
    method __init__ (line 218) | def __init__(
    method forward (line 290) | def forward(self, x):
  class EVAViT (line 316) | class EVAViT(Backbone):
    method __init__ (line 323) | def __init__(
    method _init_weights (line 435) | def _init_weights(self, m):
    method forward (line 448) | def forward(self, x):
  class SimpleFeaturePyramid (line 462) | class SimpleFeaturePyramid(Backbone):
    method __init__ (line 468) | def __init__(
    method padding_constraints (line 569) | def padding_constraints(self):
    method forward (line 575) | def forward(self, x):
  class D2_EVA01 (line 606) | class D2_EVA01(SimpleFeaturePyramid):
    method __init__ (line 607) | def __init__(self, cfg, input_shape):
    method output_shape (line 644) | def output_shape(self):
    method size_divisibility (line 653) | def size_divisibility(self):
  function get_vit_lr_decay_rate (line 658) | def get_vit_lr_decay_rate(name, lr_decay_rate=1.0, num_layers=12):

FILE: thirdparty/GLEE/glee/backbone/eva02-dino.py
  class SwiGLU (line 39) | class SwiGLU(nn.Module):
    method __init__ (line 40) | def __init__(self, in_features, hidden_features=None, out_features=Non...
    method forward (line 56) | def forward(self, x):
  class Attention (line 66) | class Attention(nn.Module):
    method __init__ (line 67) | def __init__(
    method forward (line 103) | def forward(self, x):
  class ResBottleneckBlock (line 139) | class ResBottleneckBlock(CNNBlockBase):
    method __init__ (line 145) | def __init__(
    method forward (line 191) | def forward(self, x):
  class Block (line 200) | class Block(nn.Module):
    method __init__ (line 203) | def __init__(
    method forward (line 266) | def forward(self, x):
  class EVA02_ViT (line 290) | class EVA02_ViT(Backbone):
    method __init__ (line 297) | def __init__(
    method _init_weights (line 414) | def _init_weights(self, m):
    method forward (line 423) | def forward(self, x):
  class SimpleFeaturePyramid (line 437) | class SimpleFeaturePyramid(Backbone):
    method __init__ (line 443) | def __init__(
    method padding_constraints (line 545) | def padding_constraints(self):
    method forward (line 551) | def forward(self, x):
  function get_vit_lr_decay_rate (line 580) | def get_vit_lr_decay_rate(name, lr_decay_rate=1.0, num_layers=12):

FILE: thirdparty/GLEE/glee/backbone/eva02.py
  class SwiGLU (line 52) | class SwiGLU(nn.Module):
    method __init__ (line 53) | def __init__(self, in_features, hidden_features=None, out_features=Non...
    method forward (line 69) | def forward(self, x):
  class Attention (line 79) | class Attention(nn.Module):
    method __init__ (line 80) | def __init__(
    method forward (line 115) | def forward(self, x):
  class ResBottleneckBlock (line 151) | class ResBottleneckBlock(CNNBlockBase):
    method __init__ (line 157) | def __init__(
    method forward (line 203) | def forward(self, x):
  class Block (line 212) | class Block(nn.Module):
    method __init__ (line 215) | def __init__(
    method forward (line 278) | def forward(self, x):
  class EVA02_ViT (line 302) | class EVA02_ViT(Backbone):
    method __init__ (line 309) | def __init__(
    method _init_weights (line 426) | def _init_weights(self, m):
    method forward (line 435) | def forward(self, x):
  class SimpleFeaturePyramid (line 449) | class SimpleFeaturePyramid(Backbone):
    method __init__ (line 455) | def __init__(
    method padding_constraints (line 557) | def padding_constraints(self):
    method forward (line 563) | def forward(self, x):
  class D2_EVA02 (line 594) | class D2_EVA02(SimpleFeaturePyramid):
    method __init__ (line 595) | def __init__(self, cfg, input_shape):
    method output_shape (line 635) | def output_shape(self):
    method size_divisibility (line 644) | def size_divisibility(self):

FILE: thirdparty/GLEE/glee/backbone/eva_01_utils.py
  function window_partition (line 18) | def window_partition(x, window_size):
  function window_unpartition (line 42) | def window_unpartition(windows, window_size, pad_hw, hw):
  function get_rel_pos (line 65) | def get_rel_pos(q_size, k_size, rel_pos, interp_type):
  function add_decomposed_rel_pos (line 132) | def add_decomposed_rel_pos(attn, q, rel_pos_h, rel_pos_w, q_size, k_size...
  function get_abs_pos (line 164) | def get_abs_pos(abs_pos, has_cls_token, hw):
  class PatchEmbed (line 196) | class PatchEmbed(nn.Module):
    method __init__ (line 201) | def __init__(
    method forward (line 218) | def forward(self, x):

FILE: thirdparty/GLEE/glee/backbone/eva_02_utils.py
  function window_partition (line 19) | def window_partition(x, window_size):
  function window_unpartition (line 43) | def window_unpartition(windows, window_size, pad_hw, hw):
  function get_rel_pos (line 66) | def get_rel_pos(q_size, k_size, rel_pos):
  function add_decomposed_rel_pos (line 128) | def add_decomposed_rel_pos(attn, q, rel_pos_h, rel_pos_w, q_size, k_size):
  function get_abs_pos (line 160) | def get_abs_pos(abs_pos, has_cls_token, hw):
  class PatchEmbed (line 192) | class PatchEmbed(nn.Module):
    method __init__ (line 197) | def __init__(
    method forward (line 214) | def forward(self, x):
  function broadcat (line 232) | def broadcat(tensors, dim = -1):
  function rotate_half (line 250) | def rotate_half(x):
  class VisionRotaryEmbedding (line 258) | class VisionRotaryEmbedding(nn.Module):
    method __init__ (line 259) | def __init__(
    method forward (line 298) | def forward(self, t, start_index = 0):
  class VisionRotaryEmbeddingFast (line 309) | class VisionRotaryEmbeddingFast(nn.Module):
    method __init__ (line 310) | def __init__(
    method forward (line 349) | def forward(self, t):

FILE: thirdparty/GLEE/glee/backbone/internimage.py
  class to_channels_first (line 22) | class to_channels_first(nn.Module):
    method __init__ (line 24) | def __init__(self):
    method forward (line 27) | def forward(self, x):
  class to_channels_last (line 31) | class to_channels_last(nn.Module):
    method __init__ (line 33) | def __init__(self):
    method forward (line 36) | def forward(self, x):
  function build_norm_layer (line 40) | def build_norm_layer(dim,
  function build_act_layer (line 64) | def build_act_layer(act_layer):
  class CrossAttention (line 75) | class CrossAttention(nn.Module):
    method __init__ (line 91) | def __init__(self,
    method forward (line 128) | def forward(self, x, k=None, v=None):
  class AttentiveBlock (line 165) | class AttentiveBlock(nn.Module):
    method __init__ (line 183) | def __init__(self,
    method forward (line 211) | def forward(self,
  class AttentionPoolingBlock (line 227) | class AttentionPoolingBlock(AttentiveBlock):
    method forward (line 229) | def forward(self, x):
  class StemLayer (line 240) | class StemLayer(nn.Module):
    method __init__ (line 249) | def __init__(self,
    method forward (line 271) | def forward(self, x):
  class DownsampleLayer (line 280) | class DownsampleLayer(nn.Module):
    method __init__ (line 287) | def __init__(self, channels, norm_layer='LN'):
    method forward (line 298) | def forward(self, x):
  class MLPLayer (line 304) | class MLPLayer(nn.Module):
    method __init__ (line 314) | def __init__(self,
    method forward (line 328) | def forward(self, x):
  class InternImageLayer (line 337) | class InternImageLayer(nn.Module):
    method __init__ (line 354) | def __init__(self,
    method forward (line 408) | def forward(self, x):
  class InternImageBlock (line 437) | class InternImageBlock(nn.Module):
    method __init__ (line 455) | def __init__(self,
    method forward (line 510) | def forward(self, x, return_wo_downsample=False):
  class InternImage (line 527) | class InternImage(Backbone):
    method __init__ (line 551) | def __init__(self,
    method _init_weights (line 647) | def _init_weights(self, m):
    method _init_deform_weights (line 656) | def _init_deform_weights(self, m):
    method forward (line 660) | def forward(self, x):
  class D2InternImage (line 675) | class D2InternImage(InternImage):
    method __init__ (line 676) | def __init__(self, cfg, input_shape):
    method forward (line 708) | def forward(self, x):
    method output_shape (line 725) | def output_shape(self):
    method size_divisibility (line 734) | def size_divisibility(self):

FILE: thirdparty/GLEE/glee/backbone/registry.py
  function register_backbone (line 4) | def register_backbone(fn):
  function model_entrypoints (line 10) | def model_entrypoints(model_name):
  function is_model (line 13) | def is_model(model_name):

FILE: thirdparty/GLEE/glee/backbone/resnet.py
  class BasicBlock (line 36) | class BasicBlock(CNNBlockBase):
    method __init__ (line 42) | def __init__(self, in_channels, out_channels, *, stride=1, norm="BN"):
    method forward (line 89) | def forward(self, x):
  class BottleneckBlock (line 104) | class BottleneckBlock(CNNBlockBase):
    method __init__ (line 111) | def __init__(
    method forward (line 198) | def forward(self, x):
  class DeformBottleneckBlock (line 217) | class DeformBottleneckBlock(CNNBlockBase):
    method __init__ (line 223) | def __init__(
    method forward (line 307) | def forward(self, x):
  class BasicStem (line 334) | class BasicStem(CNNBlockBase):
    method __init__ (line 340) | def __init__(self, in_channels=3, out_channels=64, norm="BN"):
    method forward (line 359) | def forward(self, x):
  class ResNet (line 366) | class ResNet(Backbone):
    method __init__ (line 371) | def __init__(self, stem, stages, num_classes=None, out_features=None, ...
    method forward (line 439) | def forward(self, x):
    method output_shape (line 464) | def output_shape(self):
    method freeze (line 472) | def freeze(self, freeze_at=0):
    method make_stage (line 497) | def make_stage(block_class, num_blocks, *, in_channels, out_channels, ...
    method make_default_stages (line 552) | def make_default_stages(depth, block_class=None, **kwargs):
  function make_stage (line 610) | def make_stage(*args, **kwargs):
  function _convert_ndarray_to_tensor (line 617) | def _convert_ndarray_to_tensor(state_dict: Dict[str, Any]) -> None:
  function get_resnet_backbone (line 638) | def get_resnet_backbone(cfg):

FILE: thirdparty/GLEE/glee/backbone/swin.py
  class Mlp (line 20) | class Mlp(nn.Module):
    method __init__ (line 23) | def __init__(
    method forward (line 34) | def forward(self, x):
  function window_partition (line 43) | def window_partition(x, window_size):
  function window_reverse (line 57) | def window_reverse(windows, window_size, H, W):
  class WindowAttention (line 73) | class WindowAttention(nn.Module):
    method __init__ (line 86) | def __init__(
    method forward (line 130) | def forward(self, x, mask=None):
  class SwinTransformerBlock (line 173) | class SwinTransformerBlock(nn.Module):
    method __init__ (line 190) | def __init__(
    method forward (line 234) | def forward(self, x, mask_matrix):
  class PatchMerging (line 297) | class PatchMerging(nn.Module):
    method __init__ (line 304) | def __init__(self, dim, norm_layer=nn.LayerNorm):
    method forward (line 310) | def forward(self, x, H, W):
  class BasicLayer (line 339) | class BasicLayer(nn.Module):
    method __init__ (line 357) | def __init__(
    method forward (line 405) | def forward(self, x, H, W):
  class PatchEmbed (line 455) | class PatchEmbed(nn.Module):
    method __init__ (line 464) | def __init__(self, patch_size=4, in_chans=3, embed_dim=96, norm_layer=...
    method forward (line 478) | def forward(self, x):
  class SwinTransformer (line 497) | class SwinTransformer(nn.Module):
    method __init__ (line 525) | def __init__(
    method _freeze_stages (line 617) | def _freeze_stages(self):
    method init_weights (line 634) | def init_weights(self, pretrained=None):
    method forward (line 660) | def forward(self, x):
    method train (line 689) | def train(self, mode=True):
  class D2SwinTransformer (line 696) | class D2SwinTransformer(SwinTransformer, Backbone):
    method __init__ (line 697) | def __init__(self, cfg, input_shape):
    method forward (line 754) | def forward(self, x):
    method output_shape (line 771) | def output_shape(self):
    method size_divisibility (line 780) | def size_divisibility(self):

FILE: thirdparty/GLEE/glee/backbone/vit.py
  class Attention (line 28) | class Attention(nn.Module):
    method __init__ (line 31) | def __init__(
    method forward (line 68) | def forward(self, x):
  class ResBottleneckBlock (line 91) | class ResBottleneckBlock(CNNBlockBase):
    method __init__ (line 97) | def __init__(
    method forward (line 143) | def forward(self, x):
  class Block (line 152) | class Block(nn.Module):
    method __init__ (line 155) | def __init__(
    method forward (line 217) | def forward(self, x):
  class ViT (line 238) | class ViT(Backbone):
    method __init__ (line 245) | def __init__(
    method _init_weights (line 353) | def _init_weights(self, m):
    method forward (line 362) | def forward(self, x):
  class D2ViT (line 384) | class D2ViT(ViT, Backbone):
    method __init__ (line 385) | def __init__(self, cfg, input_shape):
    method forward (line 445) | def forward(self, x):
    method output_shape (line 462) | def output_shape(self):
    method size_divisibility (line 471) | def size_divisibility(self):

FILE: thirdparty/GLEE/glee/backbone/vit_utils.py
  function window_partition (line 18) | def window_partition(x, window_size):
  function window_unpartition (line 42) | def window_unpartition(windows, window_size, pad_hw, hw):
  function get_rel_pos (line 65) | def get_rel_pos(q_size, k_size, rel_pos, interp_type):
  function add_decomposed_rel_pos (line 132) | def add_decomposed_rel_pos(attn, q, rel_pos_h, rel_pos_w, q_size, k_size...
  function get_abs_pos (line 164) | def get_abs_pos(abs_pos, has_cls_token, hw):
  class PatchEmbed (line 196) | class PatchEmbed(nn.Module):
    method __init__ (line 201) | def __init__(
    method forward (line 218) | def forward(self, x):

FILE: thirdparty/GLEE/glee/config.py
  function add_glee_config (line 3) | def add_glee_config(cfg):

FILE: thirdparty/GLEE/glee/config_deeplab.py
  function add_deeplab_config (line 5) | def add_deeplab_config(cfg):

FILE: thirdparty/GLEE/glee/models/glee_model.py
  function rand_sample (line 26) | def rand_sample(x, max_len):
  function agg_lang_feat (line 34) | def agg_lang_feat(features, mask, pool_type="average"):
  class GLEE_Model (line 51) | class GLEE_Model(nn.Module):
    method __init__ (line 55) | def __init__(self, cfg, matcher, device, video_info, contras_mean):
    method device (line 98) | def device(self):
    method forward (line 101) | def forward(self, images, prompts, task, targets=None, batch_name_list...
    method get_template (line 249) | def get_template(self, imgs, pad_masks, prompt_mode='scribble'):

FILE: thirdparty/GLEE/glee/models/pixel_decoder/early_fusion.py
  class VLFuse (line 9) | class VLFuse(torch.nn.Module):
    method __init__ (line 14) | def __init__(self, ):
    method init_configs (line 28) | def init_configs(self, ):
    method forward (line 41) | def forward(self, x, task=None):
  class BiMultiHeadAttention (line 57) | class BiMultiHeadAttention(nn.Module):
    method __init__ (line 58) | def __init__(self, v_dim, l_dim, embed_dim, num_heads, dropout=0.1):
    method _shape (line 87) | def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
    method _reset_parameters (line 90) | def _reset_parameters(self):
    method forward (line 104) | def forward(self, v, l, attention_mask_l=None):
  class BiAttentionBlockForCheckpoint (line 192) | class BiAttentionBlockForCheckpoint(nn.Module):
    method __init__ (line 193) | def __init__(self, v_dim, l_dim, embed_dim, num_heads, dropout=0.1,
    method forward (line 219) | def forward(self, v, l, attention_mask_l=None, task=None):

FILE: thirdparty/GLEE/glee/models/pixel_decoder/maskdino_encoder.py
  function build_pixel_decoder (line 27) | def build_pixel_decoder(cfg, input_shape):
  class MSDeformAttnTransformerEncoderOnly (line 43) | class MSDeformAttnTransformerEncoderOnly(nn.Module):
    method __init__ (line 44) | def __init__(self, d_model=256, nhead=8,
    method _reset_parameters (line 64) | def _reset_parameters(self):
    method get_valid_ratio (line 73) | def get_valid_ratio(self, mask):
    method forward (line 82) | def forward(self, srcs, masks, pos_embeds, early_fusion=None):
  class MSDeformAttnTransformerEncoderLayer (line 119) | class MSDeformAttnTransformerEncoderLayer(nn.Module):
    method __init__ (line 120) | def __init__(self,
    method with_pos_embed (line 140) | def with_pos_embed(tensor, pos):
    method forward_ffn (line 143) | def forward_ffn(self, src):
    method forward (line 149) | def forward(self, src, pos, reference_points, spatial_shapes, level_st...
  class MSDeformAttnTransformerEncoder (line 161) | class MSDeformAttnTransformerEncoder(nn.Module):
    method __init__ (line 162) | def __init__(self, vl_fusion_layer, encoder_layer, num_layers):
    method get_reference_points (line 171) | def get_reference_points(spatial_shapes, valid_ratios, device):
    method forward (line 185) | def forward(self, src, spatial_shapes, level_start_index, valid_ratios...
  class MaskDINOEncoder (line 208) | class MaskDINOEncoder(nn.Module):
    method __init__ (line 213) | def __init__(
    method from_config (line 361) | def from_config(cls, cfg, input_shape: Dict[str, ShapeSpec]):
    method forward_features (line 384) | def forward_features(self, features, masks, early_fusion=None):

FILE: thirdparty/GLEE/glee/models/pixel_decoder/ops/functions/ms_deform_attn_func.py
  class MSDeformAttnFunction (line 32) | class MSDeformAttnFunction(Function):
    method forward (line 34) | def forward(ctx, value, value_spatial_shapes, value_level_start_index,...
    method backward (line 43) | def backward(ctx, grad_output):
  function ms_deform_attn_core_pytorch (line 52) | def ms_deform_attn_core_pytorch(value, value_spatial_shapes, sampling_lo...

FILE: thirdparty/GLEE/glee/models/pixel_decoder/ops/modules/ms_deform_attn.py
  function _is_power_of_2 (line 28) | def _is_power_of_2(n):
  class MSDeformAttn (line 34) | class MSDeformAttn(nn.Module):
    method __init__ (line 35) | def __init__(self, d_model=256, n_levels=4, n_heads=8, n_points=4):
    method _reset_parameters (line 66) | def _reset_parameters(self):
    method forward (line 82) | def forward(self, query, reference_points, input_flatten, input_spatia...

FILE: thirdparty/GLEE/glee/models/pixel_decoder/ops/setup.py
  function get_extensions (line 26) | def get_extensions():

FILE: thirdparty/GLEE/glee/models/pixel_decoder/ops/src/cpu/ms_deform_attn_cpu.cpp
  function ms_deform_attn_cpu_forward (line 22) | at::Tensor
  function ms_deform_attn_cpu_backward (line 34) | std::vector<at::Tensor>

FILE: thirdparty/GLEE/glee/models/pixel_decoder/ops/src/ms_deform_attn.h
  function im2col_step (line 32) | int im2col_step)

FILE: thirdparty/GLEE/glee/models/pixel_decoder/ops/src/vision.cpp
  function PYBIND11_MODULE (line 18) | PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {

FILE: thirdparty/GLEE/glee/models/pixel_decoder/ops/test.py
  function check_forward_equal_with_pytorch_double (line 35) | def check_forward_equal_with_pytorch_double():
  function check_forward_equal_with_pytorch_float (line 51) | def check_forward_equal_with_pytorch_float():
  function check_gradient_numerical (line 66) | def check_gradient_numerical(channels=4, grad_value=True, grad_sampling_...

FILE: thirdparty/GLEE/glee/models/pixel_decoder/position_encoding.py
  class PositionEmbeddingSine (line 15) | class PositionEmbeddingSine(nn.Module):
    method __init__ (line 21) | def __init__(self, num_pos_feats=64, temperature=10000, normalize=Fals...
    method forward (line 32) | def forward(self, x, mask=None):
    method __repr__ (line 57) | def __repr__(self, _repr_indent=4):

FILE: thirdparty/GLEE/glee/models/transformer_decoder/dino_decoder.py
  class TransformerDecoder (line 18) | class TransformerDecoder(nn.Module):
    method __init__ (line 20) | def __init__(self, decoder_layer, num_layers, norm=None,
    method _reset_parameters (line 95) | def _reset_parameters(self):
    method with_pos_embed (line 103) | def with_pos_embed(tensor, pos):
    method forward (line 107) | def forward(self, tgt, memory,
  class DeformableTransformerDecoderLayer (line 200) | class DeformableTransformerDecoderLayer(nn.Module):
    method __init__ (line 202) | def __init__(self, d_model=256, d_ffn=1024,
    method rm_self_attn_modules (line 234) | def rm_self_attn_modules(self):
    method with_pos_embed (line 240) | def with_pos_embed(tensor, pos):
    method forward_ffn (line 243) | def forward_ffn(self, tgt):
    method forward (line 250) | def forward(self,

FILE: thirdparty/GLEE/glee/models/transformer_decoder/maskdino_decoder.py
  function build_transformer_decoder (line 29) | def build_transformer_decoder(cfg, in_channels, lang_encoder, mask_class...
  class MaskDINODecoder (line 38) | class MaskDINODecoder(nn.Module):
    method __init__ (line 40) | def __init__(
    method from_config (line 245) | def from_config(cls, cfg, in_channels, lang_encoder, mask_classificati...
    method prepare_for_dn (line 272) | def prepare_for_dn(self, targets, tgt, refpoint_emb, batch_size,task):
    method dn_post_process (line 396) | def dn_post_process(self,outputs_class,outputs_score,outputs_coord,mas...
    method get_valid_ratio (line 418) | def get_valid_ratio(self, mask):
    method pred_box (line 427) | def pred_box(self, reference, hs, ref0=None):
    method forward (line 447) | def forward(self, x, mask_features, extra, task, masks, targets=None):
    method forward_prediction_heads (line 602) | def forward_prediction_heads(self, output, mask_features, task, extra,...
    method _set_aux_loss (line 624) | def _set_aux_loss(self, outputs_class, outputs_score, outputs_seg_mask...

FILE: thirdparty/GLEE/glee/models/vos_utils.py
  class VLFuse (line 9) | class VLFuse(torch.nn.Module):
    method __init__ (line 14) | def __init__(self, ):
    method init_configs (line 28) | def init_configs(self, ):
    method forward (line 41) | def forward(self, x, task=None):
  function masks_to_boxes (line 58) | def masks_to_boxes(masks):
  class FeatureFuser (line 84) | class FeatureFuser(nn.Module):
    method __init__ (line 88) | def __init__(self, in_channels, channels=256):
    method forward (line 95) | def forward(self, features):
  function aligned_bilinear (line 112) | def aligned_bilinear(tensor, factor):
  class BiMultiHeadAttention (line 139) | class BiMultiHeadAttention(nn.Module):
    method __init__ (line 140) | def __init__(self, v_dim, l_dim, embed_dim, num_heads, dropout=0.1):
    method _shape (line 169) | def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
    method _reset_parameters (line 172) | def _reset_parameters(self):
    method forward (line 186) | def forward(self, v, l, attention_mask_l=None):
  class BiAttentionBlockForCheckpoint (line 274) | class BiAttentionBlockForCheckpoint(nn.Module):
    method __init__ (line 275) | def __init__(self, v_dim, l_dim, embed_dim, num_heads, dropout=0.1,
    method forward (line 301) | def forward(self, v, l, attention_mask_l=None, task=None):

FILE: thirdparty/GLEE/glee/modules/attention.py
  function multi_head_attention_forward (line 15) | def multi_head_attention_forward(
  class _LinearWithBias (line 326) | class _LinearWithBias(nn.Linear):
    method __init__ (line 329) | def __init__(self, in_features: int, out_features: int) -> None:
  class MultiheadAttention (line 333) | class MultiheadAttention(nn.Module):
    method __init__ (line 366) | def __init__(self, embed_dim, num_heads, dropout=0., bias=True, add_bi...
    method _reset_parameters (line 405) | def _reset_parameters(self):
    method __setstate__ (line 421) | def __setstate__(self, state):
    method forward (line 428) | def forward(self, query: Tensor, key: Tensor, value: Tensor, key_paddi...

FILE: thirdparty/GLEE/glee/modules/point_features.py
  function point_sample (line 21) | def point_sample(input, point_coords, **kwargs):
  function generate_regular_grid_point_coords (line 47) | def generate_regular_grid_point_coords(R, side_size, device):
  function get_uncertain_point_coords_with_randomness (line 65) | def get_uncertain_point_coords_with_randomness(
  function get_uncertain_point_coords_on_grid (line 121) | def get_uncertain_point_coords_on_grid(uncertainty_map, num_points):
  function point_sample_fine_grained_features (line 148) | def point_sample_fine_grained_features(features_list, feature_scales, bo...
  function get_point_coords_wrt_image (line 194) | def get_point_coords_wrt_image(boxes_coords, point_coords):
  function sample_point_labels (line 221) | def sample_point_labels(instances, point_coords):

FILE: thirdparty/GLEE/glee/modules/position_encoding.py
  class PositionEmbeddingSine (line 12) | class PositionEmbeddingSine(nn.Module):
    method __init__ (line 18) | def __init__(self, num_pos_feats=64, temperature=10000, normalize=Fals...
    method forward (line 29) | def forward(self, x, mask=None):
    method __repr__ (line 54) | def __repr__(self, _repr_indent=4):

FILE: thirdparty/GLEE/glee/modules/postprocessing.py
  function detector_postprocess (line 9) | def detector_postprocess(
  function bbox_postprocess (line 77) | def bbox_postprocess(result, input_size, img_size, output_height, output...
  function sem_seg_postprocess (line 99) | def sem_seg_postprocess(result, img_size, output_height, output_width):

FILE: thirdparty/GLEE/glee/utils/box_ops.py
  function box_cxcywh_to_xyxy (line 9) | def box_cxcywh_to_xyxy(x):
  function box_xyxy_to_cxcywh (line 16) | def box_xyxy_to_cxcywh(x):
  function box_xywh_to_xyxy (line 22) | def box_xywh_to_xyxy(x):
  function box_iou (line 29) | def box_iou(boxes1, boxes2):
  function generalized_box_iou (line 45) | def generalized_box_iou(boxes1, boxes2):
  function masks_to_boxes (line 69) | def masks_to_boxes(masks):

FILE: thirdparty/GLEE/glee/utils/config.py
  function configurable (line 7) | def configurable(init_func=None, *, from_config=None):
  function _called_with_cfg (line 94) | def _called_with_cfg(*args, **kwargs):
  function _get_args_from_config (line 110) | def _get_args_from_config(from_config_func, *args, **kwargs):

FILE: thirdparty/GLEE/glee/utils/it_contrastive.py
  function is_dist_initialized (line 5) | def is_dist_initialized():
  function get_world_size (line 8) | def get_world_size():
  function all_gather_grad (line 13) | def all_gather_grad(x):
  function all_gather_nograd (line 22) | def all_gather_nograd(tensor):
  function image_text_contrastive_loss (line 36) | def image_text_contrastive_loss(image_feat, text_feat, temperature, imag...

FILE: thirdparty/GLEE/glee/utils/misc.py
  function _max_by_axis (line 16) | def _max_by_axis(the_list):
  class NestedTensor (line 24) | class NestedTensor(object):
    method __init__ (line 25) | def __init__(self, tensors, mask: Optional[Tensor]):
    method to (line 29) | def to(self, device):
    method decompose (line 40) | def decompose(self):
    method __repr__ (line 43) | def __repr__(self):
  function nested_tensor_from_tensor_list (line 46) | def nested_tensor_from_tensor_list(tensor_list: List[Tensor]):
  function _collate_and_pad_divisibility (line 88) | def _collate_and_pad_divisibility(tensor_list: list, div=32):
  function _onnx_nested_tensor_from_tensor_list (line 122) | def _onnx_nested_tensor_from_tensor_list(tensor_list: List[Tensor]) -> N...
  function is_dist_avail_and_initialized (line 151) | def is_dist_avail_and_initialized():
  function get_iou (line 158) | def get_iou(gt_masks, pred_masks, ignore_label=-1):

FILE: thirdparty/GLEE/glee/utils/utils.py
  class MLP (line 11) | class MLP(nn.Module):
    method __init__ (line 14) | def __init__(self, input_dim, hidden_dim, output_dim, num_layers):
    method forward (line 20) | def forward(self, x):
  function inverse_sigmoid (line 26) | def inverse_sigmoid(x, eps=1e-5):
  function gen_encoder_output_proposals (line 33) | def gen_encoder_output_proposals(memory:Tensor, memory_padding_mask:Tens...
  function gen_sineembed_for_position (line 74) | def gen_sineembed_for_position(pos_tensor):
  function _get_activation_fn (line 103) | def _get_activation_fn(activation):
  function _get_clones (line 118) | def _get_clones(module, N, layer_share=False):
  function _get_clones_advanced (line 125) | def _get_clones_advanced(module, N, N_valid):
Condensed preview — 72 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (602K chars).
[
  {
    "path": ".gitignore",
    "chars": 33,
    "preview": "./tmp\n*.pth\n*.pyc\n**/__pycache__\n"
  },
  {
    "path": "README.md",
    "chars": 4374,
    "preview": "# InstructNav\n\nEnabling robots to navigate following diverse language instructions in unexplored environments is an attr"
  },
  {
    "path": "config_utils.py",
    "chars": 5559,
    "preview": "import habitat\nfrom habitat.config.read_write import read_write\nfrom habitat.config.default_structured_configs import (\n"
  },
  {
    "path": "constants.py",
    "chars": 319,
    "preview": "from cv_utils.object_list import categories\nGLEE_CONFIG_PATH = \"./thirdparty/GLEE/configs/SwinL.yaml\"\nGLEE_CHECKPOINT_PA"
  },
  {
    "path": "cv_utils/glee_detector.py",
    "chars": 5098,
    "preview": "from thirdparty.GLEE.glee.models.glee_model import GLEE_Model\nfrom thirdparty.GLEE.glee.config_deeplab import add_deepla"
  },
  {
    "path": "cv_utils/image_percevior.py",
    "chars": 1184,
    "preview": "from constants import *\nfrom .glee_detector import *\nclass GLEE_Percevior:\n    def __init__(self,\n                 glee_"
  },
  {
    "path": "cv_utils/object_list.py",
    "chars": 6913,
    "preview": "categories = [\n  {\"id\": 1, \"name\": \"bed\"},\n  {\"id\": 2, \"name\": \"sofa\"},\n  {\"id\": 3, \"name\": \"chair\"},\n  {\"id\": 4, \"name\""
  },
  {
    "path": "llm_utils/gpt_request.py",
    "chars": 2329,
    "preview": "import os\nfrom openai import AzureOpenAI,OpenAI\nimport requests\nimport base64\nimport cv2\nimport numpy as np\nfrom mimetyp"
  },
  {
    "path": "llm_utils/nav_prompt.py",
    "chars": 3390,
    "preview": "CHAINON_PROMPT = \"You are a wheeled mobile robot working in an indoor environment.\\\nAnd you are required to finish a nav"
  },
  {
    "path": "mapper.py",
    "chars": 21615,
    "preview": "from mapping_utils.geometry import *\nfrom mapping_utils.preprocess import *\nfrom mapping_utils.projection import *\nfrom "
  },
  {
    "path": "mapping_utils/geometry.py",
    "chars": 5967,
    "preview": "import numpy as np\nimport open3d as o3d\nimport quaternion\nimport time\nimport torch\nimport cv2\n\ndef get_pointcloud_from_d"
  },
  {
    "path": "mapping_utils/path_planning.py",
    "chars": 1369,
    "preview": "import numpy as np\nimport cv2\nfrom pathfinding.core.diagonal_movement import DiagonalMovement\nfrom pathfinding.core.grid"
  },
  {
    "path": "mapping_utils/preprocess.py",
    "chars": 241,
    "preview": "import numpy as np\ndef preprocess_depth(depth:np.ndarray,lower_bound:float=0.1,upper_bound:float=4.9):\n    depth[np.wher"
  },
  {
    "path": "mapping_utils/projection.py",
    "chars": 4733,
    "preview": "import numpy as np\nimport open3d as o3d\nimport cv2\n# obstacle = 0\n# unknown = 1\n# position = 2\n# navigable = 3\n# frontie"
  },
  {
    "path": "mapping_utils/transform.py",
    "chars": 1724,
    "preview": "import numpy as np\nimport quaternion\ndef habitat_camera_intrinsic(config):\n    assert config.habitat.simulator.agents.ma"
  },
  {
    "path": "objnav_agent.py",
    "chars": 13800,
    "preview": "import habitat\nimport numpy as np\nimport cv2\nimport ast\nimport open3d as o3d\nfrom mapping_utils.geometry import *\nfrom m"
  },
  {
    "path": "objnav_benchmark.py",
    "chars": 2208,
    "preview": "import habitat\nimport os\nimport argparse\nimport csv\nfrom tqdm import tqdm\nfrom config_utils import hm3d_config,mp3d_conf"
  },
  {
    "path": "requirements.txt",
    "chars": 511,
    "preview": "apex==0.9.10dev\neinops==0.8.0\nfairscale==0.4.4\nfvcore==0.1.5.post20221221\nimageio==2.34.1\nmatplotlib==3.8.4\nMultiScaleDe"
  },
  {
    "path": "thirdparty/GLEE/configs/R50.yaml",
    "chars": 1840,
    "preview": "MODEL:\n  META_ARCHITECTURE: \"GLEE\"\n  MASK_ON: True\n  BACKBONE:\n    FREEZE_AT: 0\n    NAME: \"build_resnet_backbone\"\n  PIXE"
  },
  {
    "path": "thirdparty/GLEE/configs/SwinL.yaml",
    "chars": 2010,
    "preview": "MODEL:\n  META_ARCHITECTURE: \"GLEE\"\n  MASK_ON: True\n  BACKBONE:\n    NAME: \"D2SwinTransformer\"\n  SWIN:\n    EMBED_DIM: 192\n"
  },
  {
    "path": "thirdparty/GLEE/glee/__init__.py",
    "chars": 380,
    "preview": "from __future__ import absolute_import\nfrom __future__ import division\nfrom __future__ import print_function\n\n\nfrom .con"
  },
  {
    "path": "thirdparty/GLEE/glee/backbone/__init__.py",
    "chars": 149,
    "preview": "from .build import build_backbone\n\nfrom .resnet import *\nfrom .swin import *\n# from .focal import *\n# from .focal_dw imp"
  },
  {
    "path": "thirdparty/GLEE/glee/backbone/backbone.py",
    "chars": 1466,
    "preview": "# Copyright (c) Facebook, Inc. and its affiliates.\nimport torch.nn as nn\n\nfrom detectron2.modeling import ShapeSpec\n\n__a"
  },
  {
    "path": "thirdparty/GLEE/glee/backbone/build.py",
    "chars": 353,
    "preview": "from .registry import model_entrypoints\nfrom .registry import is_model\n\nfrom .backbone import *\n\ndef build_backbone(conf"
  },
  {
    "path": "thirdparty/GLEE/glee/backbone/davit.py",
    "chars": 20864,
    "preview": "import os\nimport itertools\nimport logging\n\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nimport tor"
  },
  {
    "path": "thirdparty/GLEE/glee/backbone/eva01.py",
    "chars": 25875,
    "preview": "import logging\nimport math\nfrom functools import partial\n\nimport fvcore.nn.weight_init as weight_init\nimport torch\nimpor"
  },
  {
    "path": "thirdparty/GLEE/glee/backbone/eva02-dino.py",
    "chars": 20996,
    "preview": "import logging\nimport math\nfrom functools import partial\n\nimport fvcore.nn.weight_init as weight_init\nimport torch\nimpor"
  },
  {
    "path": "thirdparty/GLEE/glee/backbone/eva02.py",
    "chars": 22632,
    "preview": "# --------------------------------------------------------\n# EVA02\n# ---------------------------------------------------"
  },
  {
    "path": "thirdparty/GLEE/glee/backbone/eva_01_utils.py",
    "chars": 7905,
    "preview": "# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved\nimport math\nimport numpy as np\nfrom scipy import "
  },
  {
    "path": "thirdparty/GLEE/glee/backbone/eva_02_utils.py",
    "chars": 12410,
    "preview": "# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved\nimport math\nimport numpy as np\nfrom scipy import "
  },
  {
    "path": "thirdparty/GLEE/glee/backbone/internimage.py",
    "chars": 27896,
    "preview": "# --------------------------------------------------------\n# InternImage\n# Copyright (c) 2022 OpenGVLab\n# Licensed under"
  },
  {
    "path": "thirdparty/GLEE/glee/backbone/registry.py",
    "chars": 344,
    "preview": "_model_entrypoints = {}\n\n\ndef register_backbone(fn):\n    module_name_split = fn.__module__.split('.')\n    model_name = m"
  },
  {
    "path": "thirdparty/GLEE/glee/backbone/resnet.py",
    "chars": 24866,
    "preview": "# Copyright (c) Facebook, Inc. and its affiliates.\nimport pickle\nimport numpy as np\nfrom typing import Any, Dict\nimport "
  },
  {
    "path": "thirdparty/GLEE/glee/backbone/swin.py",
    "chars": 27846,
    "preview": "# --------------------------------------------------------\n# Swin Transformer\n# Copyright (c) 2021 Microsoft\n# Licensed "
  },
  {
    "path": "thirdparty/GLEE/glee/backbone/vit.py",
    "chars": 16498,
    "preview": "import logging\nimport math\nimport fvcore.nn.weight_init as weight_init\nimport torch\nimport torch.nn as nn\n\nfrom detectro"
  },
  {
    "path": "thirdparty/GLEE/glee/backbone/vit_utils.py",
    "chars": 7906,
    "preview": "# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved\nimport math\nimport numpy as np\nfrom scipy import "
  },
  {
    "path": "thirdparty/GLEE/glee/config.py",
    "chars": 13626,
    "preview": "# -*- coding: utf-8 -*-\nfrom detectron2.config import CfgNode as CN\ndef add_glee_config(cfg):\n    \"\"\"\n    Add config for"
  },
  {
    "path": "thirdparty/GLEE/glee/config_deeplab.py",
    "chars": 1157,
    "preview": "# -*- coding: utf-8 -*-\n# Copyright (c) Facebook, Inc. and its affiliates.\n\n\ndef add_deeplab_config(cfg):\n    \"\"\"\n    Ad"
  },
  {
    "path": "thirdparty/GLEE/glee/models/glee_model.py",
    "chars": 15040,
    "preview": "# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved\n\"\"\"\n\"\"\"\nimport torch\nimport torch.nn.functional a"
  },
  {
    "path": "thirdparty/GLEE/glee/models/pixel_decoder/__init__.py",
    "chars": 47,
    "preview": "# Copyright (c) IDEA, Inc. and its affiliates.\n"
  },
  {
    "path": "thirdparty/GLEE/glee/models/pixel_decoder/early_fusion.py",
    "chars": 9841,
    "preview": "import torch\nimport torch.nn.functional as F\nfrom torch import nn\nfrom timm.models.layers import DropPath\n\n\n\n\nclass VLFu"
  },
  {
    "path": "thirdparty/GLEE/glee/models/pixel_decoder/maskdino_encoder.py",
    "chars": 20550,
    "preview": "# ------------------------------------------------------------------------\n# DINO\n# Copyright (c) 2022 IDEA. All Rights "
  },
  {
    "path": "thirdparty/GLEE/glee/models/pixel_decoder/ops/functions/__init__.py",
    "chars": 734,
    "preview": "# ------------------------------------------------------------------------------------------------\n# Deformable DETR\n# C"
  },
  {
    "path": "thirdparty/GLEE/glee/models/pixel_decoder/ops/functions/ms_deform_attn_func.py",
    "chars": 3727,
    "preview": "# ------------------------------------------------------------------------------------------------\n# Deformable DETR\n# C"
  },
  {
    "path": "thirdparty/GLEE/glee/models/pixel_decoder/ops/make.sh",
    "chars": 736,
    "preview": "#!/usr/bin/env bash\n# ------------------------------------------------------------------------------------------------\n#"
  },
  {
    "path": "thirdparty/GLEE/glee/models/pixel_decoder/ops/modules/__init__.py",
    "chars": 720,
    "preview": "# ------------------------------------------------------------------------------------------------\n# Deformable DETR\n# C"
  },
  {
    "path": "thirdparty/GLEE/glee/models/pixel_decoder/ops/modules/ms_deform_attn.py",
    "chars": 6805,
    "preview": "# ------------------------------------------------------------------------------------------------\n# Deformable DETR\n# C"
  },
  {
    "path": "thirdparty/GLEE/glee/models/pixel_decoder/ops/setup.py",
    "chars": 3038,
    "preview": "# ------------------------------------------------------------------------------------------------\n# Deformable DETR\n# C"
  },
  {
    "path": "thirdparty/GLEE/glee/models/pixel_decoder/ops/src/cpu/ms_deform_attn_cpu.cpp",
    "chars": 1399,
    "preview": "/*!\n**************************************************************************************************\n* Deformable DETR"
  },
  {
    "path": "thirdparty/GLEE/glee/models/pixel_decoder/ops/src/cpu/ms_deform_attn_cpu.h",
    "chars": 1282,
    "preview": "/*!\n**************************************************************************************************\n* Deformable DETR"
  },
  {
    "path": "thirdparty/GLEE/glee/models/pixel_decoder/ops/src/cuda/ms_deform_attn_cuda.cu",
    "chars": 7459,
    "preview": "/*!\n**************************************************************************************************\n* Deformable DETR"
  },
  {
    "path": "thirdparty/GLEE/glee/models/pixel_decoder/ops/src/cuda/ms_deform_attn_cuda.h",
    "chars": 1283,
    "preview": "/*!\n**************************************************************************************************\n* Deformable DETR"
  },
  {
    "path": "thirdparty/GLEE/glee/models/pixel_decoder/ops/src/cuda/ms_deform_im2col_cuda.cuh",
    "chars": 54837,
    "preview": "/*!\n**************************************************************************\n* Deformable DETR\n* Copyright (c) 2020 Se"
  },
  {
    "path": "thirdparty/GLEE/glee/models/pixel_decoder/ops/src/ms_deform_attn.h",
    "chars": 1981,
    "preview": "/*!\n**************************************************************************************************\n* Deformable DETR"
  },
  {
    "path": "thirdparty/GLEE/glee/models/pixel_decoder/ops/src/vision.cpp",
    "chars": 942,
    "preview": "/*!\n**************************************************************************************************\n* Deformable DETR"
  },
  {
    "path": "thirdparty/GLEE/glee/models/pixel_decoder/ops/test.py",
    "chars": 4223,
    "preview": "# ------------------------------------------------------------------------------------------------\n# Deformable DETR\n# C"
  },
  {
    "path": "thirdparty/GLEE/glee/models/pixel_decoder/position_encoding.py",
    "chars": 2723,
    "preview": "# ------------------------------------------------------------------------\n# Copyright (c) 2022 IDEA. All Rights Reserve"
  },
  {
    "path": "thirdparty/GLEE/glee/models/transformer_decoder/__init__.py",
    "chars": 94,
    "preview": "# Copyright (c) IDEA, Inc. and its affiliates.\nfrom .maskdino_decoder import MaskDINODecoder\n\n"
  },
  {
    "path": "thirdparty/GLEE/glee/models/transformer_decoder/dino_decoder.py",
    "chars": 16192,
    "preview": "# ------------------------------------------------------------------------\r\n# DINO\r\n# Copyright (c) 2022 IDEA. All Right"
  },
  {
    "path": "thirdparty/GLEE/glee/models/transformer_decoder/maskdino_decoder.py",
    "chars": 30566,
    "preview": "# ------------------------------------------------------------------------\n# DINO\n# Copyright (c) 2022 IDEA. All Rights "
  },
  {
    "path": "thirdparty/GLEE/glee/models/vos_utils.py",
    "chars": 12316,
    "preview": "import torch\nimport torch.nn.functional as F\nfrom torch import nn\nfrom timm.models.layers import DropPath\n\n\n\n\nclass VLFu"
  },
  {
    "path": "thirdparty/GLEE/glee/modules/__init__.py",
    "chars": 117,
    "preview": "from .position_encoding import *\nfrom .attention import *\nfrom .postprocessing import *\nfrom .point_features import *"
  },
  {
    "path": "thirdparty/GLEE/glee/modules/attention.py",
    "chars": 22876,
    "preview": "# Code copy from PyTorch, modified by Xueyan Zou\n\nimport warnings\nfrom typing import Optional, Tuple\n\nimport torch\nimpor"
  },
  {
    "path": "thirdparty/GLEE/glee/modules/point_features.py",
    "chars": 11822,
    "preview": "# Copyright (c) Facebook, Inc. and its affiliates.\nimport torch\nfrom torch.nn import functional as F\n\nfrom detectron2.la"
  },
  {
    "path": "thirdparty/GLEE/glee/modules/position_encoding.py",
    "chars": 2499,
    "preview": "# Copyright (c) Facebook, Inc. and its affiliates.\n## Modified by Bowen Cheng from: https://github.com/facebookresearch/"
  },
  {
    "path": "thirdparty/GLEE/glee/modules/postprocessing.py",
    "chars": 4900,
    "preview": "# Copyright (c) Facebook, Inc. and its affiliates.\nimport torch\nfrom torch.nn import functional as F\n\nfrom detectron2.st"
  },
  {
    "path": "thirdparty/GLEE/glee/utils/__init__.py",
    "chars": 95,
    "preview": "from .config import *\nfrom .misc import *\nfrom .box_ops import *\nfrom .it_contrastive import *\n"
  },
  {
    "path": "thirdparty/GLEE/glee/utils/box_ops.py",
    "chars": 2693,
    "preview": "# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved\n\"\"\"\nUtilities for bounding box manipulation and G"
  },
  {
    "path": "thirdparty/GLEE/glee/utils/config.py",
    "chars": 5152,
    "preview": "# -*- coding: utf-8 -*-\n# Copyright (c) Facebook, Inc. and its affiliates.\n\nimport functools\nimport inspect\n\ndef configu"
  },
  {
    "path": "thirdparty/GLEE/glee/utils/it_contrastive.py",
    "chars": 2076,
    "preview": "import torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\ndef is_dist_initialized():\n    return torch.distribu"
  },
  {
    "path": "thirdparty/GLEE/glee/utils/misc.py",
    "chars": 6466,
    "preview": "# Copyright (c) Facebook, Inc. and its affiliates.\n# Modified by Bowen Cheng from https://github.com/facebookresearch/de"
  },
  {
    "path": "thirdparty/GLEE/glee/utils/utils.py",
    "chars": 5046,
    "preview": "import torch\nimport copy\nfrom torch import nn, Tensor\nimport os\n\nimport math\nimport torch.nn.functional as F\nfrom torch "
  }
]

About this extraction

This page contains the full source code of the LYX0501/InstructNav GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 72 files (565.1 KB), approximately 145.9k tokens, and a symbol index with 498 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Copied to clipboard!