[
  {
    "path": ".gitignore",
    "content": "./tmp\n*.pth\n*.pyc\n**/__pycache__\n"
  },
  {
    "path": "README.md",
    "content": "# InstructNav\n\nEnabling robots to navigate following diverse language instructions in unexplored environments is an attractive goal for human-robot interaction. In this work, we propose InstructNav, a generic instruction navigation system. InstructNav makes the first endeavor to handle various instruction navigation tasks without any navigation training or pre-built maps. To reach this goal, we introduce **Dynamic Chain-of-Navigation (DCoN)** to unify the planning process for different types of navigation instructions. Furthermore, we propose **Multi-sourced Value Maps** to model key elements in instruction navigation so that linguistic DCoN planning can be converted into robot actionable trajectories. \n\n[InstructNav](https://github.com/LYX0501/InstructNav/blob/main/InstructNav.png)\n\nWith InstructNav, we complete the R2R-CE task in a zero-shot way for the first time and outperform many task-training methods. Besides, InstructNav also surpasses the previous SOTA method by 10.48% on the zero-shot Habitat ObjNav and by 86.34% on demand-driven navigation DDN. Real robot experiments on diverse indoor scenes further demonstrate our method's robustness in coping with the environment and instruction variations. Please refer to more details in our paper: \n![InstructNav: Zero-shot System for Generic Instruction Navigation in Unexplored Environment](https://arxiv.org/abs/2406.04882).\n## 🔥 News\n- 2024.9.11: The HM3D objnav benchmark code is released.\n- 2024.9.5: Our paper is accepted by CoRL 2024. Codes will be released in the recent.\n\n### Dependency ###\nOur project is based on the [habitat-sim](https://github.com/facebookresearch/habitat-sim?tab=readme-ov-file), [habitat-lab](https://github.com/facebookresearch/habitat-lab), and [Detectron2](https://github.com/facebookresearch/detectron2). Please follow the guides to install them in your python environment. You can directly install the latest version of habitat-lab and habitat-sim. And make sure you have properly download the navigation scenes [(HM3D, MP3D)](https://github.com/facebookresearch/habitat-lab/blob/main/DATASETS.md) and the episode dataset for both visual-language navigation (VLN-CE) and object navigation.\n\n### Installation ###\nFirstly, clone our repo as:\n```\ngit clone https://github.com/LYX0501/InstructNav.git\ncd InstructNav\npip install -r requirements.txt\n```\nOur method depends on an open-vocalbulary detection and segmentation model [GLEE](https://github.com/FoundationVision/GLEE). Please check the original repo or try to use the copy located in the ./thirdparty/ directory.\n### \n\n### Prepare your GPT4 and GPT4V API Keys ###\nPlease prepare your keys for calling the API for large-language model and large vision-language model.\nWe prefer to use the GPT4 and GPT4V to do the inference. And our code follows the AzureOpenAI calling process.\nBefore running the benchmark, you should prepare for your own api-keys and api-endpoint and api-version. You can check the ./llm_utils/gpt_request.py for usage details.\n```\nexport GPT4_API_BASE=<YOUR_GPT4_ENDPOINT>\nexport GPT4_API_KEY=<YOUR_GPT4_KEY>\nexport GPT4_API_DEPLOY=<GPT4_MODEL_NAME>\nexport GPT4_API_VERSION=<GPT4_MODEL_VERSION>\nexport GPT4V_API_BASE=<YOUR_GPT4V_ENDPOINT>\nexport GPT4V_API_KEY=<YOUR_GPT4V_KEY>\nexport GPT4V_API_DEPLOY=<GPT4V_MODEL_NAME>\nexport GPT4V_API_VERSION=<GPT4V_MODEL_VERSION>\n```\n\n### Running our Benchmark Code ###\nIf everything goes well, you can directly run the evaluation code for different navigation tasks.\nFor example, \n```\npython objnav_benchmark.py\n```\nAnd all the episode results, intermediate results such as GPT4 input/output and value maps will be saved in /tmp/ directory. The real-time agent first-person-view image observation, depth and segmentation will be saved in the project root directory. Examples are shown below:\n![test](https://github.com/user-attachments/assets/51a65b07-70e2-49f3-a850-815b0ec151d0)\n\nhttps://github.com/user-attachments/assets/04e37b91-c524-4c51-86d1-8fb72325f612\n\n\n\n\n\n\n## BibTex\nPlease cite our paper if you find it helpful :)\n```\n@misc{InstructNav,\n      title={InstructNav: Zero-shot System for Generic Instruction Navigation in Unexplored Environment}, \n      author={Yuxing Long and Wenzhe Cai and Hongcheng Wang and Guanqi Zhan and Hao Dong},\n      year={2024},\n      eprint={2406.04882},\n      archivePrefix={arXiv},\n      primaryClass={cs.RO},\n}\n```\n"
  },
  {
    "path": "config_utils.py",
    "content": "import habitat\nfrom habitat.config.read_write import read_write\nfrom habitat.config.default_structured_configs import (\n    CollisionsMeasurementConfig,\n    FogOfWarConfig,\n    TopDownMapMeasurementConfig,\n)\nHM3D_CONFIG_PATH = \"<YOUR SAVE PATH>/habitat-lab/habitat-lab/habitat/config/benchmark/nav/objectnav/objectnav_hm3d.yaml\"\nMP3D_CONFIG_PATH = \"<YOUR SAVE PATH>/habitat-lab/habitat-lab/habitat/config/benchmark/nav/objectnav/objectnav_mp3d.yaml\"\nR2R_CONFIG_PATH = \"<YOUR SAVE PATH>/habitat-lab/habitat-lab/habitat/config/benchmark/nav/vln_r2r.yaml\"\n\ndef hm3d_config(path:str=HM3D_CONFIG_PATH,stage:str='val',episodes=200):\n    habitat_config = habitat.get_config(path)\n    with read_write(habitat_config):\n        habitat_config.habitat.dataset.split = stage\n        habitat_config.habitat.dataset.scenes_dir = \"/home/PJLAB/caiwenzhe/Desktop/dataset/scenes\"\n        habitat_config.habitat.dataset.data_path = \"/home/PJLAB/caiwenzhe/Desktop/dataset/habitat_task/objectnav/hm3d/v2/{split}/{split}.json.gz\"\n        habitat_config.habitat.simulator.scene_dataset = \"/home/PJLAB/caiwenzhe/Desktop/dataset/scenes/hm3d_v0.2/hm3d_annotated_basis.scene_dataset_config.json\"\n        habitat_config.habitat.environment.iterator_options.num_episode_sample = episodes\n        habitat_config.habitat.task.measurements.update(\n        {\n            \"top_down_map\": TopDownMapMeasurementConfig(\n                map_padding=3,\n                map_resolution=1024,\n                draw_source=True,\n                draw_border=True,\n                draw_shortest_path=False,\n                draw_view_points=True,\n                draw_goal_positions=True,\n                draw_goal_aabbs=True,\n                fog_of_war=FogOfWarConfig(\n                    draw=True,\n                    visibility_dist=5.0,\n                    fov=90,\n                ),\n            ),\n            \"collisions\": CollisionsMeasurementConfig(),\n        })\n        habitat_config.habitat.simulator.agents.main_agent.sim_sensors.depth_sensor.max_depth=5.0\n        habitat_config.habitat.simulator.agents.main_agent.sim_sensors.depth_sensor.normalize_depth=False\n        habitat_config.habitat.task.measurements.success.success_distance = 0.25\n    return habitat_config\n    \ndef mp3d_config(path:str=MP3D_CONFIG_PATH,stage:str='val',episodes=200):\n    habitat_config = habitat.get_config(path)\n    with read_write(habitat_config):\n        habitat_config.habitat.dataset.split = stage\n        habitat_config.habitat.dataset.scenes_dir = \"/home/PJLAB/caiwenzhe/Desktop/dataset/scenes\"\n        habitat_config.habitat.dataset.data_path = \"/home/PJLAB/caiwenzhe/Desktop/dataset/habitat_task/objectnav/mp3d/v1/{split}/{split}.json.gz\"\n        habitat_config.habitat.simulator.scene_dataset = \"/home/PJLAB/caiwenzhe/Desktop/dataset/scenes/mp3d/mp3d.scene_dataset_config.json\"\n        habitat_config.habitat.environment.iterator_options.num_episode_sample = episodes\n        habitat_config.habitat.task.measurements.update(\n        {\n            \"top_down_map\": TopDownMapMeasurementConfig(\n                map_padding=3,\n                map_resolution=1024,\n                draw_source=True,\n                draw_border=True,\n                draw_shortest_path=False,\n                draw_view_points=True,\n                draw_goal_positions=True,\n                draw_goal_aabbs=True,\n                fog_of_war=FogOfWarConfig(\n                    draw=True,\n                    visibility_dist=5.0,\n                    fov=79,\n                ),\n            ),\n            \"collisions\": CollisionsMeasurementConfig(),\n        })\n        habitat_config.habitat.simulator.agents.main_agent.sim_sensors.depth_sensor.max_depth=5.0\n        habitat_config.habitat.simulator.agents.main_agent.sim_sensors.depth_sensor.normalize_depth=False\n        habitat_config.habitat.task.measurements.success.success_distance = 0.25\n    return habitat_config\n\ndef r2r_config(path:str=R2R_CONFIG_PATH,stage:str='val_seen',episodes=200):\n    habitat_config = habitat.get_config(path)\n    with read_write(habitat_config):\n        habitat_config.habitat.dataset.split = stage\n        habitat_config.habitat.dataset.scenes_dir = \"/home/PJLAB/caiwenzhe/Desktop/dataset/scenes\"\n        habitat_config.habitat.dataset.data_path = \"/home/PJLAB/caiwenzhe/Desktop/dataset/habitat_task/vln/r2r/{split}/{split}.json.gz\"\n        habitat_config.habitat.simulator.scene_dataset = \"/home/PJLAB/caiwenzhe/Desktop/dataset/scenes/mp3d/mp3d.scene_dataset_config.json\"\n        habitat_config.habitat.environment.iterator_options.num_episode_sample = episodes\n        habitat_config.habitat.task.measurements.update(\n        {\n            \"top_down_map\": TopDownMapMeasurementConfig(\n                map_padding=3,\n                map_resolution=1024,\n                draw_source=True,\n                draw_border=True,\n                draw_shortest_path=False,\n                draw_view_points=True,\n                draw_goal_positions=True,\n                draw_goal_aabbs=True,\n                fog_of_war=FogOfWarConfig(\n                    draw=True,\n                    visibility_dist=5.0,\n                    fov=79,\n                ),\n            ),\n            \"collisions\": CollisionsMeasurementConfig(),\n        })  \n        habitat_config.habitat.simulator.agents.main_agent.sim_sensors.depth_sensor.max_depth=5.0\n        habitat_config.habitat.simulator.agents.main_agent.sim_sensors.depth_sensor.normalize_depth=False\n        habitat_config.habitat.task.measurements.success.success_distance = 0.25\n    return habitat_config\n"
  },
  {
    "path": "constants.py",
    "content": "from cv_utils.object_list import categories\nGLEE_CONFIG_PATH = \"./thirdparty/GLEE/configs/SwinL.yaml\"\nGLEE_CHECKPOINT_PATH = \"./thirdparty/GLEE/GLEE_SwinL_Scaleup10m.pth\"\nDETECT_OBJECTS = [[cat['name'].lower()] for cat in categories]\nINTEREST_OBJECTS = ['bed','chair','toilet','potted_plant','television_set','sofa']\n\n\n"
  },
  {
    "path": "cv_utils/glee_detector.py",
    "content": "from thirdparty.GLEE.glee.models.glee_model import GLEE_Model\nfrom thirdparty.GLEE.glee.config_deeplab import add_deeplab_config\nfrom thirdparty.GLEE.glee.config import add_glee_config\nfrom habitat_sim.utils.common import d3_40_colors_rgb\nfrom constants import *\nfrom detectron2.config import get_cfg\nfrom .object_list import categories as CATEGORIES\nimport torch\nimport torch.nn.functional as F\nimport torchvision\nimport cv2\nimport numpy as np\nCATEGORIES = [cat['name'].lower() for cat in CATEGORIES]\n\ndef initialize_glee(glee_config=GLEE_CONFIG_PATH,\n                    glee_checkpoint=GLEE_CHECKPOINT_PATH,\n                    device=\"cuda:0\"):\n    cfg_swin = get_cfg()\n    add_deeplab_config(cfg_swin)\n    add_glee_config(cfg_swin)\n    conf_files_swin = glee_config\n    checkpoints_swin = torch.load(glee_checkpoint) \n    cfg_swin.merge_from_file(conf_files_swin)\n    GLEEmodel_swin = GLEE_Model(cfg_swin, None, device, None, True).to(device)\n    GLEEmodel_swin.load_state_dict(checkpoints_swin, strict=False)\n    GLEEmodel_swin.eval()\n    return GLEEmodel_swin\n\n# prompt_mode=\"categories\", \n# results_select=[\"box\", \"mask\", \"name\", \"score\"],\n\ndef glee_segmentation(img, \n                      GLEEmodel, \n                      custom_category=CATEGORIES,\n                      num_inst_select=15,\n                      threshold_select=0.2,\n                      device=\"cuda:0\"):        \n    pixel_mean = torch.Tensor([123.675, 116.28, 103.53]).to(device).view(3, 1, 1)\n    pixel_std = torch.Tensor([58.395, 57.12, 57.375]).to(device).view(3, 1, 1)\n    normalizer = lambda x: (x - pixel_mean) / pixel_std\n    ori_image = torch.as_tensor(np.ascontiguousarray(img.transpose(2, 0, 1)))\n    ori_image = normalizer(ori_image.to(device))[None,]\n    _,_, ori_height, ori_width = ori_image.shape\n    resizer = torchvision.transforms.Resize(800)\n    resize_image = resizer(ori_image)\n    image_size = torch.as_tensor((resize_image.shape[-2],resize_image.shape[-1]))\n    re_size = resize_image.shape[-2:]\n    stride = 32\n    # the last two dims are H,W, both subject to divisibility requirement\n    padding_size = ((image_size + (stride - 1)).div(stride, rounding_mode=\"floor\") * stride).tolist()\n    infer_image = torch.zeros(1,3,padding_size[0],padding_size[1]).to(resize_image)\n    infer_image[0,:,:image_size[0],:image_size[1]] = resize_image\n    batch_category_name = custom_category\n    prompt_list = []\n    with torch.no_grad():\n        (outputs,_) = GLEEmodel(infer_image, prompt_list, task=\"coco\", batch_name_list=batch_category_name, is_train=False)\n    topK_instance = max(num_inst_select,1)\n    bbox_pred = outputs['pred_boxes'][0]\n    bbox_pred[:,0],bbox_pred[:,2] = bbox_pred[:,0] * img.shape[1] - bbox_pred[:,2] * img.shape[1] * 0.5, bbox_pred[:,0] * img.shape[1] + bbox_pred[:,2] * img.shape[1] * 0.5\n    bbox_pred[:,1],bbox_pred[:,3] = bbox_pred[:,1] * img.shape[0] - bbox_pred[:,3] * img.shape[0] * 0.5, bbox_pred[:,1] * img.shape[0] + bbox_pred[:,3] * img.shape[0] * 0.5\n    mask_pred = outputs['pred_masks'][0]\n    mask_cls = outputs['pred_logits'][0]\n    scores = mask_cls.sigmoid().max(-1)[0]\n    scores_per_image, topk_indices = scores.topk(topK_instance, sorted=True)\n    valid = scores_per_image>threshold_select\n    topk_indices = topk_indices[valid]\n    scores_per_image = scores_per_image[valid]\n    pred_class = mask_cls[topk_indices].max(-1)[1].tolist()\n    if len(pred_class) == 0: \n        return [], [], [], []\n    mask_pred = mask_pred[topk_indices]\n    bbox_pred = bbox_pred[topk_indices].cpu().numpy()\n    pred_masks = F.interpolate( mask_pred[None,], size=(padding_size[0], padding_size[1]), mode=\"bilinear\", align_corners=False)\n    pred_masks = pred_masks[:,:,:re_size[0],:re_size[1]]\n    pred_masks = F.interpolate( pred_masks, size=(ori_height,ori_width), mode=\"bilinear\", align_corners=False  )\n    pred_masks = (pred_masks>0).detach().cpu().numpy()[0]\n    return bbox_pred, pred_masks, np.array(batch_category_name)[pred_class], scores_per_image\n\ndef visualize_segmentation(image,classes,masks):\n    copy_image = image.copy()\n    label_classes = np.unique(classes)\n    for cls,mask in zip(classes,masks):\n        if len(np.unique(mask)) != 2: continue\n        copy_image[np.where(mask == 1)] = d3_40_colors_rgb[label_classes.tolist().index(cls)]\n        x, y = int(np.mean(np.where(mask)[1])), int(np.mean(np.where(mask)[0])) \n        cv2.putText(copy_image, str(cls), (x, y), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255,255,255), 2)\n    ret_image = cv2.addWeighted(image,0.2,copy_image,0.8,0)\n    return ret_image\n\ndef visualize_detection(image,classes,bboxes):\n    copy_image = image.copy()\n    label_classes = np.unique(classes)\n    for cls,bbox in zip(classes,bboxes):\n        color = d3_40_colors_rgb[label_classes.tolist().index(cls)%40]\n        copy_image = cv2.rectangle(copy_image,(int(bbox[0]),int(bbox[1])),(int(bbox[2]),int(bbox[3])),color.tolist(),2)\n        x, y = int(bbox[0]*0.5+bbox[2]*0.5), int(bbox[1]*0.5+bbox[3]*0.5)\n        cv2.putText(copy_image, str(cls), (x, y), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255,255,255), 2)\n    return copy_image\n\n\n\n    "
  },
  {
    "path": "cv_utils/image_percevior.py",
    "content": "from constants import *\nfrom .glee_detector import *\nclass GLEE_Percevior:\n    def __init__(self,\n                 glee_config=GLEE_CONFIG_PATH,\n                 glee_checkpoint=GLEE_CHECKPOINT_PATH,\n                 device = \"cuda:0\"):\n        self.device = device\n        self.glee_model = initialize_glee(glee_config,glee_checkpoint,device)\n    def perceive(self,image,confidence_threshold=0.25,area_threshold=2500):\n        pred_bboxes, pred_masks, pred_class, pred_confidence = glee_segmentation(image,self.glee_model,threshold_select=confidence_threshold,device=self.device)\n        try:\n            mask_area = np.array([mask.sum() for mask in pred_masks])\n            bbox_trust = np.array([(bbox[0] > 20) & (bbox[2] < image.shape[1] - 20) for bbox in pred_bboxes])\n            visualization = visualize_segmentation(image,pred_class[(mask_area>area_threshold) & bbox_trust],pred_masks[(mask_area>area_threshold) & bbox_trust])\n            return pred_class[(mask_area>area_threshold) & bbox_trust],pred_masks[(mask_area>area_threshold) & bbox_trust],pred_confidence[(mask_area>area_threshold) & bbox_trust],[visualization]\n        except:\n            return [],[],[],[image]\n"
  },
  {
    "path": "cv_utils/object_list.py",
    "content": "categories = [\n  {\"id\": 1, \"name\": \"bed\"},\n  {\"id\": 2, \"name\": \"sofa\"},\n  {\"id\": 3, \"name\": \"chair\"},\n  {\"id\": 4, \"name\": \"television_set\"},\n  {\"id\": 5, \"name\": \"potted_plant\"},\n  {\"id\": 6, \"name\": \"toilet\"},\n  {\"id\": 7, \"name\": \"lamp\"},\n  {\"id\": 8, \"name\": \"desk\"},\n  {\"id\": 9, \"name\": \"bookshelf\"},\n  {\"id\": 10, \"name\": \"cupboard\"},\n  {\"id\": 11, \"name\": \"drawer\"},\n  {\"id\": 12, \"name\": \"refrigerator\"},\n  {\"id\": 13, \"name\": \"oven\"},\n  {\"id\": 14, \"name\": \"microwave\"},\n  {\"id\": 15, \"name\": \"toaster\"},\n  {\"id\": 16, \"name\": \"sink\"},\n  {\"id\": 17, \"name\": \"dishwasher\"},\n  {\"id\": 18, \"name\": \"coffee_machine\"},\n  {\"id\": 19, \"name\": \"kettle\"},\n  {\"id\": 20, \"name\": \"stove\"},\n  {\"id\": 21, \"name\": \"washing_machine\"},\n  {\"id\": 22, \"name\": \"dryer\"},\n  {\"id\": 23, \"name\": \"mirror\"},\n  {\"id\": 24, \"name\": \"clock\"},\n  {\"id\": 25, \"name\": \"curtains\"},\n  {\"id\": 26, \"name\": \"blinds\"},\n  {\"id\": 27, \"name\": \"bathtub\"},\n  {\"id\": 28, \"name\": \"shower\"},\n  {\"id\": 29, \"name\": \"table\"},\n  {\"id\": 30, \"name\": \"towel\"},\n  {\"id\": 31, \"name\": \"soap_dispenser\"},\n  {\"id\": 32, \"name\": \"toothbrush\"},\n  {\"id\": 33, \"name\": \"toothpaste\"},\n  {\"id\": 34, \"name\": \"shampoo\"},\n  {\"id\": 35, \"name\": \"conditioner\"},\n  {\"id\": 36, \"name\": \"hair_dryer\"},\n  {\"id\": 37, \"name\": \"razor\"},\n  {\"id\": 38, \"name\": \"makeup\"},\n  {\"id\": 39, \"name\": \"tissue_box\"},\n  {\"id\": 40, \"name\": \"trash_can\"},\n  {\"id\": 41, \"name\": \"vacuum_cleaner\"},\n  {\"id\": 42, \"name\": \"mop\"},\n  {\"id\": 43, \"name\": \"broom\"},\n  {\"id\": 44, \"name\": \"bucket\"},\n  {\"id\": 45, \"name\": \"sponge\"},\n  {\"id\": 46, \"name\": \"detergent\"},\n  {\"id\": 47, \"name\": \"iron\"},\n  {\"id\": 48, \"name\": \"ironing_board\"},\n  {\"id\": 49, \"name\": \"laundry_basket\"},\n  {\"id\": 50, \"name\": \"clothes_hanger\"},\n  {\"id\": 51, \"name\": \"coat_rack\"},\n  {\"id\": 52, \"name\": \"shoe_rack\"},\n  {\"id\": 53, \"name\": \"umbrella\"},\n  {\"id\": 54, \"name\": \"fire_extinguisher\"},\n  {\"id\": 55, \"name\": \"first_aid_kit\"},\n  {\"id\": 56, \"name\": \"thermometer\"},\n  {\"id\": 57, \"name\": \"scale\"},\n  {\"id\": 58, \"name\": \"fan\"},\n  {\"id\": 59, \"name\": \"heater\"},\n  {\"id\": 60, \"name\": \"air_conditioner\"},\n  {\"id\": 61, \"name\": \"humidifier\"},\n  {\"id\": 62, \"name\": \"dehumidifier\"},\n  {\"id\": 63, \"name\": \"light_switch\"},\n  {\"id\": 64, \"name\": \"electrical_outlet\"},\n  {\"id\": 65, \"name\": \"extension_cord\"},\n  {\"id\": 66, \"name\": \"remote_control\"},\n  {\"id\": 67, \"name\": \"game_console\"},\n  {\"id\": 68, \"name\": \"router\"},\n  {\"id\": 69, \"name\": \"modem\"},\n  {\"id\": 70, \"name\": \"computer\"},\n  {\"id\": 71, \"name\": \"laptop\"},\n  {\"id\": 72, \"name\": \"printer\"},\n  {\"id\": 73, \"name\": \"scanner\"},\n  {\"id\": 74, \"name\": \"fax_machine\"},\n  {\"id\": 75, \"name\": \"telephone\"},\n  {\"id\": 76, \"name\": \"smartphone\"},\n  {\"id\": 77, \"name\": \"tablet\"},\n  {\"id\": 78, \"name\": \"keyboard\"},\n  {\"id\": 79, \"name\": \"mouse\"},\n  {\"id\": 80, \"name\": \"monitor\"},\n  {\"id\": 81, \"name\": \"notebook\"},\n  {\"id\": 82, \"name\": \"pen\"},\n  {\"id\": 83, \"name\": \"pencil\"},\n  {\"id\": 84, \"name\": \"eraser\"},\n  {\"id\": 85, \"name\": \"stapler\"},\n  {\"id\": 86, \"name\": \"scissors\"},\n  {\"id\": 87, \"name\": \"tape_dispenser\"},\n  {\"id\": 88, \"name\": \"paper_clip\"},\n  {\"id\": 89, \"name\": \"envelope\"},\n  {\"id\": 90, \"name\": \"letter_opener\"},\n  {\"id\": 91, \"name\": \"cabinet\"},\n  {\"id\": 92, \"name\": \"whiteboard\"},\n  {\"id\": 93, \"name\": \"calendar\"},\n  {\"id\": 94, \"name\": \"photo_frame\"},\n  {\"id\": 95, \"name\": \"vase\"},\n  {\"id\": 96, \"name\": \"candle\"},\n  {\"id\": 97, \"name\": \"incense\"},\n  {\"id\": 98, \"name\": \"book\"},\n  {\"id\": 99, \"name\": \"magazine\"},\n  {\"id\": 100, \"name\": \"newspaper\"},\n  {\"id\": 101, \"name\": \"album\"},\n  {\"id\": 102, \"name\": \"record_player\"},\n  {\"id\": 103, \"name\": \"cd_player\"},\n  {\"id\": 104, \"name\": \"dvd_player\"},\n  {\"id\": 105, \"name\": \"blu_ray_player\"},\n  {\"id\": 106, \"name\": \"speaker\"},\n  {\"id\": 107, \"name\": \"headphones\"},\n  {\"id\": 108, \"name\": \"microphone\"},\n  {\"id\": 109, \"name\": \"camera\"},\n  {\"id\": 110, \"name\": \"camcorder\"},\n  {\"id\": 111, \"name\": \"tripod\"},\n  {\"id\": 112, \"name\": \"flashlight\"},\n  {\"id\": 113, \"name\": \"batteries\"},\n  {\"id\": 114, \"name\": \"charger\"},\n  {\"id\": 115, \"name\": \"cable\"},\n  {\"id\": 116, \"name\": \"usb_drive\"},\n  {\"id\": 117, \"name\": \"hard_drive\"},\n  {\"id\": 118, \"name\": \"router\"},\n  {\"id\": 119, \"name\": \"switch\"},\n  {\"id\": 120, \"name\": \"firewall\"},\n  {\"id\": 121, \"name\": \"server\"},\n  {\"id\": 122, \"name\": \"keyboard_tray\"},\n  {\"id\": 123, \"name\": \"mouse_pad\"},\n  {\"id\": 124, \"name\": \"speaker_stand\"},\n  {\"id\": 125, \"name\": \"monitor_stand\"},\n  {\"id\": 126, \"name\": \"file_folder\"},\n  {\"id\": 127, \"name\": \"binder\"},\n  {\"id\": 128, \"name\": \"clipboard\"},\n  {\"id\": 129, \"name\": \"calculator\"},\n  {\"id\": 130, \"name\": \"label_maker\"},\n  {\"id\": 131, \"name\": \"hole_punch\"},\n  {\"id\": 132, \"name\": \"paper_shredder\"},\n  {\"id\": 133, \"name\": \"post_it_note\"},\n  {\"id\": 134, \"name\": \"thumbtack\"},\n  {\"id\": 135, \"name\": \"magnet\"},\n  {\"id\": 136, \"name\": \"ruler\"},\n  {\"id\": 137, \"name\": \"protractor\"},\n  {\"id\": 138, \"name\": \"compass\"},\n  {\"id\": 139, \"name\": \"glue\"},\n  {\"id\": 140, \"name\": \"white_out\"},\n  {\"id\": 141, \"name\": \"marker\"},\n  {\"id\": 142, \"name\": \"highlighter\"},\n  {\"id\": 143, \"name\": \"crayon\"},\n  {\"id\": 144, \"name\": \"paint\"},\n  {\"id\": 145, \"name\": \"paintbrush\"},\n  {\"id\": 146, \"name\": \"easel\"},\n  {\"id\": 147, \"name\": \"canvas\"},\n  {\"id\": 148, \"name\": \"palette\"},\n  {\"id\": 149, \"name\": \"sculpting_tools\"},\n  {\"id\": 150, \"name\": \"clay\"},\n  {\"id\": 151, \"name\": \"sewing_machine\"},\n  {\"id\": 152, \"name\": \"thread\"},\n  {\"id\": 153, \"name\": \"needle\"},\n  {\"id\": 154, \"name\": \"scissors\"},\n  {\"id\": 155, \"name\": \"fabric\"},\n  {\"id\": 156, \"name\": \"measuring_tape\"},\n  {\"id\": 157, \"name\": \"pin_cushion\"},\n  {\"id\": 158, \"name\": \"thimble\"},\n  {\"id\": 159, \"name\": \"seam_ripper\"},\n  {\"id\": 160, \"name\": \"iron\"},\n  {\"id\": 161, \"name\": \"pattern\"},\n  {\"id\": 162, \"name\": \"ribbon\"},\n  {\"id\": 163, \"name\": \"button\"},\n  {\"id\": 164, \"name\": \"zipper\"},\n  {\"id\": 165, \"name\": \"hook\"},\n  {\"id\": 166, \"name\": \"stairs\"},\n  {\"id\": 167, \"name\": \"snap\"},\n  {\"id\": 168, \"name\": \"velcro\"},\n  {\"id\": 169, \"name\": \"elastic\"},\n  {\"id\": 170, \"name\": \"lace\"},\n  {\"id\": 171, \"name\": \"trim\"},\n  {\"id\": 172, \"name\": \"bead\"},\n  {\"id\": 173, \"name\": \"sequin\"},\n  {\"id\": 174, \"name\": \"glue_gun\"},\n  {\"id\": 175, \"name\": \"glue_stick\"},\n  {\"id\": 176, \"name\": \"craft_knife\"},\n  {\"id\": 177, \"name\": \"cutting_mat\"},\n  {\"id\": 178, \"name\": \"ruler\"},\n  {\"id\": 179, \"name\": \"scalpel\"},\n  {\"id\": 180, \"name\": \"tweezers\"},\n  {\"id\": 181, \"name\": \"pliers\"},\n  {\"id\": 182, \"name\": \"hammer\"},\n  {\"id\": 183, \"name\": \"screwdriver\"},\n  {\"id\": 184, \"name\": \"wrench\"},\n  {\"id\": 185, \"name\": \"drill\"},\n  {\"id\": 186, \"name\": \"saw\"},\n  {\"id\": 187, \"name\": \"chisel\"},\n  {\"id\": 188, \"name\": \"level\"},\n  {\"id\": 189, \"name\": \"tape_measure\"},\n  {\"id\": 190, \"name\": \"toolbox\"},\n  {\"id\": 191, \"name\": \"nail\"},\n  {\"id\": 192, \"name\": \"screw\"},\n  {\"id\": 193, \"name\": \"bolt\"},\n  {\"id\": 194, \"name\": \"nut\"},\n  {\"id\": 195, \"name\": \"washer\"},\n  {\"id\": 196, \"name\": \"sandpaper\"},\n  {\"id\": 197, \"name\": \"wood_glue\"},\n  {\"id\": 198, \"name\": \"clamp\"},\n  {\"id\": 199, \"name\": \"vise\"},\n  {\"id\": 200, \"name\": \"workbench\"}\n]"
  },
  {
    "path": "llm_utils/gpt_request.py",
    "content": "import os\nfrom openai import AzureOpenAI,OpenAI\nimport requests\nimport base64\nimport cv2\nimport numpy as np\nfrom mimetypes import guess_type\n\ngpt4_api_base = os.environ['GPT4_API_BASE']\ngpt4_api_key = os.environ['GPT4_API_KEY']\ngpt4v_api_base = os.environ['GPT4V_API_BASE']\ngpt4v_api_key = os.environ['GPT4V_API_KEY']\n\ndeployment_name = os.environ['GPT4_API_DEPLOY']\napi_version = os.environ['GPT4_API_VERSION']\ngpt4_client = AzureOpenAI(\n    api_key=gpt4_api_key,  \n    api_version=api_version,\n    base_url=f\"{gpt4_api_base}/openai/deployments/{deployment_name}\"\n)\n\ndeployment_name = os.environ['GPT4V_API_DEPLOY']\napi_version = os.environ['GPT4V_API_VERSION']\ngpt4v_client = AzureOpenAI(\n    api_key=gpt4v_api_key,  \n    api_version=api_version,\n    base_url=f\"{gpt4v_api_base}/openai/deployments/{deployment_name}\")\n\ndef local_image_to_data_url(image):\n    if isinstance(image,str):\n        mime_type, _ = guess_type(image)\n        with open(image, \"rb\") as image_file:\n            base64_encoded_data = base64.b64encode(image_file.read()).decode('utf-8')\n        return f\"data:{mime_type};base64,{base64_encoded_data}\"\n    elif isinstance(image,np.ndarray):\n        base64_encoded_data = base64.b64encode(cv2.imencode('.jpg',image)[1]).decode('utf-8')\n        return f\"data:image/jpeg;base64,{base64_encoded_data}\"\n\ndef gptv_response(text_prompt,image_prompt,system_prompt=\"\"):\n    prompt = [{'role':'system','content':system_prompt},\n             {'role':'user','content':[{'type':'text','text':text_prompt},\n                                       {'type':'image_url','image_url':{'url':local_image_to_data_url(image_prompt)}}]}]\n    response = gpt4v_client.chat.completions.create(model=deployment_name,\n                                                    messages=prompt,\n                                                    max_tokens=1000)\n    return response.choices[0].message.content\n\ndef gpt_response(text_prompt,system_prompt=\"\"):\n    prompt = [{'role':'system','content':system_prompt},\n              {'role':'user','content':[{'type':'text','text':text_prompt}]}]\n    response = gpt4_client.chat.completions.create(model=deployment_name,\n                                              messages=prompt,\n                                              max_tokens=1000)\n    return response.choices[0].message.content\n\n\n"
  },
  {
    "path": "llm_utils/nav_prompt.py",
    "content": "CHAINON_PROMPT = \"You are a wheeled mobile robot working in an indoor environment.\\\nAnd you are required to finish a navigation task indicated by a human instruction in a new house.\\\nYour task is to make a navigation plan for finishing the task as soon as possible.\\\nThe navigation plan should be formulated as a chain as {<Action_1> - <Landmark_1> - <Action_2> - <Landmark_2> ...}.\\\nTo make the plan, I will provide you the following elements:\\\n(1) <Navigation Instruction>: The navigation instruction given by the human.\\\n(2) <Previous Plan>: The completed steps in the plan recording your history trajectory.\\\n(3) <Semantic Clue>: The list recording all the observed rooms and objects in this house from your perspective.\\\nThe allowed <Action> in the plan contains ['Explore','Approach','Move Forward','Turn Left','Turn Right','Turn Around'].\\\nThe action 'Explore' will lead you to the exploration frontiers to help unlock new areas.\\\nThe action 'Approach' will lead you close to a specific object or room for more detailed observations.\\\nThe allowed <Landmark> should be one appeared semantic instance in the <Navigation Instruction> or <Semantic Clue>.\\\nDo not output an imagined instance as <Landmark> which has not been observerd in <Semantic Clue> or mentioned in the <Navigation Instruction>.\\\nTo select the landmark, you should consider the common house layout, human's habit of objects placement and the task navigation instruction.\\\nFor example, the sofa is often close to a television, therefore, sofa is a good landmark for finding the television and to satisfy the human entertainment demand.\\\nIf the action and landmark is clearly specified in the instruction like 'walk forward to the television', then you can directly decompose the instruction into 'Move_Forward' - 'Television' without 'Explore' action.\\\nYou only need to plan one <Action> and one <Landmark> ahead, besides, you should output a flag to indicate whether you have finished the navigation task.\\\nTherefore, your output answer should be formatted as Answer={'Reason':<Your Reason>, 'Action':<Chosen Action>, 'Landmark':<Chosen Instance>, 'Flag':<Task Finished Flag>}.\\\nIf you find a specific instance of the target object or a synonyms object, the output 'Flag' should be True.\\\nTry to select the <Landmark> that is closely related to the <Navigation Instruction> according to the human habit.\\\nTry not repeatly select the same <Landmark> as the <Previous Plan>.\"\n\nGPT4V_PROMPT = f\"You are an indoor navigation agent. I give you a panoramic observation image, complete navigation instruction and the sub-instruction you should execute now.  \\\nDirection 1 and 11 are ahead, Direction 5 and 7 are back, Direction 3 is to the right, and Direction 9 is to the left. Please carefully analyze visual information in each direction \\\nand judge which direction is most suitable for next movement according to the act and landmark mentioned in the sub-instruction. \\\nYou answer should follow \\\"Thinking Process\\\" and \\\"Judgement\\\". In the \\\"Judgement: \\\" field, you should only write down direction ID you choose. \\\nIf you think you have arrived the destination, you can answer \\\"Stop\\\" in the \\\"Judgement: \\\" field. Note that the \\\"Direction 5\\\" and \\\"Direction 7\\\" are the directions you just came from. \\\nGenerally, the direction with more navigation landmarks in the complete navigation instruction is better.\"\n\n\n"
  },
  {
    "path": "mapper.py",
    "content": "from mapping_utils.geometry import *\nfrom mapping_utils.preprocess import *\nfrom mapping_utils.projection import *\nfrom mapping_utils.transform import *\nfrom mapping_utils.path_planning import *\nfrom cv_utils.image_percevior import GLEE_Percevior\nfrom matplotlib import colormaps\nfrom habitat_sim.utils.common import d3_40_colors_rgb\nfrom constants import *\nimport open3d as o3d\nfrom lavis.models import load_model_and_preprocess\nfrom PIL import Image\nclass Instruct_Mapper:\n    def __init__(self,\n                 camera_intrinsic,\n                 pcd_resolution=0.05,\n                 grid_resolution=0.1,\n                 grid_size=5,\n                 floor_height=-0.8,\n                 ceiling_height=0.8,\n                 translation_func=habitat_translation,\n                 rotation_func=habitat_rotation,\n                 rotate_axis=[0,1,0],\n                 device='cuda:0'):\n        self.device = device\n        self.camera_intrinsic = camera_intrinsic\n        self.pcd_resolution = pcd_resolution\n        self.grid_resolution = grid_resolution\n        self.grid_size = grid_size\n        self.floor_height = floor_height\n        self.ceiling_height = ceiling_height\n        self.translation_func = translation_func\n        self.rotation_func = rotation_func\n        self.rotate_axis = np.array(rotate_axis)\n        self.object_percevior = GLEE_Percevior(device=device)\n        self.pcd_device = o3d.core.Device(device.upper())\n    \n    def reset(self,position,rotation):\n        self.update_iterations = 0\n        self.initial_position = self.translation_func(position)\n        self.current_position = self.translation_func(position) - self.initial_position\n        self.current_rotation = self.rotation_func(rotation)\n        self.scene_pcd = o3d.t.geometry.PointCloud(self.pcd_device)\n        self.navigable_pcd = o3d.t.geometry.PointCloud(self.pcd_device)\n        self.object_pcd = o3d.t.geometry.PointCloud(self.pcd_device)\n        self.object_entities = []\n        self.trajectory_position = []\n    \n    def update(self,rgb,depth,position,rotation):\n        self.current_position = self.translation_func(position) - self.initial_position\n        self.current_rotation = self.rotation_func(rotation)\n        self.current_depth = preprocess_depth(depth)\n        self.current_rgb = preprocess_image(rgb)\n        self.trajectory_position.append(self.current_position)\n        # to avoid there is no valid depth value (especially in real-world)\n        if np.sum(self.current_depth) > 0:\n            camera_points,camera_colors = get_pointcloud_from_depth(self.current_rgb,self.current_depth,self.camera_intrinsic)\n            world_points = translate_to_world(camera_points,self.current_position,self.current_rotation)\n            self.current_pcd = gpu_pointcloud_from_array(world_points,camera_colors,self.pcd_device).voxel_down_sample(self.pcd_resolution)\n        else:\n            return\n        # semantic masking and project object mask to pointcloud\n        classes,masks,confidences,visualization = self.object_percevior.perceive(self.current_rgb)\n        self.segmentation = visualization[0]\n        current_object_entities = self.get_object_entities(self.current_depth,classes,masks,confidences)\n        self.object_entities = self.associate_object_entities(self.object_entities,current_object_entities)\n        self.object_pcd = self.update_object_pcd()\n        # pointcloud update\n        self.scene_pcd = gpu_merge_pointcloud(self.current_pcd,self.scene_pcd).voxel_down_sample(self.pcd_resolution)\n        self.scene_pcd = self.scene_pcd.select_by_index((self.scene_pcd.point.positions[:,2]>self.floor_height-0.2).nonzero()[0])\n        self.useful_pcd = self.scene_pcd.select_by_index((self.scene_pcd.point.positions[:,2]<self.ceiling_height).nonzero()[0])\n        \n        # all the stairs will be regarded as navigable\n        for entity in current_object_entities:\n            if entity['class'] == 'stairs':\n                self.navigable_pcd = gpu_merge_pointcloud(self.navigable_pcd,entity['pcd'])\n        # geometry \n        current_navigable_point = self.current_pcd.select_by_index((self.current_pcd.point.positions[:,2]<self.floor_height).nonzero()[0])\n        current_navigable_position = current_navigable_point.point.positions.cpu().numpy()\n        standing_position = np.array([self.current_position[0],self.current_position[1],current_navigable_position[:,2].mean()])\n        interpolate_points = np.linspace(np.ones_like(current_navigable_position)*standing_position,current_navigable_position,25).reshape(-1,3)\n        interpolate_points = interpolate_points[(interpolate_points[:,2] > self.floor_height-0.2) & (interpolate_points[:,2] < self.floor_height+0.2)]\n        interpolate_colors = np.ones_like(interpolate_points) * 100\n        try:\n            current_navigable_pcd = gpu_pointcloud_from_array(interpolate_points,interpolate_colors,self.pcd_device).voxel_down_sample(self.grid_resolution)\n            self.navigable_pcd = gpu_merge_pointcloud(self.navigable_pcd,current_navigable_pcd).voxel_down_sample(self.pcd_resolution)\n        except:\n            self.navigable_pcd = self.useful_pcd.select_by_index((self.useful_pcd.point.positions[:,2]<self.floor_height).nonzero()[0])\n       \n        \n        # try:\n        #     self.navigable_pcd = self.navigable_pcd.voxel_down_sample(self.pcd_resolution)\n        # except:\n        #     self.navigable_pcd = self.useful_pcd.select_by_index((self.useful_pcd.point.positions[:,2]<self.floor_height).nonzero()[0])\n        #print(\"Warning: hello world\")\n        # self.navigable_pcd = self.useful_pcd.select_by_index((self.useful_pcd.point.positions[:,2]<self.floor_height).nonzero()[0])\n            \n        # filter the obstacle pointcloud\n        self.obstacle_pcd = self.useful_pcd.select_by_index((self.useful_pcd.point.positions[:,2]>self.floor_height+0.1).nonzero()[0])\n        self.trajectory_pcd = gpu_pointcloud_from_array(np.array(self.trajectory_position),np.zeros((len(self.trajectory_position),3)),self.pcd_device)\n        self.frontier_pcd = project_frontier(self.obstacle_pcd,self.navigable_pcd,self.floor_height+0.2,self.grid_resolution)\n        self.frontier_pcd[:,2] = self.navigable_pcd.point.positions.cpu().numpy()[:,2].mean()\n        self.frontier_pcd = gpu_pointcloud_from_array(self.frontier_pcd,np.ones((self.frontier_pcd.shape[0],3))*np.array([[255,0,0]]),self.pcd_device)\n        self.update_iterations += 1\n    \n    def update_object_pcd(self):\n        object_pcd = o3d.geometry.PointCloud()\n        for entity in self.object_entities:\n            points = entity['pcd'].point.positions.cpu().numpy()\n            colors = entity['pcd'].point.colors.cpu().numpy()\n            new_pcd = o3d.geometry.PointCloud()\n            new_pcd.points = o3d.utility.Vector3dVector(points)\n            new_pcd.colors = o3d.utility.Vector3dVector(colors)\n            object_pcd = object_pcd + new_pcd\n        try:\n            return gpu_pointcloud(object_pcd,self.pcd_device)\n        except:\n            return self.scene_pcd\n    \n    def get_view_pointcloud(self,rgb,depth,translation,rotation):\n        current_position = self.translation_func(translation) - self.initial_position\n        current_rotation = self.rotation_func(rotation)\n        current_depth = preprocess_depth(depth)\n        current_rgb = preprocess_image(rgb)\n        camera_points,camera_colors = get_pointcloud_from_depth(current_rgb,current_depth,self.camera_intrinsic)\n        world_points = translate_to_world(camera_points,current_position,current_rotation)\n        current_pcd = gpu_pointcloud_from_array(world_points,camera_colors,self.pcd_device).voxel_down_sample(self.pcd_resolution)\n        return current_pcd\n    \n    def get_object_entities(self,depth,classes,masks,confidences):\n        entities = []\n        exist_objects = np.unique([ent['class'] for ent in self.object_entities]).tolist()\n        for cls,mask,score in zip(classes,masks,confidences):\n            if depth[mask>0].min() < 1.0 and score < 0.5:\n                continue\n            if cls not in exist_objects:\n                exist_objects.append(cls)\n            camera_points = get_pointcloud_from_depth_mask(depth,mask,self.camera_intrinsic)\n            world_points = translate_to_world(camera_points,self.current_position,self.current_rotation)\n            point_colors = np.array([d3_40_colors_rgb[exist_objects.index(cls)%40]]*world_points.shape[0])\n            if world_points.shape[0] < 10:\n                continue\n            object_pcd = gpu_pointcloud_from_array(world_points,point_colors,self.pcd_device).voxel_down_sample(self.pcd_resolution)\n            object_pcd = gpu_cluster_filter(object_pcd)\n            if object_pcd.point.positions.shape[0] < 10:\n                continue\n            entity = {'class':cls,'pcd':object_pcd,'confidence':score}\n            entities.append(entity)\n        return entities\n    \n    def associate_object_entities(self,ref_entities,eval_entities):        \n        for entity in eval_entities:\n            if len(ref_entities) == 0:\n                ref_entities.append(entity)\n                continue\n            overlap_score = []\n            eval_pcd = entity['pcd']\n            for ref_entity in ref_entities:\n                if eval_pcd.point.positions.shape[0] == 0:\n                    break\n                cdist = pointcloud_distance(eval_pcd,ref_entity['pcd'])\n                overlap_condition = (cdist < 0.1)\n                nonoverlap_condition = overlap_condition.logical_not()\n                eval_pcd = eval_pcd.select_by_index(o3d.core.Tensor(nonoverlap_condition.cpu().numpy(),device=self.pcd_device).nonzero()[0])\n                overlap_score.append((overlap_condition.sum()/(overlap_condition.shape[0]+1e-6)).cpu().numpy())\n            max_overlap_score = np.max(overlap_score)\n            arg_overlap_index = np.argmax(overlap_score)\n            if max_overlap_score < 0.25:\n                entity['pcd'] = eval_pcd\n                ref_entities.append(entity)\n            else:\n                argmax_entity = ref_entities[arg_overlap_index]\n                argmax_entity['pcd'] = gpu_merge_pointcloud(argmax_entity['pcd'],eval_pcd)\n                if argmax_entity['pcd'].point.positions.shape[0] < entity['pcd'].point.positions.shape[0] or entity['class'] in INTEREST_OBJECTS:\n                    argmax_entity['class'] = entity['class']\n                ref_entities[arg_overlap_index] = argmax_entity\n        return ref_entities\n    \n    def get_obstacle_affordance(self):\n        try:\n            distance = pointcloud_distance(self.navigable_pcd,self.obstacle_pcd)\n            affordance = (distance - distance.min())/(distance.max() - distance.min() + 1e-6)\n            affordance[distance < 0.25] = 0\n            return affordance.cpu().numpy()\n        except:\n            return np.zeros((self.navigable_pcd.point.positions.shape[0],),dtype=np.float32)\n    \n    def get_trajectory_affordance(self):\n        try:\n            distance = pointcloud_distance(self.navigable_pcd,self.trajectory_pcd)\n            affordance = (distance - distance.min()) / (distance.max() - distance.min() + 1e-6)\n            return affordance.cpu().numpy()\n        except:\n            return np.zeros((self.navigable_pcd.point.positions.shape[0],),dtype=np.float32)\n    \n    def get_semantic_affordance(self,target_class,threshold=0.1):\n        semantic_pointcloud = o3d.t.geometry.PointCloud()\n        for entity in self.object_entities:\n            if entity['class'] in target_class:\n                semantic_pointcloud = gpu_merge_pointcloud(semantic_pointcloud,entity['pcd'])\n        try:\n            distance = pointcloud_2d_distance(self.navigable_pcd,semantic_pointcloud) \n            affordance = 1 - (distance - distance.min()) / (distance.max() - distance.min() + 1e-6)\n            affordance[distance > threshold] = 0\n            affordance = affordance.cpu().numpy()\n            return affordance\n        except:\n            return np.zeros((self.navigable_pcd.point.positions.shape[0],),dtype=np.float32)\n    \n    def get_gpt4v_affordance(self,gpt4v_pcd):\n        try:\n            distance = pointcloud_distance(self.navigable_pcd,gpt4v_pcd)\n            affordance = 1 - (distance - distance.min()) / (distance.max() - distance.min() + 1e-6)\n            affordance[distance > 0.1] = 0\n            return affordance.cpu().numpy()\n        except:\n            return np.zeros((self.navigable_pcd.point.positions.shape[0],),dtype=np.float32)\n    \n    def get_action_affordance(self,action):\n        try:\n            if action == 'Explore':\n                distance = pointcloud_2d_distance(self.navigable_pcd,self.frontier_pcd)\n                affordance = 1 - (distance - distance.min()) / (distance.max() - distance.min() + 1e-6)\n                affordance[distance > 0.2] = 0\n                return affordance.cpu().numpy()\n            elif action == 'Move_Forward':\n                pixel_x,pixel_z,depth_values = project_to_camera(self.navigable_pcd,self.camera_intrinsic,self.current_position,self.current_rotation)\n                filter_condition = (pixel_x >= 0) & (pixel_x < self.camera_intrinsic[0][2]*2) & (pixel_z >= 0) & (pixel_z < self.camera_intrinsic[1][2]*2) & (depth_values > 1.5) & (depth_values < 2.5)\n                filter_pcd = self.navigable_pcd.select_by_index(o3d.core.Tensor(np.where(filter_condition==1)[0],device=self.navigable_pcd.device))\n                distance = pointcloud_distance(self.navigable_pcd,filter_pcd)\n                affordance = 1 - (distance - distance.min()) / (distance.max() - distance.min() + 1e-6)\n                affordance[distance > 0.1] = 0\n                return affordance.cpu().numpy()\n            elif action == 'Turn_Around':\n                R = np.array([np.pi,np.pi,np.pi]) * self.rotate_axis\n                turn_extrinsic = np.matmul(self.current_rotation,quaternion.as_rotation_matrix(quaternion.from_euler_angles(R)))\n                pixel_x,pixel_z,depth_values = project_to_camera(self.navigable_pcd,self.camera_intrinsic,self.current_position,turn_extrinsic)\n                filter_condition = (pixel_x >= 0) & (pixel_x < self.camera_intrinsic[0][2]*2) & (pixel_z >= 0) & (pixel_z < self.camera_intrinsic[1][2]*2) & (depth_values > 1.5) & (depth_values < 2.5)\n                filter_pcd = self.navigable_pcd.select_by_index(o3d.core.Tensor(np.where(filter_condition==1)[0],device=self.navigable_pcd.device))\n                distance = pointcloud_distance(self.navigable_pcd,filter_pcd)\n                affordance = 1 - (distance - distance.min()) / (distance.max() - distance.min() + 1e-6)\n                affordance[distance > 0.1] = 0\n                return affordance.cpu().numpy()\n            elif action == 'Turn_Left':\n                R = np.array([np.pi/2,np.pi/2,np.pi/2]) * self.rotate_axis\n                turn_extrinsic = np.matmul(self.current_rotation,quaternion.as_rotation_matrix(quaternion.from_euler_angles(R)))\n                pixel_x,pixel_z,depth_values = project_to_camera(self.navigable_pcd,self.camera_intrinsic,self.current_position,turn_extrinsic)\n                filter_condition = (pixel_x >= 0) & (pixel_x < self.camera_intrinsic[0][2]*2) & (pixel_z >= 0) & (pixel_z < self.camera_intrinsic[1][2]*2) & (depth_values > 1.5) & (depth_values < 2.5)\n                filter_pcd = self.navigable_pcd.select_by_index(o3d.core.Tensor(np.where(filter_condition==1)[0],device=self.navigable_pcd.device))\n                distance = pointcloud_distance(self.navigable_pcd,filter_pcd)\n                affordance = 1 - (distance - distance.min()) / (distance.max() - distance.min() + 1e-6)\n                affordance[distance > 0.1] = 0\n                return affordance.cpu().numpy()\n            elif action == 'Turn_Right':\n                R = np.array([-np.pi/2,-np.pi/2,-np.pi/2]) * self.rotate_axis\n                turn_extrinsic = np.matmul(self.current_rotation,quaternion.as_rotation_matrix(quaternion.from_euler_angles(R)))\n                pixel_x,pixel_z,depth_values = project_to_camera(self.navigable_pcd,self.camera_intrinsic,self.current_position,turn_extrinsic)\n                filter_condition = (pixel_x >= 0) & (pixel_x < self.camera_intrinsic[0][2]*2) & (pixel_z >= 0) & (pixel_z < self.camera_intrinsic[1][2]*2) & (depth_values > 1.5) & (depth_values < 2.5)\n                filter_pcd = self.navigable_pcd.select_by_index(o3d.core.Tensor(np.where(filter_condition==1)[0],device=self.navigable_pcd.device))\n                distance = pointcloud_distance(self.navigable_pcd,filter_pcd)\n                affordance = 1 - (distance - distance.min()) / (distance.max() - distance.min() + 1e-6)\n                affordance[distance > 0.1] = 0\n                return affordance.cpu().numpy()\n            elif action == 'Enter':\n                return self.get_semantic_affordance(['doorway','door','entrance','exit'])\n            elif action == 'Exit':\n                return self.get_semantic_affordance(['doorway','door','entrance','exit'])\n            else:\n                return np.zeros((self.navigable_pcd.point.positions.shape[0],),dtype=np.float32) \n        except:\n            return np.zeros((self.navigable_pcd.point.positions.shape[0],),dtype=np.float32) \n\n    def get_objnav_affordance_map(self,action,target_class,gpt4v_pcd,complete_flag=False,failure_mode=False):\n        if failure_mode:\n            obstacle_affordance = self.get_obstacle_affordance()\n            affordance = self.get_action_affordance('Explore')\n            affordance = np.clip(affordance,0.1,1.0)\n            affordance[obstacle_affordance == 0] = 0\n            return affordance,self.visualize_affordance(affordance)\n        elif complete_flag:\n            affordance = self.get_semantic_affordance([target_class],threshold=0.1)\n            return affordance,self.visualize_affordance(affordance)\n        else:\n            obstacle_affordance = self.get_obstacle_affordance()\n            semantic_affordance = self.get_semantic_affordance([target_class],threshold=1.5)\n            action_affordance = self.get_action_affordance(action)\n            gpt4v_affordance = self.get_gpt4v_affordance(gpt4v_pcd)\n            history_affordance = self.get_trajectory_affordance()\n            affordance = 0.25*semantic_affordance + 0.25*action_affordance + 0.25*gpt4v_affordance + 0.25*history_affordance\n            affordance = np.clip(affordance,0.1,1.0)\n            affordance[obstacle_affordance == 0] = 0\n            return affordance,self.visualize_affordance(affordance/(affordance.max()+1e-6))\n\n    def get_debug_affordance_map(self,action,target_class,gpt4v_pcd):\n        obstacle_affordance = self.get_obstacle_affordance()\n        semantic_affordance = self.get_semantic_affordance([target_class],threshold=1.5)\n        action_affordance = self.get_action_affordance(action)\n        gpt4v_affordance = self.get_gpt4v_affordance(gpt4v_pcd)\n        history_affordance = self.get_trajectory_affordance()\n        return self.visualize_affordance(semantic_affordance/(semantic_affordance.max()+1e-6)),\\\n               self.visualize_affordance(history_affordance/(history_affordance.max()+1e-6)),\\\n               self.visualize_affordance(action_affordance/(action_affordance.max()+1e-6)),\\\n               self.visualize_affordance(gpt4v_affordance/(gpt4v_affordance.max()+1e-6)),\\\n               self.visualize_affordance(obstacle_affordance/(obstacle_affordance.max()+1e-6))\n\n    def visualize_affordance(self,affordance):\n        cmap = colormaps.get('jet')\n        color_affordance = cmap(affordance)[:,0:3]\n        color_affordance = cpu_pointcloud_from_array(self.navigable_pcd.point.positions.cpu().numpy(),color_affordance)\n        return color_affordance\n    \n    def get_appeared_objects(self):\n        return [entity['class'] for entity in self.object_entities]\n\n    def save_pointcloud_debug(self,path=\"./\"):\n        save_pcd = o3d.geometry.PointCloud()\n        try:\n            assert self.useful_pcd.point.positions.shape[0] > 0\n            save_pcd.points = o3d.utility.Vector3dVector(self.useful_pcd.point.positions.cpu().numpy())\n            save_pcd.colors = o3d.utility.Vector3dVector(self.useful_pcd.point.colors.cpu().numpy())\n            o3d.io.write_point_cloud(path + \"scene.ply\",save_pcd)\n        except:\n            pass\n        try:\n            assert self.navigable_pcd.point.positions.shape[0] > 0\n            save_pcd.points = o3d.utility.Vector3dVector(self.navigable_pcd.point.positions.cpu().numpy())\n            save_pcd.colors = o3d.utility.Vector3dVector(self.navigable_pcd.point.colors.cpu().numpy())\n            o3d.io.write_point_cloud(path + \"navigable.ply\",save_pcd)\n        except:\n            pass\n        try:\n            assert self.obstacle_pcd.point.positions.shape[0] > 0\n            save_pcd.points = o3d.utility.Vector3dVector(self.obstacle_pcd.point.positions.cpu().numpy())\n            save_pcd.colors = o3d.utility.Vector3dVector(self.obstacle_pcd.point.colors.cpu().numpy())\n            o3d.io.write_point_cloud(path + \"obstacle.ply\",save_pcd)\n        except:\n            pass\n        \n        object_pcd = o3d.geometry.PointCloud()\n        for entity in self.object_entities:\n            points = entity['pcd'].point.positions.cpu().numpy()\n            colors = entity['pcd'].point.colors.cpu().numpy()\n            new_pcd = o3d.geometry.PointCloud()\n            new_pcd.points = o3d.utility.Vector3dVector(points)\n            new_pcd.colors = o3d.utility.Vector3dVector(colors)\n            object_pcd = object_pcd + new_pcd\n        if len(object_pcd.points) > 0:\n            o3d.io.write_point_cloud(path + \"object.ply\",object_pcd)\n    \n   \n"
  },
  {
    "path": "mapping_utils/geometry.py",
    "content": "import numpy as np\nimport open3d as o3d\nimport quaternion\nimport time\nimport torch\nimport cv2\n\ndef get_pointcloud_from_depth(rgb:np.ndarray,depth:np.ndarray,intrinsic:np.ndarray):\n    if len(depth.shape) == 3:\n        depth = depth[:,:,0]\n    filter_z,filter_x = np.where(depth>0)\n    depth_values = depth[filter_z,filter_x]\n    pixel_z = (depth.shape[0] - 1 - filter_z - intrinsic[1][2]) * depth_values / intrinsic[1][1]\n    pixel_x = (filter_x - intrinsic[0][2])*depth_values / intrinsic[0][0]\n    pixel_y = depth_values\n    color_values = rgb[filter_z,filter_x]\n    point_values = np.stack([pixel_x,pixel_z,-pixel_y],axis=-1)\n    return point_values,color_values\n\ndef get_pointcloud_from_depth_mask(depth:np.ndarray,mask:np.ndarray,intrinsic:np.ndarray):\n    if len(depth.shape) == 3:\n        depth = depth[:,:,0]\n    if len(mask.shape) == 3:\n        mask = mask[:,:,0]\n    filter_z,filter_x = np.where((depth>0) & (mask>0))\n    depth_values = depth[filter_z,filter_x]\n    pixel_z = (depth.shape[0] - 1 - filter_z - intrinsic[1][2]) * depth_values / intrinsic[1][1]\n    pixel_x = (filter_x - intrinsic[0][2])*depth_values / intrinsic[0][0]\n    pixel_y = depth_values\n    point_values = np.stack([pixel_x,pixel_z,-pixel_y],axis=-1)\n    return point_values\n\ndef translate_to_world(pointcloud,position,rotation):\n    extrinsic = np.eye(4)\n    extrinsic[0:3,0:3] = rotation \n    extrinsic[0:3,3] = position\n    world_points = np.matmul(extrinsic,np.concatenate((pointcloud,np.ones((pointcloud.shape[0],1))),axis=-1).T).T\n    return world_points[:,0:3]\n\ndef project_to_camera(pcd,intrinsic,position,rotation):\n    extrinsic = np.eye(4)\n    extrinsic[0:3,0:3] = rotation\n    extrinsic[0:3,3] = position\n    extrinsic = np.linalg.inv(extrinsic)\n    try:\n        camera_points = np.concatenate((pcd.point.positions.cpu().numpy(),np.ones((pcd.point.positions.shape[0],1))),axis=-1)\n    except:\n        camera_points = np.concatenate((pcd.points,np.ones((np.array(pcd.points).shape[0],1))),axis=-1)\n    camera_points = np.matmul(extrinsic,camera_points.T).T[:,0:3]\n    depth_values = -camera_points[:,2]\n    filter_x = (camera_points[:,0] * intrinsic[0][0] / depth_values + intrinsic[0][2]).astype(np.int32)\n    filter_z = (-camera_points[:,1] * intrinsic[1][1] / depth_values - intrinsic[1][2] + intrinsic[1][2]*2 - 1).astype(np.int32)\n    return filter_x,filter_z,depth_values\n    \ndef pointcloud_distance(pcdA,pcdB,device='cpu'):\n    try:\n        pointsA = torch.tensor(pcdA.point.positions.cpu().numpy(),device=device)\n        pointsB = torch.tensor(pcdB.point.positions.cpu().numpy(),device=device)\n    except:\n        pointsA = torch.tensor(np.array(pcdA.points),device=device)\n        pointsB = torch.tensor(np.array(pcdB.points),device=device)\n    cdist = torch.cdist(pointsA,pointsB)\n    min_distances1, _ = cdist.min(dim=1)\n    return min_distances1\n\ndef pointcloud_2d_distance(pcdA,pcdB,device='cpu'):\n    pointsA = torch.tensor(pcdA.point.positions.cpu().numpy(),device=device)\n    pointsA[:,2] = 0\n    pointsB = torch.tensor(pcdB.point.positions.cpu().numpy(),device=device)\n    pointsB[:,2] = 0\n    cdist = torch.cdist(pointsA,pointsB)\n    min_distances1, _ = cdist.min(dim=1)\n    return min_distances1\n\ndef cpu_pointcloud_from_array(points,colors):\n    pointcloud = o3d.geometry.PointCloud()\n    pointcloud.points = o3d.utility.Vector3dVector(points)\n    pointcloud.colors = o3d.utility.Vector3dVector(colors)\n    return pointcloud\n\ndef gpu_pointcloud_from_array(points,colors,device):\n    pointcloud = o3d.t.geometry.PointCloud(device)\n    pointcloud.point.positions = o3d.core.Tensor(points,dtype=o3d.core.Dtype.Float32,device=device)\n    pointcloud.point.colors = o3d.core.Tensor(colors.astype(np.float32)/255.0,dtype=o3d.core.Dtype.Float32,device=device)\n    return pointcloud\n\ndef gpu_pointcloud(pointcloud,device):\n    new_pointcloud = o3d.t.geometry.PointCloud(device)\n    new_pointcloud.point.positions = o3d.core.Tensor(np.asarray(pointcloud.points),device=device)\n    new_pointcloud.point.colors = o3d.core.Tensor(np.asarray(pointcloud.colors),device=device)\n    return new_pointcloud\n    \ndef cpu_pointcloud(pointcloud):\n    new_pointcloud = o3d.geometry.PointCloud()\n    new_pointcloud.points = o3d.utility.Vector3dVector(pointcloud.point.positions.cpu().numpy())\n    new_pointcloud.colors = o3d.utility.Vector3dVector(pointcloud.point.colors.cpu().numpy())\n    return new_pointcloud\n\ndef cpu_merge_pointcloud(pcdA,pcdB):\n    return pcdA + pcdB\n\ndef gpu_merge_pointcloud(pcdA,pcdB):\n    if pcdA.is_empty():\n        return pcdB\n    if pcdB.is_empty():\n        return pcdA\n    return pcdA + pcdB\n\ndef gpu_cluster_filter(pointcloud,eps=0.3,min_points=20):\n    labels = pointcloud.cluster_dbscan(eps=eps, min_points=min_points, print_progress=False)\n    numpy_labels = labels.cpu().numpy()\n    unique_labels = np.unique(numpy_labels)\n    largest_cluster_label = max(unique_labels, key=lambda x: np.sum(numpy_labels == x))\n    largest_cluster_pc = pointcloud.select_by_index((labels == largest_cluster_label).nonzero()[0])\n    return largest_cluster_pc\n\ndef cpu_cluster_filter(pointcloud,eps=0.3,min_points=20):\n    labels = pointcloud.cluster_dbscan(eps=eps, min_points=min_points, print_progress=False)\n    unique_labels = np.unique(labels)\n    largest_cluster_label = max(unique_labels, key=lambda x: np.sum(labels == x))\n    largest_cluster_pc = pointcloud.select_by_index((labels == largest_cluster_label).nonzero()[0])\n    return largest_cluster_pc\n\ndef quat2array(quat):\n    return np.array([quat.w,quat.x,quat.y,quat.z],np.float32)\n\ndef quaternion_distance(quatA,quatB):\n    # M*4, N*4\n    dot = np.dot(quatA,quatB.T)\n    dot[dot<0] = -dot[dot<0]\n    angle = 2*np.arccos(dot)\n    return angle/np.pi*180\n\ndef eculidean_distance(posA,posB):\n    posA_reshaped = posA[:, np.newaxis, :]\n    posB_reshaped = posB[np.newaxis, :, :]\n    pairwise_distance = np.sqrt(np.sum((posA_reshaped - posB_reshaped)**2, axis=2))\n    return pairwise_distance\n    \n\n"
  },
  {
    "path": "mapping_utils/path_planning.py",
    "content": "import numpy as np\nimport cv2\nfrom pathfinding.core.diagonal_movement import DiagonalMovement\nfrom pathfinding.core.grid import Grid\nfrom pathfinding.finder.a_star import AStarFinder\nfrom .projection import *\n\ndef path_planning(costmap,start_index,goal_index):\n    planmap = costmap.copy()\n    planmap[planmap == 1] = 10\n    grid = Grid(matrix=(planmap*100).astype(np.int32))\n    finder = AStarFinder(diagonal_movement=DiagonalMovement.always)\n    start_index[0][1] = np.clip(start_index[0][1],0,costmap.shape[1]-1)\n    start_index[0][0] = np.clip(start_index[0][0],0,costmap.shape[0]-1)\n    goal_index[0][1] = np.clip(goal_index[0][1],0,costmap.shape[1]-1)\n    goal_index[0][0] = np.clip(goal_index[0][0],0,costmap.shape[0]-1)\n    start = grid.node(start_index[0][1],start_index[0][0])\n    goal = grid.node(goal_index[0][1],goal_index[0][0])\n    path,_ = finder.find_path(start,goal,grid)\n    return path\n\ndef visualize_path(costmap,path):\n    visualize_costmap = costmap.copy()\n    for waypoint in path:\n        x = waypoint.y\n        y = waypoint.x\n        visualize_costmap[x,y] = 10\n    visualize_costmap = cv2.resize(visualize_costmap,(0,0),fx=10,fy=10,interpolation=cv2.INTER_NEAREST)\n    visualize_costmap = cv2.applyColorMap((255*visualize_costmap/10).astype(np.uint8),cv2.COLORMAP_JET)\n    return visualize_costmap    \n\n    \n    \n    \n    \n    \n    \n    \n    "
  },
  {
    "path": "mapping_utils/preprocess.py",
    "content": "import numpy as np\ndef preprocess_depth(depth:np.ndarray,lower_bound:float=0.1,upper_bound:float=4.9):\n    depth[np.where((depth<lower_bound)|(depth>upper_bound))] = 0\n    return depth\ndef preprocess_image(image:np.ndarray):\n    return image"
  },
  {
    "path": "mapping_utils/projection.py",
    "content": "import numpy as np\nimport open3d as o3d\nimport cv2\n# obstacle = 0\n# unknown = 1\n# position = 2\n# navigable = 3\n# frontier = 4\n\ndef project_frontier(obstacle_pcd,navigable_pcd,obstacle_height=-0.7,grid_resolution=0.25):\n    np_obstacle_points = obstacle_pcd.point.positions.cpu().numpy()\n    np_navigable_points = navigable_pcd.point.positions.cpu().numpy()\n    np_all_points = np.concatenate((np_obstacle_points,np_navigable_points),axis=0)\n    max_bound = np.max(np_all_points,axis=0)\n    min_bound = np.min(np_all_points,axis=0)\n    grid_dimensions = np.ceil((max_bound - min_bound) / grid_resolution).astype(int)\n    grid_map = np.ones((grid_dimensions[0],grid_dimensions[1]),dtype=np.int32)\n    # get navigable occupancy\n    navigable_points = np_navigable_points\n    navigable_indices = np.floor((navigable_points - min_bound) / grid_resolution).astype(int)\n    navigable_indices[:,0] = np.clip(navigable_indices[:,0],0,grid_dimensions[0]-1)\n    navigable_indices[:,1] = np.clip(navigable_indices[:,1],0,grid_dimensions[1]-1)\n    navigable_indices[:,2] = np.clip(navigable_indices[:,2],0,grid_dimensions[2]-1)\n    navigable_voxels = np.zeros(grid_dimensions,dtype=np.int32)\n    navigable_voxels[navigable_indices[:,0],navigable_indices[:,1],navigable_indices[:,2]] = 1\n    navigable_map = (navigable_voxels.sum(axis=2) > 0)\n    grid_map[np.where(navigable_map>0)] = 3\n    # get obstacle occupancy\n    obstacle_points = np_obstacle_points\n    obstacle_indices = np.floor((obstacle_points - min_bound) / grid_resolution).astype(int)\n    obstacle_indices[:,0] = np.clip(obstacle_indices[:,0],0,grid_dimensions[0]-1)\n    obstacle_indices[:,1] = np.clip(obstacle_indices[:,1],0,grid_dimensions[1]-1)\n    obstacle_indices[:,2] = np.clip(obstacle_indices[:,2],0,grid_dimensions[2]-1)\n    obstacle_voxels = np.zeros(grid_dimensions,dtype=np.int32)\n    obstacle_voxels[obstacle_indices[:,0],obstacle_indices[:,1],obstacle_indices[:,2]] = 1\n    obstacle_map = (obstacle_voxels.sum(axis=2) > 0)\n    grid_map[np.where(obstacle_map>0)] = 0\n     # get outer-border of navigable areas\n    outer_border_navigable = ((grid_map == 3)*255).astype(np.uint8)\n    contours,hierarchiy = cv2.findContours(outer_border_navigable,cv2.RETR_EXTERNAL,cv2.CHAIN_APPROX_SIMPLE)\n    outer_border_navigable = cv2.drawContours(np.zeros((grid_map.shape[0],grid_map.shape[1])),contours,-1,(255,255,255),1).astype(np.float32)\n    obstacles = ((grid_map == 0)*255).astype(np.float32)\n    obstacles = cv2.dilate(obstacles.astype(np.uint8),np.ones((3,3)))\n    outer_border_navigable = ((outer_border_navigable - obstacles) > 0)\n    grid_map_x,grid_map_y = np.where(outer_border_navigable>0)\n    grid_indexes = np.stack((grid_map_x,grid_map_y,obstacle_height*np.ones((grid_map_x.shape[0],))),axis=1)\n    frontier_points = grid_indexes * grid_resolution + min_bound\n    return frontier_points\n    \ndef translate_grid_to_point(pointcloud,grid_indexes,grid_resolution=0.25):\n    np_all_points = pointcloud.point.positions.cpu().numpy()\n    min_bound = np.min(np_all_points,axis=0)\n    translate_points = grid_indexes * grid_resolution + min_bound\n    return translate_points\n\ndef translate_point_to_grid(pointcloud,point_poses,grid_resolution=0.25):\n    if len(point_poses.shape) == 1:\n        point_poses = point_poses[np.newaxis,:]\n    np_all_points = pointcloud.point.positions.cpu().numpy()\n    min_bound = np.min(np_all_points,axis=0)\n    grid_index = np.floor((point_poses - min_bound) / grid_resolution).astype(int)\n    return grid_index[:,0:2]\n\ndef project_costmap(navigable_pcd,affordance_value,grid_resolution=0.25):\n    navigable_points = navigable_pcd.point.positions.cpu().numpy()\n    max_bound = np.max(navigable_points,axis=0)\n    min_bound = np.min(navigable_points,axis=0)\n    grid_dimensions = np.ceil((max_bound - min_bound) / grid_resolution).astype(int)\n    navigable_voxels = np.zeros(grid_dimensions,dtype=np.float32)\n    navigable_indices = np.floor((navigable_points - min_bound) / grid_resolution).astype(int)\n    navigable_indices[:,0] = np.clip(navigable_indices[:,0],0,grid_dimensions[0]-1)\n    navigable_indices[:,1] = np.clip(navigable_indices[:,1],0,grid_dimensions[1]-1)\n    navigable_indices[:,2] = np.clip(navigable_indices[:,2],0,grid_dimensions[2]-1)\n    navigable_voxels[navigable_indices[:,0],navigable_indices[:,1],navigable_indices[:,2]] = affordance_value\n    navigable_costmap = navigable_voxels.max(axis=2)\n    navigable_costmap = 1 - navigable_costmap\n    color_navigable_costmap = cv2.applyColorMap((navigable_costmap*255).astype(np.uint8),cv2.COLORMAP_JET)\n    color_navigable_costmap = cv2.resize(color_navigable_costmap,(0,0),fx=5,fy=5,interpolation=cv2.INTER_NEAREST)\n    return navigable_costmap,color_navigable_costmap\n\n"
  },
  {
    "path": "mapping_utils/transform.py",
    "content": "import numpy as np\nimport quaternion\ndef habitat_camera_intrinsic(config):\n    assert config.habitat.simulator.agents.main_agent.sim_sensors.depth_sensor.width == config.habitat.simulator.agents.main_agent.sim_sensors.rgb_sensor.width, 'The configuration of the depth camera should be the same as rgb camera.'\n    assert config.habitat.simulator.agents.main_agent.sim_sensors.depth_sensor.height == config.habitat.simulator.agents.main_agent.sim_sensors.rgb_sensor.height, 'The configuration of the depth camera should be the same as rgb camera.'\n    assert config.habitat.simulator.agents.main_agent.sim_sensors.depth_sensor.hfov == config.habitat.simulator.agents.main_agent.sim_sensors.rgb_sensor.hfov, 'The configuration of the depth camera should be the same as rgb camera.'\n    width = config.habitat.simulator.agents.main_agent.sim_sensors.depth_sensor.width\n    height = config.habitat.simulator.agents.main_agent.sim_sensors.depth_sensor.height\n    hfov = config.habitat.simulator.agents.main_agent.sim_sensors.depth_sensor.hfov\n    xc = (width - 1.) / 2.\n    zc = (height - 1.) / 2.\n    f = (width / 2.) / np.tan(np.deg2rad(hfov / 2.))\n    intrinsic_matrix = np.array([[f,0,xc],\n                                 [0,f,zc],\n                                 [0,0,1]],np.float32)\n    return intrinsic_matrix\n\ndef habitat_translation(position):\n    return np.array([position[0],position[2],position[1]])\n\ndef habitat_rotation(rotation):\n    rotation_matrix = quaternion.as_rotation_matrix(rotation)\n    transform_matrix = np.array([[1,0,0],\n                                 [0,0,1],\n                                 [0,1,0]])\n    rotation_matrix = np.matmul(transform_matrix,rotation_matrix)\n    return rotation_matrix\n"
  },
  {
    "path": "objnav_agent.py",
    "content": "import habitat\nimport numpy as np\nimport cv2\nimport ast\nimport open3d as o3d\nfrom mapping_utils.geometry import *\nfrom mapping_utils.projection import *\nfrom mapping_utils.path_planning import *\nfrom habitat.tasks.nav.shortest_path_follower import ShortestPathFollower\nfrom mapper import Instruct_Mapper\nfrom habitat.utils.visualizations.maps import colorize_draw_agent_and_fit_to_height\nfrom llm_utils.nav_prompt import CHAINON_PROMPT,GPT4V_PROMPT\nfrom llm_utils.gpt_request import gpt_response,gptv_response\n\nclass HM3D_Objnav_Agent:\n    def __init__(self,env:habitat.Env,mapper:Instruct_Mapper):\n        self.env = env\n        self.mapper = mapper\n        self.episode_samples = 0\n        self.planner = ShortestPathFollower(env.sim,0.5,False)\n\n    def translate_objnav(self,object_goal):\n        if object_goal.lower() == 'plant':\n            return \"Find the <%s>.\"%\"potted_plant\"\n        elif object_goal.lower() == \"tv_monitor\":\n            return \"Find the <%s>.\"%\"television_set\"\n        else:\n            return \"Find the <%s>.\"%object_goal\n    \n    def reset_debug_probes(self):\n        self.rgb_trajectory = []\n        self.depth_trajectory = []\n        self.topdown_trajectory = []\n        self.segmentation_trajectory = []\n\n        self.gpt_trajectory = []\n        self.gptv_trajectory = []\n        self.panoramic_trajectory = []\n        \n        self.obstacle_affordance_trajectory = []\n        self.semantic_affordance_trajectory = []\n        self.history_affordance_trajectory = []\n        self.action_affordance_trajectory = []\n        self.gpt4v_affordance_trajectory = []\n        self.affordance_trajectory = []\n\n    def reset(self):\n        self.episode_samples += 1\n        self.episode_steps = 0\n        self.obs = self.env.reset()\n        self.mapper.reset(self.env.sim.get_agent_state().sensor_states['rgb'].position,self.env.sim.get_agent_state().sensor_states['rgb'].rotation)\n        self.instruct_goal = self.translate_objnav(self.env.current_episode.object_category)\n        self.trajectory_summary = \"\"\n        self.reset_debug_probes()     \n       \n    def rotate_panoramic(self,rotate_times = 12):\n        self.temporary_pcd = []\n        self.temporary_images = []\n        for i in range(rotate_times):\n            if self.env.episode_over:\n                break\n            self.update_trajectory()\n            self.temporary_pcd.append(self.mapper.current_pcd)\n            self.temporary_images.append(self.rgb_trajectory[-1])\n            self.obs = self.env.step(3)\n            \n    def concat_panoramic(self,images):\n        try:\n            height,width = images[0].shape[0],images[0].shape[1]\n        except:\n            height,width = 480,640\n        background_image = np.zeros((2*height + 3*10, 3*width + 4*10, 3),np.uint8)\n        copy_images = np.array(images,dtype=np.uint8)\n        for i in range(len(copy_images)):\n            if i % 2 != 0:\n                row = (i//6)\n                col = ((i%6)//2)\n                copy_images[i] = cv2.putText(copy_images[i],\"Direction %d\"%i,(100,100),cv2.FONT_HERSHEY_SIMPLEX, 2, (255, 0, 0), 6, cv2.LINE_AA)\n                background_image[10*(row+1)+row*height:10*(row+1)+row*height+height:,col*width + col * 10:col*width+col*10+width,:] = copy_images[i]\n                \n        return background_image\n    \n    def update_trajectory(self):\n        self.episode_steps += 1\n        self.metrics = self.env.get_metrics()\n        self.rgb_trajectory.append(cv2.cvtColor(self.obs['rgb'],cv2.COLOR_BGR2RGB))\n        self.depth_trajectory.append((self.obs['depth']/5.0 * 255.0).astype(np.uint8))\n        \n        topdown_image = cv2.cvtColor(colorize_draw_agent_and_fit_to_height(self.metrics['top_down_map'],1024),cv2.COLOR_BGR2RGB)\n        topdown_image = cv2.putText(topdown_image,'Success:%.2f,SPL:%.2f,SoftSPL:%.2f,DTS:%.2f'%(self.metrics['success'],self.metrics['spl'],self.metrics['soft_spl'],self.metrics['distance_to_goal']),(0,100),cv2.FONT_HERSHEY_SIMPLEX,2,(0,0,0),2,cv2.LINE_AA)\n        self.topdown_trajectory.append(topdown_image)\n        \n        self.position = self.env.sim.get_agent_state().sensor_states['rgb'].position\n        self.rotation = self.env.sim.get_agent_state().sensor_states['rgb'].rotation\n\n        self.mapper.update(self.rgb_trajectory[-1],self.obs['depth'],self.position,self.rotation)\n        self.segmentation_trajectory.append(self.mapper.segmentation)\n        self.observed_objects = self.mapper.get_appeared_objects()\n\n        cv2.imwrite(\"monitor-rgb.jpg\",self.rgb_trajectory[-1])\n        cv2.imwrite(\"monitor-depth.jpg\",self.depth_trajectory[-1])\n        cv2.imwrite(\"monitor-segmentation.jpg\",self.segmentation_trajectory[-1])\n            \n    def save_trajectory(self,dir=\"./tmp_objnav/\"):\n        import imageio\n        import os\n        os.makedirs(dir)\n\n        self.mapper.save_pointcloud_debug(dir) \n        fps_writer = imageio.get_writer(dir+\"fps.mp4\", fps=4)\n        dps_writer = imageio.get_writer(dir+\"depth.mp4\", fps=4)\n        seg_writer = imageio.get_writer(dir+\"segmentation.mp4\", fps=4)\n        metric_writer = imageio.get_writer(dir+\"metrics.mp4\",fps=4)\n        for i,img,dep,seg,met in zip(np.arange(len(self.rgb_trajectory)),self.rgb_trajectory,self.depth_trajectory,self.segmentation_trajectory,self.topdown_trajectory):\n            fps_writer.append_data(cv2.cvtColor(img,cv2.COLOR_BGR2RGB))\n            dps_writer.append_data(dep)\n            seg_writer.append_data(cv2.cvtColor(seg,cv2.COLOR_BGR2RGB))\n            metric_writer.append_data(cv2.cvtColor(met,cv2.COLOR_BGR2RGB))\n\n        for index,pano_img in enumerate(self.panoramic_trajectory):\n            cv2.imwrite(dir+\"%d-pano.jpg\"%index,pano_img)\n        with open(dir+\"gpt4_history.txt\",'w') as file:\n            file.write(\"\".join(self.gpt_trajectory))\n        with open(dir+\"gpt4v_history.txt\",'w') as file:\n            file.write(\"\".join(self.gptv_trajectory))\n\n        for i,afford,safford,hafford,cafford,gafford,oafford in zip(np.arange(len(self.affordance_trajectory)),self.affordance_trajectory,self.semantic_affordance_trajectory,self.history_affordance_trajectory,self.action_affordance_trajectory,self.gpt4v_affordance_trajectory,self.obstacle_affordance_trajectory):\n            o3d.io.write_point_cloud(dir+\"afford-%d-plan.ply\"%i,afford)\n            o3d.io.write_point_cloud(dir+\"semantic-afford-%d-plan.ply\"%i,safford)\n            o3d.io.write_point_cloud(dir+\"history-afford-%d-plan.ply\"%i,hafford)\n            o3d.io.write_point_cloud(dir+\"action-afford-%d-plan.ply\"%i,cafford)\n            o3d.io.write_point_cloud(dir+\"gpt4v-afford-%d-plan.ply\"%i,gafford)\n            o3d.io.write_point_cloud(dir+\"obstacle-afford-%d-plan.ply\"%i,oafford)\n\n        fps_writer.close()\n        dps_writer.close()\n        seg_writer.close()\n        metric_writer.close()\n    \n    def query_chainon(self):\n        semantic_clue = {'observed object':self.observed_objects}\n        query_content = \"<Navigation Instruction>:{}, <Previous Plan>:{}, <Semantic Clue>:{}\".format(self.instruct_goal,\"{\" + self.trajectory_summary + \"}\",semantic_clue)\n        self.gpt_trajectory.append(\"Input:\\n%s \\n\"%query_content)\n        for i in range(10):\n            try:\n                raw_answer = gpt_response(query_content,CHAINON_PROMPT)\n                print(\"GPT-4 Output Response: %s\"%raw_answer)\n                answer = raw_answer.replace(\" \",\"\")\n                answer = answer[answer.index(\"{\"):answer.index(\"}\")+1]\n                answer = ast.literal_eval(answer)\n                if 'Action' in answer.keys() and 'Landmark' in answer.keys() and 'Flag' in answer.keys():\n                    break\n            except:\n                continue\n        self.gpt_trajectory.append(\"\\nGPT-4 Answer:\\n%s\"%raw_answer)\n        if self.trajectory_summary == \"\":\n            self.trajectory_summary = self.trajectory_summary + str(answer['Action']) + '-' + str(answer['Landmark'])\n        else:\n            self.trajectory_summary = self.trajectory_summary + '-' + str(answer['Action']) + '-' + str(answer['Landmark'])\n        return answer\n    \n    def query_gpt4v(self):\n        images = self.temporary_images\n        inference_image = self.concat_panoramic(images)\n        cv2.imwrite(\"monitor-panoramic.jpg\",inference_image)\n        text_content = \"<Navigation Instruction>:{}\\n <Sub Instruction>:{}\".format(self.instruct_goal,self.trajectory_summary.split(\"-\")[-2] + \"-\" + self.trajectory_summary.split(\"-\")[-1])\n        self.gptv_trajectory.append(\"\\nInput:\\n%s \\n\"%text_content)\n        for i in range(10):\n            try:\n                raw_answer = gptv_response(text_content,inference_image,GPT4V_PROMPT)\n                print(\"GPT-4V Output Response: %s\"%raw_answer)\n                answer = raw_answer[raw_answer.index(\"Judgement: Direction\"):]\n                answer = answer.replace(\" \",\"\")\n                answer = int(answer.split(\"Direction\")[-1])\n                break\n            except:\n                continue\n        self.gptv_trajectory.append(\"GPT-4V Answer:\\n%s\"%raw_answer)\n        self.panoramic_trajectory.append(inference_image)\n        try:\n            return answer\n        except:\n            return np.random.randint(0,12)\n    \n    def make_plan(self,rotate=True,failed=False):\n        if rotate == True:\n            self.rotate_panoramic()\n        self.chainon_answer = self.query_chainon()\n        self.gpt4v_answer = self.query_gpt4v()\n        self.gpt4v_pcd = o3d.t.geometry.PointCloud(self.mapper.pcd_device)\n        self.gpt4v_pcd = gpu_merge_pointcloud(self.gpt4v_pcd,self.temporary_pcd[self.gpt4v_answer])\n        self.found_goal = bool(self.chainon_answer['Flag'])\n        self.affordance_pcd,self.colored_affordance_pcd = self.mapper.get_objnav_affordance_map(self.chainon_answer['Action'],self.chainon_answer['Landmark'],self.gpt4v_pcd,self.chainon_answer['Flag'],failure_mode=failed)\n        self.semantic_afford,self.history_afford,self.action_afford,self.gpt4v_afford,self.obs_afford = self.mapper.get_debug_affordance_map(self.chainon_answer['Action'],self.chainon_answer['Landmark'],self.gpt4v_pcd)\n        if self.affordance_pcd.max() == 0:\n            self.affordance_pcd,self.colored_affordance_pcd = self.mapper.get_objnav_affordance_map(self.chainon_answer['Action'],self.chainon_answer['Landmark'],self.gpt4v_pcd,False,failure_mode=failed)\n            self.found_goal = False\n            \n        self.affordance_map,self.colored_affordance_map = project_costmap(self.mapper.navigable_pcd,self.affordance_pcd,self.mapper.grid_resolution)\n        self.target_point = self.mapper.navigable_pcd.point.positions[self.affordance_pcd.argmax()].cpu().numpy()\n        self.plan_position = self.mapper.current_position.copy()\n        target_index = translate_point_to_grid(self.mapper.navigable_pcd,self.target_point,self.mapper.grid_resolution)\n        start_index = translate_point_to_grid(self.mapper.navigable_pcd,self.mapper.current_position,self.mapper.grid_resolution)\n        self.path = path_planning(self.affordance_map,start_index,target_index)\n        self.path = [translate_grid_to_point(self.mapper.navigable_pcd,np.array([[waypoint.y,waypoint.x,0]]),self.mapper.grid_resolution)[0] for waypoint in self.path]\n        if len(self.path) == 0:\n            self.waypoint = self.mapper.navigable_pcd.point.positions.cpu().numpy()[np.argmax(self.affordance_pcd)]\n            self.waypoint[2] = self.mapper.current_position[2]\n        elif len(self.path) < 5: \n            self.waypoint = self.path[-1]\n            self.waypoint[2] = self.mapper.current_position[2]\n        else:\n            self.waypoint = self.path[4]\n            self.waypoint[2] = self.mapper.current_position[2]\n\n        self.affordance_trajectory.append(self.colored_affordance_pcd)\n        self.obstacle_affordance_trajectory.append(self.obs_afford)\n        self.semantic_affordance_trajectory.append(self.semantic_afford)\n        self.history_affordance_trajectory.append(self.history_afford)\n        self.action_affordance_trajectory.append(self.action_afford)\n        self.gpt4v_affordance_trajectory.append(self.gpt4v_afford)\n    \n    def step(self):\n        to_target_distance = np.sqrt(np.sum(np.square(self.mapper.current_position - self.waypoint)))\n        if to_target_distance < 0.6 and len(self.path) > 0:\n            self.path = self.path[min(5,len(self.path)-1):]\n            if len(self.path) < 3:\n                self.waypoint = self.path[-1]\n                self.waypoint[2] = self.mapper.current_position[2]\n            else:\n                self.waypoint = self.path[2]\n                self.waypoint[2] = self.mapper.current_position[2]\n\n        pid_waypoint = self.waypoint + self.mapper.initial_position\n        pid_waypoint = np.array([pid_waypoint[0],self.env.sim.get_agent_state().position[1],pid_waypoint[1]])\n        act = self.planner.get_next_action(pid_waypoint)\n        move_distance =  np.sqrt(np.sum(np.square(self.mapper.current_position - self.plan_position)))\n        if (act == 0 or move_distance > 3.0) and not self.found_goal:\n            self.make_plan(rotate=True)\n            pid_waypoint = self.waypoint + self.mapper.initial_position\n            pid_waypoint = np.array([pid_waypoint[0],self.env.sim.get_agent_state().position[1],pid_waypoint[1]])\n            act = self.planner.get_next_action(pid_waypoint)\n        if act == 0 and not self.found_goal:\n            self.make_plan(False,True)\n            pid_waypoint = self.waypoint + self.mapper.initial_position\n            pid_waypoint = np.array([pid_waypoint[0],self.env.sim.get_agent_state().position[1],pid_waypoint[1]])\n            act = self.planner.get_next_action(pid_waypoint)\n            print(\"Warning: Failure locomotion and action = %d\"%act)\n        if not self.env.episode_over:\n            self.obs = self.env.step(act)\n            self.update_trajectory()\n       \n"
  },
  {
    "path": "objnav_benchmark.py",
    "content": "import habitat\nimport os\nimport argparse\nimport csv\nfrom tqdm import tqdm\nfrom config_utils import hm3d_config,mp3d_config\nfrom mapping_utils.transform import habitat_camera_intrinsic\nfrom mapper import Instruct_Mapper\nfrom objnav_agent import HM3D_Objnav_Agent\nos.environ['CUDA_VISIBLE_DEVICES'] = '0'\nos.environ[\"MAGNUM_LOG\"] = \"quiet\"\nos.environ[\"HABITAT_SIM_LOG\"] = \"quiet\"\ndef write_metrics(metrics,path=\"objnav_hm3d.csv\"):\n    with open(path, mode=\"w\", newline=\"\") as csv_file:\n        fieldnames = metrics[0].keys()\n        writer = csv.DictWriter(csv_file, fieldnames=fieldnames)\n        writer.writeheader()\n        writer.writerows(metrics)\n\ndef get_args():\n    parser = argparse.ArgumentParser()\n    parser.add_argument(\"--eval_episodes\",type=int,default=500)\n    parser.add_argument(\"--mapper_resolution\",type=float,default=0.05)\n    parser.add_argument(\"--path_resolution\",type=float,default=0.2)\n    parser.add_argument(\"--path_scale\",type=int,default=5)\n    return parser.parse_known_args()[0]\n\nif __name__ == \"__main__\":\n    args = get_args()\n    habitat_config = hm3d_config(stage='val',episodes=args.eval_episodes)\n    habitat_env = habitat.Env(habitat_config)\n    habitat_mapper = Instruct_Mapper(habitat_camera_intrinsic(habitat_config),\n                                    pcd_resolution=args.mapper_resolution,\n                                    grid_resolution=args.path_resolution,\n                                    grid_size=args.path_scale)\n    habitat_agent = HM3D_Objnav_Agent(habitat_env,habitat_mapper)\n    evaluation_metrics = []\n    for i in tqdm(range(args.eval_episodes)):\n        habitat_agent.reset()\n        habitat_agent.make_plan()\n        while not habitat_env.episode_over and habitat_agent.episode_steps < 495:\n            habitat_agent.step()\n        habitat_agent.save_trajectory(\"./tmp/episode-%d/\"%i)\n        evaluation_metrics.append({'success':habitat_agent.metrics['success'],\n                                'spl':habitat_agent.metrics['spl'],\n                                'distance_to_goal':habitat_agent.metrics['distance_to_goal'],\n                                'object_goal':habitat_agent.instruct_goal})\n        write_metrics(evaluation_metrics)"
  },
  {
    "path": "requirements.txt",
    "content": "apex==0.9.10dev\neinops==0.8.0\nfairscale==0.4.4\nfvcore==0.1.5.post20221221\nimageio==2.34.1\nmatplotlib==3.8.4\nMultiScaleDeformableAttention==1.0\nnumpy==1.23.5\nnumpy_quaternion==2023.0.3\nomegaconf==2.3.0\nopen3d==0.18.0\nopenai==1.45.0\nopencv_python==4.4.0.46\nopencv_python_headless==4.5.5.64\npathfinding==1.0.9\nPillow==10.4.0\nRequests==2.32.3\nsalesforce_lavis==1.0.2\nscipy==1.14.1\nsetuptools==60.2.0\ntimm==0.4.12\ntorch==2.2.2+cu121\ntorchvision==0.17.2+cu121\ntqdm==4.65.2\ntransformers==4.26.1\nxformers==0.0.28.post1\n"
  },
  {
    "path": "thirdparty/GLEE/configs/R50.yaml",
    "content": "MODEL:\n  META_ARCHITECTURE: \"GLEE\"\n  MASK_ON: True\n  BACKBONE:\n    FREEZE_AT: 0\n    NAME: \"build_resnet_backbone\"\n  PIXEL_MEAN: [123.675, 116.280, 103.530]\n  PIXEL_STD: [58.395, 57.120, 57.375]\n  RESNETS:\n    DEPTH: 50\n    STEM_TYPE: \"basic\"  # not used\n    STEM_OUT_CHANNELS: 64\n    STRIDE_IN_1X1: False\n    OUT_FEATURES: [\"res2\", \"res3\", \"res4\", \"res5\"]\n    # NORM: \"SyncBN\"\n    RES5_MULTI_GRID: [1, 1, 1]  # not used\n  SEM_SEG_HEAD:\n    NAME: \"MaskDINOHead\"\n    IGNORE_VALUE: 255\n    NUM_CLASSES: 80\n    LOSS_WEIGHT: 1.0\n    CONVS_DIM: 256\n    MASK_DIM: 256\n    NORM: \"GN\"\n    # pixel decoder\n    PIXEL_DECODER_NAME: \"MaskDINOEncoder\"\n    DIM_FEEDFORWARD: 2048\n    NUM_FEATURE_LEVELS: 3\n    TOTAL_NUM_FEATURE_LEVELS: 4\n    IN_FEATURES: [\"res2\", \"res3\", \"res4\", \"res5\"]\n    DEFORMABLE_TRANSFORMER_ENCODER_IN_FEATURES: [\"res3\", \"res4\", \"res5\"]\n    COMMON_STRIDE: 4\n    TRANSFORMER_ENC_LAYERS: 6\n    FEATURE_ORDER: \"low2high\"\n  MaskDINO:\n    TRANSFORMER_DECODER_NAME: \"MaskDINODecoder\"\n    DEEP_SUPERVISION: True\n    NO_OBJECT_WEIGHT: 0.1\n    CLASS_WEIGHT: 4.0\n    MASK_WEIGHT: 5.0\n    DICE_WEIGHT: 5.0\n    BOX_WEIGHT: 5.0\n    GIOU_WEIGHT: 2.0\n    HIDDEN_DIM: 256\n    NUM_OBJECT_QUERIES: 300\n    NHEADS: 8\n    DROPOUT: 0.0\n    DIM_FEEDFORWARD: 2048\n    ENC_LAYERS: 0\n    PRE_NORM: False\n    ENFORCE_INPUT_PROJ: False\n    SIZE_DIVISIBILITY: 32\n    DEC_LAYERS: 9  # 9+1, 9 decoder layers, add one for the loss on learnable query\n    TRAIN_NUM_POINTS: 12544\n    OVERSAMPLE_RATIO: 3.0\n    IMPORTANCE_SAMPLE_RATIO: 0.75\n    INITIAL_PRED: True\n    TWO_STAGE: True\n    DN: \"standard\"\n    DN_NUM: 100\n    INITIALIZE_BOX_TYPE: \"no\"\n    TEST:\n      SEMANTIC_ON: False\n      INSTANCE_ON: True\n      PANOPTIC_ON: False\n      OVERLAP_THRESHOLD: 0.8\n      OBJECT_MASK_THRESHOLD: 0.25\n  TEXT:\n    ARCH: clip_teacher\n  LANGUAGE_BACKBONE:\n    LANG_DIM: 512\n"
  },
  {
    "path": "thirdparty/GLEE/configs/SwinL.yaml",
    "content": "MODEL:\n  META_ARCHITECTURE: \"GLEE\"\n  MASK_ON: True\n  BACKBONE:\n    NAME: \"D2SwinTransformer\"\n  SWIN:\n    EMBED_DIM: 192\n    DEPTHS: [2, 2, 18, 2]\n    NUM_HEADS: [6, 12, 24, 48]\n    WINDOW_SIZE: 12\n    APE: False\n    DROP_PATH_RATE: 0.3\n    PATCH_NORM: True\n    PRETRAIN_IMG_SIZE: 384\n  PIXEL_MEAN: [123.675, 116.280, 103.530]\n  PIXEL_STD: [58.395, 57.120, 57.375]\n  RESNETS:\n    DEPTH: 50\n    STEM_TYPE: \"basic\"  # not used\n    STEM_OUT_CHANNELS: 64\n    STRIDE_IN_1X1: False\n    OUT_FEATURES: [\"res2\", \"res3\", \"res4\", \"res5\"]\n    # NORM: \"SyncBN\"\n    RES5_MULTI_GRID: [1, 1, 1]  # not used\n  SEM_SEG_HEAD:\n    NAME: \"MaskDINOHead\"\n    IGNORE_VALUE: 255\n    NUM_CLASSES: 80\n    LOSS_WEIGHT: 1.0\n    CONVS_DIM: 256\n    MASK_DIM: 256\n    NORM: \"GN\"\n    # pixel decoder\n    PIXEL_DECODER_NAME: \"MaskDINOEncoder\"\n    DIM_FEEDFORWARD: 2048\n    NUM_FEATURE_LEVELS: 3\n    TOTAL_NUM_FEATURE_LEVELS: 4\n    IN_FEATURES: [\"res2\", \"res3\", \"res4\", \"res5\"]\n    DEFORMABLE_TRANSFORMER_ENCODER_IN_FEATURES: [\"res3\", \"res4\", \"res5\"]\n    COMMON_STRIDE: 4\n    TRANSFORMER_ENC_LAYERS: 6\n    FEATURE_ORDER: \"low2high\"\n  MaskDINO:\n    TRANSFORMER_DECODER_NAME: \"MaskDINODecoder\"\n    DEEP_SUPERVISION: True\n    NO_OBJECT_WEIGHT: 0.1\n    CLASS_WEIGHT: 4.0\n    MASK_WEIGHT: 5.0\n    DICE_WEIGHT: 5.0\n    BOX_WEIGHT: 5.0\n    GIOU_WEIGHT: 2.0\n    HIDDEN_DIM: 256\n    NUM_OBJECT_QUERIES: 300\n    NHEADS: 8\n    DROPOUT: 0.0\n    DIM_FEEDFORWARD: 2048\n    ENC_LAYERS: 0\n    PRE_NORM: False\n    ENFORCE_INPUT_PROJ: False\n    SIZE_DIVISIBILITY: 32\n    DEC_LAYERS: 9  # 9+1, 9 decoder layers, add one for the loss on learnable query\n    TRAIN_NUM_POINTS: 12544\n    OVERSAMPLE_RATIO: 3.0\n    IMPORTANCE_SAMPLE_RATIO: 0.75\n    INITIAL_PRED: True\n    TWO_STAGE: True\n    DN: \"standard\"\n    DN_NUM: 100\n    INITIALIZE_BOX_TYPE: \"no\"\n    TEST:\n      SEMANTIC_ON: False\n      INSTANCE_ON: True\n      PANOPTIC_ON: False\n      OVERLAP_THRESHOLD: 0.8\n      OBJECT_MASK_THRESHOLD: 0.25\n  TEXT:\n    ARCH: clip_teacher\n  LANGUAGE_BACKBONE:\n    LANG_DIM: 512\n"
  },
  {
    "path": "thirdparty/GLEE/glee/__init__.py",
    "content": "from __future__ import absolute_import\nfrom __future__ import division\nfrom __future__ import print_function\n\n\nfrom .config import add_glee_config\nfrom .config_deeplab import add_deeplab_config\n# from .GLEE import GLEE\n# from .data import build_detection_train_loader, build_detection_test_loader\nfrom .backbone.swin import D2SwinTransformer\nfrom .backbone.eva02 import D2_EVA02\n\n"
  },
  {
    "path": "thirdparty/GLEE/glee/backbone/__init__.py",
    "content": "from .build import build_backbone\n\nfrom .resnet import *\nfrom .swin import *\n# from .focal import *\n# from .focal_dw import *\nfrom .backbone import *"
  },
  {
    "path": "thirdparty/GLEE/glee/backbone/backbone.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates.\nimport torch.nn as nn\n\nfrom detectron2.modeling import ShapeSpec\n\n__all__ = [\"Backbone\"]\n\n\nclass Backbone(nn.Module):\n    \"\"\"\n    Abstract base class for network backbones.\n    \"\"\"\n\n    def __init__(self):\n        \"\"\"\n        The `__init__` method of any subclass can specify its own set of arguments.\n        \"\"\"\n        super().__init__()\n\n    def forward(self):\n        \"\"\"\n        Subclasses must override this method, but adhere to the same return type.\n\n        Returns:\n            dict[str->Tensor]: mapping from feature name (e.g., \"res2\") to tensor\n        \"\"\"\n        pass\n\n    @property\n    def size_divisibility(self) -> int:\n        \"\"\"\n        Some backbones require the input height and width to be divisible by a\n        specific integer. This is typically true for encoder / decoder type networks\n        with lateral connection (e.g., FPN) for which feature maps need to match\n        dimension in the \"bottom up\" and \"top down\" paths. Set to 0 if no specific\n        input size divisibility is required.\n        \"\"\"\n        return 0\n\n    def output_shape(self):\n        \"\"\"\n        Returns:\n            dict[str->ShapeSpec]\n        \"\"\"\n        # this is a backward-compatible default\n        return {\n            name: ShapeSpec(\n                channels=self._out_feature_channels[name], stride=self._out_feature_strides[name]\n            )\n            for name in self._out_features\n        }\n"
  },
  {
    "path": "thirdparty/GLEE/glee/backbone/build.py",
    "content": "from .registry import model_entrypoints\nfrom .registry import is_model\n\nfrom .backbone import *\n\ndef build_backbone(config, **kwargs):\n    model_name = config['MODEL']['BACKBONE']['NAME']\n    if not is_model(model_name):\n        raise ValueError(f'Unkown model: {model_name}')\n    model = model_entrypoints(model_name)(config, **kwargs)\n    return model"
  },
  {
    "path": "thirdparty/GLEE/glee/backbone/davit.py",
    "content": "import os\nimport itertools\nimport logging\n\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nimport torch.utils.checkpoint as checkpoint\nfrom collections import OrderedDict\n\nfrom einops import rearrange\nfrom timm.models.layers import DropPath, trunc_normal_\n\nfrom detectron2.utils.file_io import PathManager\nfrom detectron2.modeling import BACKBONE_REGISTRY, Backbone, ShapeSpec\n\nfrom .registry import register_backbone\n\nlogger = logging.getLogger(__name__)\n\n\nclass MySequential(nn.Sequential):\n    def forward(self, *inputs):\n        for module in self._modules.values():\n            if type(inputs) == tuple:\n                inputs = module(*inputs)\n            else:\n                inputs = module(inputs)\n        return inputs\n\n\nclass PreNorm(nn.Module):\n    def __init__(self, norm, fn, drop_path=None):\n        super().__init__()\n        self.norm = norm\n        self.fn = fn\n        self.drop_path = drop_path\n\n    def forward(self, x, *args, **kwargs):\n        shortcut = x\n        if self.norm != None:\n            x, size = self.fn(self.norm(x), *args, **kwargs)\n        else:\n            x, size = self.fn(x, *args, **kwargs)\n\n        if self.drop_path:\n            x = self.drop_path(x)\n\n        x = shortcut + x\n\n        return x, size\n\n\nclass Mlp(nn.Module):\n    def __init__(\n        self,\n        in_features,\n        hidden_features=None,\n        out_features=None,\n        act_layer=nn.GELU,\n    ):\n        super().__init__()\n        out_features = out_features or in_features\n        hidden_features = hidden_features or in_features\n        self.net = nn.Sequential(OrderedDict([\n            (\"fc1\", nn.Linear(in_features, hidden_features)),\n            (\"act\", act_layer()),\n            (\"fc2\", nn.Linear(hidden_features, out_features))\n        ]))\n\n    def forward(self, x, size):\n        return self.net(x), size\n\n\nclass DepthWiseConv2d(nn.Module):\n    def __init__(\n        self,\n        dim_in,\n        kernel_size,\n        padding,\n        stride,\n        bias=True,\n    ):\n        super().__init__()\n        self.dw = nn.Conv2d(\n            dim_in, dim_in,\n            kernel_size=kernel_size,\n            padding=padding,\n            groups=dim_in,\n            stride=stride,\n            bias=bias\n        )\n\n    def forward(self, x, size):\n        B, N, C = x.shape\n        H, W = size\n        assert N == H * W\n\n        x = self.dw(x.transpose(1, 2).view(B, C, H, W))\n        size = (x.size(-2), x.size(-1))\n        x = x.flatten(2).transpose(1, 2)\n        return x, size\n\n\nclass ConvEmbed(nn.Module):\n    \"\"\" Image to Patch Embedding\n    \"\"\"\n\n    def __init__(\n        self,\n        patch_size=7,\n        in_chans=3,\n        embed_dim=64,\n        stride=4,\n        padding=2,\n        norm_layer=None,\n        pre_norm=True\n    ):\n        super().__init__()\n        self.patch_size = patch_size\n\n        self.proj = nn.Conv2d(\n            in_chans, embed_dim,\n            kernel_size=patch_size,\n            stride=stride,\n            padding=padding\n        )\n\n        dim_norm = in_chans if pre_norm else embed_dim\n        self.norm = norm_layer(dim_norm) if norm_layer else None\n\n        self.pre_norm = pre_norm\n\n    def forward(self, x, size):\n        H, W = size\n        if len(x.size()) == 3:\n            if self.norm and self.pre_norm:\n                x = self.norm(x)\n            x = rearrange(\n                x, 'b (h w) c -> b c h w',\n                h=H, w=W\n            )\n\n        x = self.proj(x)\n\n        _, _, H, W = x.shape\n        x = rearrange(x, 'b c h w -> b (h w) c')\n        if self.norm and not self.pre_norm:\n            x = self.norm(x)\n\n        return x, (H, W)\n\n\nclass ChannelAttention(nn.Module):\n\n    def __init__(self, dim, groups=8, qkv_bias=True):\n        super().__init__()\n\n        self.groups = groups\n        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)\n        self.proj = nn.Linear(dim, dim)\n\n    def forward(self, x, size):\n        B, N, C = x.shape\n\n        qkv = self.qkv(x).reshape(B, N, 3, self.groups, C // self.groups).permute(2, 0, 3, 1, 4)\n        q, k, v = qkv[0], qkv[1], qkv[2]\n\n        q = q * (N ** -0.5)\n        attention = q.transpose(-1, -2) @ k\n        attention = attention.softmax(dim=-1)\n        x = (attention @ v.transpose(-1, -2)).transpose(-1, -2)\n        x = x.transpose(1, 2).reshape(B, N, C)\n        x = self.proj(x)\n        return x, size\n\n\nclass ChannelBlock(nn.Module):\n\n    def __init__(self, dim, groups, mlp_ratio=4., qkv_bias=True,\n                 drop_path_rate=0., act_layer=nn.GELU, norm_layer=nn.LayerNorm,\n                 conv_at_attn=True, conv_at_ffn=True):\n        super().__init__()\n\n        drop_path = DropPath(drop_path_rate) if drop_path_rate > 0. else nn.Identity()\n\n        self.conv1 = PreNorm(None, DepthWiseConv2d(dim, 3, 1, 1)) if conv_at_attn else None\n        self.channel_attn = PreNorm(\n            norm_layer(dim),\n            ChannelAttention(dim, groups=groups, qkv_bias=qkv_bias),\n            drop_path\n        )\n        self.conv2 = PreNorm(None, DepthWiseConv2d(dim, 3, 1, 1)) if conv_at_ffn else None\n        self.ffn = PreNorm(\n            norm_layer(dim),\n            Mlp(in_features=dim, hidden_features=int(dim*mlp_ratio), act_layer=act_layer),\n            drop_path\n        )\n\n    def forward(self, x, size):\n        if self.conv1:\n            x, size = self.conv1(x, size)\n        x, size = self.channel_attn(x, size)\n\n        if self.conv2:\n            x, size = self.conv2(x, size)\n        x, size = self.ffn(x, size)\n\n        return x, size\n\n\ndef window_partition(x, window_size: int):\n    B, H, W, C = x.shape\n    x = x.view(B, H // window_size, window_size, W // window_size, window_size, C)\n    windows = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(-1, window_size, window_size, C)\n    return windows\n\n\ndef window_reverse(windows, window_size: int, H: int, W: int):\n    B = int(windows.shape[0] / (H * W / window_size / window_size))\n    x = windows.view(B, H // window_size, W // window_size, window_size, window_size, -1)\n    x = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(B, H, W, -1)\n    return x\n\n\nclass WindowAttention(nn.Module):\n    def __init__(self, dim, num_heads, window_size, qkv_bias=True):\n\n        super().__init__()\n        self.dim = dim\n        self.window_size = window_size\n        self.num_heads = num_heads\n        head_dim = dim // num_heads\n        self.scale = head_dim ** -0.5\n\n        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)\n        self.proj = nn.Linear(dim, dim)\n\n        self.softmax = nn.Softmax(dim=-1)\n\n    def forward(self, x, size):\n\n        H, W = size\n        B, L, C = x.shape\n        assert L == H * W, \"input feature has wrong size\"\n\n        x = x.view(B, H, W, C)\n\n        pad_l = pad_t = 0\n        pad_r = (self.window_size - W % self.window_size) % self.window_size\n        pad_b = (self.window_size - H % self.window_size) % self.window_size\n        x = F.pad(x, (0, 0, pad_l, pad_r, pad_t, pad_b))\n        _, Hp, Wp, _ = x.shape\n\n        x = window_partition(x, self.window_size)\n        x = x.view(-1, self.window_size * self.window_size, C)\n\n        # W-MSA/SW-MSA\n        # attn_windows = self.attn(x_windows)\n\n        B_, N, C = x.shape\n        qkv = self.qkv(x).reshape(B_, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)\n        q, k, v = qkv[0], qkv[1], qkv[2]\n\n        q = q * self.scale\n        attn = (q @ k.transpose(-2, -1))\n        attn = self.softmax(attn)\n\n        x = (attn @ v).transpose(1, 2).reshape(B_, N, C)\n        x = self.proj(x)\n\n        # merge windows\n        x = x.view(\n            -1, self.window_size, self.window_size, C\n        )\n        x = window_reverse(x, self.window_size, Hp, Wp)\n\n        if pad_r > 0 or pad_b > 0:\n            x = x[:, :H, :W, :].contiguous()\n\n        x = x.view(B, H * W, C)\n\n        return x, size\n\n\nclass SpatialBlock(nn.Module):\n\n    def __init__(self, dim, num_heads, window_size,\n                 mlp_ratio=4., qkv_bias=True, drop_path_rate=0., act_layer=nn.GELU,\n                 norm_layer=nn.LayerNorm, conv_at_attn=True, conv_at_ffn=True):\n        super().__init__()\n\n        drop_path = DropPath(drop_path_rate) if drop_path_rate > 0. else nn.Identity()\n\n        self.conv1 = PreNorm(None, DepthWiseConv2d(dim, 3, 1, 1)) if conv_at_attn else None\n        self.window_attn = PreNorm(\n            norm_layer(dim),\n            WindowAttention(dim, num_heads, window_size, qkv_bias=qkv_bias),\n            drop_path\n        )\n        self.conv2 = PreNorm(None, DepthWiseConv2d(dim, 3, 1, 1)) if conv_at_ffn else None\n        self.ffn = PreNorm(\n            norm_layer(dim),\n            Mlp(in_features=dim, hidden_features=int(dim*mlp_ratio), act_layer=act_layer),\n            drop_path\n        )\n\n    def forward(self, x, size):\n        if self.conv1:\n            x, size = self.conv1(x, size)\n        x, size = self.window_attn(x, size)\n\n        if self.conv2:\n            x, size = self.conv2(x, size)\n        x, size = self.ffn(x, size)\n        return x, size\n\n\nclass DaViT(nn.Module):\n    \"\"\" DaViT: Dual-Attention Transformer\n\n    Args:\n        img_size (int): Image size, Default: 224.\n        in_chans (int): Number of input image channels. Default: 3.\n        num_classes (int): Number of classes for classification head. Default: 1000.\n        patch_size (tuple(int)): Patch size of convolution in different stages. Default: (7, 2, 2, 2).\n        patch_stride (tuple(int)): Patch stride of convolution in different stages. Default: (4, 2, 2, 2).\n        patch_padding (tuple(int)): Patch padding of convolution in different stages. Default: (3, 0, 0, 0).\n        patch_prenorm (tuple(bool)): If True, perform norm before convlution layer. Default: (True, False, False, False).\n        embed_dims (tuple(int)): Patch embedding dimension in different stages. Default: (64, 128, 192, 256).\n        num_heads (tuple(int)): Number of spatial attention heads in different stages. Default: (4, 8, 12, 16).\n        num_groups (tuple(int)): Number of channel groups in different stages. Default: (4, 8, 12, 16).\n        window_size (int): Window size. Default: 7.\n        mlp_ratio (float): Ratio of mlp hidden dim to embedding dim. Default: 4.\n        qkv_bias (bool): If True, add a learnable bias to query, key, value. Default: True.\n        drop_path_rate (float): Stochastic depth rate. Default: 0.1.\n        norm_layer (nn.Module): Normalization layer. Default: nn.LayerNorm.\n        enable_checkpoint (bool): If True, enable checkpointing. Default: False.\n        conv_at_attn (bool): If True, performe depthwise convolution before attention layer. Default: True.\n        conv_at_ffn (bool): If True, performe depthwise convolution before ffn layer. Default: True.\n    \"\"\"\n\n    def __init__(\n        self,\n        img_size=224,\n        in_chans=3,\n        num_classes=1000,\n        depths=(1, 1, 3, 1),\n        patch_size=(7, 2, 2, 2),\n        patch_stride=(4, 2, 2, 2),\n        patch_padding=(3, 0, 0, 0),\n        patch_prenorm=(False, False, False, False),\n        embed_dims=(64, 128, 192, 256),\n        num_heads=(3, 6, 12, 24),\n        num_groups=(3, 6, 12, 24),\n        window_size=7,\n        mlp_ratio=4.,\n        qkv_bias=True,\n        drop_path_rate=0.1,\n        norm_layer=nn.LayerNorm,\n        enable_checkpoint=False,\n        conv_at_attn=True,\n        conv_at_ffn=True,\n        out_indices=[],\n     ):\n        super().__init__()\n\n        self.num_classes = num_classes\n        self.embed_dims = embed_dims\n        self.num_heads = num_heads\n        self.num_groups = num_groups\n        self.num_stages = len(self.embed_dims)\n        self.enable_checkpoint = enable_checkpoint\n        assert self.num_stages == len(self.num_heads) == len(self.num_groups)\n\n        num_stages = len(embed_dims)\n        self.img_size = img_size\n        dpr = [x.item() for x in torch.linspace(0, drop_path_rate, sum(depths)*2)]\n\n\n        depth_offset = 0\n        convs = []\n        blocks = []\n        for i in range(num_stages):\n            conv_embed = ConvEmbed(\n                patch_size=patch_size[i],\n                stride=patch_stride[i],\n                padding=patch_padding[i],\n                in_chans=in_chans if i == 0 else self.embed_dims[i - 1],\n                embed_dim=self.embed_dims[i],\n                norm_layer=norm_layer,\n                pre_norm=patch_prenorm[i]\n            )\n            convs.append(conv_embed)\n\n            print(f'=> Depth offset in stage {i}: {depth_offset}')\n            block = MySequential(\n                *[\n                    MySequential(OrderedDict([\n                        (\n                            'spatial_block', SpatialBlock(\n                                embed_dims[i],\n                                num_heads[i],\n                                window_size,\n                                drop_path_rate=dpr[depth_offset+j*2],\n                                qkv_bias=qkv_bias,\n                                mlp_ratio=mlp_ratio,\n                                conv_at_attn=conv_at_attn,\n                                conv_at_ffn=conv_at_ffn,\n                            )\n                        ),\n                        (\n                            'channel_block', ChannelBlock(\n                                embed_dims[i],\n                                num_groups[i],\n                                drop_path_rate=dpr[depth_offset+j*2+1],\n                                qkv_bias=qkv_bias,\n                                mlp_ratio=mlp_ratio,\n                                conv_at_attn=conv_at_attn,\n                                conv_at_ffn=conv_at_ffn,\n                            )\n                        )\n                    ])) for j in range(depths[i])\n                ]\n            )\n            blocks.append(block)\n            depth_offset += depths[i]*2\n\n        self.convs = nn.ModuleList(convs)\n        self.blocks = nn.ModuleList(blocks)\n\n        self.out_indices = out_indices\n        # self.norms = norm_layer(self.embed_dims[-1])\n        # self.avgpool = nn.AdaptiveAvgPool1d(1)\n        # self.head = nn.Linear(self.embed_dims[-1], num_classes) if num_classes > 0 else nn.Identity()\n        self.apply(self._init_weights)\n\n    @property\n    def dim_out(self):\n        return self.embed_dims[-1]\n\n    def _init_weights(self, m):\n        if isinstance(m, nn.Linear):\n            trunc_normal_(m.weight, std=0.02)\n            if m.bias is not None:\n                nn.init.constant_(m.bias, 0)\n        elif isinstance(m, nn.Conv2d):\n            nn.init.normal_(m.weight, std=0.02)\n            for name, _ in m.named_parameters():\n                if name in ['bias']:\n                    nn.init.constant_(m.bias, 0)\n        elif isinstance(m, nn.LayerNorm):\n            nn.init.constant_(m.weight, 1.0)\n            nn.init.constant_(m.bias, 0)\n        elif isinstance(m, nn.BatchNorm2d):\n            nn.init.constant_(m.weight, 1.0)\n            nn.init.constant_(m.bias, 0)\n\n    def _try_remap_keys(self, pretrained_dict):\n        remap_keys = {\n            \"conv_embeds\": \"convs\",\n            \"main_blocks\": \"blocks\",\n            \"0.cpe.0.proj\": \"spatial_block.conv1.fn.dw\",\n            \"0.attn\": \"spatial_block.window_attn.fn\",\n            \"0.cpe.1.proj\": \"spatial_block.conv2.fn.dw\",\n            \"0.mlp\": \"spatial_block.ffn.fn.net\",\n            \"1.cpe.0.proj\": \"channel_block.conv1.fn.dw\",\n            \"1.attn\": \"channel_block.channel_attn.fn\",\n            \"1.cpe.1.proj\": \"channel_block.conv2.fn.dw\",\n            \"1.mlp\": \"channel_block.ffn.fn.net\",\n            \"0.norm1\": \"spatial_block.window_attn.norm\",\n            \"0.norm2\": \"spatial_block.ffn.norm\",\n            \"1.norm1\": \"channel_block.channel_attn.norm\",\n            \"1.norm2\": \"channel_block.ffn.norm\"\n        }\n\n        full_key_mappings = {}\n        for k in pretrained_dict.keys():\n            old_k = k\n            for remap_key in remap_keys.keys():\n                if remap_key in k:\n                    print(f'=> Repace {remap_key} with {remap_keys[remap_key]}')\n                    k = k.replace(remap_key, remap_keys[remap_key])\n\n            full_key_mappings[old_k] = k\n\n        return full_key_mappings\n\n    def from_state_dict(self, pretrained_dict, pretrained_layers=[], verbose=True):\n        model_dict = self.state_dict()\n        stripped_key = lambda x: x[14:] if x.startswith('image_encoder.') else x\n        full_key_mappings = self._try_remap_keys(pretrained_dict)\n\n        pretrained_dict = {\n            stripped_key(full_key_mappings[k]): v for k, v in pretrained_dict.items()\n            if stripped_key(full_key_mappings[k]) in model_dict.keys()\n        }\n        need_init_state_dict = {}\n        for k, v in pretrained_dict.items():\n            need_init = (\n                k.split('.')[0] in pretrained_layers\n                or pretrained_layers[0] == '*'\n            )\n            if need_init:\n                if verbose:\n                    print(f'=> init {k} from pretrained state dict')\n\n                need_init_state_dict[k] = v\n        self.load_state_dict(need_init_state_dict, strict=False)\n\n    def from_pretrained(self, pretrained='', pretrained_layers=[], verbose=True):\n        if os.path.isfile(pretrained):\n            print(f'=> loading pretrained model {pretrained}')\n            pretrained_dict = torch.load(pretrained, map_location='cpu')\n\n            self.from_state_dict(pretrained_dict, pretrained_layers, verbose)\n\n    def forward_features(self, x):\n        input_size = (x.size(2), x.size(3))\n\n        outs = {}\n        for i, (conv, block) in enumerate(zip(self.convs, self.blocks)):\n            x, input_size = conv(x, input_size)\n            if self.enable_checkpoint:\n                x, input_size = checkpoint.checkpoint(block, x, input_size)\n            else:\n                x, input_size = block(x, input_size)\n            if i in self.out_indices:\n                out = x.view(-1, *input_size, self.embed_dims[i]).permute(0, 3, 1, 2).contiguous()\n                outs[\"res{}\".format(i + 2)] = out       \n\n        if len(self.out_indices) == 0:\n            outs[\"res5\"] = x.view(-1, *input_size, self.embed_dims[-1]).permute(0, 3, 1, 2).contiguous()\n        \n        return outs\n\n    def forward(self, x):\n        x = self.forward_features(x)\n        # x = self.head(x)\n        return x\n\nclass D2DaViT(DaViT, Backbone):\n    def __init__(self, cfg, input_shape):\n\n        spec = cfg['BACKBONE']['DAVIT']\n\n        super().__init__(\n            num_classes=0,\n            depths=spec['DEPTHS'],\n            embed_dims=spec['DIM_EMBED'],\n            num_heads=spec['NUM_HEADS'],\n            num_groups=spec['NUM_GROUPS'],\n            patch_size=spec['PATCH_SIZE'],\n            patch_stride=spec['PATCH_STRIDE'],\n            patch_padding=spec['PATCH_PADDING'],\n            patch_prenorm=spec['PATCH_PRENORM'],\n            drop_path_rate=spec['DROP_PATH_RATE'],\n            img_size=input_shape,\n            window_size=spec.get('WINDOW_SIZE', 7),\n            enable_checkpoint=spec.get('ENABLE_CHECKPOINT', False),\n            conv_at_attn=spec.get('CONV_AT_ATTN', True),\n            conv_at_ffn=spec.get('CONV_AT_FFN', True),\n            out_indices=spec.get('OUT_INDICES', []),\n        )\n\n        self._out_features = cfg['BACKBONE']['DAVIT']['OUT_FEATURES']\n\n        self._out_feature_strides = {\n            \"res2\": 4,\n            \"res3\": 8,\n            \"res4\": 16,\n            \"res5\": 32,\n        }\n        self._out_feature_channels = {\n            \"res2\": self.embed_dims[0],\n            \"res3\": self.embed_dims[1],\n            \"res4\": self.embed_dims[2],\n            \"res5\": self.embed_dims[3],\n        }\n\n    def forward(self, x):\n        \"\"\"\n        Args:\n            x: Tensor of shape (N,C,H,W). H, W must be a multiple of ``self.size_divisibility``.\n        Returns:\n            dict[str->Tensor]: names and the corresponding features\n        \"\"\"\n        assert (\n            x.dim() == 4\n        ), f\"SwinTransformer takes an input of shape (N, C, H, W). Got {x.shape} instead!\"\n        outputs = {}\n        y = super().forward(x)\n\n        for k in y.keys():\n            if k in self._out_features:\n                outputs[k] = y[k]\n        return outputs\n\n    def output_shape(self):\n        return {\n            name: ShapeSpec(\n                channels=self._out_feature_channels[name], stride=self._out_feature_strides[name]\n            )\n            for name in self._out_features\n        }\n\n    @property\n    def size_divisibility(self):\n        return 32\n\n@register_backbone\ndef get_davit_backbone(cfg):\n    davit = D2DaViT(cfg['MODEL'], 224)    \n\n    if cfg['MODEL']['BACKBONE']['LOAD_PRETRAINED'] is True:\n        filename = cfg['MODEL']['BACKBONE']['PRETRAINED']\n        logger.info(f'=> init from {filename}')\n        davit.from_pretrained(\n            filename, \n            cfg['MODEL']['BACKBONE']['DAVIT'].get('PRETRAINED_LAYERS', ['*']), \n            cfg['VERBOSE'])\n\n    return davit"
  },
  {
    "path": "thirdparty/GLEE/glee/backbone/eva01.py",
    "content": "import logging\nimport math\nfrom functools import partial\n\nimport fvcore.nn.weight_init as weight_init\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nfrom torch import Tensor, Size\nfrom typing import Union, List\nfrom torch.nn.parameter import Parameter\nimport numbers\n\nfrom detectron2.layers import CNNBlockBase, Conv2d, get_norm\nfrom detectron2.modeling.backbone.fpn import _assert_strides_are_log2_contiguous\n\nfrom fairscale.nn.checkpoint import checkpoint_wrapper\nfrom timm.models.layers import DropPath, Mlp, trunc_normal_\n\n# from detectron2.modeling.backbone import Backbone\nfrom detectron2.modeling import BACKBONE_REGISTRY, Backbone, ShapeSpec\n\nfrom .eva_01_utils import (\n    PatchEmbed,\n    add_decomposed_rel_pos,\n    get_abs_pos,\n    window_partition,\n    window_unpartition,\n)\nfrom detectron2.modeling.backbone.fpn import LastLevelMaxPool\n\nlogger = logging.getLogger(__name__)\n\n\n__all__ = [\"EVAViT\", \"SimpleFeaturePyramid\", \"get_vit_lr_decay_rate\"]\n\n\n_shape_t = Union[int, List[int], Size]\n\n\n# steal from beit https://github.com/microsoft/unilm/tree/master/beit\nclass LayerNormWithForceFP32(nn.Module):\n    __constants__ = ['normalized_shape', 'eps', 'elementwise_affine']\n    normalized_shape: _shape_t\n    eps: float\n    elementwise_affine: bool\n\n    def __init__(self, normalized_shape: _shape_t, eps: float = 1e-5, elementwise_affine: bool = True) -> None:\n        super(LayerNormWithForceFP32, self).__init__()\n        if isinstance(normalized_shape, numbers.Integral):\n            normalized_shape = (normalized_shape,)\n        self.normalized_shape = tuple(normalized_shape)\n        self.eps = eps\n        self.elementwise_affine = elementwise_affine\n        if self.elementwise_affine:\n            self.weight = Parameter(torch.Tensor(*normalized_shape))\n            self.bias = Parameter(torch.Tensor(*normalized_shape))\n        else:\n            self.register_parameter('weight', None)\n            self.register_parameter('bias', None)\n        self.reset_parameters()\n\n    def reset_parameters(self) -> None:\n        if self.elementwise_affine:\n            nn.init.ones_(self.weight)\n            nn.init.zeros_(self.bias)\n\n    def forward(self, input: Tensor) -> Tensor:\n        return F.layer_norm(\n            input.float(), self.normalized_shape, self.weight.float(), self.bias.float(), self.eps).type_as(input)\n\n    def extra_repr(self) -> Tensor:\n        return '{normalized_shape}, eps={eps}, ' \\\n               'elementwise_affine={elementwise_affine}'.format(**self.__dict__)\n\n\nclass Attention(nn.Module):\n    \"\"\"Multi-head Attention block with relative position embeddings.\"\"\"\n\n    def __init__(\n        self,\n        dim,\n        num_heads=8,\n        qkv_bias=True,\n        beit_like_qkv_bias=False,\n        use_rel_pos=False,\n        rel_pos_zero_init=True,\n        input_size=None,\n        interp_type=\"vitdet\",\n    ):\n        \"\"\"\n        Args:\n            dim (int): Number of input channels.\n            num_heads (int): Number of attention heads.\n            qkv_bias (bool:  If True, add a learnable bias to query, key, value.\n            rel_pos (bool): If True, add relative positional embeddings to the attention map.\n            rel_pos_zero_init (bool): If True, zero initialize relative positional parameters.\n            input_size (int or None): Input resolution for calculating the relative positional\n                parameter size.\n        \"\"\"\n        super().__init__()\n        self.num_heads = num_heads\n        head_dim = dim // num_heads\n        self.scale = head_dim**-0.5\n\n        self.beit_like_qkv_bias = beit_like_qkv_bias\n        if beit_like_qkv_bias:\n            self.q_bias = nn.Parameter(torch.zeros(dim))\n            self.v_bias = nn.Parameter(torch.zeros(dim))\n\n        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)\n        self.proj = nn.Linear(dim, dim)\n\n        self.use_rel_pos = use_rel_pos\n        self.interp_type = interp_type\n        if self.use_rel_pos:\n            # initialize relative positional embeddings\n            self.rel_pos_h = nn.Parameter(torch.zeros(2 * input_size[0] - 1, head_dim))\n            self.rel_pos_w = nn.Parameter(torch.zeros(2 * input_size[1] - 1, head_dim))\n\n            if not rel_pos_zero_init:\n                trunc_normal_(self.rel_pos_h, std=0.02)\n                trunc_normal_(self.rel_pos_w, std=0.02)\n        self.qk_float = False\n\n    def forward(self, x):\n        B, H, W, _ = x.shape\n        # qkv with shape (3, B, nHead, H * W, C)\n        if self.beit_like_qkv_bias:\n            qkv_bias = torch.cat((self.q_bias, torch.zeros_like(self.v_bias, requires_grad=False), self.v_bias))\n            qkv = torch.nn.functional.linear(input=x, weight=self.qkv.weight, bias=qkv_bias)\n            qkv = qkv.reshape(B, H * W, 3, self.num_heads, -1).permute(2, 0, 3, 1, 4)\n        else:\n            qkv = self.qkv(x).reshape(B, H * W, 3, self.num_heads, -1).permute(2, 0, 3, 1, 4)\n        # q, k, v with shape (B * nHead, H * W, C)\n        q, k, v = qkv.reshape(3, B * self.num_heads, H * W, -1).unbind(0)\n\n        if self.qk_float:\n            attn = (q.float() * self.scale) @ k.float().transpose(-2, -1)\n            if self.use_rel_pos:\n                attn = add_decomposed_rel_pos(attn, q, self.rel_pos_h, self.rel_pos_w, (H, W), (H, W), self.interp_type)\n            attn = attn.softmax(dim=-1).type_as(x)\n        else:\n            attn = (q * self.scale) @ k.transpose(-2, -1)\n            if self.use_rel_pos:\n                attn = add_decomposed_rel_pos(attn, q, self.rel_pos_h, self.rel_pos_w, (H, W), (H, W), self.interp_type)\n            attn = attn.softmax(dim=-1)\n        x = (attn @ v).view(B, self.num_heads, H, W, -1).permute(0, 2, 3, 1, 4).reshape(B, H, W, -1)\n        x = self.proj(x)\n\n        return x\n\n\nclass ResBottleneckBlock(CNNBlockBase):\n    \"\"\"\n    The standard bottleneck residual block without the last activation layer.\n    It contains 3 conv layers with kernels 1x1, 3x3, 1x1.\n    \"\"\"\n\n    def __init__(\n        self,\n        in_channels,\n        out_channels,\n        bottleneck_channels,\n        norm=\"LN\",\n        act_layer=nn.GELU,\n    ):\n        \"\"\"\n        Args:\n            in_channels (int): Number of input channels.\n            out_channels (int): Number of output channels.\n            bottleneck_channels (int): number of output channels for the 3x3\n                \"bottleneck\" conv layers.\n            norm (str or callable): normalization for all conv layers.\n                See :func:`layers.get_norm` for supported format.\n            act_layer (callable): activation for all conv layers.\n        \"\"\"\n        super().__init__(in_channels, out_channels, 1)\n\n        self.conv1 = Conv2d(in_channels, bottleneck_channels, 1, bias=False)\n        self.norm1 = get_norm(norm, bottleneck_channels)\n        self.act1 = act_layer()\n\n        self.conv2 = Conv2d(\n            bottleneck_channels,\n            bottleneck_channels,\n            3,\n            padding=1,\n            bias=False,\n        )\n        self.norm2 = get_norm(norm, bottleneck_channels)\n        self.act2 = act_layer()\n\n        self.conv3 = Conv2d(bottleneck_channels, out_channels, 1, bias=False)\n        self.norm3 = get_norm(norm, out_channels)\n\n        for layer in [self.conv1, self.conv2, self.conv3]:\n            weight_init.c2_msra_fill(layer)\n        for layer in [self.norm1, self.norm2]:\n            layer.weight.data.fill_(1.0)\n            layer.bias.data.zero_()\n        # zero init last norm layer.\n        self.norm3.weight.data.zero_()\n        self.norm3.bias.data.zero_()\n\n    def forward(self, x):\n        out = x\n        for layer in self.children():\n            out = layer(out)\n\n        out = x + out\n        return out\n\n\nclass Block(nn.Module):\n    \"\"\"Transformer blocks with support of window attention and residual propagation blocks\"\"\"\n\n    def __init__(\n        self,\n        dim,\n        num_heads,\n        mlp_ratio=4.0,\n        qkv_bias=True,\n        drop_path=0.0,\n        norm_layer=LayerNormWithForceFP32,\n        act_layer=nn.GELU,\n        use_rel_pos=False,\n        rel_pos_zero_init=True,\n        window_size=0,\n        use_residual_block=False,\n        input_size=None,\n        beit_like_qkv_bias=False,\n        beit_like_gamma=False,\n        interp_type=\"vitdet\",\n    ):\n        \"\"\"\n        Args:\n            dim (int): Number of input channels.\n            num_heads (int): Number of attention heads in each ViT block.\n            mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.\n            qkv_bias (bool): If True, add a learnable bias to query, key, value.\n            drop_path (float): Stochastic depth rate.\n            norm_layer (nn.Module): Normalization layer.\n            act_layer (nn.Module): Activation layer.\n            use_rel_pos (bool): If True, add relative positional embeddings to the attention map.\n            rel_pos_zero_init (bool): If True, zero initialize relative positional parameters.\n            window_size (int): Window size for window attention blocks. If it equals 0, then not\n                use window attention.\n            use_residual_block (bool): If True, use a residual block after the MLP block.\n            input_size (int or None): Input resolution for calculating the relative positional\n                parameter size.\n            beit_like_qkv_bias (bool)\n            beit_like_gamma (bool)\n        \"\"\"\n        super().__init__()\n        self.norm1 = norm_layer(dim)\n        self.attn = Attention(\n            dim,\n            num_heads=num_heads,\n            qkv_bias=qkv_bias,\n            use_rel_pos=use_rel_pos,\n            rel_pos_zero_init=rel_pos_zero_init,\n            input_size=input_size if window_size == 0 else (window_size, window_size),\n            beit_like_qkv_bias=beit_like_qkv_bias,\n            interp_type=interp_type,\n        )\n\n        self.drop_path = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()\n        self.norm2 = norm_layer(dim)\n        self.mlp = Mlp(in_features=dim, hidden_features=int(dim * mlp_ratio), act_layer=act_layer)\n\n        self.window_size = window_size\n\n        self.use_residual_block = use_residual_block\n        if use_residual_block:\n            # Use a residual block with bottleneck channel as dim // 2\n            self.residual = ResBottleneckBlock(\n                in_channels=dim,\n                out_channels=dim,\n                bottleneck_channels=dim // 2,\n                norm=\"LN\",\n                act_layer=act_layer,\n            )\n\n        self.beit_like_gamma = beit_like_gamma\n        if beit_like_gamma:\n            self.gamma_1 = nn.Parameter(torch.ones((dim)), requires_grad=True)\n            self.gamma_2 = nn.Parameter(torch.ones((dim)), requires_grad=True)\n\n    def forward(self, x):\n        shortcut = x\n        x = self.norm1(x)\n        # Window partition\n        if self.window_size > 0:\n            H, W = x.shape[1], x.shape[2]\n            x, pad_hw = window_partition(x, self.window_size)\n\n        x = self.attn(x)\n        # Reverse window partition\n        if self.window_size > 0:\n            x = window_unpartition(x, self.window_size, pad_hw, (H, W))\n\n        if self.beit_like_gamma:\n            x = shortcut + self.drop_path(self.gamma_1 * x)\n            x = x + self.drop_path(self.gamma_2 * self.mlp(self.norm2(x)))\n        else:\n            x = shortcut + self.drop_path(x)\n            x = x + self.drop_path(self.mlp(self.norm2(x)))\n\n        if self.use_residual_block:\n            x = self.residual(x.permute(0, 3, 1, 2)).permute(0, 2, 3, 1)\n\n        return x\n\n\nclass EVAViT(Backbone):\n    \"\"\"\n    This module implements Vision Transformer (ViT) backbone in :paper:`vitdet`.\n    \"Exploring Plain Vision Transformer Backbones for Object Detection\",\n    https://arxiv.org/abs/2203.16527\n    \"\"\"\n\n    def __init__(\n        self,\n        img_size=1024,\n        patch_size=16,\n        in_chans=3,\n        embed_dim=768,\n        depth=12,\n        num_heads=12,\n        mlp_ratio=4.0,\n        qkv_bias=True,\n        drop_path_rate=0.0,\n        norm_layer=LayerNormWithForceFP32,\n        act_layer=nn.GELU,\n        use_abs_pos=True,\n        use_rel_pos=False,\n        rel_pos_zero_init=True,\n        window_size=0,\n        window_block_indexes=(),\n        residual_block_indexes=(),\n        use_act_checkpoint=False,\n        pretrain_img_size=224,\n        pretrain_use_cls_token=True,\n        out_feature=\"last_feat\",\n        beit_like_qkv_bias=True,\n        beit_like_gamma=False,\n        freeze_patch_embed=False,\n        interp_type=\"vitdet\", \n    ):\n        \"\"\"\n        Args:\n            img_size (int): Input image size.\n            patch_size (int): Patch size.\n            in_chans (int): Number of input image channels.\n            embed_dim (int): Patch embedding dimension.\n            depth (int): Depth of ViT.\n            num_heads (int): Number of attention heads in each ViT block.\n            mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.\n            qkv_bias (bool): If True, add a learnable bias to query, key, value.\n            drop_path_rate (float): Stochastic depth rate.\n            norm_layer (nn.Module): Normalization layer.\n            act_layer (nn.Module): Activation layer.\n            use_abs_pos (bool): If True, use absolute positional embeddings.\n            use_rel_pos (bool): If True, add relative positional embeddings to the attention map.\n            rel_pos_zero_init (bool): If True, zero initialize relative positional parameters.\n            window_size (int): Window size for window attention blocks.\n            window_block_indexes (list): Indexes for blocks using window attention.\n            residual_block_indexes (list): Indexes for blocks using conv propagation.\n            use_act_checkpoint (bool): If True, use activation checkpointing.\n            pretrain_img_size (int): input image size for pretraining models.\n            pretrain_use_cls_token (bool): If True, pretrainig models use class token.\n            out_feature (str): name of the feature from the last block.\n            beit_like_qkv_bias (bool): beit_like_model that has gamma_1 and gamma_2 in blocks and qkv_bias=False\n            beit_like_gamma (bool)\n            freeze_patch_embed (bool)\n            interp_type: \"vitdet\" for training / fine-ting, \"beit\" for eval (slightly improvement at a higher res)\n        \"\"\"\n        super().__init__()\n        self.pretrain_use_cls_token = pretrain_use_cls_token\n\n        self.patch_embed = PatchEmbed(\n            kernel_size=(patch_size, patch_size),\n            stride=(patch_size, patch_size),\n            in_chans=in_chans,\n            embed_dim=embed_dim,\n        )\n\n        if use_abs_pos:\n            # Initialize absolute positional embedding with pretrain image size.\n            num_patches = (pretrain_img_size // patch_size) * (pretrain_img_size // patch_size)\n            num_positions = (num_patches + 1) if pretrain_use_cls_token else num_patches\n            self.pos_embed = nn.Parameter(torch.zeros(1, num_positions, embed_dim))\n        else:\n            self.pos_embed = None\n\n        # stochastic depth decay rule\n        dpr = [x.item() for x in torch.linspace(0, drop_path_rate, depth)]\n\n        self.blocks = nn.ModuleList()\n        if beit_like_qkv_bias:\n            qkv_bias = False\n        for i in range(depth):\n            block = Block(\n                dim=embed_dim,\n                num_heads=num_heads,\n                mlp_ratio=mlp_ratio,\n                qkv_bias=qkv_bias,\n                drop_path=dpr[i],\n                norm_layer=norm_layer,\n                act_layer=act_layer,\n                use_rel_pos=use_rel_pos,\n                rel_pos_zero_init=rel_pos_zero_init,\n                window_size=window_size if i in window_block_indexes else 0,\n                use_residual_block=i in residual_block_indexes,\n                input_size=(img_size // patch_size, img_size // patch_size),\n                beit_like_qkv_bias=beit_like_qkv_bias,\n                beit_like_gamma=beit_like_gamma,\n                interp_type=interp_type,\n            )\n            if use_act_checkpoint:\n                block = checkpoint_wrapper(block)\n            self.blocks.append(block)\n\n        self._out_feature_channels = {out_feature: embed_dim}\n        self._out_feature_strides = {out_feature: patch_size}\n        self._out_features = [out_feature]\n\n        if self.pos_embed is not None:\n            trunc_normal_(self.pos_embed, std=0.02)\n\n        self.freeze_patch_embed = freeze_patch_embed\n        self.apply(self._init_weights)\n\n    def _init_weights(self, m):\n        if isinstance(m, nn.Linear):\n            trunc_normal_(m.weight, std=0.02)\n            if isinstance(m, nn.Linear) and m.bias is not None:\n                nn.init.constant_(m.bias, 0)\n        elif isinstance(m, LayerNormWithForceFP32):\n            nn.init.constant_(m.bias, 0)\n            nn.init.constant_(m.weight, 1.0)\n\n        if self.freeze_patch_embed:\n            for n, p in self.patch_embed.named_parameters():\n                p.requires_grad = False\n\n    def forward(self, x):\n        x = self.patch_embed(x)\n        if self.pos_embed is not None:\n            x = x + get_abs_pos(\n                self.pos_embed, self.pretrain_use_cls_token, (x.shape[1], x.shape[2])\n            )\n\n        for blk in self.blocks:\n            x = blk(x)\n\n        outputs = {self._out_features[0]: x.permute(0, 3, 1, 2)}\n        return outputs\n\n\nclass SimpleFeaturePyramid(Backbone):\n    \"\"\"\n    This module implements SimpleFeaturePyramid in :paper:`vitdet`.\n    It creates pyramid features built on top of the input feature map.\n    \"\"\"\n\n    def __init__(\n        self,\n        net,\n        in_feature,\n        out_channels,\n        scale_factors,\n        top_block=None,\n        norm=\"LN\",\n        square_pad=0,\n    ):\n        \"\"\"\n        Args:\n            net (Backbone): module representing the subnetwork backbone.\n                Must be a subclass of :class:`Backbone`.\n            in_feature (str): names of the input feature maps coming\n                from the net.\n            out_channels (int): number of channels in the output feature maps.\n            scale_factors (list[float]): list of scaling factors to upsample or downsample\n                the input features for creating pyramid features.\n            top_block (nn.Module or None): if provided, an extra operation will\n                be performed on the output of the last (smallest resolution)\n                pyramid output, and the result will extend the result list. The top_block\n                further downsamples the feature map. It must have an attribute\n                \"num_levels\", meaning the number of extra pyramid levels added by\n                this block, and \"in_feature\", which is a string representing\n                its input feature (e.g., p5).\n            norm (str): the normalization to use.\n            square_pad (int): If > 0, require input images to be padded to specific square size.\n        \"\"\"\n        super(SimpleFeaturePyramid, self).__init__()\n        assert isinstance(net, Backbone)\n        self.scale_factors = scale_factors\n\n        input_shapes = net.output_shape()\n        strides = [int(input_shapes[in_feature].stride / scale) for scale in scale_factors]\n        _assert_strides_are_log2_contiguous(strides)\n\n        dim = input_shapes[in_feature].channels\n        self.stages = []\n        use_bias = norm == \"\"\n        for idx, scale in enumerate(scale_factors):\n            out_dim = dim\n            if scale == 4.0:\n                layers = [\n                    nn.ConvTranspose2d(dim, dim // 2, kernel_size=2, stride=2),\n                    get_norm(norm, dim // 2),\n                    nn.GELU(),\n                    nn.ConvTranspose2d(dim // 2, dim // 4, kernel_size=2, stride=2),\n                ]\n                out_dim = dim // 4\n            elif scale == 2.0:\n                layers = [nn.ConvTranspose2d(dim, dim // 2, kernel_size=2, stride=2)]\n                out_dim = dim // 2\n            elif scale == 1.0:\n                layers = []\n            elif scale == 0.5:\n                layers = [nn.MaxPool2d(kernel_size=2, stride=2)]\n            else:\n                raise NotImplementedError(f\"scale_factor={scale} is not supported yet.\")\n\n            layers.extend(\n                [\n                    Conv2d(\n                        out_dim,\n                        out_channels,\n                        kernel_size=1,\n                        bias=use_bias,\n                        norm=get_norm(norm, out_channels),\n                    ),\n                    Conv2d(\n                        out_channels,\n                        out_channels,\n                        kernel_size=3,\n                        padding=1,\n                        bias=use_bias,\n                        norm=get_norm(norm, out_channels),\n                    ),\n                ]\n            )\n            layers = nn.Sequential(*layers)\n\n            stage = int(math.log2(strides[idx]))\n            self.add_module(f\"simfp_{stage}\", layers)\n            self.stages.append(layers)\n\n        self.net = net\n        self.in_feature = in_feature\n        self.top_block = top_block\n        # Return feature names are \"p<stage>\", like [\"p2\", \"p3\", ..., \"p6\"]\n        self._out_feature_strides = {\"p{}\".format(int(math.log2(s))): s for s in strides}\n        # top block output feature maps.\n        if self.top_block is not None:\n            for s in range(stage, stage + self.top_block.num_levels):\n                self._out_feature_strides[\"p{}\".format(s + 1)] = 2 ** (s + 1)\n\n        self._out_features = list(self._out_feature_strides.keys())\n        self._out_feature_channels = {k: out_channels for k in self._out_features}\n        self._size_divisibility = strides[-1]\n        self._square_pad = square_pad\n\n    @property\n    def padding_constraints(self):\n        return {\n            \"size_divisiblity\": self._size_divisibility,\n            \"square_size\": self._square_pad,\n        }\n\n    def forward(self, x):\n        \"\"\"\n        Args:\n            x: Tensor of shape (N,C,H,W). H, W must be a multiple of ``self.size_divisibility``.\n\n        Returns:\n            dict[str->Tensor]:\n                mapping from feature map name to pyramid feature map tensor\n                in high to low resolution order. Returned feature names follow the FPN\n                convention: \"p<stage>\", where stage has stride = 2 ** stage e.g.,\n                [\"p2\", \"p3\", ..., \"p6\"].\n        \"\"\"\n        bottom_up_features = self.net(x)\n        features = bottom_up_features[self.in_feature]\n        results = []\n\n        for stage in self.stages:\n            results.append(stage(features))\n\n        if self.top_block is not None:\n            if self.top_block.in_feature in bottom_up_features:\n                top_block_in_feature = bottom_up_features[self.top_block.in_feature]\n            else:\n                top_block_in_feature = results[self._out_features.index(self.top_block.in_feature)]\n            results.extend(self.top_block(top_block_in_feature))\n        assert len(self._out_features) == len(results)\n        return {f: res for f, res in zip(self._out_features, results)}\n\n\n\n@BACKBONE_REGISTRY.register()\nclass D2_EVA01(SimpleFeaturePyramid):\n    def __init__(self, cfg, input_shape):\n        \n        super().__init__(\n            net = EVAViT(\n                img_size= cfg.MODEL.EVA01.IMAGE_SIZE,\n                patch_size=cfg.MODEL.EVA01.PATCH_SIZE,\n                window_size= cfg.MODEL.EVA01.WINDOW_SIZE,\n                embed_dim= cfg.MODEL.EVA01.DMBED_DIM,\n                depth= cfg.MODEL.EVA01.DEPTH,\n                num_heads= cfg.MODEL.EVA01.NUM_HEADS ,\n                drop_path_rate= cfg.MODEL.EVA01.DROP_PATH_RATE,\n                mlp_ratio= cfg.MODEL.EVA01.MLP_RATIO,\n                qkv_bias=True,\n                norm_layer=partial(nn.LayerNorm, eps=1e-6),\n                window_block_indexes= cfg.MODEL.EVA01.WINDOW_BLOCK_INDEXES,\n                residual_block_indexes=[],\n                use_act_checkpoint = True,\n                use_rel_pos = True,\n                out_feature=\"last_feat\",\n                beit_like_qkv_bias=cfg.MODEL.EVA01.BEIT_LIKE_QKV_BIAS ,\n                beit_like_gamma= cfg.MODEL.EVA01.BEIT_LIKE_GAMMA,\n                freeze_patch_embed= cfg.MODEL.EVA01.FREEZE_PATH_EMBED,\n            ),\n            in_feature = \"last_feat\",\n            out_channels=256,\n            scale_factors=(2.0, 1.0, 0.5),  # (4.0, 2.0, 1.0, 0.5) in ViTDet\n            top_block=LastLevelMaxPool(),\n            norm=\"LN\",\n            square_pad=cfg.MODEL.EVA01.IMAGE_SIZE,\n\n        )\n        pretrained_weight = cfg.MODEL.EVA01.PRETRAINED_WEIGHT \n        if pretrained_weight:\n            checkpoint = torch.load(pretrained_weight, map_location='cpu')\n            print(f'\\nload pretrain weight from {pretrained_weight} \\n') \n            self.load_state_dict(checkpoint['model'], strict=False)\n \n    def output_shape(self):\n        return {\n            name: ShapeSpec(\n                channels=self._out_feature_channels[name], stride=self._out_feature_strides[name]\n            )\n            for name in self._out_features\n        }\n\n    @property\n    def size_divisibility(self):\n        return 32\n\n\n\ndef get_vit_lr_decay_rate(name, lr_decay_rate=1.0, num_layers=12):\n    \"\"\"\n    Calculate lr decay rate for different ViT blocks.\n    Args:\n        name (string): parameter name.\n        lr_decay_rate (float): base lr decay rate.\n        num_layers (int): number of ViT blocks.\n\n    Returns:\n        lr decay rate for the given parameter.\n    \"\"\"\n    layer_id = num_layers + 1\n    if 'backbone' in name: #name.startswith(\"backbone\"):\n        if \".pos_embed\" in name or \".patch_embed\" in name:\n            layer_id = 0\n        elif \".blocks.\" in name and \".residual.\" not in name:\n            layer_id = int(name[name.find(\".blocks.\") :].split(\".\")[2]) + 1\n\n    return lr_decay_rate ** (num_layers + 1 - layer_id)\n"
  },
  {
    "path": "thirdparty/GLEE/glee/backbone/eva02-dino.py",
    "content": "import logging\nimport math\nfrom functools import partial\n\nimport fvcore.nn.weight_init as weight_init\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\nfrom detectron2.layers import CNNBlockBase, Conv2d, get_norm\nfrom detectron2.modeling.backbone.fpn import _assert_strides_are_log2_contiguous\n\nfrom detectron2.modeling.backbone import Backbone\nfrom .eva_02_utils import (\n    PatchEmbed,\n    add_decomposed_rel_pos,\n    get_abs_pos,\n    window_partition,\n    window_unpartition,\n    VisionRotaryEmbeddingFast,\n)\n\ntry:\n    import xformers.ops as xops\n    HAS_XFORMER=True\nexcept:\n    HAS_XFORMER=False\n    pass\n\n\nlogger = logging.getLogger(__name__)\n\n\n\n__all__ = [\"EVA02_ViT\", \"SimpleFeaturePyramid\", \"get_vit_lr_decay_rate\"]\n\n\n\nclass SwiGLU(nn.Module):\n    def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.SiLU, drop=0., \n                norm_layer=nn.LayerNorm, subln=False\n            ):\n        super().__init__()\n        out_features = out_features or in_features\n        hidden_features = hidden_features or in_features\n\n        self.w1 = nn.Linear(in_features, hidden_features)\n        self.w2 = nn.Linear(in_features, hidden_features)\n\n        self.act = act_layer()\n        self.ffn_ln = norm_layer(hidden_features) if subln else nn.Identity()\n        self.w3 = nn.Linear(hidden_features, out_features)\n        \n        self.drop = nn.Dropout(drop)\n\n    def forward(self, x):\n        x1 = self.w1(x)\n        x2 = self.w2(x)\n        hidden = self.act(x1) * x2\n        x = self.ffn_ln(hidden)\n        x = self.w3(x)\n        x = self.drop(x)\n        return x\n    \n\nclass Attention(nn.Module):\n    def __init__(\n            self, \n            dim, \n            num_heads=8, \n            qkv_bias=True, \n            qk_scale=None, \n            attn_head_dim=None, \n            rope=None,\n            xattn=True,\n        ):\n        super().__init__()\n        self.num_heads = num_heads\n        head_dim = dim // num_heads\n        if attn_head_dim is not None:\n            head_dim = attn_head_dim\n        all_head_dim = head_dim * self.num_heads\n        self.scale = qk_scale or head_dim ** -0.5\n\n        self.q_proj = nn.Linear(dim, all_head_dim, bias=False)\n        self.k_proj = nn.Linear(dim, all_head_dim, bias=False)\n        self.v_proj = nn.Linear(dim, all_head_dim, bias=False)\n\n        if qkv_bias:\n            self.q_bias = nn.Parameter(torch.zeros(all_head_dim))\n            self.v_bias = nn.Parameter(torch.zeros(all_head_dim))\n        else:\n            self.q_bias = None\n            self.v_bias = None\n\n        self.rope = rope\n        self.xattn = xattn\n        self.proj = nn.Linear(all_head_dim, dim)\n\n        if not HAS_XFORMER:\n            self.xattn = False\n\n    def forward(self, x):\n        B, H, W, C = x.shape\n        x = x.view(B, -1, C)\n        N = H * W\n\n        q = F.linear(input=x, weight=self.q_proj.weight, bias=self.q_bias)\n        k = F.linear(input=x, weight=self.k_proj.weight, bias=None)\n        v = F.linear(input=x, weight=self.v_proj.weight, bias=self.v_bias)\n\n        q = q.reshape(B, N, self.num_heads, -1).permute(0, 2, 1, 3)     # B, num_heads, N, C\n        k = k.reshape(B, N, self.num_heads, -1).permute(0, 2, 1, 3)  \n        v = v.reshape(B, N, self.num_heads, -1).permute(0, 2, 1, 3) \n\n        ## rope\n        q = self.rope(q).type_as(v)\n        k = self.rope(k).type_as(v)\n\n        if self.xattn:\n            q = q.permute(0, 2, 1, 3)   # B, num_heads, N, C -> B, N, num_heads, C\n            k = k.permute(0, 2, 1, 3)\n            v = v.permute(0, 2, 1, 3)\n            \n            x = xops.memory_efficient_attention(q, k, v)\n            x = x.reshape(B, N, -1)\n        else:\n            q = q * self.scale\n            attn = (q @ k.transpose(-2, -1))\n            attn = attn.softmax(dim=-1).type_as(x)\n            x = (attn @ v).transpose(1, 2).reshape(B, N, -1)\n\n        x = self.proj(x)\n        x = x.view(B, H, W, C)\n\n        return x\n\n\nclass ResBottleneckBlock(CNNBlockBase):\n    \"\"\"\n    The standard bottleneck residual block without the last activation layer.\n    It contains 3 conv layers with kernels 1x1, 3x3, 1x1.\n    \"\"\"\n\n    def __init__(\n        self,\n        in_channels,\n        out_channels,\n        bottleneck_channels,\n        norm=\"LN\",\n        act_layer=nn.GELU,\n    ):\n        \"\"\"\n        Args:\n            in_channels (int): Number of input channels.\n            out_channels (int): Number of output channels.\n            bottleneck_channels (int): number of output channels for the 3x3\n                \"bottleneck\" conv layers.\n            norm (str or callable): normalization for all conv layers.\n                See :func:`layers.get_norm` for supported format.\n            act_layer (callable): activation for all conv layers.\n        \"\"\"\n        super().__init__(in_channels, out_channels, 1)\n\n        self.conv1 = Conv2d(in_channels, bottleneck_channels, 1, bias=False)\n        self.norm1 = get_norm(norm, bottleneck_channels)\n        self.act1 = act_layer()\n\n        self.conv2 = Conv2d(\n            bottleneck_channels,\n            bottleneck_channels,\n            3,\n            padding=1,\n            bias=False,\n        )\n        self.norm2 = get_norm(norm, bottleneck_channels)\n        self.act2 = act_layer()\n\n        self.conv3 = Conv2d(bottleneck_channels, out_channels, 1, bias=False)\n        self.norm3 = get_norm(norm, out_channels)\n\n        for layer in [self.conv1, self.conv2, self.conv3]:\n            weight_init.c2_msra_fill(layer)\n        for layer in [self.norm1, self.norm2]:\n            layer.weight.data.fill_(1.0)\n            layer.bias.data.zero_()\n        # zero init last norm layer.\n        self.norm3.weight.data.zero_()\n        self.norm3.bias.data.zero_()\n\n    def forward(self, x):\n        out = x\n        for layer in self.children():\n            out = layer(out)\n\n        out = x + out\n        return out\n\n\nclass Block(nn.Module):\n    \"\"\"Transformer blocks with support of window attention and residual propagation blocks\"\"\"\n\n    def __init__(\n        self,\n        dim,\n        num_heads,\n        mlp_ratio=4*2/3,\n        qkv_bias=True,\n        drop_path=0.0,\n        norm_layer=partial(nn.LayerNorm, eps=1e-6), \n        window_size=0,\n        use_residual_block=False,\n        rope=None,\n        xattn=True,\n    ):\n        \"\"\"\n        Args:\n            dim (int): Number of input channels.\n            num_heads (int): Number of attention heads in each ViT block.\n            mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.\n            qkv_bias (bool): If True, add a learnable bias to query, key, value.\n            drop_path (float): Stochastic depth rate.\n            norm_layer (nn.Module): Normalization layer.\n            act_layer (nn.Module): Activation layer.\n            use_rel_pos (bool): If True, add relative positional embeddings to the attention map.\n            rel_pos_zero_init (bool): If True, zero initialize relative positional parameters.\n            window_size (int): Window size for window attention blocks. If it equals 0, then not\n                use window attention.\n            use_residual_block (bool): If True, use a residual block after the MLP block.\n            input_size (int or None): Input resolution for calculating the relative positional\n                parameter size.\n        \"\"\"\n        super().__init__()\n        self.norm1 = norm_layer(dim)\n        self.attn = Attention(\n            dim,\n            num_heads=num_heads,\n            qkv_bias=qkv_bias,\n            rope=rope,\n            xattn=xattn,\n        )\n\n        from timm.models.layers import DropPath\n\n        self.drop_path = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()\n        self.norm2 = norm_layer(dim)\n        self.mlp = SwiGLU(\n                in_features=dim, \n                hidden_features=int(dim * mlp_ratio), \n                subln=True,\n                norm_layer=norm_layer,\n            )\n\n        self.window_size = window_size\n\n        self.use_residual_block = use_residual_block\n        if use_residual_block:\n            # Use a residual block with bottleneck channel as dim // 2\n            self.residual = ResBottleneckBlock(\n                in_channels=dim,\n                out_channels=dim,\n                bottleneck_channels=dim // 2,\n                norm=\"LN\",\n            )\n\n    def forward(self, x):\n        shortcut = x\n        x = self.norm1(x)\n\n        # Window partition\n        if self.window_size > 0:\n            H, W = x.shape[1], x.shape[2]\n            x, pad_hw = window_partition(x, self.window_size)\n\n        x = self.attn(x)\n\n        # Reverse window partition\n        if self.window_size > 0:\n            x = window_unpartition(x, self.window_size, pad_hw, (H, W))\n\n        x = shortcut + self.drop_path(x)\n        x = x + self.drop_path(self.mlp(self.norm2(x)))\n\n        if self.use_residual_block:\n            x = self.residual(x.permute(0, 3, 1, 2)).permute(0, 2, 3, 1)\n\n        return x\n\n\nclass EVA02_ViT(Backbone):\n    \"\"\"\n    This module implements Vision Transformer (ViT) backbone in :paper:`vitdet`.\n    \"Exploring Plain Vision Transformer Backbones for Object Detection\",\n    https://arxiv.org/abs/2203.16527\n    \"\"\"\n\n    def __init__(\n        self,\n        img_size=1024,\n        patch_size=16,\n        in_chans=3,\n        embed_dim=768,\n        depth=12,\n        num_heads=12,\n        mlp_ratio=4*2/3,\n        qkv_bias=True,\n        drop_path_rate=0.0,\n        norm_layer=partial(nn.LayerNorm, eps=1e-6),\n        act_layer=nn.GELU,\n        use_abs_pos=True,\n        use_rel_pos=False,\n        rope=True,\n        pt_hw_seq_len=16,\n        intp_freq=True,\n        window_size=0,\n        window_block_indexes=(),\n        residual_block_indexes=(),\n        use_act_checkpoint=False,\n        pretrain_img_size=224,\n        pretrain_use_cls_token=True,\n        out_feature=\"last_feat\",\n        xattn=True,\n    ):\n        \"\"\"\n        Args:\n            img_size (int): Input image size.\n            patch_size (int): Patch size.\n            in_chans (int): Number of input image channels.\n            embed_dim (int): Patch embedding dimension.\n            depth (int): Depth of ViT.\n            num_heads (int): Number of attention heads in each ViT block.\n            mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.\n            qkv_bias (bool): If True, add a learnable bias to query, key, value.\n            drop_path_rate (float): Stochastic depth rate.\n            norm_layer (nn.Module): Normalization layer.\n            act_layer (nn.Module): Activation layer.\n            use_abs_pos (bool): If True, use absolute positional embeddings.\n            use_rel_pos (bool): If True, add relative positional embeddings to the attention map.\n            rel_pos_zero_init (bool): If True, zero initialize relative positional parameters.\n            window_size (int): Window size for window attention blocks.\n            window_block_indexes (list): Indexes for blocks using window attention.\n            residual_block_indexes (list): Indexes for blocks using conv propagation.\n            use_act_checkpoint (bool): If True, use activation checkpointing.\n            pretrain_img_size (int): input image size for pretraining models.\n            pretrain_use_cls_token (bool): If True, pretrainig models use class token.\n            out_feature (str): name of the feature from the last block.\n        \"\"\"\n        super().__init__()\n        self.pretrain_use_cls_token = pretrain_use_cls_token\n\n        self.patch_embed = PatchEmbed(\n            kernel_size=(patch_size, patch_size),\n            stride=(patch_size, patch_size),\n            in_chans=in_chans,\n            embed_dim=embed_dim,\n        )\n\n        if use_abs_pos:\n            # Initialize absolute positional embedding with pretrain image size.\n            num_patches = (pretrain_img_size // patch_size) * (pretrain_img_size // patch_size)\n            num_positions = (num_patches + 1) if pretrain_use_cls_token else num_patches\n            self.pos_embed = nn.Parameter(torch.zeros(1, num_positions, embed_dim))\n        else:\n            self.pos_embed = None\n\n\n        half_head_dim = embed_dim // num_heads // 2\n        hw_seq_len = img_size // patch_size\n\n        self.rope_win = VisionRotaryEmbeddingFast(\n            dim=half_head_dim,\n            pt_seq_len=pt_hw_seq_len,\n            ft_seq_len=window_size if intp_freq else None,\n        )\n        self.rope_glb = VisionRotaryEmbeddingFast(\n            dim=half_head_dim,\n            pt_seq_len=pt_hw_seq_len,\n            ft_seq_len=hw_seq_len if intp_freq else None,\n        )\n\n        # stochastic depth decay rule\n        dpr = [x.item() for x in torch.linspace(0, drop_path_rate, depth)]\n\n        self.blocks = nn.ModuleList()\n        for i in range(depth):\n            block = Block(\n                dim=embed_dim,\n                num_heads=num_heads,\n                mlp_ratio=mlp_ratio,\n                qkv_bias=qkv_bias,\n                drop_path=dpr[i],\n                norm_layer=norm_layer,\n                window_size=window_size if i in window_block_indexes else 0,\n                use_residual_block=i in residual_block_indexes,\n                rope=self.rope_win if i in window_block_indexes else self.rope_glb,\n                xattn=xattn\n            )\n            if use_act_checkpoint:\n                # TODO: use torch.utils.checkpoint\n                from fairscale.nn.checkpoint import checkpoint_wrapper\n\n                block = checkpoint_wrapper(block)\n            self.blocks.append(block)\n\n        self._out_feature_channels = {out_feature: embed_dim}\n        self._out_feature_strides = {out_feature: patch_size}\n        self._out_features = [out_feature]\n\n        if self.pos_embed is not None:\n            nn.init.trunc_normal_(self.pos_embed, std=0.02)\n\n        self.apply(self._init_weights)\n\n    def _init_weights(self, m):\n        if isinstance(m, nn.Linear):\n            nn.init.trunc_normal_(m.weight, std=0.02)\n            if isinstance(m, nn.Linear) and m.bias is not None:\n                nn.init.constant_(m.bias, 0)\n        elif isinstance(m, nn.LayerNorm):\n            nn.init.constant_(m.bias, 0)\n            nn.init.constant_(m.weight, 1.0)\n\n    def forward(self, x):\n        x = self.patch_embed(x)\n        if self.pos_embed is not None:\n            x = x + get_abs_pos(\n                self.pos_embed, self.pretrain_use_cls_token, (x.shape[1], x.shape[2])\n            )\n\n        for blk in self.blocks:\n            x = blk(x)\n\n        outputs = {self._out_features[0]: x.permute(0, 3, 1, 2)}\n        return outputs\n\n\nclass SimpleFeaturePyramid(Backbone):\n    \"\"\"\n    This module implements SimpleFeaturePyramid in :paper:`vitdet`.\n    It creates pyramid features built on top of the input feature map.\n    \"\"\"\n\n    def __init__(\n        self,\n        net,\n        in_feature,\n        out_channels,\n        scale_factors,\n        top_block=None,\n        norm=\"LN\",\n        square_pad=0,\n    ):\n        \"\"\"\n        Args:\n            net (Backbone): module representing the subnetwork backbone.\n                Must be a subclass of :class:`Backbone`.\n            in_feature (str): names of the input feature maps coming\n                from the net.\n            out_channels (int): number of channels in the output feature maps.\n            scale_factors (list[float]): list of scaling factors to upsample or downsample\n                the input features for creating pyramid features.\n            top_block (nn.Module or None): if provided, an extra operation will\n                be performed on the output of the last (smallest resolution)\n                pyramid output, and the result will extend the result list. The top_block\n                further downsamples the feature map. It must have an attribute\n                \"num_levels\", meaning the number of extra pyramid levels added by\n                this block, and \"in_feature\", which is a string representing\n                its input feature (e.g., p5).\n            norm (str): the normalization to use.\n            square_pad (int): If > 0, require input images to be padded to specific square size.\n        \"\"\"\n        super(SimpleFeaturePyramid, self).__init__()\n        assert isinstance(net, Backbone)\n\n        self.scale_factors = scale_factors\n\n        input_shapes = net.output_shape()\n        strides = [int(input_shapes[in_feature].stride / scale) for scale in scale_factors]\n        _assert_strides_are_log2_contiguous(strides)\n\n        dim = input_shapes[in_feature].channels\n        self.stages = []\n        use_bias = norm == \"\"\n        for idx, scale in enumerate(scale_factors):\n            out_dim = dim\n            if scale == 4.0:\n                layers = [\n                    nn.ConvTranspose2d(dim, dim // 2, kernel_size=2, stride=2),\n                    get_norm(norm, dim // 2),\n                    nn.GELU(),\n                    nn.ConvTranspose2d(dim // 2, dim // 4, kernel_size=2, stride=2),\n                ]\n                out_dim = dim // 4\n            elif scale == 2.0:\n                layers = [nn.ConvTranspose2d(dim, dim // 2, kernel_size=2, stride=2)]\n                out_dim = dim // 2\n            elif scale == 1.0:\n                layers = []\n            elif scale == 0.5:\n                layers = [nn.MaxPool2d(kernel_size=2, stride=2)]\n            else:\n                raise NotImplementedError(f\"scale_factor={scale} is not supported yet.\")\n\n            layers.extend(\n                [\n                    Conv2d(\n                        out_dim,\n                        out_channels,\n                        kernel_size=1,\n                        bias=use_bias,\n                        norm=get_norm(norm, out_channels),\n                    ),\n                    Conv2d(\n                        out_channels,\n                        out_channels,\n                        kernel_size=3,\n                        padding=1,\n                        bias=use_bias,\n                        norm=get_norm(norm, out_channels),\n                    ),\n                ]\n            )\n            layers = nn.Sequential(*layers)\n\n            stage = int(math.log2(strides[idx]))\n            self.add_module(f\"simfp_{stage}\", layers)\n            self.stages.append(layers)\n\n        self.net = net\n        self.in_feature = in_feature\n        self.top_block = top_block\n        # Return feature names are \"p<stage>\", like [\"p2\", \"p3\", ..., \"p6\"]\n        self._out_feature_strides = {\"p{}\".format(int(math.log2(s))): s for s in strides}\n        # top block output feature maps.\n        if self.top_block is not None:\n            for s in range(stage, stage + self.top_block.num_levels):\n                self._out_feature_strides[\"p{}\".format(s + 1)] = 2 ** (s + 1)\n\n        self._out_features = list(self._out_feature_strides.keys())\n        self._out_feature_channels = {k: out_channels for k in self._out_features}\n        self._size_divisibility = strides[-1]\n        self._square_pad = square_pad\n\n    @property\n    def padding_constraints(self):\n        return {\n            \"size_divisiblity\": self._size_divisibility,\n            \"square_size\": self._square_pad,\n        }\n\n    def forward(self, x):\n        \"\"\"\n        Args:\n            x: Tensor of shape (N,C,H,W). H, W must be a multiple of ``self.size_divisibility``.\n\n        Returns:\n            dict[str->Tensor]:\n                mapping from feature map name to pyramid feature map tensor\n                in high to low resolution order. Returned feature names follow the FPN\n                convention: \"p<stage>\", where stage has stride = 2 ** stage e.g.,\n                [\"p2\", \"p3\", ..., \"p6\"].\n        \"\"\"\n        bottom_up_features = self.net(x)\n        features = bottom_up_features[self.in_feature]\n        results = []\n\n        for stage in self.stages:\n            results.append(stage(features))\n\n        if self.top_block is not None:\n            if self.top_block.in_feature in bottom_up_features:\n                top_block_in_feature = bottom_up_features[self.top_block.in_feature]\n            else:\n                top_block_in_feature = results[self._out_features.index(self.top_block.in_feature)]\n            results.extend(self.top_block(top_block_in_feature))\n        assert len(self._out_features) == len(results)\n        return {f: res for f, res in zip(self._out_features, results)}\n\n\ndef get_vit_lr_decay_rate(name, lr_decay_rate=1.0, num_layers=12):\n    \"\"\"\n    Calculate lr decay rate for different ViT blocks.\n    Args:\n        name (string): parameter name.\n        lr_decay_rate (float): base lr decay rate.\n        num_layers (int): number of ViT blocks.\n\n    Returns:\n        lr decay rate for the given parameter.\n    \"\"\"\n    layer_id = num_layers + 1\n    if name.startswith(\"backbone\"):\n        if \".pos_embed\" in name or \".patch_embed\" in name:\n            layer_id = 0\n        elif \".blocks.\" in name and \".residual.\" not in name:\n            layer_id = int(name[name.find(\".blocks.\") :].split(\".\")[2]) + 1\n\n    return lr_decay_rate ** (num_layers + 1 - layer_id)"
  },
  {
    "path": "thirdparty/GLEE/glee/backbone/eva02.py",
    "content": "# --------------------------------------------------------\n# EVA02\n# --------------------------------------------------------\nimport logging\nimport math\nfrom functools import partial\n\nimport fvcore.nn.weight_init as weight_init\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\nfrom detectron2.layers import CNNBlockBase, Conv2d, get_norm\nfrom detectron2.modeling.backbone.fpn import _assert_strides_are_log2_contiguous\n\nfrom detectron2.modeling.backbone import Backbone\nfrom timm.models.layers import DropPath, Mlp, trunc_normal_\nfrom detectron2.modeling import BACKBONE_REGISTRY, Backbone, ShapeSpec\n\n\nfrom .eva_02_utils import (\n    PatchEmbed,\n    add_decomposed_rel_pos,\n    get_abs_pos,\n    window_partition,\n    window_unpartition,\n    VisionRotaryEmbeddingFast,\n)\nfrom detectron2.modeling.backbone.fpn import LastLevelMaxPool\n\n\ntry:\n    import xformers.ops as xops\n    HAS_XFORMER=True\nexcept:\n    HAS_XFORMER=False\n    pass\n\ntry:\n    from apex.normalization import FusedLayerNorm\nexcept:\n    pass\n\nlogger = logging.getLogger(__name__)\n\n\n\n__all__ = [\"EVA02_ViT\", \"SimpleFeaturePyramid\", \"get_vit_lr_decay_rate\"]\n\n\n\nclass SwiGLU(nn.Module):\n    def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.SiLU, drop=0., \n                norm_layer=nn.LayerNorm, subln=False\n            ):\n        super().__init__()\n        out_features = out_features or in_features\n        hidden_features = hidden_features or in_features\n\n        self.w1 = nn.Linear(in_features, hidden_features)\n        self.w2 = nn.Linear(in_features, hidden_features)\n\n        self.act = act_layer()\n        self.ffn_ln = norm_layer(hidden_features) if subln else nn.Identity()\n        self.w3 = nn.Linear(hidden_features, out_features)\n        \n        self.drop = nn.Dropout(drop)\n\n    def forward(self, x):\n        x1 = self.w1(x)\n        x2 = self.w2(x)\n        hidden = self.act(x1) * x2\n        x = self.ffn_ln(hidden)\n        x = self.w3(x)\n        x = self.drop(x)\n        return x\n    \n\nclass Attention(nn.Module):\n    def __init__(\n            self, \n            dim, \n            num_heads=8, \n            qkv_bias=True, \n            qk_scale=None, \n            attn_head_dim=None, \n            rope=None,\n            xattn=True,\n        ):\n        super().__init__()\n        self.num_heads = num_heads\n        head_dim = dim // num_heads\n        if attn_head_dim is not None:\n            head_dim = attn_head_dim\n        all_head_dim = head_dim * self.num_heads\n        self.scale = qk_scale or head_dim ** -0.5\n\n        self.q_proj = nn.Linear(dim, all_head_dim, bias=False)\n        self.k_proj = nn.Linear(dim, all_head_dim, bias=False)\n        self.v_proj = nn.Linear(dim, all_head_dim, bias=False)\n\n        if qkv_bias:\n            self.q_bias = nn.Parameter(torch.zeros(all_head_dim))\n            self.v_bias = nn.Parameter(torch.zeros(all_head_dim))\n        else:\n            self.q_bias = None\n            self.v_bias = None\n\n        self.rope = rope\n        self.xattn = xattn\n        self.proj = nn.Linear(all_head_dim, dim)\n        if not HAS_XFORMER:\n            self.xattn = False\n\n    def forward(self, x):\n        B, H, W, C = x.shape\n        x = x.view(B, -1, C)\n        N = H * W\n\n        q = F.linear(input=x, weight=self.q_proj.weight, bias=self.q_bias)\n        k = F.linear(input=x, weight=self.k_proj.weight, bias=None)\n        v = F.linear(input=x, weight=self.v_proj.weight, bias=self.v_bias)\n\n        q = q.reshape(B, N, self.num_heads, -1).permute(0, 2, 1, 3)     # B, num_heads, N, C\n        k = k.reshape(B, N, self.num_heads, -1).permute(0, 2, 1, 3)  \n        v = v.reshape(B, N, self.num_heads, -1).permute(0, 2, 1, 3) \n\n        ## rope\n        q = self.rope(q).type_as(v)\n        k = self.rope(k).type_as(v)\n\n        if self.xattn:\n            q = q.permute(0, 2, 1, 3)   # B, num_heads, N, C -> B, N, num_heads, C\n            k = k.permute(0, 2, 1, 3)\n            v = v.permute(0, 2, 1, 3)\n            \n            x = xops.memory_efficient_attention(q, k, v)\n            x = x.reshape(B, N, -1)\n        else:\n            q = q * self.scale\n            attn = (q @ k.transpose(-2, -1))\n            attn = attn.softmax(dim=-1).type_as(x)\n            x = (attn @ v).transpose(1, 2).reshape(B, N, -1)\n\n        x = self.proj(x)\n        x = x.view(B, H, W, C)\n\n        return x\n\n\nclass ResBottleneckBlock(CNNBlockBase):\n    \"\"\"\n    The standard bottleneck residual block without the last activation layer.\n    It contains 3 conv layers with kernels 1x1, 3x3, 1x1.\n    \"\"\"\n\n    def __init__(\n        self,\n        in_channels,\n        out_channels,\n        bottleneck_channels,\n        norm=\"LN\",\n        act_layer=nn.GELU,\n    ):\n        \"\"\"\n        Args:\n            in_channels (int): Number of input channels.\n            out_channels (int): Number of output channels.\n            bottleneck_channels (int): number of output channels for the 3x3\n                \"bottleneck\" conv layers.\n            norm (str or callable): normalization for all conv layers.\n                See :func:`layers.get_norm` for supported format.\n            act_layer (callable): activation for all conv layers.\n        \"\"\"\n        super().__init__(in_channels, out_channels, 1)\n\n        self.conv1 = Conv2d(in_channels, bottleneck_channels, 1, bias=False)\n        self.norm1 = get_norm(norm, bottleneck_channels)\n        self.act1 = act_layer()\n\n        self.conv2 = Conv2d(\n            bottleneck_channels,\n            bottleneck_channels,\n            3,\n            padding=1,\n            bias=False,\n        )\n        self.norm2 = get_norm(norm, bottleneck_channels)\n        self.act2 = act_layer()\n\n        self.conv3 = Conv2d(bottleneck_channels, out_channels, 1, bias=False)\n        self.norm3 = get_norm(norm, out_channels)\n\n        for layer in [self.conv1, self.conv2, self.conv3]:\n            weight_init.c2_msra_fill(layer)\n        for layer in [self.norm1, self.norm2]:\n            layer.weight.data.fill_(1.0)\n            layer.bias.data.zero_()\n        # zero init last norm layer.\n        self.norm3.weight.data.zero_()\n        self.norm3.bias.data.zero_()\n\n    def forward(self, x):\n        out = x\n        for layer in self.children():\n            out = layer(out)\n\n        out = x + out\n        return out\n\n\nclass Block(nn.Module):\n    \"\"\"Transformer blocks with support of window attention and residual propagation blocks\"\"\"\n\n    def __init__(\n        self,\n        dim,\n        num_heads,\n        mlp_ratio=4*2/3,\n        qkv_bias=True,\n        drop_path=0.0,\n        norm_layer=partial(nn.LayerNorm, eps=1e-6), \n        window_size=0,\n        use_residual_block=False,\n        rope=None,\n        xattn=True,\n    ):\n        \"\"\"\n        Args:\n            dim (int): Number of input channels.\n            num_heads (int): Number of attention heads in each ViT block.\n            mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.\n            qkv_bias (bool): If True, add a learnable bias to query, key, value.\n            drop_path (float): Stochastic depth rate.\n            norm_layer (nn.Module): Normalization layer.\n            act_layer (nn.Module): Activation layer.\n            use_rel_pos (bool): If True, add relative positional embeddings to the attention map.\n            rel_pos_zero_init (bool): If True, zero initialize relative positional parameters.\n            window_size (int): Window size for window attention blocks. If it equals 0, then not\n                use window attention.\n            use_residual_block (bool): If True, use a residual block after the MLP block.\n            input_size (int or None): Input resolution for calculating the relative positional\n                parameter size.\n        \"\"\"\n        super().__init__()\n        self.norm1 = norm_layer(dim)\n        self.attn = Attention(\n            dim,\n            num_heads=num_heads,\n            qkv_bias=qkv_bias,\n            rope=rope,\n            xattn=xattn,\n        )\n\n        from timm.models.layers import DropPath\n\n        self.drop_path = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()\n        self.norm2 = norm_layer(dim)\n        self.mlp = SwiGLU(\n                in_features=dim, \n                hidden_features=int(dim * mlp_ratio), \n                subln=True,\n                norm_layer=norm_layer,\n            )\n\n        self.window_size = window_size\n\n        self.use_residual_block = use_residual_block\n        if use_residual_block:\n            # Use a residual block with bottleneck channel as dim // 2\n            self.residual = ResBottleneckBlock(\n                in_channels=dim,\n                out_channels=dim,\n                bottleneck_channels=dim // 2,\n                norm=\"LN\",\n            )\n\n    def forward(self, x):\n        shortcut = x\n        x = self.norm1(x)\n\n        # Window partition\n        if self.window_size > 0:\n            H, W = x.shape[1], x.shape[2]\n            x, pad_hw = window_partition(x, self.window_size)\n\n        x = self.attn(x)\n\n        # Reverse window partition\n        if self.window_size > 0:\n            x = window_unpartition(x, self.window_size, pad_hw, (H, W))\n\n        x = shortcut + self.drop_path(x)\n        x = x + self.drop_path(self.mlp(self.norm2(x)))\n\n        if self.use_residual_block:\n            x = self.residual(x.permute(0, 3, 1, 2)).permute(0, 2, 3, 1)\n\n        return x\n\n\nclass EVA02_ViT(Backbone):\n    \"\"\"\n    This module implements Vision Transformer (ViT) backbone in :paper:`vitdet`.\n    \"Exploring Plain Vision Transformer Backbones for Object Detection\",\n    https://arxiv.org/abs/2203.16527\n    \"\"\"\n\n    def __init__(\n        self,\n        img_size=1024,\n        patch_size=16,\n        in_chans=3,\n        embed_dim=768,\n        depth=12,\n        num_heads=12,\n        mlp_ratio=4*2/3,\n        qkv_bias=True,\n        drop_path_rate=0.0,\n        norm_layer=partial(nn.LayerNorm, eps=1e-6),\n        act_layer=nn.GELU,\n        use_abs_pos=True,\n        use_rel_pos=False,\n        rope=True,\n        pt_hw_seq_len=16,\n        intp_freq=True,\n        window_size=0,\n        window_block_indexes=(),\n        residual_block_indexes=(),\n        use_act_checkpoint=False,\n        pretrain_img_size=224,\n        pretrain_use_cls_token=True,\n        out_feature=\"last_feat\",\n        xattn=True,\n    ):\n        \"\"\"\n        Args:\n            img_size (int): Input image size.\n            patch_size (int): Patch size.\n            in_chans (int): Number of input image channels.\n            embed_dim (int): Patch embedding dimension.\n            depth (int): Depth of ViT.\n            num_heads (int): Number of attention heads in each ViT block.\n            mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.\n            qkv_bias (bool): If True, add a learnable bias to query, key, value.\n            drop_path_rate (float): Stochastic depth rate.\n            norm_layer (nn.Module): Normalization layer.\n            act_layer (nn.Module): Activation layer.\n            use_abs_pos (bool): If True, use absolute positional embeddings.\n            use_rel_pos (bool): If True, add relative positional embeddings to the attention map.\n            rel_pos_zero_init (bool): If True, zero initialize relative positional parameters.\n            window_size (int): Window size for window attention blocks.\n            window_block_indexes (list): Indexes for blocks using window attention.\n            residual_block_indexes (list): Indexes for blocks using conv propagation.\n            use_act_checkpoint (bool): If True, use activation checkpointing.\n            pretrain_img_size (int): input image size for pretraining models.\n            pretrain_use_cls_token (bool): If True, pretrainig models use class token.\n            out_feature (str): name of the feature from the last block.\n        \"\"\"\n        super().__init__()\n        self.pretrain_use_cls_token = pretrain_use_cls_token\n\n        self.patch_embed = PatchEmbed(\n            kernel_size=(patch_size, patch_size),\n            stride=(patch_size, patch_size),\n            in_chans=in_chans,\n            embed_dim=embed_dim,\n        )\n\n        if use_abs_pos:\n            # Initialize absolute positional embedding with pretrain image size.\n            num_patches = (pretrain_img_size // patch_size) * (pretrain_img_size // patch_size)\n            num_positions = (num_patches + 1) if pretrain_use_cls_token else num_patches\n            self.pos_embed = nn.Parameter(torch.zeros(1, num_positions, embed_dim))\n        else:\n            self.pos_embed = None\n\n\n        half_head_dim = embed_dim // num_heads // 2\n        hw_seq_len = img_size // patch_size\n\n        self.rope_win = VisionRotaryEmbeddingFast(\n            dim=half_head_dim,\n            pt_seq_len=pt_hw_seq_len,\n            ft_seq_len=window_size if intp_freq else None,\n        )\n        self.rope_glb = VisionRotaryEmbeddingFast(\n            dim=half_head_dim,\n            pt_seq_len=pt_hw_seq_len,\n            ft_seq_len=hw_seq_len if intp_freq else None,\n        )\n\n        # stochastic depth decay rule\n        dpr = [x.item() for x in torch.linspace(0, drop_path_rate, depth)]\n\n        self.blocks = nn.ModuleList()\n        for i in range(depth):\n            block = Block(\n                dim=embed_dim,\n                num_heads=num_heads,\n                mlp_ratio=mlp_ratio,\n                qkv_bias=qkv_bias,\n                drop_path=dpr[i],\n                norm_layer=norm_layer,\n                window_size=window_size if i in window_block_indexes else 0,\n                use_residual_block=i in residual_block_indexes,\n                rope=self.rope_win if i in window_block_indexes else self.rope_glb,\n                xattn=xattn\n            )\n            if use_act_checkpoint:\n                # TODO: use torch.utils.checkpoint\n                from fairscale.nn.checkpoint import checkpoint_wrapper\n\n                block = checkpoint_wrapper(block)\n            self.blocks.append(block)\n\n        self._out_feature_channels = {out_feature: embed_dim}\n        self._out_feature_strides = {out_feature: patch_size}\n        self._out_features = [out_feature]\n\n        if self.pos_embed is not None:\n            nn.init.trunc_normal_(self.pos_embed, std=0.02)\n\n        self.apply(self._init_weights)\n\n    def _init_weights(self, m):\n        if isinstance(m, nn.Linear):\n            nn.init.trunc_normal_(m.weight, std=0.02)\n            if isinstance(m, nn.Linear) and m.bias is not None:\n                nn.init.constant_(m.bias, 0)\n        elif isinstance(m, nn.LayerNorm):\n            nn.init.constant_(m.bias, 0)\n            nn.init.constant_(m.weight, 1.0)\n\n    def forward(self, x):\n        x = self.patch_embed(x)\n        if self.pos_embed is not None:\n            x = x + get_abs_pos(\n                self.pos_embed, self.pretrain_use_cls_token, (x.shape[1], x.shape[2])\n            )\n\n        for blk in self.blocks:\n            x = blk(x)\n\n        outputs = {self._out_features[0]: x.permute(0, 3, 1, 2)}\n        return outputs\n\n\nclass SimpleFeaturePyramid(Backbone):\n    \"\"\"\n    This module implements SimpleFeaturePyramid in :paper:`vitdet`.\n    It creates pyramid features built on top of the input feature map.\n    \"\"\"\n\n    def __init__(\n        self,\n        net,\n        in_feature,\n        out_channels,\n        scale_factors,\n        top_block=None,\n        norm=\"LN\",\n        square_pad=0,\n    ):\n        \"\"\"\n        Args:\n            net (Backbone): module representing the subnetwork backbone.\n                Must be a subclass of :class:`Backbone`.\n            in_feature (str): names of the input feature maps coming\n                from the net.\n            out_channels (int): number of channels in the output feature maps.\n            scale_factors (list[float]): list of scaling factors to upsample or downsample\n                the input features for creating pyramid features.\n            top_block (nn.Module or None): if provided, an extra operation will\n                be performed on the output of the last (smallest resolution)\n                pyramid output, and the result will extend the result list. The top_block\n                further downsamples the feature map. It must have an attribute\n                \"num_levels\", meaning the number of extra pyramid levels added by\n                this block, and \"in_feature\", which is a string representing\n                its input feature (e.g., p5).\n            norm (str): the normalization to use.\n            square_pad (int): If > 0, require input images to be padded to specific square size.\n        \"\"\"\n        super(SimpleFeaturePyramid, self).__init__()\n        assert isinstance(net, Backbone)\n\n        self.scale_factors = scale_factors\n\n        input_shapes = net.output_shape()\n        strides = [int(input_shapes[in_feature].stride / scale) for scale in scale_factors]\n        _assert_strides_are_log2_contiguous(strides)\n\n        dim = input_shapes[in_feature].channels\n        self.stages = []\n        use_bias = norm == \"\"\n        for idx, scale in enumerate(scale_factors):\n            out_dim = dim\n            if scale == 4.0:\n                layers = [\n                    nn.ConvTranspose2d(dim, dim // 2, kernel_size=2, stride=2),\n                    get_norm(norm, dim // 2),\n                    nn.GELU(),\n                    nn.ConvTranspose2d(dim // 2, dim // 4, kernel_size=2, stride=2),\n                ]\n                out_dim = dim // 4\n            elif scale == 2.0:\n                layers = [nn.ConvTranspose2d(dim, dim // 2, kernel_size=2, stride=2)]\n                out_dim = dim // 2\n            elif scale == 1.0:\n                layers = []\n            elif scale == 0.5:\n                layers = [nn.MaxPool2d(kernel_size=2, stride=2)]\n            else:\n                raise NotImplementedError(f\"scale_factor={scale} is not supported yet.\")\n\n            layers.extend(\n                [\n                    Conv2d(\n                        out_dim,\n                        out_channels,\n                        kernel_size=1,\n                        bias=use_bias,\n                        norm=get_norm(norm, out_channels),\n                    ),\n                    Conv2d(\n                        out_channels,\n                        out_channels,\n                        kernel_size=3,\n                        padding=1,\n                        bias=use_bias,\n                        norm=get_norm(norm, out_channels),\n                    ),\n                ]\n            )\n            layers = nn.Sequential(*layers)\n\n            stage = int(math.log2(strides[idx]))\n            self.add_module(f\"simfp_{stage}\", layers)\n            self.stages.append(layers)\n\n        self.net = net\n        self.in_feature = in_feature\n        self.top_block = top_block\n        # Return feature names are \"p<stage>\", like [\"p2\", \"p3\", ..., \"p6\"]\n        self._out_feature_strides = {\"p{}\".format(int(math.log2(s))): s for s in strides}\n        # top block output feature maps.\n        if self.top_block is not None:\n            for s in range(stage, stage + self.top_block.num_levels):\n                self._out_feature_strides[\"p{}\".format(s + 1)] = 2 ** (s + 1)\n\n        self._out_features = list(self._out_feature_strides.keys())\n        self._out_feature_channels = {k: out_channels for k in self._out_features}\n        self._size_divisibility = strides[-1]\n        self._square_pad = square_pad\n\n    @property\n    def padding_constraints(self):\n        return {\n            \"size_divisiblity\": self._size_divisibility,\n            \"square_size\": self._square_pad,\n        }\n\n    def forward(self, x):\n        \"\"\"\n        Args:\n            x: Tensor of shape (N,C,H,W). H, W must be a multiple of ``self.size_divisibility``.\n\n        Returns:\n            dict[str->Tensor]:\n                mapping from feature map name to pyramid feature map tensor\n                in high to low resolution order. Returned feature names follow the FPN\n                convention: \"p<stage>\", where stage has stride = 2 ** stage e.g.,\n                [\"p2\", \"p3\", ..., \"p6\"].\n        \"\"\"\n        bottom_up_features = self.net(x)\n        features = bottom_up_features[self.in_feature]\n        results = []\n\n        for stage in self.stages:\n            results.append(stage(features))\n\n        if self.top_block is not None:\n            if self.top_block.in_feature in bottom_up_features:\n                top_block_in_feature = bottom_up_features[self.top_block.in_feature]\n            else:\n                top_block_in_feature = results[self._out_features.index(self.top_block.in_feature)]\n            results.extend(self.top_block(top_block_in_feature))\n        assert len(self._out_features) == len(results)\n        return {f: res for f, res in zip(self._out_features, results)}\n\n\n\n@BACKBONE_REGISTRY.register()\nclass D2_EVA02(SimpleFeaturePyramid):\n    def __init__(self, cfg, input_shape):\n\n        super().__init__(\n            \n            net = EVA02_ViT(\n                img_size= cfg.MODEL.EVA02.IMAGE_SIZE,\n                patch_size=cfg.MODEL.EVA02.PATCH_SIZE,\n                window_size= cfg.MODEL.EVA02.WINDOW_SIZE,\n                embed_dim= cfg.MODEL.EVA02.DMBED_DIM,\n                depth= cfg.MODEL.EVA02.DEPTH,\n                num_heads= cfg.MODEL.EVA02.NUM_HEADS ,\n                drop_path_rate= cfg.MODEL.EVA02.DROP_PATH_RATE,\n                mlp_ratio= cfg.MODEL.EVA02.MLP_RATIO,\n                # qkv_bias=True,\n                norm_layer=partial(nn.LayerNorm, eps=1e-6),\n                window_block_indexes= cfg.MODEL.EVA02.WINDOW_BLOCK_INDEXES,\n                # residual_block_indexes=[],\n                # use_rel_pos=False,\n                use_act_checkpoint = cfg.MODEL.EVA02.CHECKPOINT,\n                out_feature=\"last_feat\",\n                # intp_freq=True,\n            ),\n            in_feature = \"last_feat\",\n            out_channels=256,\n            scale_factors=(2.0, 1.0, 0.5),  # (4.0, 2.0, 1.0, 0.5) in ViTDet\n            top_block=LastLevelMaxPool(),\n            norm=\"LN\",\n            square_pad=cfg.MODEL.EVA02.IMAGE_SIZE,\n\n        )\n\n        pretrained_weight = cfg.MODEL.EVA02.PRETRAINED_WEIGHT \n        if pretrained_weight:\n            checkpoint = torch.load(pretrained_weight, map_location='cpu')\n            print(f'\\nload pretrain weight from {pretrained_weight} \\n') \n\n            self.load_state_dict(checkpoint['model'], strict=False)\n            \n\n \n    def output_shape(self):\n        return {\n            name: ShapeSpec(\n                channels=self._out_feature_channels[name], stride=self._out_feature_strides[name]\n            )\n            for name in self._out_features\n        }\n\n    @property\n    def size_divisibility(self):\n        return 32\n\n\n"
  },
  {
    "path": "thirdparty/GLEE/glee/backbone/eva_01_utils.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved\nimport math\nimport numpy as np\nfrom scipy import interpolate\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\n__all__ = [\n    \"window_partition\",\n    \"window_unpartition\",\n    \"add_decomposed_rel_pos\",\n    \"get_abs_pos\",\n    \"PatchEmbed\",\n]\n\n\ndef window_partition(x, window_size):\n    \"\"\"\n    Partition into non-overlapping windows with padding if needed.\n    Args:\n        x (tensor): input tokens with [B, H, W, C].\n        window_size (int): window size.\n\n    Returns:\n        windows: windows after partition with [B * num_windows, window_size, window_size, C].\n        (Hp, Wp): padded height and width before partition\n    \"\"\"\n    B, H, W, C = x.shape\n\n    pad_h = (window_size - H % window_size) % window_size\n    pad_w = (window_size - W % window_size) % window_size\n    if pad_h > 0 or pad_w > 0:\n        x = F.pad(x, (0, 0, 0, pad_w, 0, pad_h))\n    Hp, Wp = H + pad_h, W + pad_w\n\n    x = x.view(B, Hp // window_size, window_size, Wp // window_size, window_size, C)\n    windows = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(-1, window_size, window_size, C)\n    return windows, (Hp, Wp)\n\n\ndef window_unpartition(windows, window_size, pad_hw, hw):\n    \"\"\"\n    Window unpartition into original sequences and removing padding.\n    Args:\n        x (tensor): input tokens with [B * num_windows, window_size, window_size, C].\n        window_size (int): window size.\n        pad_hw (Tuple): padded height and width (Hp, Wp).\n        hw (Tuple): original height and width (H, W) before padding.\n\n    Returns:\n        x: unpartitioned sequences with [B, H, W, C].\n    \"\"\"\n    Hp, Wp = pad_hw\n    H, W = hw\n    B = windows.shape[0] // (Hp * Wp // window_size // window_size)\n    x = windows.view(B, Hp // window_size, Wp // window_size, window_size, window_size, -1)\n    x = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(B, Hp, Wp, -1)\n\n    if Hp > H or Wp > W:\n        x = x[:, :H, :W, :].contiguous()\n    return x\n\n\ndef get_rel_pos(q_size, k_size, rel_pos, interp_type):\n    \"\"\"\n    Get relative positional embeddings according to the relative positions of\n        query and key sizes.\n    Args:\n        q_size (int): size of query q.\n        k_size (int): size of key k.\n        rel_pos (Tensor): relative position embeddings (L, C).\n\n    Returns:\n        Extracted positional embeddings according to relative positions.\n    \"\"\"\n    max_rel_dist = int(2 * max(q_size, k_size) - 1)\n    # Interpolate rel pos if needed.\n    if rel_pos.shape[0] != max_rel_dist:\n        if interp_type == \"vitdet\":\n            # the vitdet impl: \n            # https://github.com/facebookresearch/detectron2/blob/96c752ce821a3340e27edd51c28a00665dd32a30/detectron2/modeling/backbone/utils.py#L77.\n\n            rel_pos_resized = F.interpolate(\n                rel_pos.reshape(1, rel_pos.shape[0], -1).permute(0, 2, 1),\n                size=max_rel_dist,\n                mode=\"linear\",\n            )\n            rel_pos_resized = rel_pos_resized.reshape(-1, max_rel_dist).permute(1, 0)\n        elif interp_type == \"beit\":\n            # steal from beit https://github.com/microsoft/unilm/tree/master/beit\n            # modified by Yuxin Fang\n\n            src_size = rel_pos.shape[0]\n            dst_size = max_rel_dist\n\n            q = 1.0903078\n            dis = []\n\n            cur = 1\n            for i in range(src_size // 2):\n                dis.append(cur)\n                cur += q ** (i + 1)\n\n            r_ids = [-_ for _ in reversed(dis)]\n            x = r_ids + [0] + dis\n            t = dst_size // 2.0\n            dx = np.arange(-t, t + 0.1, 1.0)\n\n            all_rel_pos_bias = []\n            for i in range(rel_pos.shape[1]):\n                # a hack from https://github.com/baaivision/EVA/issues/8,\n                # could also be used in fine-tuning but the performance haven't been tested.\n                z = rel_pos[:, i].view(src_size).cpu().float().detach().numpy()\n                f = interpolate.interp1d(x, z, kind='cubic', fill_value=\"extrapolate\")\n                all_rel_pos_bias.append(\n                    torch.Tensor(f(dx)).contiguous().view(-1, 1).to(rel_pos.device))\n            rel_pos_resized = torch.cat(all_rel_pos_bias, dim=-1)\n        else:\n            raise NotImplementedError()\n    else:\n        rel_pos_resized = rel_pos\n\n    # Scale the coords with short length if shapes for q and k are different.\n    q_coords = torch.arange(q_size)[:, None] * max(k_size / q_size, 1.0)\n    k_coords = torch.arange(k_size)[None, :] * max(q_size / k_size, 1.0)\n    relative_coords = (q_coords - k_coords) + (k_size - 1) * max(q_size / k_size, 1.0)\n\n    return rel_pos_resized[relative_coords.long()]\n\n\ndef add_decomposed_rel_pos(attn, q, rel_pos_h, rel_pos_w, q_size, k_size, interp_type):\n    \"\"\"\n    Calculate decomposed Relative Positional Embeddings from :paper:`mvitv2`.\n    https://github.com/facebookresearch/mvit/blob/19786631e330df9f3622e5402b4a419a263a2c80/mvit/models/attention.py   # noqa B950\n    Args:\n        attn (Tensor): attention map.\n        q (Tensor): query q in the attention layer with shape (B, q_h * q_w, C).\n        rel_pos_h (Tensor): relative position embeddings (Lh, C) for height axis.\n        rel_pos_w (Tensor): relative position embeddings (Lw, C) for width axis.\n        q_size (Tuple): spatial sequence size of query q with (q_h, q_w).\n        k_size (Tuple): spatial sequence size of key k with (k_h, k_w).\n\n    Returns:\n        attn (Tensor): attention map with added relative positional embeddings.\n    \"\"\"\n    q_h, q_w = q_size\n    k_h, k_w = k_size\n    Rh = get_rel_pos(q_h, k_h, rel_pos_h, interp_type)\n    Rw = get_rel_pos(q_w, k_w, rel_pos_w, interp_type)\n\n    B, _, dim = q.shape\n    r_q = q.reshape(B, q_h, q_w, dim)\n    rel_h = torch.einsum(\"bhwc,hkc->bhwk\", r_q, Rh)\n    rel_w = torch.einsum(\"bhwc,wkc->bhwk\", r_q, Rw)\n\n    attn = (\n        attn.view(B, q_h, q_w, k_h, k_w) + rel_h[:, :, :, :, None] + rel_w[:, :, :, None, :]\n    ).view(B, q_h * q_w, k_h * k_w)\n\n    return attn\n\n\ndef get_abs_pos(abs_pos, has_cls_token, hw):\n    \"\"\"\n    Calculate absolute positional embeddings. If needed, resize embeddings and remove cls_token\n        dimension for the original embeddings.\n    Args:\n        abs_pos (Tensor): absolute positional embeddings with (1, num_position, C).\n        has_cls_token (bool): If true, has 1 embedding in abs_pos for cls token.\n        hw (Tuple): size of input image tokens.\n\n    Returns:\n        Absolute positional embeddings after processing with shape (1, H, W, C)\n    \"\"\"\n    h, w = hw\n    if has_cls_token:\n        abs_pos = abs_pos[:, 1:]\n    xy_num = abs_pos.shape[1]\n    size = int(math.sqrt(xy_num))\n    assert size * size == xy_num\n\n    if size != h or size != w:\n        new_abs_pos = F.interpolate(\n            abs_pos.reshape(1, size, size, -1).permute(0, 3, 1, 2),\n            size=(h, w),\n            mode=\"bicubic\",\n            align_corners=False,\n        )\n\n        return new_abs_pos.permute(0, 2, 3, 1)\n    else:\n        return abs_pos.reshape(1, h, w, -1)\n\n\nclass PatchEmbed(nn.Module):\n    \"\"\"\n    Image to Patch Embedding.\n    \"\"\"\n\n    def __init__(\n        self, kernel_size=(16, 16), stride=(16, 16), padding=(0, 0), in_chans=3, embed_dim=768\n    ):\n        \"\"\"\n        Args:\n            kernel_size (Tuple): kernel size of the projection layer.\n            stride (Tuple): stride of the projection layer.\n            padding (Tuple): padding size of the projection layer.\n            in_chans (int): Number of input image channels.\n            embed_dim (int):  embed_dim (int): Patch embedding dimension.\n        \"\"\"\n        super().__init__()\n\n        self.proj = nn.Conv2d(\n            in_chans, embed_dim, kernel_size=kernel_size, stride=stride, padding=padding\n        )\n\n    def forward(self, x):\n        x = self.proj(x)\n        # B C H W -> B H W C\n        x = x.permute(0, 2, 3, 1)\n        return x"
  },
  {
    "path": "thirdparty/GLEE/glee/backbone/eva_02_utils.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved\nimport math\nimport numpy as np\nfrom scipy import interpolate\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\n__all__ = [\n    \"window_partition\",\n    \"window_unpartition\",\n    \"add_decomposed_rel_pos\",\n    \"get_abs_pos\",\n    \"PatchEmbed\",\n    \"VisionRotaryEmbeddingFast\",\n]\n\n\ndef window_partition(x, window_size):\n    \"\"\"\n    Partition into non-overlapping windows with padding if needed.\n    Args:\n        x (tensor): input tokens with [B, H, W, C].\n        window_size (int): window size.\n\n    Returns:\n        windows: windows after partition with [B * num_windows, window_size, window_size, C].\n        (Hp, Wp): padded height and width before partition\n    \"\"\"\n    B, H, W, C = x.shape\n\n    pad_h = (window_size - H % window_size) % window_size\n    pad_w = (window_size - W % window_size) % window_size\n    if pad_h > 0 or pad_w > 0:\n        x = F.pad(x, (0, 0, 0, pad_w, 0, pad_h))\n    Hp, Wp = H + pad_h, W + pad_w\n\n    x = x.view(B, Hp // window_size, window_size, Wp // window_size, window_size, C)\n    windows = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(-1, window_size, window_size, C)\n    return windows, (Hp, Wp)\n\n\ndef window_unpartition(windows, window_size, pad_hw, hw):\n    \"\"\"\n    Window unpartition into original sequences and removing padding.\n    Args:\n        x (tensor): input tokens with [B * num_windows, window_size, window_size, C].\n        window_size (int): window size.\n        pad_hw (Tuple): padded height and width (Hp, Wp).\n        hw (Tuple): original height and width (H, W) before padding.\n\n    Returns:\n        x: unpartitioned sequences with [B, H, W, C].\n    \"\"\"\n    Hp, Wp = pad_hw\n    H, W = hw\n    B = windows.shape[0] // (Hp * Wp // window_size // window_size)\n    x = windows.view(B, Hp // window_size, Wp // window_size, window_size, window_size, -1)\n    x = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(B, Hp, Wp, -1)\n\n    if Hp > H or Wp > W:\n        x = x[:, :H, :W, :].contiguous()\n    return x\n\n\ndef get_rel_pos(q_size, k_size, rel_pos):\n    \"\"\"\n    Get relative positional embeddings according to the relative positions of\n        query and key sizes.\n    Args:\n        q_size (int): size of query q.\n        k_size (int): size of key k.\n        rel_pos (Tensor): relative position embeddings (L, C).\n\n    Returns:\n        Extracted positional embeddings according to relative positions.\n    \"\"\"\n    max_rel_dist = int(2 * max(q_size, k_size) - 1)\n    use_log_interpolation = True\n\n    # Interpolate rel pos if needed.\n    if rel_pos.shape[0] != max_rel_dist:\n        if not use_log_interpolation:\n            # Interpolate rel pos.\n            rel_pos_resized = F.interpolate(\n                rel_pos.reshape(1, rel_pos.shape[0], -1).permute(0, 2, 1),\n                size=max_rel_dist,\n                mode=\"linear\",\n            )\n            rel_pos_resized = rel_pos_resized.reshape(-1, max_rel_dist).permute(1, 0)\n        else:\n            src_size = rel_pos.shape[0]\n            dst_size = max_rel_dist\n\n            # q = 1.13492\n            q = 1.0903078\n            dis = []\n\n            cur = 1\n            for i in range(src_size // 2):\n                dis.append(cur)\n                cur += q ** (i + 1)\n\n            r_ids = [-_ for _ in reversed(dis)]\n            x = r_ids + [0] + dis\n            t = dst_size // 2.0\n            dx = np.arange(-t, t + 0.1, 1.0)\n            # print(\"x = %s\" % str(x))\n            # print(\"dx = %s\" % str(dx))\n            all_rel_pos_bias = []\n            for i in range(rel_pos.shape[1]):\n                z = rel_pos[:, i].view(src_size).cpu().float().numpy()\n                f = interpolate.interp1d(x, z, kind='cubic', fill_value=\"extrapolate\")\n                all_rel_pos_bias.append(\n                    torch.Tensor(f(dx)).contiguous().view(-1, 1).to(rel_pos.device))\n            rel_pos_resized = torch.cat(all_rel_pos_bias, dim=-1)\n    else:\n        rel_pos_resized = rel_pos\n\n    # Scale the coords with short length if shapes for q and k are different.\n    q_coords = torch.arange(q_size)[:, None] * max(k_size / q_size, 1.0)\n    k_coords = torch.arange(k_size)[None, :] * max(q_size / k_size, 1.0)\n    relative_coords = (q_coords - k_coords) + (k_size - 1) * max(q_size / k_size, 1.0)\n\n    return rel_pos_resized[relative_coords.long()]\n\n\ndef add_decomposed_rel_pos(attn, q, rel_pos_h, rel_pos_w, q_size, k_size):\n    \"\"\"\n    Calculate decomposed Relative Positional Embeddings from :paper:`mvitv2`.\n    https://github.com/facebookresearch/mvit/blob/19786631e330df9f3622e5402b4a419a263a2c80/mvit/models/attention.py   # noqa B950\n    Args:\n        attn (Tensor): attention map.\n        q (Tensor): query q in the attention layer with shape (B, q_h * q_w, C).\n        rel_pos_h (Tensor): relative position embeddings (Lh, C) for height axis.\n        rel_pos_w (Tensor): relative position embeddings (Lw, C) for width axis.\n        q_size (Tuple): spatial sequence size of query q with (q_h, q_w).\n        k_size (Tuple): spatial sequence size of key k with (k_h, k_w).\n\n    Returns:\n        attn (Tensor): attention map with added relative positional embeddings.\n    \"\"\"\n    q_h, q_w = q_size\n    k_h, k_w = k_size\n    Rh = get_rel_pos(q_h, k_h, rel_pos_h)\n    Rw = get_rel_pos(q_w, k_w, rel_pos_w)\n\n    B, _, dim = q.shape\n    r_q = q.reshape(B, q_h, q_w, dim)\n    rel_h = torch.einsum(\"bhwc,hkc->bhwk\", r_q, Rh)\n    rel_w = torch.einsum(\"bhwc,wkc->bhwk\", r_q, Rw)\n\n    attn = (\n        attn.view(B, q_h, q_w, k_h, k_w) + rel_h[:, :, :, :, None] + rel_w[:, :, :, None, :]\n    ).view(B, q_h * q_w, k_h * k_w)\n\n    return attn\n\n\ndef get_abs_pos(abs_pos, has_cls_token, hw):\n    \"\"\"\n    Calculate absolute positional embeddings. If needed, resize embeddings and remove cls_token\n        dimension for the original embeddings.\n    Args:\n        abs_pos (Tensor): absolute positional embeddings with (1, num_position, C).\n        has_cls_token (bool): If true, has 1 embedding in abs_pos for cls token.\n        hw (Tuple): size of input image tokens.\n\n    Returns:\n        Absolute positional embeddings after processing with shape (1, H, W, C)\n    \"\"\"\n    h, w = hw\n    if has_cls_token:\n        abs_pos = abs_pos[:, 1:]\n    xy_num = abs_pos.shape[1]\n    size = int(math.sqrt(xy_num))\n    assert size * size == xy_num\n\n    if size != h or size != w:\n        new_abs_pos = F.interpolate(\n            abs_pos.reshape(1, size, size, -1).permute(0, 3, 1, 2),\n            size=(h, w),\n            mode=\"bicubic\",\n            align_corners=False,\n        )\n\n        return new_abs_pos.permute(0, 2, 3, 1)\n    else:\n        return abs_pos.reshape(1, h, w, -1)\n\n\nclass PatchEmbed(nn.Module):\n    \"\"\"\n    Image to Patch Embedding.\n    \"\"\"\n\n    def __init__(\n        self, kernel_size=(16, 16), stride=(16, 16), padding=(0, 0), in_chans=3, embed_dim=768\n    ):\n        \"\"\"\n        Args:\n            kernel_size (Tuple): kernel size of the projection layer.\n            stride (Tuple): stride of the projection layer.\n            padding (Tuple): padding size of the projection layer.\n            in_chans (int): Number of input image channels.\n            embed_dim (int):  embed_dim (int): Patch embedding dimension.\n        \"\"\"\n        super().__init__()\n\n        self.proj = nn.Conv2d(\n            in_chans, embed_dim, kernel_size=kernel_size, stride=stride, padding=padding\n        )\n\n    def forward(self, x):\n        x = self.proj(x)\n        # B C H W -> B H W C\n        x = x.permute(0, 2, 3, 1)\n        return x\n    \n\n\n\nfrom math import pi\n\nimport torch\nfrom torch import nn\n\nfrom einops import rearrange, repeat\n\n\n\ndef broadcat(tensors, dim = -1):\n    num_tensors = len(tensors)\n    shape_lens = set(list(map(lambda t: len(t.shape), tensors)))\n    assert len(shape_lens) == 1, 'tensors must all have the same number of dimensions'\n    shape_len = list(shape_lens)[0]\n    dim = (dim + shape_len) if dim < 0 else dim\n    dims = list(zip(*map(lambda t: list(t.shape), tensors)))\n    expandable_dims = [(i, val) for i, val in enumerate(dims) if i != dim]\n    assert all([*map(lambda t: len(set(t[1])) <= 2, expandable_dims)]), 'invalid dimensions for broadcastable concatentation'\n    max_dims = list(map(lambda t: (t[0], max(t[1])), expandable_dims))\n    expanded_dims = list(map(lambda t: (t[0], (t[1],) * num_tensors), max_dims))\n    expanded_dims.insert(dim, (dim, dims[dim]))\n    expandable_shapes = list(zip(*map(lambda t: t[1], expanded_dims)))\n    tensors = list(map(lambda t: t[0].expand(*t[1]), zip(tensors, expandable_shapes)))\n    return torch.cat(tensors, dim = dim)\n\n\n\ndef rotate_half(x):\n    x = rearrange(x, '... (d r) -> ... d r', r = 2)\n    x1, x2 = x.unbind(dim = -1)\n    x = torch.stack((-x2, x1), dim = -1)\n    return rearrange(x, '... d r -> ... (d r)')\n\n\n\nclass VisionRotaryEmbedding(nn.Module):\n    def __init__(\n        self,\n        dim,\n        pt_seq_len,\n        ft_seq_len=None,\n        custom_freqs = None,\n        freqs_for = 'lang',\n        theta = 10000,\n        max_freq = 10,\n        num_freqs = 1,\n    ):\n        super().__init__()\n        if custom_freqs:\n            freqs = custom_freqs\n        elif freqs_for == 'lang':\n            freqs = 1. / (theta ** (torch.arange(0, dim, 2)[:(dim // 2)].float() / dim))\n        elif freqs_for == 'pixel':\n            freqs = torch.linspace(1., max_freq / 2, dim // 2) * pi\n        elif freqs_for == 'constant':\n            freqs = torch.ones(num_freqs).float()\n        else:\n            raise ValueError(f'unknown modality {freqs_for}')\n\n        if ft_seq_len is None: ft_seq_len = pt_seq_len\n        t = torch.arange(ft_seq_len) / ft_seq_len * pt_seq_len\n\n        freqs_h = torch.einsum('..., f -> ... f', t, freqs)\n        freqs_h = repeat(freqs_h, '... n -> ... (n r)', r = 2)\n\n        freqs_w = torch.einsum('..., f -> ... f', t, freqs)\n        freqs_w = repeat(freqs_w, '... n -> ... (n r)', r = 2)\n\n        freqs = broadcat((freqs_h[:, None, :], freqs_w[None, :, :]), dim = -1)\n\n        self.register_buffer(\"freqs_cos\", freqs.cos())\n        self.register_buffer(\"freqs_sin\", freqs.sin())\n\n        print('======== shape of rope freq', self.freqs_cos.shape, '========')\n\n    def forward(self, t, start_index = 0):\n        rot_dim = self.freqs_cos.shape[-1]\n        end_index = start_index + rot_dim\n        assert rot_dim <= t.shape[-1], f'feature dimension {t.shape[-1]} is not of sufficient size to rotate in all the positions {rot_dim}'\n        t_left, t, t_right = t[..., :start_index], t[..., start_index:end_index], t[..., end_index:]\n        t = (t * self.freqs_cos) + (rotate_half(t) * self.freqs_sin)\n        return torch.cat((t_left, t, t_right), dim = -1)\n\n\n\n\nclass VisionRotaryEmbeddingFast(nn.Module):\n    def __init__(\n        self,\n        dim,\n        pt_seq_len=16,\n        ft_seq_len=None,\n        custom_freqs = None,\n        freqs_for = 'lang',\n        theta = 10000,\n        max_freq = 10,\n        num_freqs = 1,\n    ):\n        super().__init__()\n        if custom_freqs:\n            freqs = custom_freqs\n        elif freqs_for == 'lang':\n            freqs = 1. / (theta ** (torch.arange(0, dim, 2)[:(dim // 2)].float() / dim))\n        elif freqs_for == 'pixel':\n            freqs = torch.linspace(1., max_freq / 2, dim // 2) * pi\n        elif freqs_for == 'constant':\n            freqs = torch.ones(num_freqs).float()\n        else:\n            raise ValueError(f'unknown modality {freqs_for}')\n\n        if ft_seq_len is None: ft_seq_len = pt_seq_len\n        t = torch.arange(ft_seq_len) / ft_seq_len * pt_seq_len\n\n        freqs = torch.einsum('..., f -> ... f', t, freqs)\n        freqs = repeat(freqs, '... n -> ... (n r)', r = 2)\n        freqs = broadcat((freqs[:, None, :], freqs[None, :, :]), dim = -1)\n\n        freqs_cos = freqs.cos().view(-1, freqs.shape[-1])\n        freqs_sin = freqs.sin().view(-1, freqs.shape[-1])\n\n        self.register_buffer(\"freqs_cos\", freqs_cos)\n        self.register_buffer(\"freqs_sin\", freqs_sin)\n\n        print('======== shape of rope freq', self.freqs_cos.shape, '========')\n\n    # def forward(self, t): return  t * self.freqs_cos + rotate_half(t) * self.freqs_sin\n    def forward(self, t): \n        if t.shape[2] != self.freqs_cos.shape[0]:\n            t_len = t.shape[2]\n            output = t * self.freqs_cos[:t_len] + rotate_half(t) * self.freqs_sin[:t_len]\n        else:\n            output = t * self.freqs_cos + rotate_half(t) * self.freqs_sin\n        return  output\n\n"
  },
  {
    "path": "thirdparty/GLEE/glee/backbone/internimage.py",
    "content": "# --------------------------------------------------------\n# InternImage\n# Copyright (c) 2022 OpenGVLab\n# Licensed under The MIT License [see LICENSE for details]\n# --------------------------------------------------------\n\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nimport torch.utils.checkpoint as checkpoint\nfrom timm.models.layers import trunc_normal_, DropPath\n\nfrom detectron2.utils.logger import setup_logger\nfrom detectron2.modeling.backbone import Backbone\n\n\nfrom detectron2.modeling import BACKBONE_REGISTRY, Backbone, ShapeSpec\nfrom .ops_dcnv3 import modules as opsm\n\n\n\nclass to_channels_first(nn.Module):\n\n    def __init__(self):\n        super().__init__()\n\n    def forward(self, x):\n        return x.permute(0, 3, 1, 2)\n\n\nclass to_channels_last(nn.Module):\n\n    def __init__(self):\n        super().__init__()\n\n    def forward(self, x):\n        return x.permute(0, 2, 3, 1)\n\n\ndef build_norm_layer(dim,\n                     norm_layer,\n                     in_format='channels_last',\n                     out_format='channels_last',\n                     eps=1e-6):\n    layers = []\n    if norm_layer == 'BN':\n        if in_format == 'channels_last':\n            layers.append(to_channels_first())\n        layers.append(nn.BatchNorm2d(dim))\n        if out_format == 'channels_last':\n            layers.append(to_channels_last())\n    elif norm_layer == 'LN':\n        if in_format == 'channels_first':\n            layers.append(to_channels_last())\n        layers.append(nn.LayerNorm(dim, eps=eps))\n        if out_format == 'channels_first':\n            layers.append(to_channels_first())\n    else:\n        raise NotImplementedError(\n            f'build_norm_layer does not support {norm_layer}')\n    return nn.Sequential(*layers)\n\n\ndef build_act_layer(act_layer):\n    if act_layer == 'ReLU':\n        return nn.ReLU(inplace=True)\n    elif act_layer == 'SiLU':\n        return nn.SiLU(inplace=True)\n    elif act_layer == 'GELU':\n        return nn.GELU()\n\n    raise NotImplementedError(f'build_act_layer does not support {act_layer}')\n\n\nclass CrossAttention(nn.Module):\n    r\"\"\" Cross Attention Module\n    Args:\n        dim (int): Number of input channels.\n        num_heads (int): Number of attention heads. Default: 8\n        qkv_bias (bool, optional):  If True, add a learnable bias to q, k, v.\n            Default: False.\n        qk_scale (float | None, optional): Override default qk scale of\n            head_dim ** -0.5 if set. Default: None.\n        attn_drop (float, optional): Dropout ratio of attention weight.\n            Default: 0.0\n        proj_drop (float, optional): Dropout ratio of output. Default: 0.0\n        attn_head_dim (int, optional): Dimension of attention head.\n        out_dim (int, optional): Dimension of output.\n    \"\"\"\n    \n    def __init__(self,\n                 dim,\n                 num_heads=8,\n                 qkv_bias=False,\n                 qk_scale=None,\n                 attn_drop=0.,\n                 proj_drop=0.,\n                 attn_head_dim=None,\n                 out_dim=None):\n        super().__init__()\n        if out_dim is None:\n            out_dim = dim\n        self.num_heads = num_heads\n        head_dim = dim // num_heads\n        if attn_head_dim is not None:\n            head_dim = attn_head_dim\n        all_head_dim = head_dim * self.num_heads\n        self.scale = qk_scale or head_dim ** -0.5\n        assert all_head_dim == dim\n\n        self.q = nn.Linear(dim, all_head_dim, bias=False)\n        self.k = nn.Linear(dim, all_head_dim, bias=False)\n        self.v = nn.Linear(dim, all_head_dim, bias=False)\n\n        if qkv_bias:\n            self.q_bias = nn.Parameter(torch.zeros(all_head_dim))\n            self.k_bias = nn.Parameter(torch.zeros(all_head_dim))\n            self.v_bias = nn.Parameter(torch.zeros(all_head_dim))\n        else:\n            self.q_bias = None\n            self.k_bias = None\n            self.v_bias = None\n\n        self.attn_drop = nn.Dropout(attn_drop)\n        self.proj = nn.Linear(all_head_dim, out_dim)\n        self.proj_drop = nn.Dropout(proj_drop)\n\n    def forward(self, x, k=None, v=None):\n        B, N, C = x.shape\n        N_k = k.shape[1]\n        N_v = v.shape[1]\n\n        q_bias, k_bias, v_bias = None, None, None\n        if self.q_bias is not None:\n            q_bias = self.q_bias\n            k_bias = self.k_bias\n            v_bias = self.v_bias\n\n        q = F.linear(input=x, weight=self.q.weight, bias=q_bias)\n        q = q.reshape(B, N, 1, self.num_heads,\n                      -1).permute(2, 0, 3, 1,\n                                  4).squeeze(0)  # (B, N_head, N_q, dim)\n\n        k = F.linear(input=k, weight=self.k.weight, bias=k_bias)\n        k = k.reshape(B, N_k, 1, self.num_heads, -1).permute(2, 0, 3, 1,\n                                                             4).squeeze(0)\n\n        v = F.linear(input=v, weight=self.v.weight, bias=v_bias)\n        v = v.reshape(B, N_v, 1, self.num_heads, -1).permute(2, 0, 3, 1,\n                                                             4).squeeze(0)\n\n        q = q * self.scale\n        attn = (q @ k.transpose(-2, -1))  # (B, N_head, N_q, N_k)\n\n        attn = attn.softmax(dim=-1)\n        attn = self.attn_drop(attn)\n\n        x = (attn @ v).transpose(1, 2).reshape(B, N, -1)\n        x = self.proj(x)\n        x = self.proj_drop(x)\n\n        return x\n\n\nclass AttentiveBlock(nn.Module):\n    r\"\"\"Attentive Block\n    Args:\n        dim (int): Number of input channels.\n        num_heads (int): Number of attention heads. Default: 8\n        qkv_bias (bool, optional):  If True, add a learnable bias to q, k, v.\n            Default: False.\n        qk_scale (float | None, optional): Override default qk scale of\n            head_dim ** -0.5 if set. Default: None.\n        drop (float, optional): Dropout rate. Default: 0.0.\n        attn_drop (float, optional): Attention dropout rate. Default: 0.0.\n        drop_path (float | tuple[float], optional): Stochastic depth rate.\n            Default: 0.0.\n        norm_layer (nn.Module, optional): Normalization layer.  Default: nn.LayerNorm.\n        attn_head_dim (int, optional): Dimension of attention head. Default: None.\n        out_dim (int, optional): Dimension of output. Default: None.\n    \"\"\"\n    \n    def __init__(self,\n                 dim,\n                 num_heads,\n                 qkv_bias=False,\n                 qk_scale=None,\n                 drop=0.,\n                 attn_drop=0.,\n                 drop_path=0.,\n                 norm_layer=\"LN\",\n                 attn_head_dim=None,\n                 out_dim=None):\n        super().__init__()\n\n        self.norm1_q = build_norm_layer(dim, norm_layer, eps=1e-6)\n        self.norm1_k = build_norm_layer(dim, norm_layer, eps=1e-6)\n        self.norm1_v = build_norm_layer(dim, norm_layer, eps=1e-6)\n        self.cross_dcn = CrossAttention(dim,\n                                        num_heads=num_heads,\n                                        qkv_bias=qkv_bias,\n                                        qk_scale=qk_scale,\n                                        attn_drop=attn_drop,\n                                        proj_drop=drop,\n                                        attn_head_dim=attn_head_dim,\n                                        out_dim=out_dim)\n\n        self.drop_path = DropPath(\n            drop_path) if drop_path > 0. else nn.Identity()\n\n    def forward(self,\n                x_q,\n                x_kv,\n                pos_q,\n                pos_k,\n                bool_masked_pos,\n                rel_pos_bias=None):\n        x_q = self.norm1_q(x_q + pos_q)\n        x_k = self.norm1_k(x_kv + pos_k)\n        x_v = self.norm1_v(x_kv)\n\n        x = self.cross_dcn(x_q, k=x_k, v=x_v)\n\n        return x\n\n\nclass AttentionPoolingBlock(AttentiveBlock):\n\n    def forward(self, x):\n        x_q = x.mean(1, keepdim=True)\n        x_kv = x\n        pos_q, pos_k = 0, 0\n        x = super().forward(x_q, x_kv, pos_q, pos_k,\n                            bool_masked_pos=None,\n                            rel_pos_bias=None)\n        x = x.squeeze(1)\n        return x\n\n\nclass StemLayer(nn.Module):\n    r\"\"\" Stem layer of InternImage\n    Args:\n        in_chans (int): number of input channels\n        out_chans (int): number of output channels\n        act_layer (str): activation layer\n        norm_layer (str): normalization layer\n    \"\"\"\n\n    def __init__(self,\n                 in_chans=3,\n                 out_chans=96,\n                 act_layer='GELU',\n                 norm_layer='BN'):\n        super().__init__()\n        self.conv1 = nn.Conv2d(in_chans,\n                               out_chans // 2,\n                               kernel_size=3,\n                               stride=2,\n                               padding=1)\n        self.norm1 = build_norm_layer(out_chans // 2, norm_layer,\n                                      'channels_first', 'channels_first')\n        self.act = build_act_layer(act_layer)\n        self.conv2 = nn.Conv2d(out_chans // 2,\n                               out_chans,\n                               kernel_size=3,\n                               stride=2,\n                               padding=1)\n        self.norm2 = build_norm_layer(out_chans, norm_layer, 'channels_first',\n                                      'channels_last')\n\n    def forward(self, x):\n        x = self.conv1(x)\n        x = self.norm1(x)\n        x = self.act(x)\n        x = self.conv2(x)\n        x = self.norm2(x)\n        return x\n\n\nclass DownsampleLayer(nn.Module):\n    r\"\"\" Downsample layer of InternImage\n    Args:\n        channels (int): number of input channels\n        norm_layer (str): normalization layer\n    \"\"\"\n\n    def __init__(self, channels, norm_layer='LN'):\n        super().__init__()\n        self.conv = nn.Conv2d(channels,\n                              2 * channels,\n                              kernel_size=3,\n                              stride=2,\n                              padding=1,\n                              bias=False)\n        self.norm = build_norm_layer(2 * channels, norm_layer,\n                                     'channels_first', 'channels_last')\n\n    def forward(self, x):\n        x = self.conv(x.permute(0, 3, 1, 2))\n        x = self.norm(x)\n        return x\n\n\nclass MLPLayer(nn.Module):\n    r\"\"\" MLP layer of InternImage\n    Args:\n        in_features (int): number of input features\n        hidden_features (int): number of hidden features\n        out_features (int): number of output features\n        act_layer (str): activation layer\n        drop (float): dropout rate\n    \"\"\"\n\n    def __init__(self,\n                 in_features,\n                 hidden_features=None,\n                 out_features=None,\n                 act_layer='GELU',\n                 drop=0.):\n        super().__init__()\n        out_features = out_features or in_features\n        hidden_features = hidden_features or in_features\n        self.fc1 = nn.Linear(in_features, hidden_features)\n        self.act = build_act_layer(act_layer)\n        self.fc2 = nn.Linear(hidden_features, out_features)\n        self.drop = nn.Dropout(drop)\n\n    def forward(self, x):\n        x = self.fc1(x)\n        x = self.act(x)\n        x = self.drop(x)\n        x = self.fc2(x)\n        x = self.drop(x)\n        return x\n\n\nclass InternImageLayer(nn.Module):\n    r\"\"\" Basic layer of InternImage\n    Args:\n        core_op (nn.Module): core operation of InternImage\n        channels (int): number of input channels\n        groups (list): Groups of each block.\n        mlp_ratio (float): ratio of mlp hidden features to input channels\n        drop (float): dropout rate\n        drop_path (float): drop path rate\n        act_layer (str): activation layer\n        norm_layer (str): normalization layer\n        post_norm (bool): whether to use post normalization\n        layer_scale (float): layer scale\n        offset_scale (float): offset scale\n        with_cp (bool): whether to use checkpoint\n    \"\"\"\n\n    def __init__(self,\n                 core_op,\n                 channels,\n                 groups,\n                 mlp_ratio=4.,\n                 drop=0.,\n                 drop_path=0.,\n                 act_layer='GELU',\n                 norm_layer='LN',\n                 post_norm=False,\n                 layer_scale=None,\n                 offset_scale=1.0,\n                 with_cp=False,\n                 dw_kernel_size=None, # for InternImage-H/G\n                 res_post_norm=False, # for InternImage-H/G\n                 center_feature_scale=False): # for InternImage-H/G\n        super().__init__()\n        self.channels = channels\n        self.groups = groups\n        self.mlp_ratio = mlp_ratio\n        self.with_cp = with_cp\n\n        self.norm1 = build_norm_layer(channels, 'LN')\n        self.post_norm = post_norm\n        self.dcn = core_op(\n            channels=channels,\n            kernel_size=3,\n            stride=1,\n            pad=1,\n            dilation=1,\n            group=groups,\n            offset_scale=offset_scale,\n            act_layer=act_layer,\n            norm_layer=norm_layer,\n            dw_kernel_size=dw_kernel_size, # for InternImage-H/G\n            center_feature_scale=center_feature_scale) # for InternImage-H/G\n        self.drop_path = DropPath(drop_path) if drop_path > 0. \\\n            else nn.Identity()\n        self.norm2 = build_norm_layer(channels, 'LN')\n        self.mlp = MLPLayer(in_features=channels,\n                            hidden_features=int(channels * mlp_ratio),\n                            act_layer=act_layer,\n                            drop=drop)\n        self.layer_scale = layer_scale is not None\n        if self.layer_scale:\n            self.gamma1 = nn.Parameter(layer_scale * torch.ones(channels),\n                                       requires_grad=True)\n            self.gamma2 = nn.Parameter(layer_scale * torch.ones(channels),\n                                       requires_grad=True)\n        self.res_post_norm = res_post_norm\n        if res_post_norm:\n            self.res_post_norm1 = build_norm_layer(channels, 'LN')\n            self.res_post_norm2 = build_norm_layer(channels, 'LN')\n\n    def forward(self, x):\n\n        def _inner_forward(x):\n            if not self.layer_scale:\n                if self.post_norm:\n                    x = x + self.drop_path(self.norm1(self.dcn(x)))\n                    x = x + self.drop_path(self.norm2(self.mlp(x)))\n                elif self.res_post_norm: # for InternImage-H/G\n                    x = x + self.drop_path(self.res_post_norm1(self.dcn(self.norm1(x))))\n                    x = x + self.drop_path(self.res_post_norm2(self.mlp(self.norm2(x))))\n                else:\n                    x = x + self.drop_path(self.dcn(self.norm1(x)))\n                    x = x + self.drop_path(self.mlp(self.norm2(x)))\n                return x\n            if self.post_norm:\n                x = x + self.drop_path(self.gamma1 * self.norm1(self.dcn(x)))\n                x = x + self.drop_path(self.gamma2 * self.norm2(self.mlp(x)))\n            else:\n                x = x + self.drop_path(self.gamma1 * self.dcn(self.norm1(x)))\n                x = x + self.drop_path(self.gamma2 * self.mlp(self.norm2(x)))\n            return x\n\n        if self.with_cp and x.requires_grad:\n            x = checkpoint.checkpoint(_inner_forward, x)\n        else:\n            x = _inner_forward(x)\n        return x\n\n\nclass InternImageBlock(nn.Module):\n    r\"\"\" Block of InternImage\n    Args:\n        core_op (nn.Module): core operation of InternImage\n        channels (int): number of input channels\n        depths (list): Depth of each block.\n        groups (list): Groups of each block.\n        mlp_ratio (float): ratio of mlp hidden features to input channels\n        drop (float): dropout rate\n        drop_path (float): drop path rate\n        act_layer (str): activation layer\n        norm_layer (str): normalization layer\n        post_norm (bool): whether to use post normalization\n        layer_scale (float): layer scale\n        offset_scale (float): offset scale\n        with_cp (bool): whether to use checkpoint\n    \"\"\"\n\n    def __init__(self,\n                 core_op,\n                 channels,\n                 depth,\n                 groups,\n                 downsample=True,\n                 mlp_ratio=4.,\n                 drop=0.,\n                 drop_path=0.,\n                 act_layer='GELU',\n                 norm_layer='LN',\n                 post_norm=False,\n                 offset_scale=1.0,\n                 layer_scale=None,\n                 with_cp=False,\n                 dw_kernel_size=None, # for InternImage-H/G\n                 post_norm_block_ids=None, # for InternImage-H/G\n                 res_post_norm=False, # for InternImage-H/G\n                 center_feature_scale=False): # for InternImage-H/G\n        super().__init__()\n        self.channels = channels\n        self.depth = depth\n        self.post_norm = post_norm\n        self.center_feature_scale = center_feature_scale\n\n        self.blocks = nn.ModuleList([\n            InternImageLayer(\n                core_op=core_op,\n                channels=channels,\n                groups=groups,\n                mlp_ratio=mlp_ratio,\n                drop=drop,\n                drop_path=drop_path[i] if isinstance(\n                    drop_path, list) else drop_path,\n                act_layer=act_layer,\n                norm_layer=norm_layer,\n                post_norm=post_norm,\n                layer_scale=layer_scale,\n                offset_scale=offset_scale,\n                with_cp=with_cp,\n                dw_kernel_size=dw_kernel_size, # for InternImage-H/G\n                res_post_norm=res_post_norm, # for InternImage-H/G\n                center_feature_scale=center_feature_scale # for InternImage-H/G\n            ) for i in range(depth)\n        ])\n        if not self.post_norm or center_feature_scale:\n            self.norm = build_norm_layer(channels, 'LN')\n        self.post_norm_block_ids = post_norm_block_ids\n        if post_norm_block_ids is not None: # for InternImage-H/G\n            self.post_norms = nn.ModuleList(\n                [build_norm_layer(channels, 'LN', eps=1e-6) for _ in post_norm_block_ids]\n            )\n        self.downsample = DownsampleLayer(\n            channels=channels, norm_layer=norm_layer) if downsample else None\n\n    def forward(self, x, return_wo_downsample=False):\n        for i, blk in enumerate(self.blocks):\n            x = blk(x)\n            if (self.post_norm_block_ids is not None) and (i in self.post_norm_block_ids):\n                index = self.post_norm_block_ids.index(i)\n                x = self.post_norms[index](x) # for InternImage-H/G\n        if not self.post_norm or self.center_feature_scale:\n            x = self.norm(x)\n        if return_wo_downsample:\n            x_ = x\n        if self.downsample is not None:\n            x = self.downsample(x)\n\n        if return_wo_downsample:\n            return x, x_\n        return x\n\nclass InternImage(Backbone):\n    r\"\"\" InternImage\n        A PyTorch impl of : `InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions`  -\n          https://arxiv.org/pdf/2103.14030\n    Args:\n        core_op (str): Core operator. Default: 'DCNv3'\n        channels (int): Number of the first stage. Default: 64\n        depths (list): Depth of each block. Default: [3, 4, 18, 5]\n        groups (list): Groups of each block. Default: [3, 6, 12, 24]\n        mlp_ratio (float): Ratio of mlp hidden dim to embedding dim. Default: 4.\n        drop_rate (float): Probability of an element to be zeroed. Default: 0.\n        drop_path_rate (float): Stochastic depth rate. Default: 0.\n        act_layer (str): Activation layer. Default: 'GELU'\n        norm_layer (str): Normalization layer. Default: 'LN'\n        layer_scale (bool): Whether to use layer scale. Default: False\n        cls_scale (bool): Whether to use class scale. Default: False\n        with_cp (bool): Use checkpoint or not. Using checkpoint will save some\n        dw_kernel_size (int): Size of the dwconv. Default: None\n        level2_post_norm (bool): Whether to use level2 post norm. Default: False\n        level2_post_norm_block_ids (list): Indexes of post norm blocks. Default: None\n        res_post_norm (bool): Whether to use res post norm. Default: False\n        center_feature_scale (bool): Whether to use center feature scale. Default: False\n    \"\"\"\n\n    def __init__(self,\n                 core_op='DCNv3',\n                 channels=64,\n                 depths=[3, 4, 18, 5],\n                 groups=[3, 6, 12, 24],\n                 mlp_ratio=4.,\n                 drop_rate=0.,\n                 drop_path_rate=0.2,\n                 drop_path_type='linear',\n                 act_layer='GELU',\n                 norm_layer='LN',\n                 layer_scale=None,\n                 offset_scale=1.0,\n                 post_norm=False,\n                 with_cp=False,\n                 dw_kernel_size=None,  # for InternImage-H/G\n                 level2_post_norm=False,  # for InternImage-H/G\n                 level2_post_norm_block_ids=None,  # for InternImage-H/G\n                 res_post_norm=False,  # for InternImage-H/G\n                 center_feature_scale=False,  # for InternImage-H/G\n                 out_indices=(0, 1, 2, 3),\n                 init_cfg=None,\n                 **kwargs):\n        super().__init__()\n        self.core_op = core_op\n        self.num_levels = len(depths)\n        self.depths = depths\n        self.channels = channels\n        self.num_features = int(channels * 2**(self.num_levels - 1))\n        self.post_norm = post_norm\n        self.mlp_ratio = mlp_ratio\n        self.init_cfg = init_cfg\n        self.out_indices = out_indices\n        self.level2_post_norm_block_ids = level2_post_norm_block_ids\n        logger = setup_logger(name=\"InternImage\")\n        logger.info(f'using core type: {core_op}')\n        logger.info(f'using activation layer: {act_layer}')\n        logger.info(f'using main norm layer: {norm_layer}')\n        logger.info(f'using dpr: {drop_path_type}, {drop_path_rate}')\n        logger.info(f\"level2_post_norm: {level2_post_norm}\")\n        logger.info(f\"level2_post_norm_block_ids: {level2_post_norm_block_ids}\")\n        logger.info(f\"res_post_norm: {res_post_norm}\")\n\n        in_chans = 3\n        self.patch_embed = StemLayer(in_chans=in_chans,\n                                     out_chans=channels,\n                                     act_layer=act_layer,\n                                     norm_layer=norm_layer)\n        self.pos_drop = nn.Dropout(p=drop_rate)\n\n        dpr = [\n            x.item() for x in torch.linspace(0, drop_path_rate, sum(depths))\n        ]\n        if drop_path_type == 'uniform':\n            for i in range(len(dpr)):\n                dpr[i] = drop_path_rate\n\n        self.levels = nn.ModuleList()\n        for i in range(self.num_levels):\n            post_norm_block_ids = level2_post_norm_block_ids if level2_post_norm and (\n                i == 2) else None # for InternImage-H/G\n            level = InternImageBlock(\n                core_op=getattr(opsm, core_op),\n                channels=int(channels * 2**i),\n                depth=depths[i],\n                groups=groups[i],\n                mlp_ratio=self.mlp_ratio,\n                drop=drop_rate,\n                drop_path=dpr[sum(depths[:i]):sum(depths[:i + 1])],\n                act_layer=act_layer,\n                norm_layer=norm_layer,\n                post_norm=post_norm,\n                downsample=(i < self.num_levels - 1),\n                layer_scale=layer_scale,\n                offset_scale=offset_scale,\n                with_cp=with_cp,\n                dw_kernel_size=dw_kernel_size,  # for InternImage-H/G\n                post_norm_block_ids=post_norm_block_ids, # for InternImage-H/G\n                res_post_norm=res_post_norm, # for InternImage-H/G\n                center_feature_scale=center_feature_scale # for InternImage-H/G\n            )\n            self.levels.append(level)\n\n        self.num_layers = len(depths)\n        self.apply(self._init_weights)\n        self.apply(self._init_deform_weights)\n\n        # add basic info for d2 backbone\n        self._out_features = [\"res{}\".format(i+2) for i in self.out_indices]\n        self._out_feature_channels = {\n            \"res{}\".format(i+2): self.channels * 2**i for i in self.out_indices\n        }\n        self._out_feature_strides = {\"res{}\".format(i+2): 2 ** (i + 2) for i in self.out_indices}\n        self._size_devisibility = 32\n\n\n    def _init_weights(self, m):\n        if isinstance(m, nn.Linear):\n            trunc_normal_(m.weight, std=.02)\n            if isinstance(m, nn.Linear) and m.bias is not None:\n                nn.init.constant_(m.bias, 0)\n        elif isinstance(m, nn.LayerNorm):\n            nn.init.constant_(m.bias, 0)\n            nn.init.constant_(m.weight, 1.0)\n\n    def _init_deform_weights(self, m):\n        if isinstance(m, getattr(opsm, self.core_op)):\n            m._reset_parameters()\n\n    def forward(self, x):\n        x = self.patch_embed(x)\n        x = self.pos_drop(x)\n\n        # d2 need dict output\n        # seq_out = []\n        seq_out = {}\n        for level_idx, level in enumerate(self.levels):\n            x, x_ = level(x, return_wo_downsample=True)\n            if level_idx in self.out_indices:\n                # seq_out.append(x_.permute(0, 3, 1, 2).contiguous())\n                seq_out[\"res{}\".format(level_idx+2)] = x_.permute(0, 3, 1, 2).contiguous()\n        return seq_out\n\n@BACKBONE_REGISTRY.register()\nclass D2InternImage(InternImage):\n    def __init__(self, cfg, input_shape):\n\n        super().__init__(\n            core_op= cfg.MODEL.INTERNIMAGE.CORE_OP ,\n            channels=cfg.MODEL.INTERNIMAGE.CHANNELS,\n            depths=cfg.MODEL.INTERNIMAGE.DEPTHS,\n            groups=cfg.MODEL.INTERNIMAGE.GROUPS,\n            mlp_ratio= cfg.MODEL.INTERNIMAGE.MLP_RATIO ,\n            drop_path_rate=cfg.MODEL.INTERNIMAGE.DROP_PATH_RATE,\n            norm_layer=cfg.MODEL.INTERNIMAGE.NORM_LAYER,\n            layer_scale=cfg.MODEL.INTERNIMAGE.LAYER_SCALE ,\n            offset_scale=cfg.MODEL.INTERNIMAGE.OFFSET_SCALE,\n            post_norm=cfg.MODEL.INTERNIMAGE.POST_NORM,\n            with_cp=cfg.MODEL.INTERNIMAGE.WITH_CP ,\n            out_indices=cfg.MODEL.INTERNIMAGE.OUT_IINDICES,\n            dw_kernel_size= cfg.MODEL.INTERNIMAGE.DW_KERNEL_SIZE, # for InternImage-H/G\n            res_post_norm= cfg.MODEL.INTERNIMAGE.RES_POST_NORM, # for InternImage-H/G\n            level2_post_norm= cfg.MODEL.INTERNIMAGE.LEVEL2_POST_NORM, # for InternImage-H/G\n            level2_post_norm_block_ids= cfg.MODEL.INTERNIMAGE.LEVEL2_POST_NORM_BLOCK_IDS, # for InternImage-H/G\n            center_feature_scale= cfg.MODEL.INTERNIMAGE.CENTER_FEATURE_SCALE, # for InternImage-H/G\n            \n\n        )\n\n\n        pretrained_weight = cfg.MODEL.INTERNIMAGE.PRETRAINED_WEIGHT\n        if pretrained_weight:\n            checkpoint = torch.load(pretrained_weight, map_location='cpu')\n            print(f'\\nload pretrain weight from {pretrained_weight} \\n') \n            self.load_state_dict(checkpoint['model'], strict=False)\n        \n\n    def forward(self, x):\n        \"\"\"\n        Args:\n            x: Tensor of shape (N,C,H,W). H, W must be a multiple of ``self.size_divisibility``.\n        Returns:\n            dict[str->Tensor]: names and the corresponding features\n        \"\"\"\n        assert (\n            x.dim() == 4\n        ), f\"SwinTransformer takes an input of shape (N, C, H, W). Got {x.shape} instead!\"\n        outputs = {}\n        y = super().forward(x)\n        for k in y.keys():\n            if k in self._out_features:\n                outputs[k] = y[k]\n        return outputs\n\n    def output_shape(self):\n        return {\n            name: ShapeSpec(\n                channels=self._out_feature_channels[name], stride=self._out_feature_strides[name]\n            )\n            for name in self._out_features\n        }\n\n    @property\n    def size_divisibility(self):\n        return 32\n\n\n"
  },
  {
    "path": "thirdparty/GLEE/glee/backbone/registry.py",
    "content": "_model_entrypoints = {}\n\n\ndef register_backbone(fn):\n    module_name_split = fn.__module__.split('.')\n    model_name = module_name_split[-1]\n    _model_entrypoints[model_name] = fn\n    return fn\n\ndef model_entrypoints(model_name):\n    return _model_entrypoints[model_name]\n\ndef is_model(model_name):\n    return model_name in _model_entrypoints\n"
  },
  {
    "path": "thirdparty/GLEE/glee/backbone/resnet.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates.\nimport pickle\nimport numpy as np\nfrom typing import Any, Dict\nimport fvcore.nn.weight_init as weight_init\nimport torch\nimport torch.nn.functional as F\nfrom torch import nn\n\n\nfrom .backbone import Backbone\nfrom .registry import register_backbone\n\nfrom detectron2.layers import (\n    CNNBlockBase,\n    Conv2d,\n    DeformConv,\n    ModulatedDeformConv,\n    ShapeSpec,\n    get_norm,\n)\nfrom detectron2.utils.file_io import PathManager\n\n__all__ = [\n    \"ResNetBlockBase\",\n    \"BasicBlock\",\n    \"BottleneckBlock\",\n    \"DeformBottleneckBlock\",\n    \"BasicStem\",\n    \"ResNet\",\n    \"make_stage\",\n    \"get_resnet_backbone\",\n]\n\n\nclass BasicBlock(CNNBlockBase):\n    \"\"\"\n    The basic residual block for ResNet-18 and ResNet-34 defined in :paper:`ResNet`,\n    with two 3x3 conv layers and a projection shortcut if needed.\n    \"\"\"\n\n    def __init__(self, in_channels, out_channels, *, stride=1, norm=\"BN\"):\n        \"\"\"\n        Args:\n            in_channels (int): Number of input channels.\n            out_channels (int): Number of output channels.\n            stride (int): Stride for the first conv.\n            norm (str or callable): normalization for all conv layers.\n                See :func:`layers.get_norm` for supported format.\n        \"\"\"\n        super().__init__(in_channels, out_channels, stride)\n\n        if in_channels != out_channels:\n            self.shortcut = Conv2d(\n                in_channels,\n                out_channels,\n                kernel_size=1,\n                stride=stride,\n                bias=False,\n                norm=get_norm(norm, out_channels),\n            )\n        else:\n            self.shortcut = None\n\n        self.conv1 = Conv2d(\n            in_channels,\n            out_channels,\n            kernel_size=3,\n            stride=stride,\n            padding=1,\n            bias=False,\n            norm=get_norm(norm, out_channels),\n        )\n\n        self.conv2 = Conv2d(\n            out_channels,\n            out_channels,\n            kernel_size=3,\n            stride=1,\n            padding=1,\n            bias=False,\n            norm=get_norm(norm, out_channels),\n        )\n\n        for layer in [self.conv1, self.conv2, self.shortcut]:\n            if layer is not None:  # shortcut can be None\n                weight_init.c2_msra_fill(layer)\n\n    def forward(self, x):\n        out = self.conv1(x)\n        out = F.relu_(out)\n        out = self.conv2(out)\n\n        if self.shortcut is not None:\n            shortcut = self.shortcut(x)\n        else:\n            shortcut = x\n\n        out += shortcut\n        out = F.relu_(out)\n        return out\n\n\nclass BottleneckBlock(CNNBlockBase):\n    \"\"\"\n    The standard bottleneck residual block used by ResNet-50, 101 and 152\n    defined in :paper:`ResNet`.  It contains 3 conv layers with kernels\n    1x1, 3x3, 1x1, and a projection shortcut if needed.\n    \"\"\"\n\n    def __init__(\n        self,\n        in_channels,\n        out_channels,\n        *,\n        bottleneck_channels,\n        stride=1,\n        num_groups=1,\n        norm=\"BN\",\n        stride_in_1x1=False,\n        dilation=1,\n    ):\n        \"\"\"\n        Args:\n            bottleneck_channels (int): number of output channels for the 3x3\n                \"bottleneck\" conv layers.\n            num_groups (int): number of groups for the 3x3 conv layer.\n            norm (str or callable): normalization for all conv layers.\n                See :func:`layers.get_norm` for supported format.\n            stride_in_1x1 (bool): when stride>1, whether to put stride in the\n                first 1x1 convolution or the bottleneck 3x3 convolution.\n            dilation (int): the dilation rate of the 3x3 conv layer.\n        \"\"\"\n        super().__init__(in_channels, out_channels, stride)\n\n        if in_channels != out_channels:\n            self.shortcut = Conv2d(\n                in_channels,\n                out_channels,\n                kernel_size=1,\n                stride=stride,\n                bias=False,\n                norm=get_norm(norm, out_channels),\n            )\n        else:\n            self.shortcut = None\n\n        # The original MSRA ResNet models have stride in the first 1x1 conv\n        # The subsequent fb.torch.resnet and Caffe2 ResNe[X]t implementations have\n        # stride in the 3x3 conv\n        stride_1x1, stride_3x3 = (stride, 1) if stride_in_1x1 else (1, stride)\n\n        self.conv1 = Conv2d(\n            in_channels,\n            bottleneck_channels,\n            kernel_size=1,\n            stride=stride_1x1,\n            bias=False,\n            norm=get_norm(norm, bottleneck_channels),\n        )\n\n        self.conv2 = Conv2d(\n            bottleneck_channels,\n            bottleneck_channels,\n            kernel_size=3,\n            stride=stride_3x3,\n            padding=1 * dilation,\n            bias=False,\n            groups=num_groups,\n            dilation=dilation,\n            norm=get_norm(norm, bottleneck_channels),\n        )\n\n        self.conv3 = Conv2d(\n            bottleneck_channels,\n            out_channels,\n            kernel_size=1,\n            bias=False,\n            norm=get_norm(norm, out_channels),\n        )\n\n        for layer in [self.conv1, self.conv2, self.conv3, self.shortcut]:\n            if layer is not None:  # shortcut can be None\n                weight_init.c2_msra_fill(layer)\n\n        # Zero-initialize the last normalization in each residual branch,\n        # so that at the beginning, the residual branch starts with zeros,\n        # and each residual block behaves like an identity.\n        # See Sec 5.1 in \"Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour\":\n        # \"For BN layers, the learnable scaling coefficient γ is initialized\n        # to be 1, except for each residual block's last BN\n        # where γ is initialized to be 0.\"\n\n        # nn.init.constant_(self.conv3.norm.weight, 0)\n        # TODO this somehow hurts performance when training GN models from scratch.\n        # Add it as an option when we need to use this code to train a backbone.\n\n    def forward(self, x):\n        out = self.conv1(x)\n        out = F.relu_(out)\n\n        out = self.conv2(out)\n        out = F.relu_(out)\n\n        out = self.conv3(out)\n\n        if self.shortcut is not None:\n            shortcut = self.shortcut(x)\n        else:\n            shortcut = x\n\n        out += shortcut\n        out = F.relu_(out)\n        return out\n\n\nclass DeformBottleneckBlock(CNNBlockBase):\n    \"\"\"\n    Similar to :class:`BottleneckBlock`, but with :paper:`deformable conv <deformconv>`\n    in the 3x3 convolution.\n    \"\"\"\n\n    def __init__(\n        self,\n        in_channels,\n        out_channels,\n        *,\n        bottleneck_channels,\n        stride=1,\n        num_groups=1,\n        norm=\"BN\",\n        stride_in_1x1=False,\n        dilation=1,\n        deform_modulated=False,\n        deform_num_groups=1,\n    ):\n        super().__init__(in_channels, out_channels, stride)\n        self.deform_modulated = deform_modulated\n\n        if in_channels != out_channels:\n            self.shortcut = Conv2d(\n                in_channels,\n                out_channels,\n                kernel_size=1,\n                stride=stride,\n                bias=False,\n                norm=get_norm(norm, out_channels),\n            )\n        else:\n            self.shortcut = None\n\n        stride_1x1, stride_3x3 = (stride, 1) if stride_in_1x1 else (1, stride)\n\n        self.conv1 = Conv2d(\n            in_channels,\n            bottleneck_channels,\n            kernel_size=1,\n            stride=stride_1x1,\n            bias=False,\n            norm=get_norm(norm, bottleneck_channels),\n        )\n\n        if deform_modulated:\n            deform_conv_op = ModulatedDeformConv\n            # offset channels are 2 or 3 (if with modulated) * kernel_size * kernel_size\n            offset_channels = 27\n        else:\n            deform_conv_op = DeformConv\n            offset_channels = 18\n\n        self.conv2_offset = Conv2d(\n            bottleneck_channels,\n            offset_channels * deform_num_groups,\n            kernel_size=3,\n            stride=stride_3x3,\n            padding=1 * dilation,\n            dilation=dilation,\n        )\n        self.conv2 = deform_conv_op(\n            bottleneck_channels,\n            bottleneck_channels,\n            kernel_size=3,\n            stride=stride_3x3,\n            padding=1 * dilation,\n            bias=False,\n            groups=num_groups,\n            dilation=dilation,\n            deformable_groups=deform_num_groups,\n            norm=get_norm(norm, bottleneck_channels),\n        )\n\n        self.conv3 = Conv2d(\n            bottleneck_channels,\n            out_channels,\n            kernel_size=1,\n            bias=False,\n            norm=get_norm(norm, out_channels),\n        )\n\n        for layer in [self.conv1, self.conv2, self.conv3, self.shortcut]:\n            if layer is not None:  # shortcut can be None\n                weight_init.c2_msra_fill(layer)\n\n        nn.init.constant_(self.conv2_offset.weight, 0)\n        nn.init.constant_(self.conv2_offset.bias, 0)\n\n    def forward(self, x):\n        out = self.conv1(x)\n        out = F.relu_(out)\n\n        if self.deform_modulated:\n            offset_mask = self.conv2_offset(out)\n            offset_x, offset_y, mask = torch.chunk(offset_mask, 3, dim=1)\n            offset = torch.cat((offset_x, offset_y), dim=1)\n            mask = mask.sigmoid()\n            out = self.conv2(out, offset, mask)\n        else:\n            offset = self.conv2_offset(out)\n            out = self.conv2(out, offset)\n        out = F.relu_(out)\n\n        out = self.conv3(out)\n\n        if self.shortcut is not None:\n            shortcut = self.shortcut(x)\n        else:\n            shortcut = x\n\n        out += shortcut\n        out = F.relu_(out)\n        return out\n\n\nclass BasicStem(CNNBlockBase):\n    \"\"\"\n    The standard ResNet stem (layers before the first residual block),\n    with a conv, relu and max_pool.\n    \"\"\"\n\n    def __init__(self, in_channels=3, out_channels=64, norm=\"BN\"):\n        \"\"\"\n        Args:\n            norm (str or callable): norm after the first conv layer.\n                See :func:`layers.get_norm` for supported format.\n        \"\"\"\n        super().__init__(in_channels, out_channels, 4)\n        self.in_channels = in_channels\n        self.conv1 = Conv2d(\n            in_channels,\n            out_channels,\n            kernel_size=7,\n            stride=2,\n            padding=3,\n            bias=False,\n            norm=get_norm(norm, out_channels),\n        )\n        weight_init.c2_msra_fill(self.conv1)\n\n    def forward(self, x):\n        x = self.conv1(x)\n        x = F.relu_(x)\n        x = F.max_pool2d(x, kernel_size=3, stride=2, padding=1)\n        return x\n\n\nclass ResNet(Backbone):\n    \"\"\"\n    Implement :paper:`ResNet`.\n    \"\"\"\n\n    def __init__(self, stem, stages, num_classes=None, out_features=None, freeze_at=0):\n        \"\"\"\n        Args:\n            stem (nn.Module): a stem module\n            stages (list[list[CNNBlockBase]]): several (typically 4) stages,\n                each contains multiple :class:`CNNBlockBase`.\n            num_classes (None or int): if None, will not perform classification.\n                Otherwise, will create a linear layer.\n            out_features (list[str]): name of the layers whose outputs should\n                be returned in forward. Can be anything in \"stem\", \"linear\", or \"res2\" ...\n                If None, will return the output of the last layer.\n            freeze_at (int): The number of stages at the beginning to freeze.\n                see :meth:`freeze` for detailed explanation.\n        \"\"\"\n        super().__init__()\n        self.stem = stem\n        self.num_classes = num_classes\n\n        current_stride = self.stem.stride\n        self._out_feature_strides = {\"stem\": current_stride}\n        self._out_feature_channels = {\"stem\": self.stem.out_channels}\n\n        self.stage_names, self.stages = [], []\n\n        if out_features is not None:\n            # Avoid keeping unused layers in this module. They consume extra memory\n            # and may cause allreduce to fail\n            num_stages = max(\n                [{\"res2\": 1, \"res3\": 2, \"res4\": 3, \"res5\": 4}.get(f, 0) for f in out_features]\n            )\n            stages = stages[:num_stages]\n        for i, blocks in enumerate(stages):\n            assert len(blocks) > 0, len(blocks)\n            for block in blocks:\n                assert isinstance(block, CNNBlockBase), block\n\n            name = \"res\" + str(i + 2)\n            stage = nn.Sequential(*blocks)\n\n            self.add_module(name, stage)\n            self.stage_names.append(name)\n            self.stages.append(stage)\n\n            self._out_feature_strides[name] = current_stride = int(\n                current_stride * np.prod([k.stride for k in blocks])\n            )\n            self._out_feature_channels[name] = curr_channels = blocks[-1].out_channels\n        self.stage_names = tuple(self.stage_names)  # Make it static for scripting\n\n        if num_classes is not None:\n            self.avgpool = nn.AdaptiveAvgPool2d((1, 1))\n            self.linear = nn.Linear(curr_channels, num_classes)\n\n            # Sec 5.1 in \"Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour\":\n            # \"The 1000-way fully-connected layer is initialized by\n            # drawing weights from a zero-mean Gaussian with standard deviation of 0.01.\"\n            nn.init.normal_(self.linear.weight, std=0.01)\n            name = \"linear\"\n\n        if out_features is None:\n            out_features = [name]\n        self._out_features = out_features\n        assert len(self._out_features)\n        children = [x[0] for x in self.named_children()]\n        for out_feature in self._out_features:\n            assert out_feature in children, \"Available children: {}\".format(\", \".join(children))\n        self.freeze(freeze_at)\n\n    def forward(self, x):\n        \"\"\"\n        Args:\n            x: Tensor of shape (N,C,H,W). H, W must be a multiple of ``self.size_divisibility``.\n\n        Returns:\n            dict[str->Tensor]: names and the corresponding features\n        \"\"\"\n        assert x.dim() == 4, f\"ResNet takes an input of shape (N, C, H, W). Got {x.shape} instead!\"\n        outputs = {}\n        x = self.stem(x)\n        if \"stem\" in self._out_features:\n            outputs[\"stem\"] = x\n        for name, stage in zip(self.stage_names, self.stages):\n            x = stage(x)\n            if name in self._out_features:\n                outputs[name] = x\n        if self.num_classes is not None:\n            x = self.avgpool(x)\n            x = torch.flatten(x, 1)\n            x = self.linear(x)\n            if \"linear\" in self._out_features:\n                outputs[\"linear\"] = x\n        return outputs\n\n    def output_shape(self):\n        return {\n            name: ShapeSpec(\n                channels=self._out_feature_channels[name], stride=self._out_feature_strides[name]\n            )\n            for name in self._out_features\n        }\n\n    def freeze(self, freeze_at=0):\n        \"\"\"\n        Freeze the first several stages of the ResNet. Commonly used in\n        fine-tuning.\n\n        Layers that produce the same feature map spatial size are defined as one\n        \"stage\" by :paper:`FPN`.\n\n        Args:\n            freeze_at (int): number of stages to freeze.\n                `1` means freezing the stem. `2` means freezing the stem and\n                one residual stage, etc.\n\n        Returns:\n            nn.Module: this ResNet itself\n        \"\"\"\n        if freeze_at >= 1:\n            self.stem.freeze()\n        for idx, stage in enumerate(self.stages, start=2):\n            if freeze_at >= idx:\n                for block in stage.children():\n                    block.freeze()\n        return self\n\n    @staticmethod\n    def make_stage(block_class, num_blocks, *, in_channels, out_channels, **kwargs):\n        \"\"\"\n        Create a list of blocks of the same type that forms one ResNet stage.\n\n        Args:\n            block_class (type): a subclass of CNNBlockBase that's used to create all blocks in this\n                stage. A module of this type must not change spatial resolution of inputs unless its\n                stride != 1.\n            num_blocks (int): number of blocks in this stage\n            in_channels (int): input channels of the entire stage.\n            out_channels (int): output channels of **every block** in the stage.\n            kwargs: other arguments passed to the constructor of\n                `block_class`. If the argument name is \"xx_per_block\", the\n                argument is a list of values to be passed to each block in the\n                stage. Otherwise, the same argument is passed to every block\n                in the stage.\n\n        Returns:\n            list[CNNBlockBase]: a list of block module.\n\n        Examples:\n        ::\n            stage = ResNet.make_stage(\n                BottleneckBlock, 3, in_channels=16, out_channels=64,\n                bottleneck_channels=16, num_groups=1,\n                stride_per_block=[2, 1, 1],\n                dilations_per_block=[1, 1, 2]\n            )\n\n        Usually, layers that produce the same feature map spatial size are defined as one\n        \"stage\" (in :paper:`FPN`). Under such definition, ``stride_per_block[1:]`` should\n        all be 1.\n        \"\"\"\n        blocks = []\n        for i in range(num_blocks):\n            curr_kwargs = {}\n            for k, v in kwargs.items():\n                if k.endswith(\"_per_block\"):\n                    assert len(v) == num_blocks, (\n                        f\"Argument '{k}' of make_stage should have the \"\n                        f\"same length as num_blocks={num_blocks}.\"\n                    )\n                    newk = k[: -len(\"_per_block\")]\n                    assert newk not in kwargs, f\"Cannot call make_stage with both {k} and {newk}!\"\n                    curr_kwargs[newk] = v[i]\n                else:\n                    curr_kwargs[k] = v\n\n            blocks.append(\n                block_class(in_channels=in_channels, out_channels=out_channels, **curr_kwargs)\n            )\n            in_channels = out_channels\n        return blocks\n\n    @staticmethod\n    def make_default_stages(depth, block_class=None, **kwargs):\n        \"\"\"\n        Created list of ResNet stages from pre-defined depth (one of 18, 34, 50, 101, 152).\n        If it doesn't create the ResNet variant you need, please use :meth:`make_stage`\n        instead for fine-grained customization.\n\n        Args:\n            depth (int): depth of ResNet\n            block_class (type): the CNN block class. Has to accept\n                `bottleneck_channels` argument for depth > 50.\n                By default it is BasicBlock or BottleneckBlock, based on the\n                depth.\n            kwargs:\n                other arguments to pass to `make_stage`. Should not contain\n                stride and channels, as they are predefined for each depth.\n\n        Returns:\n            list[list[CNNBlockBase]]: modules in all stages; see arguments of\n                :class:`ResNet.__init__`.\n        \"\"\"\n        num_blocks_per_stage = {\n            18: [2, 2, 2, 2],\n            34: [3, 4, 6, 3],\n            50: [3, 4, 6, 3],\n            101: [3, 4, 23, 3],\n            152: [3, 8, 36, 3],\n        }[depth]\n        if block_class is None:\n            block_class = BasicBlock if depth < 50 else BottleneckBlock\n        if depth < 50:\n            in_channels = [64, 64, 128, 256]\n            out_channels = [64, 128, 256, 512]\n        else:\n            in_channels = [64, 256, 512, 1024]\n            out_channels = [256, 512, 1024, 2048]\n        ret = []\n        for (n, s, i, o) in zip(num_blocks_per_stage, [1, 2, 2, 2], in_channels, out_channels):\n            if depth >= 50:\n                kwargs[\"bottleneck_channels\"] = o // 4\n            ret.append(\n                ResNet.make_stage(\n                    block_class=block_class,\n                    num_blocks=n,\n                    stride_per_block=[s] + [1] * (n - 1),\n                    in_channels=i,\n                    out_channels=o,\n                    **kwargs,\n                )\n            )\n        return ret\n\n\nResNetBlockBase = CNNBlockBase\n\"\"\"\nAlias for backward compatibiltiy.\n\"\"\"\n\n\ndef make_stage(*args, **kwargs):\n    \"\"\"\n    Deprecated alias for backward compatibiltiy.\n    \"\"\"\n    return ResNet.make_stage(*args, **kwargs)\n\n\ndef _convert_ndarray_to_tensor(state_dict: Dict[str, Any]) -> None:\n    \"\"\"\n    In-place convert all numpy arrays in the state_dict to torch tensor.\n    Args:\n        state_dict (dict): a state-dict to be loaded to the model.\n            Will be modified.\n    \"\"\"\n    # model could be an OrderedDict with _metadata attribute\n    # (as returned by Pytorch's state_dict()). We should preserve these\n    # properties.\n    for k in list(state_dict.keys()):\n        v = state_dict[k]\n        if not isinstance(v, np.ndarray) and not isinstance(v, torch.Tensor):\n            raise ValueError(\n                \"Unsupported type found in checkpoint! {}: {}\".format(k, type(v))\n            )\n        if not isinstance(v, torch.Tensor):\n            state_dict[k] = torch.from_numpy(v)\n\n\n@register_backbone\ndef get_resnet_backbone(cfg):\n    \"\"\"\n    Create a ResNet instance from config.\n\n    Returns:\n        ResNet: a :class:`ResNet` instance.\n    \"\"\"\n    res_cfg = cfg['MODEL']['BACKBONE']['RESNETS']\n\n    # need registration of new blocks/stems?\n    norm = res_cfg['NORM']\n    stem = BasicStem(\n        in_channels=res_cfg['STEM_IN_CHANNELS'],\n        out_channels=res_cfg['STEM_OUT_CHANNELS'],\n        norm=norm,\n    )\n\n    # fmt: off\n    freeze_at           = res_cfg['FREEZE_AT']\n    out_features        = res_cfg['OUT_FEATURES']\n    depth               = res_cfg['DEPTH']\n    num_groups          = res_cfg['NUM_GROUPS']\n    width_per_group     = res_cfg['WIDTH_PER_GROUP']\n    bottleneck_channels = num_groups * width_per_group\n    in_channels         = res_cfg['STEM_OUT_CHANNELS']\n    out_channels        = res_cfg['RES2_OUT_CHANNELS']\n    stride_in_1x1       = res_cfg['STRIDE_IN_1X1']\n    res5_dilation       = res_cfg['RES5_DILATION']\n    deform_on_per_stage = res_cfg['DEFORM_ON_PER_STAGE']\n    deform_modulated    = res_cfg['DEFORM_MODULATED']\n    deform_num_groups   = res_cfg['DEFORM_NUM_GROUPS']\n    # fmt: on\n    assert res5_dilation in {1, 2}, \"res5_dilation cannot be {}.\".format(res5_dilation)\n\n    num_blocks_per_stage = {\n        18: [2, 2, 2, 2],\n        34: [3, 4, 6, 3],\n        50: [3, 4, 6, 3],\n        101: [3, 4, 23, 3],\n        152: [3, 8, 36, 3],\n    }[depth]\n\n    if depth in [18, 34]:\n        assert out_channels == 64, \"Must set MODEL.RESNETS.RES2_OUT_CHANNELS = 64 for R18/R34\"\n        assert not any(\n            deform_on_per_stage\n        ), \"MODEL.RESNETS.DEFORM_ON_PER_STAGE unsupported for R18/R34\"\n        assert res5_dilation == 1, \"Must set MODEL.RESNETS.RES5_DILATION = 1 for R18/R34\"\n        assert num_groups == 1, \"Must set MODEL.RESNETS.NUM_GROUPS = 1 for R18/R34\"\n\n    stages = []\n\n    for idx, stage_idx in enumerate(range(2, 6)):\n        # res5_dilation is used this way as a convention in R-FCN & Deformable Conv paper\n        dilation = res5_dilation if stage_idx == 5 else 1\n        first_stride = 1 if idx == 0 or (stage_idx == 5 and dilation == 2) else 2\n        stage_kargs = {\n            \"num_blocks\": num_blocks_per_stage[idx],\n            \"stride_per_block\": [first_stride] + [1] * (num_blocks_per_stage[idx] - 1),\n            \"in_channels\": in_channels,\n            \"out_channels\": out_channels,\n            \"norm\": norm,\n        }\n        # Use BasicBlock for R18 and R34.\n        if depth in [18, 34]:\n            stage_kargs[\"block_class\"] = BasicBlock\n        else:\n            stage_kargs[\"bottleneck_channels\"] = bottleneck_channels\n            stage_kargs[\"stride_in_1x1\"] = stride_in_1x1\n            stage_kargs[\"dilation\"] = dilation\n            stage_kargs[\"num_groups\"] = num_groups\n            if deform_on_per_stage[idx]:\n                stage_kargs[\"block_class\"] = DeformBottleneckBlock\n                stage_kargs[\"deform_modulated\"] = deform_modulated\n                stage_kargs[\"deform_num_groups\"] = deform_num_groups\n            else:\n                stage_kargs[\"block_class\"] = BottleneckBlock\n        blocks = ResNet.make_stage(**stage_kargs)\n        in_channels = out_channels\n        out_channels *= 2\n        bottleneck_channels *= 2\n        stages.append(blocks)\n    backbone = ResNet(stem, stages, out_features=out_features, freeze_at=freeze_at)\n\n    if cfg['MODEL']['BACKBONE']['LOAD_PRETRAINED'] is True:\n        filename = cfg['MODEL']['BACKBONE']['PRETRAINED']\n        with PathManager.open(filename, \"rb\") as f:\n            ckpt = pickle.load(f, encoding=\"latin1\")['model']\n        _convert_ndarray_to_tensor(ckpt)\n        ckpt.pop('stem.fc.weight')\n        ckpt.pop('stem.fc.bias')\n        backbone.load_state_dict(ckpt)\n\n    return backbone\n"
  },
  {
    "path": "thirdparty/GLEE/glee/backbone/swin.py",
    "content": "# --------------------------------------------------------\n# Swin Transformer\n# Copyright (c) 2021 Microsoft\n# Licensed under The MIT License [see LICENSE for details]\n# Written by Ze Liu, Yutong Lin, Yixuan Wei\n# --------------------------------------------------------\n\n\nimport numpy as np\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nimport torch.utils.checkpoint as checkpoint\nfrom timm.models.layers import DropPath, to_2tuple, trunc_normal_\n\nfrom detectron2.modeling import BACKBONE_REGISTRY, Backbone, ShapeSpec\nfrom .registry import register_backbone\n\n\nclass Mlp(nn.Module):\n    \"\"\"Multilayer perceptron.\"\"\"\n\n    def __init__(\n        self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.0\n    ):\n        super().__init__()\n        out_features = out_features or in_features\n        hidden_features = hidden_features or in_features\n        self.fc1 = nn.Linear(in_features, hidden_features)\n        self.act = act_layer()\n        self.fc2 = nn.Linear(hidden_features, out_features)\n        self.drop = nn.Dropout(drop)\n\n    def forward(self, x):\n        x = self.fc1(x)\n        x = self.act(x)\n        x = self.drop(x)\n        x = self.fc2(x)\n        x = self.drop(x)\n        return x\n\n\ndef window_partition(x, window_size):\n    \"\"\"\n    Args:\n        x: (B, H, W, C)\n        window_size (int): window size\n    Returns:\n        windows: (num_windows*B, window_size, window_size, C)\n    \"\"\"\n    B, H, W, C = x.shape\n    x = x.view(B, H // window_size, window_size, W // window_size, window_size, C)\n    windows = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(-1, window_size, window_size, C)\n    return windows\n\n\ndef window_reverse(windows, window_size, H, W):\n    \"\"\"\n    Args:\n        windows: (num_windows*B, window_size, window_size, C)\n        window_size (int): Window size\n        H (int): Height of image\n        W (int): Width of image\n    Returns:\n        x: (B, H, W, C)\n    \"\"\"\n    B = int(windows.shape[0] / (H * W / window_size / window_size))\n    x = windows.view(B, H // window_size, W // window_size, window_size, window_size, -1)\n    x = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(B, H, W, -1)\n    return x\n\n\nclass WindowAttention(nn.Module):\n    \"\"\"Window based multi-head self attention (W-MSA) module with relative position bias.\n    It supports both of shifted and non-shifted window.\n    Args:\n        dim (int): Number of input channels.\n        window_size (tuple[int]): The height and width of the window.\n        num_heads (int): Number of attention heads.\n        qkv_bias (bool, optional):  If True, add a learnable bias to query, key, value. Default: True\n        qk_scale (float | None, optional): Override default qk scale of head_dim ** -0.5 if set\n        attn_drop (float, optional): Dropout ratio of attention weight. Default: 0.0\n        proj_drop (float, optional): Dropout ratio of output. Default: 0.0\n    \"\"\"\n\n    def __init__(\n        self,\n        dim,\n        window_size,\n        num_heads,\n        qkv_bias=True,\n        qk_scale=None,\n        attn_drop=0.0,\n        proj_drop=0.0,\n    ):\n\n        super().__init__()\n        self.dim = dim\n        self.window_size = window_size  # Wh, Ww\n        self.num_heads = num_heads\n        head_dim = dim // num_heads\n        self.scale = qk_scale or head_dim ** -0.5\n\n        # define a parameter table of relative position bias\n        self.relative_position_bias_table = nn.Parameter(\n            torch.zeros((2 * window_size[0] - 1) * (2 * window_size[1] - 1), num_heads)\n        )  # 2*Wh-1 * 2*Ww-1, nH\n\n        # get pair-wise relative position index for each token inside the window\n        coords_h = torch.arange(self.window_size[0])\n        coords_w = torch.arange(self.window_size[1])\n        coords = torch.stack(torch.meshgrid([coords_h, coords_w]))  # 2, Wh, Ww\n        coords_flatten = torch.flatten(coords, 1)  # 2, Wh*Ww\n        relative_coords = coords_flatten[:, :, None] - coords_flatten[:, None, :]  # 2, Wh*Ww, Wh*Ww\n        relative_coords = relative_coords.permute(1, 2, 0).contiguous()  # Wh*Ww, Wh*Ww, 2\n        relative_coords[:, :, 0] += self.window_size[0] - 1  # shift to start from 0\n        relative_coords[:, :, 1] += self.window_size[1] - 1\n        relative_coords[:, :, 0] *= 2 * self.window_size[1] - 1\n        relative_position_index = relative_coords.sum(-1)  # Wh*Ww, Wh*Ww\n        self.register_buffer(\"relative_position_index\", relative_position_index)\n\n        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)\n        self.attn_drop = nn.Dropout(attn_drop)\n        self.proj = nn.Linear(dim, dim)\n        self.proj_drop = nn.Dropout(proj_drop)\n\n        trunc_normal_(self.relative_position_bias_table, std=0.02)\n        self.softmax = nn.Softmax(dim=-1)\n\n    def forward(self, x, mask=None):\n        \"\"\"Forward function.\n        Args:\n            x: input features with shape of (num_windows*B, N, C)\n            mask: (0/-inf) mask with shape of (num_windows, Wh*Ww, Wh*Ww) or None\n        \"\"\"\n        B_, N, C = x.shape\n        qkv = (\n            self.qkv(x)\n            .reshape(B_, N, 3, self.num_heads, C // self.num_heads)\n            .permute(2, 0, 3, 1, 4)\n        )\n        q, k, v = qkv[0], qkv[1], qkv[2]  # make torchscript happy (cannot use tensor as tuple)\n\n        q = q * self.scale\n        attn = q @ k.transpose(-2, -1)\n\n        relative_position_bias = self.relative_position_bias_table[\n            self.relative_position_index.view(-1)\n        ].view(\n            self.window_size[0] * self.window_size[1], self.window_size[0] * self.window_size[1], -1\n        )  # Wh*Ww,Wh*Ww,nH\n        relative_position_bias = relative_position_bias.permute(\n            2, 0, 1\n        ).contiguous()  # nH, Wh*Ww, Wh*Ww\n        attn = attn + relative_position_bias.unsqueeze(0)\n\n        if mask is not None:\n            nW = mask.shape[0]\n            attn = attn.view(B_ // nW, nW, self.num_heads, N, N) + mask.unsqueeze(1).unsqueeze(0)\n            attn = attn.view(-1, self.num_heads, N, N)\n            attn = self.softmax(attn)\n        else:\n            attn = self.softmax(attn)\n\n        attn = self.attn_drop(attn)\n\n        x = (attn @ v).transpose(1, 2).reshape(B_, N, C)\n        x = self.proj(x)\n        x = self.proj_drop(x)\n        return x\n\n\nclass SwinTransformerBlock(nn.Module):\n    \"\"\"Swin Transformer Block.\n    Args:\n        dim (int): Number of input channels.\n        num_heads (int): Number of attention heads.\n        window_size (int): Window size.\n        shift_size (int): Shift size for SW-MSA.\n        mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.\n        qkv_bias (bool, optional): If True, add a learnable bias to query, key, value. Default: True\n        qk_scale (float | None, optional): Override default qk scale of head_dim ** -0.5 if set.\n        drop (float, optional): Dropout rate. Default: 0.0\n        attn_drop (float, optional): Attention dropout rate. Default: 0.0\n        drop_path (float, optional): Stochastic depth rate. Default: 0.0\n        act_layer (nn.Module, optional): Activation layer. Default: nn.GELU\n        norm_layer (nn.Module, optional): Normalization layer.  Default: nn.LayerNorm\n    \"\"\"\n\n    def __init__(\n        self,\n        dim,\n        num_heads,\n        window_size=7,\n        shift_size=0,\n        mlp_ratio=4.0,\n        qkv_bias=True,\n        qk_scale=None,\n        drop=0.0,\n        attn_drop=0.0,\n        drop_path=0.0,\n        act_layer=nn.GELU,\n        norm_layer=nn.LayerNorm,\n    ):\n        super().__init__()\n        self.dim = dim\n        self.num_heads = num_heads\n        self.window_size = window_size\n        self.shift_size = shift_size\n        self.mlp_ratio = mlp_ratio\n        assert 0 <= self.shift_size < self.window_size, \"shift_size must in 0-window_size\"\n\n        self.norm1 = norm_layer(dim)\n        self.attn = WindowAttention(\n            dim,\n            window_size=to_2tuple(self.window_size),\n            num_heads=num_heads,\n            qkv_bias=qkv_bias,\n            qk_scale=qk_scale,\n            attn_drop=attn_drop,\n            proj_drop=drop,\n        )\n\n        self.drop_path = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()\n        self.norm2 = norm_layer(dim)\n        mlp_hidden_dim = int(dim * mlp_ratio)\n        self.mlp = Mlp(\n            in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop\n        )\n\n        self.H = None\n        self.W = None\n\n    def forward(self, x, mask_matrix):\n        \"\"\"Forward function.\n        Args:\n            x: Input feature, tensor size (B, H*W, C).\n            H, W: Spatial resolution of the input feature.\n            mask_matrix: Attention mask for cyclic shift.\n        \"\"\"\n        B, L, C = x.shape\n        H, W = self.H, self.W\n        assert L == H * W, \"input feature has wrong size\"\n\n        shortcut = x\n        x = self.norm1(x)\n        x = x.view(B, H, W, C)\n\n        # pad feature maps to multiples of window size\n        pad_l = pad_t = 0\n        pad_r = (self.window_size - W % self.window_size) % self.window_size\n        pad_b = (self.window_size - H % self.window_size) % self.window_size\n        x = F.pad(x, (0, 0, pad_l, pad_r, pad_t, pad_b))\n        _, Hp, Wp, _ = x.shape\n\n        # cyclic shift\n        if self.shift_size > 0:\n            shifted_x = torch.roll(x, shifts=(-self.shift_size, -self.shift_size), dims=(1, 2))\n            attn_mask = mask_matrix\n        else:\n            shifted_x = x\n            attn_mask = None\n\n        # partition windows\n        x_windows = window_partition(\n            shifted_x, self.window_size\n        )  # nW*B, window_size, window_size, C\n        x_windows = x_windows.view(\n            -1, self.window_size * self.window_size, C\n        )  # nW*B, window_size*window_size, C\n\n        # W-MSA/SW-MSA\n        attn_windows = self.attn(x_windows, mask=attn_mask)  # nW*B, window_size*window_size, C\n\n        # merge windows\n        attn_windows = attn_windows.view(-1, self.window_size, self.window_size, C)\n        shifted_x = window_reverse(attn_windows, self.window_size, Hp, Wp)  # B H' W' C\n\n        # reverse cyclic shift\n        if self.shift_size > 0:\n            x = torch.roll(shifted_x, shifts=(self.shift_size, self.shift_size), dims=(1, 2))\n        else:\n            x = shifted_x\n\n        if pad_r > 0 or pad_b > 0:\n            x = x[:, :H, :W, :].contiguous()\n\n        x = x.view(B, H * W, C)\n\n        # FFN\n        x = shortcut + self.drop_path(x)\n        x = x + self.drop_path(self.mlp(self.norm2(x)))\n\n        return x\n\n\nclass PatchMerging(nn.Module):\n    \"\"\"Patch Merging Layer\n    Args:\n        dim (int): Number of input channels.\n        norm_layer (nn.Module, optional): Normalization layer.  Default: nn.LayerNorm\n    \"\"\"\n\n    def __init__(self, dim, norm_layer=nn.LayerNorm):\n        super().__init__()\n        self.dim = dim\n        self.reduction = nn.Linear(4 * dim, 2 * dim, bias=False)\n        self.norm = norm_layer(4 * dim)\n\n    def forward(self, x, H, W):\n        \"\"\"Forward function.\n        Args:\n            x: Input feature, tensor size (B, H*W, C).\n            H, W: Spatial resolution of the input feature.\n        \"\"\"\n        B, L, C = x.shape\n        assert L == H * W, \"input feature has wrong size\"\n\n        x = x.view(B, H, W, C)\n\n        # padding\n        pad_input = (H % 2 == 1) or (W % 2 == 1)\n        if pad_input:\n            x = F.pad(x, (0, 0, 0, W % 2, 0, H % 2))\n\n        x0 = x[:, 0::2, 0::2, :]  # B H/2 W/2 C\n        x1 = x[:, 1::2, 0::2, :]  # B H/2 W/2 C\n        x2 = x[:, 0::2, 1::2, :]  # B H/2 W/2 C\n        x3 = x[:, 1::2, 1::2, :]  # B H/2 W/2 C\n        x = torch.cat([x0, x1, x2, x3], -1)  # B H/2 W/2 4*C\n        x = x.view(B, -1, 4 * C)  # B H/2*W/2 4*C\n\n        x = self.norm(x)\n        x = self.reduction(x)\n\n        return x\n\n\nclass BasicLayer(nn.Module):\n    \"\"\"A basic Swin Transformer layer for one stage.\n    Args:\n        dim (int): Number of feature channels\n        depth (int): Depths of this stage.\n        num_heads (int): Number of attention head.\n        window_size (int): Local window size. Default: 7.\n        mlp_ratio (float): Ratio of mlp hidden dim to embedding dim. Default: 4.\n        qkv_bias (bool, optional): If True, add a learnable bias to query, key, value. Default: True\n        qk_scale (float | None, optional): Override default qk scale of head_dim ** -0.5 if set.\n        drop (float, optional): Dropout rate. Default: 0.0\n        attn_drop (float, optional): Attention dropout rate. Default: 0.0\n        drop_path (float | tuple[float], optional): Stochastic depth rate. Default: 0.0\n        norm_layer (nn.Module, optional): Normalization layer. Default: nn.LayerNorm\n        downsample (nn.Module | None, optional): Downsample layer at the end of the layer. Default: None\n        use_checkpoint (bool): Whether to use checkpointing to save memory. Default: False.\n    \"\"\"\n\n    def __init__(\n        self,\n        dim,\n        depth,\n        num_heads,\n        window_size=7,\n        mlp_ratio=4.0,\n        qkv_bias=True,\n        qk_scale=None,\n        drop=0.0,\n        attn_drop=0.0,\n        drop_path=0.0,\n        norm_layer=nn.LayerNorm,\n        downsample=None,\n        use_checkpoint=False,\n    ):\n        super().__init__()\n        self.window_size = window_size\n        self.shift_size = window_size // 2\n        self.depth = depth\n        self.use_checkpoint = use_checkpoint\n\n        # build blocks\n        self.blocks = nn.ModuleList(\n            [\n                SwinTransformerBlock(\n                    dim=dim,\n                    num_heads=num_heads,\n                    window_size=window_size,\n                    shift_size=0 if (i % 2 == 0) else window_size // 2,\n                    mlp_ratio=mlp_ratio,\n                    qkv_bias=qkv_bias,\n                    qk_scale=qk_scale,\n                    drop=drop,\n                    attn_drop=attn_drop,\n                    drop_path=drop_path[i] if isinstance(drop_path, list) else drop_path,\n                    norm_layer=norm_layer,\n                )\n                for i in range(depth)\n            ]\n        )\n\n        # patch merging layer\n        if downsample is not None:\n            self.downsample = downsample(dim=dim, norm_layer=norm_layer)\n        else:\n            self.downsample = None\n\n    def forward(self, x, H, W):\n        \"\"\"Forward function.\n        Args:\n            x: Input feature, tensor size (B, H*W, C).\n            H, W: Spatial resolution of the input feature.\n        \"\"\"\n\n        # calculate attention mask for SW-MSA\n        Hp = int(np.ceil(H / self.window_size)) * self.window_size\n        Wp = int(np.ceil(W / self.window_size)) * self.window_size\n        img_mask = torch.zeros((1, Hp, Wp, 1), device=x.device)  # 1 Hp Wp 1\n        h_slices = (\n            slice(0, -self.window_size),\n            slice(-self.window_size, -self.shift_size),\n            slice(-self.shift_size, None),\n        )\n        w_slices = (\n            slice(0, -self.window_size),\n            slice(-self.window_size, -self.shift_size),\n            slice(-self.shift_size, None),\n        )\n        cnt = 0\n        for h in h_slices:\n            for w in w_slices:\n                img_mask[:, h, w, :] = cnt\n                cnt += 1\n\n        mask_windows = window_partition(\n            img_mask, self.window_size\n        )  # nW, window_size, window_size, 1\n        mask_windows = mask_windows.view(-1, self.window_size * self.window_size)\n        attn_mask = mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2)\n        attn_mask = attn_mask.masked_fill(attn_mask != 0, float(-100.0)).masked_fill(\n            attn_mask == 0, float(0.0)\n        )\n\n        for blk in self.blocks:\n            blk.H, blk.W = H, W\n            if self.use_checkpoint:\n                x = checkpoint.checkpoint(blk, x, attn_mask)\n            else:\n                x = blk(x, attn_mask)\n        if self.downsample is not None:\n            x_down = self.downsample(x, H, W)\n            Wh, Ww = (H + 1) // 2, (W + 1) // 2\n            return x, H, W, x_down, Wh, Ww\n        else:\n            return x, H, W, x, H, W\n\n\nclass PatchEmbed(nn.Module):\n    \"\"\"Image to Patch Embedding\n    Args:\n        patch_size (int): Patch token size. Default: 4.\n        in_chans (int): Number of input image channels. Default: 3.\n        embed_dim (int): Number of linear projection output channels. Default: 96.\n        norm_layer (nn.Module, optional): Normalization layer. Default: None\n    \"\"\"\n\n    def __init__(self, patch_size=4, in_chans=3, embed_dim=96, norm_layer=None):\n        super().__init__()\n        patch_size = to_2tuple(patch_size)\n        self.patch_size = patch_size\n\n        self.in_chans = in_chans\n        self.embed_dim = embed_dim\n\n        self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=patch_size)\n        if norm_layer is not None:\n            self.norm = norm_layer(embed_dim)\n        else:\n            self.norm = None\n\n    def forward(self, x):\n        \"\"\"Forward function.\"\"\"\n        # padding\n        _, _, H, W = x.size()\n        if W % self.patch_size[1] != 0:\n            x = F.pad(x, (0, self.patch_size[1] - W % self.patch_size[1]))\n        if H % self.patch_size[0] != 0:\n            x = F.pad(x, (0, 0, 0, self.patch_size[0] - H % self.patch_size[0]))\n\n        x = self.proj(x)  # B C Wh Ww\n        if self.norm is not None:\n            Wh, Ww = x.size(2), x.size(3)\n            x = x.flatten(2).transpose(1, 2)\n            x = self.norm(x)\n            x = x.transpose(1, 2).view(-1, self.embed_dim, Wh, Ww)\n\n        return x\n\n\nclass SwinTransformer(nn.Module):\n    \"\"\"Swin Transformer backbone.\n        A PyTorch impl of : `Swin Transformer: Hierarchical Vision Transformer using Shifted Windows`  -\n          https://arxiv.org/pdf/2103.14030\n    Args:\n        pretrain_img_size (int): Input image size for training the pretrained model,\n            used in absolute postion embedding. Default 224.\n        patch_size (int | tuple(int)): Patch size. Default: 4.\n        in_chans (int): Number of input image channels. Default: 3.\n        embed_dim (int): Number of linear projection output channels. Default: 96.\n        depths (tuple[int]): Depths of each Swin Transformer stage.\n        num_heads (tuple[int]): Number of attention head of each stage.\n        window_size (int): Window size. Default: 7.\n        mlp_ratio (float): Ratio of mlp hidden dim to embedding dim. Default: 4.\n        qkv_bias (bool): If True, add a learnable bias to query, key, value. Default: True\n        qk_scale (float): Override default qk scale of head_dim ** -0.5 if set.\n        drop_rate (float): Dropout rate.\n        attn_drop_rate (float): Attention dropout rate. Default: 0.\n        drop_path_rate (float): Stochastic depth rate. Default: 0.2.\n        norm_layer (nn.Module): Normalization layer. Default: nn.LayerNorm.\n        ape (bool): If True, add absolute position embedding to the patch embedding. Default: False.\n        patch_norm (bool): If True, add normalization after patch embedding. Default: True.\n        out_indices (Sequence[int]): Output from which stages.\n        frozen_stages (int): Stages to be frozen (stop grad and set eval mode).\n            -1 means not freezing any parameters.\n        use_checkpoint (bool): Whether to use checkpointing to save memory. Default: False.\n    \"\"\"\n\n    def __init__(\n        self,\n        pretrain_img_size=224,\n        patch_size=4,\n        in_chans=3,\n        embed_dim=96,\n        depths=[2, 2, 6, 2],\n        num_heads=[3, 6, 12, 24],\n        window_size=7,\n        mlp_ratio=4.0,\n        qkv_bias=True,\n        qk_scale=None,\n        drop_rate=0.0,\n        attn_drop_rate=0.0,\n        drop_path_rate=0.2,\n        norm_layer=nn.LayerNorm,\n        ape=False,\n        patch_norm=True,\n        out_indices=(0, 1, 2, 3),\n        frozen_stages=-1,\n        use_checkpoint=False,\n    ):\n        super().__init__()\n\n        self.pretrain_img_size = pretrain_img_size\n        self.num_layers = len(depths)\n        self.embed_dim = embed_dim\n        self.ape = ape\n        self.patch_norm = patch_norm\n        self.out_indices = out_indices\n        self.frozen_stages = frozen_stages\n\n        # split image into non-overlapping patches\n        self.patch_embed = PatchEmbed(\n            patch_size=patch_size,\n            in_chans=in_chans,\n            embed_dim=embed_dim,\n            norm_layer=norm_layer if self.patch_norm else None,\n        )\n\n        # absolute position embedding\n        if self.ape:\n            pretrain_img_size = to_2tuple(pretrain_img_size)\n            patch_size = to_2tuple(patch_size)\n            patches_resolution = [\n                pretrain_img_size[0] // patch_size[0],\n                pretrain_img_size[1] // patch_size[1],\n            ]\n\n            self.absolute_pos_embed = nn.Parameter(\n                torch.zeros(1, embed_dim, patches_resolution[0], patches_resolution[1])\n            )\n            trunc_normal_(self.absolute_pos_embed, std=0.02)\n\n        self.pos_drop = nn.Dropout(p=drop_rate)\n\n        # stochastic depth\n        dpr = [\n            x.item() for x in torch.linspace(0, drop_path_rate, sum(depths))\n        ]  # stochastic depth decay rule\n\n        # build layers\n        self.layers = nn.ModuleList()\n        for i_layer in range(self.num_layers):\n            layer = BasicLayer(\n                dim=int(embed_dim * 2 ** i_layer),\n                depth=depths[i_layer],\n                num_heads=num_heads[i_layer],\n                window_size=window_size,\n                mlp_ratio=mlp_ratio,\n                qkv_bias=qkv_bias,\n                qk_scale=qk_scale,\n                drop=drop_rate,\n                attn_drop=attn_drop_rate,\n                drop_path=dpr[sum(depths[:i_layer]) : sum(depths[: i_layer + 1])],\n                norm_layer=norm_layer,\n                downsample=PatchMerging if (i_layer < self.num_layers - 1) else None,\n                use_checkpoint=use_checkpoint,\n            )\n            self.layers.append(layer)\n\n        num_features = [int(embed_dim * 2 ** i) for i in range(self.num_layers)]\n        self.num_features = num_features\n\n        # add a norm layer for each output\n        for i_layer in out_indices:\n            layer = norm_layer(num_features[i_layer])\n            layer_name = f\"norm{i_layer}\"\n            self.add_module(layer_name, layer)\n\n        self._freeze_stages()\n\n    def _freeze_stages(self):\n        if self.frozen_stages >= 0:\n            self.patch_embed.eval()\n            for param in self.patch_embed.parameters():\n                param.requires_grad = False\n\n        if self.frozen_stages >= 1 and self.ape:\n            self.absolute_pos_embed.requires_grad = False\n\n        if self.frozen_stages >= 2:\n            self.pos_drop.eval()\n            for i in range(0, self.frozen_stages - 1):\n                m = self.layers[i]\n                m.eval()\n                for param in m.parameters():\n                    param.requires_grad = False\n\n    def init_weights(self, pretrained=None):\n        \"\"\"Initialize the weights in backbone.\n        Args:\n            pretrained (str, optional): Path to pre-trained weights.\n                Defaults to None.\n        \"\"\"\n\n        def _init_weights(m):\n            if isinstance(m, nn.Linear):\n                trunc_normal_(m.weight, std=0.02)\n                if isinstance(m, nn.Linear) and m.bias is not None:\n                    nn.init.constant_(m.bias, 0)\n            elif isinstance(m, nn.LayerNorm):\n                nn.init.constant_(m.bias, 0)\n                nn.init.constant_(m.weight, 1.0)\n\n        if isinstance(pretrained, str):\n            self.apply(_init_weights)\n            checkpoint = torch.load(pretrained, map_location='cpu')\n            print(f'\\nload pretrain weight from {pretrained} \\n') \n            self.load_state_dict(checkpoint['model'], strict=False)\n        elif pretrained is None:\n            self.apply(_init_weights)\n        else:\n            raise TypeError('pretrained must be a str or None')\n\n    def forward(self, x):\n        \"\"\"Forward function.\"\"\"\n        x = self.patch_embed(x)\n\n        Wh, Ww = x.size(2), x.size(3)\n        if self.ape:\n            # interpolate the position embedding to the corresponding size\n            absolute_pos_embed = F.interpolate(\n                self.absolute_pos_embed, size=(Wh, Ww), mode=\"bicubic\"\n            )\n            x = (x + absolute_pos_embed).flatten(2).transpose(1, 2)  # B Wh*Ww C\n        else:\n            x = x.flatten(2).transpose(1, 2)\n        x = self.pos_drop(x)\n\n        outs = {}\n        for i in range(self.num_layers):\n            layer = self.layers[i]\n            x_out, H, W, x, Wh, Ww = layer(x, Wh, Ww)\n\n            if i in self.out_indices:\n                norm_layer = getattr(self, f\"norm{i}\")\n                x_out = norm_layer(x_out)\n\n                out = x_out.view(-1, H, W, self.num_features[i]).permute(0, 3, 1, 2).contiguous()\n                outs[\"res{}\".format(i + 2)] = out\n\n        return outs\n\n    def train(self, mode=True):\n        \"\"\"Convert the model into training mode while keep layers freezed.\"\"\"\n        super(SwinTransformer, self).train(mode)\n        self._freeze_stages()\n\n\n@BACKBONE_REGISTRY.register()\nclass D2SwinTransformer(SwinTransformer, Backbone):\n    def __init__(self, cfg, input_shape):\n\n        pretrain_img_size = cfg.MODEL.SWIN.PRETRAIN_IMG_SIZE\n        patch_size = cfg.MODEL.SWIN.PATCH_SIZE\n        in_chans = 3\n        embed_dim = cfg.MODEL.SWIN.EMBED_DIM\n        depths = cfg.MODEL.SWIN.DEPTHS\n        num_heads = cfg.MODEL.SWIN.NUM_HEADS\n        window_size = cfg.MODEL.SWIN.WINDOW_SIZE\n        mlp_ratio = cfg.MODEL.SWIN.MLP_RATIO\n        qkv_bias = cfg.MODEL.SWIN.QKV_BIAS\n        qk_scale = cfg.MODEL.SWIN.QK_SCALE\n        drop_rate = cfg.MODEL.SWIN.DROP_RATE\n        attn_drop_rate = cfg.MODEL.SWIN.ATTN_DROP_RATE\n        drop_path_rate = cfg.MODEL.SWIN.DROP_PATH_RATE\n        norm_layer = nn.LayerNorm\n        ape = cfg.MODEL.SWIN.APE\n        patch_norm = cfg.MODEL.SWIN.PATCH_NORM\n        use_checkpoint = cfg.MODEL.SWIN.USE_CHECKPOINT\n        pretrained_weight = cfg.MODEL.SWIN.PRETRAINED_WEIGHT\n\n\n        super().__init__(\n            pretrain_img_size,\n            patch_size,\n            in_chans,\n            embed_dim,\n            depths,\n            num_heads,\n            window_size,\n            mlp_ratio,\n            qkv_bias,\n            qk_scale,\n            drop_rate,\n            attn_drop_rate,\n            drop_path_rate,\n            norm_layer,\n            ape,\n            patch_norm,\n            use_checkpoint=use_checkpoint,\n        )\n        self.init_weights(pretrained_weight)\n        self._out_features = cfg.MODEL.SWIN.OUT_FEATURES\n\n        self._out_feature_strides = {\n            \"res2\": 4,\n            \"res3\": 8,\n            \"res4\": 16,\n            \"res5\": 32,\n        }\n        self._out_feature_channels = {\n            \"res2\": self.num_features[0],\n            \"res3\": self.num_features[1],\n            \"res4\": self.num_features[2],\n            \"res5\": self.num_features[3],\n        }\n\n    def forward(self, x):\n        \"\"\"\n        Args:\n            x: Tensor of shape (N,C,H,W). H, W must be a multiple of ``self.size_divisibility``.\n        Returns:\n            dict[str->Tensor]: names and the corresponding features\n        \"\"\"\n        assert (\n            x.dim() == 4\n        ), f\"SwinTransformer takes an input of shape (N, C, H, W). Got {x.shape} instead!\"\n        outputs = {}\n        y = super().forward(x)\n        for k in y.keys():\n            if k in self._out_features:\n                outputs[k] = y[k]\n        return outputs\n\n    def output_shape(self):\n        return {\n            name: ShapeSpec(\n                channels=self._out_feature_channels[name], stride=self._out_feature_strides[name]\n            )\n            for name in self._out_features\n        }\n\n    @property\n    def size_divisibility(self):\n        return 32\n\n\n"
  },
  {
    "path": "thirdparty/GLEE/glee/backbone/vit.py",
    "content": "import logging\nimport math\nimport fvcore.nn.weight_init as weight_init\nimport torch\nimport torch.nn as nn\n\nfrom detectron2.layers import CNNBlockBase, Conv2d, get_norm\nfrom detectron2.modeling.backbone.fpn import _assert_strides_are_log2_contiguous\nimport torch.nn.functional as F\n\nfrom detectron2.modeling import BACKBONE_REGISTRY, Backbone, ShapeSpec\nfrom .utils import (\n    PatchEmbed,\n    add_decomposed_rel_pos,\n    get_abs_pos,\n    window_partition,\n    window_unpartition,\n)\nfrom functools import partial\nimport torch.utils.checkpoint as checkpoint\n\nlogger = logging.getLogger(__name__)\n\n\n__all__ = [\"ViT\", \"SimpleFeaturePyramid\", \"get_vit_lr_decay_rate\"]\n\n\nclass Attention(nn.Module):\n    \"\"\"Multi-head Attention block with relative position embeddings.\"\"\"\n\n    def __init__(\n        self,\n        dim,\n        num_heads=8,\n        qkv_bias=True,\n        use_rel_pos=False,\n        rel_pos_zero_init=True,\n        input_size=None,\n    ):\n        \"\"\"\n        Args:\n            dim (int): Number of input channels.\n            num_heads (int): Number of attention heads.\n            qkv_bias (bool:  If True, add a learnable bias to query, key, value.\n            rel_pos (bool): If True, add relative positional embeddings to the attention map.\n            rel_pos_zero_init (bool): If True, zero initialize relative positional parameters.\n            input_size (int or None): Input resolution for calculating the relative positional\n                parameter size.\n        \"\"\"\n        super().__init__()\n        self.num_heads = num_heads\n        head_dim = dim // num_heads\n        self.scale = head_dim**-0.5\n\n        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)\n        self.proj = nn.Linear(dim, dim)\n\n        self.use_rel_pos = use_rel_pos\n        if self.use_rel_pos:\n            # initialize relative positional embeddings\n            self.rel_pos_h = nn.Parameter(torch.zeros(2 * input_size[0] - 1, head_dim))\n            self.rel_pos_w = nn.Parameter(torch.zeros(2 * input_size[1] - 1, head_dim))\n\n            if not rel_pos_zero_init:\n                nn.init.trunc_normal_(self.rel_pos_h, std=0.02)\n                nn.init.trunc_normal_(self.rel_pos_w, std=0.02)\n\n    def forward(self, x):\n        B, H, W, _ = x.shape\n        # qkv with shape (3, B, nHead, H * W, C)\n        qkv = self.qkv(x).reshape(B, H * W, 3, self.num_heads, -1).permute(2, 0, 3, 1, 4)\n        # q, k, v with shape (B * nHead, H * W, C)\n        q, k, v = qkv.reshape(3, B * self.num_heads, H * W, -1).unbind(0)\n\n        with torch.backends.cuda.sdp_kernel(\n            enable_flash=True, enable_math=False, enable_mem_efficient=True\n        ):\n            x = F.scaled_dot_product_attention(q, k, v)\n        attn = (q * self.scale) @ k.transpose(-2, -1)\n\n        if self.use_rel_pos:\n            attn = add_decomposed_rel_pos(attn, q, self.rel_pos_h, self.rel_pos_w, (H, W), (H, W))\n\n        attn = attn.softmax(dim=-1)\n        x = (attn @ v).view(B, self.num_heads, H, W, -1).permute(0, 2, 3, 1, 4).reshape(B, H, W, -1)\n        x = self.proj(x)\n\n        return x\n\n\nclass ResBottleneckBlock(CNNBlockBase):\n    \"\"\"\n    The standard bottleneck residual block without the last activation layer.\n    It contains 3 conv layers with kernels 1x1, 3x3, 1x1.\n    \"\"\"\n\n    def __init__(\n        self,\n        in_channels,\n        out_channels,\n        bottleneck_channels,\n        norm=\"LN\",\n        act_layer=nn.GELU,\n    ):\n        \"\"\"\n        Args:\n            in_channels (int): Number of input channels.\n            out_channels (int): Number of output channels.\n            bottleneck_channels (int): number of output channels for the 3x3\n                \"bottleneck\" conv layers.\n            norm (str or callable): normalization for all conv layers.\n                See :func:`layers.get_norm` for supported format.\n            act_layer (callable): activation for all conv layers.\n        \"\"\"\n        super().__init__(in_channels, out_channels, 1)\n\n        self.conv1 = Conv2d(in_channels, bottleneck_channels, 1, bias=False)\n        self.norm1 = get_norm(norm, bottleneck_channels)\n        self.act1 = act_layer()\n\n        self.conv2 = Conv2d(\n            bottleneck_channels,\n            bottleneck_channels,\n            3,\n            padding=1,\n            bias=False,\n        )\n        self.norm2 = get_norm(norm, bottleneck_channels)\n        self.act2 = act_layer()\n\n        self.conv3 = Conv2d(bottleneck_channels, out_channels, 1, bias=False)\n        self.norm3 = get_norm(norm, out_channels)\n\n        for layer in [self.conv1, self.conv2, self.conv3]:\n            weight_init.c2_msra_fill(layer)\n        for layer in [self.norm1, self.norm2]:\n            layer.weight.data.fill_(1.0)\n            layer.bias.data.zero_()\n        # zero init last norm layer.\n        self.norm3.weight.data.zero_()\n        self.norm3.bias.data.zero_()\n\n    def forward(self, x):\n        out = x\n        for layer in self.children():\n            out = layer(out)\n\n        out = x + out\n        return out\n\n\nclass Block(nn.Module):\n    \"\"\"Transformer blocks with support of window attention and residual propagation blocks\"\"\"\n\n    def __init__(\n        self,\n        dim,\n        num_heads,\n        mlp_ratio=4.0,\n        qkv_bias=True,\n        drop_path=0.0,\n        norm_layer=nn.LayerNorm,\n        act_layer=nn.GELU,\n        use_rel_pos=False,\n        rel_pos_zero_init=True,\n        window_size=0,\n        use_residual_block=False,\n        input_size=None,\n    ):\n        \"\"\"\n        Args:\n            dim (int): Number of input channels.\n            num_heads (int): Number of attention heads in each ViT block.\n            mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.\n            qkv_bias (bool): If True, add a learnable bias to query, key, value.\n            drop_path (float): Stochastic depth rate.\n            norm_layer (nn.Module): Normalization layer.\n            act_layer (nn.Module): Activation layer.\n            use_rel_pos (bool): If True, add relative positional embeddings to the attention map.\n            rel_pos_zero_init (bool): If True, zero initialize relative positional parameters.\n            window_size (int): Window size for window attention blocks. If it equals 0, then not\n                use window attention.\n            use_residual_block (bool): If True, use a residual block after the MLP block.\n            input_size (int or None): Input resolution for calculating the relative positional\n                parameter size.\n        \"\"\"\n        super().__init__()\n        self.norm1 = norm_layer(dim)\n        self.attn = Attention(\n            dim,\n            num_heads=num_heads,\n            qkv_bias=qkv_bias,\n            use_rel_pos=use_rel_pos,\n            rel_pos_zero_init=rel_pos_zero_init,\n            input_size=input_size if window_size == 0 else (window_size, window_size),\n        )\n\n        from timm.models.layers import DropPath, Mlp\n\n        self.drop_path = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()\n        self.norm2 = norm_layer(dim)\n        self.mlp = Mlp(in_features=dim, hidden_features=int(dim * mlp_ratio), act_layer=act_layer)\n\n        self.window_size = window_size\n\n        self.use_residual_block = use_residual_block\n        if use_residual_block:\n            # Use a residual block with bottleneck channel as dim // 2\n            self.residual = ResBottleneckBlock(\n                in_channels=dim,\n                out_channels=dim,\n                bottleneck_channels=dim // 2,\n                norm=\"LN\",\n                act_layer=act_layer,\n            )\n\n    def forward(self, x):\n        shortcut = x\n        x = self.norm1(x)\n        # Window partition\n        if self.window_size > 0:\n            H, W = x.shape[1], x.shape[2]\n            x, pad_hw = window_partition(x, self.window_size)\n        x = self.attn(x)\n        # Reverse window partition\n        if self.window_size > 0:\n            x = window_unpartition(x, self.window_size, pad_hw, (H, W))\n\n        x = shortcut + self.drop_path(x)\n        x = x + self.drop_path(self.mlp(self.norm2(x)))\n\n        if self.use_residual_block:\n            x = self.residual(x.permute(0, 3, 1, 2)).permute(0, 2, 3, 1)\n\n        return x\n\n\nclass ViT(Backbone):\n    \"\"\"\n    This module implements Vision Transformer (ViT) backbone in :paper:`vitdet`.\n    \"Exploring Plain Vision Transformer Backbones for Object Detection\",\n    https://arxiv.org/abs/2203.16527\n    \"\"\"\n\n    def __init__(\n        self,\n        img_size=1024,\n        patch_size=16,\n        in_chans=3,\n        embed_dim=768,\n        depth=12,\n        num_heads=12,\n        mlp_ratio=4.0,\n        qkv_bias=True,\n        drop_path_rate=0.0,\n        norm_layer=nn.LayerNorm,\n        act_layer=nn.GELU,\n        use_abs_pos=True,\n        use_rel_pos=False,\n        rel_pos_zero_init=True,\n        window_size=0,\n        window_block_indexes=(),\n        residual_block_indexes=(),\n        use_act_checkpoint=False,\n        pretrain_img_size=224,\n        pretrain_use_cls_token=True,\n        out_feature=\"last_feat\",\n    ):\n        \"\"\"\n        Args:\n            img_size (int): Input image size.\n            patch_size (int): Patch size.\n            in_chans (int): Number of input image channels.\n            embed_dim (int): Patch embedding dimension.\n            depth (int): Depth of ViT.\n            num_heads (int): Number of attention heads in each ViT block.\n            mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.\n            qkv_bias (bool): If True, add a learnable bias to query, key, value.\n            drop_path_rate (float): Stochastic depth rate.\n            norm_layer (nn.Module): Normalization layer.\n            act_layer (nn.Module): Activation layer.\n            use_abs_pos (bool): If True, use absolute positional embeddings.\n            use_rel_pos (bool): If True, add relative positional embeddings to the attention map.\n            rel_pos_zero_init (bool): If True, zero initialize relative positional parameters.\n            window_size (int): Window size for window attention blocks.\n            window_block_indexes (list): Indexes for blocks using window attention.\n            residual_block_indexes (list): Indexes for blocks using conv propagation.\n            use_act_checkpoint (bool): If True, use activation checkpointing.\n            pretrain_img_size (int): input image size for pretraining models.\n            pretrain_use_cls_token (bool): If True, pretrainig models use class token.\n            out_feature (str): name of the feature from the last block.\n        \"\"\"\n        super().__init__()\n        self.pretrain_use_cls_token = pretrain_use_cls_token\n\n        self.patch_embed = PatchEmbed(\n            kernel_size=(patch_size, patch_size),\n            stride=(patch_size, patch_size),\n            in_chans=in_chans,\n            embed_dim=embed_dim,\n        )\n\n        if use_abs_pos:\n            # Initialize absolute positional embedding with pretrain image size.\n            num_patches = (pretrain_img_size // patch_size) * (pretrain_img_size // patch_size)\n            num_positions = (num_patches + 1) if pretrain_use_cls_token else num_patches\n            self.pos_embed = nn.Parameter(torch.zeros(1, num_positions, embed_dim))\n        else:\n            self.pos_embed = None\n\n        # stochastic depth decay rule\n        dpr = [x.item() for x in torch.linspace(0, drop_path_rate, depth)]\n\n        self.blocks = nn.ModuleList()\n        for i in range(depth):\n            block = Block(\n                dim=embed_dim,\n                num_heads=num_heads,\n                mlp_ratio=mlp_ratio,\n                qkv_bias=qkv_bias,\n                drop_path=dpr[i],\n                norm_layer=norm_layer,\n                act_layer=act_layer,\n                use_rel_pos=use_rel_pos,\n                rel_pos_zero_init=rel_pos_zero_init,\n                window_size=window_size if i in window_block_indexes else 0,\n                use_residual_block=i in residual_block_indexes,\n                input_size=(img_size // patch_size, img_size // patch_size),\n            )\n            if use_act_checkpoint:\n                # TODO: use torch.utils.checkpoint\n                from fairscale.nn.checkpoint import checkpoint_wrapper\n\n                block = checkpoint_wrapper(block)\n            self.blocks.append(block)\n\n        self._out_feature_channels = {out_feature: embed_dim}\n        self._out_feature_strides = {out_feature: patch_size}\n        self._out_features = [out_feature]\n\n        if self.pos_embed is not None:\n            nn.init.trunc_normal_(self.pos_embed, std=0.02)\n\n        # In our method, we don't use backbone feature with stride 4\n        self.fpn1 = nn.Sequential(\n            nn.ConvTranspose2d(embed_dim, embed_dim // 2, kernel_size=2, stride=2),\n        )\n        self.fpn2 = nn.Identity() \n        self.fpn3 = nn.MaxPool2d(kernel_size=2, stride=2)\n\n        self.apply(self._init_weights)\n\n    def _init_weights(self, m):\n        if isinstance(m, nn.Linear):\n            nn.init.trunc_normal_(m.weight, std=0.02)\n            if isinstance(m, nn.Linear) and m.bias is not None:\n                nn.init.constant_(m.bias, 0)\n        elif isinstance(m, nn.LayerNorm):\n            nn.init.constant_(m.bias, 0)\n            nn.init.constant_(m.weight, 1.0)\n\n    def forward(self, x):\n        x = self.patch_embed(x)\n        if self.pos_embed is not None:\n            x = x + get_abs_pos(\n                self.pos_embed, self.pretrain_use_cls_token, (x.shape[1], x.shape[2])\n            )\n\n        for blk in self.blocks:\n            x = blk(x)\n        xp = x.permute(0, 3, 1, 2) # (b, h, w, c) --> (b, c, h, w)\n        \n        features = []\n        ops = [self.fpn1, self.fpn2, self.fpn3]\n        for i in range(len(ops)):\n            features.append(ops[i](xp))\n        rets = {\"res{}\".format(u + 3): v for (u,v) in enumerate(features)}\n\n        return rets\n\n\n\n@BACKBONE_REGISTRY.register()\nclass D2ViT(ViT, Backbone):\n    def __init__(self, cfg, input_shape):\n        use_checkpoint = cfg.MODEL.VIT.USE_CHECKPOINT\n        if cfg.MODEL.VIT.NAME == \"ViT-Base\":\n            embed_dim=768\n            depth=12\n            drop_path_rate=0.1\n            num_heads=12\n        elif cfg.MODEL.VIT.NAME == \"ViT-Large\":\n            embed_dim=1024\n            depth=24\n            drop_path_rate=0.4\n            num_heads=16\n        elif cfg.MODEL.VIT.NAME == \"ViT-huge\":\n            embed_dim=1280\n            depth=32\n            drop_path_rate=0.5\n            num_heads=16\n        else:\n            raise ValueError(\"Unsupported ViT name\")\n        super().__init__(\n            img_size=1024,\n            patch_size=16,\n            in_chans=input_shape.channels,\n            embed_dim=embed_dim,\n            depth=depth,\n            num_heads=num_heads,\n            drop_path_rate=drop_path_rate,\n            window_size=14,\n            mlp_ratio=4,\n            qkv_bias=True,\n            norm_layer=partial(nn.LayerNorm, eps=1e-6),\n            window_block_indexes=[\n                # 2, 5, 8 11 for global attention\n                0,\n                1,\n                3,\n                4,\n                6,\n                7,\n                9,\n                10,\n            ],\n            residual_block_indexes=[],\n            use_rel_pos=True,\n            out_feature=\"last_feat\",\n            use_act_checkpoint=use_checkpoint)\n\n        self._out_features = cfg.MODEL.VIT.OUT_FEATURES\n\n        self._out_feature_strides = {\n            \"res3\": 8,\n            \"res4\": 16,\n            \"res5\": 32,\n        }\n        self._out_feature_channels = {\n            \"res3\": embed_dim // 2,\n            \"res4\": embed_dim,\n            \"res5\": embed_dim,\n        }\n\n    def forward(self, x):\n        \"\"\"\n        Args:\n            x: Tensor of shape (N,C,H,W). H, W must be a multiple of ``self.size_divisibility``.\n        Returns:\n            dict[str->Tensor]: names and the corresponding features\n        \"\"\"\n        assert (\n            x.dim() == 4\n        ), f\"SwinTransformer takes an input of shape (N, C, H, W). Got {x.shape} instead!\"\n        outputs = {}\n        y = super().forward(x)\n        for k in y.keys():\n            if k in self._out_features:\n                outputs[k] = y[k]\n        return outputs\n\n    def output_shape(self):\n        return {\n            name: ShapeSpec(\n                channels=self._out_feature_channels[name], stride=self._out_feature_strides[name]\n            )\n            for name in self._out_features\n        }\n\n    @property\n    def size_divisibility(self):\n        return 32"
  },
  {
    "path": "thirdparty/GLEE/glee/backbone/vit_utils.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved\nimport math\nimport numpy as np\nfrom scipy import interpolate\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\n__all__ = [\n    \"window_partition\",\n    \"window_unpartition\",\n    \"add_decomposed_rel_pos\",\n    \"get_abs_pos\",\n    \"PatchEmbed\",\n]\n\n\ndef window_partition(x, window_size):\n    \"\"\"\n    Partition into non-overlapping windows with padding if needed.\n    Args:\n        x (tensor): input tokens with [B, H, W, C].\n        window_size (int): window size.\n\n    Returns:\n        windows: windows after partition with [B * num_windows, window_size, window_size, C].\n        (Hp, Wp): padded height and width before partition\n    \"\"\"\n    B, H, W, C = x.shape\n\n    pad_h = (window_size - H % window_size) % window_size\n    pad_w = (window_size - W % window_size) % window_size\n    if pad_h > 0 or pad_w > 0:\n        x = F.pad(x, (0, 0, 0, pad_w, 0, pad_h))\n    Hp, Wp = H + pad_h, W + pad_w\n\n    x = x.view(B, Hp // window_size, window_size, Wp // window_size, window_size, C)\n    windows = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(-1, window_size, window_size, C)\n    return windows, (Hp, Wp)\n\n\ndef window_unpartition(windows, window_size, pad_hw, hw):\n    \"\"\"\n    Window unpartition into original sequences and removing padding.\n    Args:\n        x (tensor): input tokens with [B * num_windows, window_size, window_size, C].\n        window_size (int): window size.\n        pad_hw (Tuple): padded height and width (Hp, Wp).\n        hw (Tuple): original height and width (H, W) before padding.\n\n    Returns:\n        x: unpartitioned sequences with [B, H, W, C].\n    \"\"\"\n    Hp, Wp = pad_hw\n    H, W = hw\n    B = windows.shape[0] // (Hp * Wp // window_size // window_size)\n    x = windows.view(B, Hp // window_size, Wp // window_size, window_size, window_size, -1)\n    x = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(B, Hp, Wp, -1)\n\n    if Hp > H or Wp > W:\n        x = x[:, :H, :W, :].contiguous()\n    return x\n\n\ndef get_rel_pos(q_size, k_size, rel_pos, interp_type):\n    \"\"\"\n    Get relative positional embeddings according to the relative positions of\n        query and key sizes.\n    Args:\n        q_size (int): size of query q.\n        k_size (int): size of key k.\n        rel_pos (Tensor): relative position embeddings (L, C).\n\n    Returns:\n        Extracted positional embeddings according to relative positions.\n    \"\"\"\n    max_rel_dist = int(2 * max(q_size, k_size) - 1)\n    # Interpolate rel pos if needed.\n    if rel_pos.shape[0] != max_rel_dist:\n        if interp_type == \"vitdet\":\n            # the vitdet impl: \n            # https://github.com/facebookresearch/detectron2/blob/96c752ce821a3340e27edd51c28a00665dd32a30/detectron2/modeling/backbone/utils.py#L77.\n\n            rel_pos_resized = F.interpolate(\n                rel_pos.reshape(1, rel_pos.shape[0], -1).permute(0, 2, 1),\n                size=max_rel_dist,\n                mode=\"linear\",\n            )\n            rel_pos_resized = rel_pos_resized.reshape(-1, max_rel_dist).permute(1, 0)\n        elif interp_type == \"beit\":\n            # steal from beit https://github.com/microsoft/unilm/tree/master/beit\n            # modified by Yuxin Fang\n\n            src_size = rel_pos.shape[0]\n            dst_size = max_rel_dist\n\n            q = 1.0903078\n            dis = []\n\n            cur = 1\n            for i in range(src_size // 2):\n                dis.append(cur)\n                cur += q ** (i + 1)\n\n            r_ids = [-_ for _ in reversed(dis)]\n            x = r_ids + [0] + dis\n            t = dst_size // 2.0\n            dx = np.arange(-t, t + 0.1, 1.0)\n\n            all_rel_pos_bias = []\n            for i in range(rel_pos.shape[1]):\n                # a hack from https://github.com/baaivision/EVA/issues/8,\n                # could also be used in fine-tuning but the performance haven't been tested.\n                z = rel_pos[:, i].view(src_size).cpu().float().detach().numpy()\n                f = interpolate.interp1d(x, z, kind='cubic', fill_value=\"extrapolate\")\n                all_rel_pos_bias.append(\n                    torch.Tensor(f(dx)).contiguous().view(-1, 1).to(rel_pos.device))\n            rel_pos_resized = torch.cat(all_rel_pos_bias, dim=-1)\n        else:\n            raise NotImplementedError()\n    else:\n        rel_pos_resized = rel_pos\n\n    # Scale the coords with short length if shapes for q and k are different.\n    q_coords = torch.arange(q_size)[:, None] * max(k_size / q_size, 1.0)\n    k_coords = torch.arange(k_size)[None, :] * max(q_size / k_size, 1.0)\n    relative_coords = (q_coords - k_coords) + (k_size - 1) * max(q_size / k_size, 1.0)\n\n    return rel_pos_resized[relative_coords.long()]\n\n\ndef add_decomposed_rel_pos(attn, q, rel_pos_h, rel_pos_w, q_size, k_size, interp_type):\n    \"\"\"\n    Calculate decomposed Relative Positional Embeddings from :paper:`mvitv2`.\n    https://github.com/facebookresearch/mvit/blob/19786631e330df9f3622e5402b4a419a263a2c80/mvit/models/attention.py   # noqa B950\n    Args:\n        attn (Tensor): attention map.\n        q (Tensor): query q in the attention layer with shape (B, q_h * q_w, C).\n        rel_pos_h (Tensor): relative position embeddings (Lh, C) for height axis.\n        rel_pos_w (Tensor): relative position embeddings (Lw, C) for width axis.\n        q_size (Tuple): spatial sequence size of query q with (q_h, q_w).\n        k_size (Tuple): spatial sequence size of key k with (k_h, k_w).\n\n    Returns:\n        attn (Tensor): attention map with added relative positional embeddings.\n    \"\"\"\n    q_h, q_w = q_size\n    k_h, k_w = k_size\n    Rh = get_rel_pos(q_h, k_h, rel_pos_h, interp_type)\n    Rw = get_rel_pos(q_w, k_w, rel_pos_w, interp_type)\n\n    B, _, dim = q.shape\n    r_q = q.reshape(B, q_h, q_w, dim)\n    rel_h = torch.einsum(\"bhwc,hkc->bhwk\", r_q, Rh)\n    rel_w = torch.einsum(\"bhwc,wkc->bhwk\", r_q, Rw)\n\n    attn = (\n        attn.view(B, q_h, q_w, k_h, k_w) + rel_h[:, :, :, :, None] + rel_w[:, :, :, None, :]\n    ).view(B, q_h * q_w, k_h * k_w)\n\n    return attn\n\n\ndef get_abs_pos(abs_pos, has_cls_token, hw):\n    \"\"\"\n    Calculate absolute positional embeddings. If needed, resize embeddings and remove cls_token\n        dimension for the original embeddings.\n    Args:\n        abs_pos (Tensor): absolute positional embeddings with (1, num_position, C).\n        has_cls_token (bool): If true, has 1 embedding in abs_pos for cls token.\n        hw (Tuple): size of input image tokens.\n\n    Returns:\n        Absolute positional embeddings after processing with shape (1, H, W, C)\n    \"\"\"\n    h, w = hw\n    if has_cls_token:\n        abs_pos = abs_pos[:, 1:]\n    xy_num = abs_pos.shape[1]\n    size = int(math.sqrt(xy_num))\n    assert size * size == xy_num\n\n    if size != h or size != w:\n        new_abs_pos = F.interpolate(\n            abs_pos.reshape(1, size, size, -1).permute(0, 3, 1, 2),\n            size=(h, w),\n            mode=\"bicubic\",\n            align_corners=False,\n        )\n\n        return new_abs_pos.permute(0, 2, 3, 1)\n    else:\n        return abs_pos.reshape(1, h, w, -1)\n\n\nclass PatchEmbed(nn.Module):\n    \"\"\"\n    Image to Patch Embedding.\n    \"\"\"\n\n    def __init__(\n        self, kernel_size=(16, 16), stride=(16, 16), padding=(0, 0), in_chans=3, embed_dim=768\n    ):\n        \"\"\"\n        Args:\n            kernel_size (Tuple): kernel size of the projection layer.\n            stride (Tuple): stride of the projection layer.\n            padding (Tuple): padding size of the projection layer.\n            in_chans (int): Number of input image channels.\n            embed_dim (int):  embed_dim (int): Patch embedding dimension.\n        \"\"\"\n        super().__init__()\n\n        self.proj = nn.Conv2d(\n            in_chans, embed_dim, kernel_size=kernel_size, stride=stride, padding=padding\n        )\n\n    def forward(self, x):\n        x = self.proj(x)\n        # B C H W -> B H W C\n        x = x.permute(0, 2, 3, 1)\n        return x\n"
  },
  {
    "path": "thirdparty/GLEE/glee/config.py",
    "content": "# -*- coding: utf-8 -*-\nfrom detectron2.config import CfgNode as CN\ndef add_glee_config(cfg):\n    \"\"\"\n    Add config for DETR.\n    \"\"\"\n    \n    cfg.FIND_UNUSED_PARAMETERS = True\n    cfg.MODEL.MAX_CATEGORY_LEN = 100\n    cfg.MODEL.PSEUDO_VIDEO = False\n    cfg.MODEL.FREEZE_WHOLE = False\n    cfg.MODEL.CONTRAS_MEAN = False\n    cfg.MODEL.CROSS_TRACK = False\n    cfg.MODEL.TRACK_VERSION = 'v3'\n    \n    cfg.INPUT.SAMPLING_FRAME_NUM = 1\n    cfg.INPUT.SAMPLING_FRAME_RANGE = 10\n    cfg.INPUT.SAMPLING_INTERVAL = 1\n    cfg.INPUT.SAMPLING_FRAME_SHUFFLE = False\n    cfg.INPUT.AUGMENTATIONS = [] # \"brightness\", \"contrast\", \"saturation\", \"rotation\"\n    cfg.INPUT.DATASET_MAPPER_NAME = None\n\n    cfg.DATALOADER.DATASET_RATIO = [1, 1]\n    cfg.DATALOADER.USE_DIFF_BS_SIZE = True\n    cfg.DATALOADER.DATASET_BS = [2, 2]\n    cfg.DATALOADER.DATASET_FILTERS = [True, True]\n    cfg.DATALOADER.USE_RFS = [False, False]\n    cfg.DATALOADER.MULTI_DATASET_GROUPING = True\n    cfg.DATALOADER.DATASET_ANN = ['image']\n\n\n    cfg.INPUT.SIZE_DIVISIBILITY = -1\n\n    cfg.DATALOADER.DATASET_RATIO = [1, 1]\n    cfg.DATALOADER.USE_DIFF_BS_SIZE = True\n    cfg.DATALOADER.DATASET_BS = [2, 2]\n    cfg.DATALOADER.USE_RFS = [False, False]\n    cfg.DATALOADER.MULTI_DATASET_GROUPING = True\n    cfg.DATALOADER.DATASET_ANN = ['box', 'box']\n\n    # Allow different datasets to use different input resolutions\n    cfg.INPUT.MIN_SIZE_TRAIN_MULTI = [(480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800), (320, 352, 392, 416, 448, 480, 512, 544, 576, 608, 640)]\n    cfg.INPUT.MAX_SIZE_TRAIN_MULTI = [1333, 768]\n\n\n    # MaskDINO model config\n    cfg.MODEL.MaskDINO = CN()\n    cfg.MODEL.MaskDINO.LEARN_TGT = False\n\n    # loss\n    cfg.MODEL.MaskDINO.PANO_BOX_LOSS = False\n    cfg.MODEL.MaskDINO.SEMANTIC_CE_LOSS = False\n    cfg.MODEL.MaskDINO.DEEP_SUPERVISION = True\n    cfg.MODEL.MaskDINO.NO_OBJECT_WEIGHT = 0.1\n    cfg.MODEL.MaskDINO.CLASS_WEIGHT = 4.0\n    cfg.MODEL.MaskDINO.DICE_WEIGHT = 5.0\n    cfg.MODEL.MaskDINO.MASK_WEIGHT = 5.0\n    cfg.MODEL.MaskDINO.BOX_WEIGHT = 5.\n    cfg.MODEL.MaskDINO.GIOU_WEIGHT = 2.\n\n    # cost weight\n    cfg.MODEL.MaskDINO.COST_CLASS_WEIGHT = 4.0\n    cfg.MODEL.MaskDINO.COST_DICE_WEIGHT = 5.0\n    cfg.MODEL.MaskDINO.COST_MASK_WEIGHT = 5.0\n    cfg.MODEL.MaskDINO.COST_BOX_WEIGHT = 5.\n    cfg.MODEL.MaskDINO.COST_GIOU_WEIGHT = 2.\n\n    # transformer config\n    cfg.MODEL.MaskDINO.NHEADS = 8\n    cfg.MODEL.MaskDINO.DROPOUT = 0.1\n    cfg.MODEL.MaskDINO.DIM_FEEDFORWARD = 2048\n    cfg.MODEL.MaskDINO.ENC_LAYERS = 0\n    cfg.MODEL.MaskDINO.DEC_LAYERS = 6\n    cfg.MODEL.MaskDINO.INITIAL_PRED = True\n    cfg.MODEL.MaskDINO.PRE_NORM = False\n    cfg.MODEL.MaskDINO.BOX_LOSS = True\n    cfg.MODEL.MaskDINO.HIDDEN_DIM = 256\n    cfg.MODEL.MaskDINO.NUM_OBJECT_QUERIES = 100\n\n    cfg.MODEL.MaskDINO.ENFORCE_INPUT_PROJ = False\n    cfg.MODEL.MaskDINO.TWO_STAGE = True\n    cfg.MODEL.MaskDINO.INITIALIZE_BOX_TYPE = 'no'  # ['no', 'bitmask', 'mask2box']\n    cfg.MODEL.MaskDINO.DN=\"seg\"\n    cfg.MODEL.MaskDINO.DN_NOISE_SCALE=0.4\n    cfg.MODEL.MaskDINO.DN_NUM=100\n    cfg.MODEL.MaskDINO.PRED_CONV=False\n\n    cfg.MODEL.MaskDINO.EVAL_FLAG = 1\n\n    # MSDeformAttn encoder configs\n    cfg.MODEL.SEM_SEG_HEAD.DEFORMABLE_TRANSFORMER_ENCODER_IN_FEATURES = [\"res3\", \"res4\", \"res5\"]\n    cfg.MODEL.SEM_SEG_HEAD.DEFORMABLE_TRANSFORMER_ENCODER_N_POINTS = 4\n    cfg.MODEL.SEM_SEG_HEAD.DEFORMABLE_TRANSFORMER_ENCODER_N_HEADS = 8\n    cfg.MODEL.SEM_SEG_HEAD.DIM_FEEDFORWARD = 2048\n    cfg.MODEL.SEM_SEG_HEAD.NUM_FEATURE_LEVELS = 3\n    cfg.MODEL.SEM_SEG_HEAD.TOTAL_NUM_FEATURE_LEVELS = 4\n    cfg.MODEL.SEM_SEG_HEAD.FEATURE_ORDER = 'high2low'  # ['low2high', 'high2low'] high2low: from high level to low level\n\n    #####################\n\n    # MaskDINO inference config\n    cfg.MODEL.MaskDINO.TEST = CN()\n    cfg.MODEL.MaskDINO.TEST.TEST_FOUCUS_ON_BOX = False\n    cfg.MODEL.MaskDINO.TEST.SEMANTIC_ON = True\n    cfg.MODEL.MaskDINO.TEST.INSTANCE_ON = False\n    cfg.MODEL.MaskDINO.TEST.PANOPTIC_ON = False\n    cfg.MODEL.MaskDINO.TEST.OBJECT_MASK_THRESHOLD = 0.0\n    cfg.MODEL.MaskDINO.TEST.OVERLAP_THRESHOLD = 0.0\n    cfg.MODEL.MaskDINO.TEST.SEM_SEG_POSTPROCESSING_BEFORE_INFERENCE = False\n    cfg.MODEL.MaskDINO.TEST.PANO_TRANSFORM_EVAL = True\n    cfg.MODEL.MaskDINO.TEST.PANO_TEMPERATURE = 0.06\n    # cfg.MODEL.MaskDINO.TEST.EVAL_FLAG = 1\n\n    # Sometimes `backbone.size_divisibility` is set to 0 for some backbone (e.g. ResNet)\n    # you can use this config to override\n    cfg.MODEL.MaskDINO.SIZE_DIVISIBILITY = 32\n\n    # pixel decoder config\n    cfg.MODEL.SEM_SEG_HEAD.MASK_DIM = 256\n    # adding transformer in pixel decoder\n    cfg.MODEL.SEM_SEG_HEAD.TRANSFORMER_ENC_LAYERS = 0\n    # pixel decoder\n    cfg.MODEL.SEM_SEG_HEAD.PIXEL_DECODER_NAME = \"MaskDINOEncoder\"\n\n    # transformer module\n    cfg.MODEL.MaskDINO.TRANSFORMER_DECODER_NAME = \"MaskDINODecoder\"\n\n    # LSJ aug\n    cfg.INPUT.IMAGE_SIZE = 1024\n    cfg.INPUT.MIN_SCALE = 0.1\n    cfg.INPUT.MAX_SCALE = 2.0\n\n    # point loss configs\n    # Number of points sampled during training for a mask point head.\n    cfg.MODEL.MaskDINO.TRAIN_NUM_POINTS = 112 * 112\n    # Oversampling parameter for PointRend point sampling during training. Parameter `k` in the\n    # original paper.\n    cfg.MODEL.MaskDINO.OVERSAMPLE_RATIO = 3.0\n    # Importance sampling parameter for PointRend point sampling during training. Parametr `beta` in\n    # the original paper.\n    cfg.MODEL.MaskDINO.IMPORTANCE_SAMPLE_RATIO = 0.75\n\n\n\n\n    cfg.MODEL.DIM_PROJ = 256\n    cfg.MODEL.VISUAL_PROMPT = False\n    cfg.MODEL.TEXT = CN()\n    cfg.MODEL.TEXT.ARCH = 'vlpencoder'\n    cfg.MODEL.TEXT.NAME= 'transformer'\n    cfg.MODEL.TEXT.TOKENIZER= 'clip'\n    cfg.MODEL.TEXT.CONTEXT_LENGTH= 77 # 77\n    cfg.MODEL.TEXT.WIDTH= 512\n    cfg.MODEL.TEXT.HEADS= 8\n    cfg.MODEL.TEXT.LAYERS= 12 # 6\n    cfg.MODEL.TEXT.AUTOGRESSIVE= True\n\n\n\n    cfg.MODEL.LANGUAGE_BACKBONE = CN()\n    cfg.MODEL.LANGUAGE_BACKBONE.USE_CHECKPOINT = False\n    cfg.MODEL.LANGUAGE_BACKBONE.TOKENIZER_TYPE = \"bert-base-uncased\"\n    cfg.MODEL.LANGUAGE_BACKBONE.MODEL_TYPE = \"bert-base-uncased\"\n    cfg.MODEL.LANGUAGE_BACKBONE.LANG_DIM = 768\n    cfg.MODEL.LANGUAGE_BACKBONE.MAX_QUERY_LEN = 77 # max length of the tokenized captions. \n    cfg.MODEL.LANGUAGE_BACKBONE.N_LAYERS = 1\n    # cfg.MODEL.LANGUAGE_BACKBONE.UNUSED_TOKEN = 106\n    # cfg.MODEL.LANGUAGE_BACKBONE.MASK_SPECIAL = False\n    cfg.MODEL.LANGUAGE_BACKBONE.PAD_MAX = True\n\n\n\n\n\n    cfg.MODEL.ENCODER = CN()  \n    cfg.MODEL.ENCODER.NAME= 'transformer_encoder_fpn'\n    cfg.MODEL.ENCODER.IGNORE_VALUE= 255\n    cfg.MODEL.ENCODER.NUM_CLASSES= 133\n    cfg.MODEL.ENCODER.LOSS_WEIGHT= 1.0\n    cfg.MODEL.ENCODER.CONVS_DIM= 512\n    cfg.MODEL.ENCODER.MASK_DIM= 512\n    cfg.MODEL.ENCODER.NORM= \"GN\"\n    cfg.MODEL.ENCODER.IN_FEATURES= [\"res2\", \"res3\", \"res4\", \"res5\"]\n    cfg.MODEL.ENCODER.DEFORMABLE_TRANSFORMER_ENCODER_IN_FEATURES= [\"res3\", \"res4\", \"res5\"]\n    cfg.MODEL.ENCODER.COMMON_STRIDE= 4\n    cfg.MODEL.ENCODER.TRANSFORMER_ENC_LAYERS= 6\n\n    cfg.MODEL.DECODER = CN()  \n    cfg.MODEL.DECODER.TRANSFORMER_IN_FEATURE= \"multi_scale_pixel_decoder\"\n    cfg.MODEL.DECODER.MASK  = True\n    # DETECTION= False\n    # SPATIAL=\n    #   ENABLED= True\n    # GROUNDING=\n    #   ENABLED= False\n    #   MAX_LEN= 5\n    #   TEXT_WEIGHT= 2.0\n    #   CLASS_WEIGHT= 0.5\n    # VISUAL=\n    #   ENABLED= False\n    # AUDIO=\n    #   ENABLED= False\n    # OPENIMAGE=\n    #   ENABLED= False\n    #   NEGATIVE_SAMPLES= 5\n    #   GROUNDING=\n    #     ENABLED= False\n    #     MAX_LEN= 5\n    # CAPTION=\n    #   ENABLED= False\n    #   PHRASE_PROB= 0.5\n    #   SIM_THRES= 0.95\n    cfg.MODEL.DECODER.HIDDEN_DIM= 512\n    cfg.MODEL.DECODER.NUM_OBJECT_QUERIES= 101\n    cfg.MODEL.DECODER.NHEADS= 8\n    cfg.MODEL.DECODER.DROPOUT= 0.0\n    cfg.MODEL.DECODER.DIM_FEEDFORWARD= 2048\n    cfg.MODEL.DECODER.MAX_SPATIAL_LEN= [512, 512, 512, 512]\n    cfg.MODEL.DECODER.PRE_NORM= False\n    cfg.MODEL.DECODER.ENFORCE_INPUT_PROJ= False\n    cfg.MODEL.DECODER.SIZE_DIVISIBILITY= 32\n    cfg.MODEL.DECODER.TRAIN_NUM_POINTS= 12544\n    cfg.MODEL.DECODER.OVERSAMPLE_RATIO= 3.0\n    cfg.MODEL.DECODER.IMPORTANCE_SAMPLE_RATIO= 0.75\n    cfg.MODEL.DECODER.DEC_LAYERS= 10  # 9 decoder layers, add one for the loss on learnable query\n    cfg.MODEL.DECODER.TOP_GROUNDING_LAYERS= 10\n    cfg.MODEL.DECODER.TOP_CAPTION_LAYERS= 10\n    cfg.MODEL.DECODER.TOP_SPATIAL_LAYERS= 10\n    cfg.MODEL.DECODER.TOP_OPENIMAGE_LAYERS= 10\n    # TEST=\n    #   SEMANTIC_ON= True\n    #   INSTANCE_ON= True\n    #   PANOPTIC_ON= True\n    #   OVERLAP_THRESHOLD= 0.8\n    #   OBJECT_MASK_THRESHOLD= 0.4\n    #   SEM_SEG_POSTPROCESSING_BEFORE_INFERENCE= false\n    #   DETECTIONS_PER_IMAGE= 100\n\n    cfg.ATTENTION_ARCH = CN()\n    # cfg.ATTENTION_ARCH.VARIABLE={\n    # 'queries': ['object'],\n    # 'tokens': ['grounding', 'spatial', 'visual', 'audio']}\n\n#   SELF_ATTENTION:\n#     queries:\n#       object: ['queries_object', 'tokens_grounding', 'tokens_spatial', 'tokens_visual', 'tokens_audio']\n#     tokens:\n#       grounding: ['queries_object', 'tokens_grounding']\n#       spatial: ['tokens_spatial']\n#       visual: ['tokens_visual']\n#       audio: ['queries_object', 'tokens_audio']\n#   CROSS_ATTENTION:\n#     queries:\n#       object: True\n#     tokens:\n#       grounding: False\n#       spatial: False\n#       visual: False\n#       audio: False\n#   MASKING: ['tokens_spatial', 'tokens_grounding', 'tokens_visual', 'tokens_audio']\n#   DUPLICATION:\n#     queries:\n#       grounding: 'queries_object'\n#       spatial: 'queries_object'\n#   SPATIAL_MEMORIES: 32\n\n\n\n\n\n\n    cfg.SOLVER.OPTIMIZER = \"ADAMW\"\n    cfg.SOLVER.BACKBONE_MULTIPLIER = 0.1\n    cfg.SOLVER.TEXTENCODER_MULTIPLIER = 1.0\n    cfg.SOLVER.LR_DECAY_RATE = None\n    cfg.SOLVER.LR_DECAY_RATE_NUM_LAYERS = None\n\n\n    ## support Swin backbone\n    cfg.MODEL.SWIN = CN()\n    cfg.MODEL.SWIN.PRETRAIN_IMG_SIZE = 224\n    cfg.MODEL.SWIN.PATCH_SIZE = 4\n    cfg.MODEL.SWIN.EMBED_DIM = 96\n    cfg.MODEL.SWIN.DEPTHS = [2, 2, 6, 2]\n    cfg.MODEL.SWIN.NUM_HEADS = [3, 6, 12, 24]\n    cfg.MODEL.SWIN.WINDOW_SIZE = 7\n    cfg.MODEL.SWIN.MLP_RATIO = 4.0\n    cfg.MODEL.SWIN.QKV_BIAS = True\n    cfg.MODEL.SWIN.QK_SCALE = None\n    cfg.MODEL.SWIN.DROP_RATE = 0.0\n    cfg.MODEL.SWIN.ATTN_DROP_RATE = 0.0\n    cfg.MODEL.SWIN.DROP_PATH_RATE = 0.3\n    cfg.MODEL.SWIN.APE = False\n    cfg.MODEL.SWIN.PATCH_NORM = True\n    cfg.MODEL.SWIN.OUT_FEATURES = [\"res2\", \"res3\", \"res4\", \"res5\"]\n    cfg.MODEL.SWIN.USE_CHECKPOINT = False\n    cfg.MODEL.SWIN.PRETRAINED_WEIGHT = None\n\n\n    # support InterImage backbone\n    cfg.MODEL.INTERNIMAGE = CN()  # large as base\n\n    #### large\n    cfg.MODEL.INTERNIMAGE.PRETRAINED_WEIGHT = None\n    cfg.MODEL.INTERNIMAGE.CORE_OP = \"DCNv3\"\n    cfg.MODEL.INTERNIMAGE.CHANNELS = 160\n    cfg.MODEL.INTERNIMAGE.DEPTHS = [5, 5, 22, 5]\n    cfg.MODEL.INTERNIMAGE.GROUPS =[10, 20, 40, 80]\n    cfg.MODEL.INTERNIMAGE.MLP_RATIO =4.\n    cfg.MODEL.INTERNIMAGE.DROP_PATH_RATE =0.0\n    cfg.MODEL.INTERNIMAGE.NORM_LAYER = \"LN\"\n    cfg.MODEL.INTERNIMAGE.LAYER_SCALE = 1.0\n    cfg.MODEL.INTERNIMAGE.OFFSET_SCALE = 2.0\n    cfg.MODEL.INTERNIMAGE.POST_NORM = True\n    cfg.MODEL.INTERNIMAGE.WITH_CP = False\n    cfg.MODEL.INTERNIMAGE.OUT_IINDICES = (0, 1, 2, 3)\n    cfg.MODEL.INTERNIMAGE.DW_KERNEL_SIZE = None\n    cfg.MODEL.INTERNIMAGE.RES_POST_NORM = False\n    cfg.MODEL.INTERNIMAGE.LEVEL2_POST_NORM = False\n    cfg.MODEL.INTERNIMAGE.LEVEL2_POST_NORM_BLOCK_IDS = None\n    cfg.MODEL.INTERNIMAGE.CENTER_FEATURE_SCALE = False\n\n    ### huge\n    # cfg.MODEL.INTERNIMAGE.PRETRAINED_WEIGHT = None\n    # cfg.MODEL.INTERNIMAGE.CORE_OP = \"DCNv3\"\n    # cfg.MODEL.INTERNIMAGE.CHANNELS = 320\n    # cfg.MODEL.INTERNIMAGE.DEPTHS = [6, 6, 32, 6]\n    # cfg.MODEL.INTERNIMAGE.GROUPS = [10, 20, 40, 80]\n    # cfg.MODEL.INTERNIMAGE.MLP_RATIO =4.\n    # cfg.MODEL.INTERNIMAGE.DROP_PATH_RATE = 0.5\n    # cfg.MODEL.INTERNIMAGE.NORM_LAYER = \"LN\"\n    # cfg.MODEL.INTERNIMAGE.LAYER_SCALE = None\n    # cfg.MODEL.INTERNIMAGE.OFFSET_SCALE = 1.0\n    # cfg.MODEL.INTERNIMAGE.POST_NORM = False\n    # cfg.MODEL.INTERNIMAGE.WITH_CP = False\n    # cfg.MODEL.INTERNIMAGE.OUT_IINDICES = (0, 1, 2, 3)\n\n    # cfg.MODEL.INTERNIMAGE.DW_KERNEL_SIZE = 5\n    # cfg.MODEL.INTERNIMAGE.RES_POST_NORM = True\n    # cfg.MODEL.INTERNIMAGE.LEVEL2_POST_NORM = True\n    # cfg.MODEL.INTERNIMAGE.LEVEL2_POST_NORM_BLOCK_IDS = [5, 11, 17, 23, 29]\n    # cfg.MODEL.INTERNIMAGE.CENTER_FEATURE_SCALE = True\n\n\n    # support EVA02 backbone\n    cfg.MODEL.EVA02 = CN()  # large as base\n\n    #### large\n    cfg.MODEL.EVA02.PRETRAINED_WEIGHT = None\n    cfg.MODEL.EVA02.IMAGE_SIZE =  1536\n    cfg.MODEL.EVA02.PATCH_SIZE =  16\n    cfg.MODEL.EVA02.WINDOW_SIZE =  16\n    cfg.MODEL.EVA02.DMBED_DIM =1024  \n    cfg.MODEL.EVA02.DEPTH =  24\n    cfg.MODEL.EVA02.NUM_HEADS =  16\n    cfg.MODEL.EVA02.MLP_RATIO =   4*2/3\n    cfg.MODEL.EVA02.DROP_PATH_RATE =  0.3\n    cfg.MODEL.EVA02.CHECKPOINT = True\n    cfg.MODEL.EVA02.WINDOW_BLOCK_INDEXES =  [0, 1, 3, 4, 6, 7, 9, 10, 12, 13, 15, 16, 18, 19, 21, 22]\n    \n \n\n    # support EVA01 backbone\n    cfg.MODEL.EVA01 = CN()  # large as base\n\n    #### large\n    cfg.MODEL.EVA01.PRETRAINED_WEIGHT = None\n\n    cfg.MODEL.EVA01.BEIT_LIKE_QKV_BIAS = True\n    cfg.MODEL.EVA01.BEIT_LIKE_GAMMA = False\n    cfg.MODEL.EVA01.FREEZE_PATH_EMBED = True\n\n    cfg.MODEL.EVA01.IMAGE_SIZE =  1280  # only for correct dim in pos embed\n    cfg.MODEL.EVA01.PATCH_SIZE =  16\n    cfg.MODEL.EVA01.WINDOW_SIZE =  16\n    cfg.MODEL.EVA01.DMBED_DIM = 1408  \n    cfg.MODEL.EVA01.DEPTH =  40\n    cfg.MODEL.EVA01.NUM_HEADS =  16\n    cfg.MODEL.EVA01.MLP_RATIO =   6144 / 1408\n    cfg.MODEL.EVA01.DROP_PATH_RATE =  0.6\n    cfg.MODEL.EVA01.WINDOW_BLOCK_INDEXES =  [0, 1, 2, 4, 5, 6, 8, 9, 10, 12, 13, 14, 16, 17, 18, 20, 21, 22, 24, 25, 26, 28, 29, 30, 32, 33, 34, 36, 37, 38]\n    \n \n\n "
  },
  {
    "path": "thirdparty/GLEE/glee/config_deeplab.py",
    "content": "# -*- coding: utf-8 -*-\n# Copyright (c) Facebook, Inc. and its affiliates.\n\n\ndef add_deeplab_config(cfg):\n    \"\"\"\n    Add config for DeepLab.\n    \"\"\"\n    # We retry random cropping until no single category in semantic segmentation GT occupies more\n    # than `SINGLE_CATEGORY_MAX_AREA` part of the crop.\n    cfg.INPUT.CROP.SINGLE_CATEGORY_MAX_AREA = 1.0\n    # Used for `poly` learning rate schedule.\n    cfg.SOLVER.POLY_LR_POWER = 0.9\n    cfg.SOLVER.POLY_LR_CONSTANT_ENDING = 0.0\n    # Loss type, choose from `cross_entropy`, `hard_pixel_mining`.\n    cfg.MODEL.SEM_SEG_HEAD.LOSS_TYPE = \"hard_pixel_mining\"\n    # DeepLab settings\n    cfg.MODEL.SEM_SEG_HEAD.PROJECT_FEATURES = [\"res2\"]\n    cfg.MODEL.SEM_SEG_HEAD.PROJECT_CHANNELS = [48]\n    cfg.MODEL.SEM_SEG_HEAD.ASPP_CHANNELS = 256\n    cfg.MODEL.SEM_SEG_HEAD.ASPP_DILATIONS = [6, 12, 18]\n    cfg.MODEL.SEM_SEG_HEAD.ASPP_DROPOUT = 0.1\n    cfg.MODEL.SEM_SEG_HEAD.USE_DEPTHWISE_SEPARABLE_CONV = False\n    # Backbone new configs\n    cfg.MODEL.RESNETS.RES4_DILATION = 1\n    cfg.MODEL.RESNETS.RES5_MULTI_GRID = [1, 2, 4]\n    # ResNet stem type from: `basic`, `deeplab`\n    cfg.MODEL.RESNETS.STEM_TYPE = \"deeplab\"\n"
  },
  {
    "path": "thirdparty/GLEE/glee/models/glee_model.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved\n\"\"\"\n\"\"\"\nimport torch\nimport torch.nn.functional as F\nfrom torch import nn\n# from ..backbone import build_backbone, Backbone\n# from ..body.encoder import build_encoder\n# from ..body.decoder import build_decoder\n\nfrom detectron2.modeling import  build_backbone\n\nfrom .pixel_decoder.maskdino_encoder import build_pixel_decoder\nfrom .transformer_decoder.maskdino_decoder import build_transformer_decoder\n\nimport random\nfrom transformers import AutoTokenizer\nfrom collections import OrderedDict\nfrom ..modules.point_features import point_sample\nfrom timm.models.layers import trunc_normal_\nfrom transformers import CLIPTokenizer,CLIPTextModel\nfrom .vos_utils import masks_to_boxes, FeatureFuser\nimport numpy as np\nimport math\n\ndef rand_sample(x, max_len):\n    if x.shape[1] <= max_len:\n        return x\n    else:\n        rand_idx = torch.randperm(x.shape[1])[:max_len]\n        return x[:,rand_idx]\n\n\ndef agg_lang_feat(features, mask, pool_type=\"average\"):\n    \"\"\"average pooling of language features\"\"\"\n    # feat: (bs, seq_len, C)\n    # mask: (bs, seq_len)\n    if pool_type == \"average\":\n        embedded = features * mask.unsqueeze(-1).float() # use mask to zero out invalid token features\n        aggregate = embedded.sum(1) / (mask.sum(-1).unsqueeze(-1).float())\n    elif pool_type == \"max\":\n        out = []\n        for i in range(len(features)):\n            pool_feat, _ = torch.max(features[i][mask[i]], 0) # (L, C) -> (C, )\n            out.append(pool_feat)\n        aggregate = torch.stack(out, dim=0) # (bs, C)\n    else:\n        raise ValueError(\"pool_type should be average or max\")\n    return aggregate\n\nclass GLEE_Model(nn.Module):\n    \"\"\"\n    Main class for mask classification semantic segmentation architectures.\n    \"\"\"\n    def __init__(self, cfg, matcher, device, video_info, contras_mean):\n        super().__init__()\n        self.cfg = cfg\n        self.matcher = matcher\n        self.backbone = build_backbone(cfg)\n        output_channels = [v for k,v in self.backbone._out_feature_channels.items()]\n        self.sot_fuser = FeatureFuser(output_channels[-3:], 256)\n        self.tokenizer = CLIPTokenizer.from_pretrained('/home/PJLAB/caiwenzhe/Desktop/checkpoints/clip-vit-base-patch32') \n        self.tokenizer.add_special_tokens({'cls_token': self.tokenizer.eos_token})\n        self.text_encoder = CLIPTextModel.from_pretrained('/home/PJLAB/caiwenzhe/Desktop/checkpoints/clip-vit-base-patch32')\n        # self.text_encoder_teacher = CLIPTextModel.from_pretrained('GLEE/clip_vit_base_patch32')\n        self.lang_encoder = None\n        # for p in self.text_encoder_teacher.parameters():\n            # p.requires_grad = False\n        self.lang_projection = nn.Parameter(torch.rand(cfg.MODEL.LANGUAGE_BACKBONE.LANG_DIM, cfg.MODEL.DIM_PROJ))\n        self.text_encode_type = 'clip_teacher'\n        \n        # self.lang_encoder = None     \n        self.pixel_decoder = build_pixel_decoder(cfg, self.backbone.output_shape())\n        transformer_predictor_in_channels = cfg.MODEL.SEM_SEG_HEAD.CONVS_DIM\n        self.predictor = build_transformer_decoder(cfg, transformer_predictor_in_channels, lang_encoder = self.lang_encoder, mask_classification=True,)\n        self.to(device)\n        \n        self.video_info = video_info\n        self.contras_mean = contras_mean\n\n        self.track_loss_version = cfg.MODEL.TRACK_VERSION\n\n        self.no_mask_tasks = ['obj365', 'obj365_clip','openimage', 'openimage_clip', 'vg', 'grit', 'bdd_det', 'bdd_track_box'] \n\n\n        # for visual prompt\n        hidden_dim = 256\n        self.max_spatial_len = [512,512,512,512]\n        self.mask_sptial_embed = nn.ParameterList([nn.Parameter(torch.empty(hidden_dim, hidden_dim)) for x in range(4)])\n        trunc_normal_(self.mask_sptial_embed[0], std=.02)\n        trunc_normal_(self.mask_sptial_embed[1], std=.02)\n        trunc_normal_(self.mask_sptial_embed[2], std=.02)\n        trunc_normal_(self.mask_sptial_embed[3], std=.02)\n        # learnable positive negative indicator\n        self.pn_indicator = nn.Embedding(2, hidden_dim)\n\n    @property\n    def device(self):\n        return self.pixel_mean.device\n    \n    def forward(self, images, prompts, task, targets=None, batch_name_list=None, is_train = True, visual_prompt_type='scribble'):\n        extra =  {}\n        # dist_loss = None\n        early_semantic = None\n\n        if self.text_encode_type == \"clip_teacher\":\n            if task not in ['grounding','rvos']:\n                assert batch_name_list\n                calsses_name_list = batch_name_list\n                tokenized = self.tokenizer.batch_encode_plus(calsses_name_list,\n                        max_length=self.cfg.MODEL.LANGUAGE_BACKBONE.MAX_QUERY_LEN, # 256\n                        padding='max_length' if self.cfg.MODEL.LANGUAGE_BACKBONE.PAD_MAX else \"longest\", # max_length\n                        return_special_tokens_mask=True,\n                        return_tensors='pt',\n                        truncation=True).to(images.device)\n                texts = (tokenized['input_ids'], tokenized['attention_mask'])\n                token_x = self.text_encoder(*texts)['last_hidden_state']\n\n                valid_mask = tokenized['attention_mask'].bool()\n                # token_x_teacher = self.text_encoder_teacher(*texts)['last_hidden_state']\n                # if is_train:\n                # dist_loss =  F.mse_loss(token_x[valid_mask], token_x_teacher[valid_mask] )\n                    # F.l2_loss(token_x[valid_mask], token_x_teacher[valid_mask] )  \n                token_x = token_x @ self.lang_projection\n                lang_feat_pool = agg_lang_feat(token_x, tokenized['attention_mask'], pool_type=\"average\")  # (bs,  768)\n                extra['class_embeddings'] = lang_feat_pool \n                if True: # early_fusion\n                    gather_all_classtoken = token_x.flatten(0,1)[tokenized['attention_mask'].flatten(0,1)>0]\n                    gather_all_classtoken = gather_all_classtoken.unsqueeze(0).repeat(len(images),1,1) #[bs,L,C]\n                    gather_all_classtoken_mask = torch.ones_like(gather_all_classtoken[:,:,0])>0  #[bs,L]\n                    early_semantic = {\"hidden\":gather_all_classtoken.float(),\"masks\":gather_all_classtoken_mask} \n\n\n        if 'grounding' in prompts:\n \n            if self.text_encode_type == 'clip_frozen' or self.text_encode_type == 'clip_teacher':\n\n                tokens = self.tokenizer(\n                    prompts['grounding'], padding='max_length', truncation=True, max_length=self.cfg.MODEL.LANGUAGE_BACKBONE.MAX_QUERY_LEN, return_tensors='pt'\n                    )\n                tokens = {key: value.to(images.device) for key, value in tokens.items()}\n\n                texts = (tokens['input_ids'], tokens['attention_mask'])\n                x = self.text_encoder(*texts)\n                token_x = x['last_hidden_state']\n                token_x = token_x @ self.lang_projection\n\n                extra['grounding_tokens'] = token_x.permute(1,0,2) #[len,bz,C]\n\n                non_zero_query_mask = tokens['attention_mask']\n                lang_feat_pool = agg_lang_feat(token_x, non_zero_query_mask, pool_type=\"average\").unsqueeze(1) # (bs, 1, 768)\n\n                dist_loss =  (lang_feat_pool*0).sum()\n                \n                extra['grounding_nonzero_mask'] = ~non_zero_query_mask.bool()  # [bz,len]\n                extra['grounding_class'] = lang_feat_pool.squeeze(1) #[bz,C\n                # gather_all_classtoken = token_x.flatten(0,1)[tokenized['attention_mask'].flatten(0,1)>0]\n                # gather_all_classtoken = gather_all_classtoken.unsqueeze(0).repeat(len(images),1,1) #[bs,L,C]\n                # gather_all_classtoken_mask = torch.ones_like(gather_all_classtoken[:,:,0])>0  #[bs,L]\n                # early_semantic = {\"hidden\":gather_all_classtoken.float(),\"masks\":gather_all_classtoken_mask} \n                early_semantic = {\"hidden\":token_x.float(),\"masks\":tokens['attention_mask']>0} \n        \n\n        if isinstance(images,torch.Tensor):\n            features = self.backbone(images)\n        else:\n            features = self.backbone(images.tensor)\n\n\n\n\n        if 'spatial' in prompts:\n            ## setp 1,2,3\n            key_images = [ images ]  #bz*[1,3,H,W]\n            key_promptmasks = [m.unsqueeze(0) for m in prompts['spatial']] #bz*[1,1,H,W]\n\n            prompt_mode = visual_prompt_type            \n            ref_feats, ref_masks = self.get_template(key_images, key_promptmasks, prompt_mode) \n            early_fusion = {\"hidden\":ref_feats,\"masks\":ref_masks} \n            if early_semantic is None:\n                early_semantic = early_fusion\n            else:\n                early_semantic[\"hidden\"] = torch.cat([early_semantic[\"hidden\"],early_fusion[\"hidden\"]],dim=1)\n                early_semantic[\"masks\"] = torch.cat([early_semantic[\"masks\"],early_fusion[\"masks\"]],dim=1)\n\n        \n        # bz = len(images)//2\n        mask_features, _, multi_scale_features, zero_loss = self.pixel_decoder.forward_features(features, masks=None, early_fusion = early_semantic)\n        if 'spatial' in prompts:\n            pos_masks = prompts['spatial']\n            # neg_masks = [~p for p in prompts['spatial']]\n            neg_masks = [p&False for p in prompts['spatial']]\n            \n            extra.update({'spatial_query_pos_mask': pos_masks, 'spatial_query_neg_mask': neg_masks})\n\n\n            _,h,w = extra['spatial_query_pos_mask'][0].shape\n            divisor = torch.tensor([h,w], device=mask_features.device)[None,]\n            # Get mean pos spatial query\n            non_zero_pos_point = [rand_sample((m.nonzero()[:,1:]/divisor).t(), self.max_spatial_len[-1]).t() for m in extra['spatial_query_pos_mask']]\n            non_zero_pos_point = nn.utils.rnn.pad_sequence(non_zero_pos_point, padding_value=-1).permute(1,0,2)  \n            non_zero_pos_mask = (non_zero_pos_point.sum(dim=-1) < 0)  \n            spatial_query_pos = point_sample(mask_features, non_zero_pos_point.flip(dims=(2,)).type(mask_features.dtype), align_corners=True) #[(N, C, P)\n            spatial_query_pos = torch.stack([x[m].mean(dim=0, keepdim=True) for x, m in zip(spatial_query_pos.transpose(1,2), ~non_zero_pos_mask)]).transpose(0,1).nan_to_num() # [1,bz,C]\n            # Get mean neg spatial query\n            non_zero_neg_point = [rand_sample((m.nonzero()[:,1:]/divisor).t(), self.max_spatial_len[-1]).t() for m in extra['spatial_query_neg_mask']]\n            non_zero_neg_point = nn.utils.rnn.pad_sequence(non_zero_neg_point, padding_value=-1).permute(1,0,2)\n            non_zero_neg_mask = (non_zero_neg_point.sum(dim=-1) < 0)\n            spatial_query_neg = point_sample(mask_features, non_zero_neg_point.flip(dims=(2,)).type(mask_features.dtype), align_corners=True)\n            spatial_query_neg = torch.stack([x[m].mean(dim=0, keepdim=True) for x, m in zip(spatial_query_neg.transpose(1,2), ~non_zero_neg_mask)]).transpose(0,1).nan_to_num()\n\n            # Get layerwise spatial query\n            src_spatial_queries = []\n            src_spatial_maskings = []\n            for i in range(len(multi_scale_features)):\n                bs,dc,h,w = multi_scale_features[i].shape\n                # src_mask_features = multi_scale_features[i].view(h,w,bs,dc)\n                src_mask_features = multi_scale_features[i].permute(2,3,0,1)\n                src_mask_features = src_mask_features @ self.mask_sptial_embed[i]\n\n                non_zero_query_point_pos = [rand_sample((m.nonzero()[:,1:]/divisor).t(), self.max_spatial_len[i]).t() for m in extra['spatial_query_pos_mask']]\n                non_zero_query_point_neg = [rand_sample((m.nonzero()[:,1:]/divisor).t(), self.max_spatial_len[i]).t() for m in extra['spatial_query_neg_mask']]\n                non_zero_query_point = [torch.cat([x,y], dim=0) for x,y in zip(non_zero_query_point_pos, non_zero_query_point_neg)]\n                pos_neg_indicator = [torch.cat([torch.ones(x.shape[0], device=x.device), -torch.ones(y.shape[0], device=y.device)]) for x,y in zip(non_zero_query_point_pos, non_zero_query_point_neg)]\n                pos_neg_indicator = nn.utils.rnn.pad_sequence(pos_neg_indicator, padding_value=0)\n                non_zero_query_point = nn.utils.rnn.pad_sequence(non_zero_query_point, padding_value=-1).permute(1,0,2)\n                non_zero_query_mask = (non_zero_query_point.sum(dim=-1) < 0)\n                non_zero_query_point[non_zero_query_mask] = 0\n\n                spatial_tokens = point_sample(src_mask_features.permute(2,3,0,1), non_zero_query_point.flip(dims=(2,)).type(src_mask_features.dtype), align_corners=True).permute(2,0,1)\n                spatial_tokens[pos_neg_indicator==1] += self.pn_indicator.weight[0:1]\n                spatial_tokens[pos_neg_indicator==-1] += self.pn_indicator.weight[1:2]\n\n                src_spatial_queries += [spatial_tokens]\n                src_spatial_maskings += [non_zero_query_mask]\n\n            extra['visual_prompt_tokens'] = src_spatial_queries #[len,bz,C]\n            extra['visual_prompt_nonzero_mask'] = src_spatial_maskings  # [bz,len]\n        \n \n        outputs = self.predictor(multi_scale_features, mask_features, extra=extra, task=task, masks=None, targets=targets)\n        return  outputs \n \n\n\n\n     \n\n    def get_template(self, imgs, pad_masks, prompt_mode='scribble'):\n        \"\"\"img: (N, 3, H, W), mask: (N, 1, H, W), bbox: (1, 4)\"\"\"\n        \"\"\"get 4-channel template\"\"\"\n\n        croped_img_with_mask = []\n\n        for image_i, mask_i in zip( imgs, pad_masks):\n\n            if prompt_mode in ['scribble','point']:\n                image_with_mask = image_i + mask_i.to(image_i)\n            else:\n                image_with_mask = image_i \n\n            # image_with_mask = torch.cat([image_i,mask_i.to(image_i)],dim=1) #[1,3,H,W]\n            box_i = masks_to_boxes(mask_i[0])  #[xyxy]\n            box_i[:, 2:] = box_i[:, 2:] - box_i[:, :2] #xywh\n            \n\n            x, y, w, h = box_i[0].long().tolist()\n\n            self.search_area_factor=2\n\n            crop_sz = math.ceil(math.sqrt(w * h) * self.search_area_factor)\n            x1 = max(0,round(x + 0.5 * w - crop_sz * 0.5))\n            x2 = x1 + crop_sz\n            y1 = max(0,round(y + 0.5 * h - crop_sz * 0.5))\n            y2 = y1 + crop_sz\n\n            im_crop = image_with_mask[:, :, y1:y2, x1:x2]\n            # resize\n            if im_crop.shape[-1] ==0 or im_crop.shape[-2] ==0 :\n                im_crop = image_with_mask\n            im_crop = F.interpolate(im_crop, (256,256), mode='bilinear', align_corners=False)\n            croped_img_with_mask.append(im_crop)\n        croped_img_with_mask = torch.cat(croped_img_with_mask,dim=0) #[bz,3,256,256]\n        with torch.no_grad():\n            ref_srcs = self.backbone(croped_img_with_mask.contiguous())\n        ref_srcs = [v for k,v in ref_srcs.items()]\n        ref_feats = self.sot_fuser(ref_srcs[1:]).float() #[bz,256,32,32]\n\n        ref_feats = ref_feats.flatten(-2).permute(0, 2, 1) # (bs, L, C)\n        ref_masks = torch.ones_like(ref_feats[:,:,0])>0  #[bs,L]\n        \n        return ref_feats, ref_masks\n\n"
  },
  {
    "path": "thirdparty/GLEE/glee/models/pixel_decoder/__init__.py",
    "content": "# Copyright (c) IDEA, Inc. and its affiliates.\n"
  },
  {
    "path": "thirdparty/GLEE/glee/models/pixel_decoder/early_fusion.py",
    "content": "import torch\nimport torch.nn.functional as F\nfrom torch import nn\nfrom timm.models.layers import DropPath\n\n\n\n\nclass VLFuse(torch.nn.Module):\n    \"\"\"\n    Early Fusion Module\n    \"\"\"\n\n    def __init__(self, ):\n        super(VLFuse, self).__init__()\n        self.init_configs()\n\n        # early fusion module\n        # bi-direction (text->image, image->text)\n        self.b_attn = BiAttentionBlockForCheckpoint(v_dim=self.img_dim, # 256\n                    l_dim=self.lang_dim, # 768\n                    embed_dim=self.embed_dim, # 2048\n                    num_heads=self.n_head, # 8\n                    dropout=0.1,\n                    drop_path=.0,\n                    init_values=1.0 / 6,\n                    )\n    def init_configs(self, ):\n        # common params\n        self.img_dim =  256\n\n        self.max_query_len = 256\n        self.n_layers =1\n\n        # mha params\n        self.n_head = 8\n        self.embed_dim = 2048 # 2048 by default\n        \n        self.lang_dim = 256\n\n    def forward(self, x, task=None):\n        visual_features = x[\"visual\"]\n        language_dict_features = x[\"lang\"]\n\n        fused_visual_features, language_features = self.b_attn(\n                visual_features, language_dict_features['hidden'], language_dict_features['masks'], task)\n\n        language_dict_features['hidden'] = language_features\n        fused_language_dict_features = language_dict_features\n\n        features_dict = {\"visual\": fused_visual_features,\n                         \"lang\": fused_language_dict_features}\n\n        return features_dict\n\n\nclass BiMultiHeadAttention(nn.Module):\n    def __init__(self, v_dim, l_dim, embed_dim, num_heads, dropout=0.1):\n        super(BiMultiHeadAttention, self).__init__()\n\n        self.embed_dim = embed_dim\n        self.num_heads = num_heads\n        self.head_dim = embed_dim // num_heads\n        self.v_dim = v_dim\n        self.l_dim = l_dim\n\n        assert (\n                self.head_dim * self.num_heads == self.embed_dim\n        ), f\"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim} and `num_heads`: {self.num_heads}).\"\n        self.scale = self.head_dim ** (-0.5)\n        self.dropout = dropout\n\n        self.v_proj = nn.Linear(self.v_dim, self.embed_dim)\n        self.l_proj = nn.Linear(self.l_dim, self.embed_dim)\n        self.values_v_proj = nn.Linear(self.v_dim, self.embed_dim)\n        self.values_l_proj = nn.Linear(self.l_dim, self.embed_dim)\n\n        self.out_v_proj = nn.Linear(self.embed_dim, self.v_dim)\n        self.out_l_proj = nn.Linear(self.embed_dim, self.l_dim)\n\n        self.stable_softmax_2d =  False\n        self.clamp_min_for_underflow = True\n        self.clamp_max_for_overflow = True\n\n        self._reset_parameters()\n\n    def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):\n        return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()\n\n    def _reset_parameters(self):\n        nn.init.xavier_uniform_(self.v_proj.weight)\n        self.v_proj.bias.data.fill_(0)\n        nn.init.xavier_uniform_(self.l_proj.weight)\n        self.l_proj.bias.data.fill_(0)\n        nn.init.xavier_uniform_(self.values_v_proj.weight)\n        self.values_v_proj.bias.data.fill_(0)\n        nn.init.xavier_uniform_(self.values_l_proj.weight)\n        self.values_l_proj.bias.data.fill_(0)\n        nn.init.xavier_uniform_(self.out_v_proj.weight)\n        self.out_v_proj.bias.data.fill_(0)\n        nn.init.xavier_uniform_(self.out_l_proj.weight)\n        self.out_l_proj.bias.data.fill_(0)\n\n    def forward(self, v, l, attention_mask_l=None):\n        bsz, tgt_len, embed_dim = v.size()\n\n        query_states = self.v_proj(v) * self.scale\n        key_states = self._shape(self.l_proj(l), -1, bsz)\n        value_v_states = self._shape(self.values_v_proj(v), -1, bsz)\n        value_l_states = self._shape(self.values_l_proj(l), -1, bsz)\n\n        proj_shape = (bsz * self.num_heads, -1, self.head_dim) # (bs * 8, -1, embed_dim//8)\n        query_states = self._shape(query_states, tgt_len, bsz).view(*proj_shape) # (bs * 8, seq_len_img, embed_dim//8)\n        key_states = key_states.view(*proj_shape) # (bs * 8, seq_len_text, embed_dim//8)\n        value_v_states = value_v_states.view(*proj_shape)\n        value_l_states = value_l_states.view(*proj_shape)\n\n        src_len = key_states.size(1)\n        attn_weights = torch.bmm(query_states, key_states.transpose(1, 2)) # (bs * 8, seq_len_img, seq_len_text)\n\n        if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):\n            raise ValueError(\n                f\"Attention weights should be of size {(bsz * self.num_heads, tgt_len, src_len)}, but is {attn_weights.size()}\"\n            )\n\n        # attn_weights_l = nn.functional.softmax(attn_weights.transpose(1, 2), dim=-1)\n\n        if self.stable_softmax_2d:\n            attn_weights = attn_weights - attn_weights.max()\n        \n        if self.clamp_min_for_underflow:\n            attn_weights = torch.clamp(attn_weights, min=-50000) # Do not increase -50000, data type half has quite limited range\n        if self.clamp_max_for_overflow:\n            attn_weights = torch.clamp(attn_weights, max=50000) # Do not increase 50000, data type half has quite limited range\n\n        attn_weights_T = attn_weights.transpose(1, 2)\n        attn_weights_l = (attn_weights_T - torch.max(attn_weights_T, dim=-1, keepdim=True)[\n            0])\n        if self.clamp_min_for_underflow:\n            attn_weights_l = torch.clamp(attn_weights_l, min=-50000) # Do not increase -50000, data type half has quite limited range\n        if self.clamp_max_for_overflow:\n            attn_weights_l = torch.clamp(attn_weights_l, max=50000) # Do not increase 50000, data type half has quite limited range\n\n        attn_weights_l = attn_weights_l.softmax(dim=-1)\n        # assert attention_mask_l.dtype == torch.int64\n        if attention_mask_l is not None:\n            assert (attention_mask_l.dim() == 2) # (bs, seq_len)\n            attention_mask = attention_mask_l.unsqueeze(1).unsqueeze(1) # (bs, 1, 1, seq_len)\n            attention_mask = attention_mask.expand(bsz, 1, tgt_len, src_len)\n            attention_mask = attention_mask.masked_fill(attention_mask == 0, -9e15)\n\n            if attention_mask.size() != (bsz, 1, tgt_len, src_len):\n                raise ValueError(\n                    f\"Attention mask should be of size {(bsz, 1, tgt_len, src_len)}\"\n                )\n            attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len) + attention_mask\n            attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)\n\n        attn_weights_v = nn.functional.softmax(attn_weights, dim=-1)\n\n        attn_probs_v = F.dropout(attn_weights_v, p=self.dropout, training=self.training)\n        attn_probs_l = F.dropout(attn_weights_l, p=self.dropout, training=self.training)\n\n        attn_output_v = torch.bmm(attn_probs_v, value_l_states)\n        attn_output_l = torch.bmm(attn_probs_l, value_v_states)\n\n\n        if attn_output_v.size() != (bsz * self.num_heads, tgt_len, self.head_dim):\n            raise ValueError(\n                f\"`attn_output_v` should be of size {(bsz, self.num_heads, tgt_len, self.head_dim)}, but is {attn_output_v.size()}\"\n            )\n\n        if attn_output_l.size() != (bsz * self.num_heads, src_len, self.head_dim):\n            raise ValueError(\n                f\"`attn_output_l` should be of size {(bsz, self.num_heads, src_len, self.head_dim)}, but is {attn_output_l.size()}\"\n            )\n\n        attn_output_v = attn_output_v.view(bsz, self.num_heads, tgt_len, self.head_dim)\n        attn_output_v = attn_output_v.transpose(1, 2)\n        attn_output_v = attn_output_v.reshape(bsz, tgt_len, self.embed_dim)\n\n        attn_output_l = attn_output_l.view(bsz, self.num_heads, src_len, self.head_dim)\n        attn_output_l = attn_output_l.transpose(1, 2)\n        attn_output_l = attn_output_l.reshape(bsz, src_len, self.embed_dim)\n\n        attn_output_v = self.out_v_proj(attn_output_v)\n        attn_output_l = self.out_l_proj(attn_output_l)\n\n        return attn_output_v, attn_output_l\n\n\nclass BiAttentionBlockForCheckpoint(nn.Module):\n    def __init__(self, v_dim, l_dim, embed_dim, num_heads, dropout=0.1,\n                 drop_path=.0, init_values=1e-4,  ):\n        \"\"\"\n        Inputs:\n            embed_dim - Dimensionality of input and attention feature vectors\n            num_heads - Number of heads to use in the Multi-Head Attention block\n            dropout - Amount of dropout to apply in the feed-forward network\n        \"\"\"\n        super(BiAttentionBlockForCheckpoint, self).__init__()\n\n        # pre layer norm\n        self.layer_norm_v = nn.LayerNorm(v_dim)\n        self.layer_norm_l = nn.LayerNorm(l_dim)\n        self.attn = BiMultiHeadAttention(v_dim=v_dim,\n                                         l_dim=l_dim,\n                                         embed_dim=embed_dim,\n                                         num_heads=num_heads,\n                                         dropout=dropout,\n                                        )\n\n        # add layer scale for training stability\n        self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()\n        self.gamma_v = nn.Parameter(init_values * torch.ones((v_dim)), requires_grad=True)\n        self.gamma_l = nn.Parameter(init_values * torch.ones((l_dim)), requires_grad=True)\n\n\n    def forward(self, v, l, attention_mask_l=None, task=None):\n        # v: visual features, (bs, sigma(HW), 256)\n        # l: language features, (bs, seq_len, 768)\n        v = self.layer_norm_v(v)\n        l = self.layer_norm_l(l)\n        delta_v, delta_l = self.attn(v, l, attention_mask_l=attention_mask_l)\n        # v, l = v + delta_v, l + delta_l\n        v = v + self.drop_path(self.gamma_v * delta_v)\n        l = l + self.drop_path(self.gamma_l * delta_l)\n        return v, l\n\n\n"
  },
  {
    "path": "thirdparty/GLEE/glee/models/pixel_decoder/maskdino_encoder.py",
    "content": "# ------------------------------------------------------------------------\n# DINO\n# Copyright (c) 2022 IDEA. All Rights Reserved.\n# Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n# ------------------------------------------------------------------------\n# Modified by Feng Li and Hao Zhang.\nimport logging\nimport numpy as np\nfrom typing import Callable, Dict, List, Optional, Tuple, Union\nimport fvcore.nn.weight_init as weight_init\n\nimport torch\nfrom torch import nn\nfrom torch.nn import functional as F\nfrom torch.nn.init import xavier_uniform_, constant_, uniform_, normal_\nfrom torch.cuda.amp import autocast\n\nfrom detectron2.config import configurable\nfrom detectron2.layers import Conv2d, ShapeSpec, get_norm\nfrom detectron2.modeling import SEM_SEG_HEADS_REGISTRY\n\nfrom .position_encoding import PositionEmbeddingSine\nfrom ...utils.utils import _get_clones, _get_clones_advanced, _get_activation_fn\nfrom .ops.modules import MSDeformAttn\nfrom .early_fusion import VLFuse\n\ndef build_pixel_decoder(cfg, input_shape):\n    \"\"\"\n    Build a pixel decoder from `cfg.MODEL.MaskDINO.PIXEL_DECODER_NAME`.\n    \"\"\"\n    name = cfg.MODEL.SEM_SEG_HEAD.PIXEL_DECODER_NAME\n    model = SEM_SEG_HEADS_REGISTRY.get(name)(cfg, input_shape)\n    forward_features = getattr(model, \"forward_features\", None)\n    if not callable(forward_features):\n        raise ValueError(\n            \"Only SEM_SEG_HEADS with forward_features method can be used as pixel decoder. \"\n            f\"Please implement forward_features for {name} to only return mask features.\"\n        )\n    return model\n\n\n# MSDeformAttn Transformer encoder in deformable detr\nclass MSDeformAttnTransformerEncoderOnly(nn.Module):\n    def __init__(self, d_model=256, nhead=8,\n                 num_encoder_layers=6, dim_feedforward=1024, dropout=0.1,\n                 activation=\"relu\",\n                 num_feature_levels=4, enc_n_points=4,):\n        super().__init__()\n\n        self.d_model = d_model\n        self.nhead = nhead\n\n        vl_fusion_layer = VLFuse()\n\n        encoder_layer = MSDeformAttnTransformerEncoderLayer(d_model, dim_feedforward,\n                                                            dropout, activation,\n                                                            num_feature_levels, nhead, enc_n_points)\n        self.encoder = MSDeformAttnTransformerEncoder(vl_fusion_layer, encoder_layer, num_encoder_layers)\n\n        self.level_embed = nn.Parameter(torch.Tensor(num_feature_levels, d_model))\n\n        self._reset_parameters()\n\n    def _reset_parameters(self):\n        for p in self.parameters():\n            if p.dim() > 1:\n                nn.init.xavier_uniform_(p)\n        for m in self.modules():\n            if isinstance(m, MSDeformAttn):\n                m._reset_parameters()\n        normal_(self.level_embed)\n\n    def get_valid_ratio(self, mask):\n        _, H, W = mask.shape\n        valid_H = torch.sum(~mask[:, :, 0], 1)\n        valid_W = torch.sum(~mask[:, 0, :], 1)\n        valid_ratio_h = valid_H.float() / H\n        valid_ratio_w = valid_W.float() / W\n        valid_ratio = torch.stack([valid_ratio_w, valid_ratio_h], -1)\n        return valid_ratio\n\n    def forward(self, srcs, masks, pos_embeds, early_fusion=None):\n\n        enable_mask=0\n        if masks is not None:\n            for src in srcs:\n                if src.size(2)%32 or src.size(3)%32:\n                    enable_mask = 1\n        if enable_mask==0:\n            masks = [torch.zeros((x.size(0), x.size(2), x.size(3)), device=x.device, dtype=torch.bool) for x in srcs]\n        # prepare input for encoder\n        src_flatten = []\n        mask_flatten = []\n        lvl_pos_embed_flatten = []\n        spatial_shapes = []\n        for lvl, (src, mask, pos_embed) in enumerate(zip(srcs, masks, pos_embeds)):\n            bs, c, h, w = src.shape\n            spatial_shape = (h, w)\n            spatial_shapes.append(spatial_shape)\n            src = src.flatten(2).transpose(1, 2)\n            mask = mask.flatten(1)\n            pos_embed = pos_embed.flatten(2).transpose(1, 2)\n            lvl_pos_embed = pos_embed + self.level_embed[lvl].view(1, 1, -1)\n            lvl_pos_embed_flatten.append(lvl_pos_embed)\n            src_flatten.append(src)\n            mask_flatten.append(mask)\n        src_flatten = torch.cat(src_flatten, 1)\n        mask_flatten = torch.cat(mask_flatten, 1)\n        lvl_pos_embed_flatten = torch.cat(lvl_pos_embed_flatten, 1)\n        spatial_shapes = torch.as_tensor(spatial_shapes, dtype=torch.long, device=src_flatten.device)\n        level_start_index = torch.cat((spatial_shapes.new_zeros((1, )), spatial_shapes.prod(1).cumsum(0)[:-1]))\n        valid_ratios = torch.stack([self.get_valid_ratio(m) for m in masks], 1)\n        # encoder\n        memory, zero_loss = self.encoder(src_flatten, spatial_shapes, level_start_index, valid_ratios, lvl_pos_embed_flatten, mask_flatten, early_fusion)\n\n        return memory, spatial_shapes, level_start_index, zero_loss\n\n\nclass MSDeformAttnTransformerEncoderLayer(nn.Module):\n    def __init__(self,\n                 d_model=256, d_ffn=1024,\n                 dropout=0.1, activation=\"relu\",\n                 n_levels=4, n_heads=8, n_points=4):\n        super().__init__()\n\n        # self attention\n        self.self_attn = MSDeformAttn(d_model, n_levels, n_heads, n_points)\n        self.dropout1 = nn.Dropout(dropout)\n        self.norm1 = nn.LayerNorm(d_model)\n\n        # ffn\n        self.linear1 = nn.Linear(d_model, d_ffn)\n        self.activation = _get_activation_fn(activation)\n        self.dropout2 = nn.Dropout(dropout)\n        self.linear2 = nn.Linear(d_ffn, d_model)\n        self.dropout3 = nn.Dropout(dropout)\n        self.norm2 = nn.LayerNorm(d_model)\n\n    @staticmethod\n    def with_pos_embed(tensor, pos):\n        return tensor if pos is None else tensor + pos\n\n    def forward_ffn(self, src):\n        src2 = self.linear2(self.dropout2(self.activation(self.linear1(src))))\n        src = src + self.dropout3(src2)\n        src = self.norm2(src)\n        return src\n\n    def forward(self, src, pos, reference_points, spatial_shapes, level_start_index, padding_mask=None):\n        # self attention\n        src2 = self.self_attn(self.with_pos_embed(src, pos), reference_points, src, spatial_shapes, level_start_index, padding_mask)\n        src = src + self.dropout1(src2)\n        src = self.norm1(src)\n\n        # ffn\n        src = self.forward_ffn(src)\n\n        return src\n\n\nclass MSDeformAttnTransformerEncoder(nn.Module):\n    def __init__(self, vl_fusion_layer, encoder_layer, num_layers):\n        super().__init__()\n        self.layers = _get_clones(encoder_layer, num_layers)\n        self.num_layers = num_layers\n\n        self.vl_layers = _get_clones_advanced(vl_fusion_layer, num_layers, 1)\n\n\n    @staticmethod\n    def get_reference_points(spatial_shapes, valid_ratios, device):\n        reference_points_list = []\n        for lvl, (H_, W_) in enumerate(spatial_shapes):\n\n            ref_y, ref_x = torch.meshgrid(torch.linspace(0.5, H_ - 0.5, H_, dtype=torch.float32, device=device),\n                                          torch.linspace(0.5, W_ - 0.5, W_, dtype=torch.float32, device=device))\n            ref_y = ref_y.reshape(-1)[None] / (valid_ratios[:, None, lvl, 1] * H_)\n            ref_x = ref_x.reshape(-1)[None] / (valid_ratios[:, None, lvl, 0] * W_)\n            ref = torch.stack((ref_x, ref_y), -1)\n            reference_points_list.append(ref)\n        reference_points = torch.cat(reference_points_list, 1)\n        reference_points = reference_points[:, :, None] * valid_ratios[:, None]\n        return reference_points\n\n    def forward(self, src, spatial_shapes, level_start_index, valid_ratios, pos=None, padding_mask=None, early_fusion=None):\n        \n        if early_fusion:\n            output = {\"visual\": src, \"lang\": early_fusion}\n        else:\n            output = src\n\n        reference_points = self.get_reference_points(spatial_shapes, valid_ratios, device=src.device)\n        for _, (layer,vl_layer) in enumerate(zip(self.layers, self.vl_layers)):\n            if early_fusion:\n                output = vl_layer(output)\n                output[\"visual\"] = layer(output[\"visual\"], pos, reference_points, spatial_shapes, level_start_index, padding_mask)\n            else:\n                output = layer(output, pos, reference_points, spatial_shapes, level_start_index, padding_mask)\n\n        \n        if early_fusion:\n            return output[\"visual\"] ,  (output['lang']['hidden']*0).sum()\n        else:\n            return output, None\n\n\n@SEM_SEG_HEADS_REGISTRY.register()\nclass MaskDINOEncoder(nn.Module):\n    \"\"\"\n    This is the multi-scale encoder in detection models, also named as pixel decoder in segmentation models.\n    \"\"\"\n    @configurable\n    def __init__(\n        self,\n        input_shape: Dict[str, ShapeSpec],\n        *,\n        transformer_dropout: float,\n        transformer_nheads: int,\n        transformer_dim_feedforward: int,\n        transformer_enc_layers: int,\n        conv_dim: int,\n        mask_dim: int,\n        norm: Optional[Union[str, Callable]] = None,\n        # deformable transformer encoder args\n        transformer_in_features: List[str],\n        common_stride: int,\n        num_feature_levels: int,\n        total_num_feature_levels: int,\n        feature_order: str,\n        ViTBackbone: bool,\n    ):\n        \"\"\"\n        NOTE: this interface is experimental.\n        Args:\n            input_shape: shapes (channels and stride) of the input features\n            transformer_dropout: dropout probability in transformer\n            transformer_nheads: number of heads in transformer\n            transformer_dim_feedforward: dimension of feedforward network\n            transformer_enc_layers: number of transformer encoder layers\n            conv_dims: number of output channels for the intermediate conv layers.\n            mask_dim: number of output channels for the final conv layer.\n            norm (str or callable): normalization for all conv layers\n            num_feature_levels: feature scales used\n            total_num_feature_levels: total feautre scales used (include the downsampled features)\n            feature_order: 'low2high' or 'high2low', i.e., 'low2high' means low-resolution features are put in the first.\n        \"\"\"\n        super().__init__()\n        transformer_input_shape = {\n            k: v for k, v in input_shape.items() if k in transformer_in_features\n        }\n        # this is the input shape of pixel decoder\n        input_shape = sorted(input_shape.items(), key=lambda x: x[1].stride)\n        self.in_features = [k for k, v in input_shape]  # starting from \"res2\" to \"res5\"\n        self.feature_strides = [v.stride for k, v in input_shape]\n        self.feature_channels = [v.channels for k, v in input_shape]\n        self.feature_order = feature_order\n\n        if feature_order == \"low2high\":\n            transformer_input_shape = sorted(transformer_input_shape.items(), key=lambda x: -x[1].stride)\n        else:\n            transformer_input_shape = sorted(transformer_input_shape.items(), key=lambda x: x[1].stride)\n        self.transformer_in_features = [k for k, v in transformer_input_shape]  # starting from \"res2\" to \"res5\"\n        transformer_in_channels = [v.channels for k, v in transformer_input_shape]\n        self.transformer_feature_strides = [v.stride for k, v in transformer_input_shape]  # to decide extra FPN layers\n        self.maskdino_num_feature_levels = num_feature_levels  # always use 3 scales\n        self.total_num_feature_levels = total_num_feature_levels\n        self.common_stride = common_stride\n\n        self.transformer_num_feature_levels = len(self.transformer_in_features)\n        self.low_resolution_index = transformer_in_channels.index(max(transformer_in_channels))\n        self.high_resolution_index = 0 if self.feature_order == 'low2high' else -1\n        \n        self.isViTBackbone = ViTBackbone\n        if not ViTBackbone:\n            if self.transformer_num_feature_levels > 1:\n                input_proj_list = []\n                for in_channels in transformer_in_channels[::-1]:\n                    input_proj_list.append(nn.Sequential(\n                        nn.Conv2d(in_channels, conv_dim, kernel_size=1),\n                        nn.GroupNorm(32, conv_dim),\n                    ))\n                # input projectino for downsample\n                in_channels = max(transformer_in_channels)\n                for _ in range(self.total_num_feature_levels - self.transformer_num_feature_levels):  # exclude the res2\n                    input_proj_list.append(nn.Sequential(\n                        nn.Conv2d(in_channels, conv_dim, kernel_size=3, stride=2, padding=1),\n                        nn.GroupNorm(32, conv_dim),\n                    ))\n                    in_channels = conv_dim\n                self.input_proj = nn.ModuleList(input_proj_list)\n            else:\n                self.input_proj = nn.ModuleList([\n                    nn.Sequential(\n                        nn.Conv2d(transformer_in_channels[-1], conv_dim, kernel_size=1),\n                        nn.GroupNorm(32, conv_dim),\n                    )])\n\n            for proj in self.input_proj:\n                nn.init.xavier_uniform_(proj[0].weight, gain=1)\n                nn.init.constant_(proj[0].bias, 0)\n\n        self.transformer = MSDeformAttnTransformerEncoderOnly(\n            d_model=conv_dim,\n            dropout=transformer_dropout,\n            nhead=transformer_nheads,\n            dim_feedforward=transformer_dim_feedforward,\n            num_encoder_layers=transformer_enc_layers,\n            num_feature_levels=self.total_num_feature_levels,\n        )\n        N_steps = conv_dim // 2\n        self.pe_layer = PositionEmbeddingSine(N_steps, normalize=True)\n\n        self.mask_dim = mask_dim\n        # use 1x1 conv instead\n        self.mask_features = Conv2d(\n            conv_dim,\n            mask_dim,\n            kernel_size=1,\n            stride=1,\n            padding=0,\n        )\n        weight_init.c2_xavier_fill(self.mask_features)\n        # extra fpn levels\n        stride = min(self.transformer_feature_strides)\n        self.num_fpn_levels = max(int(np.log2(stride) - np.log2(self.common_stride)), 1)\n\n        lateral_convs = []\n        output_convs = []\n\n        use_bias = norm == \"\"\n        for idx, in_channels in enumerate(self.feature_channels[:self.num_fpn_levels]):\n            lateral_norm = get_norm(norm, conv_dim)\n            output_norm = get_norm(norm, conv_dim)\n\n            lateral_conv = Conv2d(\n                in_channels, conv_dim, kernel_size=1, bias=use_bias, norm=lateral_norm\n            )\n            output_conv = Conv2d(\n                conv_dim,\n                conv_dim,\n                kernel_size=3,\n                stride=1,\n                padding=1,\n                bias=use_bias,\n                norm=output_norm,\n                activation=F.relu,\n            )\n            weight_init.c2_xavier_fill(lateral_conv)\n            weight_init.c2_xavier_fill(output_conv)\n            self.add_module(\"adapter_{}\".format(idx + 1), lateral_conv)\n            self.add_module(\"layer_{}\".format(idx + 1), output_conv)\n\n            lateral_convs.append(lateral_conv)\n            output_convs.append(output_conv)\n        # Place convs into top-down order (from low to high resolution)\n        # to make the top-down computation in forward clearer.\n        self.lateral_convs = lateral_convs[::-1]\n        self.output_convs = output_convs[::-1]\n\n    @classmethod\n    def from_config(cls, cfg, input_shape: Dict[str, ShapeSpec]):\n        ret = {}\n        ret[\"input_shape\"] = {\n            k: v for k, v in input_shape.items() if k in cfg.MODEL.SEM_SEG_HEAD.IN_FEATURES\n        }\n        ret[\"conv_dim\"] = cfg.MODEL.SEM_SEG_HEAD.CONVS_DIM\n        ret[\"mask_dim\"] = cfg.MODEL.SEM_SEG_HEAD.MASK_DIM\n        ret[\"norm\"] = cfg.MODEL.SEM_SEG_HEAD.NORM\n        ret[\"transformer_dropout\"] = cfg.MODEL.MaskDINO.DROPOUT\n        ret[\"transformer_nheads\"] = cfg.MODEL.MaskDINO.NHEADS\n        ret[\"transformer_dim_feedforward\"] = cfg.MODEL.SEM_SEG_HEAD.DIM_FEEDFORWARD  # deformable transformer encoder\n        ret[\n            \"transformer_enc_layers\"\n        ] = cfg.MODEL.SEM_SEG_HEAD.TRANSFORMER_ENC_LAYERS  # a separate config\n        ret[\"transformer_in_features\"] = cfg.MODEL.SEM_SEG_HEAD.DEFORMABLE_TRANSFORMER_ENCODER_IN_FEATURES  # ['res3', 'res4', 'res5']\n        ret[\"common_stride\"] = cfg.MODEL.SEM_SEG_HEAD.COMMON_STRIDE\n        ret[\"total_num_feature_levels\"] = cfg.MODEL.SEM_SEG_HEAD.TOTAL_NUM_FEATURE_LEVELS\n        ret[\"num_feature_levels\"] = cfg.MODEL.SEM_SEG_HEAD.NUM_FEATURE_LEVELS\n        ret[\"feature_order\"] = cfg.MODEL.SEM_SEG_HEAD.FEATURE_ORDER\n        ret[\"ViTBackbone\"] = cfg.MODEL.BACKBONE.NAME  in ['D2_EVA02', 'D2_EVA01' , 'D2_ViT']\n        return ret\n\n    @autocast(enabled=False)\n    def forward_features(self, features, masks, early_fusion=None):\n        \"\"\"\n        :param features: multi-scale features from the backbone\n        :param masks: image mask\n        :return: enhanced multi-scale features and mask feature (1/4 resolution) for the decoder to produce binary mask\n        \"\"\"\n        # backbone features\n        srcs = []\n        pos = []\n        # additional downsampled features\n        srcsl = []\n        posl = []\n\n        if self.isViTBackbone:\n            for idx, f in enumerate(self.transformer_in_features[::-1]):\n                x = features[f].float()  # deformable detr does not support half precision\n                srcs.append(x)\n                pos.append(self.pe_layer(x))\n            if self.feature_order != 'low2high':\n                srcs = srcs[::-1]\n                pos = pos[::-1]\n        else:\n            if self.total_num_feature_levels > self.transformer_num_feature_levels:\n                smallest_feat = features[self.transformer_in_features[self.low_resolution_index]].float()\n                _len_srcs = self.transformer_num_feature_levels\n                for l in range(_len_srcs, self.total_num_feature_levels):\n                    if l == _len_srcs:\n                        src = self.input_proj[l](smallest_feat)\n                    else:\n                        src = self.input_proj[l](srcsl[-1])\n                    srcsl.append(src)\n                    posl.append(self.pe_layer(src))\n            srcsl = srcsl[::-1]\n            # Reverse feature maps\n            \n\n            for idx, f in enumerate(self.transformer_in_features[::-1]):\n                x = features[f].float()  # deformable detr does not support half precision\n                srcs.append(self.input_proj[idx](x))\n                pos.append(self.pe_layer(x))\n            srcs.extend(srcsl) if self.feature_order == 'low2high' else srcsl.extend(srcs)\n            pos.extend(posl) if self.feature_order == 'low2high' else posl.extend(pos)\n            if self.feature_order != 'low2high':\n                srcs = srcsl\n                pos = posl\n\n        y, spatial_shapes, level_start_index, zero_loss = self.transformer(srcs, masks, pos, early_fusion)\n        bs = y.shape[0]\n\n        split_size_or_sections = [None] * self.total_num_feature_levels\n        for i in range(self.total_num_feature_levels):\n            if i < self.total_num_feature_levels - 1:\n                split_size_or_sections[i] = level_start_index[i + 1] - level_start_index[i]\n            else:\n                split_size_or_sections[i] = y.shape[1] - level_start_index[i]\n        y = torch.split(y, split_size_or_sections, dim=1)\n\n        out = []\n        multi_scale_features = []\n        num_cur_levels = 0\n        for i, z in enumerate(y):\n            out.append(z.transpose(1, 2).view(bs, -1, spatial_shapes[i][0], spatial_shapes[i][1]))\n\n        # append `out` with extra FPN levels\n        # Reverse feature maps into top-down order (from low to high resolution)\n        for idx, f in enumerate(self.in_features[:self.num_fpn_levels][::-1]):\n            x = features[f].float()\n            lateral_conv = self.lateral_convs[idx]\n            output_conv = self.output_convs[idx]\n            cur_fpn = lateral_conv(x)\n            # Following FPN implementation, we use nearest upsampling here\n            y = cur_fpn + F.interpolate(out[self.high_resolution_index], size=cur_fpn.shape[-2:], mode=\"bilinear\", align_corners=False)\n            y = output_conv(y)\n            out.append(y)\n        for o in out:\n            if num_cur_levels < self.total_num_feature_levels:\n                multi_scale_features.append(o)\n                num_cur_levels += 1\n        return self.mask_features(out[-1]), out[0], multi_scale_features, zero_loss\n\n"
  },
  {
    "path": "thirdparty/GLEE/glee/models/pixel_decoder/ops/functions/__init__.py",
    "content": "# ------------------------------------------------------------------------------------------------\n# Deformable DETR\n# Copyright (c) 2020 SenseTime. All Rights Reserved.\n# Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n# ------------------------------------------------------------------------------------------------\n# Modified from https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0\n# ------------------------------------------------------------------------------------------------\n\n# Copyright (c) Facebook, Inc. and its affiliates.\n# Modified by Bowen Cheng from https://github.com/fundamentalvision/Deformable-DETR\n\nfrom .ms_deform_attn_func import MSDeformAttnFunction\n\n"
  },
  {
    "path": "thirdparty/GLEE/glee/models/pixel_decoder/ops/functions/ms_deform_attn_func.py",
    "content": "# ------------------------------------------------------------------------------------------------\n# Deformable DETR\n# Copyright (c) 2020 SenseTime. All Rights Reserved.\n# Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n# ------------------------------------------------------------------------------------------------\n# Modified from https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0\n# ------------------------------------------------------------------------------------------------\n\n# Copyright (c) Facebook, Inc. and its affiliates.\n# Modified by Bowen Cheng from https://github.com/fundamentalvision/Deformable-DETR\n\nfrom __future__ import absolute_import\nfrom __future__ import print_function\nfrom __future__ import division\n\nimport torch\nimport torch.nn.functional as F\nfrom torch.autograd import Function\nfrom torch.autograd.function import once_differentiable\n\ntry:\n    import MultiScaleDeformableAttention as MSDA\nexcept ModuleNotFoundError as e:\n    info_string = (\n        \"\\n\\nPlease compile MultiScaleDeformableAttention CUDA op with the following commands:\\n\"\n        \"\\t`cd maskdino/modeling/pixel_decoder/ops`\\n\"\n        \"\\t`sh make.sh`\\n\"\n    )\n    # raise ModuleNotFoundError(info_string)\n\n\nclass MSDeformAttnFunction(Function):\n    @staticmethod\n    def forward(ctx, value, value_spatial_shapes, value_level_start_index, sampling_locations, attention_weights, im2col_step):\n        ctx.im2col_step = im2col_step\n        output = MSDA.ms_deform_attn_forward(\n            value, value_spatial_shapes, value_level_start_index, sampling_locations, attention_weights, ctx.im2col_step)\n        ctx.save_for_backward(value, value_spatial_shapes, value_level_start_index, sampling_locations, attention_weights)\n        return output\n\n    @staticmethod\n    @once_differentiable\n    def backward(ctx, grad_output):\n        value, value_spatial_shapes, value_level_start_index, sampling_locations, attention_weights = ctx.saved_tensors\n        grad_value, grad_sampling_loc, grad_attn_weight = \\\n            MSDA.ms_deform_attn_backward(\n                value, value_spatial_shapes, value_level_start_index, sampling_locations, attention_weights, grad_output, ctx.im2col_step)\n\n        return grad_value, None, None, grad_sampling_loc, grad_attn_weight, None\n\n\ndef ms_deform_attn_core_pytorch(value, value_spatial_shapes, sampling_locations, attention_weights):\n    # for debug and test only,\n    # need to use cuda version instead\n    N_, S_, M_, D_ = value.shape\n    _, Lq_, M_, L_, P_, _ = sampling_locations.shape\n    value_list = value.split([H_ * W_ for H_, W_ in value_spatial_shapes], dim=1)\n    sampling_grids = 2 * sampling_locations - 1\n    sampling_value_list = []\n    for lid_, (H_, W_) in enumerate(value_spatial_shapes):\n        # N_, H_*W_, M_, D_ -> N_, H_*W_, M_*D_ -> N_, M_*D_, H_*W_ -> N_*M_, D_, H_, W_\n        value_l_ = value_list[lid_].flatten(2).transpose(1, 2).reshape(N_*M_, D_, H_, W_)\n        # N_, Lq_, M_, P_, 2 -> N_, M_, Lq_, P_, 2 -> N_*M_, Lq_, P_, 2\n        sampling_grid_l_ = sampling_grids[:, :, :, lid_].transpose(1, 2).flatten(0, 1)\n        # N_*M_, D_, Lq_, P_\n        sampling_value_l_ = F.grid_sample(value_l_, sampling_grid_l_,\n                                          mode='bilinear', padding_mode='zeros', align_corners=False)\n        sampling_value_list.append(sampling_value_l_)\n    # (N_, Lq_, M_, L_, P_) -> (N_, M_, Lq_, L_, P_) -> (N_, M_, 1, Lq_, L_*P_)\n    attention_weights = attention_weights.transpose(1, 2).reshape(N_*M_, 1, Lq_, L_*P_)\n    output = (torch.stack(sampling_value_list, dim=-2).flatten(-2) * attention_weights).sum(-1).view(N_, M_*D_, Lq_)\n    return output.transpose(1, 2).contiguous()\n"
  },
  {
    "path": "thirdparty/GLEE/glee/models/pixel_decoder/ops/make.sh",
    "content": "#!/usr/bin/env bash\n# ------------------------------------------------------------------------------------------------\n# Deformable DETR\n# Copyright (c) 2020 SenseTime. All Rights Reserved.\n# Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n# ------------------------------------------------------------------------------------------------\n# Modified from https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0\n# ------------------------------------------------------------------------------------------------\n\n# Copyright (c) Facebook, Inc. and its affiliates.\n# Modified by Bowen Cheng from https://github.com/fundamentalvision/Deformable-DETR\n\npython setup.py build install --user\n"
  },
  {
    "path": "thirdparty/GLEE/glee/models/pixel_decoder/ops/modules/__init__.py",
    "content": "# ------------------------------------------------------------------------------------------------\n# Deformable DETR\n# Copyright (c) 2020 SenseTime. All Rights Reserved.\n# Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n# ------------------------------------------------------------------------------------------------\n# Modified from https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0\n# ------------------------------------------------------------------------------------------------\n\n# Copyright (c) Facebook, Inc. and its affiliates.\n# Modified by Bowen Cheng from https://github.com/fundamentalvision/Deformable-DETR\n\nfrom .ms_deform_attn import MSDeformAttn\n"
  },
  {
    "path": "thirdparty/GLEE/glee/models/pixel_decoder/ops/modules/ms_deform_attn.py",
    "content": "# ------------------------------------------------------------------------------------------------\n# Deformable DETR\n# Copyright (c) 2020 SenseTime. All Rights Reserved.\n# Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n# ------------------------------------------------------------------------------------------------\n# Modified from https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0\n# ------------------------------------------------------------------------------------------------\n\n# Copyright (c) Facebook, Inc. and its affiliates.\n# Modified by Bowen Cheng from https://github.com/fundamentalvision/Deformable-DETR\n\nfrom __future__ import absolute_import\nfrom __future__ import print_function\nfrom __future__ import division\n\nimport warnings\nimport math\n\nimport torch\nfrom torch import nn\nimport torch.nn.functional as F\nfrom torch.nn.init import xavier_uniform_, constant_\n\nfrom ..functions import MSDeformAttnFunction\nfrom ..functions.ms_deform_attn_func import ms_deform_attn_core_pytorch\n\n\ndef _is_power_of_2(n):\n    if (not isinstance(n, int)) or (n < 0):\n        raise ValueError(\"invalid input for _is_power_of_2: {} (type: {})\".format(n, type(n)))\n    return (n & (n-1) == 0) and n != 0\n\n\nclass MSDeformAttn(nn.Module):\n    def __init__(self, d_model=256, n_levels=4, n_heads=8, n_points=4):\n        \"\"\"\n        Multi-Scale Deformable Attention Module\n        :param d_model      hidden dimension\n        :param n_levels     number of feature levels\n        :param n_heads      number of attention heads\n        :param n_points     number of sampling points per attention head per feature level\n        \"\"\"\n        super().__init__()\n        if d_model % n_heads != 0:\n            raise ValueError('d_model must be divisible by n_heads, but got {} and {}'.format(d_model, n_heads))\n        _d_per_head = d_model // n_heads\n        # you'd better set _d_per_head to a power of 2 which is more efficient in our CUDA implementation\n        if not _is_power_of_2(_d_per_head):\n            warnings.warn(\"You'd better set d_model in MSDeformAttn to make the dimension of each attention head a power of 2 \"\n                          \"which is more efficient in our CUDA implementation.\")\n\n        self.im2col_step = 128\n\n        self.d_model = d_model\n        self.n_levels = n_levels\n        self.n_heads = n_heads\n        self.n_points = n_points\n\n        self.sampling_offsets = nn.Linear(d_model, n_heads * n_levels * n_points * 2)\n        self.attention_weights = nn.Linear(d_model, n_heads * n_levels * n_points)\n        self.value_proj = nn.Linear(d_model, d_model)\n        self.output_proj = nn.Linear(d_model, d_model)\n\n        self._reset_parameters()\n\n    def _reset_parameters(self):\n        constant_(self.sampling_offsets.weight.data, 0.)\n        thetas = torch.arange(self.n_heads, dtype=torch.float32) * (2.0 * math.pi / self.n_heads)\n        grid_init = torch.stack([thetas.cos(), thetas.sin()], -1)\n        grid_init = (grid_init / grid_init.abs().max(-1, keepdim=True)[0]).view(self.n_heads, 1, 1, 2).repeat(1, self.n_levels, self.n_points, 1)\n        for i in range(self.n_points):\n            grid_init[:, :, i, :] *= i + 1\n        with torch.no_grad():\n            self.sampling_offsets.bias = nn.Parameter(grid_init.view(-1))\n        constant_(self.attention_weights.weight.data, 0.)\n        constant_(self.attention_weights.bias.data, 0.)\n        xavier_uniform_(self.value_proj.weight.data)\n        constant_(self.value_proj.bias.data, 0.)\n        xavier_uniform_(self.output_proj.weight.data)\n        constant_(self.output_proj.bias.data, 0.)\n\n    def forward(self, query, reference_points, input_flatten, input_spatial_shapes, input_level_start_index, input_padding_mask=None):\n        \"\"\"\n        :param query                       (N, Length_{query}, C)\n        :param reference_points            (N, Length_{query}, n_levels, 2), range in [0, 1], top-left (0,0), bottom-right (1, 1), including padding area\n                                        or (N, Length_{query}, n_levels, 4), add additional (w, h) to form reference boxes\n        :param input_flatten               (N, \\sum_{l=0}^{L-1} H_l \\cdot W_l, C)\n        :param input_spatial_shapes        (n_levels, 2), [(H_0, W_0), (H_1, W_1), ..., (H_{L-1}, W_{L-1})]\n        :param input_level_start_index     (n_levels, ), [0, H_0*W_0, H_0*W_0+H_1*W_1, H_0*W_0+H_1*W_1+H_2*W_2, ..., H_0*W_0+H_1*W_1+...+H_{L-1}*W_{L-1}]\n        :param input_padding_mask          (N, \\sum_{l=0}^{L-1} H_l \\cdot W_l), True for padding elements, False for non-padding elements\n\n        :return output                     (N, Length_{query}, C)\n        \"\"\"\n        N, Len_q, _ = query.shape\n        N, Len_in, _ = input_flatten.shape\n        assert (input_spatial_shapes[:, 0] * input_spatial_shapes[:, 1]).sum() == Len_in\n\n        value = self.value_proj(input_flatten)\n        if input_padding_mask is not None:\n            value = value.masked_fill(input_padding_mask[..., None], float(0))\n        value = value.view(N, Len_in, self.n_heads, self.d_model // self.n_heads)\n        sampling_offsets = self.sampling_offsets(query).view(N, Len_q, self.n_heads, self.n_levels, self.n_points, 2)\n        attention_weights = self.attention_weights(query).view(N, Len_q, self.n_heads, self.n_levels * self.n_points)\n        attention_weights = F.softmax(attention_weights, -1).view(N, Len_q, self.n_heads, self.n_levels, self.n_points)\n        # N, Len_q, n_heads, n_levels, n_points, 2\n        if reference_points.shape[-1] == 2:\n            offset_normalizer = torch.stack([input_spatial_shapes[..., 1], input_spatial_shapes[..., 0]], -1)\n            sampling_locations = reference_points[:, :, None, :, None, :] \\\n                                 + sampling_offsets / offset_normalizer[None, None, None, :, None, :]\n        elif reference_points.shape[-1] == 4:\n            sampling_locations = reference_points[:, :, None, :, None, :2] \\\n                                 + sampling_offsets / self.n_points * reference_points[:, :, None, :, None, 2:] * 0.5\n        else:\n            raise ValueError(\n                'Last dim of reference_points must be 2 or 4, but get {} instead.'.format(reference_points.shape[-1]))\n        try:\n            output = MSDeformAttnFunction.apply(\n                value, input_spatial_shapes, input_level_start_index, sampling_locations, attention_weights, self.im2col_step)\n        except:\n            # CPU\n            output = ms_deform_attn_core_pytorch(value, input_spatial_shapes, sampling_locations, attention_weights)\n        # # For FLOPs calculation only\n        # output = ms_deform_attn_core_pytorch(value, input_spatial_shapes, sampling_locations, attention_weights)\n        output = self.output_proj(output)\n        return output\n"
  },
  {
    "path": "thirdparty/GLEE/glee/models/pixel_decoder/ops/setup.py",
    "content": "# ------------------------------------------------------------------------------------------------\n# Deformable DETR\n# Copyright (c) 2020 SenseTime. All Rights Reserved.\n# Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n# ------------------------------------------------------------------------------------------------\n# Modified from https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0\n# ------------------------------------------------------------------------------------------------\n\n# Copyright (c) Facebook, Inc. and its affiliates.\n# Modified by Bowen Cheng from https://github.com/fundamentalvision/Deformable-DETR\n\nimport os\nimport glob\n\nimport torch\n\nfrom torch.utils.cpp_extension import CUDA_HOME\nfrom torch.utils.cpp_extension import CppExtension\nfrom torch.utils.cpp_extension import CUDAExtension\n\nfrom setuptools import find_packages\nfrom setuptools import setup\n\nrequirements = [\"torch\", \"torchvision\"]\n\ndef get_extensions():\n    this_dir = os.path.dirname(os.path.abspath(__file__))\n    extensions_dir = os.path.join(this_dir, \"src\")\n\n    main_file = glob.glob(os.path.join(extensions_dir, \"*.cpp\"))\n    source_cpu = glob.glob(os.path.join(extensions_dir, \"cpu\", \"*.cpp\"))\n    source_cuda = glob.glob(os.path.join(extensions_dir, \"cuda\", \"*.cu\"))\n\n    sources = main_file + source_cpu\n    extension = CppExtension\n    extra_compile_args = {\"cxx\": []}\n    define_macros = []\n\n    # Force cuda since torch ask for a device, not if cuda is in fact available.\n    if (os.environ.get('FORCE_CUDA') or torch.cuda.is_available()) and CUDA_HOME is not None:\n        extension = CUDAExtension\n        sources += source_cuda\n        define_macros += [(\"WITH_CUDA\", None)]\n        extra_compile_args[\"nvcc\"] = [\n            \"-DCUDA_HAS_FP16=1\",\n            \"-D__CUDA_NO_HALF_OPERATORS__\",\n            \"-D__CUDA_NO_HALF_CONVERSIONS__\",\n            \"-D__CUDA_NO_HALF2_OPERATORS__\",\n        ]\n    else:\n        if CUDA_HOME is None:\n            raise NotImplementedError('CUDA_HOME is None. Please set environment variable CUDA_HOME.')\n        else:\n            raise NotImplementedError('No CUDA runtime is found. Please set FORCE_CUDA=1 or test it by running torch.cuda.is_available().')\n\n    sources = [os.path.join(extensions_dir, s) for s in sources]\n    include_dirs = [extensions_dir]\n    ext_modules = [\n        extension(\n            \"MultiScaleDeformableAttention\",\n            sources,\n            include_dirs=include_dirs,\n            define_macros=define_macros,\n            extra_compile_args=extra_compile_args,\n        )\n    ]\n    return ext_modules\n\nsetup(\n    name=\"MultiScaleDeformableAttention\",\n    version=\"1.0\",\n    author=\"Weijie Su\",\n    url=\"https://github.com/fundamentalvision/Deformable-DETR\",\n    description=\"PyTorch Wrapper for CUDA Functions of Multi-Scale Deformable Attention\",\n    packages=find_packages(exclude=(\"configs\", \"tests\",)),\n    ext_modules=get_extensions(),\n    cmdclass={\"build_ext\": torch.utils.cpp_extension.BuildExtension},\n)\n"
  },
  {
    "path": "thirdparty/GLEE/glee/models/pixel_decoder/ops/src/cpu/ms_deform_attn_cpu.cpp",
    "content": "/*!\n**************************************************************************************************\n* Deformable DETR\n* Copyright (c) 2020 SenseTime. All Rights Reserved.\n* Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n**************************************************************************************************\n* Modified from https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0\n**************************************************************************************************\n*/\n\n/*!\n* Copyright (c) Facebook, Inc. and its affiliates.\n* Modified by Bowen Cheng from https://github.com/fundamentalvision/Deformable-DETR\n*/\n\n#include <vector>\n\n#include <ATen/ATen.h>\n#include <ATen/cuda/CUDAContext.h>\n\n\nat::Tensor\nms_deform_attn_cpu_forward(\n    const at::Tensor &value, \n    const at::Tensor &spatial_shapes,\n    const at::Tensor &level_start_index,\n    const at::Tensor &sampling_loc,\n    const at::Tensor &attn_weight,\n    const int im2col_step)\n{\n    AT_ERROR(\"Not implement on cpu\");\n}\n\nstd::vector<at::Tensor>\nms_deform_attn_cpu_backward(\n    const at::Tensor &value, \n    const at::Tensor &spatial_shapes,\n    const at::Tensor &level_start_index,\n    const at::Tensor &sampling_loc,\n    const at::Tensor &attn_weight,\n    const at::Tensor &grad_output,\n    const int im2col_step)\n{\n    AT_ERROR(\"Not implement on cpu\");\n}\n\n"
  },
  {
    "path": "thirdparty/GLEE/glee/models/pixel_decoder/ops/src/cpu/ms_deform_attn_cpu.h",
    "content": "/*!\n**************************************************************************************************\n* Deformable DETR\n* Copyright (c) 2020 SenseTime. All Rights Reserved.\n* Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n**************************************************************************************************\n* Modified from https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0\n**************************************************************************************************\n*/\n\n/*!\n* Copyright (c) Facebook, Inc. and its affiliates.\n* Modified by Bowen Cheng from https://github.com/fundamentalvision/Deformable-DETR\n*/\n\n#pragma once\n#include <torch/extension.h>\n\nat::Tensor\nms_deform_attn_cpu_forward(\n    const at::Tensor &value, \n    const at::Tensor &spatial_shapes,\n    const at::Tensor &level_start_index,\n    const at::Tensor &sampling_loc,\n    const at::Tensor &attn_weight,\n    const int im2col_step);\n\nstd::vector<at::Tensor>\nms_deform_attn_cpu_backward(\n    const at::Tensor &value, \n    const at::Tensor &spatial_shapes,\n    const at::Tensor &level_start_index,\n    const at::Tensor &sampling_loc,\n    const at::Tensor &attn_weight,\n    const at::Tensor &grad_output,\n    const int im2col_step);\n\n\n"
  },
  {
    "path": "thirdparty/GLEE/glee/models/pixel_decoder/ops/src/cuda/ms_deform_attn_cuda.cu",
    "content": "/*!\n**************************************************************************************************\n* Deformable DETR\n* Copyright (c) 2020 SenseTime. All Rights Reserved.\n* Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n**************************************************************************************************\n* Modified from https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0\n**************************************************************************************************\n*/\n\n/*!\n* Copyright (c) Facebook, Inc. and its affiliates.\n* Modified by Bowen Cheng from https://github.com/fundamentalvision/Deformable-DETR\n*/\n\n#include <vector>\n#include \"cuda/ms_deform_im2col_cuda.cuh\"\n\n#include <ATen/ATen.h>\n#include <ATen/cuda/CUDAContext.h>\n#include <cuda.h>\n#include <cuda_runtime.h>\n\n\nat::Tensor ms_deform_attn_cuda_forward(\n    const at::Tensor &value, \n    const at::Tensor &spatial_shapes,\n    const at::Tensor &level_start_index,\n    const at::Tensor &sampling_loc,\n    const at::Tensor &attn_weight,\n    const int im2col_step)\n{\n    AT_ASSERTM(value.is_contiguous(), \"value tensor has to be contiguous\");\n    AT_ASSERTM(spatial_shapes.is_contiguous(), \"spatial_shapes tensor has to be contiguous\");\n    AT_ASSERTM(level_start_index.is_contiguous(), \"level_start_index tensor has to be contiguous\");\n    AT_ASSERTM(sampling_loc.is_contiguous(), \"sampling_loc tensor has to be contiguous\");\n    AT_ASSERTM(attn_weight.is_contiguous(), \"attn_weight tensor has to be contiguous\");\n\n    AT_ASSERTM(value.type().is_cuda(), \"value must be a CUDA tensor\");\n    AT_ASSERTM(spatial_shapes.type().is_cuda(), \"spatial_shapes must be a CUDA tensor\");\n    AT_ASSERTM(level_start_index.type().is_cuda(), \"level_start_index must be a CUDA tensor\");\n    AT_ASSERTM(sampling_loc.type().is_cuda(), \"sampling_loc must be a CUDA tensor\");\n    AT_ASSERTM(attn_weight.type().is_cuda(), \"attn_weight must be a CUDA tensor\");\n\n    const int batch = value.size(0);\n    const int spatial_size = value.size(1);\n    const int num_heads = value.size(2);\n    const int channels = value.size(3);\n\n    const int num_levels = spatial_shapes.size(0);\n\n    const int num_query = sampling_loc.size(1);\n    const int num_point = sampling_loc.size(4);\n\n    const int im2col_step_ = std::min(batch, im2col_step);\n\n    AT_ASSERTM(batch % im2col_step_ == 0, \"batch(%d) must divide im2col_step(%d)\", batch, im2col_step_);\n    \n    auto output = at::zeros({batch, num_query, num_heads, channels}, value.options());\n\n    const int batch_n = im2col_step_;\n    auto output_n = output.view({batch/im2col_step_, batch_n, num_query, num_heads, channels});\n    auto per_value_size = spatial_size * num_heads * channels;\n    auto per_sample_loc_size = num_query * num_heads * num_levels * num_point * 2;\n    auto per_attn_weight_size = num_query * num_heads * num_levels * num_point;\n    for (int n = 0; n < batch/im2col_step_; ++n)\n    {\n        auto columns = output_n.select(0, n);\n        AT_DISPATCH_FLOATING_TYPES(value.type(), \"ms_deform_attn_forward_cuda\", ([&] {\n            ms_deformable_im2col_cuda(at::cuda::getCurrentCUDAStream(),\n                value.data<scalar_t>() + n * im2col_step_ * per_value_size,\n                spatial_shapes.data<int64_t>(),\n                level_start_index.data<int64_t>(),\n                sampling_loc.data<scalar_t>() + n * im2col_step_ * per_sample_loc_size,\n                attn_weight.data<scalar_t>() + n * im2col_step_ * per_attn_weight_size,\n                batch_n, spatial_size, num_heads, channels, num_levels, num_query, num_point,\n                columns.data<scalar_t>());\n\n        }));\n    }\n\n    output = output.view({batch, num_query, num_heads*channels});\n\n    return output;\n}\n\n\nstd::vector<at::Tensor> ms_deform_attn_cuda_backward(\n    const at::Tensor &value, \n    const at::Tensor &spatial_shapes,\n    const at::Tensor &level_start_index,\n    const at::Tensor &sampling_loc,\n    const at::Tensor &attn_weight,\n    const at::Tensor &grad_output,\n    const int im2col_step)\n{\n\n    AT_ASSERTM(value.is_contiguous(), \"value tensor has to be contiguous\");\n    AT_ASSERTM(spatial_shapes.is_contiguous(), \"spatial_shapes tensor has to be contiguous\");\n    AT_ASSERTM(level_start_index.is_contiguous(), \"level_start_index tensor has to be contiguous\");\n    AT_ASSERTM(sampling_loc.is_contiguous(), \"sampling_loc tensor has to be contiguous\");\n    AT_ASSERTM(attn_weight.is_contiguous(), \"attn_weight tensor has to be contiguous\");\n    AT_ASSERTM(grad_output.is_contiguous(), \"grad_output tensor has to be contiguous\");\n\n    AT_ASSERTM(value.type().is_cuda(), \"value must be a CUDA tensor\");\n    AT_ASSERTM(spatial_shapes.type().is_cuda(), \"spatial_shapes must be a CUDA tensor\");\n    AT_ASSERTM(level_start_index.type().is_cuda(), \"level_start_index must be a CUDA tensor\");\n    AT_ASSERTM(sampling_loc.type().is_cuda(), \"sampling_loc must be a CUDA tensor\");\n    AT_ASSERTM(attn_weight.type().is_cuda(), \"attn_weight must be a CUDA tensor\");\n    AT_ASSERTM(grad_output.type().is_cuda(), \"grad_output must be a CUDA tensor\");\n\n    const int batch = value.size(0);\n    const int spatial_size = value.size(1);\n    const int num_heads = value.size(2);\n    const int channels = value.size(3);\n\n    const int num_levels = spatial_shapes.size(0);\n\n    const int num_query = sampling_loc.size(1);\n    const int num_point = sampling_loc.size(4);\n\n    const int im2col_step_ = std::min(batch, im2col_step);\n\n    AT_ASSERTM(batch % im2col_step_ == 0, \"batch(%d) must divide im2col_step(%d)\", batch, im2col_step_);\n\n    auto grad_value = at::zeros_like(value);\n    auto grad_sampling_loc = at::zeros_like(sampling_loc);\n    auto grad_attn_weight = at::zeros_like(attn_weight);\n\n    const int batch_n = im2col_step_;\n    auto per_value_size = spatial_size * num_heads * channels;\n    auto per_sample_loc_size = num_query * num_heads * num_levels * num_point * 2;\n    auto per_attn_weight_size = num_query * num_heads * num_levels * num_point;\n    auto grad_output_n = grad_output.view({batch/im2col_step_, batch_n, num_query, num_heads, channels});\n    \n    for (int n = 0; n < batch/im2col_step_; ++n)\n    {\n        auto grad_output_g = grad_output_n.select(0, n);\n        AT_DISPATCH_FLOATING_TYPES(value.type(), \"ms_deform_attn_backward_cuda\", ([&] {\n            ms_deformable_col2im_cuda(at::cuda::getCurrentCUDAStream(),\n                                    grad_output_g.data<scalar_t>(),\n                                    value.data<scalar_t>() + n * im2col_step_ * per_value_size,\n                                    spatial_shapes.data<int64_t>(),\n                                    level_start_index.data<int64_t>(),\n                                    sampling_loc.data<scalar_t>() + n * im2col_step_ * per_sample_loc_size,\n                                    attn_weight.data<scalar_t>() + n * im2col_step_ * per_attn_weight_size,\n                                    batch_n, spatial_size, num_heads, channels, num_levels, num_query, num_point,\n                                    grad_value.data<scalar_t>() +  n * im2col_step_ * per_value_size,\n                                    grad_sampling_loc.data<scalar_t>() + n * im2col_step_ * per_sample_loc_size,\n                                    grad_attn_weight.data<scalar_t>() + n * im2col_step_ * per_attn_weight_size);\n\n        }));\n    }\n\n    return {\n        grad_value, grad_sampling_loc, grad_attn_weight\n    };\n}"
  },
  {
    "path": "thirdparty/GLEE/glee/models/pixel_decoder/ops/src/cuda/ms_deform_attn_cuda.h",
    "content": "/*!\n**************************************************************************************************\n* Deformable DETR\n* Copyright (c) 2020 SenseTime. All Rights Reserved.\n* Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n**************************************************************************************************\n* Modified from https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0\n**************************************************************************************************\n*/\n\n/*!\n* Copyright (c) Facebook, Inc. and its affiliates.\n* Modified by Bowen Cheng from https://github.com/fundamentalvision/Deformable-DETR\n*/\n\n#pragma once\n#include <torch/extension.h>\n\nat::Tensor ms_deform_attn_cuda_forward(\n    const at::Tensor &value, \n    const at::Tensor &spatial_shapes,\n    const at::Tensor &level_start_index,\n    const at::Tensor &sampling_loc,\n    const at::Tensor &attn_weight,\n    const int im2col_step);\n\nstd::vector<at::Tensor> ms_deform_attn_cuda_backward(\n    const at::Tensor &value, \n    const at::Tensor &spatial_shapes,\n    const at::Tensor &level_start_index,\n    const at::Tensor &sampling_loc,\n    const at::Tensor &attn_weight,\n    const at::Tensor &grad_output,\n    const int im2col_step);\n\n"
  },
  {
    "path": "thirdparty/GLEE/glee/models/pixel_decoder/ops/src/cuda/ms_deform_im2col_cuda.cuh",
    "content": "/*!\n**************************************************************************\n* Deformable DETR\n* Copyright (c) 2020 SenseTime. All Rights Reserved.\n* Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n**************************************************************************\n* Modified from DCN (https://github.com/msracver/Deformable-ConvNets)\n* Copyright (c) 2018 Microsoft\n**************************************************************************\n*/\n\n/*!\n* Copyright (c) Facebook, Inc. and its affiliates.\n* Modified by Bowen Cheng from https://github.com/fundamentalvision/Deformable-DETR\n*/\n\n#include <cstdio>\n#include <algorithm>\n#include <cstring>\n\n#include <ATen/ATen.h>\n#include <ATen/cuda/CUDAContext.h>\n\n#include <THC/THCAtomics.cuh>\n\n#define CUDA_KERNEL_LOOP(i, n)                          \\\n  for (int i = blockIdx.x * blockDim.x + threadIdx.x;   \\\n      i < (n);                                          \\\n      i += blockDim.x * gridDim.x)\n\nconst int CUDA_NUM_THREADS = 1024;\ninline int GET_BLOCKS(const int N, const int num_threads)\n{\n  return (N + num_threads - 1) / num_threads;\n}\n\n\ntemplate <typename scalar_t>\n__device__ scalar_t ms_deform_attn_im2col_bilinear(const scalar_t* &bottom_data, \n                                                   const int &height, const int &width, const int &nheads, const int &channels,\n                                                   const scalar_t &h, const scalar_t &w, const int &m, const int &c)\n{\n  const int h_low = floor(h);\n  const int w_low = floor(w);\n  const int h_high = h_low + 1;\n  const int w_high = w_low + 1;\n\n  const scalar_t lh = h - h_low;\n  const scalar_t lw = w - w_low;\n  const scalar_t hh = 1 - lh, hw = 1 - lw;\n\n  const int w_stride = nheads * channels;\n  const int h_stride = width * w_stride;\n  const int h_low_ptr_offset = h_low * h_stride;\n  const int h_high_ptr_offset = h_low_ptr_offset + h_stride;\n  const int w_low_ptr_offset = w_low * w_stride;\n  const int w_high_ptr_offset = w_low_ptr_offset + w_stride;\n  const int base_ptr = m * channels + c;\n\n  scalar_t v1 = 0;\n  if (h_low >= 0 && w_low >= 0)\n  {\n    const int ptr1 = h_low_ptr_offset + w_low_ptr_offset + base_ptr;\n    v1 = bottom_data[ptr1];\n  }\n  scalar_t v2 = 0;\n  if (h_low >= 0 && w_high <= width - 1)\n  {\n    const int ptr2 = h_low_ptr_offset + w_high_ptr_offset + base_ptr;\n    v2 = bottom_data[ptr2];\n  }\n  scalar_t v3 = 0;\n  if (h_high <= height - 1 && w_low >= 0)\n  {\n    const int ptr3 = h_high_ptr_offset + w_low_ptr_offset + base_ptr;\n    v3 = bottom_data[ptr3];\n  }\n  scalar_t v4 = 0;\n  if (h_high <= height - 1 && w_high <= width - 1)\n  {\n    const int ptr4 = h_high_ptr_offset + w_high_ptr_offset + base_ptr;\n    v4 = bottom_data[ptr4];\n  }\n\n  const scalar_t w1 = hh * hw, w2 = hh * lw, w3 = lh * hw, w4 = lh * lw;\n\n  const scalar_t val = (w1 * v1 + w2 * v2 + w3 * v3 + w4 * v4);\n  return val;\n}\n\n\ntemplate <typename scalar_t>\n__device__ void ms_deform_attn_col2im_bilinear(const scalar_t* &bottom_data, \n                                                   const int &height, const int &width, const int &nheads, const int &channels,\n                                                   const scalar_t &h, const scalar_t &w, const int &m, const int &c,\n                                                   const scalar_t &top_grad,\n                                                   const scalar_t &attn_weight,\n                                                   scalar_t* &grad_value, \n                                                   scalar_t* grad_sampling_loc,\n                                                   scalar_t* grad_attn_weight)\n{\n  const int h_low = floor(h);\n  const int w_low = floor(w);\n  const int h_high = h_low + 1;\n  const int w_high = w_low + 1;\n\n  const scalar_t lh = h - h_low;\n  const scalar_t lw = w - w_low;\n  const scalar_t hh = 1 - lh, hw = 1 - lw;\n\n  const int w_stride = nheads * channels;\n  const int h_stride = width * w_stride;\n  const int h_low_ptr_offset = h_low * h_stride;\n  const int h_high_ptr_offset = h_low_ptr_offset + h_stride;\n  const int w_low_ptr_offset = w_low * w_stride;\n  const int w_high_ptr_offset = w_low_ptr_offset + w_stride;\n  const int base_ptr = m * channels + c;\n\n  const scalar_t w1 = hh * hw, w2 = hh * lw, w3 = lh * hw, w4 = lh * lw;\n  const scalar_t top_grad_value = top_grad * attn_weight;\n  scalar_t grad_h_weight = 0, grad_w_weight = 0;\n\n  scalar_t v1 = 0;\n  if (h_low >= 0 && w_low >= 0)\n  {\n    const int ptr1 = h_low_ptr_offset + w_low_ptr_offset + base_ptr;\n    v1 = bottom_data[ptr1];\n    grad_h_weight -= hw * v1;\n    grad_w_weight -= hh * v1;\n    atomicAdd(grad_value+ptr1, w1*top_grad_value);\n  }\n  scalar_t v2 = 0;\n  if (h_low >= 0 && w_high <= width - 1)\n  {\n    const int ptr2 = h_low_ptr_offset + w_high_ptr_offset + base_ptr;\n    v2 = bottom_data[ptr2];\n    grad_h_weight -= lw * v2;\n    grad_w_weight += hh * v2;\n    atomicAdd(grad_value+ptr2, w2*top_grad_value);\n  }\n  scalar_t v3 = 0;\n  if (h_high <= height - 1 && w_low >= 0)\n  {\n    const int ptr3 = h_high_ptr_offset + w_low_ptr_offset + base_ptr;\n    v3 = bottom_data[ptr3];\n    grad_h_weight += hw * v3;\n    grad_w_weight -= lh * v3;\n    atomicAdd(grad_value+ptr3, w3*top_grad_value); \n  }\n  scalar_t v4 = 0;\n  if (h_high <= height - 1 && w_high <= width - 1)\n  {\n    const int ptr4 = h_high_ptr_offset + w_high_ptr_offset + base_ptr;\n    v4 = bottom_data[ptr4];\n    grad_h_weight += lw * v4;\n    grad_w_weight += lh * v4;\n    atomicAdd(grad_value+ptr4, w4*top_grad_value);\n  }\n\n  const scalar_t val = (w1 * v1 + w2 * v2 + w3 * v3 + w4 * v4);\n  *grad_attn_weight = top_grad * val;\n  *grad_sampling_loc = width * grad_w_weight * top_grad_value;\n  *(grad_sampling_loc + 1) = height * grad_h_weight * top_grad_value;\n}\n\n\ntemplate <typename scalar_t>\n__device__ void ms_deform_attn_col2im_bilinear_gm(const scalar_t* &bottom_data, \n                                                   const int &height, const int &width, const int &nheads, const int &channels,\n                                                   const scalar_t &h, const scalar_t &w, const int &m, const int &c,\n                                                   const scalar_t &top_grad,\n                                                   const scalar_t &attn_weight,\n                                                   scalar_t* &grad_value, \n                                                   scalar_t* grad_sampling_loc,\n                                                   scalar_t* grad_attn_weight)\n{\n  const int h_low = floor(h);\n  const int w_low = floor(w);\n  const int h_high = h_low + 1;\n  const int w_high = w_low + 1;\n\n  const scalar_t lh = h - h_low;\n  const scalar_t lw = w - w_low;\n  const scalar_t hh = 1 - lh, hw = 1 - lw;\n\n  const int w_stride = nheads * channels;\n  const int h_stride = width * w_stride;\n  const int h_low_ptr_offset = h_low * h_stride;\n  const int h_high_ptr_offset = h_low_ptr_offset + h_stride;\n  const int w_low_ptr_offset = w_low * w_stride;\n  const int w_high_ptr_offset = w_low_ptr_offset + w_stride;\n  const int base_ptr = m * channels + c;\n\n  const scalar_t w1 = hh * hw, w2 = hh * lw, w3 = lh * hw, w4 = lh * lw;\n  const scalar_t top_grad_value = top_grad * attn_weight;\n  scalar_t grad_h_weight = 0, grad_w_weight = 0;\n\n  scalar_t v1 = 0;\n  if (h_low >= 0 && w_low >= 0)\n  {\n    const int ptr1 = h_low_ptr_offset + w_low_ptr_offset + base_ptr;\n    v1 = bottom_data[ptr1];\n    grad_h_weight -= hw * v1;\n    grad_w_weight -= hh * v1;\n    atomicAdd(grad_value+ptr1, w1*top_grad_value);\n  }\n  scalar_t v2 = 0;\n  if (h_low >= 0 && w_high <= width - 1)\n  {\n    const int ptr2 = h_low_ptr_offset + w_high_ptr_offset + base_ptr;\n    v2 = bottom_data[ptr2];\n    grad_h_weight -= lw * v2;\n    grad_w_weight += hh * v2;\n    atomicAdd(grad_value+ptr2, w2*top_grad_value);\n  }\n  scalar_t v3 = 0;\n  if (h_high <= height - 1 && w_low >= 0)\n  {\n    const int ptr3 = h_high_ptr_offset + w_low_ptr_offset + base_ptr;\n    v3 = bottom_data[ptr3];\n    grad_h_weight += hw * v3;\n    grad_w_weight -= lh * v3;\n    atomicAdd(grad_value+ptr3, w3*top_grad_value); \n  }\n  scalar_t v4 = 0;\n  if (h_high <= height - 1 && w_high <= width - 1)\n  {\n    const int ptr4 = h_high_ptr_offset + w_high_ptr_offset + base_ptr;\n    v4 = bottom_data[ptr4];\n    grad_h_weight += lw * v4;\n    grad_w_weight += lh * v4;\n    atomicAdd(grad_value+ptr4, w4*top_grad_value);\n  }\n\n  const scalar_t val = (w1 * v1 + w2 * v2 + w3 * v3 + w4 * v4);\n  atomicAdd(grad_attn_weight, top_grad * val); \n  atomicAdd(grad_sampling_loc, width * grad_w_weight * top_grad_value);\n  atomicAdd(grad_sampling_loc + 1, height * grad_h_weight * top_grad_value);\n}\n\n\ntemplate <typename scalar_t>\n__global__ void ms_deformable_im2col_gpu_kernel(const int n,\n                                                const scalar_t *data_value, \n                                                const int64_t *data_spatial_shapes,\n                                                const int64_t *data_level_start_index, \n                                                const scalar_t *data_sampling_loc,\n                                                const scalar_t *data_attn_weight,\n                                                const int batch_size, \n                                                const int spatial_size, \n                                                const int num_heads,\n                                                const int channels, \n                                                const int num_levels,\n                                                const int num_query,\n                                                const int num_point,\n                                                scalar_t *data_col)\n{\n  CUDA_KERNEL_LOOP(index, n)\n  {\n    int _temp = index;\n    const int c_col = _temp % channels;\n    _temp /= channels;\n    const int sampling_index = _temp; \n    const int m_col = _temp % num_heads;\n    _temp /= num_heads;\n    const int q_col = _temp % num_query;\n    _temp /= num_query;\n    const int b_col = _temp;\n\n    scalar_t *data_col_ptr = data_col + index;\n    int data_weight_ptr = sampling_index * num_levels * num_point;\n    int data_loc_w_ptr = data_weight_ptr << 1;\n    const int qid_stride = num_heads * channels;\n    const int data_value_ptr_init_offset = b_col * spatial_size * qid_stride;\n    scalar_t col = 0;\n    \n    for (int l_col=0; l_col < num_levels; ++l_col)\n    {\n      const int level_start_id = data_level_start_index[l_col];\n      const int spatial_h_ptr = l_col << 1;\n      const int spatial_h = data_spatial_shapes[spatial_h_ptr];\n      const int spatial_w = data_spatial_shapes[spatial_h_ptr + 1];\n      const scalar_t *data_value_ptr = data_value + (data_value_ptr_init_offset + level_start_id * qid_stride);\n      for (int p_col=0; p_col < num_point; ++p_col)\n      {\n        const scalar_t loc_w = data_sampling_loc[data_loc_w_ptr];\n        const scalar_t loc_h = data_sampling_loc[data_loc_w_ptr + 1];\n        const scalar_t weight = data_attn_weight[data_weight_ptr];\n\n        const scalar_t h_im = loc_h * spatial_h - 0.5;\n        const scalar_t w_im = loc_w * spatial_w - 0.5;\n\n        if (h_im > -1 && w_im > -1 && h_im < spatial_h && w_im < spatial_w)\n        {\n          col += ms_deform_attn_im2col_bilinear(data_value_ptr, spatial_h, spatial_w, num_heads, channels, h_im, w_im, m_col, c_col) * weight;\n        }\n\n        data_weight_ptr += 1;\n        data_loc_w_ptr += 2;\n      }\n    }\n    *data_col_ptr = col;\n  }\n}\n\ntemplate <typename scalar_t, unsigned int blockSize>\n__global__ void ms_deformable_col2im_gpu_kernel_shm_blocksize_aware_reduce_v1(const int n,\n                                                const scalar_t *grad_col,\n                                                const scalar_t *data_value,\n                                                const int64_t *data_spatial_shapes,\n                                                const int64_t *data_level_start_index, \n                                                const scalar_t *data_sampling_loc,\n                                                const scalar_t *data_attn_weight,\n                                                const int batch_size, \n                                                const int spatial_size, \n                                                const int num_heads,\n                                                const int channels, \n                                                const int num_levels,\n                                                const int num_query,\n                                                const int num_point,\n                                                scalar_t *grad_value,\n                                                scalar_t *grad_sampling_loc,\n                                                scalar_t *grad_attn_weight)\n{\n  CUDA_KERNEL_LOOP(index, n)\n  {\n    __shared__ scalar_t cache_grad_sampling_loc[blockSize * 2];\n    __shared__ scalar_t cache_grad_attn_weight[blockSize];\n    unsigned int tid = threadIdx.x;\n    int _temp = index;\n    const int c_col = _temp % channels;\n    _temp /= channels;\n    const int sampling_index = _temp; \n    const int m_col = _temp % num_heads;\n    _temp /= num_heads;\n    const int q_col = _temp % num_query;\n    _temp /= num_query;\n    const int b_col = _temp;\n\n    const scalar_t top_grad = grad_col[index];\n\n    int data_weight_ptr = sampling_index * num_levels * num_point;\n    int data_loc_w_ptr = data_weight_ptr << 1;\n    const int grad_sampling_ptr = data_weight_ptr;\n    grad_sampling_loc += grad_sampling_ptr << 1;\n    grad_attn_weight += grad_sampling_ptr;\n    const int grad_weight_stride = 1;\n    const int grad_loc_stride = 2;\n    const int qid_stride = num_heads * channels;\n    const int data_value_ptr_init_offset = b_col * spatial_size * qid_stride;\n\n    for (int l_col=0; l_col < num_levels; ++l_col)\n    {\n      const int level_start_id = data_level_start_index[l_col];\n      const int spatial_h_ptr = l_col << 1;\n      const int spatial_h = data_spatial_shapes[spatial_h_ptr];\n      const int spatial_w = data_spatial_shapes[spatial_h_ptr + 1];\n      const int value_ptr_offset = data_value_ptr_init_offset + level_start_id * qid_stride;\n      const scalar_t *data_value_ptr = data_value + value_ptr_offset;\n      scalar_t *grad_value_ptr = grad_value + value_ptr_offset;\n\n      for (int p_col=0; p_col < num_point; ++p_col)\n      {\n        const scalar_t loc_w = data_sampling_loc[data_loc_w_ptr];\n        const scalar_t loc_h = data_sampling_loc[data_loc_w_ptr + 1];\n        const scalar_t weight = data_attn_weight[data_weight_ptr];\n\n        const scalar_t h_im = loc_h * spatial_h - 0.5;\n        const scalar_t w_im = loc_w * spatial_w - 0.5;\n        *(cache_grad_sampling_loc+(threadIdx.x << 1)) = 0;\n        *(cache_grad_sampling_loc+((threadIdx.x << 1) + 1)) = 0;\n        *(cache_grad_attn_weight+threadIdx.x)=0;\n        if (h_im > -1 && w_im > -1 && h_im < spatial_h && w_im < spatial_w)\n        {\n          ms_deform_attn_col2im_bilinear(\n            data_value_ptr, spatial_h, spatial_w, num_heads, channels, h_im, w_im, m_col, c_col,\n            top_grad, weight, grad_value_ptr, \n            cache_grad_sampling_loc+(threadIdx.x << 1), cache_grad_attn_weight+threadIdx.x);\n        }\n        \n        __syncthreads();\n        if (tid == 0)\n        {\n          scalar_t _grad_w=cache_grad_sampling_loc[0], _grad_h=cache_grad_sampling_loc[1], _grad_a=cache_grad_attn_weight[0];\n          int sid=2;\n          for (unsigned int tid = 1; tid < blockSize; ++tid)\n          {\n            _grad_w += cache_grad_sampling_loc[sid];\n            _grad_h += cache_grad_sampling_loc[sid + 1];\n            _grad_a += cache_grad_attn_weight[tid];\n            sid += 2;\n          }\n          \n          \n          *grad_sampling_loc = _grad_w;\n          *(grad_sampling_loc + 1) = _grad_h;\n          *grad_attn_weight = _grad_a;\n        }\n        __syncthreads();\n\n        data_weight_ptr += 1;\n        data_loc_w_ptr += 2;\n        grad_attn_weight += grad_weight_stride;\n        grad_sampling_loc += grad_loc_stride;\n      }\n    }\n  }\n}\n\n\ntemplate <typename scalar_t, unsigned int blockSize>\n__global__ void ms_deformable_col2im_gpu_kernel_shm_blocksize_aware_reduce_v2(const int n,\n                                                const scalar_t *grad_col,\n                                                const scalar_t *data_value,\n                                                const int64_t *data_spatial_shapes,\n                                                const int64_t *data_level_start_index, \n                                                const scalar_t *data_sampling_loc,\n                                                const scalar_t *data_attn_weight,\n                                                const int batch_size, \n                                                const int spatial_size, \n                                                const int num_heads,\n                                                const int channels, \n                                                const int num_levels,\n                                                const int num_query,\n                                                const int num_point,\n                                                scalar_t *grad_value,\n                                                scalar_t *grad_sampling_loc,\n                                                scalar_t *grad_attn_weight)\n{\n  CUDA_KERNEL_LOOP(index, n)\n  {\n    __shared__ scalar_t cache_grad_sampling_loc[blockSize * 2];\n    __shared__ scalar_t cache_grad_attn_weight[blockSize];\n    unsigned int tid = threadIdx.x;\n    int _temp = index;\n    const int c_col = _temp % channels;\n    _temp /= channels;\n    const int sampling_index = _temp; \n    const int m_col = _temp % num_heads;\n    _temp /= num_heads;\n    const int q_col = _temp % num_query;\n    _temp /= num_query;\n    const int b_col = _temp;\n\n    const scalar_t top_grad = grad_col[index];\n\n    int data_weight_ptr = sampling_index * num_levels * num_point;\n    int data_loc_w_ptr = data_weight_ptr << 1;\n    const int grad_sampling_ptr = data_weight_ptr;\n    grad_sampling_loc += grad_sampling_ptr << 1;\n    grad_attn_weight += grad_sampling_ptr;\n    const int grad_weight_stride = 1;\n    const int grad_loc_stride = 2;\n    const int qid_stride = num_heads * channels;\n    const int data_value_ptr_init_offset = b_col * spatial_size * qid_stride;\n\n    for (int l_col=0; l_col < num_levels; ++l_col)\n    {\n      const int level_start_id = data_level_start_index[l_col];\n      const int spatial_h_ptr = l_col << 1;\n      const int spatial_h = data_spatial_shapes[spatial_h_ptr];\n      const int spatial_w = data_spatial_shapes[spatial_h_ptr + 1];\n      const int value_ptr_offset = data_value_ptr_init_offset + level_start_id * qid_stride;\n      const scalar_t *data_value_ptr = data_value + value_ptr_offset;\n      scalar_t *grad_value_ptr = grad_value + value_ptr_offset;\n\n      for (int p_col=0; p_col < num_point; ++p_col)\n      {\n        const scalar_t loc_w = data_sampling_loc[data_loc_w_ptr];\n        const scalar_t loc_h = data_sampling_loc[data_loc_w_ptr + 1];\n        const scalar_t weight = data_attn_weight[data_weight_ptr];\n\n        const scalar_t h_im = loc_h * spatial_h - 0.5;\n        const scalar_t w_im = loc_w * spatial_w - 0.5;\n        *(cache_grad_sampling_loc+(threadIdx.x << 1)) = 0;\n        *(cache_grad_sampling_loc+((threadIdx.x << 1) + 1)) = 0;\n        *(cache_grad_attn_weight+threadIdx.x)=0;\n        if (h_im > -1 && w_im > -1 && h_im < spatial_h && w_im < spatial_w)\n        {\n          ms_deform_attn_col2im_bilinear(\n            data_value_ptr, spatial_h, spatial_w, num_heads, channels, h_im, w_im, m_col, c_col,\n            top_grad, weight, grad_value_ptr, \n            cache_grad_sampling_loc+(threadIdx.x << 1), cache_grad_attn_weight+threadIdx.x);\n        }\n        \n        __syncthreads();\n\n        for (unsigned int s=blockSize/2; s>0; s>>=1)\n        {\n          if (tid < s) {\n            const unsigned int xid1 = tid << 1;\n            const unsigned int xid2 = (tid + s) << 1;\n            cache_grad_attn_weight[tid] += cache_grad_attn_weight[tid + s];\n            cache_grad_sampling_loc[xid1] += cache_grad_sampling_loc[xid2];\n            cache_grad_sampling_loc[xid1 + 1] += cache_grad_sampling_loc[xid2 + 1];\n          }\n          __syncthreads();\n        }\n\n        if (tid == 0)\n        { \n          *grad_sampling_loc = cache_grad_sampling_loc[0];\n          *(grad_sampling_loc + 1) = cache_grad_sampling_loc[1];\n          *grad_attn_weight = cache_grad_attn_weight[0];\n        }\n        __syncthreads();\n\n        data_weight_ptr += 1;\n        data_loc_w_ptr += 2;\n        grad_attn_weight += grad_weight_stride;\n        grad_sampling_loc += grad_loc_stride;\n      }\n    }\n  }\n}\n\n\ntemplate <typename scalar_t>\n__global__ void ms_deformable_col2im_gpu_kernel_shm_reduce_v1(const int n,\n                                                const scalar_t *grad_col,\n                                                const scalar_t *data_value,\n                                                const int64_t *data_spatial_shapes,\n                                                const int64_t *data_level_start_index, \n                                                const scalar_t *data_sampling_loc,\n                                                const scalar_t *data_attn_weight,\n                                                const int batch_size, \n                                                const int spatial_size, \n                                                const int num_heads,\n                                                const int channels, \n                                                const int num_levels,\n                                                const int num_query,\n                                                const int num_point,\n                                                scalar_t *grad_value,\n                                                scalar_t *grad_sampling_loc,\n                                                scalar_t *grad_attn_weight)\n{\n  CUDA_KERNEL_LOOP(index, n)\n  {\n    extern __shared__ int _s[];\n    scalar_t* cache_grad_sampling_loc = (scalar_t*)_s;\n    scalar_t* cache_grad_attn_weight = cache_grad_sampling_loc + 2 * blockDim.x;\n    unsigned int tid = threadIdx.x;\n    int _temp = index;\n    const int c_col = _temp % channels;\n    _temp /= channels;\n    const int sampling_index = _temp; \n    const int m_col = _temp % num_heads;\n    _temp /= num_heads;\n    const int q_col = _temp % num_query;\n    _temp /= num_query;\n    const int b_col = _temp;\n\n    const scalar_t top_grad = grad_col[index];\n\n    int data_weight_ptr = sampling_index * num_levels * num_point;\n    int data_loc_w_ptr = data_weight_ptr << 1;\n    const int grad_sampling_ptr = data_weight_ptr;\n    grad_sampling_loc += grad_sampling_ptr << 1;\n    grad_attn_weight += grad_sampling_ptr;\n    const int grad_weight_stride = 1;\n    const int grad_loc_stride = 2;\n    const int qid_stride = num_heads * channels;\n    const int data_value_ptr_init_offset = b_col * spatial_size * qid_stride;\n\n    for (int l_col=0; l_col < num_levels; ++l_col)\n    {\n      const int level_start_id = data_level_start_index[l_col];\n      const int spatial_h_ptr = l_col << 1;\n      const int spatial_h = data_spatial_shapes[spatial_h_ptr];\n      const int spatial_w = data_spatial_shapes[spatial_h_ptr + 1];\n      const int value_ptr_offset = data_value_ptr_init_offset + level_start_id * qid_stride;\n      const scalar_t *data_value_ptr = data_value + value_ptr_offset;\n      scalar_t *grad_value_ptr = grad_value + value_ptr_offset;\n\n      for (int p_col=0; p_col < num_point; ++p_col)\n      {\n        const scalar_t loc_w = data_sampling_loc[data_loc_w_ptr];\n        const scalar_t loc_h = data_sampling_loc[data_loc_w_ptr + 1];\n        const scalar_t weight = data_attn_weight[data_weight_ptr];\n\n        const scalar_t h_im = loc_h * spatial_h - 0.5;\n        const scalar_t w_im = loc_w * spatial_w - 0.5;\n        *(cache_grad_sampling_loc+(threadIdx.x << 1)) = 0;\n        *(cache_grad_sampling_loc+((threadIdx.x << 1) + 1)) = 0;\n        *(cache_grad_attn_weight+threadIdx.x)=0;\n        if (h_im > -1 && w_im > -1 && h_im < spatial_h && w_im < spatial_w)\n        {\n          ms_deform_attn_col2im_bilinear(\n            data_value_ptr, spatial_h, spatial_w, num_heads, channels, h_im, w_im, m_col, c_col,\n            top_grad, weight, grad_value_ptr, \n            cache_grad_sampling_loc+(threadIdx.x << 1), cache_grad_attn_weight+threadIdx.x);\n        }\n        \n        __syncthreads();\n        if (tid == 0)\n        {\n          scalar_t _grad_w=cache_grad_sampling_loc[0], _grad_h=cache_grad_sampling_loc[1], _grad_a=cache_grad_attn_weight[0];\n          int sid=2;\n          for (unsigned int tid = 1; tid < blockDim.x; ++tid)\n          {\n            _grad_w += cache_grad_sampling_loc[sid];\n            _grad_h += cache_grad_sampling_loc[sid + 1];\n            _grad_a += cache_grad_attn_weight[tid];\n            sid += 2;\n          }\n          \n          \n          *grad_sampling_loc = _grad_w;\n          *(grad_sampling_loc + 1) = _grad_h;\n          *grad_attn_weight = _grad_a;\n        }\n        __syncthreads();\n\n        data_weight_ptr += 1;\n        data_loc_w_ptr += 2;\n        grad_attn_weight += grad_weight_stride;\n        grad_sampling_loc += grad_loc_stride;\n      }\n    }\n  }\n}\n\ntemplate <typename scalar_t>\n__global__ void ms_deformable_col2im_gpu_kernel_shm_reduce_v2(const int n,\n                                                const scalar_t *grad_col,\n                                                const scalar_t *data_value,\n                                                const int64_t *data_spatial_shapes,\n                                                const int64_t *data_level_start_index, \n                                                const scalar_t *data_sampling_loc,\n                                                const scalar_t *data_attn_weight,\n                                                const int batch_size, \n                                                const int spatial_size, \n                                                const int num_heads,\n                                                const int channels, \n                                                const int num_levels,\n                                                const int num_query,\n                                                const int num_point,\n                                                scalar_t *grad_value,\n                                                scalar_t *grad_sampling_loc,\n                                                scalar_t *grad_attn_weight)\n{\n  CUDA_KERNEL_LOOP(index, n)\n  {\n    extern __shared__ int _s[];\n    scalar_t* cache_grad_sampling_loc = (scalar_t*)_s;\n    scalar_t* cache_grad_attn_weight = cache_grad_sampling_loc + 2 * blockDim.x;\n    unsigned int tid = threadIdx.x;\n    int _temp = index;\n    const int c_col = _temp % channels;\n    _temp /= channels;\n    const int sampling_index = _temp; \n    const int m_col = _temp % num_heads;\n    _temp /= num_heads;\n    const int q_col = _temp % num_query;\n    _temp /= num_query;\n    const int b_col = _temp;\n\n    const scalar_t top_grad = grad_col[index];\n\n    int data_weight_ptr = sampling_index * num_levels * num_point;\n    int data_loc_w_ptr = data_weight_ptr << 1;\n    const int grad_sampling_ptr = data_weight_ptr;\n    grad_sampling_loc += grad_sampling_ptr << 1;\n    grad_attn_weight += grad_sampling_ptr;\n    const int grad_weight_stride = 1;\n    const int grad_loc_stride = 2;\n    const int qid_stride = num_heads * channels;\n    const int data_value_ptr_init_offset = b_col * spatial_size * qid_stride;\n\n    for (int l_col=0; l_col < num_levels; ++l_col)\n    {\n      const int level_start_id = data_level_start_index[l_col];\n      const int spatial_h_ptr = l_col << 1;\n      const int spatial_h = data_spatial_shapes[spatial_h_ptr];\n      const int spatial_w = data_spatial_shapes[spatial_h_ptr + 1];\n      const int value_ptr_offset = data_value_ptr_init_offset + level_start_id * qid_stride;\n      const scalar_t *data_value_ptr = data_value + value_ptr_offset;\n      scalar_t *grad_value_ptr = grad_value + value_ptr_offset;\n\n      for (int p_col=0; p_col < num_point; ++p_col)\n      {\n        const scalar_t loc_w = data_sampling_loc[data_loc_w_ptr];\n        const scalar_t loc_h = data_sampling_loc[data_loc_w_ptr + 1];\n        const scalar_t weight = data_attn_weight[data_weight_ptr];\n\n        const scalar_t h_im = loc_h * spatial_h - 0.5;\n        const scalar_t w_im = loc_w * spatial_w - 0.5;\n        *(cache_grad_sampling_loc+(threadIdx.x << 1)) = 0;\n        *(cache_grad_sampling_loc+((threadIdx.x << 1) + 1)) = 0;\n        *(cache_grad_attn_weight+threadIdx.x)=0;\n        if (h_im > -1 && w_im > -1 && h_im < spatial_h && w_im < spatial_w)\n        {\n          ms_deform_attn_col2im_bilinear(\n            data_value_ptr, spatial_h, spatial_w, num_heads, channels, h_im, w_im, m_col, c_col,\n            top_grad, weight, grad_value_ptr, \n            cache_grad_sampling_loc+(threadIdx.x << 1), cache_grad_attn_weight+threadIdx.x);\n        }\n        \n        __syncthreads();\n\n        for (unsigned int s=blockDim.x/2, spre=blockDim.x; s>0; s>>=1, spre>>=1)\n        {\n          if (tid < s) {\n            const unsigned int xid1 = tid << 1;\n            const unsigned int xid2 = (tid + s) << 1;\n            cache_grad_attn_weight[tid] += cache_grad_attn_weight[tid + s];\n            cache_grad_sampling_loc[xid1] += cache_grad_sampling_loc[xid2];\n            cache_grad_sampling_loc[xid1 + 1] += cache_grad_sampling_loc[xid2 + 1];\n            if (tid + (s << 1) < spre)\n            {\n              cache_grad_attn_weight[tid] += cache_grad_attn_weight[tid + (s << 1)];\n              cache_grad_sampling_loc[xid1] += cache_grad_sampling_loc[xid2 + (s << 1)];\n              cache_grad_sampling_loc[xid1 + 1] += cache_grad_sampling_loc[xid2 + 1 + (s << 1)];\n            } \n          }\n          __syncthreads();\n        }\n\n        if (tid == 0)\n        {\n          *grad_sampling_loc = cache_grad_sampling_loc[0];\n          *(grad_sampling_loc + 1) = cache_grad_sampling_loc[1];\n          *grad_attn_weight = cache_grad_attn_weight[0];\n        }\n        __syncthreads();\n\n        data_weight_ptr += 1;\n        data_loc_w_ptr += 2;\n        grad_attn_weight += grad_weight_stride;\n        grad_sampling_loc += grad_loc_stride;\n      }\n    }\n  }\n}\n\ntemplate <typename scalar_t>\n__global__ void ms_deformable_col2im_gpu_kernel_shm_reduce_v2_multi_blocks(const int n,\n                                                const scalar_t *grad_col,\n                                                const scalar_t *data_value,\n                                                const int64_t *data_spatial_shapes,\n                                                const int64_t *data_level_start_index, \n                                                const scalar_t *data_sampling_loc,\n                                                const scalar_t *data_attn_weight,\n                                                const int batch_size, \n                                                const int spatial_size, \n                                                const int num_heads,\n                                                const int channels, \n                                                const int num_levels,\n                                                const int num_query,\n                                                const int num_point,\n                                                scalar_t *grad_value,\n                                                scalar_t *grad_sampling_loc,\n                                                scalar_t *grad_attn_weight)\n{\n  CUDA_KERNEL_LOOP(index, n)\n  {\n    extern __shared__ int _s[];\n    scalar_t* cache_grad_sampling_loc = (scalar_t*)_s;\n    scalar_t* cache_grad_attn_weight = cache_grad_sampling_loc + 2 * blockDim.x;\n    unsigned int tid = threadIdx.x;\n    int _temp = index;\n    const int c_col = _temp % channels;\n    _temp /= channels;\n    const int sampling_index = _temp; \n    const int m_col = _temp % num_heads;\n    _temp /= num_heads;\n    const int q_col = _temp % num_query;\n    _temp /= num_query;\n    const int b_col = _temp;\n\n    const scalar_t top_grad = grad_col[index];\n\n    int data_weight_ptr = sampling_index * num_levels * num_point;\n    int data_loc_w_ptr = data_weight_ptr << 1;\n    const int grad_sampling_ptr = data_weight_ptr;\n    grad_sampling_loc += grad_sampling_ptr << 1;\n    grad_attn_weight += grad_sampling_ptr;\n    const int grad_weight_stride = 1;\n    const int grad_loc_stride = 2;\n    const int qid_stride = num_heads * channels;\n    const int data_value_ptr_init_offset = b_col * spatial_size * qid_stride;\n\n    for (int l_col=0; l_col < num_levels; ++l_col)\n    {\n      const int level_start_id = data_level_start_index[l_col];\n      const int spatial_h_ptr = l_col << 1;\n      const int spatial_h = data_spatial_shapes[spatial_h_ptr];\n      const int spatial_w = data_spatial_shapes[spatial_h_ptr + 1];\n      const int value_ptr_offset = data_value_ptr_init_offset + level_start_id * qid_stride;\n      const scalar_t *data_value_ptr = data_value + value_ptr_offset;\n      scalar_t *grad_value_ptr = grad_value + value_ptr_offset;\n\n      for (int p_col=0; p_col < num_point; ++p_col)\n      {\n        const scalar_t loc_w = data_sampling_loc[data_loc_w_ptr];\n        const scalar_t loc_h = data_sampling_loc[data_loc_w_ptr + 1];\n        const scalar_t weight = data_attn_weight[data_weight_ptr];\n\n        const scalar_t h_im = loc_h * spatial_h - 0.5;\n        const scalar_t w_im = loc_w * spatial_w - 0.5;\n        *(cache_grad_sampling_loc+(threadIdx.x << 1)) = 0;\n        *(cache_grad_sampling_loc+((threadIdx.x << 1) + 1)) = 0;\n        *(cache_grad_attn_weight+threadIdx.x)=0;\n        if (h_im > -1 && w_im > -1 && h_im < spatial_h && w_im < spatial_w)\n        {\n          ms_deform_attn_col2im_bilinear(\n            data_value_ptr, spatial_h, spatial_w, num_heads, channels, h_im, w_im, m_col, c_col,\n            top_grad, weight, grad_value_ptr, \n            cache_grad_sampling_loc+(threadIdx.x << 1), cache_grad_attn_weight+threadIdx.x);\n        }\n        \n        __syncthreads();\n\n        for (unsigned int s=blockDim.x/2, spre=blockDim.x; s>0; s>>=1, spre>>=1)\n        {\n          if (tid < s) {\n            const unsigned int xid1 = tid << 1;\n            const unsigned int xid2 = (tid + s) << 1;\n            cache_grad_attn_weight[tid] += cache_grad_attn_weight[tid + s];\n            cache_grad_sampling_loc[xid1] += cache_grad_sampling_loc[xid2];\n            cache_grad_sampling_loc[xid1 + 1] += cache_grad_sampling_loc[xid2 + 1];\n            if (tid + (s << 1) < spre)\n            {\n              cache_grad_attn_weight[tid] += cache_grad_attn_weight[tid + (s << 1)];\n              cache_grad_sampling_loc[xid1] += cache_grad_sampling_loc[xid2 + (s << 1)];\n              cache_grad_sampling_loc[xid1 + 1] += cache_grad_sampling_loc[xid2 + 1 + (s << 1)];\n            }\n          }\n          __syncthreads();\n        }\n\n        if (tid == 0)\n        {\n          atomicAdd(grad_sampling_loc, cache_grad_sampling_loc[0]);\n          atomicAdd(grad_sampling_loc + 1, cache_grad_sampling_loc[1]);\n          atomicAdd(grad_attn_weight, cache_grad_attn_weight[0]);\n        }\n        __syncthreads();\n\n        data_weight_ptr += 1;\n        data_loc_w_ptr += 2;\n        grad_attn_weight += grad_weight_stride;\n        grad_sampling_loc += grad_loc_stride;\n      }\n    }\n  }\n}\n\n\ntemplate <typename scalar_t>\n__global__ void ms_deformable_col2im_gpu_kernel_gm(const int n,\n                                                const scalar_t *grad_col,\n                                                const scalar_t *data_value,\n                                                const int64_t *data_spatial_shapes,\n                                                const int64_t *data_level_start_index, \n                                                const scalar_t *data_sampling_loc,\n                                                const scalar_t *data_attn_weight,\n                                                const int batch_size, \n                                                const int spatial_size, \n                                                const int num_heads,\n                                                const int channels, \n                                                const int num_levels,\n                                                const int num_query,\n                                                const int num_point,\n                                                scalar_t *grad_value,\n                                                scalar_t *grad_sampling_loc,\n                                                scalar_t *grad_attn_weight)\n{\n  CUDA_KERNEL_LOOP(index, n)\n  {\n    int _temp = index;\n    const int c_col = _temp % channels;\n    _temp /= channels;\n    const int sampling_index = _temp; \n    const int m_col = _temp % num_heads;\n    _temp /= num_heads;\n    const int q_col = _temp % num_query;\n    _temp /= num_query;\n    const int b_col = _temp;\n\n    const scalar_t top_grad = grad_col[index];\n\n    int data_weight_ptr = sampling_index * num_levels * num_point;\n    int data_loc_w_ptr = data_weight_ptr << 1;\n    const int grad_sampling_ptr = data_weight_ptr;\n    grad_sampling_loc += grad_sampling_ptr << 1;\n    grad_attn_weight += grad_sampling_ptr;\n    const int grad_weight_stride = 1;\n    const int grad_loc_stride = 2;\n    const int qid_stride = num_heads * channels;\n    const int data_value_ptr_init_offset = b_col * spatial_size * qid_stride;\n\n    for (int l_col=0; l_col < num_levels; ++l_col)\n    {\n      const int level_start_id = data_level_start_index[l_col];\n      const int spatial_h_ptr = l_col << 1;\n      const int spatial_h = data_spatial_shapes[spatial_h_ptr];\n      const int spatial_w = data_spatial_shapes[spatial_h_ptr + 1];\n      const int value_ptr_offset = data_value_ptr_init_offset + level_start_id * qid_stride;\n      const scalar_t *data_value_ptr = data_value + value_ptr_offset;\n      scalar_t *grad_value_ptr = grad_value + value_ptr_offset;\n\n      for (int p_col=0; p_col < num_point; ++p_col)\n      {\n        const scalar_t loc_w = data_sampling_loc[data_loc_w_ptr];\n        const scalar_t loc_h = data_sampling_loc[data_loc_w_ptr + 1];\n        const scalar_t weight = data_attn_weight[data_weight_ptr];\n\n        const scalar_t h_im = loc_h * spatial_h - 0.5;\n        const scalar_t w_im = loc_w * spatial_w - 0.5;\n        if (h_im > -1 && w_im > -1 && h_im < spatial_h && w_im < spatial_w)\n        {\n          ms_deform_attn_col2im_bilinear_gm(\n            data_value_ptr, spatial_h, spatial_w, num_heads, channels, h_im, w_im, m_col, c_col,\n            top_grad, weight, grad_value_ptr, \n            grad_sampling_loc, grad_attn_weight);\n        }\n        data_weight_ptr += 1;\n        data_loc_w_ptr += 2;\n        grad_attn_weight += grad_weight_stride;\n        grad_sampling_loc += grad_loc_stride;\n      }\n    }\n  }\n}\n\n\ntemplate <typename scalar_t>\nvoid ms_deformable_im2col_cuda(cudaStream_t stream,\n                              const scalar_t* data_value,\n                              const int64_t* data_spatial_shapes, \n                              const int64_t* data_level_start_index, \n                              const scalar_t* data_sampling_loc,\n                              const scalar_t* data_attn_weight,\n                              const int batch_size,\n                              const int spatial_size, \n                              const int num_heads, \n                              const int channels, \n                              const int num_levels, \n                              const int num_query,\n                              const int num_point,\n                              scalar_t* data_col)\n{\n  const int num_kernels = batch_size * num_query * num_heads * channels;\n  const int num_actual_kernels = batch_size * num_query * num_heads * channels;\n  const int num_threads = CUDA_NUM_THREADS;\n  ms_deformable_im2col_gpu_kernel<scalar_t>\n      <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n          0, stream>>>(\n      num_kernels, data_value, data_spatial_shapes, data_level_start_index, data_sampling_loc, data_attn_weight, \n      batch_size, spatial_size, num_heads, channels, num_levels, num_query, num_point, data_col);\n  \n  cudaError_t err = cudaGetLastError();\n  if (err != cudaSuccess)\n  {\n    printf(\"error in ms_deformable_im2col_cuda: %s\\n\", cudaGetErrorString(err));\n  }\n\n}\n\ntemplate <typename scalar_t>\nvoid ms_deformable_col2im_cuda(cudaStream_t stream,\n                              const scalar_t* grad_col,\n                              const scalar_t* data_value,\n                              const int64_t * data_spatial_shapes,\n                              const int64_t * data_level_start_index,\n                              const scalar_t * data_sampling_loc,\n                              const scalar_t * data_attn_weight,\n                              const int batch_size, \n                              const int spatial_size, \n                              const int num_heads,\n                              const int channels, \n                              const int num_levels,\n                              const int num_query,\n                              const int num_point, \n                              scalar_t* grad_value,\n                              scalar_t* grad_sampling_loc,\n                              scalar_t* grad_attn_weight)\n{\n  const int num_threads = (channels > CUDA_NUM_THREADS)?CUDA_NUM_THREADS:channels;\n  const int num_kernels = batch_size * num_query * num_heads * channels;\n  const int num_actual_kernels = batch_size * num_query * num_heads * channels;\n  if (channels > 1024)\n  {\n    if ((channels & 1023) == 0)\n    {\n      ms_deformable_col2im_gpu_kernel_shm_reduce_v2_multi_blocks<scalar_t>\n          <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n              num_threads*3*sizeof(scalar_t), stream>>>(\n                        num_kernels, \n                        grad_col,\n                        data_value,\n                        data_spatial_shapes,\n                        data_level_start_index, \n                        data_sampling_loc,\n                        data_attn_weight,\n                        batch_size, \n                        spatial_size, \n                        num_heads,\n                        channels, \n                        num_levels,\n                        num_query,\n                        num_point,\n                        grad_value,\n                        grad_sampling_loc,\n                        grad_attn_weight);\n    }\n    else\n    {\n      ms_deformable_col2im_gpu_kernel_gm<scalar_t>\n        <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n            0, stream>>>(\n                      num_kernels, \n                      grad_col,\n                      data_value,\n                      data_spatial_shapes,\n                      data_level_start_index, \n                      data_sampling_loc,\n                      data_attn_weight,\n                      batch_size, \n                      spatial_size, \n                      num_heads,\n                      channels, \n                      num_levels,\n                      num_query,\n                      num_point,\n                      grad_value,\n                      grad_sampling_loc,\n                      grad_attn_weight);\n    }\n  }\n  else{\n    switch(channels)\n    {\n      case 1:\n        ms_deformable_col2im_gpu_kernel_shm_blocksize_aware_reduce_v1<scalar_t, 1>\n        <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n            0, stream>>>(\n                      num_kernels, \n                      grad_col,\n                      data_value,\n                      data_spatial_shapes,\n                      data_level_start_index, \n                      data_sampling_loc,\n                      data_attn_weight,\n                      batch_size, \n                      spatial_size, \n                      num_heads,\n                      channels, \n                      num_levels,\n                      num_query,\n                      num_point,\n                      grad_value,\n                      grad_sampling_loc,\n                      grad_attn_weight);\n        break;\n      case 2:\n        ms_deformable_col2im_gpu_kernel_shm_blocksize_aware_reduce_v1<scalar_t, 2>\n        <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n            0, stream>>>(\n                      num_kernels, \n                      grad_col,\n                      data_value,\n                      data_spatial_shapes,\n                      data_level_start_index, \n                      data_sampling_loc,\n                      data_attn_weight,\n                      batch_size, \n                      spatial_size, \n                      num_heads,\n                      channels, \n                      num_levels,\n                      num_query,\n                      num_point,\n                      grad_value,\n                      grad_sampling_loc,\n                      grad_attn_weight);\n        break;\n      case 4:\n        ms_deformable_col2im_gpu_kernel_shm_blocksize_aware_reduce_v1<scalar_t, 4>\n        <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n            0, stream>>>(\n                      num_kernels, \n                      grad_col,\n                      data_value,\n                      data_spatial_shapes,\n                      data_level_start_index, \n                      data_sampling_loc,\n                      data_attn_weight,\n                      batch_size, \n                      spatial_size, \n                      num_heads,\n                      channels, \n                      num_levels,\n                      num_query,\n                      num_point,\n                      grad_value,\n                      grad_sampling_loc,\n                      grad_attn_weight);\n        break;\n      case 8:\n        ms_deformable_col2im_gpu_kernel_shm_blocksize_aware_reduce_v1<scalar_t, 8>\n        <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n            0, stream>>>(\n                      num_kernels, \n                      grad_col,\n                      data_value,\n                      data_spatial_shapes,\n                      data_level_start_index, \n                      data_sampling_loc,\n                      data_attn_weight,\n                      batch_size, \n                      spatial_size, \n                      num_heads,\n                      channels, \n                      num_levels,\n                      num_query,\n                      num_point,\n                      grad_value,\n                      grad_sampling_loc,\n                      grad_attn_weight);\n        break;\n      case 16:\n        ms_deformable_col2im_gpu_kernel_shm_blocksize_aware_reduce_v1<scalar_t, 16>\n        <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n            0, stream>>>(\n                      num_kernels, \n                      grad_col,\n                      data_value,\n                      data_spatial_shapes,\n                      data_level_start_index, \n                      data_sampling_loc,\n                      data_attn_weight,\n                      batch_size, \n                      spatial_size, \n                      num_heads,\n                      channels, \n                      num_levels,\n                      num_query,\n                      num_point,\n                      grad_value,\n                      grad_sampling_loc,\n                      grad_attn_weight);\n        break;\n      case 32:\n        ms_deformable_col2im_gpu_kernel_shm_blocksize_aware_reduce_v1<scalar_t, 32>\n        <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n            0, stream>>>(\n                      num_kernels, \n                      grad_col,\n                      data_value,\n                      data_spatial_shapes,\n                      data_level_start_index, \n                      data_sampling_loc,\n                      data_attn_weight,\n                      batch_size, \n                      spatial_size, \n                      num_heads,\n                      channels, \n                      num_levels,\n                      num_query,\n                      num_point,\n                      grad_value,\n                      grad_sampling_loc,\n                      grad_attn_weight);\n        break;\n      case 64:\n        ms_deformable_col2im_gpu_kernel_shm_blocksize_aware_reduce_v2<scalar_t, 64>\n        <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n            0, stream>>>(\n                      num_kernels, \n                      grad_col,\n                      data_value,\n                      data_spatial_shapes,\n                      data_level_start_index, \n                      data_sampling_loc,\n                      data_attn_weight,\n                      batch_size, \n                      spatial_size, \n                      num_heads,\n                      channels, \n                      num_levels,\n                      num_query,\n                      num_point,\n                      grad_value,\n                      grad_sampling_loc,\n                      grad_attn_weight);\n        break;\n      case 128:\n        ms_deformable_col2im_gpu_kernel_shm_blocksize_aware_reduce_v2<scalar_t, 128>\n        <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n            0, stream>>>(\n                      num_kernels, \n                      grad_col,\n                      data_value,\n                      data_spatial_shapes,\n                      data_level_start_index, \n                      data_sampling_loc,\n                      data_attn_weight,\n                      batch_size, \n                      spatial_size, \n                      num_heads,\n                      channels, \n                      num_levels,\n                      num_query,\n                      num_point,\n                      grad_value,\n                      grad_sampling_loc,\n                      grad_attn_weight);\n        break;\n      case 256:\n        ms_deformable_col2im_gpu_kernel_shm_blocksize_aware_reduce_v2<scalar_t, 256>\n        <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n            0, stream>>>(\n                      num_kernels, \n                      grad_col,\n                      data_value,\n                      data_spatial_shapes,\n                      data_level_start_index, \n                      data_sampling_loc,\n                      data_attn_weight,\n                      batch_size, \n                      spatial_size, \n                      num_heads,\n                      channels, \n                      num_levels,\n                      num_query,\n                      num_point,\n                      grad_value,\n                      grad_sampling_loc,\n                      grad_attn_weight);\n        break;\n      case 512:\n        ms_deformable_col2im_gpu_kernel_shm_blocksize_aware_reduce_v2<scalar_t, 512>\n        <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n            0, stream>>>(\n                      num_kernels, \n                      grad_col,\n                      data_value,\n                      data_spatial_shapes,\n                      data_level_start_index, \n                      data_sampling_loc,\n                      data_attn_weight,\n                      batch_size, \n                      spatial_size, \n                      num_heads,\n                      channels, \n                      num_levels,\n                      num_query,\n                      num_point,\n                      grad_value,\n                      grad_sampling_loc,\n                      grad_attn_weight);\n        break;\n      case 1024:\n        ms_deformable_col2im_gpu_kernel_shm_blocksize_aware_reduce_v2<scalar_t, 1024>\n        <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n            0, stream>>>(\n                      num_kernels, \n                      grad_col,\n                      data_value,\n                      data_spatial_shapes,\n                      data_level_start_index, \n                      data_sampling_loc,\n                      data_attn_weight,\n                      batch_size, \n                      spatial_size, \n                      num_heads,\n                      channels, \n                      num_levels,\n                      num_query,\n                      num_point,\n                      grad_value,\n                      grad_sampling_loc,\n                      grad_attn_weight);\n        break;\n      default:\n        if (channels < 64)\n        {\n          ms_deformable_col2im_gpu_kernel_shm_reduce_v1<scalar_t>\n          <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n              num_threads*3*sizeof(scalar_t), stream>>>(\n                        num_kernels, \n                        grad_col,\n                        data_value,\n                        data_spatial_shapes,\n                        data_level_start_index, \n                        data_sampling_loc,\n                        data_attn_weight,\n                        batch_size, \n                        spatial_size, \n                        num_heads,\n                        channels, \n                        num_levels,\n                        num_query,\n                        num_point,\n                        grad_value,\n                        grad_sampling_loc,\n                        grad_attn_weight);\n        }\n        else\n        {\n          ms_deformable_col2im_gpu_kernel_shm_reduce_v2<scalar_t>\n          <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n              num_threads*3*sizeof(scalar_t), stream>>>(\n                        num_kernels, \n                        grad_col,\n                        data_value,\n                        data_spatial_shapes,\n                        data_level_start_index, \n                        data_sampling_loc,\n                        data_attn_weight,\n                        batch_size, \n                        spatial_size, \n                        num_heads,\n                        channels, \n                        num_levels,\n                        num_query,\n                        num_point,\n                        grad_value,\n                        grad_sampling_loc,\n                        grad_attn_weight);\n        }\n    }\n  }\n  cudaError_t err = cudaGetLastError();\n  if (err != cudaSuccess)\n  {\n    printf(\"error in ms_deformable_col2im_cuda: %s\\n\", cudaGetErrorString(err));\n  }\n\n}"
  },
  {
    "path": "thirdparty/GLEE/glee/models/pixel_decoder/ops/src/ms_deform_attn.h",
    "content": "/*!\n**************************************************************************************************\n* Deformable DETR\n* Copyright (c) 2020 SenseTime. All Rights Reserved.\n* Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n**************************************************************************************************\n* Modified from https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0\n**************************************************************************************************\n*/\n\n/*!\n* Copyright (c) Facebook, Inc. and its affiliates.\n* Modified by Bowen Cheng from https://github.com/fundamentalvision/Deformable-DETR\n*/\n\n#pragma once\n\n#include \"cpu/ms_deform_attn_cpu.h\"\n\n#ifdef WITH_CUDA\n#include \"cuda/ms_deform_attn_cuda.h\"\n#endif\n\n\nat::Tensor\nms_deform_attn_forward(\n    const at::Tensor &value, \n    const at::Tensor &spatial_shapes,\n    const at::Tensor &level_start_index,\n    const at::Tensor &sampling_loc,\n    const at::Tensor &attn_weight,\n    const int im2col_step)\n{\n    if (value.type().is_cuda())\n    {\n#ifdef WITH_CUDA\n        return ms_deform_attn_cuda_forward(\n            value, spatial_shapes, level_start_index, sampling_loc, attn_weight, im2col_step);\n#else\n        AT_ERROR(\"Not compiled with GPU support\");\n#endif\n    }\n    AT_ERROR(\"Not implemented on the CPU\");\n}\n\nstd::vector<at::Tensor>\nms_deform_attn_backward(\n    const at::Tensor &value, \n    const at::Tensor &spatial_shapes,\n    const at::Tensor &level_start_index,\n    const at::Tensor &sampling_loc,\n    const at::Tensor &attn_weight,\n    const at::Tensor &grad_output,\n    const int im2col_step)\n{\n    if (value.type().is_cuda())\n    {\n#ifdef WITH_CUDA\n        return ms_deform_attn_cuda_backward(\n            value, spatial_shapes, level_start_index, sampling_loc, attn_weight, grad_output, im2col_step);\n#else\n        AT_ERROR(\"Not compiled with GPU support\");\n#endif\n    }\n    AT_ERROR(\"Not implemented on the CPU\");\n}\n\n"
  },
  {
    "path": "thirdparty/GLEE/glee/models/pixel_decoder/ops/src/vision.cpp",
    "content": "/*!\n**************************************************************************************************\n* Deformable DETR\n* Copyright (c) 2020 SenseTime. All Rights Reserved.\n* Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n**************************************************************************************************\n* Modified from https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0\n**************************************************************************************************\n*/\n\n/*!\n* Copyright (c) Facebook, Inc. and its affiliates.\n* Modified by Bowen Cheng from https://github.com/fundamentalvision/Deformable-DETR\n*/\n\n#include \"ms_deform_attn.h\"\n\nPYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {\n  m.def(\"ms_deform_attn_forward\", &ms_deform_attn_forward, \"ms_deform_attn_forward\");\n  m.def(\"ms_deform_attn_backward\", &ms_deform_attn_backward, \"ms_deform_attn_backward\");\n}\n"
  },
  {
    "path": "thirdparty/GLEE/glee/models/pixel_decoder/ops/test.py",
    "content": "# ------------------------------------------------------------------------------------------------\n# Deformable DETR\n# Copyright (c) 2020 SenseTime. All Rights Reserved.\n# Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n# ------------------------------------------------------------------------------------------------\n# Modified from https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0\n# ------------------------------------------------------------------------------------------------\n\n# Copyright (c) Facebook, Inc. and its affiliates.\n# Modified by Bowen Cheng from https://github.com/fundamentalvision/Deformable-DETR\n\nfrom __future__ import absolute_import\nfrom __future__ import print_function\nfrom __future__ import division\n\nimport time\nimport torch\nimport torch.nn as nn\nfrom torch.autograd import gradcheck\n\nfrom functions.ms_deform_attn_func import MSDeformAttnFunction, ms_deform_attn_core_pytorch\n\n\nN, M, D = 1, 2, 2\nLq, L, P = 2, 2, 2\nshapes = torch.as_tensor([(6, 4), (3, 2)], dtype=torch.long).cuda()\nlevel_start_index = torch.cat((shapes.new_zeros((1, )), shapes.prod(1).cumsum(0)[:-1]))\nS = sum([(H*W).item() for H, W in shapes])\n\n\ntorch.manual_seed(3)\n\n\n@torch.no_grad()\ndef check_forward_equal_with_pytorch_double():\n    value = torch.rand(N, S, M, D).cuda() * 0.01\n    sampling_locations = torch.rand(N, Lq, M, L, P, 2).cuda()\n    attention_weights = torch.rand(N, Lq, M, L, P).cuda() + 1e-5\n    attention_weights /= attention_weights.sum(-1, keepdim=True).sum(-2, keepdim=True)\n    im2col_step = 2\n    output_pytorch = ms_deform_attn_core_pytorch(value.double(), shapes, sampling_locations.double(), attention_weights.double()).detach().cpu()\n    output_cuda = MSDeformAttnFunction.apply(value.double(), shapes, level_start_index, sampling_locations.double(), attention_weights.double(), im2col_step).detach().cpu()\n    fwdok = torch.allclose(output_cuda, output_pytorch)\n    max_abs_err = (output_cuda - output_pytorch).abs().max()\n    max_rel_err = ((output_cuda - output_pytorch).abs() / output_pytorch.abs()).max()\n\n    print(f'* {fwdok} check_forward_equal_with_pytorch_double: max_abs_err {max_abs_err:.2e} max_rel_err {max_rel_err:.2e}')\n\n\n@torch.no_grad()\ndef check_forward_equal_with_pytorch_float():\n    value = torch.rand(N, S, M, D).cuda() * 0.01\n    sampling_locations = torch.rand(N, Lq, M, L, P, 2).cuda()\n    attention_weights = torch.rand(N, Lq, M, L, P).cuda() + 1e-5\n    attention_weights /= attention_weights.sum(-1, keepdim=True).sum(-2, keepdim=True)\n    im2col_step = 2\n    output_pytorch = ms_deform_attn_core_pytorch(value, shapes, sampling_locations, attention_weights).detach().cpu()\n    output_cuda = MSDeformAttnFunction.apply(value, shapes, level_start_index, sampling_locations, attention_weights, im2col_step).detach().cpu()\n    fwdok = torch.allclose(output_cuda, output_pytorch, rtol=1e-2, atol=1e-3)\n    max_abs_err = (output_cuda - output_pytorch).abs().max()\n    max_rel_err = ((output_cuda - output_pytorch).abs() / output_pytorch.abs()).max()\n\n    print(f'* {fwdok} check_forward_equal_with_pytorch_float: max_abs_err {max_abs_err:.2e} max_rel_err {max_rel_err:.2e}')\n\n\ndef check_gradient_numerical(channels=4, grad_value=True, grad_sampling_loc=True, grad_attn_weight=True):\n\n    value = torch.rand(N, S, M, channels).cuda() * 0.01\n    sampling_locations = torch.rand(N, Lq, M, L, P, 2).cuda()\n    attention_weights = torch.rand(N, Lq, M, L, P).cuda() + 1e-5\n    attention_weights /= attention_weights.sum(-1, keepdim=True).sum(-2, keepdim=True)\n    im2col_step = 2\n    func = MSDeformAttnFunction.apply\n\n    value.requires_grad = grad_value\n    sampling_locations.requires_grad = grad_sampling_loc\n    attention_weights.requires_grad = grad_attn_weight\n\n    gradok = gradcheck(func, (value.double(), shapes, level_start_index, sampling_locations.double(), attention_weights.double(), im2col_step))\n\n    print(f'* {gradok} check_gradient_numerical(D={channels})')\n\n\nif __name__ == '__main__':\n    check_forward_equal_with_pytorch_double()\n    check_forward_equal_with_pytorch_float()\n\n    for channels in [30, 32, 64, 71, 1025, 2048, 3096]:\n        check_gradient_numerical(channels, True, True, True)\n\n\n\n"
  },
  {
    "path": "thirdparty/GLEE/glee/models/pixel_decoder/position_encoding.py",
    "content": "# ------------------------------------------------------------------------\n# Copyright (c) 2022 IDEA. All Rights Reserved.\n# Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n# ------------------------------------------------------------------------\n# Modified from Mask2Former https://github.com/facebookresearch/Mask2Former by Feng Li and Hao Zhang.\n\"\"\"\nVarious positional encodings for the transformer.\n\"\"\"\nimport math\n\nimport torch\nfrom torch import nn\n\n\nclass PositionEmbeddingSine(nn.Module):\n    \"\"\"\n    This is a more standard version of the position embedding, very similar to the one\n    used by the Attention is all you need paper, generalized to work on images.\n    \"\"\"\n\n    def __init__(self, num_pos_feats=64, temperature=10000, normalize=False, scale=None):\n        super().__init__()\n        self.num_pos_feats = num_pos_feats\n        self.temperature = temperature\n        self.normalize = normalize\n        if scale is not None and normalize is False:\n            raise ValueError(\"normalize should be True if scale is passed\")\n        if scale is None:\n            scale = 2 * math.pi\n        self.scale = scale\n\n    def forward(self, x, mask=None):\n        if mask is None:\n            mask = torch.zeros((x.size(0), x.size(2), x.size(3)), device=x.device, dtype=torch.bool)\n        not_mask = ~mask\n        y_embed = not_mask.cumsum(1, dtype=torch.float32)\n        x_embed = not_mask.cumsum(2, dtype=torch.float32)\n        if self.normalize:\n            eps = 1e-6\n            y_embed = y_embed / (y_embed[:, -1:, :] + eps) * self.scale\n            x_embed = x_embed / (x_embed[:, :, -1:] + eps) * self.scale\n\n        dim_t = torch.arange(self.num_pos_feats, dtype=torch.float32, device=x.device)\n        dim_t = self.temperature ** (2 * (dim_t // 2) / self.num_pos_feats)\n\n        pos_x = x_embed[:, :, :, None] / dim_t\n        pos_y = y_embed[:, :, :, None] / dim_t\n        pos_x = torch.stack(\n            (pos_x[:, :, :, 0::2].sin(), pos_x[:, :, :, 1::2].cos()), dim=4\n        ).flatten(3)\n        pos_y = torch.stack(\n            (pos_y[:, :, :, 0::2].sin(), pos_y[:, :, :, 1::2].cos()), dim=4\n        ).flatten(3)\n        pos = torch.cat((pos_y, pos_x), dim=3).permute(0, 3, 1, 2)\n        return pos\n\n    def __repr__(self, _repr_indent=4):\n        head = \"Positional encoding \" + self.__class__.__name__\n        body = [\n            \"num_pos_feats: {}\".format(self.num_pos_feats),\n            \"temperature: {}\".format(self.temperature),\n            \"normalize: {}\".format(self.normalize),\n            \"scale: {}\".format(self.scale),\n        ]\n        # _repr_indent = 4\n        lines = [head] + [\" \" * _repr_indent + line for line in body]\n        return \"\\n\".join(lines)\n"
  },
  {
    "path": "thirdparty/GLEE/glee/models/transformer_decoder/__init__.py",
    "content": "# Copyright (c) IDEA, Inc. and its affiliates.\nfrom .maskdino_decoder import MaskDINODecoder\n\n"
  },
  {
    "path": "thirdparty/GLEE/glee/models/transformer_decoder/dino_decoder.py",
    "content": "# ------------------------------------------------------------------------\r\n# DINO\r\n# Copyright (c) 2022 IDEA. All Rights Reserved.\r\n# Licensed under the Apache License, Version 2.0 [see LICENSE for details]\r\n# ------------------------------------------------------------------------\r\n# Modified from DINO https://github.com/IDEA-Research/DINO by Feng Li and Hao Zhang.\r\n# ------------------------------------------------------------------------\r\n\r\nfrom typing import Optional, List, Union\r\nimport torch\r\nfrom torch import nn, Tensor\r\nfrom torch.cuda.amp import autocast\r\n\r\nfrom ...utils.utils import MLP, _get_clones, _get_activation_fn, gen_sineembed_for_position, inverse_sigmoid\r\nfrom ..pixel_decoder.ops.modules import MSDeformAttn\r\n\r\n\r\nclass TransformerDecoder(nn.Module):\r\n\r\n    def __init__(self, decoder_layer, num_layers, norm=None,\r\n                 return_intermediate=False,\r\n                 d_model=256, query_dim=4,\r\n                 modulate_hw_attn=True,\r\n                 num_feature_levels=1,\r\n                 deformable_decoder=True,\r\n                 decoder_query_perturber=None,\r\n                 dec_layer_number=None,  # number of queries each layer in decoder\r\n                 rm_dec_query_scale=True,\r\n                 dec_layer_share=False,\r\n                 dec_layer_dropout_prob=None,\r\n                 cross_track_layer = False,\r\n                 n_levels = None, \r\n                 n_heads = None, \r\n                 n_points = None,\r\n                 ):\r\n        super().__init__()\r\n        if num_layers > 0:\r\n            self.layers = _get_clones(decoder_layer, num_layers, layer_share=dec_layer_share)\r\n        else:\r\n            self.layers = []\r\n        self.num_layers = num_layers\r\n        self.norm = norm\r\n        self.return_intermediate = return_intermediate\r\n        assert return_intermediate, \"support return_intermediate only\"\r\n        self.query_dim = query_dim\r\n        assert query_dim in [2, 4], \"query_dim should be 2/4 but {}\".format(query_dim)\r\n        self.num_feature_levels = num_feature_levels\r\n\r\n        self.ref_point_head = MLP(query_dim // 2 * d_model, d_model, d_model, 2)\r\n        if not deformable_decoder:\r\n            self.query_pos_sine_scale = MLP(d_model, d_model, d_model, 2)\r\n        else:\r\n            self.query_pos_sine_scale = None\r\n\r\n        if rm_dec_query_scale:\r\n            self.query_scale = None\r\n        else:\r\n            raise NotImplementedError\r\n            self.query_scale = MLP(d_model, d_model, d_model, 2)\r\n        self.bbox_embed = None\r\n        self.class_embed = None\r\n\r\n        self.d_model = d_model\r\n        self.modulate_hw_attn = modulate_hw_attn\r\n        self.deformable_decoder = deformable_decoder\r\n\r\n        if not deformable_decoder and modulate_hw_attn:\r\n            self.ref_anchor_head = MLP(d_model, d_model, 2, 2)\r\n        else:\r\n            self.ref_anchor_head = None\r\n\r\n        self.decoder_query_perturber = decoder_query_perturber\r\n        self.box_pred_damping = None\r\n\r\n        self.dec_layer_number = dec_layer_number\r\n        if dec_layer_number is not None:\r\n            assert isinstance(dec_layer_number, list)\r\n            assert len(dec_layer_number) == num_layers\r\n            # assert dec_layer_number[0] ==\r\n\r\n        self.dec_layer_dropout_prob = dec_layer_dropout_prob\r\n        if dec_layer_dropout_prob is not None:\r\n            assert isinstance(dec_layer_dropout_prob, list)\r\n            assert len(dec_layer_dropout_prob) == num_layers\r\n            for i in dec_layer_dropout_prob:\r\n                assert 0.0 <= i <= 1.0\r\n        if cross_track_layer: # add a cross-attention-layer before track ffn head\r\n            self.cross_track_attn = MSDeformAttn(d_model, n_levels, n_heads, n_points)\r\n            self.cross_track = True\r\n        else:\r\n            self.cross_track = False\r\n\r\n        self._reset_parameters()\r\n\r\n    def _reset_parameters(self):\r\n        for p in self.parameters():\r\n            if p.dim() > 1:\r\n                nn.init.xavier_uniform_(p)\r\n        for m in self.modules():\r\n            if isinstance(m, MSDeformAttn):\r\n                m._reset_parameters()\r\n    @staticmethod\r\n    def with_pos_embed(tensor, pos):\r\n        return tensor if pos is None else tensor + pos\r\n\r\n\r\n    def forward(self, tgt, memory,\r\n                tgt_mask: Optional[Tensor] = None,\r\n                memory_mask: Optional[Tensor] = None,\r\n                tgt_key_padding_mask: Optional[Tensor] = None,\r\n                memory_key_padding_mask: Optional[Tensor] = None,\r\n                pos: Optional[Tensor] = None,\r\n                refpoints_unsigmoid: Optional[Tensor] = None,  # num_queries, bs, 2\r\n                # for memory\r\n                level_start_index: Optional[Tensor] = None,  # num_levels\r\n                spatial_shapes: Optional[Tensor] = None,  # bs, num_levels, 2\r\n                valid_ratios: Optional[Tensor] = None,\r\n                task = None,\r\n                extra = None,\r\n\r\n                ):\r\n        \"\"\"\r\n        Input:\r\n            - tgt: nq, bs, d_model\r\n            - memory: hw, bs, d_model\r\n            - pos: hw, bs, d_model\r\n            - refpoints_unsigmoid: nq, bs, 2/4\r\n            - valid_ratios/spatial_shapes: bs, nlevel, 2\r\n        \"\"\"\r\n        output = tgt\r\n        device = tgt.device\r\n\r\n        intermediate = []\r\n        reference_points = refpoints_unsigmoid.sigmoid().to(device)\r\n        ref_points = [reference_points]\r\n\r\n        for layer_id, layer in enumerate(self.layers):\r\n            # preprocess ref points\r\n            if self.training and self.decoder_query_perturber is not None and layer_id != 0:\r\n                reference_points = self.decoder_query_perturber(reference_points)\r\n\r\n            reference_points_input = reference_points[:, :, None] \\\r\n                                         * torch.cat([valid_ratios, valid_ratios], -1)[None, :]  # nq, bs, nlevel, 4\r\n            query_sine_embed = gen_sineembed_for_position(reference_points_input[:, :, 0, :]) # nq, bs, 256*2\r\n\r\n            raw_query_pos = self.ref_point_head(query_sine_embed)  # nq, bs, 256\r\n            pos_scale = self.query_scale(output) if self.query_scale is not None else 1\r\n            query_pos = pos_scale * raw_query_pos\r\n\r\n            output = layer(\r\n                tgt=output,\r\n                tgt_query_pos=query_pos,\r\n                tgt_query_sine_embed=query_sine_embed,\r\n                tgt_key_padding_mask=tgt_key_padding_mask,\r\n                tgt_reference_points=reference_points_input,\r\n\r\n                memory=memory,\r\n                memory_key_padding_mask=memory_key_padding_mask,\r\n                memory_level_start_index=level_start_index,\r\n                memory_spatial_shapes=spatial_shapes,\r\n                memory_pos=pos,\r\n\r\n                self_attn_mask=tgt_mask,\r\n                cross_attn_mask=memory_mask,\r\n                task = task,\r\n                extra = extra,\r\n                layer_id = layer_id,\r\n            )\r\n\r\n            # iter update\r\n            if self.bbox_embed is not None:\r\n                reference_before_sigmoid = inverse_sigmoid(reference_points)\r\n                delta_unsig = self.bbox_embed[layer_id](output).to(device)\r\n                outputs_unsig = delta_unsig + reference_before_sigmoid\r\n                new_reference_points = outputs_unsig.sigmoid()\r\n\r\n                reference_points = new_reference_points.detach()\r\n                # if layer_id != self.num_layers - 1:\r\n                ref_points.append(new_reference_points)\r\n\r\n            intermediate.append(self.norm(output))\r\n\r\n\r\n        if self.cross_track:\r\n            tgt_track = self.cross_track_attn(self.with_pos_embed(output, query_pos).transpose(0, 1),\r\n                               reference_points_input.transpose(0, 1).contiguous(),\r\n                               memory.transpose(0, 1), spatial_shapes, level_start_index,\r\n                               memory_key_padding_mask).transpose(0, 1)\r\n            tgt_track = tgt_track + output\r\n            tgt_track = tgt_track.transpose(0, 1)\r\n        else:\r\n            tgt_track = None\r\n\r\n        return [\r\n            [itm_out.transpose(0, 1) for itm_out in intermediate],\r\n            [itm_refpoint.transpose(0, 1) for itm_refpoint in ref_points], tgt_track\r\n        ]\r\n\r\n\r\nclass DeformableTransformerDecoderLayer(nn.Module):\r\n\r\n    def __init__(self, d_model=256, d_ffn=1024,\r\n                 dropout=0.1, activation=\"relu\",\r\n                 n_levels=4, n_heads=8, n_points=4,\r\n                 use_deformable_box_attn=False,\r\n                 key_aware_type=None,\r\n                 ):\r\n        super().__init__()\r\n        self.n_heads = n_heads\r\n        # cross attention\r\n        if use_deformable_box_attn:\r\n            raise NotImplementedError\r\n        else:\r\n            self.cross_attn = MSDeformAttn(d_model, n_levels, n_heads, n_points)\r\n        self.dropout1 = nn.Dropout(dropout)\r\n        self.norm1 = nn.LayerNorm(d_model)\r\n\r\n        # self attention\r\n        self.self_attn = nn.MultiheadAttention(d_model, n_heads, dropout=dropout)\r\n        self.dropout2 = nn.Dropout(dropout)\r\n        self.norm2 = nn.LayerNorm(d_model)\r\n\r\n        # ffn\r\n        self.linear1 = nn.Linear(d_model, d_ffn)\r\n        self.activation = _get_activation_fn(activation)\r\n        self.dropout3 = nn.Dropout(dropout)\r\n        self.linear2 = nn.Linear(d_ffn, d_model)\r\n        self.dropout4 = nn.Dropout(dropout)\r\n        self.norm3 = nn.LayerNorm(d_model)\r\n\r\n        self.key_aware_type = key_aware_type\r\n        self.key_aware_proj = None\r\n\r\n    def rm_self_attn_modules(self):\r\n        self.self_attn = None\r\n        self.dropout2 = None\r\n        self.norm2 = None\r\n\r\n    @staticmethod\r\n    def with_pos_embed(tensor, pos):\r\n        return tensor if pos is None else tensor + pos\r\n\r\n    def forward_ffn(self, tgt):\r\n        tgt2 = self.linear2(self.dropout3(self.activation(self.linear1(tgt))))\r\n        tgt = tgt + self.dropout4(tgt2)\r\n        tgt = self.norm3(tgt)\r\n        return tgt\r\n\r\n    @autocast(enabled=False)\r\n    def forward(self,\r\n                # for tgt\r\n                tgt: Optional[Tensor],  # nq, bs, d_model\r\n                tgt_query_pos: Optional[Tensor] = None,  # pos for query. MLP(Sine(pos))\r\n                tgt_query_sine_embed: Optional[Tensor] = None,  # pos for query. Sine(pos)\r\n                tgt_key_padding_mask: Optional[Tensor] = None,\r\n                tgt_reference_points: Optional[Tensor] = None,  # nq, bs, 4\r\n\r\n                # for memory\r\n                memory: Optional[Tensor] = None,  # hw, bs, d_model\r\n                memory_key_padding_mask: Optional[Tensor] = None,\r\n                memory_level_start_index: Optional[Tensor] = None,  # num_levels\r\n                memory_spatial_shapes: Optional[Tensor] = None,  # bs, num_levels, 2\r\n                memory_pos: Optional[Tensor] = None,  # pos for memory\r\n\r\n                # sa\r\n                self_attn_mask: Optional[Tensor] = None,  # mask used for self-attention\r\n                cross_attn_mask: Optional[Tensor] = None,  # mask used for cross-attention\r\n                task = None,\r\n                extra = None,\r\n                layer_id = None,\r\n                ):\r\n        \"\"\"\r\n        Input:\r\n            - tgt/tgt_query_pos: nq, bs, d_model\r\n            -\r\n        \"\"\"\r\n        # self attention\r\n\r\n\r\n        if task in ['grounding', 'rvos'] or  'visual_prompt_tokens' in extra:\r\n            if self_attn_mask is not None: # training with denoising query \r\n\r\n                if 'visual_prompt_tokens' in extra: # has visual prompt \r\n                    level_index = layer_id % 3 # src level :  self.num_feature_levels\r\n                    prompt_tokens = extra['visual_prompt_tokens'][level_index]\r\n                    promot_pos = prompt_tokens.detach().clone()\r\n                    prompt_mask = extra['visual_prompt_nonzero_mask'][level_index]\r\n                else: #grounding\r\n                    prompt_tokens = extra['grounding_tokens']\r\n                    promot_pos = prompt_tokens.detach().clone()\r\n                    prompt_mask = extra['grounding_nonzero_mask']\r\n                ori_size = tgt.shape[0]\r\n                new_mask_size = tgt.shape[0]+prompt_tokens.shape[0]\r\n                new_self_attn_mask = torch.zeros((tgt.shape[1], new_mask_size, new_mask_size), dtype=torch.bool, device=tgt.device)\r\n                \r\n                new_self_attn_mask[:,:ori_size,:ori_size] =  self_attn_mask.unsqueeze(0).repeat(tgt.shape[1],1,1) #denoising matching keepmask\r\n\r\n                # prompt to prompt mask set to True if they are not valid\r\n                # new_self_attn_mask[:,ori_size:,ori_size:][prompt_mask] = True\r\n                # new_self_attn_mask[:,ori_size:,ori_size:].transpose(1,2)[prompt_mask] = True\r\n\r\n                # prompt2obj and obj2prompt mask set to True \r\n                # new_self_attn_mask[:,ori_size-300:ori_size,ori_size:][] = True \r\n                new_self_attn_mask[:,:ori_size,ori_size:].transpose(1,2)[prompt_mask] = True \r\n                \r\n                new_self_attn_mask[:,ori_size:,:ori_size][prompt_mask] = True \r\n                # new_self_attn_mask[:,ori_size:,ori_size-300:ori_size].transpose(1,2)[] = True \r\n\r\n                new_self_attn_mask = new_self_attn_mask.repeat_interleave(self.n_heads, dim=0)\r\n            else: # with out denoising query\r\n                if 'visual_prompt_tokens' in extra: # has visual prompt \r\n                    level_index = layer_id % 3 # src level :  self.num_feature_levels\r\n                    prompt_tokens = extra['visual_prompt_tokens'][level_index]\r\n                    promot_pos = prompt_tokens.detach().clone()\r\n                    prompt_mask = extra['visual_prompt_nonzero_mask'][level_index]\r\n                else: #grounding\r\n                    prompt_tokens = extra['grounding_tokens']\r\n                    promot_pos = prompt_tokens.detach().clone()\r\n                    prompt_mask = extra['grounding_nonzero_mask']\r\n                ori_size = tgt.shape[0]\r\n                new_mask_size = tgt.shape[0]+prompt_tokens.shape[0]\r\n                new_self_attn_mask = torch.zeros((tgt.shape[1], new_mask_size, new_mask_size), dtype=torch.bool, device=tgt.device)\r\n                new_self_attn_mask[:,:ori_size,ori_size:].transpose(1,2)[prompt_mask] = True \r\n                new_self_attn_mask[:,ori_size:,:ori_size][prompt_mask] = True \r\n                new_self_attn_mask = new_self_attn_mask.repeat_interleave(self.n_heads, dim=0)\r\n\r\n\r\n            if self.self_attn is not None:\r\n                tgt = torch.cat([tgt,prompt_tokens],dim=0)\r\n                tgt_query_pos = torch.cat([tgt_query_pos,promot_pos],dim=0)\r\n                q = k = self.with_pos_embed(tgt, tgt_query_pos)\r\n                tgt2 = self.self_attn(q, k, tgt, attn_mask=new_self_attn_mask)[0]\r\n                tgt = tgt + self.dropout2(tgt2)\r\n                tgt = self.norm2(tgt)\r\n                tgt = tgt[:ori_size]\r\n                tgt_query_pos = tgt_query_pos[:ori_size]\r\n        else:\r\n            if self.self_attn is not None:\r\n                q = k = self.with_pos_embed(tgt, tgt_query_pos)\r\n                tgt2 = self.self_attn(q, k, tgt, attn_mask=self_attn_mask)[0]\r\n                tgt = tgt + self.dropout2(tgt2)\r\n                tgt = self.norm2(tgt)\r\n\r\n        # cross attention\r\n        if self.key_aware_type is not None:\r\n            if self.key_aware_type == 'mean':\r\n                tgt = tgt + memory.mean(0, keepdim=True)\r\n            elif self.key_aware_type == 'proj_mean':\r\n                tgt = tgt + self.key_aware_proj(memory).mean(0, keepdim=True)\r\n            else:\r\n                raise NotImplementedError(\"Unknown key_aware_type: {}\".format(self.key_aware_type))\r\n        tgt2 = self.cross_attn(self.with_pos_embed(tgt, tgt_query_pos).transpose(0, 1),\r\n                               tgt_reference_points.transpose(0, 1).contiguous(),\r\n                               memory.transpose(0, 1), memory_spatial_shapes, memory_level_start_index,\r\n                               memory_key_padding_mask).transpose(0, 1)\r\n        tgt = tgt + self.dropout1(tgt2)\r\n        tgt = self.norm1(tgt)\r\n\r\n        # ffn\r\n        tgt = self.forward_ffn(tgt)\r\n\r\n        return tgt\r\n\r\n\r\n"
  },
  {
    "path": "thirdparty/GLEE/glee/models/transformer_decoder/maskdino_decoder.py",
    "content": "# ------------------------------------------------------------------------\n# DINO\n# Copyright (c) 2022 IDEA. All Rights Reserved.\n# Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n# ------------------------------------------------------------------------\n# Modified from Mask2Former https://github.com/facebookresearch/Mask2Former by Feng Li and Hao Zhang.\nimport logging\nimport fvcore.nn.weight_init as weight_init\nimport torch\nfrom torch import nn\nfrom torch.nn import functional as F\n\nfrom detectron2.config import configurable\nfrom detectron2.layers import Conv2d\nfrom detectron2.utils.registry import Registry\nfrom detectron2.structures import BitMasks\nfrom timm.models.layers import trunc_normal_\nfrom .dino_decoder import TransformerDecoder, DeformableTransformerDecoderLayer\nfrom ...utils.utils import MLP, gen_encoder_output_proposals, inverse_sigmoid\nfrom ...utils import box_ops\n\n\nTRANSFORMER_DECODER_REGISTRY = Registry(\"TRANSFORMER_MODULE\")\nTRANSFORMER_DECODER_REGISTRY.__doc__ = \"\"\"\nRegistry for transformer module in MaskDINO.\n\"\"\"\n\n\ndef build_transformer_decoder(cfg, in_channels, lang_encoder, mask_classification=True):\n    \"\"\"\n    Build a instance embedding branch from `cfg.MODEL.INS_EMBED_HEAD.NAME`.\n    \"\"\"\n    name = cfg.MODEL.MaskDINO.TRANSFORMER_DECODER_NAME\n    return TRANSFORMER_DECODER_REGISTRY.get(name)(cfg, in_channels, lang_encoder, mask_classification)\n\n\n@TRANSFORMER_DECODER_REGISTRY.register()\nclass MaskDINODecoder(nn.Module):\n    @configurable\n    def __init__(\n            self,\n            in_channels,\n            lang_encoder,\n            mask_classification=True,\n            *,\n            num_classes: int,\n            hidden_dim: int,\n            num_queries: int,\n            nheads: int,\n            dim_feedforward: int,\n            dec_layers: int,\n            mask_dim: int,\n            dim_projection: int,\n            enforce_input_project: bool,\n            two_stage: bool,\n            dn: str,\n            noise_scale:float,\n            dn_num:int,\n            initialize_box_type:bool,\n            initial_pred:bool,\n            learn_tgt: bool,\n            total_num_feature_levels: int = 4,\n            dropout: float = 0.0,\n            activation: str = 'relu',\n            nhead: int = 8,\n            dec_n_points: int = 4,\n            return_intermediate_dec: bool = True,\n            query_dim: int = 4,\n            dec_layer_share: bool = False,\n            semantic_ce_loss: bool = False,\n            cross_track_layer: bool = False,\n    ):\n        \"\"\"\n        NOTE: this interface is experimental.\n        Args:\n            in_channels: channels of the input features\n            mask_classification: whether to add mask classifier or not\n            num_classes: number of classes\n            hidden_dim: Transformer feature dimension\n            num_queries: number of queries\n            nheads: number of heads\n            dim_feedforward: feature dimension in feedforward network\n            enc_layers: number of Transformer encoder layers\n            dec_layers: number of Transformer decoder layers\n            pre_norm: whether to use pre-LayerNorm or not\n            mask_dim: mask feature dimension\n            enforce_input_project: add input project 1x1 conv even if input\n                channels and hidden dim is identical\n            d_model: transformer dimension\n            dropout: dropout rate\n            activation: activation function\n            nhead: num heads in multi-head attention\n            dec_n_points: number of sampling points in decoder\n            return_intermediate_dec: return the intermediate results of decoder\n            query_dim: 4 -> (x, y, w, h)\n            dec_layer_share: whether to share each decoder layer\n            semantic_ce_loss: use ce loss for semantic segmentation\n        \"\"\"\n        super().__init__()\n\n        assert mask_classification, \"Only support mask classification model\"\n        self.mask_classification = mask_classification\n        self.num_feature_levels = total_num_feature_levels\n        self.initial_pred = initial_pred\n\n        self.lang_encoder = lang_encoder\n        \n        # define Transformer decoder here\n        self.dn=dn\n        self.learn_tgt = learn_tgt\n        self.noise_scale=noise_scale\n        self.dn_num=dn_num\n        self.num_heads = nheads\n        self.num_layers = dec_layers\n        self.two_stage=two_stage\n        self.initialize_box_type = initialize_box_type\n        self.total_num_feature_levels = total_num_feature_levels\n\n        self.num_queries = num_queries\n        self.semantic_ce_loss = semantic_ce_loss\n        # learnable query features\n        if not two_stage or self.learn_tgt:\n            self.query_feat = nn.Embedding(num_queries, hidden_dim)\n        if not two_stage and initialize_box_type == 'no':\n            self.query_embed = nn.Embedding(num_queries, 4)\n        if two_stage:\n            self.enc_output = nn.Linear(hidden_dim, hidden_dim)\n            self.enc_output_norm = nn.LayerNorm(hidden_dim)\n\n        self.input_proj = nn.ModuleList()\n        for _ in range(self.num_feature_levels):\n            if in_channels != hidden_dim or enforce_input_project:\n                self.input_proj.append(Conv2d(in_channels, hidden_dim, kernel_size=1))\n                weight_init.c2_xavier_fill(self.input_proj[-1])\n            else:\n                self.input_proj.append(nn.Sequential())\n        self.num_classes = {\n            'obj365':100,\n            'obj365_clip':100, \n            'lvis':100, \n            'openimage':100,\n            'lvis_clip':100,\n            'openimage_clip':100,\n            'grit':100,\n            'vg':200,\n            'coco':80,\n            'coco_clip':80,\n            'grounding':1, \n            'rvos':1, \n            'sa1b':1, \n            'sa1b_clip':1,\n            'bdd_det':10,\n            'bdd_inst':8,\n            'ytvis19':40,\n            'image_yt19':40, \n            'image_yt21':40,\n            'bdd_track_seg':8,\n            'bdd_track_box':8,\n            'ovis':25,\n            'image_o':25,\n            'ytvis21':40,\n            'uvo_video': 81,\n            'ytbvos':1,\n        }\n        # output FFNs\n        assert self.mask_classification, \"why not class embedding?\"\n \n        self.confidence_score =  MLP(hidden_dim, hidden_dim, 1, 2)\n        self.category_embed = nn.Parameter(torch.rand(hidden_dim, dim_projection))\n        # trunc_normal_(self.category_embed, std=.02)\n        # self.track_embed = MLP(hidden_dim, hidden_dim, hidden_dim, 3)\n\n        self.coco_label_enc = nn.Embedding(80,hidden_dim)\n        self.obj365_label_enc = nn.Embedding(100, hidden_dim)\n        self.vg_label_enc = nn.Embedding(200, hidden_dim)\n        self.grounding_label_enc = nn.Embedding(1,hidden_dim)\n        self.ytvis19_label_enc = nn.Embedding(40,hidden_dim)\n        self.ytvis21_label_enc = nn.Embedding(40,hidden_dim)\n        self.ovis_label_enc = nn.Embedding(25,hidden_dim)\n        self.uvo_label_enc = nn.Embedding(81,hidden_dim)\n        self.bdd_det = nn.Embedding(10,hidden_dim)\n        self.bdd_inst = nn.Embedding(8,hidden_dim)\n\n  \n        self.label_enc = {\n            'coco': self.coco_label_enc, \n            'coco_clip': self.coco_label_enc, \n            'coconomask': self.coco_label_enc,\n            'obj365': self.obj365_label_enc,\n            'lvis': self.obj365_label_enc,\n            'openimage': self.obj365_label_enc,\n            'grit': self.obj365_label_enc,\n            'vg': self.vg_label_enc,\n            'obj365_clip': self.obj365_label_enc,\n            'lvis_clip': self.obj365_label_enc,\n            'openimage_clip': self.obj365_label_enc,\n            'bdd_det':self.bdd_det,\n            'bdd_inst':self.bdd_inst,\n            'bdd_track_seg':self.bdd_inst, \n            'bdd_track_box':self.bdd_inst,\n            'sa1b': self.grounding_label_enc,\n            'sa1b_clip': self.grounding_label_enc,\n            'grounding': self.grounding_label_enc,\n            'rvos': self.grounding_label_enc,\n            'uvo_video':self.uvo_label_enc,\n            'ytvis19':self.ytvis19_label_enc,\n            'image_yt19': self.ytvis19_label_enc,\n            'ytvis21':self.ytvis21_label_enc,\n            'image_yt21':self.ytvis21_label_enc,\n            'ovis':self.ovis_label_enc,\n            'image_o': self.ovis_label_enc,\n            'burst':self.grounding_label_enc,\n            'ytbvos':self.grounding_label_enc,\n            }\n\n\n\n        self.mask_embed = MLP(hidden_dim, hidden_dim, mask_dim, 3)\n        \n        # init decoder\n        self.decoder_norm = decoder_norm = nn.LayerNorm(hidden_dim)\n        decoder_layer = DeformableTransformerDecoderLayer(hidden_dim, dim_feedforward,\n                                                          dropout, activation,\n                                                          self.num_feature_levels, nhead, dec_n_points)\n        self.decoder = TransformerDecoder(decoder_layer, self.num_layers, decoder_norm,\n                                          return_intermediate=return_intermediate_dec,\n                                          d_model=hidden_dim, query_dim=query_dim,\n                                          num_feature_levels=self.num_feature_levels,\n                                          dec_layer_share=dec_layer_share,\n                                          cross_track_layer = cross_track_layer,\n                                          n_levels=self.num_feature_levels, n_heads=nhead, n_points=dec_n_points\n                                          )\n        self.cross_track_layer = cross_track_layer\n        self.hidden_dim = hidden_dim\n        self._bbox_embed = _bbox_embed = MLP(hidden_dim, hidden_dim, 4, 3)\n        \n        nn.init.constant_(_bbox_embed.layers[-1].weight.data, 0)\n        nn.init.constant_(_bbox_embed.layers[-1].bias.data, 0)\n        box_embed_layerlist = [_bbox_embed for i in range(self.num_layers)]  # share box prediction each layer\n        self.bbox_embed = nn.ModuleList(box_embed_layerlist)\n        self.decoder.bbox_embed = self.bbox_embed\n\n\n    @classmethod\n    def from_config(cls, cfg, in_channels, lang_encoder, mask_classification):\n        ret = {}\n        ret[\"in_channels\"] = in_channels\n        ret[\"lang_encoder\"] = lang_encoder\n        ret[\"mask_classification\"] = mask_classification\n        ret[\"dim_projection\"] = cfg.MODEL.DIM_PROJ\n        ret[\"num_classes\"] = cfg.MODEL.SEM_SEG_HEAD.NUM_CLASSES\n        ret[\"hidden_dim\"] = cfg.MODEL.MaskDINO.HIDDEN_DIM\n        ret[\"num_queries\"] = cfg.MODEL.MaskDINO.NUM_OBJECT_QUERIES\n        # Transformer parameters:\n        ret[\"nheads\"] = cfg.MODEL.MaskDINO.NHEADS\n        ret[\"dim_feedforward\"] = cfg.MODEL.MaskDINO.DIM_FEEDFORWARD\n        ret[\"dec_layers\"] = cfg.MODEL.MaskDINO.DEC_LAYERS\n        ret[\"enforce_input_project\"] = cfg.MODEL.MaskDINO.ENFORCE_INPUT_PROJ\n        ret[\"mask_dim\"] = cfg.MODEL.SEM_SEG_HEAD.MASK_DIM\n        ret[\"two_stage\"] =cfg.MODEL.MaskDINO.TWO_STAGE\n        ret[\"initialize_box_type\"] = cfg.MODEL.MaskDINO.INITIALIZE_BOX_TYPE  # ['no', 'bitmask', 'mask2box']\n        ret[\"dn\"]=cfg.MODEL.MaskDINO.DN\n        ret[\"noise_scale\"] =cfg.MODEL.MaskDINO.DN_NOISE_SCALE\n        ret[\"dn_num\"] =cfg.MODEL.MaskDINO.DN_NUM\n        ret[\"initial_pred\"] =cfg.MODEL.MaskDINO.INITIAL_PRED\n        ret[\"learn_tgt\"] = cfg.MODEL.MaskDINO.LEARN_TGT\n        ret[\"total_num_feature_levels\"] = cfg.MODEL.SEM_SEG_HEAD.TOTAL_NUM_FEATURE_LEVELS\n        ret[\"semantic_ce_loss\"] = cfg.MODEL.MaskDINO.TEST.SEMANTIC_ON and cfg.MODEL.MaskDINO.SEMANTIC_CE_LOSS and ~cfg.MODEL.MaskDINO.TEST.PANOPTIC_ON\n        ret[\"cross_track_layer\"] = cfg.MODEL.CROSS_TRACK\n        return ret\n\n    def prepare_for_dn(self, targets, tgt, refpoint_emb, batch_size,task):\n        \"\"\"\n        modified from dn-detr. You can refer to dn-detr\n        https://github.com/IDEA-Research/DN-DETR/blob/main/models/dn_dab_deformable_detr/dn_components.py\n        for more details\n            :param dn_args: scalar, noise_scale\n            :param tgt: original tgt (content) in the matching part\n            :param refpoint_emb: positional anchor queries in the matching part\n            :param batch_size: bs\n            \"\"\"\n        if self.training:\n            scalar, noise_scale = self.dn_num,self.noise_scale\n\n            known = [(torch.ones_like(t['labels'])).cuda() for t in targets]\n            know_idx = [torch.nonzero(t) for t in known]\n            known_num = [sum(k) for k in known]\n\n            # use fix number of dn queries\n            if max(known_num)>0:\n                scalar = scalar//(int(max(known_num)))\n            else:\n                scalar = 0\n            if scalar == 0:\n                input_query_label = None\n                input_query_bbox = None\n                attn_mask = None\n                mask_dict = None\n                return input_query_label, input_query_bbox, attn_mask, mask_dict\n\n            # can be modified to selectively denosie some label or boxes; also known label prediction\n            unmask_bbox = unmask_label = torch.cat(known)\n            labels = torch.cat([t['labels'] for t in targets])\n            boxes = torch.cat([t['boxes'] for t in targets])\n            batch_idx = torch.cat([torch.full_like(t['labels'].long(), i) for i, t in enumerate(targets)])\n            # known\n            known_indice = torch.nonzero(unmask_label + unmask_bbox)\n            known_indice = known_indice.view(-1)\n\n            # noise\n            known_indice = known_indice.repeat(scalar, 1).view(-1)\n            known_labels = labels.repeat(scalar, 1).view(-1)\n            known_bid = batch_idx.repeat(scalar, 1).view(-1)\n            known_bboxs = boxes.repeat(scalar, 1)\n            known_labels_expaned = known_labels.clone()\n            known_bbox_expand = known_bboxs.clone()\n\n            # noise on the label\n            if noise_scale > 0:\n                p = torch.rand_like(known_labels_expaned.float())\n                chosen_indice = torch.nonzero(p < (noise_scale * 0.5)).view(-1)  # half of bbox prob\n                new_label = torch.randint_like(chosen_indice, 0, self.num_classes[task])  # randomly put a new one here\n                known_labels_expaned.scatter_(0, chosen_indice, new_label)\n            if noise_scale > 0:\n                diff = torch.zeros_like(known_bbox_expand)\n                diff[:, :2] = known_bbox_expand[:, 2:] / 2\n                diff[:, 2:] = known_bbox_expand[:, 2:]\n                known_bbox_expand += torch.mul((torch.rand_like(known_bbox_expand) * 2 - 1.0),\n                                               diff).cuda() * noise_scale\n                known_bbox_expand = known_bbox_expand.clamp(min=0.0, max=1.0)\n\n            m = known_labels_expaned.long().to('cuda')\n            input_label_embed = self.label_enc[task](m)\n            input_bbox_embed = inverse_sigmoid(known_bbox_expand)\n            single_pad = int(max(known_num))\n            pad_size = int(single_pad * scalar)\n\n            padding_label = torch.zeros(pad_size, self.hidden_dim).cuda()\n            padding_bbox = torch.zeros(pad_size, 4).cuda()\n\n            if not refpoint_emb is None:\n                input_query_label = torch.cat([padding_label, tgt], dim=0).repeat(batch_size, 1, 1)\n                input_query_bbox = torch.cat([padding_bbox, refpoint_emb], dim=0).repeat(batch_size, 1, 1)\n            else:\n                input_query_label=padding_label.repeat(batch_size, 1, 1)\n                input_query_bbox = padding_bbox.repeat(batch_size, 1, 1)\n\n            # map\n            map_known_indice = torch.tensor([]).to('cuda')\n            if len(known_num):\n                map_known_indice = torch.cat([torch.tensor(range(num)) for num in known_num])  # [1,2, 1,2,3]\n                map_known_indice = torch.cat([map_known_indice + single_pad * i for i in range(scalar)]).long()\n            if len(known_bid):\n                input_query_label[(known_bid.long(), map_known_indice)] = input_label_embed\n                input_query_bbox[(known_bid.long(), map_known_indice)] = input_bbox_embed\n\n            tgt_size = pad_size + self.num_queries\n            attn_mask = torch.ones(tgt_size, tgt_size).to('cuda') < 0\n            # match query cannot see the reconstruct\n            attn_mask[pad_size:, :pad_size] = True\n            # reconstruct cannot see each other\n            for i in range(scalar):\n                if i == 0:\n                    attn_mask[single_pad * i:single_pad * (i + 1), single_pad * (i + 1):pad_size] = True\n                if i == scalar - 1:\n                    attn_mask[single_pad * i:single_pad * (i + 1), :single_pad * i] = True\n                else:\n                    attn_mask[single_pad * i:single_pad * (i + 1), single_pad * (i + 1):pad_size] = True\n                    attn_mask[single_pad * i:single_pad * (i + 1), :single_pad * i] = True\n            mask_dict = {\n                'known_indice': torch.as_tensor(known_indice).long(),\n                'batch_idx': torch.as_tensor(batch_idx).long(),\n                'map_known_indice': torch.as_tensor(map_known_indice).long(),\n                'known_lbs_bboxes': (known_labels, known_bboxs),\n                'know_idx': know_idx,\n                'pad_size': pad_size,\n                'scalar': scalar,\n            }\n        else:\n            if not refpoint_emb is None:\n                input_query_label = tgt.repeat(batch_size, 1, 1)\n                input_query_bbox = refpoint_emb.repeat(batch_size, 1, 1)\n            else:\n                input_query_label=None\n                input_query_bbox=None\n            attn_mask = None\n            mask_dict=None\n\n        # 100*batch*256\n        if not input_query_bbox is None:\n            input_query_label = input_query_label\n            input_query_bbox = input_query_bbox\n\n        return input_query_label,input_query_bbox,attn_mask,mask_dict\n\n    def dn_post_process(self,outputs_class,outputs_score,outputs_coord,mask_dict,outputs_mask):\n        \"\"\"\n            post process of dn after output from the transformer\n            put the dn part in the mask_dict\n            \"\"\"\n        assert mask_dict['pad_size'] > 0\n        output_known_class = outputs_class[:, :, :mask_dict['pad_size'], :]\n        outputs_class = outputs_class[:, :, mask_dict['pad_size']:, :]\n        output_known_score = outputs_score[:, :, :mask_dict['pad_size'], :]\n        outputs_score = outputs_score[:, :, mask_dict['pad_size']:, :]\n\n        output_known_coord = outputs_coord[:, :, :mask_dict['pad_size'], :]\n        outputs_coord = outputs_coord[:, :, mask_dict['pad_size']:, :]\n        if outputs_mask is not None:\n            output_known_mask = outputs_mask[:, :, :mask_dict['pad_size'], :]\n            outputs_mask = outputs_mask[:, :, mask_dict['pad_size']:, :]\n        out = {'pred_logits': output_known_class[-1], 'pred_scores':output_known_score[-1],'pred_boxes': output_known_coord[-1],'pred_masks': output_known_mask[-1]}\n\n        out['aux_outputs'] = self._set_aux_loss(output_known_class, output_known_score, output_known_mask, output_known_coord)\n        mask_dict['output_known_lbs_bboxes']=out\n        return outputs_class, outputs_score, outputs_coord, outputs_mask\n\n    def get_valid_ratio(self, mask):\n        _, H, W = mask.shape\n        valid_H = torch.sum(~mask[:, :, 0], 1)\n        valid_W = torch.sum(~mask[:, 0, :], 1)\n        valid_ratio_h = valid_H.float() / H\n        valid_ratio_w = valid_W.float() / W\n        valid_ratio = torch.stack([valid_ratio_w, valid_ratio_h], -1)\n        return valid_ratio\n\n    def pred_box(self, reference, hs, ref0=None):\n        \"\"\"\n        :param reference: reference box coordinates from each decoder layer\n        :param hs: content\n        :param ref0: whether there are prediction from the first layer\n        \"\"\"\n        device = reference[0].device\n\n        if ref0 is None:\n            outputs_coord_list = []\n        else:\n            outputs_coord_list = [ref0.to(device)]\n        for dec_lid, (layer_ref_sig, layer_bbox_embed, layer_hs) in enumerate(zip(reference[:-1], self.bbox_embed, hs)):\n            layer_delta_unsig = layer_bbox_embed(layer_hs).to(device)\n            layer_outputs_unsig = layer_delta_unsig + inverse_sigmoid(layer_ref_sig).to(device)\n            layer_outputs_unsig = layer_outputs_unsig.sigmoid()\n            outputs_coord_list.append(layer_outputs_unsig)\n        outputs_coord_list = torch.stack(outputs_coord_list)\n        return outputs_coord_list\n\n    def forward(self, x, mask_features, extra, task, masks, targets=None):\n        \"\"\"\n        :param x: input, a list of multi-scale feature\n        :param mask_features: is the per-pixel embeddings with resolution 1/4 of the original image,\n        obtained by fusing backbone encoder encoded features. This is used to produce binary masks.\n        :param masks: mask in the original image\n        :param targets: used for denoising training\n        \"\"\"\n\n        if 'spatial_query_pos_mask' in extra:\n            visual_P = True\n        else:\n            visual_P = False\n\n        assert len(x) == self.num_feature_levels\n        device = x[0].device\n        size_list = []\n        # disable mask, it does not affect performance\n        enable_mask = 0\n        if masks is not None:\n            for src in x:\n                if src.size(2) % 32 or src.size(3) % 32:\n                    enable_mask = 1\n        if enable_mask == 0:\n            masks = [torch.zeros((src.size(0), src.size(2), src.size(3)), device=src.device, dtype=torch.bool) for src in x]\n        src_flatten = []\n        mask_flatten = []\n        spatial_shapes = []\n        for i in range(self.num_feature_levels):\n            idx=self.num_feature_levels-1-i\n            bs, c , h, w=x[idx].shape\n            size_list.append(x[i].shape[-2:])\n            spatial_shapes.append(x[idx].shape[-2:])\n            src_flatten.append(self.input_proj[idx](x[idx]).flatten(2).transpose(1, 2))\n            mask_flatten.append(masks[i].flatten(1))\n        src_flatten = torch.cat(src_flatten, 1)  # bs, \\sum{hxw}, c\n        mask_flatten = torch.cat(mask_flatten, 1)  # bs, \\sum{hxw}\n        spatial_shapes = torch.as_tensor(spatial_shapes, dtype=torch.long, device=src_flatten.device)\n        level_start_index = torch.cat((spatial_shapes.new_zeros((1,)), spatial_shapes.prod(1).cumsum(0)[:-1]))\n        valid_ratios = torch.stack([self.get_valid_ratio(m) for m in masks], 1)\n        \n        predictions_federate = []\n        predictions_score = []\n        predictions_class = []\n        predictions_mask = []\n        if self.two_stage:\n            output_memory, output_proposals = gen_encoder_output_proposals(src_flatten, mask_flatten, spatial_shapes)\n            output_memory = self.enc_output_norm(self.enc_output(output_memory))\n            \n            if task in ['grounding','rvos']:\n                class_embed = output_memory @ self.category_embed \n                enc_outputs_class_unselected =  torch.einsum(\"bqc,bc->bq\", class_embed, extra['grounding_class']).unsqueeze(-1) #[bz,numq,1]\n            \n            elif visual_P:\n                enc_outputs_class_unselected =  self.confidence_score(output_memory)\n            else:\n                class_embed = output_memory @ self.category_embed  # [bz,num_q,projectdim]\n                enc_outputs_class_unselected = torch.einsum(\"bqc,nc->bqn\", class_embed, extra['class_embeddings'])  #[bz,n,80]\n            enc_outputs_coord_unselected = self._bbox_embed(\n                output_memory) + output_proposals  # (bs, \\sum{hw}, 4) unsigmoid\n            topk = self.num_queries\n            topk_proposals = torch.topk(enc_outputs_class_unselected.max(-1)[0], topk, dim=1)[1]\n            refpoint_embed_undetach = torch.gather(enc_outputs_coord_unselected, 1,\n                                                   topk_proposals.unsqueeze(-1).repeat(1, 1, 4))  # unsigmoid\n            refpoint_embed = refpoint_embed_undetach.detach() #[bz,num_q,4]\n            tgt_undetach = torch.gather(output_memory, 1,\n                                  topk_proposals.unsqueeze(-1).repeat(1, 1, self.hidden_dim))  # unsigmoid  #[bz,num_q.256]\n\n            conf_score, outputs_class, outputs_mask,_ = self.forward_prediction_heads(tgt_undetach.transpose(0, 1), mask_features, task, extra, mask_dict = None)\n            tgt = tgt_undetach.detach()\n            if self.learn_tgt:\n                tgt = self.query_feat.weight[None].repeat(bs, 1, 1)\n            interm_outputs=dict()\n            interm_outputs['pred_logits'] = outputs_class\n            interm_outputs['pred_scores'] = conf_score\n            interm_outputs['pred_boxes'] = refpoint_embed_undetach.sigmoid()\n            interm_outputs['pred_masks'] = outputs_mask\n\n        elif not self.two_stage:\n            tgt = self.query_feat.weight[None].repeat(bs, 1, 1)\n            refpoint_embed = self.query_embed.weight[None].repeat(bs, 1, 1)\n        tgt_mask = None\n        mask_dict = None\n        if self.dn != \"no\" and self.training:\n            assert targets is not None\n            input_query_label, input_query_bbox, tgt_mask, mask_dict = \\\n                self.prepare_for_dn(targets, None, None, x[0].shape[0],task)\n            if mask_dict is not None:\n                tgt=torch.cat([input_query_label, tgt],dim=1)\n        # direct prediction from the matching and denoising part in the begining\n        if self.initial_pred:\n            conf_score, outputs_class, outputs_mask, pred_federat = self.forward_prediction_heads(tgt.transpose(0, 1), mask_features, task, extra, mask_dict, self.training)\n            predictions_score.append(conf_score)\n            predictions_class.append(outputs_class)\n            predictions_mask.append(outputs_mask)\n            predictions_federate.append(pred_federat)\n        if self.dn != \"no\" and self.training and mask_dict is not None:\n            refpoint_embed=torch.cat([input_query_bbox,refpoint_embed],dim=1)\n        hs, references, cross_track_embed = self.decoder(\n            tgt=tgt.transpose(0, 1),\n            memory=src_flatten.transpose(0, 1),\n            memory_key_padding_mask=mask_flatten,\n            pos=None,\n            refpoints_unsigmoid=refpoint_embed.transpose(0, 1),\n            level_start_index=level_start_index,\n            spatial_shapes=spatial_shapes,\n            valid_ratios=valid_ratios,\n            tgt_mask=tgt_mask,\n            task=task,\n            extra=extra,\n        )\n        for i, output in enumerate(hs):\n            conf_score, outputs_class, outputs_mask,pred_federat = self.forward_prediction_heads(output.transpose(0, 1), mask_features, task, extra, mask_dict, self.training or (i == len(hs)-1))\n            predictions_score.append(conf_score)\n            predictions_class.append(outputs_class)\n            predictions_mask.append(outputs_mask)\n            predictions_federate.append(pred_federat)\n\n        # iteratively box prediction\n        if self.initial_pred:\n            out_boxes = self.pred_box(references, hs, refpoint_embed.sigmoid())\n            assert len(predictions_class) == self.num_layers + 1\n        else:\n            out_boxes = self.pred_box(references, hs)\n        if mask_dict is not None:\n            predictions_mask=torch.stack(predictions_mask)\n            predictions_class=torch.stack(predictions_class)\n            predictions_score = torch.stack(predictions_score)\n            predictions_class, predictions_score, out_boxes, predictions_mask=\\\n                self.dn_post_process(predictions_class, predictions_score, out_boxes,mask_dict,predictions_mask)\n\n            predictions_class,  predictions_score, predictions_mask=list(predictions_class), list(predictions_score), list(predictions_mask)\n        elif self.training:  # this is to insure self.label_enc participate in the model\n            predictions_class[-1] += 0.0*self.label_enc[task].weight.sum()\n        if mask_dict is not None:\n            track_embed =  hs[-1][:, mask_dict['pad_size']:, :] \n        else:\n            track_embed =  hs[-1]  \n\n        out = {\n            'pred_federat':predictions_federate[-1],\n            'pred_logits': predictions_class[-1],\n            'pred_scores': predictions_score[-1],\n            'pred_masks': predictions_mask[-1],\n            'pred_boxes':out_boxes[-1],\n            'pred_track_embed': track_embed,\n            'visual_P': visual_P,\n            'aux_outputs': self._set_aux_loss(\n                predictions_class if self.mask_classification else None, predictions_score, predictions_mask, out_boxes, predictions_federate, visual_P\n            )\n        }\n        if self.two_stage:\n            out['interm_outputs'] = interm_outputs\n        return out, mask_dict\n\n    def forward_prediction_heads(self, output, mask_features, task, extra,mask_dict, pred_mask=True, visual_P=False):\n        decoder_output = self.decoder_norm(output)\n        decoder_output = decoder_output.transpose(0, 1)\n        # outputs_class = self.class_embed(decoder_output)\n\n        conf_score = self.confidence_score(decoder_output) # if visual_P else None\n\n        class_embed = decoder_output @ self.category_embed  # [bz,num_q,projectdim]\n        if task in ['grounding', 'rvos']:\n            outputs_class =  torch.einsum(\"bqc,bc->bq\", class_embed, extra['grounding_class']).unsqueeze(-1) #[bz,numq,1]\n        else:\n            outputs_class = torch.einsum(\"bqc,nc->bqn\", class_embed, extra['class_embeddings'])  #[bz,n,80]\n       \n        outputs_mask = None\n        if pred_mask:\n            mask_embed = self.mask_embed(decoder_output)\n            outputs_mask = torch.einsum(\"bqc,bchw->bqhw\", mask_embed, mask_features)\n\n\n        return conf_score, outputs_class, outputs_mask, None\n\n    @torch.jit.unused\n    def _set_aux_loss(self, outputs_class, outputs_score, outputs_seg_masks, out_boxes, predictions_federate=None, visual_P=False):\n        # this is a workaround to make torchscript happy, as torchscript\n        # doesn't support dictionary with non-homogeneous values, such\n        # as a dict having both a Tensor and a list.\n        # if self.mask_classification:\n        if predictions_federate is None:\n            return [\n                {\"pred_logits\": a, \"pred_scores\": b, \"pred_masks\": c, \"pred_boxes\":d, 'visual_P': visual_P}\n                for a, b, c, d in zip(outputs_class[:-1], outputs_score[:-1], outputs_seg_masks[:-1], out_boxes[:-1])\n            ]\n        else:\n            return [\n                {\"pred_logits\": a, \"pred_scores\": b, \"pred_masks\": c, \"pred_boxes\":d, 'pred_federat':e,'visual_P': visual_P}\n                for a, b, c, d, e in zip(outputs_class[:-1], outputs_score[:-1], outputs_seg_masks[:-1], out_boxes[:-1], predictions_federate[:-1])\n            ]"
  },
  {
    "path": "thirdparty/GLEE/glee/models/vos_utils.py",
    "content": "import torch\nimport torch.nn.functional as F\nfrom torch import nn\nfrom timm.models.layers import DropPath\n\n\n\n\nclass VLFuse(torch.nn.Module):\n    \"\"\"\n    Early Fusion Module\n    \"\"\"\n\n    def __init__(self, ):\n        super(VLFuse, self).__init__()\n        self.init_configs()\n\n        # early fusion module\n        # bi-direction (text->image, image->text)\n        self.b_attn = BiAttentionBlockForCheckpoint(v_dim=self.img_dim, # 256\n                    l_dim=self.lang_dim, # 768\n                    embed_dim=self.embed_dim, # 2048\n                    num_heads=self.n_head, # 8\n                    dropout=0.1,\n                    drop_path=.0,\n                    init_values=1.0 / 6,\n                    )\n    def init_configs(self, ):\n        # common params\n        self.img_dim =  256\n\n        self.max_query_len = 256\n        self.n_layers =1\n\n        # mha params\n        self.n_head = 8\n        self.embed_dim = 2048 # 2048 by default\n        \n        self.lang_dim = 256\n\n    def forward(self, x, task=None):\n        visual_features = x[\"visual\"]\n        language_dict_features = x[\"lang\"]\n\n        fused_visual_features, language_features = self.b_attn(\n                visual_features, language_dict_features['hidden'], language_dict_features['masks'], task)\n\n        language_dict_features['hidden'] = language_features\n        fused_language_dict_features = language_dict_features\n\n        features_dict = {\"visual\": fused_visual_features,\n                         \"lang\": fused_language_dict_features}\n\n        return features_dict\n\n\n\ndef masks_to_boxes(masks):\n    \"\"\"Compute the bounding boxes around the provided masks\n\n    The masks should be in format [N, H, W] where N is the number of masks, (H, W) are the spatial dimensions.\n\n    Returns a [N, 4] tensors, with the boxes in xyxy format\n    \"\"\"\n    if masks.numel() == 0:\n        return torch.zeros((0, 4), device=masks.device)\n\n    h, w = masks.shape[-2:]\n\n    y = torch.arange(0, h, dtype=torch.float, device=masks.device)\n    x = torch.arange(0, w, dtype=torch.float, device=masks.device)\n    y, x = torch.meshgrid(y, x)\n\n    x_mask = (masks * x.unsqueeze(0))\n    x_max = x_mask.flatten(1).max(-1)[0]\n    x_min = x_mask.masked_fill(~(masks.bool()), 1e8).flatten(1).min(-1)[0]\n\n    y_mask = (masks * y.unsqueeze(0))\n    y_max = y_mask.flatten(1).max(-1)[0]\n    y_min = y_mask.masked_fill(~(masks.bool()), 1e8).flatten(1).min(-1)[0]\n\n    return torch.stack([x_min, y_min, x_max, y_max], 1)\n\nclass FeatureFuser(nn.Module):\n    \"\"\"\n    Feature Fuser for SOT (inspired by CondInst)\n    \"\"\"\n    def __init__(self, in_channels, channels=256):\n        super().__init__()\n\n        self.refine = nn.ModuleList()\n        for in_channel in in_channels:\n            self.refine.append(nn.Conv2d(in_channel, channels, 3, padding=1))\n\n    def forward(self, features):\n        # -4, -3, -2, -1 corresponds to P3, P4, P5, P6\n        for i, f in enumerate([-3, -2, -1]):\n            if i == 0:\n                x = self.refine[i](features[f])\n            else:\n                x_p = self.refine[i](features[f])\n                target_h, target_w = x.size()[2:]\n                h, w = x_p.size()[2:]\n                assert target_h % h == 0\n                assert target_w % w == 0\n                factor_h, factor_w = target_h // h, target_w // w\n                assert factor_h == factor_w\n                x_p = aligned_bilinear(x_p, factor_h)\n                x = x + x_p\n        return x\n\ndef aligned_bilinear(tensor, factor):\n    assert tensor.dim() == 4\n    assert factor >= 1\n    assert int(factor) == factor\n\n    if factor == 1:\n        return tensor\n\n    h, w = tensor.size()[2:]\n    tensor = F.pad(tensor, pad=(0, 1, 0, 1), mode=\"replicate\")\n    oh = factor * h + 1\n    ow = factor * w + 1\n    tensor = F.interpolate(\n        tensor, size=(oh, ow),\n        mode='bilinear',\n        align_corners=True\n    )\n    tensor = F.pad(\n        tensor, pad=(factor // 2, 0, factor // 2, 0),\n        mode=\"replicate\"\n    )\n\n    return tensor[:, :, :oh - 1, :ow - 1]\n\n\n\n\nclass BiMultiHeadAttention(nn.Module):\n    def __init__(self, v_dim, l_dim, embed_dim, num_heads, dropout=0.1):\n        super(BiMultiHeadAttention, self).__init__()\n\n        self.embed_dim = embed_dim\n        self.num_heads = num_heads\n        self.head_dim = embed_dim // num_heads\n        self.v_dim = v_dim\n        self.l_dim = l_dim\n\n        assert (\n                self.head_dim * self.num_heads == self.embed_dim\n        ), f\"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim} and `num_heads`: {self.num_heads}).\"\n        self.scale = self.head_dim ** (-0.5)\n        self.dropout = dropout\n\n        self.v_proj = nn.Linear(self.v_dim, self.embed_dim)\n        self.l_proj = nn.Linear(self.l_dim, self.embed_dim)\n        self.values_v_proj = nn.Linear(self.v_dim, self.embed_dim)\n        self.values_l_proj = nn.Linear(self.l_dim, self.embed_dim)\n\n        self.out_v_proj = nn.Linear(self.embed_dim, self.v_dim)\n        self.out_l_proj = nn.Linear(self.embed_dim, self.l_dim)\n\n        self.stable_softmax_2d =  False\n        self.clamp_min_for_underflow = True\n        self.clamp_max_for_overflow = True\n\n        self._reset_parameters()\n\n    def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):\n        return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()\n\n    def _reset_parameters(self):\n        nn.init.xavier_uniform_(self.v_proj.weight)\n        self.v_proj.bias.data.fill_(0)\n        nn.init.xavier_uniform_(self.l_proj.weight)\n        self.l_proj.bias.data.fill_(0)\n        nn.init.xavier_uniform_(self.values_v_proj.weight)\n        self.values_v_proj.bias.data.fill_(0)\n        nn.init.xavier_uniform_(self.values_l_proj.weight)\n        self.values_l_proj.bias.data.fill_(0)\n        nn.init.xavier_uniform_(self.out_v_proj.weight)\n        self.out_v_proj.bias.data.fill_(0)\n        nn.init.xavier_uniform_(self.out_l_proj.weight)\n        self.out_l_proj.bias.data.fill_(0)\n\n    def forward(self, v, l, attention_mask_l=None):\n        bsz, tgt_len, embed_dim = v.size()\n\n        query_states = self.v_proj(v) * self.scale\n        key_states = self._shape(self.l_proj(l), -1, bsz)\n        value_v_states = self._shape(self.values_v_proj(v), -1, bsz)\n        value_l_states = self._shape(self.values_l_proj(l), -1, bsz)\n\n        proj_shape = (bsz * self.num_heads, -1, self.head_dim) # (bs * 8, -1, embed_dim//8)\n        query_states = self._shape(query_states, tgt_len, bsz).view(*proj_shape) # (bs * 8, seq_len_img, embed_dim//8)\n        key_states = key_states.view(*proj_shape) # (bs * 8, seq_len_text, embed_dim//8)\n        value_v_states = value_v_states.view(*proj_shape)\n        value_l_states = value_l_states.view(*proj_shape)\n\n        src_len = key_states.size(1)\n        attn_weights = torch.bmm(query_states, key_states.transpose(1, 2)) # (bs * 8, seq_len_img, seq_len_text)\n\n        if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):\n            raise ValueError(\n                f\"Attention weights should be of size {(bsz * self.num_heads, tgt_len, src_len)}, but is {attn_weights.size()}\"\n            )\n\n        # attn_weights_l = nn.functional.softmax(attn_weights.transpose(1, 2), dim=-1)\n\n        if self.stable_softmax_2d:\n            attn_weights = attn_weights - attn_weights.max()\n        \n        if self.clamp_min_for_underflow:\n            attn_weights = torch.clamp(attn_weights, min=-50000) # Do not increase -50000, data type half has quite limited range\n        if self.clamp_max_for_overflow:\n            attn_weights = torch.clamp(attn_weights, max=50000) # Do not increase 50000, data type half has quite limited range\n\n        attn_weights_T = attn_weights.transpose(1, 2)\n        attn_weights_l = (attn_weights_T - torch.max(attn_weights_T, dim=-1, keepdim=True)[\n            0])\n        if self.clamp_min_for_underflow:\n            attn_weights_l = torch.clamp(attn_weights_l, min=-50000) # Do not increase -50000, data type half has quite limited range\n        if self.clamp_max_for_overflow:\n            attn_weights_l = torch.clamp(attn_weights_l, max=50000) # Do not increase 50000, data type half has quite limited range\n\n        attn_weights_l = attn_weights_l.softmax(dim=-1)\n        # assert attention_mask_l.dtype == torch.int64\n        if attention_mask_l is not None:\n            assert (attention_mask_l.dim() == 2) # (bs, seq_len)\n            attention_mask = attention_mask_l.unsqueeze(1).unsqueeze(1) # (bs, 1, 1, seq_len)\n            attention_mask = attention_mask.expand(bsz, 1, tgt_len, src_len)\n            attention_mask = attention_mask.masked_fill(attention_mask == 0, -9e15)\n\n            if attention_mask.size() != (bsz, 1, tgt_len, src_len):\n                raise ValueError(\n                    f\"Attention mask should be of size {(bsz, 1, tgt_len, src_len)}\"\n                )\n            attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len) + attention_mask\n            attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)\n\n        attn_weights_v = nn.functional.softmax(attn_weights, dim=-1)\n\n        attn_probs_v = F.dropout(attn_weights_v, p=self.dropout, training=self.training)\n        attn_probs_l = F.dropout(attn_weights_l, p=self.dropout, training=self.training)\n\n        attn_output_v = torch.bmm(attn_probs_v, value_l_states)\n        attn_output_l = torch.bmm(attn_probs_l, value_v_states)\n\n\n        if attn_output_v.size() != (bsz * self.num_heads, tgt_len, self.head_dim):\n            raise ValueError(\n                f\"`attn_output_v` should be of size {(bsz, self.num_heads, tgt_len, self.head_dim)}, but is {attn_output_v.size()}\"\n            )\n\n        if attn_output_l.size() != (bsz * self.num_heads, src_len, self.head_dim):\n            raise ValueError(\n                f\"`attn_output_l` should be of size {(bsz, self.num_heads, src_len, self.head_dim)}, but is {attn_output_l.size()}\"\n            )\n\n        attn_output_v = attn_output_v.view(bsz, self.num_heads, tgt_len, self.head_dim)\n        attn_output_v = attn_output_v.transpose(1, 2)\n        attn_output_v = attn_output_v.reshape(bsz, tgt_len, self.embed_dim)\n\n        attn_output_l = attn_output_l.view(bsz, self.num_heads, src_len, self.head_dim)\n        attn_output_l = attn_output_l.transpose(1, 2)\n        attn_output_l = attn_output_l.reshape(bsz, src_len, self.embed_dim)\n\n        attn_output_v = self.out_v_proj(attn_output_v)\n        attn_output_l = self.out_l_proj(attn_output_l)\n\n        return attn_output_v, attn_output_l\n\n\nclass BiAttentionBlockForCheckpoint(nn.Module):\n    def __init__(self, v_dim, l_dim, embed_dim, num_heads, dropout=0.1,\n                 drop_path=.0, init_values=1e-4,  ):\n        \"\"\"\n        Inputs:\n            embed_dim - Dimensionality of input and attention feature vectors\n            num_heads - Number of heads to use in the Multi-Head Attention block\n            dropout - Amount of dropout to apply in the feed-forward network\n        \"\"\"\n        super(BiAttentionBlockForCheckpoint, self).__init__()\n\n        # pre layer norm\n        self.layer_norm_v = nn.LayerNorm(v_dim)\n        self.layer_norm_l = nn.LayerNorm(l_dim)\n        self.attn = BiMultiHeadAttention(v_dim=v_dim,\n                                         l_dim=l_dim,\n                                         embed_dim=embed_dim,\n                                         num_heads=num_heads,\n                                         dropout=dropout,\n                                        )\n\n        # add layer scale for training stability\n        self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()\n        self.gamma_v = nn.Parameter(init_values * torch.ones((v_dim)), requires_grad=True)\n        self.gamma_l = nn.Parameter(init_values * torch.ones((l_dim)), requires_grad=True)\n\n\n    def forward(self, v, l, attention_mask_l=None, task=None):\n        # v: visual features, (bs, sigma(HW), 256)\n        # l: language features, (bs, seq_len, 768)\n        v = self.layer_norm_v(v)\n        l = self.layer_norm_l(l)\n        delta_v, delta_l = self.attn(v, l, attention_mask_l=attention_mask_l)\n        # v, l = v + delta_v, l + delta_l\n        v = v + self.drop_path(self.gamma_v * delta_v)\n        l = l + self.drop_path(self.gamma_l * delta_l)\n        return v, l\n\n\n"
  },
  {
    "path": "thirdparty/GLEE/glee/modules/__init__.py",
    "content": "from .position_encoding import *\nfrom .attention import *\nfrom .postprocessing import *\nfrom .point_features import *"
  },
  {
    "path": "thirdparty/GLEE/glee/modules/attention.py",
    "content": "# Code copy from PyTorch, modified by Xueyan Zou\n\nimport warnings\nfrom typing import Optional, Tuple\n\nimport torch\nimport torch.nn as nn\nfrom torch import Tensor\nfrom torch.nn.init import constant_, xavier_normal_, xavier_uniform_\nfrom torch.nn.parameter import Parameter\nfrom torch.overrides import has_torch_function, handle_torch_function\nfrom torch.nn.functional import pad, linear, softmax, dropout\n\n\ndef multi_head_attention_forward(\n    query: Tensor,\n    key: Tensor,\n    value: Tensor,\n    embed_dim_to_check: int,\n    num_heads: int,\n    in_proj_weight: Tensor,\n    in_proj_bias: Tensor,\n    bias_k: Optional[Tensor],\n    bias_v: Optional[Tensor],\n    add_zero_attn: bool,\n    dropout_p: float,\n    out_proj_weight: Tensor,\n    out_proj_bias: Tensor,\n    training: bool = True,\n    key_padding_mask: Optional[Tensor] = None,\n    need_weights: bool = True,\n    attn_mask: Optional[Tensor] = None,\n    use_separate_proj_weight: bool = False,\n    q_proj_weight: Optional[Tensor] = None,\n    k_proj_weight: Optional[Tensor] = None,\n    v_proj_weight: Optional[Tensor] = None,\n    static_k: Optional[Tensor] = None,\n    static_v: Optional[Tensor] = None,\n) -> Tuple[Tensor, Optional[Tensor]]:\n    r\"\"\"\n    Args:\n        query, key, value: map a query and a set of key-value pairs to an output.\n            See \"Attention Is All You Need\" for more details.\n        embed_dim_to_check: total dimension of the model.\n        num_heads: parallel attention heads.\n        in_proj_weight, in_proj_bias: input projection weight and bias.\n        bias_k, bias_v: bias of the key and value sequences to be added at dim=0.\n        add_zero_attn: add a new batch of zeros to the key and\n                       value sequences at dim=1.\n        dropout_p: probability of an element to be zeroed.\n        out_proj_weight, out_proj_bias: the output projection weight and bias.\n        training: apply dropout if is ``True``.\n        key_padding_mask: if provided, specified padding elements in the key will\n            be ignored by the attention. This is an binary mask. When the value is True,\n            the corresponding value on the attention layer will be filled with -inf.\n        need_weights: output attn_output_weights.\n        attn_mask: 2D or 3D mask that prevents attention to certain positions. A 2D mask will be broadcasted for all\n            the batches while a 3D mask allows to specify a different mask for the entries of each batch.\n        use_separate_proj_weight: the function accept the proj. weights for query, key,\n            and value in different forms. If false, in_proj_weight will be used, which is\n            a combination of q_proj_weight, k_proj_weight, v_proj_weight.\n        q_proj_weight, k_proj_weight, v_proj_weight, in_proj_bias: input projection weight and bias.\n        static_k, static_v: static key and value used for attention operators.\n\n\n    Shape:\n        Inputs:\n        - query: :math:`(L, N, E)` where L is the target sequence length, N is the batch size, E is\n          the embedding dimension.\n        - key: :math:`(S, N, E)`, where S is the source sequence length, N is the batch size, E is\n          the embedding dimension.\n        - value: :math:`(S, N, E)` where S is the source sequence length, N is the batch size, E is\n          the embedding dimension.\n        - key_padding_mask: :math:`(N, S)` where N is the batch size, S is the source sequence length.\n          If a ByteTensor is provided, the non-zero positions will be ignored while the zero positions\n          will be unchanged. If a BoolTensor is provided, the positions with the\n          value of ``True`` will be ignored while the position with the value of ``False`` will be unchanged.\n        - attn_mask: 2D mask :math:`(L, S)` where L is the target sequence length, S is the source sequence length.\n          3D mask :math:`(N*num_heads, L, S)` where N is the batch size, L is the target sequence length,\n          S is the source sequence length. attn_mask ensures that position i is allowed to attend the unmasked\n          positions. If a ByteTensor is provided, the non-zero positions are not allowed to attend\n          while the zero positions will be unchanged. If a BoolTensor is provided, positions with ``True``\n          are not allowed to attend while ``False`` values will be unchanged. If a FloatTensor\n          is provided, it will be added to the attention weight.\n        - static_k: :math:`(N*num_heads, S, E/num_heads)`, where S is the source sequence length,\n          N is the batch size, E is the embedding dimension. E/num_heads is the head dimension.\n        - static_v: :math:`(N*num_heads, S, E/num_heads)`, where S is the source sequence length,\n          N is the batch size, E is the embedding dimension. E/num_heads is the head dimension.\n\n        Outputs:\n        - attn_output: :math:`(L, N, E)` where L is the target sequence length, N is the batch size,\n          E is the embedding dimension.\n        - attn_output_weights: :math:`(N, L, S)` where N is the batch size,\n          L is the target sequence length, S is the source sequence length.\n    \"\"\"\n    tens_ops = (query, key, value, in_proj_weight, in_proj_bias, bias_k, bias_v, out_proj_weight, out_proj_bias)\n    if has_torch_function(tens_ops):\n        return handle_torch_function(\n            multi_head_attention_forward,\n            tens_ops,\n            query,\n            key,\n            value,\n            embed_dim_to_check,\n            num_heads,\n            in_proj_weight,\n            in_proj_bias,\n            bias_k,\n            bias_v,\n            add_zero_attn,\n            dropout_p,\n            out_proj_weight,\n            out_proj_bias,\n            training=training,\n            key_padding_mask=key_padding_mask,\n            need_weights=need_weights,\n            attn_mask=attn_mask,\n            use_separate_proj_weight=use_separate_proj_weight,\n            q_proj_weight=q_proj_weight,\n            k_proj_weight=k_proj_weight,\n            v_proj_weight=v_proj_weight,\n            static_k=static_k,\n            static_v=static_v,\n        )\n    tgt_len, bsz, embed_dim = query.size()\n    assert embed_dim == embed_dim_to_check\n    # allow MHA to have different sizes for the feature dimension\n    assert key.size(0) == value.size(0) and key.size(1) == value.size(1)\n\n    head_dim = embed_dim // num_heads\n    assert head_dim * num_heads == embed_dim, \"embed_dim must be divisible by num_heads\"\n    scaling = float(head_dim) ** -0.5\n\n    if not use_separate_proj_weight:\n        if (query is key or torch.equal(query, key)) and (key is value or torch.equal(key, value)):\n            # self-attention\n            q, k, v = linear(query, in_proj_weight, in_proj_bias).chunk(3, dim=-1)\n\n        elif key is value or torch.equal(key, value):\n            # encoder-decoder attention\n            # This is inline in_proj function with in_proj_weight and in_proj_bias\n            _b = in_proj_bias\n            _start = 0\n            _end = embed_dim\n            _w = in_proj_weight[_start:_end, :]\n            if _b is not None:\n                _b = _b[_start:_end]\n            q = linear(query, _w, _b)\n\n            if key is None:\n                assert value is None\n                k = None\n                v = None\n            else:\n\n                # This is inline in_proj function with in_proj_weight and in_proj_bias\n                _b = in_proj_bias\n                _start = embed_dim\n                _end = None\n                _w = in_proj_weight[_start:, :]\n                if _b is not None:\n                    _b = _b[_start:]\n                k, v = linear(key, _w, _b).chunk(2, dim=-1)\n\n        else:\n            # This is inline in_proj function with in_proj_weight and in_proj_bias\n            _b = in_proj_bias\n            _start = 0\n            _end = embed_dim\n            _w = in_proj_weight[_start:_end, :]\n            if _b is not None:\n                _b = _b[_start:_end]\n            q = linear(query, _w, _b)\n\n            # This is inline in_proj function with in_proj_weight and in_proj_bias\n            _b = in_proj_bias\n            _start = embed_dim\n            _end = embed_dim * 2\n            _w = in_proj_weight[_start:_end, :]\n            if _b is not None:\n                _b = _b[_start:_end]\n            k = linear(key, _w, _b)\n\n            # This is inline in_proj function with in_proj_weight and in_proj_bias\n            _b = in_proj_bias\n            _start = embed_dim * 2\n            _end = None\n            _w = in_proj_weight[_start:, :]\n            if _b is not None:\n                _b = _b[_start:]\n            v = linear(value, _w, _b)\n    else:\n        q_proj_weight_non_opt = torch.jit._unwrap_optional(q_proj_weight)\n        len1, len2 = q_proj_weight_non_opt.size()\n        assert len1 == embed_dim and len2 == query.size(-1)\n\n        k_proj_weight_non_opt = torch.jit._unwrap_optional(k_proj_weight)\n        len1, len2 = k_proj_weight_non_opt.size()\n        assert len1 == embed_dim and len2 == key.size(-1)\n\n        v_proj_weight_non_opt = torch.jit._unwrap_optional(v_proj_weight)\n        len1, len2 = v_proj_weight_non_opt.size()\n        assert len1 == embed_dim and len2 == value.size(-1)\n\n        if in_proj_bias is not None:\n            q = linear(query, q_proj_weight_non_opt, in_proj_bias[0:embed_dim])\n            k = linear(key, k_proj_weight_non_opt, in_proj_bias[embed_dim : (embed_dim * 2)])\n            v = linear(value, v_proj_weight_non_opt, in_proj_bias[(embed_dim * 2) :])\n        else:\n            q = linear(query, q_proj_weight_non_opt, in_proj_bias)\n            k = linear(key, k_proj_weight_non_opt, in_proj_bias)\n            v = linear(value, v_proj_weight_non_opt, in_proj_bias)\n    q = q * scaling\n\n    if attn_mask is not None:\n        assert (\n            attn_mask.dtype == torch.float32\n            or attn_mask.dtype == torch.float64\n            or attn_mask.dtype == torch.float16\n            or attn_mask.dtype == torch.uint8\n            or attn_mask.dtype == torch.bool\n        ), \"Only float, byte, and bool types are supported for attn_mask, not {}\".format(attn_mask.dtype)\n        if attn_mask.dtype == torch.uint8:\n            warnings.warn(\"Byte tensor for attn_mask in nn.MultiheadAttention is deprecated. Use bool tensor instead.\")\n            attn_mask = attn_mask.to(torch.bool)\n\n        if attn_mask.dim() == 2:\n            attn_mask = attn_mask.unsqueeze(0)\n            if list(attn_mask.size()) != [1, query.size(0), key.size(0)]:\n                raise RuntimeError(\"The size of the 2D attn_mask is not correct.\")\n        elif attn_mask.dim() == 3:\n            if list(attn_mask.size()) != [bsz * num_heads, query.size(0), key.size(0)]:\n                raise RuntimeError(\"The size of the 3D attn_mask is not correct.\")\n        else:\n            raise RuntimeError(\"attn_mask's dimension {} is not supported\".format(attn_mask.dim()))\n        # attn_mask's dim is 3 now.\n\n    # convert ByteTensor key_padding_mask to bool\n    if key_padding_mask is not None and key_padding_mask.dtype == torch.uint8:\n        warnings.warn(\n            \"Byte tensor for key_padding_mask in nn.MultiheadAttention is deprecated. Use bool tensor instead.\"\n        )\n        key_padding_mask = key_padding_mask.to(torch.bool)\n\n    if bias_k is not None and bias_v is not None:\n        if static_k is None and static_v is None:\n            k = torch.cat([k, bias_k.repeat(1, bsz, 1)])\n            v = torch.cat([v, bias_v.repeat(1, bsz, 1)])\n            if attn_mask is not None:\n                attn_mask = pad(attn_mask, (0, 1))\n            if key_padding_mask is not None:\n                key_padding_mask = pad(key_padding_mask, (0, 1))\n        else:\n            assert static_k is None, \"bias cannot be added to static key.\"\n            assert static_v is None, \"bias cannot be added to static value.\"\n    else:\n        assert bias_k is None\n        assert bias_v is None\n\n    q = q.contiguous().view(tgt_len, bsz * num_heads, head_dim).transpose(0, 1)\n    if k is not None:\n        k = k.contiguous().view(-1, bsz * num_heads, head_dim).transpose(0, 1)\n    if v is not None:\n        v = v.contiguous().view(-1, bsz * num_heads, head_dim).transpose(0, 1)\n\n    if static_k is not None:\n        assert static_k.size(0) == bsz * num_heads\n        assert static_k.size(2) == head_dim\n        k = static_k\n\n    if static_v is not None:\n        assert static_v.size(0) == bsz * num_heads\n        assert static_v.size(2) == head_dim\n        v = static_v\n\n    src_len = k.size(1)\n\n    if key_padding_mask is not None:\n        # assert key_padding_mask.size(0) == bsz\n        assert key_padding_mask.size(1) == src_len\n\n    if add_zero_attn:\n        src_len += 1\n        k = torch.cat([k, torch.zeros((k.size(0), 1) + k.size()[2:], dtype=k.dtype, device=k.device)], dim=1)\n        v = torch.cat([v, torch.zeros((v.size(0), 1) + v.size()[2:], dtype=v.dtype, device=v.device)], dim=1)\n        if attn_mask is not None:\n            attn_mask = pad(attn_mask, (0, 1))\n        if key_padding_mask is not None:\n            key_padding_mask = pad(key_padding_mask, (0, 1))\n\n    attn_output_weights = torch.bmm(q, k.transpose(1, 2))\n    assert list(attn_output_weights.size()) == [bsz * num_heads, tgt_len, src_len]\n\n    if attn_mask is not None:\n        if attn_mask.dtype == torch.bool:\n            attn_output_weights.masked_fill_(attn_mask, float(\"-inf\"))\n        else:\n            attn_output_weights += attn_mask\n\n    if key_padding_mask is not None:\n        attn_output_weights = attn_output_weights.view(bsz, num_heads, tgt_len, src_len)\n        attn_output_weights = attn_output_weights.masked_fill(\n            key_padding_mask.unsqueeze(1),\n            float(\"-inf\"),\n        )\n        attn_output_weights = attn_output_weights.view(bsz * num_heads, tgt_len, src_len)\n\n    attn_output_weights = softmax(attn_output_weights, dim=-1)\n    attn_output_weights = dropout(attn_output_weights, p=dropout_p, training=training)\n\n    attn_output = torch.bmm(attn_output_weights, v)\n    assert list(attn_output.size()) == [bsz * num_heads, tgt_len, head_dim]\n    attn_output = attn_output.transpose(0, 1).contiguous().view(tgt_len, bsz, embed_dim)\n    attn_output = linear(attn_output, out_proj_weight, out_proj_bias)\n\n    if need_weights:\n        # average attention weights over heads\n        attn_output_weights = attn_output_weights.view(bsz, num_heads, tgt_len, src_len)\n        return attn_output, attn_output_weights.sum(dim=1) / num_heads\n    else:\n        return attn_output, None\n\n\n# This class exists solely for Transformer; it has an annotation stating\n# that bias is never None, which appeases TorchScript\nclass _LinearWithBias(nn.Linear):\n    bias: Tensor  # type: ignore\n\n    def __init__(self, in_features: int, out_features: int) -> None:\n        super().__init__(in_features, out_features, bias=True)  # type: ignore\n\n\nclass MultiheadAttention(nn.Module):\n    r\"\"\"Allows the model to jointly attend to information\n    from different representation subspaces.\n    See `Attention Is All You Need <https://arxiv.org/abs/1706.03762>`_\n\n    .. math::\n        \\text{MultiHead}(Q, K, V) = \\text{Concat}(head_1,\\dots,head_h)W^O\n\n    where :math:`head_i = \\text{Attention}(QW_i^Q, KW_i^K, VW_i^V)`.\n\n    Args:\n        embed_dim: total dimension of the model.\n        num_heads: parallel attention heads.\n        dropout: a Dropout layer on attn_output_weights. Default: 0.0.\n        bias: add bias as module parameter. Default: True.\n        add_bias_kv: add bias to the key and value sequences at dim=0.\n        add_zero_attn: add a new batch of zeros to the key and\n                       value sequences at dim=1.\n        kdim: total number of features in key. Default: None.\n        vdim: total number of features in value. Default: None.\n\n    Note that if :attr:`kdim` and :attr:`vdim` are None, they will be set\n    to :attr:`embed_dim` such that query, key, and value have the same\n    number of features.\n\n    Examples::\n\n        >>> multihead_attn = nn.MultiheadAttention(embed_dim, num_heads)\n        >>> attn_output, attn_output_weights = multihead_attn(query, key, value)\n    \"\"\"\n    bias_k: Optional[torch.Tensor]\n    bias_v: Optional[torch.Tensor]\n\n    def __init__(self, embed_dim, num_heads, dropout=0., bias=True, add_bias_kv=False, add_zero_attn=False, kdim=None, vdim=None):\n        super(MultiheadAttention, self).__init__()\n        self.embed_dim = embed_dim\n        self.kdim = kdim if kdim is not None else embed_dim\n        self.vdim = vdim if vdim is not None else embed_dim\n        self._qkv_same_embed_dim = self.kdim == embed_dim and self.vdim == embed_dim\n\n        self.num_heads = num_heads\n        self.dropout = dropout\n        self.head_dim = embed_dim // num_heads\n        assert self.head_dim * num_heads == self.embed_dim, \"embed_dim must be divisible by num_heads\"\n\n        if self._qkv_same_embed_dim is False:\n            self.q_proj_weight = Parameter(torch.Tensor(embed_dim, embed_dim))\n            self.k_proj_weight = Parameter(torch.Tensor(embed_dim, self.kdim))\n            self.v_proj_weight = Parameter(torch.Tensor(embed_dim, self.vdim))\n            self.register_parameter('in_proj_weight', None)\n        else:\n            self.in_proj_weight = Parameter(torch.empty(3 * embed_dim, embed_dim))\n            self.register_parameter('q_proj_weight', None)\n            self.register_parameter('k_proj_weight', None)\n            self.register_parameter('v_proj_weight', None)\n\n        if bias:\n            self.in_proj_bias = Parameter(torch.empty(3 * embed_dim))\n        else:\n            self.register_parameter('in_proj_bias', None)\n        self.out_proj = _LinearWithBias(embed_dim, embed_dim)\n\n        if add_bias_kv:\n            self.bias_k = Parameter(torch.empty(1, 1, embed_dim))\n            self.bias_v = Parameter(torch.empty(1, 1, embed_dim))\n        else:\n            self.bias_k = self.bias_v = None\n\n        self.add_zero_attn = add_zero_attn\n\n        self._reset_parameters()\n\n    def _reset_parameters(self):\n        if self._qkv_same_embed_dim:\n            xavier_uniform_(self.in_proj_weight)\n        else:\n            xavier_uniform_(self.q_proj_weight)\n            xavier_uniform_(self.k_proj_weight)\n            xavier_uniform_(self.v_proj_weight)\n\n        if self.in_proj_bias is not None:\n            constant_(self.in_proj_bias, 0.)\n            constant_(self.out_proj.bias, 0.)\n        if self.bias_k is not None:\n            xavier_normal_(self.bias_k)\n        if self.bias_v is not None:\n            xavier_normal_(self.bias_v)\n\n    def __setstate__(self, state):\n        # Support loading old MultiheadAttention checkpoints generated by v1.1.0\n        if '_qkv_same_embed_dim' not in state:\n            state['_qkv_same_embed_dim'] = True\n\n        super(MultiheadAttention, self).__setstate__(state)\n\n    def forward(self, query: Tensor, key: Tensor, value: Tensor, key_padding_mask: Optional[Tensor] = None,\n                need_weights: bool = True, attn_mask: Optional[Tensor] = None) -> Tuple[Tensor, Optional[Tensor]]:\n        r\"\"\"\n    Args:\n        query, key, value: map a query and a set of key-value pairs to an output.\n            See \"Attention Is All You Need\" for more details.\n        key_padding_mask: if provided, specified padding elements in the key will\n            be ignored by the attention. When given a binary mask and a value is True,\n            the corresponding value on the attention layer will be ignored. When given\n            a byte mask and a value is non-zero, the corresponding value on the attention\n            layer will be ignored\n        need_weights: output attn_output_weights.\n        attn_mask: 2D or 3D mask that prevents attention to certain positions. A 2D mask will be broadcasted for all\n            the batches while a 3D mask allows to specify a different mask for the entries of each batch.\n\n    Shapes for inputs:\n        - query: :math:`(L, N, E)` where L is the target sequence length, N is the batch size, E is\n          the embedding dimension.\n        - key: :math:`(S, N, E)`, where S is the source sequence length, N is the batch size, E is\n          the embedding dimension.\n        - value: :math:`(S, N, E)` where S is the source sequence length, N is the batch size, E is\n          the embedding dimension.\n        - key_padding_mask: :math:`(N, S)` where N is the batch size, S is the source sequence length.\n          If a ByteTensor is provided, the non-zero positions will be ignored while the position\n          with the zero positions will be unchanged. If a BoolTensor is provided, the positions with the\n          value of ``True`` will be ignored while the position with the value of ``False`` will be unchanged.\n        - attn_mask: if a 2D mask: :math:`(L, S)` where L is the target sequence length, S is the\n          source sequence length.\n\n          If a 3D mask: :math:`(N\\cdot\\text{num\\_heads}, L, S)` where N is the batch size, L is the target sequence\n          length, S is the source sequence length. ``attn_mask`` ensure that position i is allowed to attend\n          the unmasked positions. If a ByteTensor is provided, the non-zero positions are not allowed to attend\n          while the zero positions will be unchanged. If a BoolTensor is provided, positions with ``True``\n          is not allowed to attend while ``False`` values will be unchanged. If a FloatTensor\n          is provided, it will be added to the attention weight.\n\n    Shapes for outputs:\n        - attn_output: :math:`(L, N, E)` where L is the target sequence length, N is the batch size,\n          E is the embedding dimension.\n        - attn_output_weights: :math:`(N, L, S)` where N is the batch size,\n          L is the target sequence length, S is the source sequence length.\n        \"\"\"\n        if not self._qkv_same_embed_dim:\n            return multi_head_attention_forward(\n                query, key, value, self.embed_dim, self.num_heads,\n                self.in_proj_weight, self.in_proj_bias,\n                self.bias_k, self.bias_v, self.add_zero_attn,\n                self.dropout, self.out_proj.weight, self.out_proj.bias,\n                training=self.training,\n                key_padding_mask=key_padding_mask, need_weights=need_weights,\n                attn_mask=attn_mask, use_separate_proj_weight=True,\n                q_proj_weight=self.q_proj_weight, k_proj_weight=self.k_proj_weight,\n                v_proj_weight=self.v_proj_weight)\n        else:\n            return multi_head_attention_forward(\n                query, key, value, self.embed_dim, self.num_heads,\n                self.in_proj_weight, self.in_proj_bias,\n                self.bias_k, self.bias_v, self.add_zero_attn,\n                self.dropout, self.out_proj.weight, self.out_proj.bias,\n                training=self.training,\n                key_padding_mask=key_padding_mask, need_weights=need_weights,\n                attn_mask=attn_mask)"
  },
  {
    "path": "thirdparty/GLEE/glee/modules/point_features.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates.\nimport torch\nfrom torch.nn import functional as F\n\nfrom detectron2.layers import cat, shapes_to_tensor\nfrom detectron2.structures import BitMasks, Boxes\n\n# from ..layers import cat, shapes_to_tensor\n# from ..structures import BitMasks, Boxes\n\n\"\"\"\nShape shorthand in this module:\n\n    N: minibatch dimension size, i.e. the number of RoIs for instance segmenation or the\n        number of images for semantic segmenation.\n    R: number of ROIs, combined over all images, in the minibatch\n    P: number of points\n\"\"\"\n\n\ndef point_sample(input, point_coords, **kwargs):\n    \"\"\"\n    A wrapper around :function:`torch.nn.functional.grid_sample` to support 3D point_coords tensors.\n    Unlike :function:`torch.nn.functional.grid_sample` it assumes `point_coords` to lie inside\n    [0, 1] x [0, 1] square.\n\n    Args:\n        input (Tensor): A tensor of shape (N, C, H, W) that contains features map on a H x W grid.\n        point_coords (Tensor): A tensor of shape (N, P, 2) or (N, Hgrid, Wgrid, 2) that contains\n        [0, 1] x [0, 1] normalized point coordinates.\n\n    Returns:\n        output (Tensor): A tensor of shape (N, C, P) or (N, C, Hgrid, Wgrid) that contains\n            features for points in `point_coords`. The features are obtained via bilinear\n            interplation from `input` the same way as :function:`torch.nn.functional.grid_sample`.\n    \"\"\"\n    add_dim = False\n    if point_coords.dim() == 3:\n        add_dim = True\n        point_coords = point_coords.unsqueeze(2)\n    output = F.grid_sample(input, 2.0 * point_coords - 1.0, **kwargs)\n    if add_dim:\n        output = output.squeeze(3)\n    return output\n\n\ndef generate_regular_grid_point_coords(R, side_size, device):\n    \"\"\"\n    Generate regular square grid of points in [0, 1] x [0, 1] coordinate space.\n\n    Args:\n        R (int): The number of grids to sample, one for each region.\n        side_size (int): The side size of the regular grid.\n        device (torch.device): Desired device of returned tensor.\n\n    Returns:\n        (Tensor): A tensor of shape (R, side_size^2, 2) that contains coordinates\n            for the regular grids.\n    \"\"\"\n    aff = torch.tensor([[[0.5, 0, 0.5], [0, 0.5, 0.5]]], device=device)\n    r = F.affine_grid(aff, torch.Size((1, 1, side_size, side_size)), align_corners=False)\n    return r.view(1, -1, 2).expand(R, -1, -1)\n\n\ndef get_uncertain_point_coords_with_randomness(\n    coarse_logits, uncertainty_func, num_points, oversample_ratio, importance_sample_ratio\n):\n    \"\"\"\n    Sample points in [0, 1] x [0, 1] coordinate space based on their uncertainty. The unceratinties\n        are calculated for each point using 'uncertainty_func' function that takes point's logit\n        prediction as input.\n    See PointRend paper for details.\n\n    Args:\n        coarse_logits (Tensor): A tensor of shape (N, C, Hmask, Wmask) or (N, 1, Hmask, Wmask) for\n            class-specific or class-agnostic prediction.\n        uncertainty_func: A function that takes a Tensor of shape (N, C, P) or (N, 1, P) that\n            contains logit predictions for P points and returns their uncertainties as a Tensor of\n            shape (N, 1, P).\n        num_points (int): The number of points P to sample.\n        oversample_ratio (int): Oversampling parameter.\n        importance_sample_ratio (float): Ratio of points that are sampled via importnace sampling.\n\n    Returns:\n        point_coords (Tensor): A tensor of shape (N, P, 2) that contains the coordinates of P\n            sampled points.\n    \"\"\"\n    assert oversample_ratio >= 1\n    assert importance_sample_ratio <= 1 and importance_sample_ratio >= 0\n    num_boxes = coarse_logits.shape[0]\n    num_sampled = int(num_points * oversample_ratio)\n    point_coords = torch.rand(num_boxes, num_sampled, 2, device=coarse_logits.device, dtype=coarse_logits.dtype)\n    point_logits = point_sample(coarse_logits, point_coords, align_corners=False)\n    # It is crucial to calculate uncertainty based on the sampled prediction value for the points.\n    # Calculating uncertainties of the coarse predictions first and sampling them for points leads\n    # to incorrect results.\n    # To illustrate this: assume uncertainty_func(logits)=-abs(logits), a sampled point between\n    # two coarse predictions with -1 and 1 logits has 0 logits, and therefore 0 uncertainty value.\n    # However, if we calculate uncertainties for the coarse predictions first,\n    # both will have -1 uncertainty, and the sampled point will get -1 uncertainty.\n    point_uncertainties = uncertainty_func(point_logits)\n    num_uncertain_points = int(importance_sample_ratio * num_points)\n    num_random_points = num_points - num_uncertain_points\n    idx = torch.topk(point_uncertainties[:, 0, :], k=num_uncertain_points, dim=1)[1]\n    shift = num_sampled * torch.arange(num_boxes, dtype=torch.long, device=coarse_logits.device)\n    idx += shift[:, None]\n    point_coords = point_coords.view(-1, 2)[idx.view(-1), :].view(\n        num_boxes, num_uncertain_points, 2\n    )\n    if num_random_points > 0:\n        point_coords = cat(\n            [\n                point_coords,\n                torch.rand(num_boxes, num_random_points, 2, device=coarse_logits.device),\n            ],\n            dim=1,\n        )\n    return point_coords\n\n\ndef get_uncertain_point_coords_on_grid(uncertainty_map, num_points):\n    \"\"\"\n    Find `num_points` most uncertain points from `uncertainty_map` grid.\n\n    Args:\n        uncertainty_map (Tensor): A tensor of shape (N, 1, H, W) that contains uncertainty\n            values for a set of points on a regular H x W grid.\n        num_points (int): The number of points P to select.\n\n    Returns:\n        point_indices (Tensor): A tensor of shape (N, P) that contains indices from\n            [0, H x W) of the most uncertain points.\n        point_coords (Tensor): A tensor of shape (N, P, 2) that contains [0, 1] x [0, 1] normalized\n            coordinates of the most uncertain points from the H x W grid.\n    \"\"\"\n    R, _, H, W = uncertainty_map.shape\n    h_step = 1.0 / float(H)\n    w_step = 1.0 / float(W)\n\n    num_points = min(H * W, num_points)\n    point_indices = torch.topk(uncertainty_map.view(R, H * W), k=num_points, dim=1)[1]\n    point_coords = torch.zeros(R, num_points, 2, dtype=torch.float, device=uncertainty_map.device)\n    point_coords[:, :, 0] = w_step / 2.0 + (point_indices % W).to(torch.float) * w_step\n    point_coords[:, :, 1] = h_step / 2.0 + (point_indices // W).to(torch.float) * h_step\n    return point_indices, point_coords\n\n\ndef point_sample_fine_grained_features(features_list, feature_scales, boxes, point_coords):\n    \"\"\"\n    Get features from feature maps in `features_list` that correspond to specific point coordinates\n        inside each bounding box from `boxes`.\n\n    Args:\n        features_list (list[Tensor]): A list of feature map tensors to get features from.\n        feature_scales (list[float]): A list of scales for tensors in `features_list`.\n        boxes (list[Boxes]): A list of I Boxes  objects that contain R_1 + ... + R_I = R boxes all\n            together.\n        point_coords (Tensor): A tensor of shape (R, P, 2) that contains\n            [0, 1] x [0, 1] box-normalized coordinates of the P sampled points.\n\n    Returns:\n        point_features (Tensor): A tensor of shape (R, C, P) that contains features sampled\n            from all features maps in feature_list for P sampled points for all R boxes in `boxes`.\n        point_coords_wrt_image (Tensor): A tensor of shape (R, P, 2) that contains image-level\n            coordinates of P points.\n    \"\"\"\n    cat_boxes = Boxes.cat(boxes)\n    num_boxes = [b.tensor.size(0) for b in boxes]\n\n    point_coords_wrt_image = get_point_coords_wrt_image(cat_boxes.tensor, point_coords)\n    split_point_coords_wrt_image = torch.split(point_coords_wrt_image, num_boxes)\n\n    point_features = []\n    for idx_img, point_coords_wrt_image_per_image in enumerate(split_point_coords_wrt_image):\n        point_features_per_image = []\n        for idx_feature, feature_map in enumerate(features_list):\n            h, w = feature_map.shape[-2:]\n            scale = shapes_to_tensor([w, h]) / feature_scales[idx_feature]\n            point_coords_scaled = point_coords_wrt_image_per_image / scale.to(feature_map.device)\n            point_features_per_image.append(\n                point_sample(\n                    feature_map[idx_img].unsqueeze(0),\n                    point_coords_scaled.unsqueeze(0),\n                    align_corners=False,\n                )\n                .squeeze(0)\n                .transpose(1, 0)\n            )\n        point_features.append(cat(point_features_per_image, dim=1))\n\n    return cat(point_features, dim=0), point_coords_wrt_image\n\n\ndef get_point_coords_wrt_image(boxes_coords, point_coords):\n    \"\"\"\n    Convert box-normalized [0, 1] x [0, 1] point cooordinates to image-level coordinates.\n\n    Args:\n        boxes_coords (Tensor): A tensor of shape (R, 4) that contains bounding boxes.\n            coordinates.\n        point_coords (Tensor): A tensor of shape (R, P, 2) that contains\n            [0, 1] x [0, 1] box-normalized coordinates of the P sampled points.\n\n    Returns:\n        point_coords_wrt_image (Tensor): A tensor of shape (R, P, 2) that contains\n            image-normalized coordinates of P sampled points.\n    \"\"\"\n    with torch.no_grad():\n        point_coords_wrt_image = point_coords.clone()\n        point_coords_wrt_image[:, :, 0] = point_coords_wrt_image[:, :, 0] * (\n            boxes_coords[:, None, 2] - boxes_coords[:, None, 0]\n        )\n        point_coords_wrt_image[:, :, 1] = point_coords_wrt_image[:, :, 1] * (\n            boxes_coords[:, None, 3] - boxes_coords[:, None, 1]\n        )\n        point_coords_wrt_image[:, :, 0] += boxes_coords[:, None, 0]\n        point_coords_wrt_image[:, :, 1] += boxes_coords[:, None, 1]\n    return point_coords_wrt_image\n\n\ndef sample_point_labels(instances, point_coords):\n    \"\"\"\n    Sample point labels from ground truth mask given point_coords.\n\n    Args:\n        instances (list[Instances]): A list of N Instances, where N is the number of images\n            in the batch. So, i_th elememt of the list contains R_i objects and R_1 + ... + R_N is\n            equal to R. The ground-truth gt_masks in each instance will be used to compute labels.\n        points_coords (Tensor): A tensor of shape (R, P, 2), where R is the total number of\n            instances and P is the number of points for each instance. The coordinates are in\n            the absolute image pixel coordinate space, i.e. [0, H] x [0, W].\n\n    Returns:\n        Tensor: A tensor of shape (R, P) that contains the labels of P sampled points.\n    \"\"\"\n    with torch.no_grad():\n        gt_mask_logits = []\n        point_coords_splits = torch.split(\n            point_coords, [len(instances_per_image) for instances_per_image in instances]\n        )\n        for i, instances_per_image in enumerate(instances):\n            if len(instances_per_image) == 0:\n                continue\n            assert isinstance(\n                instances_per_image.gt_masks, BitMasks\n            ), \"Point head works with GT in 'bitmask' format. Set INPUT.MASK_FORMAT to 'bitmask'.\"\n\n            gt_bit_masks = instances_per_image.gt_masks.tensor\n            h, w = instances_per_image.gt_masks.image_size\n            scale = torch.tensor([w, h], dtype=torch.float, device=gt_bit_masks.device)\n            points_coord_grid_sample_format = point_coords_splits[i] / scale\n            gt_mask_logits.append(\n                point_sample(\n                    gt_bit_masks.to(torch.float32).unsqueeze(1),\n                    points_coord_grid_sample_format,\n                    align_corners=False,\n                ).squeeze(1)\n            )\n\n    point_labels = cat(gt_mask_logits)\n    return point_labels\n"
  },
  {
    "path": "thirdparty/GLEE/glee/modules/position_encoding.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates.\n## Modified by Bowen Cheng from: https://github.com/facebookresearch/detr/blob/master/models/position_encoding.py\n\"\"\"\nVarious positional encodings for the transformer.\n\"\"\"\nimport math\n\nimport torch\nfrom torch import nn\n\n\nclass PositionEmbeddingSine(nn.Module):\n    \"\"\"\n    This is a more standard version of the position embedding, very similar to the one\n    used by the Attention is all you need paper, generalized to work on images.\n    \"\"\"\n\n    def __init__(self, num_pos_feats=64, temperature=10000, normalize=False, scale=None):\n        super().__init__()\n        self.num_pos_feats = num_pos_feats\n        self.temperature = temperature\n        self.normalize = normalize\n        if scale is not None and normalize is False:\n            raise ValueError(\"normalize should be True if scale is passed\")\n        if scale is None:\n            scale = 2 * math.pi\n        self.scale = scale\n\n    def forward(self, x, mask=None):\n        if mask is None:\n            mask = torch.zeros((x.size(0), x.size(2), x.size(3)), device=x.device, dtype=torch.bool)\n        not_mask = ~mask\n        y_embed = not_mask.cumsum(1, dtype=x.dtype)\n        x_embed = not_mask.cumsum(2, dtype=x.dtype)\n        if self.normalize:\n            eps = 1e-6\n            y_embed = y_embed / (y_embed[:, -1:, :] + eps) * self.scale\n            x_embed = x_embed / (x_embed[:, :, -1:] + eps) * self.scale\n\n        dim_t = torch.arange(self.num_pos_feats, dtype=x.dtype, device=x.device)\n        dim_t = self.temperature ** (2 * (dim_t // 2) / self.num_pos_feats)\n\n        pos_x = x_embed[:, :, :, None] / dim_t\n        pos_y = y_embed[:, :, :, None] / dim_t\n        pos_x = torch.stack(\n            (pos_x[:, :, :, 0::2].sin(), pos_x[:, :, :, 1::2].cos()), dim=4\n        ).flatten(3)\n        pos_y = torch.stack(\n            (pos_y[:, :, :, 0::2].sin(), pos_y[:, :, :, 1::2].cos()), dim=4\n        ).flatten(3)\n        pos = torch.cat((pos_y, pos_x), dim=3).permute(0, 3, 1, 2)\n        return pos\n    \n    def __repr__(self, _repr_indent=4):\n        head = \"Positional encoding \" + self.__class__.__name__\n        body = [\n            \"num_pos_feats: {}\".format(self.num_pos_feats),\n            \"temperature: {}\".format(self.temperature),\n            \"normalize: {}\".format(self.normalize),\n            \"scale: {}\".format(self.scale),\n        ]\n        # _repr_indent = 4\n        lines = [head] + [\" \" * _repr_indent + line for line in body]\n        return \"\\n\".join(lines)\n"
  },
  {
    "path": "thirdparty/GLEE/glee/modules/postprocessing.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates.\nimport torch\nfrom torch.nn import functional as F\n\nfrom detectron2.structures import Instances, ROIMasks\n\n\n# perhaps should rename to \"resize_instance\"\ndef detector_postprocess(\n    results: Instances, output_height: int, output_width: int, mask_threshold: float = 0.5\n):\n    \"\"\"\n    Resize the output instances.\n    The input images are often resized when entering an object detector.\n    As a result, we often need the outputs of the detector in a different\n    resolution from its inputs.\n\n    This function will resize the raw outputs of an R-CNN detector\n    to produce outputs according to the desired output resolution.\n\n    Args:\n        results (Instances): the raw outputs from the detector.\n            `results.image_size` contains the input image resolution the detector sees.\n            This object might be modified in-place.\n        output_height, output_width: the desired output resolution.\n\n    Returns:\n        Instances: the resized output from the model, based on the output resolution\n    \"\"\"\n    if isinstance(output_width, torch.Tensor):\n        # This shape might (but not necessarily) be tensors during tracing.\n        # Converts integer tensors to float temporaries to ensure true\n        # division is performed when computing scale_x and scale_y.\n        output_width_tmp = output_width.float()\n        output_height_tmp = output_height.float()\n        new_size = torch.stack([output_height, output_width])\n    else:\n        new_size = (output_height, output_width)\n        output_width_tmp = output_width\n        output_height_tmp = output_height\n\n    scale_x, scale_y = (\n        output_width_tmp / results.image_size[1],\n        output_height_tmp / results.image_size[0],\n    )\n    results = Instances(new_size, **results.get_fields())\n\n    if results.has(\"pred_boxes\"):\n        output_boxes = results.pred_boxes\n    elif results.has(\"proposal_boxes\"):\n        output_boxes = results.proposal_boxes\n    else:\n        output_boxes = None\n    assert output_boxes is not None, \"Predictions must contain boxes!\"\n\n    output_boxes.scale(scale_x, scale_y)\n    output_boxes.clip(results.image_size)\n\n    results = results[output_boxes.nonempty()]\n\n    if results.has(\"pred_masks\"):\n        if isinstance(results.pred_masks, ROIMasks):\n            roi_masks = results.pred_masks\n        else:\n            # pred_masks is a tensor of shape (N, 1, M, M)\n            roi_masks = ROIMasks(results.pred_masks[:, 0, :, :])\n        results.pred_masks = roi_masks.to_bitmasks(\n            results.pred_boxes, output_height, output_width, mask_threshold\n        ).tensor  # TODO return ROIMasks/BitMask object in the future\n\n    if results.has(\"pred_keypoints\"):\n        results.pred_keypoints[:, :, 0] *= scale_x\n        results.pred_keypoints[:, :, 1] *= scale_y\n\n    return results\n\ndef bbox_postprocess(result, input_size, img_size, output_height, output_width):\n    \"\"\"\n    result: [xc,yc,w,h] range [0,1] to [x1,y1,x2,y2] range [0,w], [0,h]\n    \"\"\"\n    if result is None:\n        return None\n    \n    scale = torch.tensor([input_size[1], input_size[0], input_size[1], input_size[0]])[None,:].to(result.device)\n    result = result.sigmoid() * scale\n    x1,y1,x2,y2 = result[:,0] - result[:,2]/2, result[:,1] - result[:,3]/2, result[:,0] + result[:,2]/2, result[:,1] + result[:,3]/2\n    h,w = img_size\n\n    x1 = x1.clamp(min=0, max=w)\n    y1 = y1.clamp(min=0, max=h)\n    x2 = x2.clamp(min=0, max=w)\n    y2 = y2.clamp(min=0, max=h)\n\n    box = torch.stack([x1,y1,x2,y2]).permute(1,0)\n    scale = torch.tensor([output_width/w, output_height/h, output_width/w, output_height/h])[None,:].to(result.device)\n    box = box*scale\n    return box\n\ndef sem_seg_postprocess(result, img_size, output_height, output_width):\n    \"\"\"\n    Return semantic segmentation predictions in the original resolution.\n\n    The input images are often resized when entering semantic segmentor. Moreover, in same\n    cases, they also padded inside segmentor to be divisible by maximum network stride.\n    As a result, we often need the predictions of the segmentor in a different\n    resolution from its inputs.\n\n    Args:\n        result (Tensor): semantic segmentation prediction logits. A tensor of shape (C, H, W),\n            where C is the number of classes, and H, W are the height and width of the prediction.\n        img_size (tuple): image size that segmentor is taking as input.\n        output_height, output_width: the desired output resolution.\n\n    Returns:\n        semantic segmentation prediction (Tensor): A tensor of the shape\n            (C, output_height, output_width) that contains per-pixel soft predictions.\n    \"\"\"\n    result = result[:, : img_size[0], : img_size[1]].expand(1, -1, -1, -1)\n    result = F.interpolate(\n        result, size=(output_height, output_width), mode=\"bilinear\", align_corners=False\n    )[0]\n    return result\n"
  },
  {
    "path": "thirdparty/GLEE/glee/utils/__init__.py",
    "content": "from .config import *\nfrom .misc import *\nfrom .box_ops import *\nfrom .it_contrastive import *\n"
  },
  {
    "path": "thirdparty/GLEE/glee/utils/box_ops.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved\n\"\"\"\nUtilities for bounding box manipulation and GIoU.\n\"\"\"\nimport torch\nfrom torchvision.ops.boxes import box_area\n\n\ndef box_cxcywh_to_xyxy(x):\n    x_c, y_c, w, h = x.unbind(-1)\n    b = [(x_c - 0.5 * w), (y_c - 0.5 * h),\n         (x_c + 0.5 * w), (y_c + 0.5 * h)]\n    return torch.stack(b, dim=-1)\n\n\ndef box_xyxy_to_cxcywh(x):\n    x0, y0, x1, y1 = x.unbind(-1)\n    b = [(x0 + x1) / 2, (y0 + y1) / 2,\n         (x1 - x0), (y1 - y0)]\n    return torch.stack(b, dim=-1)\n\ndef box_xywh_to_xyxy(x):\n    x0, y0, x1, y1 = x.unbind(-1)\n    b = [x0, y0, (x0 + x1), (y0 + y1)]\n    return torch.stack(b, dim=-1)\n\n\n# modified from torchvision to also return the union\ndef box_iou(boxes1, boxes2):\n    area1 = box_area(boxes1)\n    area2 = box_area(boxes2)\n\n    lt = torch.max(boxes1[:, None, :2], boxes2[:, :2])  # [N,M,2]\n    rb = torch.min(boxes1[:, None, 2:], boxes2[:, 2:])  # [N,M,2]\n\n    wh = (rb - lt).clamp(min=0)  # [N,M,2]\n    inter = wh[:, :, 0] * wh[:, :, 1]  # [N,M]\n\n    union = area1[:, None] + area2 - inter\n\n    iou = inter / union\n    return iou, union\n\n\ndef generalized_box_iou(boxes1, boxes2):\n    \"\"\"\n    Generalized IoU from https://giou.stanford.edu/\n\n    The boxes should be in [x0, y0, x1, y1] format\n\n    Returns a [N, M] pairwise matrix, where N = len(boxes1)\n    and M = len(boxes2)\n    \"\"\"\n    # degenerate boxes gives inf / nan results\n    # so do an early check\n    assert (boxes1[:, 2:] >= boxes1[:, :2]).all()\n    assert (boxes2[:, 2:] >= boxes2[:, :2]).all()\n    iou, union = box_iou(boxes1, boxes2)\n\n    lt = torch.min(boxes1[:, None, :2], boxes2[:, :2])\n    rb = torch.max(boxes1[:, None, 2:], boxes2[:, 2:])\n\n    wh = (rb - lt).clamp(min=0)  # [N,M,2]\n    area = wh[:, :, 0] * wh[:, :, 1]\n\n    return iou - (area - union) / area\n\n\ndef masks_to_boxes(masks):\n    \"\"\"Compute the bounding boxes around the provided masks\n\n    The masks should be in format [N, H, W] where N is the number of masks, (H, W) are the spatial dimensions.\n\n    Returns a [N, 4] tensors, with the boxes in xyxy format\n    \"\"\"\n    if masks.numel() == 0:\n        return torch.zeros((0, 4), device=masks.device)\n\n    h, w = masks.shape[-2:]\n\n    y = torch.arange(0, h, dtype=torch.float)\n    x = torch.arange(0, w, dtype=torch.float)\n    y, x = torch.meshgrid(y, x)\n\n    x_mask = (masks * x.unsqueeze(0))\n    x_max = x_mask.flatten(1).max(-1)[0]\n    x_min = x_mask.masked_fill(~(masks.bool()), 1e8).flatten(1).min(-1)[0]\n\n    y_mask = (masks * y.unsqueeze(0))\n    y_max = y_mask.flatten(1).max(-1)[0]\n    y_min = y_mask.masked_fill(~(masks.bool()), 1e8).flatten(1).min(-1)[0]\n\n    return torch.stack([x_min, y_min, x_max, y_max], 1)"
  },
  {
    "path": "thirdparty/GLEE/glee/utils/config.py",
    "content": "# -*- coding: utf-8 -*-\n# Copyright (c) Facebook, Inc. and its affiliates.\n\nimport functools\nimport inspect\n\ndef configurable(init_func=None, *, from_config=None):\n    \"\"\"\n    Decorate a function or a class's __init__ method so that it can be called\n    with a :class:`CfgNode` object using a :func:`from_config` function that translates\n    :class:`CfgNode` to arguments.\n\n    Examples:\n    ::\n        # Usage 1: Decorator on __init__:\n        class A:\n            @configurable\n            def __init__(self, a, b=2, c=3):\n                pass\n\n            @classmethod\n            def from_config(cls, cfg):   # 'cfg' must be the first argument\n                # Returns kwargs to be passed to __init__\n                return {\"a\": cfg.A, \"b\": cfg.B}\n\n        a1 = A(a=1, b=2)  # regular construction\n        a2 = A(cfg)       # construct with a cfg\n        a3 = A(cfg, b=3, c=4)  # construct with extra overwrite\n\n        # Usage 2: Decorator on any function. Needs an extra from_config argument:\n        @configurable(from_config=lambda cfg: {\"a: cfg.A, \"b\": cfg.B})\n        def a_func(a, b=2, c=3):\n            pass\n\n        a1 = a_func(a=1, b=2)  # regular call\n        a2 = a_func(cfg)       # call with a cfg\n        a3 = a_func(cfg, b=3, c=4)  # call with extra overwrite\n\n    Args:\n        init_func (callable): a class's ``__init__`` method in usage 1. The\n            class must have a ``from_config`` classmethod which takes `cfg` as\n            the first argument.\n        from_config (callable): the from_config function in usage 2. It must take `cfg`\n            as its first argument.\n    \"\"\"\n\n    if init_func is not None:\n        assert (\n            inspect.isfunction(init_func)\n            and from_config is None\n            and init_func.__name__ == \"__init__\"\n        ), \"Incorrect use of @configurable. Check API documentation for examples.\"\n\n        @functools.wraps(init_func)\n        def wrapped(self, *args, **kwargs):\n            try:\n                from_config_func = type(self).from_config\n            except AttributeError as e:\n                raise AttributeError(\n                    \"Class with @configurable must have a 'from_config' classmethod.\"\n                ) from e\n            if not inspect.ismethod(from_config_func):\n                raise TypeError(\"Class with @configurable must have a 'from_config' classmethod.\")\n\n            if _called_with_cfg(*args, **kwargs):\n                explicit_args = _get_args_from_config(from_config_func, *args, **kwargs)\n                init_func(self, **explicit_args)\n            else:\n                init_func(self, *args, **kwargs)\n\n        return wrapped\n\n    else:\n        if from_config is None:\n            return configurable  # @configurable() is made equivalent to @configurable\n        assert inspect.isfunction(\n            from_config\n        ), \"from_config argument of configurable must be a function!\"\n\n        def wrapper(orig_func):\n            @functools.wraps(orig_func)\n            def wrapped(*args, **kwargs):\n                if _called_with_cfg(*args, **kwargs):\n                    explicit_args = _get_args_from_config(from_config, *args, **kwargs)\n                    return orig_func(**explicit_args)\n                else:\n                    return orig_func(*args, **kwargs)\n\n            wrapped.from_config = from_config\n            return wrapped\n\n        return wrapper\n\ndef _called_with_cfg(*args, **kwargs):\n    \"\"\"\n    Returns:\n        bool: whether the arguments contain CfgNode and should be considered\n            forwarded to from_config.\n    \"\"\"\n    from omegaconf import DictConfig\n\n    if len(args) and isinstance(args[0], (dict)):\n        return True\n    if isinstance(kwargs.pop(\"cfg\", None), (dict)):\n        return True\n    # `from_config`'s first argument is forced to be \"cfg\".\n    # So the above check covers all cases.\n    return False\n\ndef _get_args_from_config(from_config_func, *args, **kwargs):\n    \"\"\"\n    Use `from_config` to obtain explicit arguments.\n\n    Returns:\n        dict: arguments to be used for cls.__init__\n    \"\"\"\n    signature = inspect.signature(from_config_func)\n    if list(signature.parameters.keys())[0] != \"cfg\":\n        if inspect.isfunction(from_config_func):\n            name = from_config_func.__name__\n        else:\n            name = f\"{from_config_func.__self__}.from_config\"\n        raise TypeError(f\"{name} must take 'cfg' as the first argument!\")\n    support_var_arg = any(\n        param.kind in [param.VAR_POSITIONAL, param.VAR_KEYWORD]\n        for param in signature.parameters.values()\n    )\n    if support_var_arg:  # forward all arguments to from_config, if from_config accepts them\n        ret = from_config_func(*args, **kwargs)\n    else:\n        # forward supported arguments to from_config\n        supported_arg_names = set(signature.parameters.keys())\n        extra_kwargs = {}\n        for name in list(kwargs.keys()):\n            if name not in supported_arg_names:\n                extra_kwargs[name] = kwargs.pop(name)\n        ret = from_config_func(*args, **kwargs)\n        # forward the other arguments to __init__\n        ret.update(extra_kwargs)\n    return ret"
  },
  {
    "path": "thirdparty/GLEE/glee/utils/it_contrastive.py",
    "content": "import torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\ndef is_dist_initialized():\n    return torch.distributed.is_initialized()\n\ndef get_world_size():\n    if is_dist_initialized():\n        return torch.distributed.get_world_size()\n    return 1\n\ndef all_gather_grad(x):\n    if get_world_size() > 1:\n        all_x = [torch.zeros_like(x) for _ in range(get_world_size())]\n        torch.distributed.all_gather(all_x, x)\n        all_x[torch.distributed.get_rank()] = x\n        x = torch.cat(all_x, dim=0)\n    return x\n\n@torch.no_grad()\ndef all_gather_nograd(tensor):\n    # from albef\n    \"\"\"\n    Performs all_gather operation on the provided tensors.\n    *** Warning ***: torch.distributed.all_gather has no gradient.\n    \"\"\"\n    if get_world_size() > 1:\n        tensors_gather = [torch.ones_like(tensor)\n            for _ in range(torch.distributed.get_world_size())]\n        torch.distributed.all_gather(tensors_gather, tensor, async_op=False)\n\n        tensor = torch.cat(tensors_gather, dim=0)\n    return tensor\n\ndef image_text_contrastive_loss(image_feat, text_feat, temperature, image_id=None, text_id=None):\n    # add the following 4 lines\n    image_feat = all_gather_grad(image_feat)\n    text_feat = all_gather_grad(text_feat)\n    \n    logits = torch.matmul(image_feat, text_feat.t())\n    logits /= temperature\n    \n    if image_id is None and text_id is None:\n        gt = torch.arange(logits.shape[0], device=logits.device)\n        loss1 = F.cross_entropy(logits, gt)\n        loss2 = F.cross_entropy(logits.t(), gt)        \n    else:\n        image_id = all_gather_grad(image_id)\n        text_id = all_gather_grad(text_id)\n\n        gt_image = image_id.reshape((-1, 1)) == image_id.reshape((1, -1))\n        gt_text = text_id.reshape((-1, 1)) == text_id.reshape((1, -1))\n        gt = torch.logical_or(gt_image, gt_text)\n\n        loss1 = -torch.sum(gt * F.log_softmax(logits, dim=1)) / gt.sum()\n        loss2 = -torch.sum(gt.t() * F.log_softmax(logits.t(), dim=1)) / gt.sum()\n\n    return (loss1 + loss2) / 2 * get_world_size() # scale it up by the number of GPUs\n"
  },
  {
    "path": "thirdparty/GLEE/glee/utils/misc.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates.\n# Modified by Bowen Cheng from https://github.com/facebookresearch/detr/blob/master/util/misc.py\n# Modified by Xueyan Zou\n\"\"\"\nMisc functions, including distributed helpers.\n\nMostly copy-paste from torchvision references.\n\"\"\"\nfrom typing import List, Optional\n\nimport torch\nimport torch.distributed as dist\nimport torchvision\nfrom torch import Tensor\n\ndef _max_by_axis(the_list):\n    # type: (List[List[int]]) -> List[int]\n    maxes = the_list[0]\n    for sublist in the_list[1:]:\n        for index, item in enumerate(sublist):\n            maxes[index] = max(maxes[index], item)\n    return maxes\n\nclass NestedTensor(object):\n    def __init__(self, tensors, mask: Optional[Tensor]):\n        self.tensors = tensors\n        self.mask = mask\n\n    def to(self, device):\n        # type: (Device) -> NestedTensor # noqa\n        cast_tensor = self.tensors.to(device)\n        mask = self.mask\n        if mask is not None:\n            assert mask is not None\n            cast_mask = mask.to(device)\n        else:\n            cast_mask = None\n        return NestedTensor(cast_tensor, cast_mask)\n\n    def decompose(self):\n        return self.tensors, self.mask\n\n    def __repr__(self):\n        return str(self.tensors)\n\ndef nested_tensor_from_tensor_list(tensor_list: List[Tensor]):\n    # TODO make this more general\n    if tensor_list[0].ndim == 3:\n        if torchvision._is_tracing():\n            # nested_tensor_from_tensor_list() does not export well to ONNX\n            # call _onnx_nested_tensor_from_tensor_list() instead\n            return _onnx_nested_tensor_from_tensor_list(tensor_list)\n\n        # TODO make it support different-sized images\n        max_size = _max_by_axis([list(img.shape) for img in tensor_list])\n        # min_size = tuple(min(s) for s in zip(*[img.shape for img in tensor_list]))\n        batch_shape = [len(tensor_list)] + max_size\n        b, c, h, w = batch_shape\n        dtype = tensor_list[0].dtype\n        device = tensor_list[0].device\n        tensor = torch.zeros(batch_shape, dtype=dtype, device=device)\n        mask = torch.ones((b, h, w), dtype=torch.bool, device=device)\n        for img, pad_img, m in zip(tensor_list, tensor, mask):\n            pad_img[: img.shape[0], : img.shape[1], : img.shape[2]].copy_(img)\n            m[: img.shape[1], : img.shape[2]] = False\n    elif tensor_list[0].ndim == 2:\n        if torchvision._is_tracing():\n            # nested_tensor_from_tensor_list() does not export well to ONNX\n            # call _onnx_nested_tensor_from_tensor_list() instead\n            return _onnx_nested_tensor_from_tensor_list(tensor_list)\n\n        # TODO make it support different-sized images\n        max_size = _max_by_axis([list(txt.shape) for txt in tensor_list])\n        # min_size = tuple(min(s) for s in zip(*[img.shape for img in tensor_list]))\n        batch_shape = [len(tensor_list)] + max_size\n        b, c, l = batch_shape\n        dtype = tensor_list[0].dtype\n        device = tensor_list[0].device\n        tensor = torch.zeros(batch_shape, dtype=dtype, device=device)\n        mask = torch.ones((b, l), dtype=torch.bool, device=device)\n        for txt, pad_txt, m in zip(tensor_list, tensor, mask):\n            pad_txt[: txt.shape[0], : txt.shape[1]] = txt\n            m[: txt.shape[1]] = False\n    else:\n        raise ValueError(\"not supported\")\n    return NestedTensor(tensor, mask)\n\ndef _collate_and_pad_divisibility(tensor_list: list, div=32):\n    max_size = []\n    for i in range(tensor_list[0].dim()):\n        max_size_i = torch.max(\n            torch.tensor([img.shape[i] for img in tensor_list]).to(torch.float32)\n        ).to(torch.int64)\n        max_size.append(max_size_i)\n    max_size = tuple(max_size)\n\n    c,h,w = max_size\n    pad_h = (div - h % div) if h % div != 0 else 0\n    pad_w = (div - w % div) if w % div != 0 else 0\n    max_size = (c,h+pad_h,w+pad_w)\n    \n    # work around for\n    # pad_img[: img.shape[0], : img.shape[1], : img.shape[2]].copy_(img)\n    # m[: img.shape[1], :img.shape[2]] = False\n    # which is not yet supported in onnx\n    padded_imgs = []\n    padded_masks = []\n    for img in tensor_list:\n        padding = [(s1 - s2) for s1, s2 in zip(max_size, tuple(img.shape))]\n        padded_img = torch.nn.functional.pad(img, (0, padding[2], 0, padding[1], 0, padding[0]))\n        padded_imgs.append(padded_img)\n\n        m = torch.zeros_like(img[0], dtype=torch.int, device=img.device)\n        padded_mask = torch.nn.functional.pad(m, (0, padding[2], 0, padding[1]), \"constant\", 1)\n        padded_masks.append(padded_mask.to(torch.bool))\n    \n    return padded_imgs\n\n# _onnx_nested_tensor_from_tensor_list() is an implementation of\n# nested_tensor_from_tensor_list() that is supported by ONNX tracing.\n@torch.jit.unused\ndef _onnx_nested_tensor_from_tensor_list(tensor_list: List[Tensor]) -> NestedTensor:\n    max_size = []\n    for i in range(tensor_list[0].dim()):\n        max_size_i = torch.max(\n            torch.stack([img.shape[i] for img in tensor_list]).to(torch.float32)\n        ).to(torch.int64)\n        max_size.append(max_size_i)\n    max_size = tuple(max_size)\n\n    # work around for\n    # pad_img[: img.shape[0], : img.shape[1], : img.shape[2]].copy_(img)\n    # m[: img.shape[1], :img.shape[2]] = False\n    # which is not yet supported in onnx\n    padded_imgs = []\n    padded_masks = []\n    for img in tensor_list:\n        padding = [(s1 - s2) for s1, s2 in zip(max_size, tuple(img.shape))]\n        padded_img = torch.nn.functional.pad(img, (0, padding[2], 0, padding[1], 0, padding[0]))\n        padded_imgs.append(padded_img)\n\n        m = torch.zeros_like(img[0], dtype=torch.int, device=img.device)\n        padded_mask = torch.nn.functional.pad(m, (0, padding[2], 0, padding[1]), \"constant\", 1)\n        padded_masks.append(padded_mask.to(torch.bool))\n\n    tensor = torch.stack(padded_imgs)\n    mask = torch.stack(padded_masks)\n\n    return NestedTensor(tensor, mask=mask)\n\ndef is_dist_avail_and_initialized():\n    if not dist.is_available():\n        return False\n    if not dist.is_initialized():\n        return False\n    return True\n\ndef get_iou(gt_masks, pred_masks, ignore_label=-1):\n    rev_ignore_mask = ~(gt_masks == ignore_label)\n    gt_masks = gt_masks.bool()\n    n,h,w = gt_masks.shape\n    intersection = ((gt_masks & pred_masks) & rev_ignore_mask).reshape(n,h*w).sum(dim=-1)\n    union = ((gt_masks | pred_masks) & rev_ignore_mask).reshape(n,h*w).sum(dim=-1)\n    ious = (intersection / union)\n    return ious"
  },
  {
    "path": "thirdparty/GLEE/glee/utils/utils.py",
    "content": "import torch\nimport copy\nfrom torch import nn, Tensor\nimport os\n\nimport math\nimport torch.nn.functional as F\nfrom torch import nn\n\n\nclass MLP(nn.Module):\n    \"\"\" Very simple multi-layer perceptron (also called FFN)\"\"\"\n\n    def __init__(self, input_dim, hidden_dim, output_dim, num_layers):\n        super().__init__()\n        self.num_layers = num_layers\n        h = [hidden_dim] * (num_layers - 1)\n        self.layers = nn.ModuleList(nn.Linear(n, k) for n, k in zip([input_dim] + h, h + [output_dim]))\n\n    def forward(self, x):\n        for i, layer in enumerate(self.layers):\n            x = F.relu(layer(x)) if i < self.num_layers - 1 else layer(x)\n        return x\n\n\ndef inverse_sigmoid(x, eps=1e-5):\n    x = x.clamp(min=0, max=1)\n    x1 = x.clamp(min=eps)\n    x2 = (1 - x).clamp(min=eps)\n    return torch.log(x1/x2)\n\n\ndef gen_encoder_output_proposals(memory:Tensor, memory_padding_mask:Tensor, spatial_shapes:Tensor):\n    \"\"\"\n    Input:\n        - memory: bs, \\sum{hw}, d_model\n        - memory_padding_mask: bs, \\sum{hw}\n        - spatial_shapes: nlevel, 2\n    Output:\n        - output_memory: bs, \\sum{hw}, d_model\n        - output_proposals: bs, \\sum{hw}, 4\n    \"\"\"\n    N_, S_, C_ = memory.shape\n    base_scale = 4.0\n    proposals = []\n    _cur = 0\n    for lvl, (H_, W_) in enumerate(spatial_shapes):\n        mask_flatten_ = memory_padding_mask[:, _cur:(_cur + H_ * W_)].view(N_, H_, W_, 1)\n        valid_H = torch.sum(~mask_flatten_[:, :, 0, 0], 1)\n        valid_W = torch.sum(~mask_flatten_[:, 0, :, 0], 1)\n\n        grid_y, grid_x = torch.meshgrid(torch.linspace(0, H_ - 1, H_, dtype=torch.float32, device=memory.device),\n                                        torch.linspace(0, W_ - 1, W_, dtype=torch.float32, device=memory.device))\n        grid = torch.cat([grid_x.unsqueeze(-1), grid_y.unsqueeze(-1)], -1)\n\n        scale = torch.cat([valid_W.unsqueeze(-1), valid_H.unsqueeze(-1)], 1).view(N_, 1, 1, 2)\n        grid = (grid.unsqueeze(0).expand(N_, -1, -1, -1) + 0.5) / scale\n        wh = torch.ones_like(grid) * 0.05 * (2.0 ** lvl)\n        proposal = torch.cat((grid, wh), -1).view(N_, -1, 4)\n        proposals.append(proposal)\n        _cur += (H_ * W_)\n    output_proposals = torch.cat(proposals, 1)\n    output_proposals_valid = ((output_proposals > 0.01) & (output_proposals < 0.99)).all(-1, keepdim=True)\n    output_proposals = torch.log(output_proposals / (1 - output_proposals))\n    output_proposals = output_proposals.masked_fill(memory_padding_mask.unsqueeze(-1), float('inf'))\n    output_proposals = output_proposals.masked_fill(~output_proposals_valid, float('inf'))\n\n    output_memory = memory\n    output_memory = output_memory.masked_fill(memory_padding_mask.unsqueeze(-1), float(0))\n    output_memory = output_memory.masked_fill(~output_proposals_valid, float(0))\n    return output_memory, output_proposals\n\n\ndef gen_sineembed_for_position(pos_tensor):\n    # n_query, bs, _ = pos_tensor.size()\n    # sineembed_tensor = torch.zeros(n_query, bs, 256)\n    scale = 2 * math.pi\n    dim_t = torch.arange(128, dtype=torch.float32, device=pos_tensor.device)\n    dim_t = 10000 ** (2 * (dim_t // 2) / 128)\n    x_embed = pos_tensor[:, :, 0] * scale\n    y_embed = pos_tensor[:, :, 1] * scale\n    pos_x = x_embed[:, :, None] / dim_t\n    pos_y = y_embed[:, :, None] / dim_t\n    pos_x = torch.stack((pos_x[:, :, 0::2].sin(), pos_x[:, :, 1::2].cos()), dim=3).flatten(2)\n    pos_y = torch.stack((pos_y[:, :, 0::2].sin(), pos_y[:, :, 1::2].cos()), dim=3).flatten(2)\n    if pos_tensor.size(-1) == 2:\n        pos = torch.cat((pos_y, pos_x), dim=2)\n    elif pos_tensor.size(-1) == 4:\n        w_embed = pos_tensor[:, :, 2] * scale\n        pos_w = w_embed[:, :, None] / dim_t\n        pos_w = torch.stack((pos_w[:, :, 0::2].sin(), pos_w[:, :, 1::2].cos()), dim=3).flatten(2)\n\n        h_embed = pos_tensor[:, :, 3] * scale\n        pos_h = h_embed[:, :, None] / dim_t\n        pos_h = torch.stack((pos_h[:, :, 0::2].sin(), pos_h[:, :, 1::2].cos()), dim=3).flatten(2)\n\n        pos = torch.cat((pos_y, pos_x, pos_w, pos_h), dim=2)\n    else:\n        raise ValueError(\"Unknown pos_tensor shape(-1):{}\".format(pos_tensor.size(-1)))\n    return pos\n\n\ndef _get_activation_fn(activation):\n    \"\"\"Return an activation function given a string\"\"\"\n    if activation == \"relu\":\n        return F.relu\n    if activation == \"gelu\":\n        return F.gelu\n    if activation == \"glu\":\n        return F.glu\n    if activation == \"prelu\":\n        return nn.PReLU()\n    if activation == \"selu\":\n        return F.selu\n    raise RuntimeError(F\"activation should be relu/gelu, not {activation}.\")\n\n\ndef _get_clones(module, N, layer_share=False):\n\n    if layer_share:\n        return nn.ModuleList([module for i in range(N)])\n    else:\n        return nn.ModuleList([copy.deepcopy(module) for i in range(N)])\n\ndef _get_clones_advanced(module, N, N_valid):\n    assert N_valid <= N\n    layers = []\n    for i in range(N):\n        if i < N_valid:\n            layers.append(copy.deepcopy(module))\n        else:\n            layers.append(nn.Identity())\n    return nn.ModuleList(layers)"
  }
]