[
  {
    "path": ".gitignore",
    "content": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# Distribution / packaging\n.Python\nbuild/\ndevelop-eggs/\ndist/\ndownloads/\neggs/\n.eggs/\nlib/\nlib64/\nparts/\nsdist/\nvar/\nwheels/\npip-wheel-metadata/\nshare/python-wheels/\n*.egg-info/\n.installed.cfg\n*.egg\nMANIFEST\n*.so\n\n# PyInstaller\n#  Usually these files are written by a python script from a template\n#  before PyInstaller builds the exe, so as to inject date/other infos into it.\n*.manifest\n*.spec\n\n# Installer logs\npip-log.txt\npip-delete-this-directory.txt\n\n# Unit test / coverage reports\nhtmlcov/\n.tox/\n.nox/\n.coverage\n.coverage.*\n.cache\nnosetests.xml\ncoverage.xml\n*.cover\n.hypothesis/\n.pytest_cache/\n\n# Translations\n*.mo\n*.pot\n\n# Django stuff:\n*.log\nlocal_settings.py\ndb.sqlite3\ndb.sqlite3-journal\n\n# Flask stuff:\ninstance/\n.webassets-cache\n\n# Scrapy stuff:\n.scrapy\n\n# Sphinx documentation\ndocs/_build/\n\n# Jupyter Notebook\n.ipynb_checkpoints\n\n# IPython\nprofile_default/\nipython_config.py\n\n# pyenv\n.python-version\n\n# pipenv\n#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.\n#   However, in case of collaboration, if having platform-specific dependencies or dependencies\n#   having no cross-platform support, pipenv may install dependencies that don’t work, or not\n#   install at all, so you might want to exclude Pipfile.lock from version control.\nPipfile.lock\n\n# PEP 582; used by e.g. github.com/David-OConnor/pyflow\n__pypackages__/\n\n# Celery stuff\ncelerybeat-schedule\ncelerybeat.pid\n\n# SageMath parsed files\n*.sage.py\n\n# Environments\n.env\n.venv\nenv/\nvenv/\nENV/\nenv.bak/\nvenv.bak/\n\n# VS Code\n.vscode/\n"
  },
  {
    "path": "LICENSE",
    "content": "MIT License\n\nCopyright (c) 2024 Horizon Robotics\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n"
  },
  {
    "path": "README.md",
    "content": "# BIP3D: Bridging 2D Images and 3D Perception for Embodied Intelligence\n\n<div align=\"center\" class=\"authors\">\n    <a href=\"https://scholar.google.com/citations?user=pfXQwcQAAAAJ&hl=en\" target=\"_blank\">Xuewu Lin</a>,\n    <a href=\"https://wzmsltw.github.io/\" target=\"_blank\">Tianwei Lin</a>,\n    <a href=\"https://scholar.google.com/citations?user=F2e_jZMAAAAJ&hl=en\" target=\"_blank\">Lichao Huang</a>,\n    <a href=\"https://openreview.net/profile?id=~HONGYU_XIE2\" target=\"_blank\">Hongyu Xie</a>,\n    <a href=\"https://scholar.google.com/citations?user=HQfc8TEAAAAJ&hl=en\" target=\"_blank\">Zhizhong Su</a>\n</div>\n\n\n<div align=\"center\" style=\"line-height: 3;\">\n  <a href=\"https://github.com/HorizonRobotics/BIP3D\" target=\"_blank\" style=\"margin: 2px;\">\n    <img alt=\"Code\" src=\"https://img.shields.io/badge/Code-Github-bule\" style=\"display: inline-block; vertical-align: middle;\"/>\n  </a>\n  <a href=\"https://linxuewu.github.io/BIP3D-page/\" target=\"_blank\" style=\"margin: 2px;\">\n    <img alt=\"Homepage\" src=\"https://img.shields.io/badge/Homepage-BIP3D-green\" style=\"display: inline-block; vertical-align: middle;\"/>\n  </a>\n  <a href=\"https://huggingface.co/HorizonRobotics/BIP3D\" target=\"_blank\" style=\"margin: 2px;\">\n    <img alt=\"Hugging Face\" src=\"https://img.shields.io/badge/Models-Hugging%20Face-yellow\" style=\"display: inline-block; vertical-align: middle;\"/>\n  </a>\n  <a href=\"https://arxiv.org/abs/2411.14869\" target=\"_blank\" style=\"margin: 2px;\">\n    <img alt=\"Paper\" src=\"https://img.shields.io/badge/Paper-Arxiv-red\" style=\"display: inline-block; vertical-align: middle;\"/>\n  </a>\n</div>\n\n\n## :rocket: News\n**01/Jun/2025**: We have refactored and integrated the BIP3D code into [robo_orchard_lab](https://github.com/HorizonRobotics/robo_orchard_lab/tree/master/projects/bip3d_grounding), removing the dependency on MM​​ series. The environment is now easier to set up, and the performance are improved. Welcome to try it out!\n\n**14/Mar/2025**: Our code has been released.\n\n**27/Feb/2025**: Our paper has been accepted by CVPR 2025.\n\n**22/Nov/2024**: We release our paper to [Arxiv](https://arxiv.org/abs/2411.14869).\n\n## :open_book: Quick Start\n[Quick Start](docs/quick_start.md)\n\n## :link: Framework\n<div align=\"center\">\n  <img src=\"https://github.com/HorizonRobotics/BIP3D/raw/main/resources/bip3d_structure.png\" width=\"90%\" alt=\"BIP3D\" />\n  <p style=\"font-size:0.8em; color:#555;\">The Architecture Diagram of BIP3D, where the red stars indicate the parts that have been modified or added compared to the base model, GroundingDINO, and dashed lines indicate optional elements.</p>\n</div>\n\n## :trophy: Results on EmbodiedScan Benchmark\nWe made several improvements based on the original paper, achieving better 3D perception results. The main improvements include the following two points:\n1. **New Fusion Operation**: We enhanced the decoder by replacing the deformable aggregation (DAG) with a 3D deformable attention mechanism (DAT). Specifically, we improved the feature sampling process by transitioning from bilinear interpolation to trilinear interpolation, which leverages depth distribution for more accurate feature extraction.\n2. **Mixed Data Training**: To optimize the grounding model's performance, we adopted a mixed-data training strategy by integrating detection data with grounding data during the grounding finetuning process.\n\n### 1. Results on Multi-view 3D Detection Validation Dataset\nOp DAG denotes deformable aggregation, and DAT denotes 3D deformable attention. Set [`with_depth=True`](configs/bip3d_det.py#L175) to activate the DAT.\n\nThe metric in the table is `AP@0.25`. For more metrics, please refer to the logs.\n|Model | Inputs | Op | Overall | Head | Common | Tail | Small | Medium | Large | ScanNet | 3RScan | MP3D | ckpt | log |\n|  :----  | :---: |  :---: | :---: |:---: | :---: | :---: | :---:| :---:|:---:|:---: | :---: | :----: | :----: | :---: |\n|[BIP3D](configs/bip3d_det_rgb.py) | RGB | DAG | 16.57|23.29|13.84|12.29|2.67|17.85|12.89|19.71|26.76|8.50   | [link](https://huggingface.co/HorizonRobotics/BIP3D/blob/main/det_rgb_dag/model_checkpoint.pth) | [link](https://huggingface.co/HorizonRobotics/BIP3D/blob/main/det_rgb_dag/job.log) |\n|[BIP3D](configs/bip3d_det_rgb.py) | RGB | DAT | 16.67|22.41|14.19|13.18|3.32|17.25|14.89|20.80|24.18|9.91  | [link](https://huggingface.co/HorizonRobotics/BIP3D/blob/main/det_rgb_dat/model_checkpoint.pth) | [link](https://huggingface.co/HorizonRobotics/BIP3D/blob/main/det_rgb_dat/job.log) |\n|[BIP3D](configs/bip3d_det.py) |RGB-D | DAG | 22.53|28.89|20.51|17.83|6.95|24.21|15.46|24.77|35.29|10.34  | [link](https://huggingface.co/HorizonRobotics/BIP3D/blob/main/det_rgbd_dag/model_checkpoint.pth) | [link](https://huggingface.co/HorizonRobotics/BIP3D/blob/main/det_rgbd_dag/job.log) |\n|[BIP3D](configs/bip3d_det.py) |RGB-D | DAT | 23.24|31.51|20.20|17.62|7.31|24.09|15.82|26.35|36.29|11.44   | [link](https://huggingface.co/HorizonRobotics/BIP3D/blob/main/det_rgbd_dat/model_checkpoint.pth) | [link](https://huggingface.co/HorizonRobotics/BIP3D/blob/main/det_rgbd_dat/job.log) |\n\n### 2. Results on Multi-view 3D Grounding Mini Dataset\nTo train and validate with mini dataset, set [`data_version=\"v1-mini\"`](configs/bip3d_grounding.py#L333).\n|Model | Inputs | Op | Overall | Easy | Hard | View-dep | View-indep | ScanNet | 3RScan | MP3D | ckpt | log |\n|  :----  | :---: | :---: | :---: | :---: | :---:| :---:|:---:|:---: | :---: | :----: |:---: | :----: |\n|[BIP3D](configs/bip3d_grounding_rgb.py) | RGB | DAG | 44.00|44.39|39.56|46.05|42.92|48.62|42.47|36.40  | [link](https://huggingface.co/HorizonRobotics/BIP3D/blob/main/grounding_mini_rgb_dag/model_checkpoint.pth) | [link](https://huggingface.co/HorizonRobotics/BIP3D/blob/main/grounding_mini_rgb_dag/job.log) |\n|[BIP3D](configs/bip3d_grounding_rgb.py) | RGB | DAT | 44.43|44.74|41.02|45.17|44.04|49.70|41.81|37.28  | [link](https://huggingface.co/HorizonRobotics/BIP3D/blob/main/grounding_mini_rgb_dat/model_checkpoint.pth) | [link](https://huggingface.co/HorizonRobotics/BIP3D/blob/main/grounding_mini_rgb_dat/job.log) |\n|[BIP3D](configs/bip3d_grounding.py) | RGB-D | DAG | 45.79|46.22|40.91|45.93|45.71|48.94|46.61|37.36  | [link](https://huggingface.co/HorizonRobotics/BIP3D/blob/main/grounding_mini_rgbd_dag/model_checkpoint.pth) | [link](https://huggingface.co/HorizonRobotics/BIP3D/blob/main/grounding_mini_rgbd_dag/job.log) |\n|[BIP3D](configs/bip3d_grounding.py) | RGB-D | DAT | 58.47|59.02|52.23|60.20|57.56|66.63|54.79|46.72  | [link](https://huggingface.co/HorizonRobotics/BIP3D/blob/main/grounding_mini_rgbd_dat/model_checkpoint.pth) | [link](https://huggingface.co/HorizonRobotics/BIP3D/blob/main/grounding_mini_rgbd_dat/job.log) |\n\n\n### 3. Results on Multi-view 3D Grounding Validation Dataset\n|Model | Inputs | Op | Mixed Data | Overall | Easy | Hard | View-dep | View-indep | ScanNet | 3RScan | MP3D | ckpt | log |\n|  :----  | :---: | :---: | :---: |:---: | :---: | :---:| :---:|:---:|:---: | :---: | :----: |:---: | :----: |\n|[BIP3D](configs/bip3d_grounding_rgb.py) | RGB | DAG |No| 45.81|46.21|41.34|47.07|45.09|50.40|47.53|32.97   | [link](https://huggingface.co/HorizonRobotics/BIP3D/blob/main/grounding_rgb_dag/model_checkpoint.pth) | [link](https://huggingface.co/HorizonRobotics/BIP3D/blob/main/grounding_rgb_dag/job.log) |\n|[BIP3D](configs/bip3d_grounding_rgb.py) | RGB | DAT |No| 47.29|47.82|41.42|48.58|46.56|52.74|47.85|34.60   | [link](https://huggingface.co/HorizonRobotics/BIP3D/blob/main/grounding_rgb_dat/model_checkpoint.pth) | [link](https://huggingface.co/HorizonRobotics/BIP3D/blob/main/grounding_rgb_dat/job.log) |\n|[BIP3D](configs/bip3d_grounding.py) | RGB-D | DAG |No| 53.75|53.87|52.43|55.21|52.93|60.05|54.92|38.20   | [link](https://huggingface.co/HorizonRobotics/BIP3D/blob/main/grounding_rgbd_dag/model_checkpoint.pth) | [link](https://huggingface.co/HorizonRobotics/BIP3D/blob/main/grounding_rgbd_dag/job.log) |\n|[BIP3D](configs/bip3d_grounding.py) | RGB-D | DAT |No|61.36|61.88|55.58|62.43|60.76|66.96|62.75|46.92   | [link](https://huggingface.co/HorizonRobotics/BIP3D/blob/main/grounding_rgbd_dat/model_checkpoint.pth) | [link](https://huggingface.co/HorizonRobotics/BIP3D/blob/main/grounding_rgbd_dat/job.log) |\n|[BIP3D](configs/bip3d_det_grounding.py) | RGB-D | DAT |Yes|66.58|66.99|62.07|67.95|65.81|72.43|68.26|51.14   | [link](https://huggingface.co/HorizonRobotics/BIP3D/blob/main/grounding_rgbd_dat_mixdata/model_checkpoint.pth) | [link](https://huggingface.co/HorizonRobotics/BIP3D/blob/main/grounding_rgbd_dat_mixdata/job.log) |\n\n\n### 4. [Results on Multi-view 3D Grounding Test Dataset](https://huggingface.co/spaces/AGC2024/visual-grounding-2024)\n\n|Model | Overall | Easy | Hard | View-dep | View-indep | ckpt | log |\n|  :----  | :---: | :---: | :---: | :---: | :---:| :---:|:---:|\n|[EmbodiedScan](https://github.com/OpenRobotLab/EmbodiedScan) | 39.67 | 40.52 | 30.24 | 39.05 | 39.94 | - | - |\n|[SAG3D*](https://opendrivelab.github.io/Challenge%202024/multiview_Mi-Robot.pdf) | 46.92 | 47.72 | 38.03 | 46.31 | 47.18 | - | - |\n|[DenseG*](https://opendrivelab.github.io/Challenge%202024/multiview_THU-LenovoAI.pdf) | 59.59 | 60.39 | 50.81 | 60.50 | 59.20 |  - | - |\n|[BIP3D](configs/bip3d_det_grounding.py) | 67.38 | 68.12 | 59.08 | 67.88 | 67.16 |  - | - |\n|BIP3D-B | 70.53 | 71.22 | 62.91 | 70.69 | 70.47 | - | - |\n\n`*` denotes model ensemble, and note that our BIP3D does not use the ensemble trick. This differs from what is mentioned in the paper and shows significant improvements.\n\nOur best model, BIP3D-B, is based on GroundingDINO-base and is trained with the addition ARKitScenes dataset.\n\n\n## :page_facing_up: Citation\n```\n@article{lin2024bip3d,\n  title={BIP3D: Bridging 2D Images and 3D Perception for Embodied Intelligence},\n  author={Lin, Xuewu and Lin, Tianwei and Huang, Lichao and Xie, Hongyu and Su, Zhizhong},\n  journal={arXiv preprint arXiv:2411.14869},\n  year={2024}\n}\n```\n\n\n## :handshake: Acknowledgement\n[EmbodiedScan](https://github.com/OpenRobotLab/EmbodiedScan)\n\n[Sparse4D](https://github.com/HorizonRobotics/Sparse4D)\n\n[3D-deformable-attention](https://github.com/IDEA-Research/3D-deformable-attention)\n\n[mmdet-GroundingDINO](https://github.com/open-mmlab/mmdetection/tree/main/configs/grounding_dino)\n"
  },
  {
    "path": "bip3d/__init__.py",
    "content": "from .registry import *\n"
  },
  {
    "path": "bip3d/converter/extract_occupancy_ann.py",
    "content": "import os\nimport shutil\nfrom argparse import ArgumentParser\n\nfrom tqdm import tqdm\n\n\ndef extract_occupancy(dataset, src, dst):\n    \"\"\"Extract occupancy annotations of a single dataset to dataset root.\"\"\"\n    print('Processing dataset', dataset)\n    scenes = os.listdir(os.path.join(src, dataset))\n    dst_dataset = os.path.join(dst, dataset)\n    if not os.path.exists(dst_dataset):\n        print('Missing dataset:', dataset)\n        return\n    for scene in tqdm(scenes):\n        if dataset == 'scannet':\n            dst_scene = os.path.join(dst_dataset, 'scans', scene)\n        else:\n            dst_scene = os.path.join(dst_dataset, scene)\n\n        if not os.path.exists(dst_scene):\n            print(f'Missing scene {scene} in dataset {dataset}')\n            continue\n        dst_occ = os.path.join(dst_scene, 'occupancy')\n        if not os.path.exists(dst_occ):\n            shutil.copytree(os.path.join(src, dataset, scene), dst_occ)\n        else:\n            files = os.listdir(os.path.join(src, dataset, scene))\n            for file in files:\n                if not os.path.exists(os.path.join(dst_occ, file)):\n                    shutil.copyfile(os.path.join(src, dataset, scene, file),\n                                    os.path.join(dst_occ, file))\n\n\nif __name__ == '__main__':\n    parser = ArgumentParser()\n    parser.add_argument('--src',\n                        required=True,\n                        help='folder of the occupancy annotations')\n    parser.add_argument('--dst',\n                        required=True,\n                        help='folder of the raw datasets')\n    args = parser.parse_args()\n    datasets = os.listdir(args.src)\n    for dataset in datasets:\n        extract_occupancy(dataset, args.src, args.dst)\n"
  },
  {
    "path": "bip3d/converter/generate_image_3rscan.py",
    "content": "import os\nimport zipfile\nfrom argparse import ArgumentParser\nfrom functools import partial\n\nimport mmengine\n\n\ndef process_scene(path, scene_name):\n    \"\"\"Process single 3Rscan scene.\"\"\"\n    with zipfile.ZipFile(os.path.join(path, scene_name, 'sequence.zip'),\n                         'r') as zip_ref:\n        zip_ref.extractall(os.path.join(path, scene_name, 'sequence'))\n\n\nif __name__ == '__main__':\n    parser = ArgumentParser()\n    parser.add_argument('--dataset_folder',\n                        required=True,\n                        help='folder of the dataset.')\n    parser.add_argument('--nproc', type=int, default=8)\n    args = parser.parse_args()\n\n    mmengine.track_parallel_progress(func=partial(process_scene,\n                                                  args.dataset_folder),\n                                     tasks=os.listdir(args.dataset_folder),\n                                     nproc=args.nproc)\n"
  },
  {
    "path": "bip3d/converter/generate_image_matterport3d.py",
    "content": "import os\nimport zipfile\nfrom argparse import ArgumentParser\nfrom functools import partial\n\nimport mmengine\n\n\ndef process_scene(path, output_folder, scene_name):\n    \"\"\"Process single 3Rscan scene.\"\"\"\n    files = list(os.listdir(os.path.join(path, scene_name)))\n    for file in files:\n        if not file.endswith(\".zip\"):\n            continue\n        if file != \"sens.zip\":\n            continue\n        with zipfile.ZipFile(os.path.join(path, scene_name, file),\n                             'r') as zip_ref:\n            if file == \"sens.zip\":\n                zip_ref.extractall(os.path.join(output_folder, scene_name, file[:-4]))\n            else:\n                zip_ref.extractall(output_folder)\n\n\nif __name__ == '__main__':\n    parser = ArgumentParser()\n    parser.add_argument('--dataset_folder',\n                        required=True,\n                        help='folder of the dataset.')\n    parser.add_argument('--output_folder',\n                        required=True,\n                        help='output folder of the dataset.')\n    parser.add_argument('--nproc', type=int, default=8)\n    args = parser.parse_args()\n\n    if not os.path.exists(args.output_folder):\n        os.makedirs(args.output_folder, exist_ok=True)\n\n    mmengine.track_parallel_progress(func=partial(process_scene,\n                                                  args.dataset_folder, args.output_folder),\n                                              tasks=os.listdir(args.dataset_folder),\n                                     nproc=args.nproc)\n"
  },
  {
    "path": "bip3d/converter/generate_image_scannet.py",
    "content": "# Modified from https://github.com/ScanNet/ScanNet/blob/master/SensReader/python/SensorData.py # noqa\nimport os\nimport struct\nimport zlib\nfrom argparse import ArgumentParser\nfrom functools import partial\n\nimport imageio\nimport mmengine\nimport numpy as np\n\nCOMPRESSION_TYPE_COLOR = {-1: 'unknown', 0: 'raw', 1: 'png', 2: 'jpeg'}\n\nCOMPRESSION_TYPE_DEPTH = {\n    -1: 'unknown',\n    0: 'raw_ushort',\n    1: 'zlib_ushort',\n    2: 'occi_ushort'\n}\n\n\nclass RGBDFrame:\n    \"\"\"Class for single ScanNet RGB-D image processing.\"\"\"\n\n    def load(self, file_handle):\n        \"\"\"Load basic information of a given RGBD frame.\"\"\"\n        self.camera_to_world = np.asarray(struct.unpack(\n            'f' * 16, file_handle.read(16 * 4)),\n                                          dtype=np.float32).reshape(4, 4)\n        self.timestamp_color = struct.unpack('Q', file_handle.read(8))[0]\n        self.timestamp_depth = struct.unpack('Q', file_handle.read(8))[0]\n        self.color_size_bytes = struct.unpack('Q', file_handle.read(8))[0]\n        self.depth_size_bytes = struct.unpack('Q', file_handle.read(8))[0]\n        self.color_data = b''.join(\n            struct.unpack('c' * self.color_size_bytes,\n                          file_handle.read(self.color_size_bytes)))\n        self.depth_data = b''.join(\n            struct.unpack('c' * self.depth_size_bytes,\n                          file_handle.read(self.depth_size_bytes)))\n\n    def decompress_depth(self, compression_type):\n        \"\"\"Decompress the depth data.\"\"\"\n        assert compression_type == 'zlib_ushort'\n        return zlib.decompress(self.depth_data)\n\n    def decompress_color(self, compression_type):\n        \"\"\"Decompress the RGB image data.\"\"\"\n        assert compression_type == 'jpeg'\n        return imageio.imread(self.color_data)\n\n\nclass SensorData:\n    \"\"\"Class for single ScanNet scene processing.\n\n    Single scene file contains multiple RGB-D images.\n    \"\"\"\n\n    def __init__(self, filename, fast=False):\n        self.version = 4\n        self.load(filename, fast)\n\n    def load(self, filename, fast):\n        \"\"\"Load a single scene data with multiple RGBD frames.\"\"\"\n        with open(filename, 'rb') as f:\n            version = struct.unpack('I', f.read(4))[0]\n            assert self.version == version\n            strlen = struct.unpack('Q', f.read(8))[0]\n            self.sensor_name = b''.join(\n                struct.unpack('c' * strlen, f.read(strlen)))\n            self.intrinsic_color = np.asarray(struct.unpack(\n                'f' * 16, f.read(16 * 4)),\n                                              dtype=np.float32).reshape(4, 4)\n            self.extrinsic_color = np.asarray(struct.unpack(\n                'f' * 16, f.read(16 * 4)),\n                                              dtype=np.float32).reshape(4, 4)\n            self.intrinsic_depth = np.asarray(struct.unpack(\n                'f' * 16, f.read(16 * 4)),\n                                              dtype=np.float32).reshape(4, 4)\n            self.extrinsic_depth = np.asarray(struct.unpack(\n                'f' * 16, f.read(16 * 4)),\n                                              dtype=np.float32).reshape(4, 4)\n            self.color_compression_type = COMPRESSION_TYPE_COLOR[struct.unpack(\n                'i', f.read(4))[0]]\n            self.depth_compression_type = COMPRESSION_TYPE_DEPTH[struct.unpack(\n                'i', f.read(4))[0]]\n            self.color_width = struct.unpack('I', f.read(4))[0]\n            self.color_height = struct.unpack('I', f.read(4))[0]\n            self.depth_width = struct.unpack('I', f.read(4))[0]\n            self.depth_height = struct.unpack('I', f.read(4))[0]\n            self.depth_shift = struct.unpack('f', f.read(4))[0]\n            num_frames = struct.unpack('Q', f.read(8))[0]\n            self.num_frames = num_frames\n            self.frames = []\n            if fast:\n                index = list(range(num_frames))[::10]\n            else:\n                index = list(range(num_frames))\n            self.index = index\n            for i in range(num_frames):\n                frame = RGBDFrame()\n                frame.load(f)\n                if i in index:\n                    self.frames.append(frame)\n\n    def export_depth_images(self, output_path):\n        \"\"\"Export depth images to the output path.\"\"\"\n        if not os.path.exists(output_path):\n            os.makedirs(output_path)\n        for f in range(len(self.frames)):\n            depth_data = self.frames[f].decompress_depth(\n                self.depth_compression_type)\n            depth = np.fromstring(depth_data, dtype=np.uint16).reshape(\n                self.depth_height, self.depth_width)\n            imageio.imwrite(\n                os.path.join(output_path,\n                             self.index_to_str(self.index[f]) + '.png'), depth)\n\n    def export_color_images(self, output_path):\n        \"\"\"Export RGB images to the output path.\"\"\"\n        if not os.path.exists(output_path):\n            os.makedirs(output_path)\n        for f in range(len(self.frames)):\n            color = self.frames[f].decompress_color(\n                self.color_compression_type)\n            imageio.imwrite(\n                os.path.join(output_path,\n                             self.index_to_str(self.index[f]) + '.jpg'), color)\n\n    @staticmethod\n    def index_to_str(index):\n        \"\"\"Convert the sample index to string.\"\"\"\n        return str(index).zfill(5)\n\n    @staticmethod\n    def save_mat_to_file(matrix, filename):\n        \"\"\"Save a matrix to file.\"\"\"\n        with open(filename, 'w') as f:\n            for line in matrix:\n                np.savetxt(f, line[np.newaxis], fmt='%f')\n\n    def export_poses(self, output_path):\n        \"\"\"Export camera poses to the output path.\"\"\"\n        if not os.path.exists(output_path):\n            os.makedirs(output_path)\n        for f in range(len(self.frames)):\n            self.save_mat_to_file(\n                self.frames[f].camera_to_world,\n                os.path.join(output_path,\n                             self.index_to_str(self.index[f]) + '.txt'))\n\n    def export_intrinsics(self, output_path):\n        \"\"\"Export the intrinsic matrix to the output path.\"\"\"\n        if not os.path.exists(output_path):\n            os.makedirs(output_path)\n        self.save_mat_to_file(self.intrinsic_color,\n                              os.path.join(output_path, 'intrinsic.txt'))\n\n    def export_depth_intrinsics(self, output_path):\n        \"\"\"Export the depth intrinsic matrix to the output path.\"\"\"\n        if not os.path.exists(output_path):\n            os.makedirs(output_path)\n        self.save_mat_to_file(self.intrinsic_depth,\n                              os.path.join(output_path, 'depth_intrinsic.txt'))\n\n\ndef process_scene(path, fast, idx):\n    \"\"\"Process single ScanNet scene.\n\n    Extract RGB images, poses and camera intrinsics.\n    \"\"\"\n    try:\n        print(path, fast, idx)\n        data = SensorData(os.path.join(path, idx, f'{idx}.sens'), fast)\n        # output_path = os.path.join('/bucket/output/robot_lab/users/xuewu.lin/scannet_posed_images', idx)\n        output_path = os.path.join('/horizon-bucket/robot_lab/users/xuewu.lin/scannet_posed_images', idx)\n        data.export_color_images(output_path)\n        data.export_intrinsics(output_path)\n        data.export_poses(output_path)\n        data.export_depth_images(output_path)\n        data.export_depth_intrinsics(output_path)\n    except Exception as e:\n        print(\"bug info: \", path, fast, idx)\n        raise e\n\n\ndef process_directory(path, fast, nproc):\n    \"\"\"Process the files in a directory with parallel support.\"\"\"\n    mmengine.track_parallel_progress(func=partial(process_scene, path, fast),\n                                     # tasks=['scene0175_00'],\n                                     tasks=['scene0288_01'],\n                                     # tasks=['scene0288_01', 'scene0175_00'],\n                                     # tasks=os.listdir(path),\n                                     nproc=nproc)\n\n\nif __name__ == '__main__':\n    parser = ArgumentParser()\n    parser.add_argument('--dataset_folder',\n                        default=None,\n                        help='folder of the dataset.')\n    parser.add_argument('--nproc', type=int, default=8)\n    parser.add_argument('--fast', action='store_true')\n    args = parser.parse_args()\n\n    if args.dataset_folder is not None:\n        os.chdir(args.dataset_folder)\n\n    # process train and val scenes\n    if os.path.exists('scans'):\n        process_directory('scans', args.fast, args.nproc)\n"
  },
  {
    "path": "bip3d/datasets/__init__.py",
    "content": "from .embodiedscan_det_grounding_dataset import EmbodiedScanDetGroundingDataset\nfrom .transforms import *  # noqa: F401,F403\n\n__all__ = [\n    'EmbodiedScanDetGroundingDataset'\n]\n"
  },
  {
    "path": "bip3d/datasets/embodiedscan_det_grounding_dataset.py",
    "content": "import math\nimport pickle\nimport copy\nimport tqdm\nimport os\nimport warnings\nfrom typing import Callable, List, Optional, Union\n\nimport mmengine\nimport numpy as np\nfrom mmengine.dataset import BaseDataset, force_full_init\nfrom mmengine.fileio import load\nfrom mmengine.logging import print_log\n\nfrom bip3d.registry import DATASETS\nfrom bip3d.structures import get_box_type\nfrom .utils import sample\n\nclass_names = (\n    'adhesive tape', 'air conditioner', 'alarm', 'album', 'arch', 'backpack',\n    'bag', 'balcony', 'ball', 'banister', 'bar', 'barricade', 'baseboard',\n    'basin', 'basket', 'bathtub', 'beam', 'beanbag', 'bed', 'bench', 'bicycle',\n    'bidet', 'bin', 'blackboard', 'blanket', 'blinds', 'board', 'body loofah',\n    'book', 'boots', 'bottle', 'bowl', 'box', 'bread', 'broom', 'brush',\n    'bucket', 'cabinet', 'calendar', 'camera', 'can', 'candle', 'candlestick',\n    'cap', 'car', 'carpet', 'cart', 'case', 'chair', 'chandelier', 'cleanser',\n    'clock', 'clothes', 'clothes dryer', 'coat hanger', 'coffee maker', 'coil',\n    'column', 'commode', 'computer', 'conducting wire', 'container', 'control',\n    'copier', 'cosmetics', 'couch', 'counter', 'countertop', 'crate', 'crib',\n    'cube', 'cup', 'curtain', 'cushion', 'decoration', 'desk', 'detergent',\n    'device', 'dish rack', 'dishwasher', 'dispenser', 'divider', 'door',\n    'door knob', 'doorframe', 'doorway', 'drawer', 'dress', 'dresser', 'drum',\n    'duct', 'dumbbell', 'dustpan', 'dvd', 'eraser', 'excercise equipment',\n    'fan', 'faucet', 'fence', 'file', 'fire extinguisher', 'fireplace',\n    'flowerpot', 'flush', 'folder', 'food', 'footstool', 'frame', 'fruit',\n    'furniture', 'garage door', 'garbage', 'glass', 'globe', 'glove',\n    'grab bar', 'grass', 'guitar', 'hair dryer', 'hamper', 'handle', 'hanger',\n    'hat', 'headboard', 'headphones', 'heater', 'helmets', 'holder', 'hook',\n    'humidifier', 'ironware', 'jacket', 'jalousie', 'jar', 'kettle',\n    'keyboard', 'kitchen island', 'kitchenware', 'knife', 'label', 'ladder',\n    'lamp', 'laptop', 'ledge', 'letter', 'light', 'luggage', 'machine',\n    'magazine', 'mailbox', 'map', 'mask', 'mat', 'mattress', 'menu',\n    'microwave', 'mirror', 'molding', 'monitor', 'mop', 'mouse', 'napkins',\n    'notebook', 'ottoman', 'oven', 'pack', 'package', 'pad', 'pan', 'panel',\n    'paper', 'paper cutter', 'partition', 'pedestal', 'pen', 'person', 'piano',\n    'picture', 'pillar', 'pillow', 'pipe', 'pitcher', 'plant', 'plate',\n    'player', 'plug', 'plunger', 'pool', 'pool table', 'poster', 'pot',\n    'price tag', 'printer', 'projector', 'purse', 'rack', 'radiator', 'radio',\n    'rail', 'range hood', 'refrigerator', 'remote control', 'ridge', 'rod',\n    'roll', 'roof', 'rope', 'sack', 'salt', 'scale', 'scissors', 'screen',\n    'seasoning', 'shampoo', 'sheet', 'shelf', 'shirt', 'shoe', 'shovel',\n    'shower', 'sign', 'sink', 'soap', 'soap dish', 'soap dispenser', 'socket',\n    'speaker', 'sponge', 'spoon', 'stairs', 'stall', 'stand', 'stapler',\n    'statue', 'steps', 'stick', 'stool', 'stopcock', 'stove', 'structure',\n    'sunglasses', 'support', 'switch', 'table', 'tablet', 'teapot',\n    'telephone', 'thermostat', 'tissue', 'tissue box', 'toaster', 'toilet',\n    'toilet paper', 'toiletry', 'tool', 'toothbrush', 'toothpaste', 'towel',\n    'toy', 'tray', 'treadmill', 'trophy', 'tube', 'tv', 'umbrella', 'urn',\n    'utensil', 'vacuum cleaner', 'vanity', 'vase', 'vent', 'ventilation',\n    'wardrobe', 'washbasin', 'washing machine', 'water cooler', 'water heater',\n    'window', 'window frame', 'windowsill', 'wine', 'wire', 'wood', 'wrap')\nhead_labels = [\n    48, 177, 82, 179, 37, 243, 28, 277, 32, 84, 215, 145, 182, 170, 22, 72, 30,\n    141, 65, 257, 221, 225, 52, 75, 231, 158, 236, 156, 47, 74, 6, 18, 71, 242,\n    217, 251, 66, 263, 5, 45, 14, 73, 278, 198, 24, 23, 196, 252, 19, 135, 26,\n    229, 183, 200, 107, 272, 246, 269, 125, 59, 279, 15, 163, 258, 57, 195, 51,\n    88, 97, 58, 102, 36, 137, 31, 80, 160, 155, 61, 238, 96, 190, 25, 219, 152,\n    142, 201, 274, 249, 178, 192\n]\ncommon_labels = [\n    189, 164, 101, 205, 273, 233, 131, 180, 86, 220, 67, 268, 224, 270, 53,\n    203, 237, 226, 10, 133, 248, 41, 55, 16, 199, 134, 99, 185, 2, 20, 234,\n    194, 253, 35, 174, 8, 223, 13, 91, 262, 230, 121, 49, 63, 119, 162, 79,\n    168, 245, 267, 122, 104, 100, 1, 176, 280, 140, 209, 259, 143, 165, 147,\n    117, 85, 105, 95, 109, 207, 68, 175, 106, 60, 4, 46, 171, 204, 111, 211,\n    108, 120, 157, 222, 17, 264, 151, 98, 38, 261, 123, 78, 118, 127, 240, 124\n]\ntail_labels = [\n    76, 149, 173, 250, 275, 255, 34, 77, 266, 283, 112, 115, 186, 136, 256, 40,\n    254, 172, 9, 212, 213, 181, 154, 94, 191, 193, 3, 130, 146, 70, 128, 167,\n    126, 81, 7, 11, 148, 228, 239, 247, 21, 42, 89, 153, 161, 244, 110, 0, 29,\n    114, 132, 159, 218, 232, 260, 56, 92, 116, 282, 33, 113, 138, 12, 188, 44,\n    150, 197, 271, 169, 206, 90, 235, 103, 281, 184, 208, 216, 202, 214, 241,\n    129, 210, 276, 64, 27, 87, 139, 227, 187, 62, 43, 50, 69, 93, 144, 166,\n    265, 54, 83, 39\n]\n\n\n@DATASETS.register_module()\nclass EmbodiedScanDetGroundingDataset(BaseDataset):\n    def __init__(\n        self,\n        data_root: str,\n        ann_file: str,\n        vg_file=None,\n        metainfo: Optional[dict] = None,\n        pipeline: List[Union[dict, Callable]] = [],\n        test_mode: bool = False,\n        load_eval_anns: bool = True,\n        filter_empty_gt: bool = True,\n        remove_dontcare: bool = False,\n        box_type_3d: str = \"Euler-Depth\",\n        dataset_length=None,\n        mode=\"detection\",\n        max_n_images=50,\n        n_images_per_sample=1,\n        drop_last_per_scene=False,\n        part=None,\n        temporal=False,\n        num_text=1,\n        tokens_positive_rebuild=True,\n        sep_token=\"[SEP]\",\n        **kwargs,\n    ):\n        self.box_type_3d, self.box_mode_3d = get_box_type(box_type_3d)\n        self.filter_empty_gt = filter_empty_gt\n        self.remove_dontcare = remove_dontcare\n        self.load_eval_anns = load_eval_anns\n        self.dataset_length = dataset_length\n        self.part = part\n        self.mode = mode\n        assert self.mode in [\"detection\", \"continuous\", \"grounding\"]\n        super().__init__(\n            ann_file=ann_file,\n            metainfo=metainfo,\n            data_root=data_root,\n            pipeline=pipeline,\n            test_mode=test_mode,\n            serialize_data=self.mode == \"detection\",\n            **kwargs,\n        )\n        if self.mode == \"continuous\":\n            self.max_n_images = max_n_images\n            self.n_images_per_sample = n_images_per_sample\n            self.drop_last_per_scene = drop_last_per_scene\n            self.convert_to_continuous()\n        elif self.mode == \"grounding\":\n            self.vg_file = vg_file\n            self.num_text = num_text\n            self.tokens_positive_rebuild = tokens_positive_rebuild\n            self.sep_token = sep_token\n            self.load_language_data()\n            self.data_bytes, self.data_address = self._serialize_data()\n            self.serialize_data = True\n        print_log(f\"dataset length : {self.__len__()}\")\n\n    def process_metainfo(self):\n        assert \"categories\" in self._metainfo\n\n        if \"classes\" not in self._metainfo:\n            self._metainfo.setdefault(\n                \"classes\", list(self._metainfo[\"categories\"].keys())\n            )\n\n        self.label_mapping = np.full(\n            max(list(self._metainfo[\"categories\"].values())) + 1, -1, dtype=int\n        )\n        for key, value in self._metainfo[\"categories\"].items():\n            if key in self._metainfo[\"classes\"]:\n                self.label_mapping[value] = self._metainfo[\"classes\"].index(\n                    key\n                )\n\n    def parse_data_info(self, info: dict):\n        info[\"box_type_3d\"] = self.box_type_3d\n        info[\"axis_align_matrix\"] = self._get_axis_align_matrix(info)\n        info[\"scan_id\"] = info[\"sample_idx\"]\n        ann_dataset = info[\"sample_idx\"].split(\"/\")[0]\n        if ann_dataset == \"matterport3d\":\n            info[\"depth_shift\"] = 4000.0\n        else:\n            info[\"depth_shift\"] = 1000.0\n        # Because multi-view settings are different from original designs\n        # we temporarily follow the ori design in ImVoxelNet\n        info[\"img_path\"] = []\n        info[\"depth_img_path\"] = []\n        if \"cam2img\" in info:\n            cam2img = info[\"cam2img\"].astype(np.float32)\n        else:\n            cam2img = []\n\n        extrinsics = []\n        for i in range(len(info[\"images\"])):\n            img_path = os.path.join(\n                self.data_prefix.get(\"img_path\", \"\"),\n                info[\"images\"][i][\"img_path\"],\n            )\n            depth_img_path = os.path.join(\n                self.data_prefix.get(\"img_path\", \"\"),\n                info[\"images\"][i][\"depth_path\"],\n            )\n\n            info[\"img_path\"].append(img_path)\n            info[\"depth_img_path\"].append(depth_img_path)\n            align_global2cam = np.linalg.inv(\n                info[\"axis_align_matrix\"] @ info[\"images\"][i][\"cam2global\"]\n            )\n            extrinsics.append(align_global2cam.astype(np.float32))\n            if \"cam2img\" not in info:\n                cam2img.append(info[\"images\"][i][\"cam2img\"].astype(np.float32))\n\n        info[\"depth2img\"] = dict(\n            extrinsic=extrinsics,\n            intrinsic=cam2img,\n            origin=np.array([0.0, 0.0, 0.5]).astype(np.float32),\n        )\n\n        if \"depth_cam2img\" not in info:\n            info[\"depth_cam2img\"] = cam2img\n\n        if not self.test_mode:\n            info[\"ann_info\"] = self.parse_ann_info(info)\n\n        if self.test_mode and self.load_eval_anns:\n            info[\"ann_info\"] = self.parse_ann_info(info)\n            info[\"eval_ann_info\"] = self._remove_dontcare(info[\"ann_info\"])\n\n        return info\n\n    def parse_ann_info(self, info: dict):\n        \"\"\"Process the `instances` in data info to `ann_info`.\n\n        Args:\n            info (dict): Info dict.\n\n        Returns:\n            dict: Processed `ann_info`.\n        \"\"\"\n\n        ann_info = None\n        if \"instances\" in info and len(info[\"instances\"]) > 0:\n            ann_info = dict(\n                gt_bboxes_3d=np.zeros(\n                    (len(info[\"instances\"]), 9), dtype=np.float32\n                ),\n                gt_labels_3d=np.zeros(\n                    (len(info[\"instances\"]),), dtype=np.int64\n                ),\n                gt_names=[],\n                bbox_id=np.zeros((len(info[\"instances\"]),), dtype=np.int64) - 1,\n            )\n            for idx, instance in enumerate(info[\"instances\"]):\n                ann_info[\"gt_bboxes_3d\"][idx] = instance[\"bbox_3d\"]\n                ann_info[\"gt_labels_3d\"][idx] = self.label_mapping[\n                    instance[\"bbox_label_3d\"]\n                ]\n                ann_info[\"gt_names\"].append(\n                    self._metainfo[\"classes\"][ann_info[\"gt_labels_3d\"][idx]]\n                    if ann_info[\"gt_labels_3d\"][idx] >= 0\n                    else \"others\"\n                )\n                ann_info[\"bbox_id\"][idx] = instance[\"bbox_id\"]\n\n        # pack ann_info for return\n        if ann_info is None:\n            ann_info = dict()\n            ann_info[\"gt_bboxes_3d\"] = np.zeros((0, 9), dtype=np.float32)\n            ann_info[\"gt_labels_3d\"] = np.zeros((0,), dtype=np.int64)\n            ann_info[\"bbox_id\"] = np.zeros((0,), dtype=np.int64) - 1\n            ann_info[\"gt_names\"] = []\n\n        # post-processing/filtering ann_info if not empty gt\n        if \"visible_instance_ids\" in info[\"images\"][0]:\n            ids = []\n            for i in range(len(info[\"images\"])):\n                ids.append(info[\"images\"][i][\"visible_instance_ids\"])\n            mask_length = ann_info[\"gt_labels_3d\"].shape[0]\n            ann_info[\"visible_instance_masks\"] = self._ids2masks(\n                ids, mask_length\n            )\n\n        if self.remove_dontcare:\n            ann_info = self._remove_dontcare(ann_info)\n\n        ann_dataset = info[\"sample_idx\"].split(\"/\")[0]\n        ann_info[\"gt_bboxes_3d\"] = self.box_type_3d(\n            ann_info[\"gt_bboxes_3d\"],\n            box_dim=ann_info[\"gt_bboxes_3d\"].shape[-1],\n            with_yaw=True,\n            origin=(0.5, 0.5, 0.5),\n        )\n        return ann_info\n\n    @staticmethod\n    def _get_axis_align_matrix(info: dict):\n        \"\"\"Get axis_align_matrix from info. If not exist, return identity mat.\n\n        Args:\n            info (dict): Info of a single sample data.\n\n        Returns:\n            np.ndarray: 4x4 transformation matrix.\n        \"\"\"\n        if \"axis_align_matrix\" in info:\n            return np.array(info[\"axis_align_matrix\"])\n        else:\n            warnings.warn(\n                \"axis_align_matrix is not found in ScanNet data info, please \"\n                \"use new pre-process scripts to re-generate ScanNet data\"\n            )\n            return np.eye(4).astype(np.float32)\n\n    def _ids2masks(self, ids, mask_length):\n        \"\"\"Change visible_instance_ids to visible_instance_masks.\"\"\"\n        masks = []\n        for idx in range(len(ids)):\n            mask = np.zeros((mask_length,), dtype=bool)\n            mask[ids[idx]] = 1\n            masks.append(mask)\n        return masks\n\n    def _remove_dontcare(self, ann_info: dict):\n        \"\"\"Remove annotations that do not need to be cared.\n\n        -1 indicates dontcare in MMDet3d.\n\n        Args:\n            ann_info (dict): Dict of annotation infos. The\n                instance with label `-1` will be removed.\n\n        Returns:\n            dict: Annotations after filtering.\n        \"\"\"\n        img_filtered_annotations = {}\n        filter_mask = ann_info[\"gt_labels_3d\"] > -1\n        for key in ann_info.keys():\n            if key == \"instances\":\n                img_filtered_annotations[key] = ann_info[key]\n            elif key == \"visible_instance_masks\":\n                img_filtered_annotations[key] = []\n                for idx in range(len(ann_info[key])):\n                    img_filtered_annotations[key].append(\n                        ann_info[key][idx][filter_mask]\n                    )\n            elif key == \"gt_names\":\n                img_filtered_annotations[key] = [\n                    x for i, x in enumerate(ann_info[key]) if filter_mask[i]\n                ]\n            else:\n                img_filtered_annotations[key] = ann_info[key][filter_mask]\n        return img_filtered_annotations\n\n    def load_data_list(self):\n        annotations = load(self.ann_file)\n        if not isinstance(annotations, dict):\n            raise TypeError(\n                f\"The annotations loaded from annotation file \"\n                f\"should be a dict, but got {type(annotations)}!\"\n            )\n        if \"data_list\" not in annotations or \"metainfo\" not in annotations:\n            raise ValueError(\n                \"Annotation must have data_list and metainfo \" \"keys\"\n            )\n        metainfo = annotations[\"metainfo\"]\n        raw_data_list = annotations[\"data_list\"]\n\n        # Meta information load from annotation file will not influence the\n        # existed meta information load from `BaseDataset.METAINFO` and\n        # `metainfo` arguments defined in constructor.\n        for k, v in metainfo.items():\n            self._metainfo.setdefault(k, v)\n\n        self.process_metainfo()\n\n        # load and parse data_infos.\n        data_list = []\n        for raw_data_info in tqdm.tqdm(\n            raw_data_list,\n            mininterval=10,\n            desc=f\"Loading {'Test' if self.test_mode else 'Train'} dataset\",\n        ):\n            if self.part is not None:\n                valid = False\n                for x in self.part:\n                    if x in raw_data_info[\"sample_idx\"]:\n                        valid = True\n                        break\n                if not valid:\n                    continue\n\n            data_info = self.parse_data_info(raw_data_info)\n            if data_info is None:\n                continue\n            assert isinstance(data_info, dict)\n            data_list.append(data_info)\n\n            if (\n                self.dataset_length is not None\n                and len(data_list) >= self.dataset_length\n            ):\n                break\n        return data_list\n\n    @staticmethod\n    def _get_axis_align_matrix(info: dict):\n        if 'axis_align_matrix' in info:\n            return np.array(info['axis_align_matrix'])\n        else:\n            warnings.warn(\n                'axis_align_matrix is not found in ScanNet data info, please '\n                'use new pre-process scripts to re-generate ScanNet data')\n            return np.eye(4).astype(np.float32)\n\n    @staticmethod\n    def _is_view_dep(text):\n        \"\"\"Check whether to augment based on sr3d utterance.\"\"\"\n        rels = [\n            'front', 'behind', 'back', 'left', 'right', 'facing', 'leftmost',\n            'rightmost', 'looking', 'across'\n        ]\n        words = set(text.split())\n        return any(rel in words for rel in rels)\n\n    def convert_info_to_scan(self):\n        self.scans = dict()\n        for data in self.data_list:\n            scan_id = data['scan_id']\n            self.scans[scan_id] = data\n        self.scan_ids = list(self.scans.keys())\n\n    def load_language_data(self):\n        self.convert_info_to_scan()\n        if isinstance(self.vg_file, str):\n            language_annotations = load(os.path.join(self.data_root, self.vg_file))\n        else:\n            language_annotations = []\n            for x in self.vg_file:\n                language_annotations.extend(load(os.path.join(self.data_root, x)))\n        if self.dataset_length is not None:\n            interval = len(language_annotations) / self.dataset_length\n            output = []\n            for i in range(self.dataset_length):\n                output.append(ids[int(interval*i)])\n            language_annotations = output\n        self.data_list = language_annotations\n        self.scan_id_to_data_idx = {}\n        for scan_id in self.scan_ids:\n            self.scan_id_to_data_idx[scan_id] = []\n        for i, d in enumerate(self.data_list):\n            self.scan_id_to_data_idx[d[\"scan_id\"]].append(i)\n\n    def convert_to_continuous(self):\n        self.convert_info_to_scan()\n        data_list = []\n        self.flag = []\n        self.image_id_dict = {}\n        for flag, scan_id in enumerate(self.scan_ids):\n            total_n = len(self.scans[scan_id][\"images\"])\n            sample_n = min(self.max_n_images, total_n)\n            ids = sample(total_n, sample_n, True).tolist()\n            self.image_id_dict[scan_id] = ids\n            if self.n_images_per_sample > 1:\n                if self.drop_last_per_scene:\n                    sample_n = math.floor(sample_n / self.n_images_per_sample)\n                else:\n                    sample_n = math.ceil(sample_n / self.n_images_per_sample)\n\n                ids = [\n                    ids[i*self.n_images_per_sample : (i+1)*self.n_images_per_sample]\n                    for i in range(sample_n)\n                ]\n            data_list.extend(\n                [dict(scan_id=scan_id, image_id=i) for i in ids]\n            )\n            self.flag.extend([flag] * len(ids))\n        self.data_list = data_list\n        self.flag = np.array(self.flag)\n\n    def get_data_info_continuous(self, data_info):\n        scan_id = data_info[\"scan_id\"]\n        data = copy.deepcopy(self.scans[scan_id])\n        img_idx = data_info[\"image_id\"]\n        if isinstance(img_idx, int):\n            img_idx = [img_idx]\n        for k in [\"images\", \"img_path\", \"depth_img_path\"]:\n            data[k] = index(data[k], img_idx)\n\n        seen_image_id = self.image_id_dict[scan_id]\n        seen_image_id = seen_image_id[:seen_image_id.index(img_idx[0])]\n\n        if len(seen_image_id) != 0:\n            last_visible_mask = index(\n                data[\"ann_info\"][\"visible_instance_masks\"], seen_image_id\n            )\n            last_visible_mask = np.stack(last_visible_mask).any(axis=0)\n        else:\n            last_visible_mask = data[\"ann_info\"][\"visible_instance_masks\"][0] * False\n\n        visible_instance_masks = index(\n            data[\"ann_info\"][\"visible_instance_masks\"], img_idx\n        )\n\n        current_visible_mask = np.stack(visible_instance_masks).any(axis=0)\n        ignore_mask = np.logical_and(\n            last_visible_mask, ~current_visible_mask\n        )\n        all_visible_mask = np.logical_or(\n            last_visible_mask, current_visible_mask,\n        )\n        data[\"visible_instance_masks\"] = visible_instance_masks\n        data[\"all_visible_mask\"] = all_visible_mask\n        data[\"ignore_mask\"] = ignore_mask\n\n        data[\"depth2img\"][\"extrinsic\"] = index(\n            data[\"depth2img\"][\"extrinsic\"], img_idx\n        )\n        if isinstance(data[\"depth2img\"][\"intrinsic\"], list):\n            data[\"depth2img\"][\"intrinsic\"] = index(\n                data[\"depth2img\"][\"intrinsic\"], img_idx\n            )\n        return data\n\n    def merge_grounding_data(self, data_infos):\n        output = dict(\n            text=\"\",\n        )\n        for key in [\"target_id\", \"distractor_ids\", \"target\", \"anchors\", \"anchor_ids\", \"tokens_positive\"]:\n            if key in data_infos[0]:\n                output[key] = []\n        for data_info in data_infos:\n            if \"target_id\" in data_info and data_info[\"target_id\"] in output[\"target_id\"]:\n                continue\n\n            if self.tokens_positive_rebuild and \"target\" in data_info:\n                start_idx = data_info[\"text\"].find(data_info[\"target\"])\n                end_idx = start_idx + len(data_info[\"target\"])\n                tokens_positive = [[start_idx, end_idx]]\n            elif \"tokens_positive\" in data_info:\n                tokens_positive = data_info[\"tokens_positive\"]\n            else:\n                tokens_positive = None\n\n            if len(output[\"text\"]) == 0:\n                output[\"text\"] = data_info[\"text\"]\n            else:\n                if tokens_positive is not None:\n                    tokens_positive = np.array(tokens_positive)\n                    tokens_positive += len(output[\"text\"]) + len(self.sep_token)\n                    tokens_positive = tokens_positive.tolist()\n                output[\"text\"] += self.sep_token + data_info[\"text\"]\n            if tokens_positive is not None:\n                output[\"tokens_positive\"].append(tokens_positive)\n            for k in [\"target_id\", \"distractor_ids\", \"target\", \"anchors\", \"anchor_ids\"]:\n                if k not in data_info:\n                    continue\n                output[k].append(data_info[k])\n        output[\"scan_id\"] = data_infos[0][\"scan_id\"]\n        return output\n\n    def get_data_info_grounding(self, data_info):\n\n        flags = {}\n        if \"distractor_ids\" in data_info:\n            flags[\"is_unique\"] = len(data_info[\"distractor_ids\"]) == 0\n            flags[\"is_hard\"] = len(data_info[\"distractor_ids\"]) > 3\n        if \"text\" in data_info:\n            flags[\"is_view_dep\"] = self._is_view_dep(data_info[\"text\"])\n\n        scan_id = data_info[\"scan_id\"]\n        scan_data = copy.deepcopy(self.scans[scan_id])\n        data_info = [data_info]\n        if self.num_text > 1:\n            data_idx = self.scan_id_to_data_idx[scan_id]\n            sample_idx = sample(\n                len(data_idx),\n                max(min(int(np.random.rand()*self.num_text), len(data_idx))-1, 1),\n                fix_interval=False\n            )\n            for i in sample_idx:\n                data_info.append(super().get_data_info(data_idx[i]))\n        data_info = self.merge_grounding_data(data_info)\n        scan_data[\"text\"] = data_info[\"text\"]\n\n        if \"ann_info\" in scan_data and \"target\" in data_info:\n            tokens_positive = []\n            obj_idx = []\n            for i, (target_name, id) in enumerate(\n                zip(data_info[\"target\"], data_info[\"target_id\"])\n            ):\n                mask = np.logical_and(\n                    scan_data[\"ann_info\"][\"bbox_id\"] == id,\n                    np.array(scan_data[\"ann_info\"][\"gt_names\"]) == target_name\n                )\n                if np.sum(mask) != 1:\n                    continue\n                obj_idx.append(np.where(mask)[0][0])\n                tokens_positive.append(data_info[\"tokens_positive\"][i])\n            obj_idx = np.array(obj_idx, dtype=np.int32)\n            scan_data[\"ann_info\"][\"gt_bboxes_3d\"] = scan_data[\"ann_info\"][\"gt_bboxes_3d\"][obj_idx]\n            scan_data[\"ann_info\"][\"gt_labels_3d\"] = scan_data[\"ann_info\"][\"gt_labels_3d\"][obj_idx]\n            scan_data[\"ann_info\"][\"gt_names\"] = [\n                scan_data[\"ann_info\"][\"gt_names\"][i] for i in obj_idx\n            ]\n            if \"visible_instance_masks\" in scan_data[\"ann_info\"]:\n                scan_data[\"ann_info\"][\"visible_instance_masks\"] = [\n                    visible_instance_mask[obj_idx]\n                    for visible_instance_mask in scan_data[\"ann_info\"][\"visible_instance_masks\"]\n                ]\n            scan_data[\"tokens_positive\"] = tokens_positive\n            scan_data[\"eval_ann_info\"] = scan_data[\"ann_info\"]\n            scan_data[\"eval_ann_info\"].update(flags)\n        elif \"tokens_positive\" in data_info:\n            scan_data[\"tokens_positive\"] = data_info.get(\"tokens_positive\")\n        return scan_data\n\n    @force_full_init\n    def get_data_info(self, idx):\n        data_info = super().get_data_info(idx)\n        if self.mode == \"detection\":\n            return data_info\n        elif self.mode == \"continuous\":\n            return self.get_data_info_continuous(data_info)\n        elif self.mode == \"grounding\":\n            return self.get_data_info_grounding(data_info)\n\n\ndef index(input, idx):\n    if isinstance(idx, int):\n        idx = [idx]\n    output = []\n    for i in idx:\n        output.append(input[i])\n    return output\n"
  },
  {
    "path": "bip3d/datasets/transforms/__init__.py",
    "content": "from .augmentation import (\n    GlobalRotScaleTrans,\n    RandomFlip3D,\n    ResizeCropFlipImage,\n)\nfrom .formatting import Pack3DDetInputs\nfrom .loading import LoadAnnotations3D, LoadDepthFromFile\nfrom .multiview import MultiViewPipeline\n\nfrom .transform import (\n    CategoryGroundingDataPrepare,\n    CamIntrisicStandardization,\n    CustomResize,\n    DepthProbLabelGenerator,\n)\n"
  },
  {
    "path": "bip3d/datasets/transforms/augmentation.py",
    "content": "from typing import List, Union\n\nimport cv2\nimport numpy as np\nfrom mmcv.transforms import BaseTransform\nfrom mmdet.datasets.transforms import RandomFlip\n\nfrom bip3d.registry import TRANSFORMS\n\n\n@TRANSFORMS.register_module()\nclass RandomFlip3D(RandomFlip):\n    \"\"\"Flip the points & bbox.\n\n    If the input dict contains the key \"flip\", then the flag will be used,\n    otherwise it will be randomly decided by a ratio specified in the init\n    method.\n\n    Required Keys:\n\n    - points (np.float32)\n    - gt_bboxes_3d (np.float32)\n\n    Modified Keys:\n\n    - points (np.float32)\n    - gt_bboxes_3d (np.float32)\n\n    Added Keys:\n\n    - points (np.float32)\n    - pcd_trans (np.float32)\n    - pcd_rotation (np.float32)\n    - pcd_rotation_angle (np.float32)\n    - pcd_scale_factor (np.float32)\n\n    Args:\n        sync_2d (bool): Whether to apply flip according to the 2D\n            images. If True, it will apply the same flip as that to 2D images.\n            If False, it will decide whether to flip randomly and independently\n            to that of 2D images. Defaults to True.\n        flip_2d (bool): Whether to apply flip for the img data.\n            If True, it will adopt the flip augmentation for the img.\n            False occurs on bev augmentation for bev-based image 3d det.\n            Defaults to True.\n        flip_3d (bool): Whether to apply flip for the 3d point cloud data.\n            If True, it will adopt the flip augmentation for the point cloud.\n            Defaults to True.\n        flip_ratio_bev_horizontal (float): The flipping probability\n            in horizontal direction. Defaults to 0.0.\n        flip_ratio_bev_vertical (float): The flipping probability\n            in vertical direction. Defaults to 0.0.\n        flip_box3d (bool): Whether to flip bounding box. In most of the case,\n            the box should be fliped. In cam-based bev detection, this is set\n            to False, since the flip of 2D images does not influence the 3D\n            box. Defaults to True.\n    \"\"\"\n\n    def __init__(self,\n                 sync_2d: bool = True,\n                 flip_2d: bool = True,\n                 flip_3d: bool = True,\n                 flip_ratio_bev_horizontal: float = 0.0,\n                 flip_ratio_bev_vertical: float = 0.0,\n                 flip_box3d: bool = True,\n                 update_lidar2cam: bool = False,\n                 **kwargs) -> None:\n        # `flip_ratio_bev_horizontal` is equal to\n        # for flip prob of 2d image when\n        # `sync_2d` is True\n        super(RandomFlip3D, self).__init__(prob=flip_ratio_bev_horizontal,\n                                           direction='horizontal',\n                                           **kwargs)\n        self.sync_2d = sync_2d\n        self.flip_2d = flip_2d\n        self.flip_3d = flip_3d\n        self.flip_ratio_bev_horizontal = flip_ratio_bev_horizontal\n        self.flip_ratio_bev_vertical = flip_ratio_bev_vertical\n        self.flip_box3d = flip_box3d\n        self.update_lidar2cam = update_lidar2cam\n        if flip_ratio_bev_horizontal is not None:\n            assert isinstance(flip_ratio_bev_horizontal, (int, float)) \\\n                    and 0 <= flip_ratio_bev_horizontal <= 1\n        if flip_ratio_bev_vertical is not None:\n            assert isinstance(flip_ratio_bev_vertical, (int, float)) \\\n                    and 0 <= flip_ratio_bev_vertical <= 1\n\n    def transform(self, input_dict: dict) -> dict:\n        \"\"\"Call function to flip points, values in the ``bbox3d_fields`` and\n        also flip 2D image and its annotations.\n\n        Args:\n            input_dict (dict): Result dict from loading pipeline.\n\n        Returns:\n            dict: Flipped results, 'flip', 'flip_direction',\n            'pcd_horizontal_flip' and 'pcd_vertical_flip' keys are added\n            into result dict.\n        \"\"\"\n        # flip 2D image and its annotations\n        if self.flip_2d:\n            # only handle the 2D image\n            if 'img' in input_dict:\n                super(RandomFlip3D, self).transform(input_dict)\n            flip = input_dict.get('flip', False)\n            if flip:\n                input_dict = self.random_flip_data_2d(input_dict)\n\n        if self.flip_3d:\n            # only handle the 3D points\n            if self.sync_2d and 'img' in input_dict:\n                # TODO check if this is necessary in FOCS3D\n                input_dict['pcd_horizontal_flip'] = input_dict['flip']\n                input_dict['pcd_vertical_flip'] = False\n            else:\n                if 'pcd_horizontal_flip' not in input_dict:\n                    if np.random.rand() < self.flip_ratio_bev_horizontal:\n                        flip_horizontal = True\n                    else:\n                        flip_horizontal = False\n                    input_dict['pcd_horizontal_flip'] = flip_horizontal\n                if 'pcd_vertical_flip' not in input_dict:\n                    if np.random.rand() < self.flip_ratio_bev_vertical:\n                        flip_vertical = True\n                    else:\n                        flip_vertical = False\n                    input_dict['pcd_vertical_flip'] = flip_vertical\n\n            if 'transformation_3d_flow' not in input_dict:\n                input_dict['transformation_3d_flow'] = []\n\n            if input_dict['pcd_horizontal_flip']:\n                self.random_flip_data_3d(input_dict, 'horizontal')\n                input_dict['transformation_3d_flow'].extend(['HF'])\n            if input_dict['pcd_vertical_flip']:\n                self.random_flip_data_3d(input_dict, 'vertical')\n                input_dict['transformation_3d_flow'].extend(['VF'])\n            if self.update_lidar2cam:\n                self._transform_lidar2cam(input_dict)\n        return input_dict\n\n    def random_flip_data_3d(self,\n                            input_dict: dict,\n                            direction: str = 'horizontal') -> None:\n        \"\"\"Flip 3D data randomly.\n\n        `random_flip_data_3d` should take these situations into consideration:\n\n        - 1. LIDAR-based 3d detection\n        - 2. LIDAR-based 3d segmentation\n        - 3. vision-only detection\n        - 4. multi-modality 3d detection.\n\n        Args:\n            input_dict (dict): Result dict from loading pipeline.\n            direction (str): Flip direction. Defaults to 'horizontal'.\n\n        Returns:\n            dict: Flipped results, 'points', 'bbox3d_fields' keys are\n            updated in the result dict.\n        \"\"\"\n        assert direction in ['horizontal', 'vertical']\n        if self.flip_box3d:\n            if 'gt_bboxes_3d' in input_dict:\n                if 'points' in input_dict:\n                    input_dict['points'] = input_dict['gt_bboxes_3d'].flip(\n                        direction, points=input_dict['points'])\n                else:\n                    # vision-only detection\n                    input_dict['gt_bboxes_3d'].flip(direction)\n            else:\n                input_dict['points'].flip(direction)\n\n    def random_flip_data_2d(self,\n                            input_dict: dict,\n                            direction: str = 'horizontal') -> dict:\n        if 'centers_2d' in input_dict:\n            assert self.sync_2d is True and direction == 'horizontal', \\\n                'Only support sync_2d=True and horizontal flip with images'\n            w = input_dict['img_shape'][1]\n            input_dict['centers_2d'][..., 0] = \\\n                w - input_dict['centers_2d'][..., 0]\n            # need to modify the horizontal position of camera center\n            # along u-axis in the image (flip like centers2d)\n            # ['cam2img'][0][2] = c_u\n            # see more details and examples at\n            # https://github.com/open-mmlab/mmdetection3d/pull/744\n            input_dict['cam2img'][0][2] = w - input_dict['cam2img'][0][2]\n\n        if 'fov_ori2aug' not in input_dict:\n            fov_ori2aug = np.eye(4, 4)\n        else:\n            fov_ori2aug = input_dict['fov_ori2aug']\n        # get the value of w\n        w = input_dict['img_shape'][1]\n        # flip_matrix[0,0] = -1\n        # flip_matrix[0,3] = w\n        # fov_ori2aug = np.matmul(fov_ori2aug, flip_matrix)\n        fov_ori2aug[0] *= -1\n        fov_ori2aug[0, 3] += w\n        input_dict['fov_ori2aug'] = fov_ori2aug\n        return input_dict\n\n    def _flip_on_direction(self, results: dict) -> None:\n        \"\"\"Function to flip images, bounding boxes, semantic segmentation map\n        and keypoints.\n\n        Add the override feature that if 'flip' is already in results, use it\n        to do the augmentation.\n        \"\"\"\n        if 'flip' not in results:\n            cur_dir = self._choose_direction()\n        else:\n            # `flip_direction` works only when `flip` is True.\n            # For example, in `MultiScaleFlipAug3D`, `flip_direction` is\n            # 'horizontal' but `flip` is False.\n            if results['flip']:\n                assert 'flip_direction' in results, 'flip and flip_direction '\n                'must exist simultaneously'\n                cur_dir = results['flip_direction']\n            else:\n                cur_dir = None\n        if cur_dir is None:\n            results['flip'] = False\n            results['flip_direction'] = None\n        else:\n            results['flip'] = True\n            results['flip_direction'] = cur_dir\n            self._flip(results)\n\n    def _transform_lidar2cam(self, results: dict) -> None:\n        \"\"\"TODO.\"\"\"\n        aug_matrix = np.eye(4)\n        if results.get('pcd_horizontal_flip', False):\n            aug_matrix[1, 1] *= -1\n        if results.get('pcd_vertical_flip', False):\n            aug_matrix[0, 0] *= -1\n        lidar2cam_list = []\n        for lidar2cam in results['lidar2cam']:\n            lidar2cam = np.array(lidar2cam)\n            lidar2cam = np.matmul(lidar2cam, aug_matrix)\n            lidar2cam_list.append(lidar2cam.tolist())\n        results['lidar2cam'] = lidar2cam_list\n\n    def __repr__(self) -> str:\n        \"\"\"str: Return a string that describes the module.\"\"\"\n        repr_str = self.__class__.__name__\n        repr_str += f'(sync_2d={self.sync_2d},'\n        repr_str += f' flip_ratio_bev_vertical={self.flip_ratio_bev_vertical})'\n        return repr_str\n\n\n@TRANSFORMS.register_module()\nclass GlobalRotScaleTrans(BaseTransform):\n    \"\"\"Apply global rotation, scaling and translation to a 3D scene.\n\n    Required Keys:\n\n    - points (np.float32)\n    - gt_bboxes_3d (np.float32)\n\n    Modified Keys:\n\n    - points (np.float32)\n    - gt_bboxes_3d (np.float32)\n\n    Added Keys:\n\n    - points (np.float32)\n    - pcd_trans (np.float32)\n    - pcd_rotation (np.float32)\n    - pcd_rotation_angle (np.float32)\n    - pcd_scale_factor (np.float32)\n\n    Args:\n        rot_range (list[float]): Range of rotation angle.\n            Defaults to [-0.78539816, 0.78539816] (close to [-pi/4, pi/4]).\n        rot_dof (int): DoF of rotation noise. Defaults to 1.\n        scale_ratio_range (list[float]): Range of scale ratio.\n            Defaults to [0.95, 1.05].\n        translation_std (list[float]): The standard deviation of\n            translation noise applied to a scene, which\n            is sampled from a gaussian distribution whose standard deviation\n            is set by ``translation_std``. Defaults to [0, 0, 0].\n        shift_height (bool): Whether to shift height.\n            (the fourth dimension of indoor points) when scaling.\n            Defaults to False.\n    \"\"\"\n\n    def __init__(self,\n                 rot_range: Union[List[float], int,\n                                  float] = [-0.78539816, 0.78539816],\n                 rot_dof: int = 1,\n                 scale_ratio_range: List[float] = [0.95, 1.05],\n                 translation_std: List[int] = [0, 0, 0],\n                 shift_height: bool = False,\n                 update_lidar2cam: bool = False) -> None:\n        seq_types = (list, tuple, np.ndarray)\n        if not isinstance(rot_range, seq_types):\n            assert isinstance(rot_range, (int, float)), \\\n                f'unsupported rot_range type {type(rot_range)}'\n            rot_range = [-rot_range, rot_range]\n        self.rot_range = rot_range\n        self.rot_dof = rot_dof\n        self.update_lidar2cam = update_lidar2cam\n\n        assert isinstance(scale_ratio_range, seq_types), \\\n            f'unsupported scale_ratio_range type {type(scale_ratio_range)}'\n\n        self.scale_ratio_range = scale_ratio_range\n\n        if not isinstance(translation_std, seq_types):\n            assert isinstance(translation_std, (int, float)), \\\n                f'unsupported translation_std type {type(translation_std)}'\n            translation_std = [\n                translation_std, translation_std, translation_std\n            ]\n        assert all([std >= 0 for std in translation_std]), \\\n            'translation_std should be positive'\n        self.translation_std = translation_std\n        self.shift_height = shift_height\n\n    def transform(self, input_dict: dict) -> dict:\n        \"\"\"Private function to rotate, scale and translate bounding boxes and\n        points.\n\n        Args:\n            input_dict (dict): Result dict from loading pipeline.\n\n        Returns:\n            dict: Results after scaling, 'points', 'pcd_rotation',\n            'pcd_scale_factor', 'pcd_trans' and `gt_bboxes_3d` are updated\n            in the result dict.\n        \"\"\"\n        if 'transformation_3d_flow' not in input_dict:\n            input_dict['transformation_3d_flow'] = []\n\n        self._rot_bbox_points(input_dict)\n\n        if 'pcd_scale_factor' not in input_dict:\n            self._random_scale(input_dict)\n        self._scale_bbox_points(input_dict)\n\n        self._trans_bbox_points(input_dict)\n\n        input_dict['transformation_3d_flow'].extend(['R', 'S', 'T'])\n        if self.update_lidar2cam:\n            self._transform_lidar2cam(input_dict)\n        return input_dict\n\n    def _trans_bbox_points(self, input_dict: dict) -> None:\n        \"\"\"Private function to translate bounding boxes and points.\n\n        Args:\n            input_dict (dict): Result dict from loading pipeline.\n\n        Returns:\n            dict: Results after translation, 'points', 'pcd_trans'\n            and `gt_bboxes_3d` is updated in the result dict.\n        \"\"\"\n        translation_std = np.array(self.translation_std, dtype=np.float32)\n        trans_factor = np.random.normal(scale=translation_std, size=3).T\n\n        if 'points' in input_dict:\n            input_dict['points'].translate(trans_factor)\n        input_dict['pcd_trans'] = trans_factor\n        if 'gt_bboxes_3d' in input_dict:\n            input_dict['gt_bboxes_3d'].translate(trans_factor)\n\n    def _rot_bbox_points(self, input_dict: dict) -> None:\n        \"\"\"Private function to rotate bounding boxes and points.\n\n        Args:\n            input_dict (dict): Result dict from loading pipeline.\n\n        Returns:\n            dict: Results after rotation, 'points', 'pcd_rotation'\n            and `gt_bboxes_3d` is updated in the result dict.\n        \"\"\"\n        rotation = self.rot_range\n        if self.rot_dof == 1:\n            noise_rotation = np.random.uniform(rotation[0], rotation[1])\n            noise_rotation *= -1\n        elif self.rot_dof > 1:\n            noise_rotation = np.array([\n                -np.random.uniform(rotation[0], rotation[1]),\n                -np.random.uniform(rotation[0], rotation[1]),\n                -np.random.uniform(rotation[0], rotation[1])\n            ])\n        else:\n            raise NotImplementedError\n        # TODO delete this. And -1 is to align the rotation with\n        # the version of 0.17.\n        # if 'gt_bboxes_3d' in input_dict and \\\n        #         len(input_dict['gt_bboxes_3d'].tensor) != 0:\n        if 'gt_bboxes_3d' in input_dict:\n            # rotate points with bboxes\n            if 'points' in input_dict:\n                points, rot_mat_T = input_dict['gt_bboxes_3d'].rotate(\n                    noise_rotation, input_dict['points'])\n                input_dict['points'] = points\n            else:\n                rot_mat_T = input_dict['gt_bboxes_3d'].rotate(noise_rotation)\n        elif 'points' in input_dict:\n            # if no bbox in input_dict, only rotate points\n            rot_mat_T = input_dict['points'].rotate(noise_rotation)\n\n        input_dict['pcd_rotation'] = rot_mat_T\n        input_dict['pcd_rotation_angle'] = noise_rotation\n\n    def _scale_bbox_points(self, input_dict: dict) -> None:\n        \"\"\"Private function to scale bounding boxes and points.\n\n        Args:\n            input_dict (dict): Result dict from loading pipeline.\n\n        Returns:\n            dict: Results after scaling, 'points' and\n            `gt_bboxes_3d` is updated in the result dict.\n        \"\"\"\n        scale = input_dict['pcd_scale_factor']\n        if 'points' in input_dict:\n            points = input_dict['points']\n            points.scale(scale)\n            if self.shift_height:\n                assert 'height' in points.attribute_dims.keys(), \\\n                  'setting shift_height=True \\\n                   but points have no height attribute'\n\n                points.tensor[:, points.attribute_dims['height']] *= scale\n            input_dict['points'] = points\n\n        if 'gt_bboxes_3d' in input_dict and \\\n                len(input_dict['gt_bboxes_3d'].tensor) != 0:\n            input_dict['gt_bboxes_3d'].scale(scale)\n\n    def _random_scale(self, input_dict: dict) -> None:\n        \"\"\"Private function to randomly set the scale factor.\n\n        Args:\n            input_dict (dict): Result dict from loading pipeline.\n\n        Returns:\n            dict: Results after scaling, 'pcd_scale_factor'\n            are updated in the result dict.\n        \"\"\"\n        scale_factor = np.random.uniform(self.scale_ratio_range[0],\n                                         self.scale_ratio_range[1])\n        input_dict['pcd_scale_factor'] = scale_factor\n\n    def _transform_lidar2cam(self, input_dict: dict) -> None:\n        aug_matrix = np.eye(4)\n\n        if 'pcd_rotation' in input_dict:\n            aug_matrix[:3, :3] = input_dict['pcd_rotation'].T.numpy(\n            ) * input_dict['pcd_scale_factor']\n        else:\n            aug_matrix[:3, :3] = np.eye(3).view(\n                1, 3, 3) * input_dict['pcd_scale_factor']\n        aug_matrix[:3, -1] = input_dict['pcd_trans'].reshape(1, 3)\n        aug_matrix[-1, -1] = 1.0\n        aug_matrix = np.linalg.inv(aug_matrix)\n        lidar2cam_list = []\n        for lidar2cam in input_dict[\"depth2img\"][\"extrinsic\"]:\n        # for lidar2cam in input_dict['lidar2cam']:\n            lidar2cam = np.array(lidar2cam)\n            lidar2cam = np.matmul(lidar2cam, aug_matrix)\n            lidar2cam_list.append(lidar2cam.tolist())\n        input_dict[\"depth2img\"][\"extrinsic\"] = lidar2cam_list\n        # input_dict['lidar2cam'] = lidar2cam_list\n\n    def __repr__(self) -> str:\n        \"\"\"str: Return a string that describes the module.\"\"\"\n        repr_str = self.__class__.__name__\n        repr_str += f'(rot_range={self.rot_range},'\n        repr_str += f' scale_ratio_range={self.scale_ratio_range},'\n        repr_str += f' translation_std={self.translation_std},'\n        repr_str += f' shift_height={self.shift_height})'\n        return repr_str\n\n\n@TRANSFORMS.register_module()\nclass ResizeCropFlipImage(object):\n    def __init__(self, data_aug_conf, test_mode=False):\n        self.data_aug_conf = data_aug_conf\n        self.test_mode = test_mode\n\n    def get_augmentation(self):\n        if self.data_aug_conf is None:\n            return None\n        H, W = self.data_aug_conf[\"H\"], self.data_aug_conf[\"W\"]\n        fH, fW = self.data_aug_conf[\"final_dim\"]\n        if not self.test_mode:\n            resize = np.random.uniform(*self.data_aug_conf[\"resize_lim\"])\n            resize_dims = (int(W * resize), int(H * resize))\n            newW, newH = resize_dims\n            crop_h = (\n                int(\n                    (1 - np.random.uniform(*self.data_aug_conf[\"bot_pct_lim\"]))\n                    * newH\n                )\n                - fH\n            )\n            crop_w = int(np.random.uniform(0, max(0, newW - fW)))\n            crop = (crop_w, crop_h, crop_w + fW, crop_h + fH)\n            flip = False\n            if self.data_aug_conf[\"rand_flip\"] and np.random.choice([0, 1]):\n                flip = True\n            rotate = np.random.uniform(*self.data_aug_conf[\"rot_lim\"])\n            # rotate_3d = np.random.uniform(*self.data_aug_conf[\"rot3d_range\"])\n            rotate_3d = 0\n        else:\n            resize = max(fH / H, fW / W)\n            resize_dims = (int(W * resize), int(H * resize))\n            newW, newH = resize_dims\n            crop_h = (\n                int((1 - np.mean(self.data_aug_conf[\"bot_pct_lim\"])) * newH)\n                - fH\n            )\n            crop_w = int(max(0, newW - fW) / 2)\n            crop = (crop_w, crop_h, crop_w + fW, crop_h + fH)\n            flip = False\n            rotate = 0\n            rotate_3d = 0\n        aug_config = {\n            \"resize\": resize,\n            \"resize_dims\": resize_dims,\n            \"crop\": crop,\n            \"flip\": flip,\n            \"rotate\": rotate,\n            \"rotate_3d\": rotate_3d,\n        }\n        return aug_config\n\n    def __call__(self, results):\n        img = results[\"img\"]\n        aug_configs = self.get_augmentation()\n        resize = aug_configs.get(\"resize\", 1)\n        results[\"img\"], extend_matrix = self._transform(img, aug_configs)\n        results[\"trans_mat\"] = extend_matrix\n        if \"depth_img\" in results:\n            results[\"depth_img\"], _ = self._transform(results[\"depth_img\"], aug_configs)\n        return results\n\n    def _transform(self, img, aug_configs):\n        H, W = img.shape[:2]\n        resize = aug_configs.get(\"resize\", 1)\n        resize_dims = (int(W * resize), int(H * resize))\n        crop = aug_configs.get(\"crop\", [0, 0, *resize_dims])\n        flip = aug_configs.get(\"flip\", False)\n        rotate = aug_configs.get(\"rotate\", 0)\n\n        transform_matrix = np.eye(3)\n        transform_matrix[:2, :2] *= resize\n        transform_matrix[:2, 2] -= np.array(crop[:2])\n        if flip:\n            flip_matrix = np.array(\n                [[-1, 0, crop[2] - crop[0]], [0, 1, 0], [0, 0, 1]]\n            )\n            transform_matrix = flip_matrix @ transform_matrix\n        rotate = rotate / 180 * np.pi\n        rot_matrix = np.array(\n            [\n                [np.cos(rotate), np.sin(rotate), 0],\n                [-np.sin(rotate), np.cos(rotate), 0],\n                [0, 0, 1],\n            ]\n        )\n        rot_center = np.array([crop[2] - crop[0], crop[3] - crop[1]]) / 2\n        rot_matrix[:2, 2] = -rot_matrix[:2, :2] @ rot_center + rot_center\n        transform_matrix = rot_matrix @ transform_matrix\n        extend_matrix = np.eye(4)\n        extend_matrix[:3, :3] = transform_matrix\n\n        img = cv2.warpAffine(img, extend_matrix[:2, :3], img.shape[:2][::-1])\n        return img, extend_matrix\n"
  },
  {
    "path": "bip3d/datasets/transforms/formatting.py",
    "content": "from typing import List, Sequence, Union\n\nimport mmengine\nimport numpy as np\nimport torch\nfrom mmcv.transforms import BaseTransform\nfrom mmengine.structures import InstanceData, PixelData\n\nfrom bip3d.registry import TRANSFORMS\nfrom bip3d.structures.bbox_3d import BaseInstance3DBoxes\nfrom bip3d.structures.points import BasePoints\nfrom bip3d.utils.typing_config import Det3DDataElement, PointData\n\n\ndef to_tensor(\n    data: Union[torch.Tensor, np.ndarray, Sequence, int,\n                float]) -> torch.Tensor:\n    \"\"\"Convert objects of various python types to :obj:`torch.Tensor`.\n\n    Supported types are: :class:`numpy.ndarray`, :class:`torch.Tensor`,\n    :class:`Sequence`, :class:`int` and :class:`float`.\n\n    Args:\n        data (torch.Tensor | numpy.ndarray | Sequence | int | float): Data to\n            be converted.\n\n    Returns:\n        torch.Tensor: the converted data.\n    \"\"\"\n\n    if isinstance(data, torch.Tensor):\n        return data\n    elif isinstance(data, np.ndarray):\n        if data.dtype is np.dtype('float64'):\n            data = data.astype(np.float32)\n        return torch.from_numpy(data)\n    elif isinstance(data, Sequence) and not mmengine.is_str(data):\n        return torch.tensor(data)\n    elif isinstance(data, int):\n        return torch.LongTensor([data])\n    elif isinstance(data, float):\n        return torch.FloatTensor([data])\n    else:\n        raise TypeError(f'type {type(data)} cannot be converted to tensor.')\n\n\n@TRANSFORMS.register_module()\nclass Pack3DDetInputs(BaseTransform):\n    INPUTS_KEYS = ['points', 'img', \"depth_img\"]\n    # to be compatible with depths in bevdepth\n    INSTANCEDATA_3D_KEYS = [\n        'gt_bboxes_3d', 'gt_labels_3d', 'attr_labels', 'depths', 'centers_2d'\n    ]\n    INSTANCEDATA_2D_KEYS = [\n        'gt_bboxes',\n        'gt_bboxes_labels',\n    ]\n\n    SEG_KEYS = [\n        'gt_seg_map', 'pts_instance_mask', 'pts_semantic_mask',\n        'gt_semantic_seg'\n    ]\n\n    def __init__(\n        self,\n        keys: dict,\n        meta_keys: dict = (\n            'img_path', 'ori_shape', 'img_shape', 'lidar2img', 'depth2img',\n            'cam2img', 'pad_shape', 'depth_map_path', 'scale_factor', 'flip',\n            'pcd_horizontal_flip', 'pcd_vertical_flip', 'box_mode_3d',\n            'box_type_3d', 'img_norm_cfg', 'num_pts_feats', 'pcd_trans',\n            'sample_idx', 'pcd_scale_factor', 'pcd_rotation',\n            'pcd_rotation_angle', 'lidar_path', 'transformation_3d_flow',\n            'trans_mat', 'affine_aug', 'sweep_img_metas', 'ori_cam2img',\n            'cam2global', 'crop_offset', 'img_crop_offset', 'resize_img_shape',\n            'lidar2cam', 'ori_lidar2img', 'num_ref_frames', 'num_views',\n            'ego2global', 'fov_ori2aug', 'ego2cam', 'axis_align_matrix',\n            'text', 'tokens_positive', 'scan_id', \"distractor_bboxes\", \"anchor_bboxes\",\n            \"trans_mat\", \"ignore_mask\",\n        )):\n        self.keys = keys\n        self.meta_keys = meta_keys\n\n    def _remove_prefix(self, key: str) -> str:\n        if key.startswith('gt_'):\n            key = key[3:]\n        return key\n\n    def transform(self, results: Union[dict,\n                                       List[dict]]):\n        \"\"\"Method to pack the input data. when the value in this dict is a\n        list, it usually is in Augmentations Testing.\n\n        Args:\n            results (dict | list[dict]): Result dict from the data pipeline.\n\n        Returns:\n            dict | List[dict]:\n\n            - 'inputs' (dict): The forward data of models. It usually contains\n              following keys:\n\n                - points\n                - img\n\n            - 'data_samples' (:obj:`Det3DDataSample`): The annotation info of\n              the sample.\n        \"\"\"\n        # augtest\n        if isinstance(results, list):\n            if len(results) == 1:\n                # simple test\n                return self.pack_single_results(results[0])\n            pack_results = []\n            for single_result in results:\n                pack_results.append(self.pack_single_results(single_result))\n            return pack_results\n        # norm training and simple testing\n        elif isinstance(results, dict):\n            return self.pack_single_results(results)\n        else:\n            raise NotImplementedError\n\n    def pack_single_results(self, results: dict) -> dict:\n        \"\"\"Method to pack the single input data. when the value in this dict is\n        a list, it usually is in Augmentations Testing.\n\n        Args:\n            results (dict): Result dict from the data pipeline.\n\n        Returns:\n            dict: A dict contains\n\n            - 'inputs' (dict): The forward data of models. It usually contains\n              following keys:\n\n                - points\n                - img\n\n            - 'data_samples' (:obj:`Det3DDataSample`): The annotation info\n              of the sample.\n        \"\"\"\n\n        if 'points' in results:\n            if isinstance(results['points'], BasePoints):\n                results['points'] = results['points'].tensor\n            # multi-sweep points\n            elif isinstance(results['points'], list):\n                if isinstance(results['points'][0], BasePoints):\n                    for idx in range(len(results['points'])):\n                        results['points'][idx] = results['points'][idx].tensor\n\n        for key in [\"img\", \"depth_img\"]:\n            if key not in results:\n                continue\n            if isinstance(results[key], list):\n                # process multiple imgs in single frame\n                imgs = np.stack(results[key], axis=0)\n                if len(imgs.shape) <= 3:\n                    imgs = imgs[..., None]\n                if imgs.flags.c_contiguous:\n                    imgs = to_tensor(imgs).permute(0, 3, 1, 2).contiguous()\n                else:\n                    imgs = to_tensor(\n                        np.ascontiguousarray(imgs.transpose(0, 3, 1, 2)))\n                results[key] = imgs\n            else:\n                img = results[key]\n                if len(img.shape) < 3:\n                    img = np.expand_dims(img, -1)\n                # To improve the computational speed by by 3-5 times, apply:\n                # `torch.permute()` rather than `np.transpose()`.\n                # Refer to https://github.com/open-mmlab/mmdetection/pull/9533\n                # for more details\n                if img.flags.c_contiguous:\n                    img = to_tensor(img).permute(2, 0, 1).contiguous()\n                else:\n                    img = to_tensor(\n                        np.ascontiguousarray(img.transpose(2, 0, 1)))\n                results[key] = img\n\n        for key in [\n                'proposals', 'gt_bboxes', 'gt_bboxes_ignore', 'gt_labels',\n                'gt_bboxes_labels', 'attr_labels', 'pts_instance_mask',\n                'pts_semantic_mask', 'centers_2d', 'depths', 'gt_labels_3d'\n        ]:\n            if key not in results:\n                continue\n            if isinstance(results[key], list):\n                results[key] = [to_tensor(res) for res in results[key]]\n            else:\n                results[key] = to_tensor(results[key])\n        if 'gt_bboxes_3d' in results:\n            # multi-sweep version\n            if isinstance(results['gt_bboxes_3d'], list):\n                if not isinstance(results['gt_bboxes_3d'][0],\n                                  BaseInstance3DBoxes):\n                    for idx in range(len(results['gt_bboxes_3d'])):\n                        results['gt_bboxes_3d'][idx] = to_tensor(\n                            results['gt_bboxes_3d'][idx])\n            elif not isinstance(results['gt_bboxes_3d'], BaseInstance3DBoxes):\n                results['gt_bboxes_3d'] = to_tensor(results['gt_bboxes_3d'])\n\n        if 'gt_semantic_seg' in results:\n            results['gt_semantic_seg'] = to_tensor(\n                results['gt_semantic_seg'][None])\n        if 'gt_seg_map' in results:\n            results['gt_seg_map'] = results['gt_seg_map'][None, ...]\n        if 'depth_map' in results:\n            results['depth_map'] = to_tensor(results['depth_map'])\n\n        data_sample = Det3DDataElement()\n        gt_instances_3d = InstanceData()\n        gt_instances = InstanceData()\n        gt_pts_seg = PointData()\n        gt_depth_map = PixelData()\n\n        data_metas = {}\n        for key in self.meta_keys:\n            if key in results:\n                data_metas[key] = results[key]\n            # TODO: unify ScanNet multi-view info with nuScenes and Waymo\n            elif 'images' in results and isinstance(results['images'], dict):\n                if len(results['images'].keys()) == 1:\n                    cam_type = list(results['images'].keys())[0]\n                    # single-view image\n                    if key in results['images'][cam_type]:\n                        data_metas[key] = results['images'][cam_type][key]\n                else:\n                    # multi-view image\n                    img_metas = []\n                    cam_types = list(results['images'].keys())\n                    for cam_type in cam_types:\n                        if key in results['images'][cam_type]:\n                            img_metas.append(results['images'][cam_type][key])\n                    if len(img_metas) > 0:\n                        data_metas[key] = img_metas\n            elif 'lidar_points' in results and isinstance(\n                    results['lidar_points'], dict):\n                if key in results['lidar_points']:\n                    data_metas[key] = results['lidar_points'][key]\n        data_sample.set_metainfo(data_metas)\n\n        inputs = {}\n        for key in self.keys:\n            if key in results:\n                if key in self.INPUTS_KEYS:\n                    inputs[key] = results[key]\n                elif key in self.INSTANCEDATA_3D_KEYS:\n                    gt_instances_3d[self._remove_prefix(key)] = results[key]\n                elif key in self.INSTANCEDATA_2D_KEYS:\n                    if key == 'gt_bboxes_labels':\n                        gt_instances['labels'] = results[key]\n                    else:\n                        gt_instances[self._remove_prefix(key)] = results[key]\n                elif key in self.SEG_KEYS:\n                    gt_pts_seg[self._remove_prefix(key)] = results[key]\n                elif key == 'depth_map':\n                    gt_depth_map.set_data(dict(data=results[key]))\n                elif key == 'gt_occupancy':\n                    data_sample.gt_occupancy = to_tensor(\n                        results['gt_occupancy'])\n                    if isinstance(results['gt_occupancy_masks'], list):\n                        data_sample.gt_occupancy_masks = [\n                            to_tensor(mask)\n                            for mask in results['gt_occupancy_masks']\n                        ]\n                    else:\n                        data_sample.gt_occupancy_masks = to_tensor(\n                            results['gt_occupancy_masks'])\n                else:\n                    raise NotImplementedError(f'Please modified '\n                                              f'`Pack3DDetInputs` '\n                                              f'to put {key} to '\n                                              f'corresponding field')\n\n        data_sample.gt_instances_3d = gt_instances_3d\n        data_sample.gt_instances = gt_instances\n        data_sample.gt_pts_seg = gt_pts_seg\n        data_sample.gt_depth_map = gt_depth_map\n\n        if 'eval_ann_info' in results:\n            data_sample.eval_ann_info = results['eval_ann_info']\n        else:\n            data_sample.eval_ann_info = None\n\n        packed_results = dict()\n        packed_results['data_samples'] = data_sample\n        packed_results['inputs'] = inputs\n        return packed_results\n\n    def __repr__(self) -> str:\n        \"\"\"str: Return a string that describes the module.\"\"\"\n        repr_str = self.__class__.__name__\n        repr_str += f'(keys={self.keys})'\n        repr_str += f'(meta_keys={self.meta_keys})'\n        return repr_str\n"
  },
  {
    "path": "bip3d/datasets/transforms/loading.py",
    "content": "from typing import Optional\n\nimport mmcv\nimport mmengine\nimport numpy as np\nfrom mmcv.transforms import BaseTransform\nfrom mmdet.datasets.transforms import LoadAnnotations\n\nfrom bip3d.registry import TRANSFORMS\n\n\n@TRANSFORMS.register_module()\nclass LoadDepthFromFile(BaseTransform):\n    \"\"\"Load a depth image from file.\n\n    Required Keys:\n\n    - depth_img_path\n\n    Modified Keys:\n\n    - depth_img\n    - depth_img_shape\n\n    Args:\n        imdecode_backend (str): The image decoding backend type. The backend\n            argument for :func:`mmcv.imfrombytes`.\n            See :func:`mmcv.imfrombytes` for details.\n            Defaults to 'cv2'.\n        ignore_empty (bool): Whether to allow loading empty image or file path\n            not existent. Defaults to False.\n        backend_args (dict, optional): Instantiates the corresponding file\n            backend. It may contain `backend` key to specify the file\n            backend. If it contains, the file backend corresponding to this\n            value will be used and initialized with the remaining values,\n            otherwise the corresponding file backend will be selected\n            based on the prefix of the file path. Defaults to None.\n            New in version 2.0.0rc4.\n    \"\"\"\n\n    def __init__(self,\n                 imdecode_backend: str = 'cv2',\n                 ignore_empty: bool = False,\n                 *,\n                 backend_args: Optional[dict] = None):\n        self.ignore_empty = ignore_empty\n        self.imdecode_backend = imdecode_backend\n\n        self.backend_args = None\n        if backend_args is not None:\n            self.backend_args = backend_args.copy()\n\n    def transform(self, results: dict):\n        \"\"\"Functions to load image.\n\n        Args:\n            results (dict): Result dict from\n                :class:`mmengine.dataset.BaseDataset`.\n\n        Returns:\n            dict: The dict contains loaded image and meta information.\n        \"\"\"\n\n        filename = results['depth_img_path']\n        depth_shift = results['depth_shift']\n\n        try:\n            depth_img_bytes = mmengine.fileio.get(\n                filename, backend_args=self.backend_args)\n            depth_img = mmcv.imfrombytes(depth_img_bytes,\n                                         flag='unchanged',\n                                         backend=self.imdecode_backend).astype(\n                                             np.float32) / depth_shift\n        except Exception as e:\n            if self.ignore_empty:\n                return None\n            else:\n                raise e\n        results['depth_img'] = depth_img\n        return results\n\n    def __repr__(self):\n        repr_str = (f'{self.__class__.__name__}('\n                    f'ignore_empty={self.ignore_empty}, '\n                    f\"imdecode_backend='{self.imdecode_backend}', \")\n\n        if self.backend_args is not None:\n            repr_str += f'backend_args={self.backend_args})'\n        else:\n            repr_str += f'backend_args={self.backend_args})'\n\n        return repr_str\n\n\n# TODO : refine\n@TRANSFORMS.register_module()\nclass LoadAnnotations3D(LoadAnnotations):\n    \"\"\"Load Annotations3D.\n\n    Load instance mask and semantic mask of points and\n    encapsulate the items into related fields.\n\n    Required Keys:\n\n    - ann_info (dict)\n\n        - gt_bboxes_3d (:obj:`LiDARInstance3DBoxes` |\n          :obj:`DepthInstance3DBoxes` | :obj:`CameraInstance3DBoxes`):\n          3D ground truth bboxes. Only when `with_bbox_3d` is True\n        - gt_labels_3d (np.int64): Labels of ground truths.\n          Only when `with_label_3d` is True.\n        - gt_bboxes (np.float32): 2D ground truth bboxes.\n          Only when `with_bbox` is True.\n        - gt_labels (np.ndarray): Labels of ground truths.\n          Only when `with_label` is True.\n        - depths (np.ndarray): Only when\n          `with_bbox_depth` is True.\n        - centers_2d (np.ndarray): Only when\n          `with_bbox_depth` is True.\n        - attr_labels (np.ndarray): Attribute labels of instances.\n          Only when `with_attr_label` is True.\n\n    - pts_instance_mask_path (str): Path of instance mask file.\n      Only when `with_mask_3d` is True.\n    - pts_semantic_mask_path (str): Path of semantic mask file.\n      Only when `with_seg_3d` is True.\n    - pts_panoptic_mask_path (str): Path of panoptic mask file.\n      Only when both `with_panoptic_3d` is True.\n\n    Added Keys:\n\n    - gt_bboxes_3d (:obj:`LiDARInstance3DBoxes` |\n      :obj:`DepthInstance3DBoxes` | :obj:`CameraInstance3DBoxes`):\n      3D ground truth bboxes. Only when `with_bbox_3d` is True\n    - gt_labels_3d (np.int64): Labels of ground truths.\n      Only when `with_label_3d` is True.\n    - gt_bboxes (np.float32): 2D ground truth bboxes.\n      Only when `with_bbox` is True.\n    - gt_labels (np.int64): Labels of ground truths.\n      Only when `with_label` is True.\n    - depths (np.float32): Only when\n      `with_bbox_depth` is True.\n    - centers_2d (np.ndarray): Only when\n      `with_bbox_depth` is True.\n    - attr_labels (np.int64): Attribute labels of instances.\n      Only when `with_attr_label` is True.\n    - pts_instance_mask (np.int64): Instance mask of each point.\n      Only when `with_mask_3d` is True.\n    - pts_semantic_mask (np.int64): Semantic mask of each point.\n      Only when `with_seg_3d` is True.\n\n    Args:\n        with_bbox_3d (bool): Whether to load 3D boxes. Defaults to True.\n        with_label_3d (bool): Whether to load 3D labels. Defaults to True.\n        with_attr_label (bool): Whether to load attribute label.\n            Defaults to False.\n        with_mask_3d (bool): Whether to load 3D instance masks for points.\n            Defaults to False.\n        with_seg_3d (bool): Whether to load 3D semantic masks for points.\n            Defaults to False.\n        with_bbox (bool): Whether to load 2D boxes. Defaults to False.\n        with_label (bool): Whether to load 2D labels. Defaults to False.\n        with_mask (bool): Whether to load 2D instance masks. Defaults to False.\n        with_seg (bool): Whether to load 2D semantic masks. Defaults to False.\n        with_bbox_depth (bool): Whether to load 2.5D boxes. Defaults to False.\n        with_panoptic_3d (bool): Whether to load 3D panoptic masks for points.\n            Defaults to False.\n        poly2mask (bool): Whether to convert polygon annotations to bitmasks.\n            Defaults to True.\n        seg_3d_dtype (str): String of dtype of 3D semantic masks.\n            Defaults to 'np.int64'.\n        seg_offset (int): The offset to split semantic and instance labels from\n            panoptic labels. Defaults to None.\n        dataset_type (str): Type of dataset used for splitting semantic and\n            instance labels. Defaults to None.\n        backend_args (dict, optional): Arguments to instantiate the\n            corresponding backend. Defaults to None.\n    \"\"\"\n\n    def __init__(self,\n                 with_bbox_3d: bool = True,\n                 with_label_3d: bool = True,\n                 with_depth_map: bool = False,\n                 with_attr_label: bool = False,\n                 with_mask_3d: bool = False,\n                 with_seg_3d: bool = False,\n                 with_bbox: bool = False,\n                 with_label: bool = False,\n                 with_mask: bool = False,\n                 with_seg: bool = False,\n                 with_bbox_depth: bool = False,\n                 with_panoptic_3d: bool = False,\n                 with_visible_instance_masks: bool = False,\n                 with_occupancy: bool = False,\n                 with_visible_occupancy_masks: bool = False,\n                 poly2mask: bool = True,\n                 seg_3d_dtype: str = 'np.int64',\n                 seg_offset: int = None,\n                 dataset_type: str = None,\n                 backend_args: Optional[dict] = None):\n        super().__init__(with_bbox=with_bbox,\n                         with_label=with_label,\n                         with_mask=with_mask,\n                         with_seg=with_seg,\n                         poly2mask=poly2mask,\n                         backend_args=backend_args)\n        self.with_bbox_3d = with_bbox_3d\n        self.with_bbox_depth = with_bbox_depth\n        self.with_label_3d = with_label_3d\n        self.with_depth_map = with_depth_map\n        self.with_attr_label = with_attr_label\n        self.with_mask_3d = with_mask_3d\n        self.with_seg_3d = with_seg_3d\n        self.with_panoptic_3d = with_panoptic_3d\n        self.with_visible_instance_masks = with_visible_instance_masks\n        self.with_occupancy = with_occupancy\n        self.with_visible_occupancy_masks = with_visible_occupancy_masks\n        self.seg_3d_dtype = eval(seg_3d_dtype)\n        self.seg_offset = seg_offset\n        self.dataset_type = dataset_type\n\n    def _load_bboxes_3d(self, results: dict):\n        \"\"\"Private function to move the 3D bounding box annotation from\n        `ann_info` field to the root of `results`.\n\n        Args:\n            results (dict): Result dict from :obj:`mmdet3d.CustomDataset`.\n\n        Returns:\n            dict: The dict containing loaded 3D bounding box annotations.\n        \"\"\"\n\n        results['gt_bboxes_3d'] = results['ann_info']['gt_bboxes_3d']\n        return results\n\n    def _load_bboxes_depth(self, results: dict):\n        \"\"\"Private function to load 2.5D bounding box annotations.\n\n        Args:\n            results (dict): Result dict from :obj:`mmdet3d.CustomDataset`.\n\n        Returns:\n            dict: The dict containing loaded 2.5D bounding box annotations.\n        \"\"\"\n\n        results['depths'] = results['ann_info']['depths']\n        results['centers_2d'] = results['ann_info']['centers_2d']\n        return results\n\n    def _load_labels_3d(self, results: dict):\n        \"\"\"Private function to load label annotations.\n\n        Args:\n            results (dict): Result dict from :obj:`mmdet3d.CustomDataset`.\n\n        Returns:\n            dict: The dict containing loaded label annotations.\n        \"\"\"\n\n        results['gt_labels_3d'] = results['ann_info']['gt_labels_3d']\n        return results\n\n    def _load_attr_labels(self, results: dict):\n        \"\"\"Private function to load label annotations.\n\n        Args:\n            results (dict): Result dict from :obj:`mmdet3d.CustomDataset`.\n\n        Returns:\n            dict: The dict containing loaded label annotations.\n        \"\"\"\n        results['attr_labels'] = results['ann_info']['attr_labels']\n        return results\n\n    def _load_depth_map(self, results: dict):\n\n        img_filename = results['img_path']\n        pts_filename = img_filename.replace('samples', 'depth_points') + '.bin'\n        results['depth_map_path'] = pts_filename\n        if self.file_client is None:\n            self.file_client = mmengine.FileClient(**self.backend_args)\n        try:\n            pts_bytes = self.file_client.get(pts_filename)\n            points = np.frombuffer(pts_bytes, dtype=np.float32)\n        except ConnectionError:\n            mmengine.check_file_exist(pts_filename)\n            if pts_filename.endswith('.npy'):\n                points = np.load(pts_filename)\n            else:\n                points = np.fromfile(pts_filename, dtype=np.float32)\n        pts_img = points.reshape(-1, 3)\n        img_shape = results['ori_shape']\n        depth_img = np.zeros(img_shape, dtype=np.float32)\n        iy = np.round(pts_img[:, 1]).astype(np.int64)\n        ix = np.round(pts_img[:, 0]).astype(np.int64)\n        depth_img[iy, ix] = pts_img[:, 2]\n        results['depth_map'] = depth_img\n\n        return results\n\n    def _load_masks_3d(self, results: dict):\n        \"\"\"Private function to load 3D mask annotations.\n\n        Args:\n            results (dict): Result dict from :obj:`mmdet3d.CustomDataset`.\n\n        Returns:\n            dict: The dict containing loaded 3D mask annotations.\n        \"\"\"\n        pts_instance_mask_path = results['pts_instance_mask_path']\n\n        try:\n            mask_bytes = mmengine.fileio.get(pts_instance_mask_path,\n                                             backend_args=self.backend_args)\n            pts_instance_mask = np.frombuffer(mask_bytes, dtype=np.int64)\n        except ConnectionError:\n            mmengine.check_file_exist(pts_instance_mask_path)\n            pts_instance_mask = np.fromfile(pts_instance_mask_path,\n                                            dtype=np.int64)\n\n        results['pts_instance_mask'] = pts_instance_mask\n        # 'eval_ann_info' will be passed to evaluator\n        if 'eval_ann_info' in results:\n            results['eval_ann_info']['pts_instance_mask'] = pts_instance_mask\n        return results\n\n    def _load_semantic_seg_3d(self, results: dict):\n        \"\"\"Private function to load 3D semantic segmentation annotations.\n\n        Args:\n            results (dict): Result dict from :obj:`mmdet3d.CustomDataset`.\n\n        Returns:\n            dict: The dict containing the semantic segmentation annotations.\n        \"\"\"\n        pts_semantic_mask_path = results['pts_semantic_mask_path']\n\n        try:\n            mask_bytes = mmengine.fileio.get(pts_semantic_mask_path,\n                                             backend_args=self.backend_args)\n            # add .copy() to fix read-only bug\n            pts_semantic_mask = np.frombuffer(mask_bytes,\n                                              dtype=self.seg_3d_dtype).copy()\n        except ConnectionError:\n            mmengine.check_file_exist(pts_semantic_mask_path)\n            pts_semantic_mask = np.fromfile(pts_semantic_mask_path,\n                                            dtype=np.int64)\n\n        if self.dataset_type == 'semantickitti':\n            pts_semantic_mask = pts_semantic_mask.astype(np.int64)\n            pts_semantic_mask = pts_semantic_mask % self.seg_offset\n        # nuScenes loads semantic and panoptic labels from different files.\n\n        results['pts_semantic_mask'] = pts_semantic_mask\n\n        # 'eval_ann_info' will be passed to evaluator\n        if 'eval_ann_info' in results:\n            results['eval_ann_info']['pts_semantic_mask'] = pts_semantic_mask\n        return results\n\n    def _load_panoptic_3d(self, results: dict):\n        \"\"\"Private function to load 3D panoptic segmentation annotations.\n\n        Args:\n            results (dict): Result dict from :obj:`mmdet3d.CustomDataset`.\n\n        Returns:\n            dict: The dict containing the panoptic segmentation annotations.\n        \"\"\"\n        pts_panoptic_mask_path = results['pts_panoptic_mask_path']\n\n        try:\n            mask_bytes = mmengine.fileio.get(pts_panoptic_mask_path,\n                                             backend_args=self.backend_args)\n            # add .copy() to fix read-only bug\n            pts_panoptic_mask = np.frombuffer(mask_bytes,\n                                              dtype=self.seg_3d_dtype).copy()\n        except ConnectionError:\n            mmengine.check_file_exist(pts_panoptic_mask_path)\n            pts_panoptic_mask = np.fromfile(pts_panoptic_mask_path,\n                                            dtype=np.int64)\n\n        if self.dataset_type == 'semantickitti':\n            pts_semantic_mask = pts_panoptic_mask.astype(np.int64)\n            pts_semantic_mask = pts_semantic_mask % self.seg_offset\n        elif self.dataset_type == 'nuscenes':\n            pts_semantic_mask = pts_semantic_mask // self.seg_offset\n\n        results['pts_semantic_mask'] = pts_semantic_mask\n\n        # We can directly take panoptic labels as instance ids.\n        pts_instance_mask = pts_panoptic_mask.astype(np.int64)\n        results['pts_instance_mask'] = pts_instance_mask\n\n        # 'eval_ann_info' will be passed to evaluator\n        if 'eval_ann_info' in results:\n            results['eval_ann_info']['pts_semantic_mask'] = pts_semantic_mask\n            results['eval_ann_info']['pts_instance_mask'] = pts_instance_mask\n        return results\n\n    def _load_visible_instance_masks(self, results: dict):\n        \"\"\"Private function to move the 3D bounding box annotation from\n        `ann_info` field to the root of `results`.\n\n        Args:\n            results (dict): Result dict from :obj:`mmdet3d.CustomDataset`.\n\n        Returns:\n            dict: The dict containing loaded 3D bounding box annotations.\n        \"\"\"\n\n        results['visible_instance_masks'] = results['ann_info'][\n            'visible_instance_masks']\n        return results\n\n    def _load_occupancy(self, results: dict):\n        \"\"\"Private function to move the 3D bounding box annotation from\n        `ann_info` field to the root of `results`.\n\n        Args:\n            results (dict): Result dict from :obj:`mmdet3d.CustomDataset`.\n\n        Returns:\n            dict: The dict containing loaded 3D bounding box annotations.\n        \"\"\"\n\n        results['gt_occupancy'] = results['ann_info']['gt_occupancy']\n        return results\n\n    def _load_visible_occupancy_masks(self, results: dict):\n        \"\"\"Private function to move the 3D bounding box annotation from\n        `ann_info` field to the root of `results`.\n\n        Args:\n            results (dict): Result dict from :obj:`mmdet3d.CustomDataset`.\n\n        Returns:\n            dict: The dict containing loaded 3D bounding box annotations.\n        \"\"\"\n\n        results['visible_occupancy_masks'] = results['ann_info'][\n            'visible_occupancy_masks']\n        return results\n\n    def _load_bboxes(self, results: dict):\n        \"\"\"Private function to load bounding box annotations.\n\n        The only difference is it remove the proceess for\n        `ignore_flag`\n\n        Args:\n            results (dict): Result dict from :obj:`mmcv.BaseDataset`.\n\n        Returns:\n            dict: The dict contains loaded bounding box annotations.\n        \"\"\"\n\n        results['gt_bboxes'] = results['ann_info']['gt_bboxes']\n\n    def _load_labels(self, results: dict):\n        \"\"\"Private function to load label annotations.\n\n        Args:\n            results (dict): Result dict from :obj :obj:`mmcv.BaseDataset`.\n\n        Returns:\n            dict: The dict contains loaded label annotations.\n        \"\"\"\n        results['gt_bboxes_labels'] = results['ann_info']['gt_bboxes_labels']\n\n    def transform(self, results: dict):\n        \"\"\"Function to load multiple types annotations.\n\n        Args:\n            results (dict): Result dict from :obj:`mmdet3d.CustomDataset`.\n\n        Returns:\n            dict: The dict containing loaded 3D bounding box, label, mask and\n            semantic segmentation annotations.\n        \"\"\"\n        results = super().transform(results)\n        if self.with_bbox_3d:\n            results = self._load_bboxes_3d(results)\n        if self.with_bbox_depth:\n            results = self._load_bboxes_depth(results)\n        if self.with_label_3d:\n            results = self._load_labels_3d(results)\n        if self.with_depth_map:\n            results = self._load_depth_map(results)\n        if self.with_attr_label:\n            results = self._load_attr_labels(results)\n        if self.with_panoptic_3d:\n            results = self._load_panoptic_3d(results)\n        if self.with_mask_3d:\n            results = self._load_masks_3d(results)\n        if self.with_seg_3d:\n            results = self._load_semantic_seg_3d(results)\n        if self.with_visible_instance_masks:\n            results = self._load_visible_instance_masks(results)\n        if self.with_occupancy:\n            results = self._load_occupancy(results)\n        if self.with_visible_occupancy_masks:\n            results = self._load_visible_occupancy_masks(results)\n        return results\n\n    def __repr__(self):\n        \"\"\"str: Return a string that describes the module.\"\"\"\n        indent_str = '    '\n        repr_str = self.__class__.__name__ + '(\\n'\n        repr_str += f'{indent_str}with_bbox_3d={self.with_bbox_3d}, '\n        repr_str += f'{indent_str}with_label_3d={self.with_label_3d}, '\n        repr_str += f'{indent_str}with_attr_label={self.with_attr_label}, '\n        repr_str += f'{indent_str}with_mask_3d={self.with_mask_3d}, '\n        repr_str += f'{indent_str}with_seg_3d={self.with_seg_3d}, '\n        repr_str += f'{indent_str}with_panoptic_3d={self.with_panoptic_3d}, '\n        repr_str += f'{indent_str}with_bbox={self.with_bbox}, '\n        repr_str += f'{indent_str}with_label={self.with_label}, '\n        repr_str += f'{indent_str}with_mask={self.with_mask}, '\n        repr_str += f'{indent_str}with_seg={self.with_seg}, '\n        repr_str += f'{indent_str}with_bbox_depth={self.with_bbox_depth}, '\n        repr_str += f'{indent_str}poly2mask={self.poly2mask}), '\n        repr_str += f'{indent_str}seg_offset={self.seg_offset}), '\n        repr_str += f'{indent_str}with_visible_instance_masks='\n        repr_str += f'{self.with_visible_instance_masks}), '\n        repr_str += f'{indent_str}with_occupancy={self.with_occupancy}), '\n        repr_str += f'{indent_str}with_visible_occupancy_masks='\n        repr_str += f'{self.with_visible_occupancy_masks})'\n\n        return repr_str\n"
  },
  {
    "path": "bip3d/datasets/transforms/multiview.py",
    "content": "import copy\nimport random\n\nimport numpy as np\nimport torch\nfrom mmcv.transforms import BaseTransform, Compose\n\nfrom bip3d.registry import TRANSFORMS\nfrom bip3d.structures.points import get_points_type\nfrom ..utils import sample\n\n\n@TRANSFORMS.register_module()\nclass MultiViewPipeline(BaseTransform):\n    \"\"\"HARD CODE\"\"\"\n    def __init__(\n        self,\n        transforms,\n        n_images,\n        max_n_images=None,\n        ordered=False,\n        rotate_3rscan=False,\n    ):\n        super().__init__()\n        self.transforms = Compose(transforms)\n        self.n_images = n_images\n        self.max_n_images = (\n            max_n_images if max_n_images is not None else n_images\n        )\n        self.ordered = ordered\n        self.rotate_3rscan = rotate_3rscan\n\n    def transform(self, results: dict):\n        \"\"\"Transform function.\n\n        Args:\n            results (dict): Result dict from loading pipeline.\n\n        Returns:\n            dict: output dict after transformation.\n        \"\"\"\n        imgs = []\n        img_paths = []\n        points = []\n        intrinsics = []\n        extrinsics = []\n        depth_imgs = []\n        trans_mat = []\n        \n        total_n = len(results[\"img_path\"])\n        sample_n = min(max(self.n_images, total_n), self.max_n_images)\n        ids = sample(total_n, sample_n, self.ordered)\n        for i in ids.tolist():\n            _results = dict()\n            _results[\"img_path\"] = results[\"img_path\"][i]\n            if \"depth_img_path\" in results:\n                _results[\"depth_img_path\"] = results[\"depth_img_path\"][i]\n                if isinstance(results[\"depth_cam2img\"], list):\n                    _results[\"depth_cam2img\"] = results[\"depth_cam2img\"][i]\n                    _results[\"cam2img\"] = results[\"depth2img\"][\"intrinsic\"][i]\n                else:\n                    _results[\"depth_cam2img\"] = results[\"depth_cam2img\"]\n                    _results[\"cam2img\"] = results[\"cam2img\"]\n                _results[\"depth_shift\"] = results[\"depth_shift\"]\n            _results = self.transforms(_results)\n            if \"depth_shift\" in _results:\n                _results.pop(\"depth_shift\")\n            if \"img\" in _results:\n                imgs.append(_results[\"img\"])\n                img_paths.append(_results[\"img_path\"])\n            if \"depth_img\" in _results:\n                depth_imgs.append(_results[\"depth_img\"])\n            if \"points\" in _results:\n                points.append(_results[\"points\"])\n            if _results.get(\"modify_cam2img\"):\n                intrinsics.append(_results[\"cam2img\"])\n            elif isinstance(results[\"depth2img\"][\"intrinsic\"], list):\n                intrinsics.append(results[\"depth2img\"][\"intrinsic\"][i])\n            else:\n                intrinsics.append(results[\"depth2img\"][\"intrinsic\"])\n            extrinsics.append(results[\"depth2img\"][\"extrinsic\"][i])\n            if \"trans_mat\" in _results:\n                trans_mat.append(_results[\"trans_mat\"])\n        for key in _results.keys():\n            if key not in [\"img\", \"points\", \"img_path\"]:\n                results[key] = _results[key]\n        if len(imgs):\n            if self.rotate_3rscan and \"3rscan\" in img_paths[0]:\n                imgs = [np.transpose(x, (1, 0, 2)) for x in imgs]\n                rot_mat = np.array(\n                    [\n                        [0, 1, 0, 0],\n                        [1, 0, 0, 0],\n                        [0, 0, 1, 0],\n                        [0, 0, 0, 1],\n                    ]\n                )\n                rot_mat = [np.linalg.inv(x) @ rot_mat @ x for x in intrinsics]\n                extrinsics = [x @ y for x, y in zip(rot_mat, extrinsics)]\n                results[\"scale_factor\"] = results[\"scale_factor\"][::-1]\n                results[\"ori_shape\"] = results[\"ori_shape\"][::-1]\n            results[\"img\"] = imgs\n            results[\"img_path\"] = img_paths\n        if len(depth_imgs):\n            if self.rotate_3rscan and \"3rscan\" in img_paths[0]:\n                depth_imgs = [np.transpose(x, (1, 0)) for x in depth_imgs]\n            results[\"depth_img\"] = depth_imgs\n\n        if len(points):\n            results[\"points\"] = points\n        if (\n            \"ann_info\" in results\n            and \"visible_instance_masks\" in results[\"ann_info\"]\n        ):\n            results[\"visible_instance_masks\"] = [\n                results[\"ann_info\"][\"visible_instance_masks\"][i] for i in ids\n            ]\n            results[\"ann_info\"][\"visible_instance_masks\"] = results[\n                \"visible_instance_masks\"\n            ]\n        elif \"visible_instance_masks\" in results:\n            results[\"visible_instance_masks\"] = [\n                results[\"visible_instance_masks\"][i] for i in ids\n            ]\n        results[\"depth2img\"][\"intrinsic\"] = intrinsics\n        results[\"depth2img\"][\"extrinsic\"] = extrinsics\n        if len(trans_mat) != 0:\n            results[\"depth2img\"][\"trans_mat\"] = trans_mat\n        return results\n"
  },
  {
    "path": "bip3d/datasets/transforms/transform.py",
    "content": "from typing import List, Optional, Tuple, Union\nimport cv2\nimport copy\n\nimport numpy as np\nfrom numpy import random\nimport mmcv\nimport torch\nfrom PIL import Image\nfrom mmcv.transforms import BaseTransform, Resize\n\nfrom bip3d.registry import TRANSFORMS\nfrom bip3d.structures.bbox_3d import points_cam2img, points_img2cam\nfrom bip3d.structures.points import BasePoints, get_points_type\n\n\n@TRANSFORMS.register_module()\nclass CategoryGroundingDataPrepare(BaseTransform):\n    def __init__(\n        self,\n        classes,\n        training,\n        max_class=None,\n        sep_token=\"[SEP]\",\n        filter_others=True,\n        z_range=None,\n        filter_invisible=False,\n        instance_mask_key=\"visible_instance_masks\",\n    ):\n        self.classes = list(classes)\n        self.training = training\n        self.max_class = max_class\n        self.sep_token = sep_token\n        self.filter_others = filter_others\n        self.z_range = z_range\n        self.filter_invisible = filter_invisible\n        self.instance_mask_key = instance_mask_key\n\n    def transform(self, results):\n        visible_instance_masks = results.get(self.instance_mask_key)\n        if isinstance(visible_instance_masks, (list, tuple)):\n            visible_instance_masks = np.stack(visible_instance_masks).any(axis=0)\n            \n        ignore_mask = results.get(\"ignore_mask\")\n        gt_names = results[\"ann_info\"][\"gt_names\"]\n        if self.filter_others and \"gt_labels_3d\" in results:\n            mask = results[\"gt_labels_3d\"] >= 0\n            results[\"gt_labels_3d\"] = results[\"gt_labels_3d\"][mask]\n            results[\"gt_bboxes_3d\"] = results[\"gt_bboxes_3d\"][mask]\n            visible_instance_masks = visible_instance_masks[mask]\n            gt_names = [x for i, x in enumerate(gt_names) if mask[i]]\n            if ignore_mask is not None:\n                ignore_mask = ignore_mask[mask]\n\n        if (\n            self.z_range is not None\n            and \"gt_labels_3d\" in results\n            and self.training\n        ):\n            mask = torch.logical_and(\n                results[\"gt_bboxes_3d\"].tensor[..., 2] >= self.z_range[0],\n                results[\"gt_bboxes_3d\"].tensor[..., 2] <= self.z_range[1],\n            ).numpy()\n            results[\"gt_labels_3d\"] = results[\"gt_labels_3d\"][mask]\n            results[\"gt_bboxes_3d\"] = results[\"gt_bboxes_3d\"][mask]\n            visible_instance_masks = visible_instance_masks[mask]\n            gt_names = [x for i, x in enumerate(gt_names) if mask[i]]\n            if ignore_mask is not None:\n                ignore_mask = ignore_mask[mask]\n\n        if self.training or self.filter_invisible:\n            results[\"gt_labels_3d\"] = results[\"gt_labels_3d\"][\n                visible_instance_masks\n            ]\n            results[\"gt_bboxes_3d\"] = results[\"gt_bboxes_3d\"][\n                visible_instance_masks\n            ]\n            gt_names = [\n                x for i, x in enumerate(gt_names) if visible_instance_masks[i]\n            ]\n            if ignore_mask is not None:\n                ignore_mask = ignore_mask[visible_instance_masks]\n                results[\"ignore_mask\"] = ignore_mask\n\n        if self.training:\n            if (\n                self.max_class is not None\n                and len(self.classes) > self.max_class\n            ):\n                classes = copy.deepcopy(gt_names)\n                random.shuffle(self.classes)\n                for c in self.classes:\n                    if c in classes:\n                        continue\n                    classes.append(c)\n                    if len(classes) >= self.max_class:\n                        break\n                random.shuffle(classes)\n            else:\n                classes = copy.deepcopy(self.classes)\n        else:\n            classes = copy.deepcopy(self.classes)\n            gt_names = classes\n\n        results[\"text\"] = self.sep_token.join(classes)\n        tokens_positive = []\n        for name in gt_names:\n            start = results[\"text\"].find(\n                self.sep_token + name + self.sep_token\n            )\n            if start == -1:\n                if results[\"text\"].startswith(name + self.sep_token):\n                    start = 0\n                else:\n                    start = results[\"text\"].find(self.sep_token + name) + len(\n                        self.sep_token\n                    )\n            else:\n                start += len(self.sep_token)\n            end = start + len(name)\n            tokens_positive.append([[start, end]])\n        results[\"tokens_positive\"] = tokens_positive\n        return results\n\n\n@TRANSFORMS.register_module()\nclass CamIntrisicStandardization(BaseTransform):\n    def __init__(self, dst_intrinsic, dst_wh):\n        if not isinstance(dst_intrinsic, np.ndarray):\n            dst_intrinsic = np.array(dst_intrinsic)\n        if dst_intrinsic.shape[0] == 3:\n            tmp = np.eye(4)\n            tmp[:3, :3] = dst_intrinsic\n            dst_intrinsic = tmp\n        self.dst_intrinsic = dst_intrinsic\n        self.dst_wh = dst_wh\n        u, v = np.arange(dst_wh[0]), np.arange(dst_wh[1])\n        u = np.repeat(u[None], dst_wh[1], 0)\n        v = np.repeat(v[:, None], dst_wh[0], 1)\n        uv = np.stack([u, v, np.ones_like(u)], axis=-1)\n        self.dst_pts = uv @ np.linalg.inv(self.dst_intrinsic[:3, :3]).T\n\n    def transform(self, results):\n        src_intrinsic = results[\"cam2img\"][:3, :3]\n        src_uv = self.dst_pts @ src_intrinsic.T\n        src_uv = src_uv.astype(np.float32)\n        if \"depth_img\" in results and results[\"img\"].shape[:2] != results[\"depth_img\"].shape[:2]:\n            results[\"depth_img\"] = cv2.resize(\n                results[\"depth_img\"], results[\"img\"].shape[:2][::-1],\n                interpolation=cv2.INTER_LINEAR,\n            )\n        for key in [\"img\", \"depth_img\"]:\n            if key not in results:\n                continue\n            results[key] = cv2.remap(\n                results[key],\n                src_uv[..., 0],\n                src_uv[..., 1],\n                cv2.INTER_NEAREST,\n            )\n            # warp_mat = self.dst_intrinsic[:3, :3] @ np.linalg.inv(src_intrinsic)\n            # cv2.warpAffine(results[key], warp_mat[:2, :3], self.dst_wh)\n        results[\"cam2img\"] = copy.deepcopy(self.dst_intrinsic)\n        results[\"depth_cam2img\"] = copy.deepcopy(self.dst_intrinsic)\n        results[\"scale\"] = self.dst_wh\n        results['img_shape'] = results[\"img\"].shape[:2]\n        results['scale_factor'] = (1, 1)\n        results['keep_ratio'] = False\n        results[\"modify_cam2img\"] = True\n        return results\n\n\n@TRANSFORMS.register_module()\nclass CustomResize(Resize):\n    def _resize_img(self, results: dict, key=\"img\"):\n        \"\"\"Resize images with ``results['scale']``.\"\"\"\n\n        if results.get(key, None) is not None:\n            if self.keep_ratio:\n                img, scale_factor = mmcv.imrescale(\n                    results[key],\n                    results['scale'],\n                    interpolation=self.interpolation,\n                    return_scale=True,\n                    backend=self.backend)\n                # the w_scale and h_scale has minor difference\n                # a real fix should be done in the mmcv.imrescale in the future\n                new_h, new_w = img.shape[:2]\n                h, w = results[key].shape[:2]\n                w_scale = new_w / w\n                h_scale = new_h / h\n            else:\n                img, w_scale, h_scale = mmcv.imresize(\n                    results[key],\n                    results['scale'],\n                    interpolation=self.interpolation,\n                    return_scale=True,\n                    backend=self.backend)\n            results[key] = img\n            if key == \"img\":\n                results['img_shape'] = img.shape[:2]\n                results['scale_factor'] = (w_scale, h_scale)\n                results['keep_ratio'] = self.keep_ratio\n\n    def transform(self, results: dict):\n        if self.scale:\n            results['scale'] = self.scale\n        else:\n            img_shape = results['img'].shape[:2]\n            results['scale'] = _scale_size(img_shape[::-1],\n                                           self.scale_factor)  # type: ignore\n        self._resize_img(results)\n        self._resize_bboxes(results)\n        self._resize_seg(results)\n        self._resize_keypoints(results)\n        if \"depth_img\" in results:\n            self._resize_img(results, key=\"depth_img\")\n        return results\n\n\n@TRANSFORMS.register_module()\nclass DepthProbLabelGenerator(BaseTransform):\n    def __init__(\n        self,\n        max_depth=10,\n        min_depth=0.25,\n        num_depth=64,\n        stride=[8, 16, 32, 64],\n        origin_stride=1,\n    ):\n        self.max_depth = max_depth\n        self.min_depth = min_depth\n        self.num_depth = num_depth\n        self.stride = [x//origin_stride for x in stride]\n        self.origin_stride = origin_stride\n\n    def transform(self, input_dict):\n        depth = input_dict[\"inputs\"][\"depth_img\"].cpu().numpy()\n        if self.origin_stride != 1:\n            H, W = depth.shape[-2:]\n            depth = np.transpose(depth, (0, 2, 3, 1))\n            depth = [mmcv.imresize(\n                x, (W//self.origin_stride, H//self.origin_stride),\n                interpolation=\"nearest\",\n            ) for x in depth]\n            depth = np.stack(depth)[:,None]\n        depth = np.clip(\n            depth,\n            a_min=self.min_depth,\n            a_max=self.max_depth,\n        )\n        depth_anchor = np.linspace(\n            self.min_depth, self.max_depth, self.num_depth)[:, None, None]\n        distance = np.abs(depth - depth_anchor)\n        mask = distance < (depth_anchor[1] - depth_anchor[0])\n        depth_gt = np.where(mask, depth_anchor, 0)\n        y = depth_gt.sum(axis=1, keepdims=True) - depth_gt\n        depth_valid_mask = depth > 0\n        depth_prob_gt = np.where(\n            (depth_gt != 0) & depth_valid_mask,\n            (depth - y) / (depth_gt - y),\n            0,\n        )\n        views, _, H, W = depth.shape\n        gt = []\n        gt_map = []\n        for s in self.stride:\n            gt_tmp = np.reshape(\n                depth_prob_gt, (views, self.num_depth, H//s, s, W//s, s))\n            gt_tmp = gt_tmp.sum(axis=-1).sum(axis=3)\n            mask_tmp = depth_valid_mask.reshape(views, 1, H//s, s, W//s, s)\n            mask_tmp = mask_tmp.sum(axis=-1).sum(axis=3)\n            gt_tmp /= np.clip(mask_tmp, a_min=1, a_max=None)\n            # gt_map.append(np.transpose(gt_tmp, (0, 2, 3, 1)))\n            gt_tmp = gt_tmp.reshape(views, self.num_depth, -1)\n            gt_tmp = np.transpose(gt_tmp, (0, 2, 1))\n            gt.append(gt_tmp)\n        gt = np.concatenate(gt, axis=1)\n        gt = np.clip(gt, a_min=0.0, a_max=1.0)\n        input_dict[\"inputs\"][\"depth_prob_gt\"] = torch.from_numpy(gt).to(\n            input_dict[\"inputs\"][\"depth_img\"])\n        return input_dict\n"
  },
  {
    "path": "bip3d/datasets/utils.py",
    "content": "import numpy as np\nfrom scipy.spatial.transform import Rotation\n\n\ndef sample(total_n, sample_n, fix_interval=True):\n    ids = np.arange(total_n)\n    if sample_n == total_n:\n        return ids\n    elif sample_n > total_n:\n        return np.concatenate([\n            ids, sample(total_n, sample_n - total_n, fix_interval)\n        ])\n    elif fix_interval:\n        interval = total_n / sample_n\n        output = []\n        for i in range(sample_n):\n            output.append(ids[int(interval * i)])\n        return np.array(output)\n    return np.random.choice(ids, sample_n, replace=False)\n\n\ndef xyzrpy_to_matrix(input):\n    output = np.eye(4)\n    output[:3, :3] = Rotation.from_euler(\"xyz\", np.array(input[3:])).as_matrix()\n    output[:3, 3] = input[:3]\n    return output\n"
  },
  {
    "path": "bip3d/eval/__init__.py",
    "content": "from .metrics import GroundingMetric, IndoorDetMetric\n\n__all__ = ['IndoorDetMetric', 'GroundingMetric']\n"
  },
  {
    "path": "bip3d/eval/indoor_eval.py",
    "content": "# Copyright (c) OpenRobotLab. All rights reserved.\nimport numpy as np\nimport torch\nfrom mmengine.logging import print_log\nfrom terminaltables import AsciiTable\n# from bip3d.visualization.default_color_map import DEFAULT_COLOR_MAP\nimport matplotlib.pyplot as plt\n\n\ndef average_precision(recalls, precisions, mode='area'):\n    \"\"\"Calculate average precision (for single or multiple scales).\n\n    Args:\n        recalls (np.ndarray): Recalls with shape of (num_scales, num_dets)\n            or (num_dets, ).\n        precisions (np.ndarray): Precisions with shape of\n            (num_scales, num_dets) or (num_dets, ).\n        mode (str): 'area' or '11points', 'area' means calculating the area\n            under precision-recall curve, '11points' means calculating\n            the average precision of recalls at [0, 0.1, ..., 1]\n\n    Returns:\n        float or np.ndarray: Calculated average precision.\n    \"\"\"\n    if recalls.ndim == 1:\n        recalls = recalls[np.newaxis, :]\n        precisions = precisions[np.newaxis, :]\n\n    assert recalls.shape == precisions.shape\n    assert recalls.ndim == 2\n\n    num_scales = recalls.shape[0]\n    ap = np.zeros(num_scales, dtype=np.float32)\n    if mode == 'area':\n        zeros = np.zeros((num_scales, 1), dtype=recalls.dtype)\n        ones = np.ones((num_scales, 1), dtype=recalls.dtype)\n        mrec = np.hstack((zeros, recalls, ones))\n        mpre = np.hstack((zeros, precisions, zeros))\n        for i in range(mpre.shape[1] - 1, 0, -1):\n            mpre[:, i - 1] = np.maximum(mpre[:, i - 1], mpre[:, i])\n        for i in range(num_scales):\n            ind = np.where(mrec[i, 1:] != mrec[i, :-1])[0]\n            ap[i] = np.sum(\n                (mrec[i, ind + 1] - mrec[i, ind]) * mpre[i, ind + 1])\n    elif mode == '11points':\n        for i in range(num_scales):\n            for thr in np.arange(0, 1 + 1e-3, 0.1):\n                precs = precisions[i, recalls[i, :] >= thr]\n                prec = precs.max() if precs.size > 0 else 0\n                ap[i] += prec\n            ap /= 11\n    else:\n        raise ValueError(\n            'Unrecognized mode, only \"area\" and \"11points\" are supported')\n    return ap\n\n\ndef eval_det_cls(pred, gt, iou_thr=None, area_range=None):\n    \"\"\"Generic functions to compute precision/recall for object detection for a\n    single class.\n\n    Args:\n        pred (dict): Predictions mapping from image id to bounding boxes\n            and scores.\n        gt (dict): Ground truths mapping from image id to bounding boxes.\n        iou_thr (list[float]): A list of iou thresholds.\n\n    Return:\n        tuple (np.ndarray, np.ndarray, float): Recalls, precisions and\n            average precision.\n    \"\"\"\n\n    # {img_id: {'bbox': box structure, 'det': matched list}}\n    class_recs = {}\n    npos = 0\n    # figure out the bbox code size first\n    gt_bbox_code_size = 9\n    pred_bbox_code_size = 9\n    for img_id in gt.keys():\n        if len(gt[img_id]) != 0:\n            gt_bbox_code_size = gt[img_id][0].tensor.shape[1]\n            break\n    for img_id in pred.keys():\n        if len(pred[img_id][0]) != 0:\n            pred_bbox_code_size = pred[img_id][0][0].tensor.shape[1]\n            break\n    assert gt_bbox_code_size == pred_bbox_code_size\n    for img_id in gt.keys():\n        cur_gt_num = len(gt[img_id])\n        if cur_gt_num != 0:\n            gt_cur = torch.zeros([cur_gt_num, gt_bbox_code_size],\n                                 dtype=torch.float32)\n            for i in range(cur_gt_num):\n                gt_cur[i] = gt[img_id][i].tensor\n            bbox = gt[img_id][0].new_box(gt_cur)\n        else:\n            bbox = gt[img_id]\n        det = [[False] * len(bbox) for i in iou_thr]\n        npos += len(bbox)\n        class_recs[img_id] = {'bbox': bbox, 'det': det}\n        if area_range is not None and len(bbox) > 0:\n            area = bbox.tensor[:, 3:6].cumprod(dim=-1)[..., -1]\n            ignore = torch.logical_or(area < area_range[0], area > area_range[1])\n            npos -= ignore.sum().item()\n            class_recs[img_id][\"ignore\"] = ignore\n\n    # construct dets\n    image_ids = []\n    confidence = []\n    ious = []\n    for img_id in pred.keys():\n        cur_num = len(pred[img_id])\n        if cur_num == 0:\n            continue\n        pred_cur = torch.zeros((cur_num, pred_bbox_code_size),\n                               dtype=torch.float32)\n        box_idx = 0\n        for box, score in pred[img_id]:\n            image_ids.append(img_id)\n            confidence.append(score)\n            # handle outlier (too thin) predicted boxes\n            w, l, h = box.tensor[0, 3:6]\n            faces = [w * l, w * h, h * l]\n            if torch.any(box.tensor.new_tensor(faces) < 2e-4):\n                # print('Find small predicted boxes,',\n                #       'and clamp short edges to 2e-2 meters.')\n                box.tensor[:, 3:6] = torch.clamp(box.tensor[:, 3:6], min=2e-2)\n            pred_cur[box_idx] = box.tensor\n            box_idx += 1\n        pred_cur = box.new_box(pred_cur)\n        gt_cur = class_recs[img_id]['bbox']\n        if len(gt_cur) > 0:\n            # calculate iou in each image\n            iou_cur = pred_cur.overlaps(pred_cur, gt_cur)\n            for i in range(cur_num):\n                ious.append(iou_cur[i])\n        else:\n            for i in range(cur_num):\n                ious.append(np.zeros(1))\n\n    confidence = np.array(confidence)\n\n    # sort by confidence\n    sorted_ind = np.argsort(-confidence)\n    image_ids = [image_ids[x] for x in sorted_ind]\n    ious = [ious[x] for x in sorted_ind]\n\n    # go down dets and mark TPs and FPs\n    num_images = len(image_ids)\n    tp_thr = [np.zeros(num_images) for i in iou_thr]\n    fp_thr = [np.zeros(num_images) for i in iou_thr]\n    for d in range(num_images):\n        R = class_recs[image_ids[d]]\n        iou_max = -np.inf\n        BBGT = R['bbox']\n        cur_iou = ious[d]\n\n        if len(BBGT) > 0:\n            # compute overlaps\n            for j in range(len(BBGT)):\n                # iou = get_iou_main(get_iou_func, (bb, BBGT[j,...]))\n                iou = cur_iou[j]\n                if iou > iou_max:\n                    iou_max = iou\n                    jmax = j\n\n        for iou_idx, thresh in enumerate(iou_thr):\n            if iou_max > thresh:\n                if \"ignore\" in R and R[\"ignore\"][jmax]:\n                    continue\n                if not R['det'][iou_idx][jmax]:\n                    tp_thr[iou_idx][d] = 1.\n                    R['det'][iou_idx][jmax] = 1\n                else:\n                    fp_thr[iou_idx][d] = 1.\n            else:\n                fp_thr[iou_idx][d] = 1.\n\n    ret = []\n    for iou_idx, thresh in enumerate(iou_thr):\n        # compute precision recall\n        fp = np.cumsum(fp_thr[iou_idx])\n        tp = np.cumsum(tp_thr[iou_idx])\n        recall = tp / float(npos)\n        # avoid divide by zero in case the first detection matches a difficult\n        # ground truth\n        precision = tp / np.maximum(tp + fp, np.finfo(np.float64).eps)\n        ap = average_precision(recall, precision)\n        ret.append((recall, precision, ap))\n\n    return ret\n\n\ndef eval_map_recall(pred, gt, ovthresh=None, area_range=None):\n    \"\"\"Evaluate mAP and recall.\n\n    Generic functions to compute precision/recall for object detection\n        for multiple classes.\n\n    Args:\n        pred (dict): Information of detection results,\n            which maps class_id and predictions.\n        gt (dict): Information of ground truths, which maps class_id and\n            ground truths.\n        ovthresh (list[float], optional): iou threshold. Default: None.\n\n    Return:\n        tuple[dict]: dict results of recall, AP, and precision for all classes.\n    \"\"\"\n\n    ret_values = {}\n    for classname in gt.keys():\n        if classname in pred:\n            ret_values[classname] = eval_det_cls(pred[classname],\n                                                 gt[classname], ovthresh, area_range)\n    recall = [{} for i in ovthresh]\n    precision = [{} for i in ovthresh]\n    ap = [{} for i in ovthresh]\n\n    for label in gt.keys():\n        for iou_idx, thresh in enumerate(ovthresh):\n            if label in pred:\n                recall[iou_idx][label], precision[iou_idx][label], ap[iou_idx][\n                    label] = ret_values[label][iou_idx]\n            else:\n                recall[iou_idx][label] = np.zeros(1)\n                precision[iou_idx][label] = np.zeros(1)\n                ap[iou_idx][label] = np.zeros(1)\n\n    return recall, precision, ap\n\n\ndef save_pr_curve(recs, precs, metrics):\n    color = list(DEFAULT_COLOR_MAP.values())\n    for rec, prec, metric in zip(recs, precs, metrics):\n        plt.figure(figsize=(20, 18))\n        fontsize = 50\n        font = 'Times New Roman'\n        for k in rec.keys():\n            if np.isnan(rec[k]).any():\n                continue\n            plt.plot(rec[k], prec[k])\n        plt.xlabel('Recall', fontsize=fontsize)\n        plt.ylabel('Precision', fontsize=fontsize)\n        plt.savefig(f'./pr_curver_{metric}.png',dpi=500.)\n\n\ndef indoor_eval(gt_annos,\n                dt_annos,\n                metric,\n                label2cat,\n                logger=None,\n                box_mode_3d=None,\n                classes_split=None,\n                part=\"Overall\"):\n    \"\"\"Indoor Evaluation.\n\n    Evaluate the result of the detection.\n\n    Args:\n        gt_annos (list[dict]): Ground truth annotations.\n        dt_annos (list[dict]): Detection annotations. the dict\n            includes the following keys\n\n            - labels_3d (torch.Tensor): Labels of boxes.\n            - bboxes_3d (:obj:`BaseInstance3DBoxes`):\n                3D bounding boxes in Depth coordinate.\n            - scores_3d (torch.Tensor): Scores of boxes.\n        metric (list[float]): IoU thresholds for computing average precisions.\n        label2cat (tuple): Map from label to category.\n        logger (logging.Logger | str, optional): The way to print the mAP\n            summary. See `mmdet.utils.print_log()` for details. Default: None.\n\n    Return:\n        dict[str, float]: Dict of results.\n    \"\"\"\n    assert len(dt_annos) == len(gt_annos)\n    pred = {}  # map {class_id: pred}\n    gt = {}  # map {class_id: gt}\n    for img_id in range(len(dt_annos)):\n        # parse detected annotations\n        det_anno = dt_annos[img_id]\n        for i in range(len(det_anno['labels_3d'])):\n            label = det_anno['labels_3d'].numpy()[i]\n            bbox = det_anno['bboxes_3d'].convert_to(box_mode_3d)[i]\n            score = det_anno['scores_3d'].numpy()[i]\n            if label not in pred:\n                pred[int(label)] = {}\n            if img_id not in pred[label]:\n                pred[int(label)][img_id] = []\n            if label not in gt:\n                gt[int(label)] = {}\n            if img_id not in gt[label]:\n                gt[int(label)][img_id] = []\n            pred[int(label)][img_id].append((bbox, score))\n\n        # parse gt annotations\n        gt_anno = gt_annos[img_id]\n\n        gt_boxes = gt_anno['gt_bboxes_3d']\n        labels_3d = gt_anno['gt_labels_3d']\n\n        for i in range(len(labels_3d)):\n            label = labels_3d[i]\n            bbox = gt_boxes[i]\n            if label not in gt:\n                gt[label] = {}\n            if img_id not in gt[label]:\n                gt[label][img_id] = []\n            gt[label][img_id].append(bbox)\n\n    rec, prec, ap = eval_map_recall(pred, gt, metric)\n\n    # filter nan results\n    ori_keys = list(ap[0].keys())\n    for key in ori_keys:\n        if np.isnan(ap[0][key][0]):\n            for r in rec:\n                del r[key]\n            for p in prec:\n                del p[key]\n            for a in ap:\n                del a[key]\n\n    ret_dict = dict()\n    header = ['classes']\n    table_columns = [[label2cat[label]\n                      for label in ap[0].keys()] + [part]]\n\n    for i, iou_thresh in enumerate(metric):\n        header.append(f'AP_{iou_thresh:.2f}')\n        header.append(f'AR_{iou_thresh:.2f}')\n        rec_list = []\n        for label in ap[i].keys():\n            ret_dict[f'{label2cat[label]}_AP_{iou_thresh:.2f}'] = float(\n                ap[i][label][0])\n        ret_dict[f'mAP_{iou_thresh:.2f}'] = float(np.mean(list(\n            ap[i].values())))\n\n        table_columns.append(list(map(float, list(ap[i].values()))))\n        table_columns[-1] += [ret_dict[f'mAP_{iou_thresh:.2f}']]\n        table_columns[-1] = [f'{x:.4f}' for x in table_columns[-1]]\n\n        for label in rec[i].keys():\n            ret_dict[f'{label2cat[label]}_rec_{iou_thresh:.2f}'] = float(\n                rec[i][label][-1])\n            rec_list.append(rec[i][label][-1])\n        ret_dict[f'mAR_{iou_thresh:.2f}'] = float(np.mean(rec_list))\n\n        table_columns.append(list(map(float, rec_list)))\n        table_columns[-1] += [ret_dict[f'mAR_{iou_thresh:.2f}']]\n        table_columns[-1] = [f'{x:.4f}' for x in table_columns[-1]]\n\n    table_data = [header]\n    table_rows = list(zip(*table_columns))\n    table_data += table_rows\n    table = AsciiTable(table_data)\n    table.inner_footing_row_border = True\n    print_log('\\n' + table.table, logger=logger)\n\n    if classes_split is not None:\n        splits = ['head', 'common', 'tail']\n        for idx in range(len(splits)):\n            header = [f'{splits[idx]}_classes']\n            # init the category list/column\n            cat_list = []\n            for label in classes_split[idx]:\n                if label in ap[0]:\n                    cat_list.append(label2cat[label])\n            table_columns = [cat_list + [part]]\n\n            for i, iou_thresh in enumerate(metric):\n                header.append(f'AP_{iou_thresh:.2f}')\n                header.append(f'AR_{iou_thresh:.2f}')\n                ap_list = []\n                for label in classes_split[idx]:\n                    if label in ap[i]:\n                        ap_list.append(float(ap[i][label][0]))\n                mean_ap = float(np.mean(ap_list))\n\n                table_columns.append(list(map(float, ap_list)))\n                table_columns[-1] += [mean_ap]\n                table_columns[-1] = [f'{x:.4f}' for x in table_columns[-1]]\n\n                rec_list = []\n                for label in classes_split[idx]:\n                    if label in rec[i]:\n                        rec_list.append(rec[i][label][-1])\n                mean_rec = float(np.mean(rec_list))\n\n                table_columns.append(list(map(float, rec_list)))\n                table_columns[-1] += [mean_rec]\n                table_columns[-1] = [f'{x:.4f}' for x in table_columns[-1]]\n\n            table_data = [header]\n            table_rows = list(zip(*table_columns))\n            table_data += table_rows\n            table = AsciiTable(table_data)\n            table.inner_footing_row_border = True\n            print_log('\\n' + table.table, logger=logger)\n\n    area_split = {\n        \"small\": [0, 0.2**3],\n        \"medium\": [0.2**3, 1.0**3],\n        \"large\": [1.0**3, float(\"inf\")],\n    }\n    if area_split is not None:\n        header = [\"area\"]\n        for i, iou_thresh in enumerate(metric):\n            header.append(f'AP_{iou_thresh:.2f}')\n            header.append(f'AR_{iou_thresh:.2f}')\n        table_rows = []\n        area_ret_dict = dict()\n        for idx, (area, area_range) in enumerate(area_split.items()):\n            # header = [area]\n            rec, prec, ap = eval_map_recall(pred, gt, metric, area_range)\n            ori_keys = list(ap[0].keys())\n            for key in ori_keys:\n                if np.isnan(ap[0][key][0]):\n                    for r in rec:\n                        del r[key]\n                    for p in prec:\n                        del p[key]\n                    for a in ap:\n                        del a[key]\n\n            # table_columns = [[f\"{area}_{part}\"]]\n            table_rows.append([f\"{area}_{part}\"])\n            for i, iou_thresh in enumerate(metric):\n                # header.append(f'AP_{iou_thresh:.2f}')\n                # header.append(f'AR_{iou_thresh:.2f}')\n                rec_list = []\n                for label in ap[i].keys():\n                    area_ret_dict[f'{label2cat[label]}_AP_{iou_thresh:.2f}'] = float(\n                        ap[i][label][0])\n                area_ret_dict[f'mAP_{iou_thresh:.2f}'] = float(np.mean(list(\n                    ap[i].values())))\n\n                table_rows[-1].append(f\"{area_ret_dict[f'mAP_{iou_thresh:.2f}']:.4f}\")\n                # table_columns.append([f\"{area_ret_dict[f'mAP_{iou_thresh:.2f}']:.4f}\"])\n\n                for label in rec[i].keys():\n                    area_ret_dict[f'{label2cat[label]}_rec_{iou_thresh:.2f}'] = float(\n                        rec[i][label][-1])\n                    rec_list.append(rec[i][label][-1])\n                area_ret_dict[f'mAR_{iou_thresh:.2f}'] = float(np.mean(rec_list))\n\n                table_rows[-1].append(f\"{area_ret_dict[f'mAR_{iou_thresh:.2f}']:.4f}\")\n                # table_columns.append([f\"{area_ret_dict[f'mAR_{iou_thresh:.2f}']:.4f}\"])\n\n            # table_data = [header]\n            # table_rows = list(zip(*table_columns))\n            # table_data += table_rows\n            # table = AsciiTable(table_data)\n            # table.inner_footing_row_border = True\n            # print_log('\\n' + table.table, logger=logger)\n        table_data = [header]\n        table_data += table_rows\n        table = AsciiTable(table_data)\n        table.inner_footing_row_border = False\n        print_log('\\n' + table.table, logger=logger)\n    return ret_dict\n"
  },
  {
    "path": "bip3d/eval/metrics/__init__.py",
    "content": "from .det_metric import IndoorDetMetric\nfrom .grounding_metric import GroundingMetric\n# from .occupancy_metric import OccupancyMetric\n\n__all__ = ['IndoorDetMetric', 'GroundingMetric']\n"
  },
  {
    "path": "bip3d/eval/metrics/det_metric.py",
    "content": "# Copyright (c) OpenRobotLab. All rights reserved.\nimport logging\nfrom collections import OrderedDict\nfrom typing import Dict, List, Optional, Sequence, Union\n\nimport numpy as np\nfrom terminaltables import AsciiTable\nfrom mmdet.evaluation import eval_map\nfrom mmengine.dist import (broadcast_object_list, collect_results,\n                           is_main_process, get_rank)\nfrom mmengine.evaluator import BaseMetric\nfrom mmengine.evaluator.metric import _to_cpu\nfrom mmengine.logging import MMLogger, print_log\n\nfrom bip3d.registry import METRICS\nfrom bip3d.structures import get_box_type\n\nfrom ..indoor_eval import indoor_eval\n\n\n@METRICS.register_module()\nclass IndoorDetMetric(BaseMetric):\n    \"\"\"Indoor scene evaluation metric.\n\n    Args:\n        iou_thr (float or List[float]): List of iou threshold when calculate\n            the metric. Defaults to [0.25, 0.5].\n        collect_device (str): Device name used for collecting results from\n            different ranks during distributed training. Must be 'cpu' or\n            'gpu'. Defaults to 'cpu'.\n        prefix (str, optional): The prefix that will be added in the metric\n            names to disambiguate homonymous metrics of different evaluators.\n            If prefix is not provided in the argument, self.default_prefix will\n            be used instead. Defaults to None.\n    \"\"\"\n\n    def __init__(self,\n                 iou_thr: List[float] = [0.25, 0.5],\n                 collect_device: str = 'cpu',\n                 prefix: Optional[str] = None,\n                 batchwise_anns: bool = False,\n                 save_result_path=None,\n                 eval_part=None,\n                 **kwargs):\n        super(IndoorDetMetric, self).__init__(prefix=prefix,\n                                              collect_device=collect_device,\n                                              **kwargs)\n        self.iou_thr = [iou_thr] if isinstance(iou_thr, float) else iou_thr\n        self.batchwise_anns = batchwise_anns\n        self.save_result_path = save_result_path\n        if eval_part is None:\n            eval_part = [\"scannet\", \"3rscan\", \"matterport3d\", \"arkit\"]\n        self.eval_part = eval_part\n\n    def process(self, data_batch: dict, data_samples: Sequence[dict]):\n        \"\"\"Process one batch of data samples and predictions.\n\n        The processed results should be stored in ``self.results``, which will\n        be used to compute the metrics when all batches have been processed.\n\n        Args:\n            data_batch (dict): A batch of data from the dataloader.\n            data_samples (Sequence[dict]): A batch of outputs from the model.\n        \"\"\"\n        for data_sample in data_samples:\n            pred_3d = data_sample['pred_instances_3d']\n            eval_ann_info = data_sample['eval_ann_info']\n            eval_ann_info[\"scan_id\"] = data_sample[\"scan_id\"]\n            cpu_pred_3d = dict()\n            for k, v in pred_3d.items():\n                if hasattr(v, 'to'):\n                    cpu_pred_3d[k] = v.to('cpu')\n                else:\n                    cpu_pred_3d[k] = v\n            self.results.append((eval_ann_info, cpu_pred_3d))\n\n    def compute_metrics(self, results: list, part=\"Overall\", split=True):\n        \"\"\"Compute the metrics from processed results.\n\n        Args:\n            results (list): The processed results of each batch.\n\n        Returns:\n            Dict[str, float]: The computed metrics. The keys are the names of\n            the metrics, and the values are corresponding results.\n        \"\"\"\n        logger: MMLogger = MMLogger.get_current_instance()\n        ann_infos = []\n        pred_results = []\n\n        for eval_ann, sinlge_pred_results in results:\n            ann_infos.append(eval_ann)\n            pred_results.append(sinlge_pred_results)\n\n        # some checkpoints may not record the key \"box_type_3d\"\n        box_type_3d, box_mode_3d = get_box_type(\n            self.dataset_meta.get('box_type_3d', 'depth'))\n\n        print_log(f\"eval : {len(ann_infos)}, {len(pred_results)}, rank: {get_rank()}\")\n        ret_dict = indoor_eval(ann_infos,\n                               pred_results,\n                               self.iou_thr,\n                               self.dataset_meta['classes'],\n                               logger=logger,\n                               box_mode_3d=box_mode_3d,\n                               classes_split=self.dataset_meta.get(\n                                   'classes_split', None) if split else None,\n                               part=part)\n\n        return ret_dict\n\n    def evaluate(self, size: int):\n        \"\"\"Evaluate the model performance of the whole dataset after processing\n        all batches.\n\n        Args:\n            size (int): Length of the entire validation dataset. When batch\n                size > 1, the dataloader may pad some data samples to make\n                sure all ranks have the same length of dataset slice. The\n                ``collect_results`` function will drop the padded data based on\n                this size.\n\n        Returns:\n            dict: Evaluation metrics dict on the val dataset. The keys are the\n            names of the metrics, and the values are corresponding results.\n        \"\"\"\n        if len(self.results) == 0:\n            print_log(\n                f'{self.__class__.__name__} got empty `self.results`. Please '\n                'ensure that the processed results are properly added into '\n                '`self.results` in `process` method.',\n                logger='current',\n                level=logging.WARNING)\n        print_log(\n            f\"number of results: \"\n            f\"{len(self.results)}, size: {size}, \"\n            f\"batchwise_anns: {self.batchwise_anns}, rank: {get_rank()}, \"\n            f\"collect_dir: {self.collect_dir}\"\n        )\n        if self.batchwise_anns:\n            # the actual dataset length/size is the len(self.results)\n            if self.collect_device == 'cpu':\n                results = collect_results(self.results,\n                                          len(self.results),\n                                          self.collect_device,\n                                          tmpdir=self.collect_dir)\n            else:\n                results = collect_results(self.results, len(self.results),\n                                          self.collect_device)\n        else:\n            if self.collect_device == 'cpu':\n                results = collect_results(self.results,\n                                          size,\n                                          self.collect_device,\n                                          tmpdir=self.collect_dir)\n            else:\n                results = collect_results(self.results, size,\n                                          self.collect_device)\n\n        print_log(\n            f\"number of collected results: \"\n            f\"{len(results) if results is not None else 'None'},\"\n            f\"rank: {get_rank()}\"\n        )\n        if is_main_process():\n            # cast all tensors in results list to cpu\n            results = _to_cpu(results)\n            # import pdb; pdb.set_trace()\n            if self.save_result_path is not None:\n                import pickle\n                import copy\n                _ret = []\n                for x in results:\n                    x = list(copy.deepcopy(x))\n                    x[0][\"gt_bboxes_3d\"] = x[0][\"gt_bboxes_3d\"].cpu().numpy()\n                    x[1][\"bboxes_3d\"] = x[1][\"bboxes_3d\"].cpu().numpy()\n                    x[1][\"scores_3d\"] = x[1][\"scores_3d\"].cpu().numpy()\n                    x[1][\"labels_3d\"] = x[1][\"labels_3d\"].cpu().numpy()\n                    x[1][\"target_scores_3d\"] = x[1][\"target_scores_3d\"].cpu().numpy()\n                    _ret.append(x)\n                pickle.dump(_ret, open(os.path.join(self.save_result_path, \"results.pkl\"), \"wb\"))\n\n            _metrics = self.compute_metrics(results)  # type: ignore\n            for part in self.eval_part:\n            # for part in []:\n                print(f\"=================== {part} =========================\")\n                part_ret = [x for x in results if part in x[0][\"scan_id\"]]\n                if len(part_ret) > 0:\n                    _metrics_part = self.compute_metrics(part_ret, part=part)\n                    _metrics_part = {f\"{part}--{k}\": v for k, v in _metrics_part.items()}\n                    _metrics.update(_metrics_part)\n            # Add prefix to metric names\n            if self.prefix:\n                _metrics = {\n                    '/'.join((self.prefix, k)): v\n                    for k, v in _metrics.items()\n                }\n            metrics = [_metrics]\n        else:\n            metrics = [None]  # type: ignore\n\n        broadcast_object_list(metrics)\n\n        if is_main_process():\n            table = []\n            header = [\"Summary\"]\n            for i, iou_thresh in enumerate(self.iou_thr):\n                header.append(f'AP_{iou_thresh:.2f}')\n                header.append(f'AR_{iou_thresh:.2f}')\n            table.append(header)\n\n            for part in self.eval_part + [\"Overall\"]:\n                table_data = [part]\n                for i, iou_thresh in enumerate(self.iou_thr):\n                    key = f\"{part+'--' if part!='Overall' else ''}mAP_{iou_thresh:.2f}\"\n                    if key in _metrics:\n                        table_data.append(f\"{_metrics[key]:.4f}\")\n                    else:\n                        table_data.append(f\"0.xxxx\")\n                    key = f\"{part+'--' if part!='Overall' else ''}mAR_{iou_thresh:.2f}\"\n                    if key in _metrics:\n                        table_data.append(f\"{_metrics[key]:.4f}\")\n                    else:\n                        table_data.append(f\"0.xxxx\")\n                table.append(table_data)\n            table = AsciiTable(table)\n            table.inner_footing_row_border = True\n            logger: MMLogger = MMLogger.get_current_instance()\n            print_log('\\n' + table.table, logger=logger)\n\n        # reset the results list\n        self.results.clear()\n        return metrics[0]\n\n\n@METRICS.register_module()\nclass Indoor2DMetric(BaseMetric):\n    \"\"\"indoor 2d predictions evaluation metric.\n\n    Args:\n        iou_thr (float or List[float]): List of iou threshold when calculate\n            the metric. Defaults to [0.5].\n        collect_device (str): Device name used for collecting results from\n            different ranks during distributed training. Must be 'cpu' or\n            'gpu'. Defaults to 'cpu'.\n        prefix (str, optional): The prefix that will be added in the metric\n            names to disambiguate homonymous metrics of different evaluators.\n            If prefix is not provided in the argument, self.default_prefix will\n            be used instead. Defaults to None.\n    \"\"\"\n\n    def __init__(self,\n                 iou_thr: Union[float, List[float]] = [0.5],\n                 collect_device: str = 'cpu',\n                 prefix: Optional[str] = None):\n        super(Indoor2DMetric, self).__init__(prefix=prefix,\n                                             collect_device=collect_device)\n        self.iou_thr = [iou_thr] if isinstance(iou_thr, float) else iou_thr\n\n    def process(self, data_batch: dict, data_samples: Sequence[dict]) -> None:\n        \"\"\"Process one batch of data samples and predictions.\n\n        The processed results should be stored in ``self.results``, which will\n        be used to compute the metrics when all batches have been processed.\n\n        Args:\n            data_batch (dict): A batch of data from the dataloader.\n            data_samples (Sequence[dict]): A batch of outputs from the model.\n        \"\"\"\n        for data_sample in data_samples:\n            pred = data_sample['pred_instances']\n            eval_ann_info = data_sample['eval_ann_info']\n            ann = dict(labels=eval_ann_info['gt_bboxes_labels'],\n                       bboxes=eval_ann_info['gt_bboxes'])\n\n            pred_bboxes = pred['bboxes'].cpu().numpy()\n            pred_scores = pred['scores'].cpu().numpy()\n            pred_labels = pred['labels'].cpu().numpy()\n\n            dets = []\n            for label in range(len(self.dataset_meta['classes'])):\n                index = np.where(pred_labels == label)[0]\n                pred_bbox_scores = np.hstack(\n                    [pred_bboxes[index], pred_scores[index].reshape((-1, 1))])\n                dets.append(pred_bbox_scores)\n\n            self.results.append((ann, dets))\n\n    def compute_metrics(self, results: list) -> Dict[str, float]:\n        \"\"\"Compute the metrics from processed results.\n\n        Args:\n            results (list): The processed results of each batch.\n\n        Returns:\n            Dict[str, float]: The computed metrics. The keys are the names of\n            the metrics, and the values are corresponding results.\n        \"\"\"\n        logger: MMLogger = MMLogger.get_current_instance()\n        annotations, preds = zip(*results)\n        eval_results = OrderedDict()\n        for iou_thr_2d_single in self.iou_thr:\n            mean_ap, _ = eval_map(preds,\n                                  annotations,\n                                  scale_ranges=None,\n                                  iou_thr=iou_thr_2d_single,\n                                  dataset=self.dataset_meta['classes'],\n                                  logger=logger)\n            eval_results['mAP_' + str(iou_thr_2d_single)] = mean_ap\n        return eval_results\n"
  },
  {
    "path": "bip3d/eval/metrics/grounding_metric.py",
    "content": "# Copyright (c) OpenRobotLab. All rights reserved.\nimport os\nfrom typing import Dict, List, Optional, Sequence\nimport pickle\n\nimport mmengine\nfrom mmengine.evaluator import BaseMetric\nfrom mmengine.logging import MMLogger, print_log\nfrom terminaltables import AsciiTable\n\nfrom bip3d.registry import METRICS\nfrom bip3d.structures import EulerDepthInstance3DBoxes\n\n\n@METRICS.register_module()\nclass GroundingMetric(BaseMetric):\n    \"\"\"Lanuage grounding evaluation metric. We calculate the grounding\n    performance based on the alignment score of each bbox with the input\n    prompt.\n\n    Args:\n        iou_thr (float or List[float]): List of iou threshold when calculate\n            the metric. Defaults to [0.25, 0.5].\n        collect_device (str): Device name used for collecting results from\n            different ranks during distributed training. Must be 'cpu' or\n            'gpu'. Defaults to 'cpu'.\n        prefix (str, optional): The prefix that will be added in the metric\n            names to disambiguate homonymous metrics of different evaluators.\n            If prefix is not provided in the argument, self.default_prefix will\n            be used instead. Defaults to None.\n        format_only (bool): Whether to only inference the predictions without\n            evaluation. Defaults to False.\n        result_dir (str): Dir to save results, e.g., if result_dir = './',\n            the result file will be './test_results.json'. Defaults to ''.\n    \"\"\"\n\n    def __init__(self,\n                 iou_thr: List[float] = [0.25, 0.5],\n                 collect_device: str = 'cpu',\n                 prefix: Optional[str] = None,\n                 format_only=False,\n                 submit_info=None,\n                 result_dir='', **kwargs) -> None:\n        super(GroundingMetric, self).__init__(prefix=prefix,\n                                              collect_device=collect_device,\n                                              **kwargs)\n        self.iou_thr = [iou_thr] if isinstance(iou_thr, float) else iou_thr\n        self.prefix = prefix\n        self.format_only = format_only\n        self.result_dir = result_dir\n        self.submit_info = submit_info if submit_info is not None else {}\n\n    def process(self, data_batch: dict, data_samples: Sequence[dict]) -> None:\n        \"\"\"Process one batch of data samples and predictions.\n\n        The processed results should be stored in ``self.results``, which will\n        be used to compute the metrics when all batches have been processed.\n\n        Args:\n            data_batch (dict): A batch of data from the dataloader.\n            data_samples (Sequence[dict]): A batch of outputs from the model.\n        \"\"\"\n        for data_sample in data_samples:\n            pred_3d = data_sample['pred_instances_3d']\n            eval_ann_info = data_sample['eval_ann_info']\n            eval_ann_info[\"scan_id\"] = data_sample[\"scan_id\"]\n            cpu_pred_3d = dict()\n            for k, v in pred_3d.items():\n                if hasattr(v, 'to'):\n                    cpu_pred_3d[k] = v.to('cpu')\n                else:\n                    cpu_pred_3d[k] = v\n            self.results.append((eval_ann_info, cpu_pred_3d))\n\n    def ground_eval(self, gt_annos, det_annos, logger=None, part=\"Overall\"):\n\n        assert len(det_annos) == len(gt_annos)\n\n        pred = {}\n        gt = {}\n\n        object_types = [\n            'Easy', 'Hard', 'View-Dep', 'View-Indep', 'Unique', 'Multi',\n            'Overall'\n        ]\n\n        for t in self.iou_thr:\n            for object_type in object_types:\n                pred.update({object_type + '@' + str(t): 0})\n                gt.update({object_type + '@' + str(t): 1e-14})\n\n        for sample_id in range(len(det_annos)):\n            det_anno = det_annos[sample_id]\n            gt_anno = gt_annos[sample_id]\n            target_scores = det_anno['target_scores_3d']  # (num_query, )\n\n            bboxes = det_anno['bboxes_3d']\n            gt_bboxes = gt_anno['gt_bboxes_3d']\n            bboxes = EulerDepthInstance3DBoxes(bboxes.tensor,\n                                               origin=(0.5, 0.5, 0.5))\n            gt_bboxes = EulerDepthInstance3DBoxes(gt_bboxes.tensor,\n                                                  origin=(0.5, 0.5, 0.5))\n\n            view_dep = gt_anno['is_view_dep']\n            hard = gt_anno['is_hard']\n            unique = gt_anno['is_unique']\n\n            box_index = target_scores.argsort(dim=-1, descending=True)[:10]\n            top_bbox = bboxes[box_index]\n\n            iou = top_bbox.overlaps(top_bbox, gt_bboxes)  # (num_query, 1)\n\n            for t in self.iou_thr:\n                threshold = iou > t\n                found = int(threshold.any())\n                if view_dep:\n                    gt['View-Dep@' + str(t)] += 1\n                    pred['View-Dep@' + str(t)] += found\n                else:\n                    gt['View-Indep@' + str(t)] += 1\n                    pred['View-Indep@' + str(t)] += found\n                if hard:\n                    gt['Hard@' + str(t)] += 1\n                    pred['Hard@' + str(t)] += found\n                else:\n                    gt['Easy@' + str(t)] += 1\n                    pred['Easy@' + str(t)] += found\n                if unique:\n                    gt['Unique@' + str(t)] += 1\n                    pred['Unique@' + str(t)] += found\n                else:\n                    gt['Multi@' + str(t)] += 1\n                    pred['Multi@' + str(t)] += found\n\n                gt['Overall@' + str(t)] += 1\n                pred['Overall@' + str(t)] += found\n\n        header = ['Type']\n        header.extend(object_types)\n        ret_dict = {}\n\n        for t in self.iou_thr:\n            if part == \"Overall\":\n                table_columns = [[f'AP@{t:.2f}']]\n            else:\n                table_columns = [[f'{part}_AP@{t:.2f}']]\n\n            for object_type in object_types:\n                metric = object_type + '@' + str(t)\n                value = pred[metric] / max(gt[metric], 1)\n                ret_dict[metric] = value\n                table_columns.append([f'{value:.4f}'])\n\n            table_data = [header]\n            table_rows = list(zip(*table_columns))\n            table_data += table_rows\n            table = AsciiTable(table_data)\n            table.inner_footing_row_border = True\n            print_log('\\n' + table.table, logger=logger)\n\n        return ret_dict\n\n    def compute_metrics(self, results: list) -> Dict[str, float]:\n        \"\"\"Compute the metrics from processed results after all batches have\n        been processed.\n\n        Args:\n            results (list): The processed results of each batch.\n\n        Returns:\n            Dict[str, float]: The computed metrics. The keys are the names of\n            the metrics, and the values are corresponding results.\n        \"\"\"\n        logger: MMLogger = MMLogger.get_current_instance()  # noqa\n        annotations, preds = zip(*results)\n        ret_dict = {}\n        if self.format_only:\n            # preds is a list of dict\n            results = []\n            for pred in preds:\n                result = dict()\n                # convert the Euler boxes to the numpy array to save\n                bboxes_3d = pred['bboxes_3d'].tensor\n                scores_3d = pred['scores_3d']\n                # Note: hard-code save top-20 predictions\n                # eval top-10 predictions during the test phase by default\n                box_index = scores_3d.argsort(dim=-1, descending=True)[:20]\n                top_bboxes_3d = bboxes_3d[box_index]\n                top_scores_3d = scores_3d[box_index]\n                result['bboxes_3d'] = top_bboxes_3d.numpy()\n                result['scores_3d'] = top_scores_3d.numpy()\n                results.append(result)\n            mmengine.dump(results,\n                          os.path.join(self.result_dir, 'test_results.json'))\n            for x in results:\n                x[\"bboxes_3d\"] = x[\"bboxes_3d\"].tolist()\n                x[\"scores_3d\"] = x[\"scores_3d\"].tolist()\n            submission = {\n                'method': 'xxx',\n                'team': 'xxx',\n                'authors': ['xxx'],\n                'e-mail': 'xxx',\n                'institution': 'xxx',\n                'country': 'xxx',\n                'results': results,\n            }\n            submission.update(self.submit_info)\n            pickle.dump(submission, open(os.path.join(self.result_dir, 'submission.pkl'), \"wb\"))\n            return ret_dict\n\n        ret_dict = self.ground_eval(annotations, preds)\n        for part in [\"scannet\", \"3rscan\", \"matterport3d\", \"arkit\"]:\n            ann = [x for x in annotations if part in x[\"scan_id\"]]\n            pred = [y for x,y in zip(annotations, preds) if part in x[\"scan_id\"]]\n            if len(pred) > 0:\n                part_ret_dict= self.ground_eval(ann, pred, part=part)\n                part_ret_dict = {f\"{part}--{k}\": v for k, v in part_ret_dict.items()}\n                ret_dict.update(part_ret_dict)\n\n        return ret_dict\n"
  },
  {
    "path": "bip3d/grid_mask.py",
    "content": "import torch\nimport torch.nn as nn\nimport numpy as np\nfrom PIL import Image\n\n\nclass Grid(object):\n    def __init__(\n        self, use_h, use_w, rotate=1, offset=False, ratio=0.5, mode=0, prob=1.0\n    ):\n        self.use_h = use_h\n        self.use_w = use_w\n        self.rotate = rotate\n        self.offset = offset\n        self.ratio = ratio\n        self.mode = mode\n        self.st_prob = prob\n        self.prob = prob\n\n    def set_prob(self, epoch, max_epoch):\n        self.prob = self.st_prob * epoch / max_epoch\n\n    def __call__(self, img, label):\n        if np.random.rand() > self.prob:\n            return img, label\n        h = img.size(1)\n        w = img.size(2)\n        self.d1 = 2\n        self.d2 = min(h, w)\n        hh = int(1.5 * h)\n        ww = int(1.5 * w)\n        d = np.random.randint(self.d1, self.d2)\n        if self.ratio == 1:\n            self.l = np.random.randint(1, d)\n        else:\n            self.l = min(max(int(d * self.ratio + 0.5), 1), d - 1)\n        mask = np.ones((hh, ww), np.float32)\n        st_h = np.random.randint(d)\n        st_w = np.random.randint(d)\n        if self.use_h:\n            for i in range(hh // d):\n                s = d * i + st_h\n                t = min(s + self.l, hh)\n                mask[s:t, :] *= 0\n        if self.use_w:\n            for i in range(ww // d):\n                s = d * i + st_w\n                t = min(s + self.l, ww)\n                mask[:, s:t] *= 0\n\n        r = np.random.randint(self.rotate)\n        mask = Image.fromarray(np.uint8(mask))\n        mask = mask.rotate(r)\n        mask = np.asarray(mask)\n        mask = mask[\n            (hh - h) // 2 : (hh - h) // 2 + h,\n            (ww - w) // 2 : (ww - w) // 2 + w,\n        ]\n\n        mask = torch.from_numpy(mask).float()\n        if self.mode == 1:\n            mask = 1 - mask\n\n        mask = mask.expand_as(img)\n        if self.offset:\n            offset = torch.from_numpy(2 * (np.random.rand(h, w) - 0.5)).float()\n            offset = (1 - mask) * offset\n            img = img * mask + offset\n        else:\n            img = img * mask\n\n        return img, label\n\n\nclass GridMask(object):\n    def __init__(\n        self, use_h, use_w, rotate=1, offset=False, ratio=0.5, mode=0, prob=1.0\n    ):\n        super(GridMask, self).__init__()\n        self.use_h = use_h\n        self.use_w = use_w\n        self.rotate = rotate\n        self.offset = offset\n        self.ratio = ratio\n        self.mode = mode\n        self.st_prob = prob\n        self.prob = prob\n\n    def set_prob(self, epoch, max_epoch):\n        self.prob = self.st_prob * epoch / max_epoch  # + 1.#0.5\n\n    def __call__(self, x, offset=None):\n        if np.random.rand() > self.prob:\n            return x\n        n, c, h, w = x.size()\n        x = x.view(-1, h, w)\n        hh = int(1.5 * h)\n        ww = int(1.5 * w)\n        d = np.random.randint(2, h)\n        self.l = min(max(int(d * self.ratio + 0.5), 1), d - 1)\n        mask = np.ones((hh, ww), np.float32)\n        st_h = np.random.randint(d)\n        st_w = np.random.randint(d)\n        if self.use_h:\n            for i in range(hh // d):\n                s = d * i + st_h\n                t = min(s + self.l, hh)\n                mask[s:t, :] *= 0\n        if self.use_w:\n            for i in range(ww // d):\n                s = d * i + st_w\n                t = min(s + self.l, ww)\n                mask[:, s:t] *= 0\n\n        r = np.random.randint(self.rotate)\n        mask = Image.fromarray(np.uint8(mask))\n        mask = mask.rotate(r)\n        mask = np.asarray(mask)\n        mask = mask[\n            (hh - h) // 2 : (hh - h) // 2 + h,\n            (ww - w) // 2 : (ww - w) // 2 + w,\n        ]\n\n        mask = torch.tensor(mask).float().cuda()\n        if self.mode == 1:\n            mask = 1 - mask\n        mask = mask.expand_as(x)\n        if offset is not None:\n            x = x.view(n, c, h, w)\n            mask = mask.view(n, c, h, w)\n            x = x * mask + offset * (1 - mask)\n            return x\n        elif self.offset:\n            offset = (\n                torch.from_numpy(2 * (np.random.rand(h, w) - 0.5))\n                .float()\n                .cuda()\n            )\n            x = x * mask + offset * (1 - mask)\n        else:\n            x = x * mask\n\n        return x.view(n, c, h, w)\n"
  },
  {
    "path": "bip3d/models/__init__.py",
    "content": "from .structure import *\nfrom .feature_enhancer import *\nfrom .spatial_enhancer import *\nfrom .bbox3d_decoder import *\nfrom .instance_bank import *\nfrom .target import *\nfrom .data_preprocessors import *\nfrom .deformable_aggregation import *\nfrom .bert import *\n"
  },
  {
    "path": "bip3d/models/base_target.py",
    "content": "from torch import nn\nfrom abc import ABC, abstractmethod\n\n\n__all__ = [\"BaseTargetWithDenoising\"]\n\n\nclass BaseTargetWithDenoising(ABC, nn.Module):\n    def __init__(self, num_dn=0, num_temp_dn_groups=0):\n        super(BaseTargetWithDenoising, self).__init__()\n        self.num_dn = num_dn\n        self.num_temp_dn_groups = num_temp_dn_groups\n        self.dn_metas = None\n\n    @abstractmethod\n    def sample(self, cls_pred, box_pred, cls_target, box_target):\n        \"\"\"\n        Perform Hungarian matching between predictions and ground truth,\n        returning the matched ground truth corresponding to the predictions\n        along with the corresponding regression weights.\n        \"\"\"\n\n    def get_dn_anchors(self, cls_target, box_target, *args, **kwargs):\n        \"\"\"\n        Generate noisy instances for the current frame, with a total of\n        'self.num_dn_groups' groups.\n        \"\"\"\n        return None\n\n    def update_dn(self, instance_feature, anchor, *args, **kwargs):\n        \"\"\"\n        Insert the previously saved 'self.dn_metas' into the noisy instances\n        of the current frame.\n        \"\"\"\n\n    def cache_dn(\n        self,\n        dn_instance_feature,\n        dn_anchor,\n        dn_cls_target,\n        valid_mask,\n        dn_id_target,\n    ):\n        \"\"\"\n        Randomly save information for 'self.num_temp_dn_groups' groups of\n        temporal noisy instances to 'self.dn_metas'.\n        \"\"\"\n        if self.num_temp_dn_groups < 0:\n            return\n        self.dn_metas = dict(dn_anchor=dn_anchor[:, : self.num_temp_dn_groups])\n"
  },
  {
    "path": "bip3d/models/bbox3d_decoder.py",
    "content": "import math\nfrom typing import List, Optional, Tuple, Union\n\nimport numpy as np\nimport torch\nfrom torch import nn\nfrom torch.cuda.amp.autocast_mode import autocast\n\nfrom mmcv.cnn import Linear, Scale\nfrom mmcv.ops import nms3d_normal, nms3d\nfrom mmengine.model import BaseModel\nfrom mmengine.structures import InstanceData\nfrom mmdet.models.layers import SinePositionalEncoding\nfrom mmdet.models.layers.transformer.deformable_detr_layers import (\n    DeformableDetrTransformerEncoder as DDTE,\n)\nfrom mmdet.models.layers.transformer.utils import MLP\n\nfrom mmdet.utils import reduce_mean\nfrom mmdet.models.detectors.glip import create_positive_map_label_to_token\nfrom mmdet.models.dense_heads.atss_vlfusion_head import (\n    convert_grounding_to_cls_scores,\n)\n\nfrom bip3d.registry import MODELS, TASK_UTILS\nfrom bip3d.structures.bbox_3d.utils import rotation_3d_in_euler\nfrom .utils import (\n    deformable_format,\n    wasserstein_distance,\n    permutation_corner_distance,\n    center_distance,\n    get_positive_map,\n    get_entities,\n    linear_act_ln,\n)\n\n__all__ = [\"BBox3DDecoder\"]\n\n\nX, Y, Z, W, L, H, ALPHA, BETA, GAMMA = range(9)\n\n\ndef decode_box(box, min_size=None, max_size=None):\n    size = box[..., 3:6].exp()\n    # size = box[..., 3:6]\n    if min_size is not None or max_size is not None:\n        size = size.clamp(min=min_size, max=max_size)\n    box = torch.cat(\n        [box[..., :3], size, box[..., 6:]],\n        dim=-1,\n    )\n    return box\n\n\n@MODELS.register_module()\nclass BBox3DDecoder(BaseModel):\n    def __init__(\n        self,\n        instance_bank: dict,\n        anchor_encoder: dict,\n        graph_model: dict,\n        norm_layer: dict,\n        ffn: dict,\n        deformable_model: dict,\n        refine_layer: dict,\n        num_decoder: int = 6,\n        num_single_frame_decoder: int = -1,\n        temp_graph_model: dict = None,\n        text_cross_attn: dict = None,\n        loss_cls: dict = None,\n        loss_reg: dict = None,\n        post_processor: dict = None,\n        sampler: dict = None,\n        gt_cls_key: str = \"gt_labels_3d\",\n        gt_reg_key: str = \"gt_bboxes_3d\",\n        gt_id_key: str = \"instance_id\",\n        with_instance_id: bool = True,\n        task_prefix: str = \"det\",\n        reg_weights: List = None,\n        operation_order: Optional[List[str]] = None,\n        cls_threshold_to_reg: float = -1,\n        look_forward_twice: bool = False,\n        init_cfg: dict = None,\n        **kwargs,\n    ):\n        super().__init__(init_cfg)\n        self.num_decoder = num_decoder\n        self.num_single_frame_decoder = num_single_frame_decoder\n        self.gt_cls_key = gt_cls_key\n        self.gt_reg_key = gt_reg_key\n        self.gt_id_key = gt_id_key\n        self.with_instance_id = with_instance_id\n        self.task_prefix = task_prefix\n        self.cls_threshold_to_reg = cls_threshold_to_reg\n        self.look_forward_twice = look_forward_twice\n\n        if reg_weights is None:\n            self.reg_weights = [1.0] * 9\n        else:\n            self.reg_weights = reg_weights\n\n        if operation_order is None:\n            operation_order = [\n                \"gnn\",\n                \"norm\",\n                \"text_cross_attn\",\n                \"norm\",\n                \"deformable\",\n                \"norm\",\n                \"ffn\",\n                \"norm\",\n                \"refine\",\n            ] * num_decoder\n        self.operation_order = operation_order\n\n        # =========== build modules ===========\n        def build(cfg, registry=MODELS):\n            if cfg is None:\n                return None\n            return registry.build(cfg)\n\n        self.instance_bank = build(instance_bank)\n        self.anchor_encoder = build(anchor_encoder)\n        self.sampler = build(sampler, TASK_UTILS)\n        self.post_processor = build(post_processor, TASK_UTILS)\n        self.loss_cls = build(loss_cls)\n        self.loss_reg = build(loss_reg)\n        self.op_config_map = {\n            \"temp_gnn\": temp_graph_model,\n            \"gnn\": graph_model,\n            \"norm\": norm_layer,\n            \"ffn\": ffn,\n            \"deformable\": deformable_model,\n            \"text_cross_attn\": text_cross_attn,\n            \"refine\": refine_layer,\n        }\n        self.layers = nn.ModuleList(\n            [\n                build(self.op_config_map.get(op, None))\n                for op in self.operation_order\n            ]\n        )\n        self.embed_dims = self.instance_bank.embed_dims\n        self.norm = nn.LayerNorm(self.embed_dims)\n\n    def init_weights(self):\n        from mmengine.model import constant_init\n        for p in self.parameters():\n            if p.dim() > 1:\n                nn.init.xavier_uniform_(p)\n        for i, op in enumerate(self.operation_order):\n            if op == \"refine\":\n                m = self.layers[i]\n                constant_init(m.layers[-2], 0, bias=0)\n                constant_init(m.layers[-1], 1)\n                nn.init.constant_(m.layers[-2].bias.data[2:], 0.0)\n\n    def forward(\n        self,\n        feature_maps,\n        text_dict=None,\n        batch_inputs=None,\n        depth_prob=None,\n        **kwargs,\n    ):\n        batch_size = feature_maps[0].shape[0]\n        feature_maps = list(deformable_format(feature_maps))\n\n        # ========= get instance info ============\n        if (\n            self.sampler.dn_metas is not None\n            and self.sampler.dn_metas[\"dn_anchor\"].shape[0] != batch_size\n        ):\n            self.sampler.dn_metas = None\n        (\n            instance_feature,\n            anchor,\n            temp_instance_feature,\n            temp_anchor,\n            time_interval,\n        ) = self.instance_bank.get(\n            batch_size, batch_inputs, dn_metas=self.sampler.dn_metas\n        )\n\n        # ========= prepare for denosing training ============\n        # 1. get dn metas: noisy-anchors and corresponding GT\n        # 2. concat learnable instances and noisy instances\n        # 3. get attention mask\n        attn_mask = None\n        dn_metas = None\n        temp_dn_reg_target = None\n        if self.training and hasattr(self.sampler, \"get_dn_anchors\"):\n            dn_metas = self.sampler.get_dn_anchors(\n                batch_inputs[self.gt_cls_key],\n                batch_inputs[self.gt_reg_key],\n                text_dict=text_dict,\n                label=batch_inputs[\"gt_labels_3d\"],\n            )\n        if dn_metas is not None:\n            (\n                dn_anchor,\n                dn_reg_target,\n                dn_cls_target,\n                dn_attn_mask,\n                valid_mask,\n                dn_query,\n            ) = dn_metas\n            num_dn_anchor = dn_anchor.shape[1]\n            if dn_anchor.shape[-1] != anchor.shape[-1]:\n                remain_state_dims = anchor.shape[-1] - dn_anchor.shape[-1]\n                dn_anchor = torch.cat(\n                    [\n                        dn_anchor,\n                        dn_anchor.new_zeros(\n                            batch_size, num_dn_anchor, remain_state_dims\n                        ),\n                    ],\n                    dim=-1,\n                )\n            anchor = torch.cat([anchor, dn_anchor], dim=1)\n            if dn_query is None:\n                dn_query = instance_feature.new_zeros(\n                    batch_size, num_dn_anchor, instance_feature.shape[-1]\n                ),\n            instance_feature = torch.cat(\n                [instance_feature, dn_query], dim=1,\n            )\n            num_instance = instance_feature.shape[1]\n            num_free_instance = num_instance - num_dn_anchor\n            attn_mask = anchor.new_ones(\n                (num_instance, num_instance), dtype=torch.bool\n            )\n            attn_mask[:num_free_instance, :num_free_instance] = False\n            attn_mask[num_free_instance:, num_free_instance:] = dn_attn_mask\n        else:\n            num_dn_anchor = None\n            num_free_instance = None\n\n        anchor_embed = self.anchor_encoder(anchor)\n        if temp_anchor is not None:\n            temp_anchor_embed = self.anchor_encoder(temp_anchor)\n        else:\n            temp_anchor_embed = None\n\n        # =================== forward the layers ====================\n        prediction = []\n        classification = []\n        quality = []\n        _anchor = None\n        for i, (op, layer) in enumerate(\n            zip(self.operation_order, self.layers)\n        ):\n            if self.layers[i] is None:\n                continue\n            elif op == \"temp_gnn\":\n                instance_feature = layer(\n                    query=instance_feature,\n                    key=temp_instance_feature,\n                    value=temp_instance_feature,\n                    query_pos=anchor_embed,\n                    key_pos=temp_anchor_embed,\n                    attn_mask=(\n                        attn_mask if temp_instance_feature is None else None\n                    ),\n                )\n            elif op == \"gnn\":\n                instance_feature = layer(\n                    query=instance_feature,\n                    key=instance_feature,\n                    value=instance_feature,\n                    query_pos=anchor_embed,\n                    key_pos=anchor_embed,\n                    attn_mask=attn_mask,\n                )\n            elif op == \"norm\" or op == \"ffn\":\n                instance_feature = layer(instance_feature)\n            elif op == \"deformable\":\n                instance_feature = layer(\n                    instance_feature,\n                    anchor,\n                    anchor_embed,\n                    feature_maps,\n                    batch_inputs,\n                    depth_prob=depth_prob,\n                )\n            elif op == \"text_cross_attn\":\n                text_feature = text_dict[\"embedded\"]\n                instance_feature = layer(\n                    query=instance_feature,\n                    key=text_feature,\n                    value=text_feature,\n                    query_pos=anchor_embed,\n                    key_padding_mask=~text_dict[\"text_token_mask\"],\n                    key_pos=0,\n                )\n            elif op == \"refine\":\n                _instance_feature = self.norm(instance_feature)\n                if self.look_forward_twice:\n                    if _anchor is None:\n                        _anchor = anchor.clone()\n                    _anchor, cls, qt = layer(\n                        _instance_feature,\n                        _anchor,\n                        anchor_embed,\n                        time_interval=time_interval,\n                        text_feature=text_feature,\n                        text_token_mask=text_dict[\"text_token_mask\"],\n                    )\n                    prediction.append(_anchor)\n                    anchor = layer(\n                        instance_feature,\n                        anchor,\n                        anchor_embed,\n                        time_interval=time_interval,\n                    )[0]\n                    anchor_embed = self.anchor_encoder(anchor)\n                    _anchor = anchor\n                    anchor = anchor.detach()\n                else:\n                    anchor, cls, qt = layer(\n                        _instance_feature,\n                        anchor,\n                        anchor_embed,\n                        time_interval=time_interval,\n                        text_feature=text_feature,\n                        text_token_mask=text_dict[\"text_token_mask\"],\n                    )\n                    anchor_embed = self.anchor_encoder(anchor)\n                    prediction.append(anchor)\n                classification.append(cls)\n                quality.append(qt)\n\n                if len(prediction) == self.num_single_frame_decoder:\n                    instance_feature, anchor = self.instance_bank.update(\n                        instance_feature, anchor if _anchor is None else _anchor,\n                        cls, num_dn_anchor\n                    )\n                    anchor_embed = self.anchor_encoder(anchor)\n                    if self.look_forward_twice:\n                        _anchor = anchor\n                        anchor = anchor.detach()\n                    if dn_metas is not None:\n                        num_instance = instance_feature.shape[1]\n                        attn_mask = anchor.new_ones(\n                            (num_instance, num_instance), dtype=torch.bool\n                        )\n                        attn_mask[:-num_dn_anchor, :-num_dn_anchor] = False\n                        attn_mask[-num_dn_anchor:, -num_dn_anchor:] = dn_attn_mask\n\n                if (\n                    len(prediction) > self.num_single_frame_decoder\n                    and temp_anchor_embed is not None\n                ):\n                    temp_anchor_embed = anchor_embed[\n                        :, : self.instance_bank.num_temp_instances\n                    ]\n            else:\n                raise NotImplementedError(f\"{op} is not supported.\")\n\n        output = {}\n\n        # split predictions of learnable instances and noisy instances\n        if dn_metas is not None:\n            dn_classification = [\n                x[:, -num_dn_anchor:] for x in classification\n            ]\n            classification = [x[:, :-num_dn_anchor] for x in classification]\n            dn_prediction = [x[:, -num_dn_anchor:] for x in prediction]\n            prediction = [x[:, :-num_dn_anchor] for x in prediction]\n            quality = [\n                x[:, :-num_dn_anchor] if x is not None else None\n                for x in quality\n            ]\n            output.update(\n                {\n                    \"dn_prediction\": dn_prediction,\n                    \"dn_classification\": dn_classification,\n                    \"dn_reg_target\": dn_reg_target,\n                    \"dn_cls_target\": dn_cls_target,\n                    \"dn_valid_mask\": valid_mask,\n                }\n            )\n            if temp_dn_reg_target is not None:\n                output.update(\n                    {\n                        \"temp_dn_reg_target\": temp_dn_reg_target,\n                        \"temp_dn_cls_target\": temp_dn_cls_target,\n                        \"temp_dn_valid_mask\": temp_valid_mask,\n                        # \"dn_id_target\": dn_id_target,\n                    }\n                )\n                dn_cls_target = temp_dn_cls_target\n                valid_mask = temp_valid_mask\n            dn_instance_feature = instance_feature[:, -num_dn_anchor:]\n            dn_anchor = anchor[:, -num_dn_anchor:]\n            instance_feature = instance_feature[:, :-num_dn_anchor]\n            anchor_embed = anchor_embed[:, :-num_dn_anchor]\n            anchor = anchor[:, :-num_dn_anchor]\n            cls = cls[:, :-num_dn_anchor]\n\n        output.update(\n            {\n                \"classification\": classification,\n                \"prediction\": prediction,\n                \"quality\": quality,\n                \"instance_feature\": instance_feature,\n                \"anchor_embed\": anchor_embed,\n            }\n        )\n\n        # cache current instances for temporal modeling\n        self.instance_bank.cache(\n            instance_feature, anchor, cls, batch_inputs, feature_maps\n        )\n        if self.with_instance_id:\n            instance_id = self.instance_bank.get_instance_id(\n                cls, anchor, self.decoder.score_threshold\n            )\n            output[\"instance_id\"] = instance_id\n        return output\n\n    def loss(self, model_outs, data, text_dict=None):\n        # ===================== prediction losses ======================\n        cls_scores = model_outs[\"classification\"]\n        reg_preds = model_outs[\"prediction\"]\n        quality = model_outs[\"quality\"]\n        output = {}\n        for decoder_idx, (cls, reg, qt) in enumerate(\n            zip(cls_scores, reg_preds, quality)\n        ):\n            reg = reg[..., : len(self.reg_weights)]\n            reg = decode_box(reg)\n            cls_target, reg_target, reg_weights, ignore_mask = self.sampler.sample(\n                cls,\n                reg,\n                data[self.gt_cls_key],\n                data[self.gt_reg_key],\n                text_dict=text_dict,\n                ignore_mask=data.get(\"ignore_mask\"),\n            )\n            reg_target = reg_target[..., : len(self.reg_weights)]\n            reg_target_full = reg_target.clone()\n            mask = torch.logical_not(torch.all(reg_target == 0, dim=-1))\n            mask = mask.reshape(-1)\n            if ignore_mask is not None:\n                ignore_mask = ~ignore_mask.reshape(-1)\n                mask = torch.logical_and(mask, ignore_mask)\n                ignore_mask = ignore_mask.tile(1, cls.shape[-1])\n                \n            mask_valid = mask.clone()\n\n            num_pos = max(\n                reduce_mean(torch.sum(mask).to(dtype=reg.dtype)), 1.0\n            )\n\n            cls = cls.flatten(end_dim=1)\n            cls_target = cls_target.flatten(end_dim=1)\n            token_mask = torch.logical_not(cls.isinf())\n            cls = cls[token_mask]\n            cls_target = cls_target[token_mask]\n\n            if ignore_mask is None:\n                cls_loss = self.loss_cls(cls, cls_target, avg_factor=num_pos)\n            else:\n                ignore_mask = ignore_mask[token_mask]\n                cls_loss = self.loss_cls(\n                    cls[ignore_mask],\n                    cls_target[ignore_mask],\n                    avg_factor=num_pos,\n                    weight=cls_mask[ignore_mask],\n                )\n\n            reg_weights = reg_weights * reg.new_tensor(self.reg_weights)\n            reg_target = reg_target.flatten(end_dim=1)[mask]\n            reg = reg.flatten(end_dim=1)[mask]\n            reg_weights = reg_weights.flatten(end_dim=1)[mask]\n            reg_target = torch.where(\n                reg_target.isnan(), reg.new_tensor(0.0), reg_target\n            )\n            # cls_target = cls_target[mask]\n            if qt is not None:\n                qt = qt.flatten(end_dim=1)[mask]\n\n            reg_loss = self.loss_reg(\n                reg,\n                reg_target,\n                weight=reg_weights,\n                avg_factor=num_pos,\n                prefix=f\"{self.task_prefix}_\",\n                suffix=f\"_{decoder_idx}\",\n                quality=qt,\n                # cls_target=cls_target,\n            )\n\n            output[f\"{self.task_prefix}_loss_cls_{decoder_idx}\"] = cls_loss\n            output.update(reg_loss)\n\n        if \"dn_prediction\" not in model_outs:\n            return output\n\n        # ===================== denoising losses ======================\n        dn_cls_scores = model_outs[\"dn_classification\"]\n        dn_reg_preds = model_outs[\"dn_prediction\"]\n\n        (\n            dn_valid_mask,\n            dn_cls_target,\n            dn_reg_target,\n            dn_pos_mask,\n            reg_weights,\n            num_dn_pos,\n        ) = self.prepare_for_dn_loss(model_outs)\n\n        for decoder_idx, (cls, reg) in enumerate(\n            zip(dn_cls_scores, dn_reg_preds)\n        ):\n            if (\n                \"temp_dn_valid_mask\" in model_outs\n                and decoder_idx == self.num_single_frame_decoder\n            ):\n                (\n                    dn_valid_mask,\n                    dn_cls_target,\n                    dn_reg_target,\n                    dn_pos_mask,\n                    reg_weights,\n                    num_dn_pos,\n                ) = self.prepare_for_dn_loss(model_outs, prefix=\"temp_\")\n\n            cls = cls.flatten(end_dim=1)[dn_valid_mask]\n            mask = torch.logical_not(cls.isinf())\n            cls_loss = self.loss_cls(\n                cls[mask],\n                dn_cls_target[mask],\n                avg_factor=num_dn_pos,\n            )\n\n            reg = reg.flatten(end_dim=1)[dn_valid_mask][dn_pos_mask][\n                ..., : len(self.reg_weights)\n            ]\n            reg = decode_box(reg)\n            reg_loss = self.loss_reg(\n                reg,\n                dn_reg_target,\n                avg_factor=num_dn_pos,\n                weight=reg_weights,\n                prefix=f\"{self.task_prefix}_\",\n                suffix=f\"_dn_{decoder_idx}\",\n            )\n            output[f\"{self.task_prefix}_loss_cls_dn_{decoder_idx}\"] = cls_loss\n            output.update(reg_loss)\n        return output\n\n    def prepare_for_dn_loss(self, model_outs, prefix=\"\"):\n        dn_valid_mask = model_outs[f\"{prefix}dn_valid_mask\"].flatten(end_dim=1)\n        dn_cls_target = model_outs[f\"{prefix}dn_cls_target\"].flatten(\n            end_dim=1\n        )[dn_valid_mask]\n        dn_reg_target = model_outs[f\"{prefix}dn_reg_target\"].flatten(\n            end_dim=1\n        )[dn_valid_mask][..., : len(self.reg_weights)]\n        dn_pos_mask = dn_cls_target.sum(dim=-1) > 0\n        dn_reg_target = dn_reg_target[dn_pos_mask]\n        reg_weights = dn_reg_target.new_tensor(self.reg_weights)[None].tile(\n            dn_reg_target.shape[0], 1\n        )\n        num_dn_pos = max(\n            reduce_mean(torch.sum(dn_valid_mask).to(dtype=reg_weights.dtype)),\n            1.0,\n        )\n        return (\n            dn_valid_mask,\n            dn_cls_target,\n            dn_reg_target,\n            dn_pos_mask,\n            reg_weights,\n            num_dn_pos,\n        )\n\n    def post_process(\n        self,\n        model_outs,\n        text_dict,\n        batch_inputs,\n        batch_data_samples,\n        results_in_data_samples=True,\n    ):\n        results = self.post_processor(\n            model_outs[\"classification\"],\n            model_outs[\"prediction\"],\n            model_outs.get(\"instance_id\"),\n            model_outs.get(\"quality\"),\n            text_dict=text_dict,\n            batch_inputs=batch_inputs,\n        )\n        if results_in_data_samples:\n            for i, ret in enumerate(results):\n                instances = InstanceData()\n                for k, v in ret.items():\n                    if k == \"bboxes_3d\":\n                        type = batch_data_samples[i].metainfo[\"box_type_3d\"]\n                        v = type(\n                            v,\n                            box_dim=v.shape[1],\n                            origin=(0.5, 0.5, 0.5),\n                        )\n                    instances.__setattr__(k, v)\n                batch_data_samples[i].pred_instances_3d = instances\n            return batch_data_samples\n        return results\n\n\n@MODELS.register_module()\nclass GroundingRefineClsHead(BaseModel):\n    def __init__(\n        self,\n        embed_dims=256,\n        output_dim=9,\n        scale=None,\n        cls_layers=False,\n        cls_bias=True,\n    ):\n        super().__init__()\n        self.embed_dims = embed_dims\n        self.output_dim = output_dim\n        self.refine_state = list(range(output_dim))\n        self.scale = scale\n        self.layers = nn.Sequential(\n            *linear_act_ln(embed_dims, 2, 2),\n            # MLP(embed_dims, embed_dims, embed_dims, 2),\n            # nn.LayerNorm(self.embed_dims),\n            # MLP(embed_dims, embed_dims, embed_dims, 2),\n            Linear(self.embed_dims, self.output_dim),\n            Scale([1.0] * self.output_dim),\n        )\n        if cls_layers:\n            self.cls_layers = nn.Sequential(\n                MLP(embed_dims, embed_dims, embed_dims, 2),\n                nn.LayerNorm(self.embed_dims),\n            )\n        else:\n            self.cls_layers = nn.Identity()\n        if cls_bias:\n            bias_value = -math.log((1 - 0.01) / 0.01)\n            self.bias = nn.Parameter(\n                torch.Tensor([bias_value]), requires_grad=True\n            )\n        else:\n            self.bias = None\n\n    def forward(\n        self,\n        instance_feature: torch.Tensor,\n        anchor: torch.Tensor = None,\n        anchor_embed: torch.Tensor = None,\n        time_interval: torch.Tensor = 1.0,\n        text_feature=None,\n        text_token_mask=None,\n        **kwargs,\n    ):\n        if anchor_embed is not None:\n            feature = instance_feature + anchor_embed\n        else:\n            feature = instance_feature\n        output = self.layers(feature)\n        if self.scale is not None:\n            output = output * output.new_tensor(self.scale)\n        if anchor is not None:\n            output = output + anchor\n\n        if text_feature is not None:\n            cls = self.cls_layers(\n                instance_feature) @ text_feature.transpose(-1, -2)\n            cls = cls / math.sqrt(instance_feature.shape[-1])\n            if self.bias is not None:\n                cls = cls + self.bias\n            if text_token_mask is not None:\n                cls.masked_fill_(~text_token_mask[:, None, :], float(\"-inf\"))\n        else:\n            cls = None\n        return output, cls, None\n\n\n@MODELS.register_module()\nclass DoF9BoxLoss(nn.Module):\n    def __init__(\n        self,\n        loss_weight_wd=1.0,\n        loss_weight_pcd=0.0,\n        loss_weight_cd=0.8,\n        decode_pred=False,\n    ):\n        super().__init__()\n        self.loss_weight_wd = loss_weight_wd\n        self.loss_weight_pcd = loss_weight_pcd\n        self.loss_weight_cd = loss_weight_cd\n        self.decode_pred = decode_pred\n\n    def forward(\n        self,\n        box,\n        box_target,\n        weight=None,\n        avg_factor=None,\n        prefix=\"\",\n        suffix=\"\",\n        **kwargs,\n    ):\n        if box_target.shape[0] == 0:\n            loss = box.sum() * 0\n            return {f\"{prefix}loss_box{suffix}\": loss}\n        if self.decode_pred:\n            box = decode_box(box)\n        loss = 0\n        if self.loss_weight_wd > 0:\n            loss += self.loss_weight_wd * wasserstein_distance(box, box_target)\n        if self.loss_weight_pcd > 0:\n            loss += self.loss_weight_pcd * permutation_corner_distance(\n                box, box_target\n            )\n        if self.loss_weight_cd > 0:\n            loss += self.loss_weight_cd * center_distance(box, box_target)\n\n        if avg_factor is None:\n            loss = loss.mean()\n        else:\n            loss = loss.sum() / avg_factor\n        output = {f\"{prefix}loss_box{suffix}\": loss}\n        return output\n\n\n@TASK_UTILS.register_module()\nclass GroundingBox3DPostProcess:\n    def __init__(\n        self,\n        num_output: int = 300,\n        score_threshold: Optional[float] = None,\n        sorted: bool = True,\n    ):\n        super(GroundingBox3DPostProcess, self).__init__()\n        self.num_output = num_output\n        self.score_threshold = score_threshold\n        self.sorted = sorted\n\n    def __call__(\n        self,\n        cls_scores,\n        box_preds,\n        instance_id=None,\n        quality=None,\n        output_idx=-1,\n        text_dict=None,\n        batch_inputs=None,\n    ):\n        cls_scores = cls_scores[output_idx].sigmoid()\n        if \"tokens_positive\" in batch_inputs:\n            tokens_positive_maps = get_positive_map(\n                batch_inputs[\"tokens_positive\"],\n                text_dict,\n            )\n            label_to_token = [\n                create_positive_map_label_to_token(x, plus=1)\n                for x in tokens_positive_maps\n            ]\n            cls_scores = convert_grounding_to_cls_scores(\n                cls_scores, label_to_token\n            )\n            entities = get_entities(\n                batch_inputs[\"text\"],\n                batch_inputs[\"tokens_positive\"],\n            )\n        else:\n            cls_scores, _ = cls_scores.max(dim=-1, keepdim=True)\n            entities = batch_inputs[\"text\"]\n\n        # if squeeze_cls:\n        #     cls_scores, cls_ids = cls_scores.max(dim=-1)\n        #     cls_scores = cls_scores.unsqueeze(dim=-1)\n\n        box_preds = box_preds[output_idx]\n        bs, num_pred, num_cls = cls_scores.shape\n        num_output = min(self.num_output, num_pred*num_cls)\n        cls_scores, indices = cls_scores.flatten(start_dim=1).topk(\n            num_output, dim=1, sorted=self.sorted\n        )\n        # if not squeeze_cls:\n        cls_ids = indices % num_cls\n        if self.score_threshold is not None:\n            mask = cls_scores >= self.score_threshold\n\n        if quality[output_idx] is None:\n            quality = None\n        if quality is not None:\n            centerness = quality[output_idx][..., CNS]\n            centerness = torch.gather(centerness, 1, indices // num_cls)\n            cls_scores_origin = cls_scores.clone()\n            cls_scores *= centerness.sigmoid()\n            cls_scores, idx = torch.sort(cls_scores, dim=1, descending=True)\n            # if not squeeze_cls:\n            cls_ids = torch.gather(cls_ids, 1, idx)\n            if self.score_threshold is not None:\n                mask = torch.gather(mask, 1, idx)\n            indices = torch.gather(indices, 1, idx)\n\n        output = []\n        for i in range(bs):\n            category_ids = cls_ids[i]\n            # if squeeze_cls:\n            #     category_ids = category_ids[indices[i]]\n            scores = cls_scores[i]\n            box = box_preds[i, indices[i] // num_cls]\n            if self.score_threshold is not None:\n                category_ids = category_ids[mask[i]]\n                scores = scores[mask[i]]\n                box = box[mask[i]]\n\n            # nms_idx = nms3d(\n            #     box[..., :7],\n            #     scores,\n            #     iou_threshold=0.4\n            # )\n            # box = box[nms_idx]\n            # scores = scores[nms_idx]\n            # category_ids = category_ids[nms_idx]\n\n            if quality is not None:\n                scores_origin = cls_scores_origin[i]\n                if self.score_threshold is not None:\n                    scores_origin = scores_origin[mask[i]]\n\n            box = decode_box(box, 0.1, 20)\n            category_ids = category_ids.cpu()\n\n            label_names = []\n            for id in category_ids.tolist():\n                if isinstance(entities[i], (tuple, list)):\n                    label_names.append(entities[i][id])\n                else:\n                    label_names.append(entities[i])\n\n            output.append(\n                {\n                    \"bboxes_3d\": box.cpu(),\n                    \"scores_3d\": scores.cpu(),\n                    \"labels_3d\": category_ids,\n                    \"target_scores_3d\": scores.cpu(),\n                    \"label_names\": label_names,\n                }\n            )\n            if quality is not None:\n                output[-1][\"cls_scores\"] = scores_origin.cpu()\n            if instance_id is not None:\n                ids = instance_id[i, indices[i]]\n                if self.score_threshold is not None:\n                    ids = ids[mask[i]]\n                output[-1][\"instance_ids\"] = ids\n        return output\n\n\n@MODELS.register_module()\nclass DoF9BoxEncoder(nn.Module):\n    def __init__(\n        self,\n        embed_dims,\n        rot_dims=3,\n        output_fc=True,\n        in_loops=1,\n        out_loops=2,\n    ):\n        super().__init__()\n        self.embed_dims = embed_dims\n\n        def embedding_layer(input_dims, output_dims):\n            return nn.Sequential(\n                *linear_act_ln(output_dims, in_loops, out_loops, input_dims)\n            )\n\n        if not isinstance(embed_dims, (list, tuple)):\n            embed_dims = [embed_dims] * 5\n        self.pos_fc = embedding_layer(3, embed_dims[0])\n        self.size_fc = embedding_layer(3, embed_dims[1])\n        self.yaw_fc = embedding_layer(rot_dims, embed_dims[2])\n        self.rot_dims = rot_dims\n        if output_fc:\n            self.output_fc = embedding_layer(embed_dims[-1], embed_dims[-1])\n        else:\n            self.output_fc = None\n\n    def forward(self, box_3d: torch.Tensor):\n        pos_feat = self.pos_fc(box_3d[..., [X, Y, Z]])\n        if box_3d.shape[-1] == 3:\n            return pos_feat\n        size_feat = self.size_fc(box_3d[..., [W, L, H]])\n        yaw_feat = self.yaw_fc(box_3d[..., ALPHA : ALPHA + self.rot_dims])\n        output = pos_feat + size_feat + yaw_feat\n        if self.output_fc is not None:\n            output = self.output_fc(output)\n        return output\n\n\n@MODELS.register_module()\nclass SparseBox3DKeyPointsGenerator(nn.Module):\n    def __init__(\n        self,\n        embed_dims=256,\n        num_learnable_pts=0,\n        fix_scale=None,\n    ):\n        super(SparseBox3DKeyPointsGenerator, self).__init__()\n        self.embed_dims = embed_dims\n        self.num_learnable_pts = num_learnable_pts\n        if fix_scale is None:\n            fix_scale = ((0.0, 0.0, 0.0),)\n        self.fix_scale = torch.tensor(fix_scale)\n        self.num_pts = len(self.fix_scale) + num_learnable_pts\n        if num_learnable_pts > 0:\n            self.learnable_fc = Linear(self.embed_dims, num_learnable_pts * 3)\n\n    def forward(\n        self,\n        anchor,\n        instance_feature=None,\n        T_cur2temp_list=None,\n        cur_timestamp=None,\n        temp_timestamps=None,\n    ):\n        bs, num_anchor = anchor.shape[:2]\n        size = anchor[..., None, [W, L, H]].exp()\n        key_points = self.fix_scale.to(anchor) * size\n        if self.num_learnable_pts > 0 and instance_feature is not None:\n            learnable_scale = (\n                self.learnable_fc(instance_feature)\n                .reshape(bs, num_anchor, self.num_learnable_pts, 3)\n                .sigmoid()\n                - 0.5\n            )\n            key_points = torch.cat(\n                [key_points, learnable_scale * size], dim=-2\n            )\n\n        key_points = rotation_3d_in_euler(\n            key_points.flatten(0, 1),\n            anchor[..., [ALPHA, BETA, GAMMA]].flatten(0, 1),\n        ).unflatten(0, (bs, num_anchor))\n        key_points = key_points + anchor[..., None, [X, Y, Z]]\n\n        if (\n            cur_timestamp is None\n            or temp_timestamps is None\n            or len(temp_timestamps) == 0\n        ) and T_cur2temp_list is None:\n            return key_points\n\n        temp_key_points_list = []\n        velocity = anchor[..., VX:]\n        for i, t_time in enumerate(temp_timestamps):\n            time_interval = cur_timestamp - t_time\n            translation = (\n                velocity\n                * time_interval.to(dtype=velocity.dtype)[:, None, None]\n            )\n            temp_key_points = key_points - translation[:, :, None]\n            if T_cur2temp_list is not None:\n                T_cur2temp = T_cur2temp_list[i].to(dtype=key_points.dtype)\n                temp_key_points = T_cur2temp[:, None, None, :3] @ torch.cat(\n                    [\n                        temp_key_points,\n                        torch.ones_like(temp_key_points[..., :1]),\n                    ],\n                    dim=-1,\n                ).unsqueeze(-1)\n            temp_key_points = temp_key_points.squeeze(-1)\n            temp_key_points_list.append(temp_key_points)\n        return key_points, temp_key_points_list\n\n    @staticmethod\n    def anchor_projection(\n        anchor,\n        T_src2dst_list,\n        src_timestamp=None,\n        dst_timestamps=None,\n        time_intervals=None,\n    ):\n        dst_anchors = []\n        for i in range(len(T_src2dst_list)):\n            vel = anchor[..., VX:]\n            vel_dim = vel.shape[-1]\n            T_src2dst = torch.unsqueeze(\n                T_src2dst_list[i].to(dtype=anchor.dtype), dim=1\n            )\n\n            center = anchor[..., [X, Y, Z]]\n            if time_intervals is not None:\n                time_interval = time_intervals[i]\n            elif src_timestamp is not None and dst_timestamps is not None:\n                time_interval = (src_timestamp - dst_timestamps[i]).to(\n                    dtype=vel.dtype\n                )\n            else:\n                time_interval = None\n            if time_interval is not None:\n                translation = vel.transpose(0, -1) * time_interval\n                translation = translation.transpose(0, -1)\n                center = center - translation\n            center = (\n                torch.matmul(\n                    T_src2dst[..., :3, :3], center[..., None]\n                ).squeeze(dim=-1)\n                + T_src2dst[..., :3, 3]\n            )\n            size = anchor[..., [W, L, H]]\n            yaw = torch.matmul(\n                T_src2dst[..., :2, :2],\n                anchor[..., [COS_YAW, SIN_YAW], None],\n            ).squeeze(-1)\n            yaw = yaw[..., [1, 0]]\n            vel = torch.matmul(\n                T_src2dst[..., :vel_dim, :vel_dim], vel[..., None]\n            ).squeeze(-1)\n            dst_anchor = torch.cat([center, size, yaw, vel], dim=-1)\n            dst_anchors.append(dst_anchor)\n        return dst_anchors\n\n    @staticmethod\n    def distance(anchor):\n        return torch.norm(anchor[..., :2], p=2, dim=-1)\n"
  },
  {
    "path": "bip3d/models/bert.py",
    "content": "# Copyright (c) OpenMMLab. All rights reserved.\nfrom collections import OrderedDict\nfrom typing import Sequence\n\nimport torch\nfrom mmengine.model import BaseModel\nfrom torch import nn\n\ntry:\n    from transformers import AutoTokenizer, BertConfig\n    from transformers import BertModel as HFBertModel\nexcept ImportError:\n    AutoTokenizer = None\n    HFBertModel = None\n\nfrom bip3d.registry import MODELS\n\n\ndef generate_masks_with_special_tokens_and_transfer_map(\n    tokenized, special_tokens_list\n):\n    \"\"\"Generate attention mask between each pair of special tokens.\n\n    Only token pairs in between two special tokens are attended to\n    and thus the attention mask for these pairs is positive.\n\n    Args:\n        input_ids (torch.Tensor): input ids. Shape: [bs, num_token]\n        special_tokens_mask (list): special tokens mask.\n\n    Returns:\n        Tuple(Tensor, Tensor):\n        - attention_mask is the attention mask between each tokens.\n          Only token pairs in between two special tokens are positive.\n          Shape: [bs, num_token, num_token].\n        - position_ids is the position id of tokens within each valid sentence.\n          The id starts from 0 whenenver a special token is encountered.\n          Shape: [bs, num_token]\n    \"\"\"\n    input_ids = tokenized[\"input_ids\"]\n    bs, num_token = input_ids.shape\n    # special_tokens_mask:\n    # bs, num_token. 1 for special tokens. 0 for normal tokens\n    special_tokens_mask = torch.zeros(\n        (bs, num_token), device=input_ids.device\n    ).bool()\n\n    for special_token in special_tokens_list:\n        special_tokens_mask |= input_ids == special_token\n\n    # idxs: each row is a list of indices of special tokens\n    idxs = torch.nonzero(special_tokens_mask)\n\n    # generate attention mask and positional ids\n    attention_mask = (\n        torch.eye(num_token, device=input_ids.device)\n        .bool()\n        .unsqueeze(0)\n        .repeat(bs, 1, 1)\n    )\n    position_ids = torch.zeros((bs, num_token), device=input_ids.device)\n    previous_col = 0\n    for i in range(idxs.shape[0]):\n        row, col = idxs[i]\n        if col == 0:\n            # if (col == 0) or (col == num_token - 1):\n            attention_mask[row, col, col] = True\n            position_ids[row, col] = 0\n        else:\n            # attention_mask[row, previous_col + 1:col,\n            #                previous_col + 1:col] = True\n            # position_ids[row, previous_col + 1:col] = torch.arange(\n            #     0, col - previous_col - 1, device=input_ids.device)\n            attention_mask[\n                row, previous_col + 1 : col + 1, previous_col + 1 : col + 1\n            ] = True\n            position_ids[row, previous_col + 1 : col + 1] = torch.arange(\n                0, col - previous_col, device=input_ids.device\n            )\n        previous_col = col\n\n    return attention_mask, position_ids.to(torch.long)\n\n\n@MODELS.register_module()\nclass BertModel(BaseModel):\n    \"\"\"BERT model for language embedding only encoder.\n\n    Args:\n        name (str, optional): name of the pretrained BERT model from\n            HuggingFace. Defaults to bert-base-uncased.\n        max_tokens (int, optional): maximum number of tokens to be\n            used for BERT. Defaults to 256.\n        pad_to_max (bool, optional): whether to pad the tokens to max_tokens.\n             Defaults to True.\n        use_sub_sentence_represent (bool, optional): whether to use sub\n            sentence represent introduced in `Grounding DINO\n            <https://arxiv.org/abs/2303.05499>`. Defaults to False.\n        special_tokens_list (list, optional): special tokens used to split\n            subsentence. It cannot be None when `use_sub_sentence_represent`\n            is True. Defaults to None.\n        add_pooling_layer (bool, optional): whether to adding pooling\n            layer in bert encoder. Defaults to False.\n        num_layers_of_embedded (int, optional): number of layers of\n            the embedded model. Defaults to 1.\n        use_checkpoint (bool, optional): whether to use gradient checkpointing.\n             Defaults to False.\n    \"\"\"\n\n    def __init__(\n        self,\n        name: str = \"bert-base-uncased\",\n        max_tokens: int = 256,\n        pad_to_max: bool = True,\n        use_sub_sentence_represent: bool = False,\n        special_tokens_list: list = None,\n        add_pooling_layer: bool = False,\n        num_layers_of_embedded: int = 1,\n        use_checkpoint: bool = False,\n        return_tokenized: bool = False,\n        **kwargs\n    ) -> None:\n\n        super().__init__(**kwargs)\n        self.max_tokens = max_tokens\n        self.pad_to_max = pad_to_max\n        self.return_tokenized = return_tokenized\n\n        if AutoTokenizer is None:\n            raise RuntimeError(\n                \"transformers is not installed, please install it by: \"\n                \"pip install transformers.\"\n            )\n\n        self.tokenizer = AutoTokenizer.from_pretrained(name)\n        self.language_backbone = nn.Sequential(\n            OrderedDict(\n                [\n                    (\n                        \"body\",\n                        BertEncoder(\n                            name,\n                            add_pooling_layer=add_pooling_layer,\n                            num_layers_of_embedded=num_layers_of_embedded,\n                            use_checkpoint=use_checkpoint,\n                        ),\n                    )\n                ]\n            )\n        )\n\n        self.use_sub_sentence_represent = use_sub_sentence_represent\n        if self.use_sub_sentence_represent:\n            assert (\n                special_tokens_list is not None\n            ), \"special_tokens should not be None \\\n                    if use_sub_sentence_represent is True\"\n\n            self.special_tokens = self.tokenizer.convert_tokens_to_ids(\n                special_tokens_list\n            )\n\n    def forward(self, captions: Sequence[str], **kwargs) -> dict:\n        \"\"\"Forward function.\"\"\"\n        device = next(self.language_backbone.parameters()).device\n        tokenized = self.tokenizer.batch_encode_plus(\n            captions,\n            max_length=self.max_tokens,\n            padding=\"max_length\" if self.pad_to_max else \"longest\",\n            return_special_tokens_mask=True,\n            return_tensors=\"pt\",\n            truncation=True,\n        ).to(device)\n        input_ids = tokenized.input_ids\n        if self.use_sub_sentence_represent:\n            attention_mask, position_ids = (\n                generate_masks_with_special_tokens_and_transfer_map(\n                    tokenized, self.special_tokens\n                )\n            )\n            token_type_ids = tokenized[\"token_type_ids\"]\n\n        else:\n            attention_mask = tokenized.attention_mask\n            position_ids = None\n            token_type_ids = None\n\n        tokenizer_input = {\n            \"input_ids\": input_ids,\n            \"attention_mask\": attention_mask,\n            \"position_ids\": position_ids,\n            \"token_type_ids\": token_type_ids,\n        }\n        language_dict_features = self.language_backbone(tokenizer_input)\n        if self.use_sub_sentence_represent:\n            language_dict_features[\"position_ids\"] = position_ids\n            language_dict_features[\"text_token_mask\"] = (\n                tokenized.attention_mask.bool()\n            )\n        if self.return_tokenized:\n            language_dict_features[\"tokenized\"] = tokenized\n        return language_dict_features\n\n\nclass BertEncoder(nn.Module):\n    \"\"\"BERT encoder for language embedding.\n\n    Args:\n        name (str): name of the pretrained BERT model from HuggingFace.\n                Defaults to bert-base-uncased.\n        add_pooling_layer (bool): whether to add a pooling layer.\n        num_layers_of_embedded (int): number of layers of the embedded model.\n                Defaults to 1.\n        use_checkpoint (bool): whether to use gradient checkpointing.\n                Defaults to False.\n    \"\"\"\n\n    def __init__(\n        self,\n        name: str,\n        add_pooling_layer: bool = False,\n        num_layers_of_embedded: int = 1,\n        use_checkpoint: bool = False,\n    ):\n        super().__init__()\n        if BertConfig is None:\n            raise RuntimeError(\n                \"transformers is not installed, please install it by: \"\n                \"pip install transformers.\"\n            )\n        config = BertConfig.from_pretrained(name)\n        config.gradient_checkpointing = use_checkpoint\n        # only encoder\n        self.model = HFBertModel.from_pretrained(\n            name, add_pooling_layer=add_pooling_layer, config=config\n        )\n        self.language_dim = config.hidden_size\n        self.num_layers_of_embedded = num_layers_of_embedded\n\n    def forward(self, x) -> dict:\n        mask = x[\"attention_mask\"]\n\n        outputs = self.model(\n            input_ids=x[\"input_ids\"],\n            attention_mask=mask,\n            position_ids=x[\"position_ids\"],\n            token_type_ids=x[\"token_type_ids\"],\n            output_hidden_states=True,\n        )\n\n        # outputs has 13 layers, 1 input layer and 12 hidden layers\n        encoded_layers = outputs.hidden_states[1:]\n        features = torch.stack(\n            encoded_layers[-self.num_layers_of_embedded :], 1\n        ).mean(1)\n        # language embedding has shape [len(phrase), seq_len, language_dim]\n        features = features / self.num_layers_of_embedded\n        if mask.dim() == 2:\n            embedded = features * mask.unsqueeze(-1).float()\n        else:\n            embedded = features\n\n        results = {\n            \"embedded\": embedded,\n            \"masks\": mask,\n            \"hidden\": encoded_layers[-1],\n        }\n        return results\n"
  },
  {
    "path": "bip3d/models/data_preprocessors/__init__.py",
    "content": "from .custom_data_preprocessor import CustomDet3DDataPreprocessor\n\n__all__ = [\"CustomDet3DDataPreprocessor\"]\n"
  },
  {
    "path": "bip3d/models/data_preprocessors/custom_data_preprocessor.py",
    "content": "import math\nfrom numbers import Number\nfrom typing import Dict, List, Optional, Sequence, Tuple, Union\n\nimport numpy as np\nimport torch\nfrom mmdet.models import DetDataPreprocessor\nfrom mmdet.models.utils.misc import samplelist_boxtype2tensor\nfrom mmengine.model import stack_batch\nfrom mmengine.structures import InstanceData\nfrom mmengine.utils import is_seq_of\nfrom torch import Tensor\nfrom torch.nn import functional as F\n\nfrom bip3d.registry import MODELS\nfrom bip3d.utils.typing_config import ConfigType, SampleList\nfrom bip3d.structures.bbox_3d import get_proj_mat_by_coord_type\n\nfrom .utils import multiview_img_stack_batch\n\n\n@MODELS.register_module()\nclass CustomDet3DDataPreprocessor(DetDataPreprocessor):\n    \"\"\"Points / Image pre-processor for point clouds / vision-only / multi-\n    modality 3D detection tasks.\n\n    It provides the data pre-processing as follows\n\n    - Collate and move image and point cloud data to the target device.\n\n    - 1) For image data:\n\n      - Pad images in inputs to the maximum size of current batch with defined\n        ``pad_value``. The padding size can be divisible by a defined\n        ``pad_size_divisor``.\n      - Stack images in inputs to batch_imgs.\n      - Convert images in inputs from bgr to rgb if the shape of input is\n        (3, H, W).\n      - Normalize images in inputs with defined std and mean.\n      - Do batch augmentations during training.\n\n    - 2) For point cloud data:\n\n      - If no voxelization, directly return list of point cloud data.\n      - If voxelization is applied, voxelize point cloud according to\n        ``voxel_type`` and obtain ``voxels``.\n\n    Args:\n        voxel (bool): Whether to apply voxelization to point cloud.\n            Defaults to False.\n        voxel_type (str): Voxelization type. Two voxelization types are\n            provided: 'hard' and 'dynamic', respectively for hard voxelization\n            and dynamic voxelization. Defaults to 'hard'.\n        voxel_layer (dict or :obj:`ConfigDict`, optional): Voxelization layer\n            config. Defaults to None.\n        batch_first (bool): Whether to put the batch dimension to the first\n            dimension when getting voxel coordinates. Defaults to True.\n        max_voxels (int, optional): Maximum number of voxels in each voxel\n            grid. Defaults to None.\n        mean (Sequence[Number], optional): The pixel mean of R, G, B channels.\n            Defaults to None.\n        std (Sequence[Number], optional): The pixel standard deviation of\n            R, G, B channels. Defaults to None.\n        pad_size_divisor (int): The size of padded image should be divisible by\n            ``pad_size_divisor``. Defaults to 1.\n        pad_value (float or int): The padded pixel value. Defaults to 0.\n        pad_mask (bool): Whether to pad instance masks. Defaults to False.\n        mask_pad_value (int): The padded pixel value for instance masks.\n            Defaults to 0.\n        pad_seg (bool): Whether to pad semantic segmentation maps.\n            Defaults to False.\n        seg_pad_value (int): The padded pixel value for semantic segmentation\n            maps. Defaults to 255.\n        bgr_to_rgb (bool): Whether to convert image from BGR to RGB.\n            Defaults to False.\n        rgb_to_bgr (bool): Whether to convert image from RGB to BGR.\n            Defaults to False.\n        boxtype2tensor (bool): Whether to convert the ``BaseBoxes`` type of\n            bboxes data to ``Tensor`` type. Defaults to True.\n        non_blocking (bool): Whether to block current process when transferring\n            data to device. Defaults to False.\n        batch_augments (List[dict], optional): Batch-level augmentations.\n            Defaults to None.\n        batchwise_inputs (bool): Pack the input as a batch of samples\n            with 1-N frames for the continuous 3D perception setting.\n            Defaults to False.\n    \"\"\"\n\n    def __init__(\n        self,\n        voxel: bool = False,\n        voxel_type: str = \"hard\",\n        voxel_layer: Optional[ConfigType] = None,\n        batch_first: bool = True,\n        max_voxels: Optional[int] = None,\n        mean: Sequence[Number] = None,\n        std: Sequence[Number] = None,\n        pad_size_divisor: int = 1,\n        pad_value: Union[float, int] = 0,\n        pad_mask: bool = False,\n        mask_pad_value: int = 0,\n        pad_seg: bool = False,\n        seg_pad_value: int = 255,\n        bgr_to_rgb: bool = False,\n        rgb_to_bgr: bool = False,\n        boxtype2tensor: bool = True,\n        non_blocking: bool = False,\n        batch_augments: Optional[List[dict]] = None,\n        batchwise_inputs: bool = False,\n    ):\n        super().__init__(\n            mean=mean,\n            std=std,\n            pad_size_divisor=pad_size_divisor,\n            pad_value=pad_value,\n            pad_mask=pad_mask,\n            mask_pad_value=mask_pad_value,\n            pad_seg=pad_seg,\n            seg_pad_value=seg_pad_value,\n            bgr_to_rgb=bgr_to_rgb,\n            rgb_to_bgr=rgb_to_bgr,\n            boxtype2tensor=boxtype2tensor,\n            non_blocking=non_blocking,\n            batch_augments=batch_augments,\n        )\n        self.voxel = voxel\n        self.voxel_type = voxel_type\n        self.batch_first = batch_first\n        self.max_voxels = max_voxels\n        self.batchwise_inputs = batchwise_inputs\n        if voxel:\n            self.voxel_layer = VoxelizationByGridShape(**voxel_layer)\n\n    def forward(self, data: Union[dict, List[dict]], training: bool = False):\n        \"\"\"Perform normalization, padding and bgr2rgb conversion based on\n        ``BaseDataPreprocessor``.\n\n        Args:\n            data (dict or List[dict]): Data from dataloader. The dict contains\n                the whole batch data, when it is a list[dict], the list\n                indicates test time augmentation.\n            training (bool): Whether to enable training time augmentation.\n                Defaults to False.\n\n        Returns:\n            dict or List[dict]: Data in the same format as the model input.\n        \"\"\"\n        if isinstance(data, list):\n            num_augs = len(data)\n            aug_batch_data = []\n            for aug_id in range(num_augs):\n                single_aug_batch_data = self.simple_process(\n                    data[aug_id], training\n                )\n                aug_batch_data.append(single_aug_batch_data)\n            return aug_batch_data\n        else:\n            return self.simple_process(data, training)\n\n    def simple_process(self, data: dict, training: bool = False):\n        \"\"\"Perform normalization, padding and bgr2rgb conversion for img data\n        based on ``BaseDataPreprocessor``, and voxelize point cloud if `voxel`\n        is set to be True.\n\n        Args:\n            data (dict): Data sampled from dataloader.\n            training (bool): Whether to enable training time augmentation.\n                Defaults to False.\n\n        Returns:\n            dict: Data in the same format as the model input.\n        \"\"\"\n        if \"img\" in data[\"inputs\"]:\n            batch_pad_shape = self._get_pad_shape(data)\n\n        if self.batchwise_inputs:\n            data_samples = data[\"data_samples\"]\n            batchwise_data_samples = []\n            if \"bboxes_3d\" in data_samples[0].gt_instances_3d:\n                assert isinstance(\n                    data_samples[0].gt_instances_3d.labels_3d, list\n                )\n                bboxes_3d = data_samples[0].gt_instances_3d.bboxes_3d\n                labels_3d = data_samples[0].gt_instances_3d.labels_3d\n            if \"gt_occupancy_masks\" in data_samples[0]:\n                gt_occupancy_masks = [\n                    mask.clone() for mask in data_samples[0].gt_occupancy_masks\n                ]\n            if (\n                \"eval_ann_info\" in data_samples[0]\n                and data_samples[0].eval_ann_info is not None\n            ):\n                eval_ann_info = data_samples[0].eval_ann_info\n            for idx in range(len(labels_3d)):\n                data_sample = data_samples[0].clone()\n                if \"bboxes_3d\" in data_sample.gt_instances_3d:\n                    data_sample.gt_instances_3d = InstanceData()\n                    data_sample.gt_instances_3d.bboxes_3d = bboxes_3d[idx]\n                    data_sample.gt_instances_3d.labels_3d = labels_3d[idx]\n                if \"gt_occupancy_masks\" in data_sample:\n                    data_sample.gt_occupancy_masks = gt_occupancy_masks[idx]\n                if \"eval_ann_info\" in data_sample:\n                    if data_sample.eval_ann_info is not None:\n                        data_sample.eval_ann_info = dict()\n                        data_sample.eval_ann_info[\"gt_bboxes_3d\"] = (\n                            eval_ann_info[\"gt_bboxes_3d\"][idx]\n                        )\n                        data_sample.eval_ann_info[\"gt_labels_3d\"] = (\n                            eval_ann_info[\"gt_labels_3d\"][idx]\n                        )\n                batchwise_data_samples.append(data_sample)\n            data[\"data_samples\"] = batchwise_data_samples\n\n        data = self.collate_data(data)\n        inputs, data_samples = data[\"inputs\"], data[\"data_samples\"]\n        batch_inputs = dict()\n        batch_inputs.update(self.process_camera_params(data_samples))\n\n        for key in [\"depth_img\", \"depth_prob_gt\"]:\n            if key not in inputs:\n                continue\n            batch_inputs[key] = torch.stack(inputs[key])\n\n        for key in [\"text\", \"scan_id\", \"tokens_positive\", \"ignore_mask\"]:\n            if not hasattr(data_samples[0], key):\n                continue\n            batch_inputs[key] = [getattr(x, key) for x in data_samples]\n        if hasattr(data_samples[0], \"gt_instances_3d\"):\n            batch_inputs[\"gt_bboxes_3d\"] = [\n                x.gt_instances_3d.bboxes_3d for x in data_samples\n            ]\n            batch_inputs[\"gt_labels_3d\"] = [\n                x.gt_instances_3d.labels_3d for x in data_samples\n            ]\n\n        if \"imgs\" in inputs:\n            imgs = inputs[\"imgs\"]\n\n            if data_samples is not None:\n                # NOTE the batched image size information may be useful, e.g.\n                # in DETR, this is needed for the construction of masks, which\n                # is then used for the transformer_head.\n                batch_input_shape = tuple(imgs[0].size()[-2:])\n                for data_sample, pad_shape in zip(\n                    data_samples, batch_pad_shape\n                ):\n                    data_sample.set_metainfo(\n                        {\n                            \"batch_input_shape\": batch_input_shape,\n                            \"pad_shape\": pad_shape,\n                        }\n                    )\n\n                if self.boxtype2tensor:\n                    samplelist_boxtype2tensor(data_samples)\n                if self.pad_mask:\n                    self.pad_gt_masks(data_samples)\n                if self.pad_seg:\n                    self.pad_gt_sem_seg(data_samples)\n\n            if training and self.batch_augments is not None:\n                for batch_aug in self.batch_augments:\n                    imgs, data_samples = batch_aug(imgs, data_samples)\n            batch_inputs[\"imgs\"] = imgs\n\n        return {\"inputs\": batch_inputs, \"data_samples\": data_samples}\n\n    def process_camera_params(self, data_samples):\n        projection_mat = []\n        extrinsic_list = []\n        intrinsic_list = []\n        image_wh = []\n        for data_sample in data_samples:\n            proj_mat = get_proj_mat_by_coord_type(\n                data_sample.metainfo, \"DEPTH\"\n            )\n\n            img_scale_factor = data_sample.metainfo.get(\"scale_factor\", [1, 1])\n            img_flip = data_sample.metainfo.get(\"flip\", False)\n            img_crop_offset = data_sample.metainfo.get(\n                \"img_crop_offset\", [0, 0]\n            )\n            trans_mat = np.eye(4)\n            trans_mat[0, 0] = img_scale_factor[0]\n            trans_mat[1, 1] = img_scale_factor[1]\n            trans_mat[0, 2] = -img_crop_offset[0]\n            trans_mat[1, 2] = -img_crop_offset[1]\n            if img_flip:\n                assert False\n            if \"trans_mat\" in proj_mat:\n                trans_mat = np.stack(proj_mat[\"trans_mat\"]) @ trans_mat\n\n            if isinstance(proj_mat, dict):\n                extrinsic = np.stack(proj_mat[\"extrinsic\"])\n                intrinsic = np.stack(proj_mat[\"intrinsic\"])\n                proj_mat = intrinsic @ extrinsic\n                extrinsic_list.append(extrinsic)\n                intrinsic_list.append(trans_mat @ intrinsic)\n            else:\n                extrinsic_list.append(\n                    np.tile(np.eye(4)[None], (proj_mat.shape[0], 1, 1))\n                )\n                intrinsic_list.append(\n                    np.tile(np.eye(4)[None], (proj_mat.shape[0], 1, 1))\n                )\n            proj_mat = trans_mat @ proj_mat\n            projection_mat.append(proj_mat)\n            image_wh.append(data_sample.metainfo[\"img_shape\"][:2])\n\n        to_tensor = lambda x: torch.from_numpy(x).cuda().to(torch.float32)\n        projection_mat = to_tensor(np.stack(projection_mat))\n        image_wh = to_tensor(np.array(image_wh))\n        image_wh = image_wh[:, None].tile(1, projection_mat.shape[1], 1)\n        extrinsic = to_tensor(np.stack(extrinsic_list))\n        intrinsic = to_tensor(np.stack(intrinsic_list))\n        return {\n            \"projection_mat\": projection_mat,\n            \"image_wh\": image_wh,\n            \"extrinsic\": extrinsic,\n            \"intrinsic\": intrinsic,\n        }\n\n    def preprocess_img(self, _batch_img: Tensor):\n        # channel transform\n        if self._channel_conversion:\n            _batch_img = _batch_img[[2, 1, 0], ...]\n        # Convert to float after channel conversion to ensure\n        # efficiency\n        _batch_img = _batch_img.float()\n        # Normalization.\n        if self._enable_normalize:\n            if self.mean.shape[0] == 3:\n                assert _batch_img.dim() == 3 and _batch_img.shape[0] == 3, (\n                    \"If the mean has 3 values, the input tensor \"\n                    \"should in shape of (3, H, W), but got the \"\n                    f\"tensor with shape {_batch_img.shape}\"\n                )\n            _batch_img = (_batch_img - self.mean) / self.std\n        return _batch_img\n\n    def collate_data(self, data: dict):\n        \"\"\"Copy data to the target device and perform normalization, padding\n        and bgr2rgb conversion and stack based on ``BaseDataPreprocessor``.\n\n        Collates the data sampled from dataloader into a list of dict and list\n        of labels, and then copies tensor to the target device.\n\n        Args:\n            data (dict): Data sampled from dataloader.\n\n        Returns:\n            dict: Data in the same format as the model input.\n        \"\"\"\n        data = self.cast_data(data)  # type: ignore\n\n        if \"img\" in data[\"inputs\"]:\n            _batch_imgs = data[\"inputs\"][\"img\"]\n            # Process data with `pseudo_collate`.\n            if is_seq_of(_batch_imgs, torch.Tensor):\n                batch_imgs = []\n                img_dim = _batch_imgs[0].dim()\n                for _batch_img in _batch_imgs:\n                    if img_dim == 3:  # standard img\n                        _batch_img = self.preprocess_img(_batch_img)\n                    elif img_dim == 4:\n                        _batch_img = [\n                            self.preprocess_img(_img) for _img in _batch_img\n                        ]\n\n                        _batch_img = torch.stack(_batch_img, dim=0)\n\n                    batch_imgs.append(_batch_img)\n\n                # Pad and stack Tensor.\n                if img_dim == 3:\n                    batch_imgs = stack_batch(\n                        batch_imgs, self.pad_size_divisor, self.pad_value\n                    )\n                elif img_dim == 4:\n                    batch_imgs = multiview_img_stack_batch(\n                        batch_imgs, self.pad_size_divisor, self.pad_value\n                    )\n\n            # Process data with `default_collate`.\n            elif isinstance(_batch_imgs, torch.Tensor):\n                assert _batch_imgs.dim() == 4, (\n                    \"The input of `ImgDataPreprocessor` should be a NCHW \"\n                    \"tensor or a list of tensor, but got a tensor with \"\n                    f\"shape: {_batch_imgs.shape}\"\n                )\n                if self._channel_conversion:\n                    _batch_imgs = _batch_imgs[:, [2, 1, 0], ...]\n                # Convert to float after channel conversion to ensure\n                # efficiency\n                _batch_imgs = _batch_imgs.float()\n                if self._enable_normalize:\n                    _batch_imgs = (_batch_imgs - self.mean) / self.std\n                h, w = _batch_imgs.shape[2:]\n                target_h = (\n                    math.ceil(h / self.pad_size_divisor)\n                    * self.pad_size_divisor\n                )\n                target_w = (\n                    math.ceil(w / self.pad_size_divisor)\n                    * self.pad_size_divisor\n                )\n                pad_h = target_h - h\n                pad_w = target_w - w\n                batch_imgs = F.pad(\n                    _batch_imgs,\n                    (0, pad_w, 0, pad_h),\n                    \"constant\",\n                    self.pad_value,\n                )\n            else:\n                raise TypeError(\n                    \"Output of `cast_data` should be a list of dict \"\n                    \"or a tuple with inputs and data_samples, but got \"\n                    f\"{type(data)}: {data}\"\n                )\n\n            data[\"inputs\"][\"imgs\"] = batch_imgs\n\n        data.setdefault(\"data_samples\", None)\n\n        return data\n\n    def _get_pad_shape(self, data: dict):\n        \"\"\"Get the pad_shape of each image based on data and\n        pad_size_divisor.\"\"\"\n        # rewrite `_get_pad_shape` for obtaining image inputs.\n        _batch_inputs = data[\"inputs\"][\"img\"]\n        # Process data with `pseudo_collate`.\n        if is_seq_of(_batch_inputs, torch.Tensor):\n            batch_pad_shape = []\n            for ori_input in _batch_inputs:\n                if ori_input.dim() == 4:\n                    # mean multiview input, select one of the\n                    # image to calculate the pad shape\n                    ori_input = ori_input[0]\n                pad_h = (\n                    int(np.ceil(ori_input.shape[1] / self.pad_size_divisor))\n                    * self.pad_size_divisor\n                )\n                pad_w = (\n                    int(np.ceil(ori_input.shape[2] / self.pad_size_divisor))\n                    * self.pad_size_divisor\n                )\n                batch_pad_shape.append((pad_h, pad_w))\n        # Process data with `default_collate`.\n        elif isinstance(_batch_inputs, torch.Tensor):\n            assert _batch_inputs.dim() == 4, (\n                \"The input of `ImgDataPreprocessor` should be a NCHW tensor \"\n                \"or a list of tensor, but got a tensor with shape: \"\n                f\"{_batch_inputs.shape}\"\n            )\n            pad_h = (\n                int(np.ceil(_batch_inputs.shape[1] / self.pad_size_divisor))\n                * self.pad_size_divisor\n            )\n            pad_w = (\n                int(np.ceil(_batch_inputs.shape[2] / self.pad_size_divisor))\n                * self.pad_size_divisor\n            )\n            batch_pad_shape = [(pad_h, pad_w)] * _batch_inputs.shape[0]\n        else:\n            raise TypeError(\n                \"Output of `cast_data` should be a list of dict \"\n                \"or a tuple with inputs and data_samples, but got \"\n                f\"{type(data)}: {data}\"\n            )\n        return batch_pad_shape\n"
  },
  {
    "path": "bip3d/models/data_preprocessors/utils.py",
    "content": "# Copyright (c) OpenMMLab. All rights reserved.\nfrom typing import List, Union\n\nimport torch\nimport torch.nn.functional as F\nfrom torch import Tensor\n\n\ndef multiview_img_stack_batch(\n    tensor_list: List[Tensor],\n    pad_size_divisor: int = 1,\n    pad_value: Union[int, float] = 0,\n) -> Tensor:\n    \"\"\"Compared to the ``stack_batch`` in `mmengine.model.utils`,\n    multiview_img_stack_batch further handle the multiview images.\n\n    See diff of padded_sizes[:, :-2] = 0 vs padded_sizes[:, 0] = 0 in line 47.\n\n    Stack multiple tensors to form a batch and pad the tensor to the max shape\n    use the right bottom padding mode in these images. If\n    ``pad_size_divisor > 0``, add padding to ensure the shape of each dim is\n    divisible by ``pad_size_divisor``.\n\n    Args:\n        tensor_list (List[Tensor]): A list of tensors with the same dim.\n        pad_size_divisor (int): If ``pad_size_divisor > 0``, add padding to\n            ensure the shape of each dim is divisible by ``pad_size_divisor``.\n            This depends on the model, and many models need to be divisible by\n            32. Defaults to 1.\n        pad_value (int or float): The padding value. Defaults to 0.\n\n    Returns:\n        Tensor: The n dim tensor.\n    \"\"\"\n    assert isinstance(\n        tensor_list, list\n    ), f\"Expected input type to be list, but got {type(tensor_list)}\"\n    assert tensor_list, \"`tensor_list` could not be an empty list\"\n    assert len({tensor.ndim for tensor in tensor_list}) == 1, (\n        \"Expected the dimensions of all tensors must be the same, \"\n        f\"but got {[tensor.ndim for tensor in tensor_list]}\"\n    )\n\n    dim = tensor_list[0].dim()\n    num_img = len(tensor_list)\n    all_sizes: torch.Tensor = torch.Tensor(\n        [tensor.shape for tensor in tensor_list]\n    )\n    max_sizes = (\n        torch.ceil(torch.max(all_sizes, dim=0)[0] / pad_size_divisor)\n        * pad_size_divisor\n    )\n    padded_sizes = max_sizes - all_sizes\n    # The first dim normally means channel, which should not be padded.\n    padded_sizes[:, :-2] = 0\n    if padded_sizes.sum() == 0:\n        return torch.stack(tensor_list)\n    # `pad` is the second arguments of `F.pad`. If pad is (1, 2, 3, 4),\n    # it means that padding the last dim with 1(left) 2(right), padding the\n    # penultimate dim to 3(top) 4(bottom). The order of `pad` is opposite of\n    # the `padded_sizes`. Therefore, the `padded_sizes` needs to be reversed,\n    # and only odd index of pad should be assigned to keep padding \"right\" and\n    # \"bottom\".\n    pad = torch.zeros(num_img, 2 * dim, dtype=torch.int)\n    pad[:, 1::2] = padded_sizes[:, range(dim - 1, -1, -1)]\n    batch_tensor = []\n    for idx, tensor in enumerate(tensor_list):\n        batch_tensor.append(\n            F.pad(tensor, tuple(pad[idx].tolist()), value=pad_value)\n        )\n    return torch.stack(batch_tensor)\n"
  },
  {
    "path": "bip3d/models/deformable_aggregation.py",
    "content": "from typing import List, Optional, Tuple\n\nimport numpy as np\nimport torch\nimport torch.nn as nn\nfrom torch.cuda.amp.autocast_mode import autocast\n\nfrom mmcv.cnn import build_norm_layer\nfrom mmcv.cnn import Linear\nfrom mmengine.model import (\n    Sequential,\n    BaseModule,\n    xavier_init,\n    constant_init,\n    ModuleList,\n)\nfrom mmcv.cnn.bricks.transformer import FFN\n\nfrom ..ops import deformable_aggregation_func\nfrom .utils import linear_act_ln\nfrom bip3d.registry import MODELS\n\n__all__ = [\n    \"DeformableFeatureAggregation\",\n]\n\n\n@MODELS.register_module()\nclass DeformableFeatureAggregation(BaseModule):\n    def __init__(\n        self,\n        embed_dims: int = 256,\n        num_groups: int = 8,\n        num_levels: int = 4,\n        num_cams: int = 6,\n        proj_drop: float = 0.0,\n        attn_drop: float = 0.0,\n        kps_generator: dict = None,\n        temporal_fusion_module=None,\n        use_temporal_anchor_embed=True,\n        use_deformable_func=False,\n        use_camera_embed=False,\n        residual_mode=\"add\",\n        batch_first=True,\n        ffn_cfg=None,\n        with_value_proj=False,\n        filter_outlier=False,\n        with_depth=False,\n        min_depth=None,\n        max_depth=None,\n    ):\n        super(DeformableFeatureAggregation, self).__init__()\n        assert batch_first\n        if embed_dims % num_groups != 0:\n            raise ValueError(\n                f\"embed_dims must be divisible by num_groups, \"\n                f\"but got {embed_dims} and {num_groups}\"\n            )\n        self.group_dims = int(embed_dims / num_groups)\n        self.embed_dims = embed_dims\n        self.num_levels = num_levels\n        self.num_groups = num_groups\n        self.num_cams = num_cams\n        self.use_temporal_anchor_embed = use_temporal_anchor_embed\n        self.use_deformable_func = use_deformable_func\n        self.attn_drop = attn_drop\n        self.residual_mode = residual_mode\n        self.proj_drop = nn.Dropout(proj_drop)\n        if kps_generator is not None:\n            kps_generator[\"embed_dims\"] = embed_dims\n            self.kps_generator = MODELS.build(kps_generator)\n            self.num_pts = self.kps_generator.num_pts\n        else:\n            self.kps_generator = None\n            self.num_pts = 1\n        if temporal_fusion_module is not None:\n            if \"embed_dims\" not in temporal_fusion_module:\n                temporal_fusion_module[\"embed_dims\"] = embed_dims\n            self.temp_module = MODELS.build(temporal_fusion_module)\n        else:\n            self.temp_module = None\n        self.output_proj = Linear(embed_dims, embed_dims)\n\n        if use_camera_embed:\n            self.camera_encoder = Sequential(\n                *linear_act_ln(embed_dims, 1, 2, 12)\n            )\n            self.weights_fc = Linear(\n                embed_dims, num_groups * num_levels * self.num_pts\n            )\n        else:\n            self.camera_encoder = None\n            self.weights_fc = Linear(\n                embed_dims, num_groups * num_cams * num_levels * self.num_pts\n            )\n        if ffn_cfg is not None:\n            ffn_cfg[\"embed_dims\"] = embed_dims\n            self.ffn = FFN(**ffn_cfg)\n            self.norms = ModuleList(\n                build_norm_layer(dict(type=\"LN\"), embed_dims)[1]\n                for _ in range(2)\n            )\n        else:\n            self.ffn = None\n        self.with_value_proj = with_value_proj\n        self.filter_outlier = filter_outlier\n        if self.with_value_proj:\n            self.value_proj = Linear(embed_dims, embed_dims)\n        self.with_depth = with_depth\n        if self.with_depth:\n            assert min_depth is not None and max_depth is not None\n            self.min_depth = min_depth\n            self.max_depth = max_depth\n\n    def init_weight(self):\n        constant_init(self.weights_fc, val=0.0, bias=0.0)\n        xavier_init(self.output_proj, distribution=\"uniform\", bias=0.0)\n\n    def get_spatial_shape_3D(self, spatial_shape, depth_dim):\n        spatial_shape_depth = spatial_shape.new_ones(*spatial_shape.shape[:-1], 1) * depth_dim\n        spatial_shape_3D = torch.cat([spatial_shape, spatial_shape_depth], dim=-1)\n        return spatial_shape_3D.contiguous()\n\n    def forward(\n        self,\n        instance_feature: torch.Tensor,\n        anchor: torch.Tensor,\n        anchor_embed: torch.Tensor,\n        feature_maps: List[torch.Tensor],\n        metas: dict,\n        depth_prob=None,\n        **kwargs: dict,\n    ):\n        bs, num_anchor = instance_feature.shape[:2]\n        if self.kps_generator is not None:\n            key_points = self.kps_generator(anchor, instance_feature)\n        else:\n            key_points = anchor[:, :, None]\n\n        points_2d, depth, mask = self.project_points(\n            key_points,\n            metas[\"projection_mat\"],\n            metas.get(\"image_wh\"),\n        )\n        weights = self._get_weights(\n            instance_feature, anchor_embed, metas, mask\n        )\n\n        if self.use_deformable_func:\n            if self.with_value_proj:\n                feature_maps[0] = self.value_proj(feature_maps[0])\n\n            points_2d = points_2d.permute(0, 2, 3, 1, 4).reshape(\n                bs, num_anchor * self.num_pts, -1, 2\n            )\n            weights = (\n                weights.permute(0, 1, 4, 2, 3, 5)\n                .contiguous()\n                .reshape(\n                    bs,\n                    num_anchor * self.num_pts,\n                    -1,\n                    self.num_levels,\n                    self.num_groups,\n                )\n            )\n            if self.with_depth:\n                depth = depth.permute(0, 2, 3, 1).reshape(\n                    bs, num_anchor * self.num_pts, -1, 1\n                )\n                # normalize depth to [0, depth_prob.shape[-1]-1]\n                depth = (depth - self.min_depth) / (self.max_depth - self.min_depth)\n                depth = depth * (depth_prob.shape[-1] - 1)\n                features = deformable_aggregation_func(\n                    *feature_maps, points_2d, weights, depth_prob, depth\n                )\n            else:\n                features = deformable_aggregation_func(*feature_maps, points_2d, weights)\n            features = features.reshape(bs, num_anchor, self.num_pts, self.embed_dims)\n            features = features.sum(dim=2)\n        else:\n            assert False\n            assert not self.with_value_proj\n            features = self.feature_sampling(\n                feature_maps,\n                key_points,\n                metas[\"projection_mat\"],\n                metas.get(\"image_wh\"),\n            )\n            features = self.multi_view_level_fusion(features, weights)\n            features = features.sum(dim=2)  # fuse multi-point features\n        output = self.proj_drop(self.output_proj(features))\n        if self.residual_mode == \"add\":\n            output = output + instance_feature\n        elif self.residual_mode == \"cat\":\n            output = torch.cat([output, instance_feature], dim=-1)\n        if self.ffn is not None:\n            output = self.norms[0](output)\n            output = self.ffn(output)\n            output = self.norms[1](output)\n        return output\n\n    def _get_weights(\n        self, instance_feature, anchor_embed, metas=None, mask=None\n    ):\n        bs, num_anchor = instance_feature.shape[:2]\n        feature = instance_feature + anchor_embed\n        if self.camera_encoder is not None:\n            num_cams = metas[\"projection_mat\"].shape[1]\n            camera_embed = self.camera_encoder(\n                metas[\"projection_mat\"][:, :, :3].reshape(bs, num_cams, -1)\n            )\n            feature = feature[:, :, None] + camera_embed[:, None]\n\n        weights = self.weights_fc(feature)\n        if mask is not None and self.filter_outlier:\n            num_cams = weights.shape[2]\n            mask = mask.permute(0, 2, 1, 3)[..., None, :, None]\n            weights = weights.reshape(\n                bs,\n                num_anchor,\n                num_cams,\n                self.num_levels,\n                self.num_pts,\n                self.num_groups,\n            )\n            weights = weights.masked_fill(\n                torch.logical_and(~mask, mask.sum(dim=2, keepdim=True) != 0),\n                float(\"-inf\"),\n            )\n        weights = (\n            weights.reshape(bs, num_anchor, -1, self.num_groups)\n            .softmax(dim=-2)\n            .reshape(\n                bs,\n                num_anchor,\n                -1,\n                self.num_levels,\n                self.num_pts,\n                self.num_groups,\n            )\n        )\n        if self.training and self.attn_drop > 0:\n            mask = torch.rand(bs, num_anchor, -1, 1, self.num_pts, 1)\n            mask = mask.to(device=weights.device, dtype=weights.dtype)\n            weights = ((mask > self.attn_drop) * weights) / (\n                1 - self.attn_drop\n            )\n        return weights\n\n    @staticmethod\n    def project_points(key_points, projection_mat, image_wh=None):\n        bs, num_anchor, num_pts = key_points.shape[:3]\n\n        pts_extend = torch.cat(\n            [key_points, torch.ones_like(key_points[..., :1])], dim=-1\n        )\n        points_2d = torch.matmul(\n            projection_mat[:, :, None, None], pts_extend[:, None, ..., None]\n        ).squeeze(-1)\n        depth = points_2d[..., 2]\n        mask = depth > 1e-5\n        points_2d = points_2d[..., :2] / torch.clamp(\n            points_2d[..., 2:3], min=1e-5\n        )\n        mask = mask & (points_2d[..., 0] > 0) & (points_2d[..., 1] > 0)\n        if image_wh is not None:\n            points_2d = points_2d / image_wh[:, :, None, None]\n            mask = mask & (points_2d[..., 0] < 1) & (points_2d[..., 1] < 1)\n        return points_2d, depth, mask\n\n    @staticmethod\n    def feature_sampling(\n        feature_maps: List[torch.Tensor],\n        key_points: torch.Tensor,\n        projection_mat: torch.Tensor,\n        image_wh: Optional[torch.Tensor] = None,\n    ) -> torch.Tensor:\n        num_levels = len(feature_maps)\n        num_cams = feature_maps[0].shape[1]\n        bs, num_anchor, num_pts = key_points.shape[:3]\n\n        points_2d = DeformableFeatureAggregation.project_points(\n            key_points, projection_mat, image_wh\n        )\n        points_2d = points_2d * 2 - 1\n        points_2d = points_2d.flatten(end_dim=1)\n\n        features = []\n        for fm in feature_maps:\n            features.append(\n                torch.nn.functional.grid_sample(\n                    fm.flatten(end_dim=1), points_2d\n                )\n            )\n        features = torch.stack(features, dim=1)\n        features = features.reshape(\n            bs, num_cams, num_levels, -1, num_anchor, num_pts\n        ).permute(\n            0, 4, 1, 2, 5, 3\n        )  # bs, num_anchor, num_cams, num_levels, num_pts, embed_dims\n\n        return features\n\n    def multi_view_level_fusion(\n        self,\n        features: torch.Tensor,\n        weights: torch.Tensor,\n    ):\n        bs, num_anchor = weights.shape[:2]\n        features = weights[..., None] * features.reshape(\n            features.shape[:-1] + (self.num_groups, self.group_dims)\n        )\n        features = features.sum(dim=2).sum(dim=2)\n        features = features.reshape(\n            bs, num_anchor, self.num_pts, self.embed_dims\n        )\n        return features\n"
  },
  {
    "path": "bip3d/models/feature_enhancer.py",
    "content": "import torch\nfrom torch import nn\n\nfrom mmengine.model import BaseModel\nfrom mmdet.models.layers.transformer.utils import get_text_sine_pos_embed\nfrom mmdet.models.layers import SinePositionalEncoding\nfrom mmdet.models.layers.transformer.deformable_detr_layers import (\n    DeformableDetrTransformerEncoder as DDTE,\n)\nfrom mmdet.models.layers.transformer.deformable_detr_layers import (\n    DeformableDetrTransformerEncoderLayer,\n)\nfrom mmdet.models.layers.transformer.detr_layers import (\n    DetrTransformerEncoderLayer,\n)\nfrom mmdet.models.utils.vlfuse_helper import SingleScaleBiAttentionBlock\nfrom bip3d.registry import MODELS\n\nfrom .utils import deformable_format\n\n\n@MODELS.register_module()\nclass TextImageDeformable2DEnhancer(BaseModel):\n    def __init__(\n        self,\n        num_layers,\n        text_img_attn_block,\n        img_attn_block,\n        text_attn_block=None,\n        embed_dims=256,\n        num_feature_levels=4,\n        positional_encoding=None,\n        **kwargs,\n    ):\n        super().__init__(**kwargs)\n        self.num_layers = num_layers\n        self.num_feature_levels = num_feature_levels\n        self.embed_dims = embed_dims\n        self.positional_encoding = positional_encoding\n        self.text_img_attn_blocks = nn.ModuleList()\n        self.img_attn_blocks = nn.ModuleList()\n        self.text_attn_blocks = nn.ModuleList()\n        for _ in range(self.num_layers):\n            self.text_img_attn_blocks.append(\n                SingleScaleBiAttentionBlock(**text_img_attn_block)\n            )\n            self.img_attn_blocks.append(\n                DeformableDetrTransformerEncoderLayer(**img_attn_block)\n            )\n            self.text_attn_blocks.append(\n                DetrTransformerEncoderLayer(**text_attn_block)\n            )\n        self.positional_encoding = SinePositionalEncoding(\n            **self.positional_encoding\n        )\n        self.level_embed = nn.Parameter(\n            torch.Tensor(self.num_feature_levels, self.embed_dims)\n        )\n\n    def forward(\n        self,\n        feature_maps,\n        text_dict=None,\n        **kwargs,\n    ):\n        with_cams = feature_maps[0].dim() == 5\n        if with_cams:\n            bs, num_cams = feature_maps[0].shape[:2]\n            feature_maps = [x.flatten(0, 1) for x in feature_maps]\n        else:\n            bs = feature_maps[0].shape[0]\n            num_cams = 1\n        pos_2d = self.get_2d_position_embed(feature_maps)\n        feature_2d, spatial_shapes, level_start_index = deformable_format(\n            feature_maps\n        )\n\n        reference_points = DDTE.get_encoder_reference_points(\n            spatial_shapes,\n            valid_ratios=feature_2d.new_ones(\n                [bs * num_cams, self.num_feature_levels, 2]\n            ),\n            device=feature_2d.device,\n        )\n\n        text_feature = text_dict[\"embedded\"]\n        pos_text = get_text_sine_pos_embed(\n            text_dict[\"position_ids\"][..., None],\n            num_pos_feats=self.embed_dims,\n            exchange_xy=False,\n        )\n\n        for layer_id in range(self.num_layers):\n            feature_2d_fused = feature_2d[:, level_start_index[-1] :]\n            if with_cams:\n                feature_2d_fused = feature_2d_fused.unflatten(\n                    0, (bs, num_cams)\n                )\n                feature_2d_fused = feature_2d_fused.flatten(1, 2)\n            feature_2d_fused, text_feature = self.text_img_attn_blocks[\n                layer_id\n            ](\n                feature_2d_fused,\n                text_feature,\n                attention_mask_l=text_dict[\"text_token_mask\"],\n            )\n            if with_cams:\n                feature_2d_fused = feature_2d_fused.unflatten(\n                    1, (num_cams, -1)\n                )\n                feature_2d_fused = feature_2d_fused.flatten(0, 1)\n            feature_2d = torch.cat(\n                [feature_2d[:, : level_start_index[-1]], feature_2d_fused],\n                dim=1,\n            )\n\n            feature_2d = self.img_attn_blocks[layer_id](\n                query=feature_2d,\n                query_pos=pos_2d,\n                reference_points=reference_points,\n                spatial_shapes=spatial_shapes,\n                level_start_index=level_start_index,\n                key_padding_mask=None,\n            )\n\n            text_attn_mask = text_dict.get(\"masks\")\n            if text_attn_mask is not None:\n                text_num_heads = self.text_attn_blocks[\n                    layer_id\n                ].self_attn_cfg.num_heads\n                text_attn_mask = ~text_attn_mask.repeat(text_num_heads, 1, 1)\n            text_feature = self.text_attn_blocks[layer_id](\n                query=text_feature,\n                query_pos=pos_text,\n                attn_mask=text_attn_mask,\n                key_padding_mask=None,\n            )\n        feature_2d = deformable_format(\n            feature_2d, spatial_shapes, batch_size=bs if with_cams else None\n        )\n        return feature_2d, text_feature\n\n    def get_2d_position_embed(self, feature_maps):\n        pos_2d = []\n        for lvl, feat in enumerate(feature_maps):\n            batch_size, c, h, w = feat.shape\n            pos = self.positional_encoding(None, feat)\n            pos = pos.view(batch_size, c, h * w).permute(0, 2, 1)\n            pos = pos + self.level_embed[lvl]\n            pos_2d.append(pos)\n        pos_2d = torch.cat(pos_2d, 1)\n        return pos_2d\n"
  },
  {
    "path": "bip3d/models/instance_bank.py",
    "content": "import torch\nfrom torch import nn\nimport torch.nn.functional as F\nimport numpy as np\n\nfrom bip3d.registry import MODELS\n\n__all__ = [\"InstanceBank\"]\n\n\ndef topk(confidence, k, *inputs):\n    bs, N = confidence.shape[:2]\n    confidence, indices = torch.topk(confidence, k, dim=1)\n    indices = (\n        indices + torch.arange(bs, device=indices.device)[:, None] * N\n    ).reshape(-1)\n    outputs = []\n    for input in inputs:\n        outputs.append(input.flatten(end_dim=1)[indices].reshape(bs, k, -1))\n    return confidence, outputs\n\n\n@MODELS.register_module()\nclass InstanceBank(nn.Module):\n    def __init__(\n        self,\n        num_anchor,\n        embed_dims,\n        anchor,\n        anchor_handler=None,\n        num_current_instance=None,\n        num_temp_instances=0,\n        default_time_interval=0.5,\n        confidence_decay=0.6,\n        anchor_grad=True,\n        feat_grad=True,\n        max_time_interval=2,\n        anchor_in_camera=True,\n    ):\n        super(InstanceBank, self).__init__()\n        self.embed_dims = embed_dims\n        self.num_current_instance = num_current_instance\n        self.num_temp_instances = num_temp_instances\n        self.default_time_interval = default_time_interval\n        self.confidence_decay = confidence_decay\n        self.max_time_interval = max_time_interval\n\n        if anchor_handler is not None:\n            anchor_handler = MODELS.build(anchor_handler)\n            assert hasattr(anchor_handler, \"anchor_projection\")\n        self.anchor_handler = anchor_handler\n        if isinstance(anchor, str):\n            anchor = np.load(anchor)\n        elif isinstance(anchor, (list, tuple)):\n            anchor = np.array(anchor)\n        if len(anchor.shape) == 3:  # for map\n            anchor = anchor.reshape(anchor.shape[0], -1)\n        self.num_anchor = min(len(anchor), num_anchor)\n        self.anchor = anchor[:num_anchor]\n        # self.anchor = nn.Parameter(\n        #     torch.tensor(anchor, dtype=torch.float32),\n        #     requires_grad=anchor_grad,\n        # )\n        self.anchor_init = anchor\n        self.instance_feature = nn.Parameter(\n            torch.zeros([1, self.embed_dims]),\n            # torch.zeros([self.anchor.shape[0], self.embed_dims]),\n            requires_grad=feat_grad,\n        )\n        self.anchor_in_camera = anchor_in_camera\n        self.reset()\n\n    def init_weight(self):\n        self.anchor.data = self.anchor.data.new_tensor(self.anchor_init)\n        if self.instance_feature.requires_grad:\n            torch.nn.init.xavier_uniform_(self.instance_feature.data, gain=1)\n\n    def reset(self):\n        self.cached_feature = None\n        self.cached_anchor = None\n        self.metas = None\n        self.mask = None\n        self.confidence = None\n        self.temp_confidence = None\n        self.instance_id = None\n        self.prev_id = 0\n\n    def bbox_transform(self, bbox, matrix):\n        # bbox: bs, n, 9\n        # matrix: bs, cam, 4, 4\n        # output: bs, n*cam, 9\n        bbox = bbox.unsqueeze(dim=2)\n        matrix = matrix.unsqueeze(dim=1)\n        points = bbox[..., :3]\n        points_extend = torch.concat(\n            [points, torch.ones_like(points[..., :1])], dim=-1\n        )\n        points_trans = torch.matmul(matrix, points_extend[..., None])[\n            ..., :3, 0\n        ]\n\n        size = bbox[..., 3:6].tile(1, 1, points_trans.shape[2], 1)\n        angle = bbox[..., 6:].tile(1, 1, points_trans.shape[2], 1)\n\n        bbox = torch.cat([points_trans, size, angle], dim=-1)\n        bbox = bbox.flatten(1, 2)\n        return bbox\n\n    def get(self, batch_size, metas=None, dn_metas=None):\n        instance_feature = torch.tile(\n            self.instance_feature[None], (batch_size, self.anchor.shape[0], 1)\n        )\n        anchor = torch.tile(\n            instance_feature.new_tensor(self.anchor)[None], (batch_size, 1, 1)\n        )\n\n        if self.anchor_in_camera:\n            cam2global = np.linalg.inv(metas[\"extrinsic\"].cpu().numpy())\n            cam2global = torch.from_numpy(cam2global).to(anchor)\n            anchor = self.bbox_transform(anchor, cam2global)\n            instance_feature = instance_feature.tile(1, cam2global.shape[1], 1)\n\n        if (\n            self.cached_anchor is not None\n            and batch_size == self.cached_anchor.shape[0]\n        ):\n            # assert False, \"TODO: linxuewu\"\n            # history_time = self.metas[\"timestamp\"]\n            # time_interval = metas[\"timestamp\"] - history_time\n            # time_interval = time_interval.to(dtype=instance_feature.dtype)\n            # self.mask = torch.abs(time_interval) <= self.max_time_interval\n\n            last_scan_id = self.metas[\"scan_id\"]\n            current_scan_id = metas[\"scan_id\"]\n            self.mask = torch.tensor(\n                [x==y for x,y in zip(last_scan_id, current_scan_id)],\n                device=anchor.device,\n            )\n            assert self.mask.shape[0] == 1\n            if not self.mask:\n                self.reset()\n            time_interval = instance_feature.new_tensor(\n                [self.default_time_interval] * batch_size\n            )\n        else:\n            self.reset()\n            time_interval = instance_feature.new_tensor(\n                [self.default_time_interval] * batch_size\n            )\n\n        return (\n            instance_feature,\n            anchor,\n            self.cached_feature,\n            self.cached_anchor,\n            time_interval,\n        )\n\n    def update(self, instance_feature, anchor, confidence, num_dn=None):\n        if self.cached_feature is None:\n            return instance_feature, anchor\n\n        if num_dn is not None and num_dn > 0:\n            dn_instance_feature = instance_feature[:, -num_dn:]\n            dn_anchor = anchor[:, -num_dn:]\n            instance_feature = instance_feature[:, : -num_dn]\n            anchor = anchor[:, : -num_dn]\n            confidence = confidence[:, : -num_dn]\n\n        N = self.num_current_instance\n        if N is not None and N < confidence.shape[1]:\n            confidence = confidence.max(dim=-1).values\n            _, (selected_feature, selected_anchor) = topk(\n                confidence, N, instance_feature, anchor\n            )\n        else:\n            selected_feature, selected_anchor = instance_feature, anchor\n        instance_feature = torch.cat(\n            [self.cached_feature, selected_feature], dim=1\n        )\n        anchor = torch.cat(\n            [self.cached_anchor, selected_anchor], dim=1\n        )\n\n        if num_dn is not None and num_dn > 0:\n            instance_feature = torch.cat(\n                [instance_feature, dn_instance_feature], dim=1\n            )\n            anchor = torch.cat([anchor, dn_anchor], dim=1)\n        return instance_feature, anchor\n\n    def cache(\n        self,\n        instance_feature,\n        anchor,\n        confidence,\n        metas=None,\n        feature_maps=None,\n    ):\n        if self.num_temp_instances <= 0:\n            return\n        instance_feature = instance_feature.detach()\n        anchor = anchor.detach()\n        confidence = confidence.detach()\n\n        self.metas = metas\n        confidence = confidence.max(dim=-1).values.sigmoid()\n        if self.confidence is not None:\n            N = self.confidence.shape[1]\n            confidence[:, : N] = torch.maximum(\n                self.confidence * self.confidence_decay,\n                confidence[:, : N],\n            )\n        self.temp_confidence = confidence\n\n        if self.num_temp_instances < confidence.shape[1]:\n            (\n                self.confidence,\n                (self.cached_feature, self.cached_anchor),\n            ) = topk(\n                confidence, self.num_temp_instances, instance_feature, anchor\n            )\n        else:\n            self.confidence, self.cached_feature, self.cached_anchor = (\n                confidence, instance_feature, anchor\n            )\n\n    def get_instance_id(self, confidence, anchor=None, threshold=None):\n        confidence = confidence.max(dim=-1).values.sigmoid()\n        instance_id = confidence.new_full(confidence.shape, -1).long()\n\n        if (\n            self.instance_id is not None\n            and self.instance_id.shape[0] == instance_id.shape[0]\n        ):\n            instance_id[:, : self.instance_id.shape[1]] = self.instance_id\n\n        mask = instance_id < 0\n        if threshold is not None:\n            mask = mask & (confidence >= threshold)\n        num_new_instance = mask.sum()\n        new_ids = torch.arange(num_new_instance).to(instance_id) + self.prev_id\n        instance_id[torch.where(mask)] = new_ids\n        self.prev_id += num_new_instance\n        self.update_instance_id(instance_id, confidence)\n        return instance_id\n\n    def update_instance_id(self, instance_id=None, confidence=None):\n        if self.temp_confidence is None:\n            if confidence.dim() == 3:  # bs, num_anchor, num_cls\n                temp_conf = confidence.max(dim=-1).values\n            else:  # bs, num_anchor\n                temp_conf = confidence\n        else:\n            temp_conf = self.temp_confidence\n        instance_id = topk(temp_conf, self.num_temp_instances, instance_id)[1][\n            0\n        ]\n        instance_id = instance_id.squeeze(dim=-1)\n        self.instance_id = F.pad(\n            instance_id,\n            (0, self.num_anchor - self.num_temp_instances),\n            value=-1,\n        )\n"
  },
  {
    "path": "bip3d/models/spatial_enhancer.py",
    "content": "import torch\nfrom torch import nn\n\nfrom mmengine.model import BaseModel\nfrom mmdet.models.layers.transformer.utils import MLP\nfrom mmcv.cnn.bricks.transformer import FFN\n\nfrom .utils import deformable_format\nfrom bip3d.registry import MODELS\n\n\n@MODELS.register_module()\nclass DepthFusionSpatialEnhancer(BaseModel):\n    def __init__(\n        self,\n        embed_dims=256,\n        feature_3d_dim=32,\n        num_depth_layers=2,\n        min_depth=0.25,\n        max_depth=10,\n        num_depth=64,\n        with_feature_3d=True,\n        loss_depth_weight=-1,\n        **kwargs,\n    ):\n        super().__init__(**kwargs)\n        self.embed_dims = embed_dims\n        self.feature_3d_dim = feature_3d_dim\n        self.num_depth_layers = num_depth_layers\n        self.min_depth = min_depth\n        self.max_depth = max_depth\n        self.num_depth = num_depth\n        self.with_feature_3d = with_feature_3d\n        self.loss_depth_weight = loss_depth_weight\n\n        fusion_dim = self.embed_dims + self.feature_3d_dim\n        if self.with_feature_3d:\n            self.pts_prob_pre_fc = nn.Linear(\n                self.embed_dims, self.feature_3d_dim\n            )\n            dim = self.feature_3d_dim * 2\n            fusion_dim += self.feature_3d_dim\n        else:\n            dim = self.embed_dims\n        self.pts_prob_fc = MLP(\n            dim,\n            dim,\n            self.num_depth,\n            self.num_depth_layers,\n        )\n        self.pts_fc = nn.Linear(3, self.feature_3d_dim)\n        self.fusion_fc = nn.Sequential(\n            FFN(embed_dims=fusion_dim, feedforward_channels=1024),\n            nn.Linear(fusion_dim, self.embed_dims),\n        )\n        self.fusion_norm = nn.LayerNorm(self.embed_dims)\n\n    def forward(\n        self,\n        feature_maps,\n        feature_3d=None,\n        batch_inputs=None,\n        **kwargs,\n    ):\n        with_cams = feature_maps[0].dim() == 5\n        if with_cams:\n            bs, num_cams = feature_maps[0].shape[:2]\n        else:\n            bs = feature_maps[0].shape[0]\n            num_cams = 1\n\n        feature_2d, spatial_shapes, _ = deformable_format(feature_maps)\n        pts = self.get_pts(\n            spatial_shapes,\n            batch_inputs[\"image_wh\"][0, 0],\n            batch_inputs[\"projection_mat\"],\n            feature_2d.device,\n            feature_2d.dtype,\n        )\n\n        if self.with_feature_3d:\n            feature_3d = deformable_format(feature_3d)[0]\n            depth_prob_feat = self.pts_prob_pre_fc(feature_2d)\n            depth_prob_feat = torch.cat([depth_prob_feat, feature_3d], dim=-1)\n            depth_prob = self.pts_prob_fc(depth_prob_feat).softmax(dim=-1)\n            feature_fused = [feature_2d, feature_3d]\n        else:\n            depth_prob = self.pts_prob_fc(feature_2d).softmax(dim=-1)\n            feature_fused = [feature_2d]\n\n        pts_feature = self.pts_fc(pts)\n        pts_feature = (depth_prob.unsqueeze(dim=-1) * pts_feature).sum(dim=-2)\n        feature_fused.append(pts_feature)\n        feature_fused = torch.cat(feature_fused, dim=-1)\n        feature_fused = self.fusion_fc(feature_fused) + feature_2d\n        feature_fused = self.fusion_norm(feature_fused)\n        feature_fused = deformable_format(feature_fused, spatial_shapes)\n        if self.loss_depth_weight > 0 and self.training:\n            loss_depth = self.depth_prob_loss(depth_prob, batch_inputs)\n        else:\n            loss_depth = None\n        return feature_fused, depth_prob, loss_depth\n\n    def get_pts(self, spatial_shapes, image_wh, projection_mat, device, dtype):\n        pixels = []\n        for i, shape in enumerate(spatial_shapes):\n            stride = image_wh[0] / shape[1]\n            u = torch.linspace(\n                0, image_wh[0] - stride, shape[1], device=device, dtype=dtype\n            )\n            v = torch.linspace(\n                0, image_wh[1] - stride, shape[0], device=device, dtype=dtype\n            )\n            u = u[None].tile(shape[0], 1)\n            v = v[:, None].tile(1, shape[1])\n            uv = torch.stack([u, v], dim=-1).flatten(0, 1)\n            pixels.append(uv)\n        pixels = torch.cat(pixels, dim=0)[:, None]\n        depths = torch.linspace(\n            self.min_depth,\n            self.max_depth,\n            self.num_depth,\n            device=device,\n            dtype=dtype,\n        )\n        depths = depths[None, :, None]\n        pts = pixels * depths\n        depths = depths.tile(pixels.shape[0], 1, 1)\n        pts = torch.cat([pts, depths, torch.ones_like(depths)], dim=-1)\n\n        pts = torch.linalg.solve(\n            projection_mat.mT.unsqueeze(dim=2), pts, left=False\n        )[\n            ..., :3\n        ]  # b,cam,N,3\n        return pts\n\n    def depth_prob_loss(self, depth_prob, batch_inputs):\n        mask = batch_inputs[\"depth_prob_gt\"][..., 0] != 1\n        loss_depth = (\n            torch.nn.functional.binary_cross_entropy(\n                depth_prob[mask], batch_inputs[\"depth_prob_gt\"][mask]\n            )\n            * self.loss_depth_weight\n        )\n        return loss_depth\n"
  },
  {
    "path": "bip3d/models/structure.py",
    "content": "from torch import nn\n\nfrom mmdet.models.detectors import BaseDetector\nfrom mmdet.models.detectors.deformable_detr import (\n    MultiScaleDeformableAttention,\n)\n\nfrom bip3d.registry import MODELS\n\n\n@MODELS.register_module()\nclass BIP3D(BaseDetector):\n    def __init__(\n        self,\n        backbone,\n        decoder,\n        neck=None,\n        text_encoder=None,\n        feature_enhancer=None,\n        spatial_enhancer=None,\n        data_preprocessor=None,\n        backbone_3d=None,\n        neck_3d=None,\n        init_cfg=None,\n        input_2d=\"imgs\",\n        input_3d=\"depth_img\",\n        embed_dims=256,\n        use_img_grid_mask=False,\n        use_depth_grid_mask=False,\n    ):\n        super().__init__(data_preprocessor, init_cfg)\n\n        def build(cfg):\n            if cfg is None:\n                return None\n            return MODELS.build(cfg)\n\n        self.backbone = build(backbone)\n        self.decoder = build(decoder)\n        self.neck = build(neck)\n        self.text_encoder = build(text_encoder)\n        self.feature_enhancer = build(feature_enhancer)\n        self.spatial_enhancer = build(spatial_enhancer)\n        self.backbone_3d = build(backbone_3d)\n        self.neck_3d = build(neck_3d)\n        self.input_2d = input_2d\n        self.input_3d = input_3d\n        self.embed_dims = embed_dims\n        if text_encoder is not None:\n            self.text_feat_map = nn.Linear(\n                self.text_encoder.language_backbone.body.language_dim,\n                self.embed_dims,\n                bias=True,\n            )\n\n        self.use_img_grid_mask = use_img_grid_mask\n        self.use_depth_grid_mask = use_depth_grid_mask\n        if use_depth_grid_mask or use_img_grid_mask:\n            from ..grid_mask import GridMask\n\n            self.grid_mask = GridMask(\n                True, True, rotate=1, offset=False, ratio=0.5, mode=1, prob=0.7\n            )\n\n    def init_weights(self):\n        \"\"\"Initialize weights for Transformer and other components.\"\"\"\n        for p in self.feature_enhancer.parameters():\n            if p.dim() > 1:\n                nn.init.xavier_uniform_(p)\n        self.decoder.init_weights()\n        for m in self.modules():\n            if isinstance(m, MultiScaleDeformableAttention):\n                m.init_weights()\n        nn.init.normal_(self.feature_enhancer.level_embed)\n        nn.init.constant_(self.text_feat_map.bias.data, 0)\n        nn.init.xavier_uniform_(self.text_feat_map.weight.data)\n\n    def extract_feat(self, batch_inputs_dict, batch_data_samples):\n        imgs = batch_inputs_dict.get(self.input_2d)\n        if imgs.dim() == 5:\n            bs, num_cams = imgs.shape[:2]\n            imgs = imgs.flatten(end_dim=1)\n        else:\n            bs = imgs.shape[0]\n            num_cams = 1\n\n        if self.use_img_grid_mask and self.training:\n            img = self.grid_mask(\n                img,\n                offset=-self.data_preprocessor.mean\n                / self.data_preprocessor.std,\n            )\n        feature_maps = self.backbone(imgs)\n        if self.neck is not None:\n            feature_maps = self.neck(feature_maps)\n        feature_maps = [x.unflatten(0, (bs, num_cams)) for x in feature_maps]\n\n        input_3d = batch_inputs_dict.get(self.input_3d)\n        if self.backbone_3d is not None and input_3d is not None:\n            if self.input_3d == \"depth_img\" and input_3d.dim() == 5:\n                assert input_3d.shape[1] == num_cams\n                input_3d = input_3d.flatten(end_dim=1)\n            if self.use_depth_grid_mask and self.training:\n                input_3d = self.grid_mask(input_3d)\n            feature_3d = self.backbone_3d(input_3d)\n            if self.neck_3d is not None:\n                feature_3d = self.neck_3d(feature_3d)\n            feature_3d = [x.unflatten(0, (bs, num_cams)) for x in feature_3d]\n        else:\n            feature_3d = None\n        return feature_maps, feature_3d\n\n    def extract_text_feature(self, batch_inputs_dict):\n        if self.text_encoder is not None:\n            text_dict = self.text_encoder(batch_inputs_dict[\"text\"])\n            text_dict[\"embedded\"] = self.text_feat_map(text_dict[\"embedded\"])\n        else:\n            text_dict = None\n        return text_dict\n\n    def loss(self, batch_inputs, batch_data_samples):\n        model_outs, text_dict, loss_depth = self._forward(\n            batch_inputs, batch_data_samples\n        )\n        loss = self.decoder.loss(model_outs, batch_inputs, text_dict=text_dict)\n        if loss_depth is not None:\n            loss[\"loss_depth\"] = loss_depth\n        return loss\n\n    def predict(self, batch_inputs, batch_data_samples):\n        model_outs, text_dict = self._forward(batch_inputs, batch_data_samples)\n        results = self.decoder.post_process(\n            model_outs, text_dict, batch_inputs, batch_data_samples\n        )\n        return results\n\n    def _forward(self, batch_inputs, batch_data_samples):\n        feature_maps, feature_3d = self.extract_feat(\n            batch_inputs, batch_data_samples\n        )\n        text_dict = self.extract_text_feature(batch_inputs)\n        if self.feature_enhancer is not None:\n            feature_maps, text_feature = self.feature_enhancer(\n                feature_maps=feature_maps,\n                feature_3d=feature_3d,\n                text_dict=text_dict,\n                batch_inputs=batch_inputs,\n                batch_data_samples=batch_data_samples,\n            )\n            text_dict[\"embedded\"] = text_feature\n        if self.spatial_enhancer is not None:\n            feature_maps, depth_prob, loss_depth = self.spatial_enhancer(\n                feature_maps=feature_maps,\n                feature_3d=feature_3d,\n                text_dict=text_dict,\n                batch_inputs=batch_inputs,\n                batch_data_samples=batch_data_samples,\n            )\n        else:\n            loss_depth = depth_prob = None\n        model_outs = self.decoder(\n            feature_maps=feature_maps,\n            feature_3d=feature_3d,\n            text_dict=text_dict,\n            batch_inputs=batch_inputs,\n            batch_data_samples=batch_data_samples,\n            depth_prob=depth_prob,\n        )\n        if self.training:\n            return model_outs, text_dict, loss_depth\n        return model_outs, text_dict\n"
  },
  {
    "path": "bip3d/models/target.py",
    "content": "import torch\nfrom torch import nn\nimport numpy as np\nimport torch.nn.functional as F\nfrom scipy.optimize import linear_sum_assignment\n\nfrom .base_target import BaseTargetWithDenoising\nfrom .utils import (\n    wasserstein_distance,\n    permutation_corner_distance,\n    center_distance,\n    get_positive_map,\n)\nfrom bip3d.registry import TASK_UTILS\n\n\n__all__ = [\"Grounding3DTarget\"]\n\n\nX, Y, Z, W, L, H, ALPHA, BETA, GAMMA = range(9)\n\n\n@TASK_UTILS.register_module()\nclass Grounding3DTarget(BaseTargetWithDenoising):\n    def __init__(\n        self,\n        cls_weight=1.0,\n        alpha=0.25,\n        gamma=2,\n        eps=1e-12,\n        box_weight=1.0,\n        cls_wise_reg_weights=None,\n        num_dn=0,\n        dn_noise_scale=0.5,\n        add_neg_dn=True,\n        num_temp_dn_groups=0,\n        with_dn_query=False,\n        num_classes=None,\n        embed_dims=256,\n        label_noise_scale=0.5,\n        cost_weight_wd=1.0,\n        cost_weight_pcd=0.0,\n        cost_weight_cd=0.8,\n        use_ignore_mask=False,\n    ):\n        super(Grounding3DTarget, self).__init__(num_dn, num_temp_dn_groups)\n        self.cls_weight = cls_weight\n        self.box_weight = box_weight\n        self.alpha = alpha\n        self.gamma = gamma\n        self.eps = eps\n        self.cls_wise_reg_weights = cls_wise_reg_weights\n        self.dn_noise_scale = dn_noise_scale\n        self.add_neg_dn = add_neg_dn\n        self.with_dn_query = with_dn_query\n        self.cost_weight_wd = cost_weight_wd\n        self.cost_weight_pcd = cost_weight_pcd\n        self.cost_weight_cd = cost_weight_cd\n        self.use_ignore_mask = use_ignore_mask\n        if self.with_dn_query:\n            self.num_classes = num_classes\n            self.embed_dims = embed_dims\n            self.label_noise_scale = label_noise_scale\n            self.label_embedding = nn.Embedding(self.num_classes, self.embed_dims)\n\n    def encode_reg_target(self, box_target, device=None):\n        outputs = []\n        for box in box_target:\n            if not isinstance(box, torch.Tensor):\n                box = torch.cat(\n                    [box.gravity_center, box.tensor[..., 3:]], dim=-1\n                )\n            output = torch.cat(\n                [box[..., [X, Y, Z]], box[..., [W, L, H]], box[..., ALPHA:]],\n                dim=-1,\n            )\n            if device is not None:\n                output = output.to(device=device)\n            outputs.append(output)\n        return outputs\n\n    @torch.no_grad()\n    def sample(\n        self,\n        cls_pred,\n        box_pred,\n        char_positives,\n        box_target,\n        text_dict,\n        ignore_mask=None,\n    ):\n        bs, num_pred, num_cls = cls_pred.shape\n\n        token_positive_maps = get_positive_map(char_positives, text_dict)\n        token_positive_maps = [\n            x.to(cls_pred).bool().float() for x in token_positive_maps\n        ]\n        cls_cost = self._cls_cost(cls_pred, token_positive_maps, text_dict[\"text_token_mask\"])\n        box_target = self.encode_reg_target(box_target, box_pred.device)\n        box_cost = self._box_cost(box_pred, box_target)\n\n        indices = []\n        for i in range(bs):\n            if cls_cost[i] is not None and box_cost[i] is not None:\n                cost = (cls_cost[i] + box_cost[i]).detach().cpu().numpy()\n                cost = np.where(np.isneginf(cost) | np.isnan(cost), 1e8, cost)\n                assign = linear_sum_assignment(cost)\n                indices.append(\n                    [cls_pred.new_tensor(x, dtype=torch.int64) for x in assign]\n                )\n            else:\n                indices.append([None, None])\n\n        output_cls_target = torch.zeros_like(cls_pred)\n        output_box_target = torch.zeros_like(box_pred)\n        output_reg_weights = torch.ones_like(box_pred)\n        if self.use_ignore_mask:\n            output_ignore_mask = torch.zeros_like(cls_pred[..., 0], dtype=torch.bool)\n            ignore_mask = [output_ignore_mask.new_tensor(x) for x in ignore_mask]\n        else:\n            output_ignore_mask = None\n        for i, (pred_idx, target_idx) in enumerate(indices):\n            if len(box_target[i]) == 0:\n                continue\n            output_cls_target[i, pred_idx] = token_positive_maps[i][target_idx]\n            output_box_target[i, pred_idx] = box_target[i][target_idx]\n            if self.use_ignore_mask:\n                output_ignore_mask[i, pred_idx] = ignore_mask[i][target_idx]\n        self.indices = indices\n        return output_cls_target, output_box_target, output_reg_weights, output_ignore_mask\n\n    def _cls_cost(self, cls_pred, token_positive_maps, text_token_mask):\n        bs = cls_pred.shape[0]\n        cls_pred = cls_pred.sigmoid()\n        cost = []\n        for i in range(bs):\n            if len(token_positive_maps[i]) > 0:\n                pred = cls_pred[i][:, text_token_mask[i]]\n                neg_cost = (\n                    -(1 - pred + self.eps).log()\n                    * (1 - self.alpha)\n                    * pred.pow(self.gamma)\n                )\n                pos_cost = (\n                    -(pred + self.eps).log()\n                    * self.alpha\n                    * (1 - pred).pow(self.gamma)\n                )\n                gt = token_positive_maps[i][:, text_token_mask[i]]\n                cls_cost = torch.einsum(\n                    \"nc,mc->nm\", pos_cost, gt\n                ) + torch.einsum(\"nc,mc->nm\", neg_cost, (1 - gt))\n                cost.append(cls_cost)\n            else:\n                cost.append(None)\n        return cost\n\n    def _box_cost(self, box_pred, box_target):\n        bs = box_pred.shape[0]\n        cost = []\n        for i in range(bs):\n            if len(box_target[i]) > 0:\n                pred = box_pred[i].unsqueeze(dim=-2)\n                gt = box_target[i].unsqueeze(dim=-3)\n                _cost = 0\n                if self.cost_weight_wd > 0:\n                    _cost += self.cost_weight_wd * wasserstein_distance(\n                        pred, gt\n                    )\n                if self.cost_weight_pcd > 0:\n                    _cost += (\n                        self.cost_weight_pcd\n                        * permutation_corner_distance(pred, gt)\n                    )\n                if self.cost_weight_cd > 0:\n                    _cost += self.cost_weight_cd * center_distance(pred, gt)\n                _cost *= self.box_weight\n                cost.append(_cost)\n            else:\n                cost.append(None)\n        return cost\n\n    def get_dn_anchors(self, char_positives, box_target, text_dict, label=None, ignore_mask=None):\n        if self.num_dn <= 0:\n            return None\n        if self.num_temp_dn_groups <= 0:\n            gt_instance_id = None\n\n        char_positives = [x[: self.num_dn] for x in char_positives]\n        box_target = [x[: self.num_dn] for x in box_target]\n\n        max_dn_gt = max([len(x) for x in char_positives] + [1])\n        token_positive_maps = get_positive_map(char_positives, text_dict)\n        token_positive_maps = torch.stack(\n            [\n                F.pad(x, (0, 0, 0, max_dn_gt - x.shape[0]), value=-1)\n                for x in token_positive_maps\n            ]\n        )\n        box_target = self.encode_reg_target(box_target, box_target[0].device)\n        box_target = torch.stack(\n            [F.pad(x, (0, 0, 0, max_dn_gt - x.shape[0])) for x in box_target]\n        )\n        token_positive_maps = token_positive_maps.to(box_target)\n        box_target = torch.where(\n            (token_positive_maps == -1).all(dim=-1, keepdim=True),\n            box_target.new_tensor(0),\n            box_target,\n        )\n\n        bs, num_gt, state_dims = box_target.shape\n        num_dn_groups = self.num_dn // max(num_gt, 1)\n\n        if num_dn_groups > 1:\n            token_positive_maps = token_positive_maps.tile(num_dn_groups, 1, 1)\n            box_target = box_target.tile(num_dn_groups, 1, 1)\n\n        noise = torch.rand_like(box_target) * 2 - 1\n        noise *= box_target.new_tensor(self.dn_noise_scale)\n        noise[..., :3] *= box_target[..., 3:6]\n        noise[..., 3:6] *= box_target[..., 3:6]\n        dn_anchor = box_target + noise\n        if self.add_neg_dn:\n            noise_neg = torch.rand_like(box_target) + 1\n            flag = torch.where(\n                torch.rand_like(box_target) > 0.5,\n                noise_neg.new_tensor(1),\n                noise_neg.new_tensor(-1),\n            )\n            noise_neg *= flag\n            noise_neg *= box_target.new_tensor(self.dn_noise_scale)\n            noise_neg[..., :3] *= box_target[..., 3:6]\n            noise_neg[..., 3:6] *= box_target[..., 3:6]\n            dn_anchor = torch.cat([dn_anchor, box_target + noise_neg], dim=1)\n            num_gt *= 2\n\n        box_cost = self._box_cost(dn_anchor, box_target)\n        dn_box_target = torch.zeros_like(dn_anchor)\n        dn_token_positive_maps = -torch.ones_like(token_positive_maps) * 3\n        if self.add_neg_dn:\n            dn_token_positive_maps = torch.cat(\n                [dn_token_positive_maps, dn_token_positive_maps], dim=1\n            )\n\n        for i in range(dn_anchor.shape[0]):\n            if box_cost[i] is None:\n                continue\n            cost = box_cost[i].cpu().numpy()\n            anchor_idx, gt_idx = linear_sum_assignment(cost)\n            anchor_idx = dn_anchor.new_tensor(anchor_idx, dtype=torch.int64)\n            gt_idx = dn_anchor.new_tensor(gt_idx, dtype=torch.int64)\n            dn_box_target[i, anchor_idx] = box_target[i, gt_idx]\n            dn_token_positive_maps[i, anchor_idx] = token_positive_maps[\n                i, gt_idx\n            ]\n        dn_anchor = (\n            dn_anchor.reshape(num_dn_groups, bs, num_gt, state_dims)\n            .permute(1, 0, 2, 3)\n            .flatten(1, 2)\n        )\n        dn_box_target = (\n            dn_box_target.reshape(num_dn_groups, bs, num_gt, state_dims)\n            .permute(1, 0, 2, 3)\n            .flatten(1, 2)\n        )\n        text_length = dn_token_positive_maps.shape[-1]\n        dn_token_positive_maps = (\n            dn_token_positive_maps.reshape(\n                num_dn_groups, bs, num_gt, text_length\n            )\n            .permute(1, 0, 2, 3)\n            .flatten(1, 2)\n        )\n\n        valid_mask = (dn_token_positive_maps >= 0).all(dim=-1)\n        if self.add_neg_dn:\n            token_positive_maps = (\n                torch.cat([token_positive_maps, token_positive_maps], dim=1)\n                .reshape(num_dn_groups, bs, num_gt, text_length)\n                .permute(1, 0, 2, 3)\n                .flatten(1, 2)\n            )\n            valid_mask = torch.logical_or(\n                valid_mask,\n                (\n                    (token_positive_maps >= 0).all(dim=-1)\n                    & (dn_token_positive_maps == -3).all(dim=-1)\n                ),\n            )  # valid denotes the items is not from pad.\n\n        attn_mask = dn_box_target.new_ones(\n            num_gt * num_dn_groups, num_gt * num_dn_groups\n        )\n        dn_token_positive_maps = torch.clamp(dn_token_positive_maps, min=0)\n        for i in range(num_dn_groups):\n            start = num_gt * i\n            end = start + num_gt\n            attn_mask[start:end, start:end] = 0\n        attn_mask = attn_mask == 1\n        dn_anchor[..., 3:6] = torch.clamp(dn_anchor[..., 3:6], min=0.01).log()\n\n        if label is not None and self.with_dn_query:\n            label = torch.stack(\n                [F.pad(x, (0, max_dn_gt - x.shape[0]), value=-1) for x in label]\n            )\n            label = label.tile(num_dn_groups, 1)\n            if self.add_neg_dn:\n                label = torch.cat([label, label], dim=1)\n            label = label.reshape(num_dn_groups, bs, num_gt).permute(1, 0, 2).flatten(1, 2)\n\n            label_noise_mask = torch.logical_or(\n                torch.rand_like(label.float()) < self.label_noise_scale * 0.5,\n                label == -1\n            )\n            label = torch.where(\n                label_noise_mask,\n                torch.randint_like(label, 0, self.num_classes),\n                label,\n            )\n            dn_query = self.label_embedding(label)\n        else:\n            dn_query = None\n\n        return (\n            dn_anchor,\n            dn_box_target,\n            dn_token_positive_maps,\n            attn_mask,\n            valid_mask,\n            dn_query,\n        )\n"
  },
  {
    "path": "bip3d/models/utils.py",
    "content": "import torch\nfrom torch import nn\nfrom pytorch3d.transforms import euler_angles_to_matrix\n\n\ndef deformable_format(\n    feature_maps,\n    spatial_shapes=None,\n    level_start_index=None,\n    flat_batch=False,\n    batch_size=None,\n):\n    if spatial_shapes is None:\n        if flat_batch and feature_maps[0].dim() > 4:\n            feature_maps = [x.flatten(end_dim=-4) for x in feature_maps]\n        feat_flatten = []\n        spatial_shapes = []\n        for lvl, feat in enumerate(feature_maps):\n            spatial_shape = torch._shape_as_tensor(feat)[-2:].to(feat.device)\n            feat = feat.flatten(start_dim=-2).transpose(-1, -2)\n            feat_flatten.append(feat)\n            spatial_shapes.append(spatial_shape)\n\n        # (bs, num_feat_points, dim)\n        feat_flatten = torch.cat(feat_flatten, -2)\n        spatial_shapes = torch.cat(spatial_shapes).view(-1, 2)\n        level_start_index = torch.cat(\n            (\n                spatial_shapes.new_zeros((1,)),  # (num_level)\n                spatial_shapes.prod(1).cumsum(0)[:-1],\n            )\n        )\n        return feat_flatten, spatial_shapes, level_start_index\n    else:\n        split_size = (spatial_shapes[:, 0] * spatial_shapes[:, 1]).tolist()\n        feature_maps = feature_maps.transpose(-1, -2)\n        feature_maps = list(torch.split(feature_maps, split_size, dim=-1))\n        for i, feat in enumerate(feature_maps):\n            feature_maps[i] = feature_maps[i].unflatten(\n                -1, (spatial_shapes[i, 0], spatial_shapes[i, 1])\n            )\n            if batch_size is not None:\n                if isinstance(batch_size, int):\n                    feature_maps[i] = feature_maps[i].unflatten(\n                        0, (batch_size, -1)\n                    )\n                else:\n                    feature_maps[i] = feature_maps[i].unflatten(\n                        0, batch_size + (-1,)\n                    )\n        return feature_maps\n\n\ndef bbox_to_corners(bbox, permutation=False):\n    assert (\n        len(bbox.shape) == 2\n    ), \"bbox must be 2D tensor of shape (N, 6) or (N, 7) or (N, 9)\"\n    if bbox.shape[-1] == 6:\n        rot_mat = (\n            torch.eye(3, device=bbox.device)\n            .unsqueeze(0)\n            .repeat(bbox.shape[0], 1, 1)\n        )\n    elif bbox.shape[-1] == 7:\n        angles = bbox[:, 6:]\n        fake_angles = torch.zeros_like(angles).repeat(1, 2)\n        angles = torch.cat((angles, fake_angles), dim=1)\n        rot_mat = euler_angles_to_matrix(angles, \"ZXY\")\n    elif bbox.shape[-1] == 9:\n        rot_mat = euler_angles_to_matrix(bbox[:, 6:], \"ZXY\")\n    else:\n        raise NotImplementedError\n    centers = bbox[:, :3].unsqueeze(1).repeat(1, 8, 1)  # shape (N, 8, 3)\n    half_sizes = (\n        bbox[:, 3:6].unsqueeze(1).repeat(1, 8, 1) / 2\n    )  # shape (N, 8, 3)\n    eight_corners_x = (\n        torch.tensor([1, 1, 1, 1, -1, -1, -1, -1], device=bbox.device)\n        .unsqueeze(0)\n        .repeat(bbox.shape[0], 1)\n    )  # shape (N, 8)\n    eight_corners_y = (\n        torch.tensor([1, 1, -1, -1, 1, 1, -1, -1], device=bbox.device)\n        .unsqueeze(0)\n        .repeat(bbox.shape[0], 1)\n    )  # shape (N, 8)\n    eight_corners_z = (\n        torch.tensor([1, -1, 1, -1, 1, -1, 1, -1], device=bbox.device)\n        .unsqueeze(0)\n        .repeat(bbox.shape[0], 1)\n    )  # shape (N, 8)\n    eight_corners = torch.stack(\n        (eight_corners_x, eight_corners_y, eight_corners_z), dim=-1\n    )  # shape (N, 8, 3)\n    eight_corners = eight_corners * half_sizes  # shape (N, 8, 3)\n    # rot_mat: (N, 3, 3), eight_corners: (N, 8, 3)\n    rotated_corners = torch.matmul(\n        eight_corners, rot_mat.transpose(1, 2)\n    )  # shape (N, 8, 3)\n    corners = rotated_corners + centers\n\n    if permutation:\n        corners = corners[:, PERMUTE_INDEX]\n    return corners\n\n\ndef wasserstein_distance(source, target):\n    rot_mat_src = euler_angles_to_matrix(source[..., 6:9], \"ZXY\")\n    sqrt_sigma_src = rot_mat_src @ (\n        source[..., 3:6, None] * rot_mat_src.transpose(-2, -1)\n    )\n\n    rot_mat_tgt = euler_angles_to_matrix(target[..., 6:9], \"ZXY\")\n    sqrt_sigma_tgt = rot_mat_tgt @ (\n        target[..., 3:6, None] * rot_mat_tgt.transpose(-2, -1)\n    )\n\n    sigma_distance = sqrt_sigma_src - sqrt_sigma_tgt\n    sigma_distance = sigma_distance.pow(2).sum(dim=-1).sum(dim=-1)\n    center_distance = ((source[..., :3] - target[..., :3]) ** 2).sum(dim=-1)\n    distance = sigma_distance + center_distance\n    distance = distance.clamp(1e-7).sqrt()\n    return distance\n\n\ndef permutation_corner_distance(source, target):\n    source_corners = bbox_to_corners(source)  # N, 8, 3\n    target_corners = bbox_to_corners(target, True)  # N, 48, 8, 3\n    distance = torch.norm(\n        source_corners.unsqueeze(dim=-2) - target_corners, p=2, dim=-1\n    )\n    distance = distance.mean(dim=-1).min(dim=-1).values\n    return distance\n\n\ndef center_distance(source, target):\n    return torch.norm(source[..., :3] - target[..., :3], p=2, dim=-1)\n\n\ndef get_positive_map(char_positive, text_dict):\n    bs, text_length = text_dict[\"embedded\"].shape[:2]\n    tokenized = text_dict[\"tokenized\"]\n    positive_maps = []\n    for i in range(bs):\n        num_target = len(char_positive[i])\n        positive_map = torch.zeros(\n            (num_target, text_length), dtype=torch.float\n        )\n        for j, tok_list in enumerate(char_positive[i]):\n            for beg, end in tok_list:\n                try:\n                    beg_pos = tokenized.char_to_token(i, beg)\n                    end_pos = tokenized.char_to_token(i, end - 1)\n                except Exception as e:\n                    print(\"beg:\", beg, \"end:\", end)\n                    print(\"token_positive:\", char_positive[i])\n                    raise e\n                if beg_pos is None:\n                    try:\n                        beg_pos = tokenized.char_to_token(i, beg + 1)\n                        if beg_pos is None:\n                            beg_pos = tokenized.char_to_token(i, beg + 2)\n                    except Exception:\n                        beg_pos = None\n                if end_pos is None:\n                    try:\n                        end_pos = tokenized.char_to_token(i, end - 2)\n                        if end_pos is None:\n                            end_pos = tokenized.char_to_token(i, end - 3)\n                    except Exception:\n                        end_pos = None\n                if beg_pos is None or end_pos is None:\n                    continue\n\n                assert beg_pos is not None and end_pos is not None\n                positive_map[j, beg_pos : end_pos + 1].fill_(1)\n        positive_map /= (positive_map.sum(-1)[:, None] + 1e-6)\n        positive_maps.append(positive_map)\n    return positive_maps\n\n\ndef get_entities(text, char_positive, sep_token=\"[SEP]\"):\n    batch_entities = []\n    for bs_idx in range(len(char_positive)):\n        entities = []\n        for obj_idx in range(len(char_positive[bs_idx])):\n            entity = \"\"\n            for beg, end in char_positive[bs_idx][obj_idx]:\n                if len(entity) == 0:\n                    entity = text[bs_idx][beg:end]\n                else:\n                    entity += sep_token + text[bs_idx][beg:end]\n            entities.append(entity)\n        batch_entities.append(entities)\n    return batch_entities\n\n\ndef linear_act_ln(\n    embed_dims, in_loops, out_loops, input_dims=None,\n    act_cfg=None,\n):\n    if act_cfg is None:\n        act_cfg = dict(type='ReLU', inplace=True)\n    from mmcv.cnn import build_activation_layer\n    if input_dims is None:\n        input_dims = embed_dims\n    layers = []\n    for _ in range(out_loops):\n        for _ in range(in_loops):\n            layers.append(nn.Linear(input_dims, embed_dims))\n            layers.append(build_activation_layer(act_cfg))\n            input_dims = embed_dims\n        layers.append(nn.LayerNorm(embed_dims))\n    return layers\n"
  },
  {
    "path": "bip3d/ops/__init__.py",
    "content": "from .deformable_aggregation import (\n    deformable_aggregation_func,\n    feature_maps_format\n)\n"
  },
  {
    "path": "bip3d/ops/deformable_aggregation.py",
    "content": "import torch\nfrom torch.autograd.function import Function, once_differentiable\nfrom torch.cuda.amp import custom_fwd, custom_bwd\n\nfrom . import deformable_aggregation_ext as da\nfrom . import deformable_aggregation_with_depth_ext as dad\n\n\nclass DeformableAggregationFunction(Function):\n    @staticmethod\n    @custom_fwd\n    def forward(\n        ctx,\n        mc_ms_feat,\n        spatial_shape,\n        scale_start_index,\n        sampling_location,\n        weights,\n    ):\n        # output: [bs, num_pts, num_embeds]\n        mc_ms_feat = mc_ms_feat.contiguous().float()\n        spatial_shape = spatial_shape.contiguous().int()\n        scale_start_index = scale_start_index.contiguous().int()\n        sampling_location = sampling_location.contiguous().float()\n        weights = weights.contiguous().float()\n        output = da.deformable_aggregation_forward(\n            mc_ms_feat,\n            spatial_shape,\n            scale_start_index,\n            sampling_location,\n            weights,\n        )\n        ctx.save_for_backward(\n            mc_ms_feat,\n            spatial_shape,\n            scale_start_index,\n            sampling_location,\n            weights,\n        )\n        return output\n\n    @staticmethod\n    @once_differentiable\n    @custom_bwd\n    def backward(ctx, grad_output):\n        (\n            mc_ms_feat,\n            spatial_shape,\n            scale_start_index,\n            sampling_location,\n            weights,\n        ) = ctx.saved_tensors\n        mc_ms_feat = mc_ms_feat.contiguous().float()\n        spatial_shape = spatial_shape.contiguous().int()\n        scale_start_index = scale_start_index.contiguous().int()\n        sampling_location = sampling_location.contiguous().float()\n        weights = weights.contiguous().float()\n\n        grad_mc_ms_feat = torch.zeros_like(mc_ms_feat)\n        grad_sampling_location = torch.zeros_like(sampling_location)\n        grad_weights = torch.zeros_like(weights)\n        da.deformable_aggregation_backward(\n            mc_ms_feat,\n            spatial_shape,\n            scale_start_index,\n            sampling_location,\n            weights,\n            grad_output.contiguous(),\n            grad_mc_ms_feat,\n            grad_sampling_location,\n            grad_weights,\n        )\n        # print(grad_mc_ms_feat.abs().mean(), grad_sampling_location.abs().mean(), grad_weights.abs().mean())\n        return (\n            grad_mc_ms_feat,\n            None,\n            None,\n            grad_sampling_location,\n            grad_weights,\n        )\n\nclass DeformableAggregationWithDepthFunction(Function):\n    @staticmethod\n    @custom_fwd\n    def forward(\n        ctx,\n        mc_ms_feat,\n        spatial_shape,\n        scale_start_index,\n        sampling_location,\n        weights,\n        num_depths,\n    ):\n        # output: [bs, num_pts, num_embeds]\n        mc_ms_feat = mc_ms_feat.contiguous().float()\n        spatial_shape = spatial_shape.contiguous().int()\n        scale_start_index = scale_start_index.contiguous().int()\n        sampling_location = sampling_location.contiguous().float()\n        weights = weights.contiguous().float()\n        output = dad.deformable_aggregation_with_depth_forward(\n            mc_ms_feat,\n            spatial_shape,\n            scale_start_index,\n            sampling_location,\n            weights,\n            num_depths,\n        )\n        ctx.save_for_backward(\n            mc_ms_feat,\n            spatial_shape,\n            scale_start_index,\n            sampling_location,\n            weights,\n        )\n        ctx._num_depths = num_depths\n        return output\n\n    @staticmethod\n    @once_differentiable\n    @custom_bwd\n    def backward(ctx, grad_output):\n        (\n            mc_ms_feat,\n            spatial_shape,\n            scale_start_index,\n            sampling_location,\n            weights,\n        ) = ctx.saved_tensors\n        num_depths = ctx._num_depths\n        mc_ms_feat = mc_ms_feat.contiguous().float()\n        spatial_shape = spatial_shape.contiguous().int()\n        scale_start_index = scale_start_index.contiguous().int()\n        sampling_location = sampling_location.contiguous().float()\n        weights = weights.contiguous().float()\n\n        grad_mc_ms_feat = torch.zeros_like(mc_ms_feat)\n        grad_sampling_location = torch.zeros_like(sampling_location)\n        grad_weights = torch.zeros_like(weights)\n        dad.deformable_aggregation_with_depth_backward(\n            mc_ms_feat,\n            spatial_shape,\n            scale_start_index,\n            sampling_location,\n            weights,\n            num_depths,\n            grad_output.contiguous(),\n            grad_mc_ms_feat,\n            grad_sampling_location,\n            grad_weights,\n        )\n        # print(grad_mc_ms_feat.abs().mean(), grad_sampling_location.abs().mean(), grad_weights.abs().mean())\n        # print(grad_mc_ms_feat.abs().max(), grad_sampling_location.abs().max(), grad_weights.abs().max())\n        # print(\"\")\n        return (\n            grad_mc_ms_feat,\n            None,\n            None,\n            grad_sampling_location,\n            grad_weights,\n            None,\n        )\n\n\ndef deformable_aggregation_func(\n    mc_ms_feat,\n    spatial_shape,\n    scale_start_index,\n    sampling_location,\n    weights,\n    depth_prob=None,\n    depth=None\n):\n    if depth_prob is not None and depth is not None:\n        mc_ms_feat = torch.cat([mc_ms_feat, depth_prob], dim=-1)\n        sampling_location = torch.cat([sampling_location, depth], dim=-1)\n        return DeformableAggregationWithDepthFunction.apply(\n            mc_ms_feat,\n            spatial_shape,\n            scale_start_index,\n            sampling_location,\n            weights,\n            depth_prob.shape[-1],\n        )\n    else:\n        return DeformableAggregationFunction.apply(\n            mc_ms_feat,\n            spatial_shape,\n            scale_start_index,\n            sampling_location,\n            weights,\n        )\n \ndef feature_maps_format(feature_maps, inverse=False):\n    bs, num_cams = feature_maps[0].shape[:2]\n    if not inverse:\n        spatial_shape = []\n        scale_start_index = [0]\n\n        col_feats = []\n        for i, feat in enumerate(feature_maps):\n            spatial_shape.append(feat.shape[-2:])\n            scale_start_index.append(\n                feat.shape[-1] * feat.shape[-2] + scale_start_index[-1]\n            )\n            col_feats.append(torch.reshape(\n                feat, (bs, num_cams, feat.shape[2], -1)\n            ))\n        scale_start_index.pop()\n        col_feats = torch.cat(col_feats, dim=-1).permute(0, 1, 3, 2)\n        feature_maps = [\n            col_feats,\n            torch.tensor(\n                spatial_shape,\n                dtype=torch.int64,\n                device=col_feats.device,\n            ),\n            torch.tensor(\n                scale_start_index,\n                dtype=torch.int64,\n                device=col_feats.device,\n            ),\n        ]\n    else:\n        spatial_shape = feature_maps[1].int()\n        split_size = (spatial_shape[:, 0] * spatial_shape[:, 1]).tolist()\n        feature_maps = feature_maps[0].permute(0, 1, 3, 2)\n        feature_maps = list(torch.split(feature_maps, split_size, dim=-1))\n        for i, feat in enumerate(feature_maps):\n            feature_maps[i] = feat.reshape(\n                feat.shape[:3] + (spatial_shape[i, 0], spatial_shape[i, 1])\n            )\n    return feature_maps\n"
  },
  {
    "path": "bip3d/ops/gcc.sh",
    "content": "export CC=/horizon-bucket/robot_lab/users/xuewu.lin/devtoolset-9/root/usr/bin/gcc\nexport CXX=/horizon-bucket/robot_lab/users/xuewu.lin/devtoolset-9/root/usr/bin/g++\nexport LD_LIBRARY_PATH=/horizon-bucket/robot_lab/users/xuewu.lin/devtoolset-9/root/usr/lib:$LD_LIBRARY_PATH\nexport LD_LIBRARY_PATH=/horizon-bucket/robot_lab/users/xuewu.lin/devtoolset-9/root/usr/lib/dyninst:$LD_LIBRARY_PATH\nexport LD_LIBRARY_PATH=/horizon-bucket/robot_lab/users/xuewu.lin/devtoolset-9/root/usr/lib64:$LD_LIBRARY_PATH\nexport LD_LIBRARY_PATH=/horizon-bucket/robot_lab/users/xuewu.lin/devtoolset-9/root/usr/lib64/dyninst:$LD_LIBRARY_PATH\n"
  },
  {
    "path": "bip3d/ops/setup.py",
    "content": "import os\n\nimport torch\nfrom setuptools import setup\nfrom torch.utils.cpp_extension import (\n    BuildExtension,\n    CppExtension,\n    CUDAExtension,\n)\n\n\ndef make_cuda_ext(\n    name,\n    module,\n    sources,\n    sources_cuda=[],\n    extra_args=[],\n    extra_include_path=[],\n):\n\n    define_macros = []\n    extra_compile_args = {\"cxx\": [] + extra_args}\n\n    if torch.cuda.is_available() or os.getenv(\"FORCE_CUDA\", \"0\") == \"1\":\n        define_macros += [(\"WITH_CUDA\", None)]\n        extension = CUDAExtension\n        extra_compile_args[\"nvcc\"] = extra_args + [\n            \"-D__CUDA_NO_HALF_OPERATORS__\",\n            \"-D__CUDA_NO_HALF_CONVERSIONS__\",\n            \"-D__CUDA_NO_HALF2_OPERATORS__\",\n        ]\n        sources += sources_cuda\n    else:\n        print(\"Compiling {} without CUDA\".format(name))\n        extension = CppExtension\n\n    return extension(\n        name=\"{}.{}\".format(module, name),\n        sources=[os.path.join(*module.split(\".\"), p) for p in sources],\n        include_dirs=extra_include_path,\n        define_macros=define_macros,\n        extra_compile_args=extra_compile_args,\n    )\n\n\nif __name__ == \"__main__\":\n    setup(\n        name=\"deformable_aggregation_with_depth_ext\",\n        ext_modules=[\n            make_cuda_ext(\n                \"deformable_aggregation_with_depth_ext\",\n                module=\".\",\n                sources=[\n                    f\"src/deformable_aggregation_with_depth.cpp\",\n                    f\"src/deformable_aggregation_with_depth_cuda.cu\",\n                ],\n            ),\n        ],\n        cmdclass={\"build_ext\": BuildExtension},\n    )\n    setup(\n        name=\"deformable_aggregation_ext\",\n        ext_modules=[\n            make_cuda_ext(\n                \"deformable_aggregation_ext\",\n                module=\".\",\n                sources=[\n                    f\"src/deformable_aggregation.cpp\",\n                    f\"src/deformable_aggregation_cuda.cu\",\n                ],\n            ),\n        ],\n        cmdclass={\"build_ext\": BuildExtension},\n    )\n"
  },
  {
    "path": "bip3d/ops/src/deformable_aggregation.cpp",
    "content": "#include <torch/extension.h>\n#include <c10/cuda/CUDAGuard.h>\n\nvoid deformable_aggregation(\n  float* output,\n  const float* mc_ms_feat,\n  const int* spatial_shape,\n  const int* scale_start_index,\n  const float* sample_location,\n  const float* weights,\n  int batch_size,\n  int num_cams,\n  int num_feat,\n  int num_embeds,\n  int num_scale,\n  int num_pts,\n  int num_groups\n);\n  \n\nvoid deformable_aggregation_grad(\n  const float* mc_ms_feat,\n  const int* spatial_shape,\n  const int* scale_start_index,\n  const float* sample_location,\n  const float* weights,\n  const float* grad_output,\n  float* grad_mc_ms_feat,\n  float* grad_sampling_location,\n  float* grad_weights,\n  int batch_size,\n  int num_cams,\n  int num_feat,\n  int num_embeds,\n  int num_scale,\n  int num_pts,\n  int num_groups\n);\n\n/* _mc_ms_feat: b, cam, feat, C */\n/* _spatial_shape: scale, 2 */\n/* _scale_start_index: scale */\n/* _sampling_location: bs, pts, */ \n\n\nat::Tensor deformable_aggregation_forward(\n  const at::Tensor &_mc_ms_feat,\n  const at::Tensor &_spatial_shape,\n  const at::Tensor &_scale_start_index,\n  const at::Tensor &_sampling_location,\n  const at::Tensor &_weights\n) {\n  at::DeviceGuard guard(_mc_ms_feat.device());\n  const at::cuda::OptionalCUDAGuard device_guard(device_of(_mc_ms_feat));\n  int batch_size = _mc_ms_feat.size(0);\n  int num_cams = _mc_ms_feat.size(1);\n  int num_feat = _mc_ms_feat.size(2);\n  int num_embeds = _mc_ms_feat.size(3);\n  int num_scale = _spatial_shape.size(0);\n  int num_pts = _sampling_location.size(1);\n  int num_groups = _weights.size(4);\n\n  const float* mc_ms_feat = _mc_ms_feat.data_ptr<float>();\n  const int* spatial_shape = _spatial_shape.data_ptr<int>();\n  const int* scale_start_index = _scale_start_index.data_ptr<int>();\n  const float* sampling_location = _sampling_location.data_ptr<float>();\n  const float* weights = _weights.data_ptr<float>();\n\n  auto output = at::zeros({batch_size, num_pts, num_embeds}, _mc_ms_feat.options());\n  deformable_aggregation(\n    output.data_ptr<float>(),\n    mc_ms_feat, spatial_shape, scale_start_index, sampling_location, weights,\n    batch_size, num_cams, num_feat, num_embeds, num_scale, num_pts, num_groups\n  );\n  return output;\n}\n\nvoid deformable_aggregation_backward(\n  const at::Tensor &_mc_ms_feat,\n  const at::Tensor &_spatial_shape,\n  const at::Tensor &_scale_start_index,\n  const at::Tensor &_sampling_location,\n  const at::Tensor &_weights,\n  const at::Tensor &_grad_output,\n  at::Tensor &_grad_mc_ms_feat,\n  at::Tensor &_grad_sampling_location,\n  at::Tensor &_grad_weights\n) {\n  at::DeviceGuard guard(_mc_ms_feat.device());\n  const at::cuda::OptionalCUDAGuard device_guard(device_of(_mc_ms_feat));\n  int batch_size = _mc_ms_feat.size(0);\n  int num_cams = _mc_ms_feat.size(1);\n  int num_feat = _mc_ms_feat.size(2);\n  int num_embeds = _mc_ms_feat.size(3);\n  int num_scale = _spatial_shape.size(0);\n  int num_pts = _sampling_location.size(1);\n  int num_groups = _weights.size(4);\n\n  const float* mc_ms_feat = _mc_ms_feat.data_ptr<float>();\n  const int* spatial_shape = _spatial_shape.data_ptr<int>();\n  const int* scale_start_index = _scale_start_index.data_ptr<int>();\n  const float* sampling_location = _sampling_location.data_ptr<float>();\n  const float* weights = _weights.data_ptr<float>();\n  const float* grad_output = _grad_output.data_ptr<float>();\n\n  float* grad_mc_ms_feat = _grad_mc_ms_feat.data_ptr<float>();\n  float* grad_sampling_location = _grad_sampling_location.data_ptr<float>();\n  float* grad_weights = _grad_weights.data_ptr<float>();\n\n  deformable_aggregation_grad(\n    mc_ms_feat, spatial_shape, scale_start_index, sampling_location, weights,\n    grad_output, grad_mc_ms_feat, grad_sampling_location, grad_weights,\n    batch_size, num_cams, num_feat, num_embeds, num_scale, num_pts, num_groups\n  );\n}\n\n\nPYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {\n  m.def(\n    \"deformable_aggregation_forward\",\n    &deformable_aggregation_forward,\n    \"deformable_aggregation_forward\"\n  );\n  m.def(\n    \"deformable_aggregation_backward\",\n    &deformable_aggregation_backward,\n    \"deformable_aggregation_backward\"\n  );\n}\n"
  },
  {
    "path": "bip3d/ops/src/deformable_aggregation_cuda.cu",
    "content": "#include <ATen/ATen.h>\n#include <ATen/cuda/CUDAContext.h>\n#include <cuda.h>\n#include <cuda_runtime.h>\n\n#include <THC/THCAtomics.cuh>\n\n\n__device__ float bilinear_sampling(\n    const float *&bottom_data, const int &height, const int &width,\n    const int &num_embeds, const float &h_im, const float &w_im,\n    const int &base_ptr\n) {\n  const int h_low = floorf(h_im);\n  const int w_low = floorf(w_im);\n  const int h_high = h_low + 1;\n  const int w_high = w_low + 1;\n\n  const float lh = h_im - h_low;\n  const float lw = w_im - w_low;\n  const float hh = 1 - lh, hw = 1 - lw;\n\n  const int w_stride = num_embeds;\n  const int h_stride = width * w_stride;\n  const int h_low_ptr_offset = h_low * h_stride;\n  const int h_high_ptr_offset = h_low_ptr_offset + h_stride;\n  const int w_low_ptr_offset = w_low * w_stride;\n  const int w_high_ptr_offset = w_low_ptr_offset + w_stride;\n\n  float v1 = 0;\n  if (h_low >= 0 && w_low >= 0) {\n    const int ptr1 = h_low_ptr_offset + w_low_ptr_offset + base_ptr;\n    v1 = bottom_data[ptr1];\n  }\n  float v2 = 0;\n  if (h_low >= 0 && w_high <= width - 1) {\n    const int ptr2 = h_low_ptr_offset + w_high_ptr_offset + base_ptr;\n    v2 = bottom_data[ptr2];\n  }\n  float v3 = 0;\n  if (h_high <= height - 1 && w_low >= 0) {\n    const int ptr3 = h_high_ptr_offset + w_low_ptr_offset + base_ptr;\n    v3 = bottom_data[ptr3];\n  }\n  float v4 = 0;\n  if (h_high <= height - 1 && w_high <= width - 1) {\n    const int ptr4 = h_high_ptr_offset + w_high_ptr_offset + base_ptr;\n    v4 = bottom_data[ptr4];\n  }\n\n  const float w1 = hh * hw, w2 = hh * lw, w3 = lh * hw, w4 = lh * lw;\n\n  const float val = (w1 * v1 + w2 * v2 + w3 * v3 + w4 * v4);\n  return val;\n}\n\n\n__device__ void bilinear_sampling_grad(\n    const float *&bottom_data, const float &weight,\n    const int &height, const int &width,\n    const int &num_embeds, const float &h_im, const float &w_im,\n    const int &base_ptr,\n    const float &grad_output,\n    float *&grad_mc_ms_feat, float *grad_sampling_location, float *grad_weights) {\n  const int h_low = floorf(h_im);\n  const int w_low = floorf(w_im);\n  const int h_high = h_low + 1;\n  const int w_high = w_low + 1;\n\n  const float lh = h_im - h_low;\n  const float lw = w_im - w_low;\n  const float hh = 1 - lh, hw = 1 - lw;\n\n  const int w_stride = num_embeds;\n  const int h_stride = width * w_stride;\n  const int h_low_ptr_offset = h_low * h_stride;\n  const int h_high_ptr_offset = h_low_ptr_offset + h_stride;\n  const int w_low_ptr_offset = w_low * w_stride;\n  const int w_high_ptr_offset = w_low_ptr_offset + w_stride;\n\n  const float w1 = hh * hw, w2 = hh * lw, w3 = lh * hw, w4 = lh * lw;\n  const float top_grad_mc_ms_feat = grad_output * weight;\n  float grad_h_weight = 0, grad_w_weight = 0;\n\n  float v1 = 0;\n  if (h_low >= 0 && w_low >= 0) {\n    const int ptr1 = h_low_ptr_offset + w_low_ptr_offset + base_ptr;\n    v1 = bottom_data[ptr1];\n    grad_h_weight -= hw * v1;\n    grad_w_weight -= hh * v1;\n    atomicAdd(grad_mc_ms_feat + ptr1, w1 * top_grad_mc_ms_feat);\n  }\n  float v2 = 0;\n  if (h_low >= 0 && w_high <= width - 1) {\n    const int ptr2 = h_low_ptr_offset + w_high_ptr_offset + base_ptr;\n    v2 = bottom_data[ptr2];\n    grad_h_weight -= lw * v2;\n    grad_w_weight += hh * v2;\n    atomicAdd(grad_mc_ms_feat + ptr2, w2 * top_grad_mc_ms_feat);\n  }\n  float v3 = 0;\n  if (h_high <= height - 1 && w_low >= 0) {\n    const int ptr3 = h_high_ptr_offset + w_low_ptr_offset + base_ptr;\n    v3 = bottom_data[ptr3];\n    grad_h_weight += hw * v3;\n    grad_w_weight -= lh * v3;\n    atomicAdd(grad_mc_ms_feat + ptr3, w3 * top_grad_mc_ms_feat);\n  }\n  float v4 = 0;\n  if (h_high <= height - 1 && w_high <= width - 1) {\n    const int ptr4 = h_high_ptr_offset + w_high_ptr_offset + base_ptr;\n    v4 = bottom_data[ptr4];\n    grad_h_weight += lw * v4;\n    grad_w_weight += lh * v4;\n    atomicAdd(grad_mc_ms_feat + ptr4, w4 * top_grad_mc_ms_feat);\n  }\n\n  const float val = (w1 * v1 + w2 * v2 + w3 * v3 + w4 * v4);\n  atomicAdd(grad_weights, grad_output * val);\n  atomicAdd(grad_sampling_location, width * grad_w_weight * top_grad_mc_ms_feat);\n  atomicAdd(grad_sampling_location + 1, height * grad_h_weight * top_grad_mc_ms_feat);\n}\n\n\n__global__ void deformable_aggregation_kernel(\n    const int num_kernels,\n    float* output,\n    const float* mc_ms_feat,\n    const int* spatial_shape,\n    const int* scale_start_index,\n    const float* sample_location,\n    const float* weights,\n    int batch_size,\n    int num_cams,\n    int num_feat,\n    int num_embeds,\n    int num_scale,\n    int num_pts,\n    int num_groups\n) {\n    int idx = blockIdx.x * blockDim.x + threadIdx.x;\n    if (idx >= num_kernels) return;\n\n    float *output_ptr = output + idx;\n    const int channel_index = idx % num_embeds;\n    const int groups_index = channel_index / (num_embeds / num_groups);\n    idx /= num_embeds;\n    const int pts_index = idx % num_pts;\n    idx /= num_pts;\n    const int batch_index = idx;\n\n    const int value_cam_stride = num_feat * num_embeds;\n    const int weight_cam_stride = num_scale * num_groups;\n    int loc_offset = (batch_index * num_pts + pts_index) * num_cams << 1;\n    int value_offset = batch_index * num_cams * value_cam_stride + channel_index;\n    int weight_offset = (\n        (batch_index * num_pts + pts_index) * num_cams * weight_cam_stride\n        + groups_index\n    );\n\n    float result = 0;\n    for (int cam_index = 0; cam_index < num_cams; ++cam_index, loc_offset += 2) {\n        const float loc_w = sample_location[loc_offset];\n        const float loc_h = sample_location[loc_offset + 1];\n        \n        if (loc_w > 0 && loc_w < 1 && loc_h > 0 && loc_h < 1) {\n            for (int scale_index = 0; scale_index < num_scale; ++scale_index) {\n                const int scale_offset = scale_start_index[scale_index] * num_embeds;\n\n                const int spatial_shape_ptr = scale_index << 1;\n                const int h = spatial_shape[spatial_shape_ptr];\n                const int w = spatial_shape[spatial_shape_ptr + 1];\n\n                const float h_im = loc_h * h - 0.5;\n                const float w_im = loc_w * w - 0.5;\n\n                const int value_ptr = value_offset + scale_offset + value_cam_stride * cam_index;\n                const float *weights_ptr = (\n                    weights + weight_offset + scale_index * num_groups\n                    + weight_cam_stride * cam_index\n                );\n                result += bilinear_sampling(mc_ms_feat, h, w, num_embeds, h_im, w_im, value_ptr) * *weights_ptr;\n            }\n        }\n    }\n    *output_ptr = result;\n}\n\n\n__global__ void deformable_aggregation_grad_kernel(\n    const int num_kernels,\n    const float* mc_ms_feat,\n    const int* spatial_shape,\n    const int* scale_start_index,\n    const float* sample_location,\n    const float* weights,\n    const float* grad_output,\n    float* grad_mc_ms_feat,\n    float* grad_sampling_location,\n    float* grad_weights,\n    int batch_size,\n    int num_cams,\n    int num_feat,\n    int num_embeds,\n    int num_scale,\n    int num_pts,\n    int num_groups\n) {\n    int idx = blockIdx.x * blockDim.x + threadIdx.x;\n    if (idx >= num_kernels) return;\n    const float grad = grad_output[idx];\n\n    const int channel_index = idx % num_embeds;\n    const int groups_index = channel_index / (num_embeds / num_groups);\n    idx /= num_embeds;\n    const int pts_index = idx % num_pts;\n    idx /= num_pts;\n    const int batch_index = idx;\n\n    const int value_cam_stride = num_feat * num_embeds;\n    const int weight_cam_stride = num_scale * num_groups;\n    int loc_offset = (batch_index * num_pts + pts_index) * num_cams << 1;\n    int value_offset = batch_index * num_cams * value_cam_stride + channel_index;\n    int weight_offset = (\n        (batch_index * num_pts + pts_index) * num_cams * weight_cam_stride\n        + groups_index\n    );\n\n    for (int cam_index = 0; cam_index < num_cams; ++cam_index, loc_offset += 2) {\n        const float loc_w = sample_location[loc_offset];\n        const float loc_h = sample_location[loc_offset + 1];\n        \n        if (loc_w > 0 && loc_w < 1 && loc_h > 0 && loc_h < 1) {\n            for (int scale_index = 0; scale_index < num_scale; ++scale_index) {\n                const int scale_offset = scale_start_index[scale_index] * num_embeds;\n\n                const int spatial_shape_ptr = scale_index << 1;\n                const int h = spatial_shape[spatial_shape_ptr];\n                const int w = spatial_shape[spatial_shape_ptr + 1];\n\n                const float h_im = loc_h * h - 0.5;\n                const float w_im = loc_w * w - 0.5;\n\n                const int value_ptr = value_offset + scale_offset + value_cam_stride * cam_index;\n                const int weights_ptr = weight_offset + scale_index * num_groups + weight_cam_stride * cam_index;\n                const float weight = weights[weights_ptr];\n\n                float *grad_location_ptr = grad_sampling_location + loc_offset;\n                float *grad_weights_ptr = grad_weights + weights_ptr;\n                bilinear_sampling_grad(\n                    mc_ms_feat, weight, h, w, num_embeds, h_im, w_im,\n                    value_ptr,\n                    grad,\n                    grad_mc_ms_feat, grad_location_ptr, grad_weights_ptr\n                );\n            }\n        }\n    }\n}\n\n\nvoid deformable_aggregation(\n    float* output,\n    const float* mc_ms_feat,\n    const int* spatial_shape,\n    const int* scale_start_index,\n    const float* sample_location,\n    const float* weights,\n    int batch_size,\n    int num_cams,\n    int num_feat,\n    int num_embeds,\n    int num_scale,\n    int num_pts,\n    int num_groups\n) {\n    const int num_kernels = batch_size * num_pts * num_embeds;\n    deformable_aggregation_kernel\n        <<<(int)ceil(((double)num_kernels/512)), 512>>>(\n        num_kernels, output,\n        mc_ms_feat, spatial_shape, scale_start_index, sample_location, weights,\n        batch_size, num_cams, num_feat, num_embeds, num_scale, num_pts, num_groups\n    );\n}\n\n\nvoid deformable_aggregation_grad(\n  const float* mc_ms_feat,\n  const int* spatial_shape,\n  const int* scale_start_index,\n  const float* sample_location,\n  const float* weights,\n  const float* grad_output,\n  float* grad_mc_ms_feat,\n  float* grad_sampling_location,\n  float* grad_weights,\n  int batch_size,\n  int num_cams,\n  int num_feat,\n  int num_embeds,\n  int num_scale,\n  int num_pts,\n  int num_groups\n) {\n    const int num_kernels = batch_size * num_pts * num_embeds;\n    deformable_aggregation_grad_kernel\n        <<<(int)ceil(((double)num_kernels/512)), 512>>>(\n        num_kernels,\n        mc_ms_feat, spatial_shape, scale_start_index, sample_location, weights,\n        grad_output, grad_mc_ms_feat, grad_sampling_location, grad_weights,\n        batch_size, num_cams, num_feat, num_embeds, num_scale, num_pts, num_groups\n    );\n}\n"
  },
  {
    "path": "bip3d/ops/src/deformable_aggregation_with_depth.cpp",
    "content": "#include <torch/extension.h>\n#include <c10/cuda/CUDAGuard.h>\n\nvoid deformable_aggregation_with_depth(\n  float* output,\n  const float* mc_ms_feat,\n  const int* spatial_shape,\n  const int* scale_start_index,\n  const float* sample_location,\n  const float* weights,\n  int batch_size,\n  int num_cams,\n  int num_feat,\n  int num_embeds,\n  int num_scale,\n  int num_pts,\n  int num_groups,\n  int num_depths\n);\n  \n\nvoid deformable_aggregation_with_depth_grad(\n  const float* mc_ms_feat,\n  const int* spatial_shape,\n  const int* scale_start_index,\n  const float* sample_location,\n  const float* weights,\n  const float* grad_output,\n  float* grad_mc_ms_feat,\n  float* grad_sampling_location,\n  float* grad_weights,\n  int batch_size,\n  int num_cams,\n  int num_feat,\n  int num_embeds,\n  int num_scale,\n  int num_pts,\n  int num_groups,\n  int num_depths\n);\n\n/* _mc_ms_feat: b, cam, feat, C+D */\n/* _spatial_shape: scale, 2 */\n/* _scale_start_index: scale */\n/* _sampling_location: bs, pts, */ \n\n\nat::Tensor deformable_aggregation_with_depth_forward(\n  const at::Tensor &_mc_ms_feat,\n  const at::Tensor &_spatial_shape,\n  const at::Tensor &_scale_start_index,\n  const at::Tensor &_sampling_location,\n  const at::Tensor &_weights,\n  const int num_depths\n) {\n  at::DeviceGuard guard(_mc_ms_feat.device());\n  const at::cuda::OptionalCUDAGuard device_guard(device_of(_mc_ms_feat));\n  int batch_size = _mc_ms_feat.size(0);\n  int num_cams = _mc_ms_feat.size(1);\n  int num_feat = _mc_ms_feat.size(2);\n  int num_embeds = _mc_ms_feat.size(3) - num_depths;\n  int num_scale = _spatial_shape.size(0);\n  int num_pts = _sampling_location.size(1);\n  int num_groups = _weights.size(4);\n\n  const float* mc_ms_feat = _mc_ms_feat.data_ptr<float>();\n  const int* spatial_shape = _spatial_shape.data_ptr<int>();\n  const int* scale_start_index = _scale_start_index.data_ptr<int>();\n  const float* sampling_location = _sampling_location.data_ptr<float>();\n  const float* weights = _weights.data_ptr<float>();\n\n  auto output = at::zeros({batch_size, num_pts, num_embeds}, _mc_ms_feat.options());\n  deformable_aggregation_with_depth(\n    output.data_ptr<float>(),\n    mc_ms_feat, spatial_shape, scale_start_index, sampling_location, weights,\n    batch_size, num_cams, num_feat, num_embeds, num_scale, num_pts, num_groups,\n    num_depths\n  );\n  return output;\n}\n\nvoid deformable_aggregation_with_depth_backward(\n  const at::Tensor &_mc_ms_feat,\n  const at::Tensor &_spatial_shape,\n  const at::Tensor &_scale_start_index,\n  const at::Tensor &_sampling_location,\n  const at::Tensor &_weights,\n  const int num_depths,\n  const at::Tensor &_grad_output,\n  at::Tensor &_grad_mc_ms_feat,\n  at::Tensor &_grad_sampling_location,\n  at::Tensor &_grad_weights\n) {\n  at::DeviceGuard guard(_mc_ms_feat.device());\n  const at::cuda::OptionalCUDAGuard device_guard(device_of(_mc_ms_feat));\n  int batch_size = _mc_ms_feat.size(0);\n  int num_cams = _mc_ms_feat.size(1);\n  int num_feat = _mc_ms_feat.size(2);\n  int num_embeds = _mc_ms_feat.size(3) - num_depths;\n  int num_scale = _spatial_shape.size(0);\n  int num_pts = _sampling_location.size(1);\n  int num_groups = _weights.size(4);\n\n  const float* mc_ms_feat = _mc_ms_feat.data_ptr<float>();\n  const int* spatial_shape = _spatial_shape.data_ptr<int>();\n  const int* scale_start_index = _scale_start_index.data_ptr<int>();\n  const float* sampling_location = _sampling_location.data_ptr<float>();\n  const float* weights = _weights.data_ptr<float>();\n  const float* grad_output = _grad_output.data_ptr<float>();\n\n  float* grad_mc_ms_feat = _grad_mc_ms_feat.data_ptr<float>();\n  float* grad_sampling_location = _grad_sampling_location.data_ptr<float>();\n  float* grad_weights = _grad_weights.data_ptr<float>();\n\n  deformable_aggregation_with_depth_grad(\n    mc_ms_feat, spatial_shape, scale_start_index, sampling_location, weights,\n    grad_output, grad_mc_ms_feat, grad_sampling_location, grad_weights,\n    batch_size, num_cams, num_feat, num_embeds, num_scale, num_pts, num_groups,\n    num_depths\n  );\n}\n\n\nPYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {\n  m.def(\n    \"deformable_aggregation_with_depth_forward\",\n    &deformable_aggregation_with_depth_forward,\n    \"deformable_aggregation_with_depth_forward\"\n  );\n  m.def(\n    \"deformable_aggregation_with_depth_backward\",\n    &deformable_aggregation_with_depth_backward,\n    \"deformable_aggregation_with_depth_backward\"\n  );\n}\n"
  },
  {
    "path": "bip3d/ops/src/deformable_aggregation_with_depth_cuda.cu",
    "content": "#include <ATen/ATen.h>\n#include <ATen/cuda/CUDAContext.h>\n#include <cuda.h>\n#include <cuda_runtime.h>\n\n#include <THC/THCAtomics.cuh>\n\n__device__ float bilinear_sampling(const float *&bottom_data, const int &height,\n                                   const int &width,\n                                   const int &num_embeds_with_depth,\n                                   const int &num_depths, const float &h_im,\n                                   const float &w_im, const float &loc_d,\n                                   const int &base_ptr, const int &depth_ptr) {\n  const int h_low = floorf(h_im);\n  const int w_low = floorf(w_im);\n  const int h_high = h_low + 1;\n  const int w_high = w_low + 1;\n  const int d_low = floorf(loc_d);\n  const int d_high = d_low + 1;\n\n  const float lh = h_im - h_low;\n  const float lw = w_im - w_low;\n  const float hh = 1 - lh, hw = 1 - lw;\n\n  const float ld = loc_d - d_low;\n  const float hd = 1 - ld;\n\n  const int w_stride = num_embeds_with_depth;\n  const int h_stride = width * w_stride;\n  const int h_low_ptr_offset = h_low * h_stride;\n  const int h_high_ptr_offset = h_low_ptr_offset + h_stride;\n  const int w_low_ptr_offset = w_low * w_stride;\n  const int w_high_ptr_offset = w_low_ptr_offset + w_stride;\n\n  float v1 = 0;\n  float dp_low1 = 0;\n  float dp_high1 = 0;\n  const bool flag_d_low = d_low >= 0 && d_low < num_depths;\n  const bool flag_d_high = d_high >= 0 && d_high < num_depths;\n  if (h_low >= 0 && w_low >= 0) {\n    const int ptr1 = h_low_ptr_offset + w_low_ptr_offset;\n    v1 = bottom_data[ptr1 + base_ptr];\n    const int ptr_d1 = ptr1 + depth_ptr + d_low;\n    if (flag_d_low) {\n      dp_low1 = bottom_data[ptr_d1];\n    }\n    if (flag_d_high) {\n      dp_high1 = bottom_data[ptr_d1 + 1];\n    }\n  }\n\n  float v2 = 0;\n  float dp_low2 = 0;\n  float dp_high2 = 0;\n  if (h_low >= 0 && w_high <= width - 1) {\n    const int ptr2 = h_low_ptr_offset + w_high_ptr_offset;\n    v2 = bottom_data[ptr2 + base_ptr];\n    const int ptr_d2 = ptr2 + depth_ptr + d_low;\n    if (flag_d_low) {\n      dp_low2 = bottom_data[ptr_d2];\n    }\n    if (flag_d_high) {\n      dp_high2 = bottom_data[ptr_d2 + 1];\n    }\n  }\n\n  float v3 = 0;\n  float dp_low3 = 0;\n  float dp_high3 = 0;\n  if (h_high <= height - 1 && w_low >= 0) {\n    const int ptr3 = h_high_ptr_offset + w_low_ptr_offset;\n    v3 = bottom_data[ptr3 + base_ptr];\n    const int ptr_d3 = ptr3 + depth_ptr + d_low;\n    if (flag_d_low) {\n      dp_low3 = bottom_data[ptr_d3];\n    }\n    if (flag_d_high) {\n      dp_high3 = bottom_data[ptr_d3 + 1];\n    }\n  }\n\n  float v4 = 0;\n  float dp_low4 = 0;\n  float dp_high4 = 0;\n  if (h_high <= height - 1 && w_high <= width - 1) {\n    const int ptr4 = h_high_ptr_offset + w_high_ptr_offset;\n    v4 = bottom_data[ptr4 + base_ptr];\n    const int ptr_d4 = ptr4 + depth_ptr + d_low;\n    if (flag_d_low) {\n      dp_low4 = bottom_data[ptr_d4];\n    }\n    if (flag_d_high) {\n      dp_high4 = bottom_data[ptr_d4 + 1];\n    }\n  }\n\n  const float w1 = hh * hw, w2 = hh * lw, w3 = lh * hw, w4 = lh * lw;\n  const float val = hd * (w1 * v1 * dp_low1 + w2 * v2 * dp_low2 +\n                          w3 * v3 * dp_low3 + w4 * v4 * dp_low4) +\n                    ld * (w1 * v1 * dp_high1 + w2 * v2 * dp_high2 +\n                          w3 * v3 * dp_high3 + w4 * v4 * dp_high4);\n  return val;\n}\n\n__device__ void\nbilinear_sampling_grad(const float *&bottom_data, const float &weight,\n                       const int &height, const int &width,\n                       const int &num_embeds_with_depth, const int &num_depths,\n                       const float &h_im, const float &w_im, const float &loc_d,\n                       const int &base_ptr, const int &depth_ptr,\n                       const float &grad_output, float *&grad_mc_ms_feat,\n                       float *grad_sampling_location, float *grad_weights) {\n  const int h_low = floorf(h_im);\n  const int w_low = floorf(w_im);\n  const int h_high = h_low + 1;\n  const int w_high = w_low + 1;\n  const int d_low = floorf(loc_d);\n  const int d_high = d_low + 1;\n\n  const float lh = h_im - h_low;\n  const float lw = w_im - w_low;\n  const float hh = 1 - lh, hw = 1 - lw;\n  const float ld = loc_d - d_low;\n  const float hd = 1 - ld;\n\n  const int w_stride = num_embeds_with_depth;\n  const int h_stride = width * w_stride;\n  const int h_low_ptr_offset = h_low * h_stride;\n  const int h_high_ptr_offset = h_low_ptr_offset + h_stride;\n  const int w_low_ptr_offset = w_low * w_stride;\n  const int w_high_ptr_offset = w_low_ptr_offset + w_stride;\n\n  const float w1 = hh * hw, w2 = hh * lw, w3 = lh * hw, w4 = lh * lw;\n  const float top_grad_mc_ms_feat = grad_output * weight;\n\n  const bool flag_d_low = d_low >= 0 && d_low < num_depths;\n  const bool flag_d_high = d_high >= 0 && d_high < num_depths;\n\n  float v1 = 0;\n  float dp_low1 = 0;\n  float dp_high1 = 0;\n  if (h_low >= 0 && w_low >= 0) {\n    const int ptr1 = h_low_ptr_offset + w_low_ptr_offset;\n    v1 = bottom_data[ptr1 + base_ptr];\n    const int ptr_d1 = ptr1 + depth_ptr + d_low;\n    if (flag_d_low) {\n      dp_low1 = bottom_data[ptr_d1];\n      atomicAdd(grad_mc_ms_feat + ptr_d1, w1 * top_grad_mc_ms_feat * v1 * hd);\n    }\n    if (flag_d_high) {\n      dp_high1 = bottom_data[ptr_d1 + 1];\n      atomicAdd(grad_mc_ms_feat + ptr_d1 + 1,\n                w1 * top_grad_mc_ms_feat * v1 * ld);\n    }\n    atomicAdd(grad_mc_ms_feat + ptr1 + base_ptr,\n              w1 * top_grad_mc_ms_feat * (dp_low1 * hd + dp_high1 * ld));\n  }\n  float v2 = 0;\n  float dp_low2 = 0;\n  float dp_high2 = 0;\n  if (h_low >= 0 && w_high <= width - 1) {\n    const int ptr2 = h_low_ptr_offset + w_high_ptr_offset;\n    v2 = bottom_data[ptr2 + base_ptr];\n    const int ptr_d2 = ptr2 + depth_ptr + d_low;\n    if (flag_d_low) {\n      dp_low2 = bottom_data[ptr_d2];\n      atomicAdd(grad_mc_ms_feat + ptr_d2, w2 * top_grad_mc_ms_feat * v2 * hd);\n    }\n    if (flag_d_high) {\n      dp_high2 = bottom_data[ptr_d2 + 1];\n      atomicAdd(grad_mc_ms_feat + ptr_d2 + 1,\n                w2 * top_grad_mc_ms_feat * v2 * ld);\n    }\n\n    atomicAdd(grad_mc_ms_feat + ptr2 + base_ptr,\n              w2 * top_grad_mc_ms_feat * (dp_low2 * hd + dp_high2 * ld));\n  }\n  float v3 = 0;\n  float dp_low3 = 0;\n  float dp_high3 = 0;\n  if (h_high <= height - 1 && w_low >= 0) {\n    const int ptr3 = h_high_ptr_offset + w_low_ptr_offset;\n    v3 = bottom_data[ptr3 + base_ptr];\n    const int ptr_d3 = ptr3 + depth_ptr + d_low;\n    if (flag_d_low) {\n      dp_low3 = bottom_data[ptr_d3];\n      atomicAdd(grad_mc_ms_feat + ptr_d3, w3 * top_grad_mc_ms_feat * v3 * hd);\n    }\n    if (flag_d_high) {\n      dp_high3 = bottom_data[ptr_d3 + 1];\n      atomicAdd(grad_mc_ms_feat + ptr_d3 + 1,\n                w3 * top_grad_mc_ms_feat * v3 * ld);\n    }\n\n    atomicAdd(grad_mc_ms_feat + ptr3 + base_ptr,\n              w3 * top_grad_mc_ms_feat * (dp_low3 * hd + dp_high3 * ld));\n  }\n  float v4 = 0;\n  float dp_low4 = 0;\n  float dp_high4 = 0;\n  if (h_high <= height - 1 && w_high <= width - 1) {\n    const int ptr4 = h_high_ptr_offset + w_high_ptr_offset;\n    v4 = bottom_data[ptr4 + base_ptr];\n    const int ptr_d4 = ptr4 + depth_ptr + d_low;\n    if (flag_d_low) {\n      dp_low4 = bottom_data[ptr_d4];\n      atomicAdd(grad_mc_ms_feat + ptr_d4, w4 * top_grad_mc_ms_feat * v4 * hd);\n    }\n    if (flag_d_high) {\n      dp_high4 = bottom_data[ptr_d4 + 1];\n      atomicAdd(grad_mc_ms_feat + ptr_d4 + 1,\n                w4 * top_grad_mc_ms_feat * v4 * ld);\n    }\n    atomicAdd(grad_mc_ms_feat + ptr4 + base_ptr,\n              w4 * top_grad_mc_ms_feat * (dp_low4 * hd + dp_high4 * ld));\n  }\n\n  const float val1 = w1 * v1 * dp_low1 + w2 * v2 * dp_low2 + w3 * v3 * dp_low3 +\n                     w4 * v4 * dp_low4;\n  const float val2 = w1 * v1 * dp_high1 + w2 * v2 * dp_high2 +\n                     w3 * v3 * dp_high3 + w4 * v4 * dp_high4;\n\n  const float val = hd * val1 + ld * val2;\n\n  atomicAdd(grad_weights, grad_output * val);\n\n  const float grad_w_weight = hd * (-hh * v1 * dp_low1 + hh * v2 * dp_low2 -\n                                    lh * v3 * dp_low3 + lh * v4 * dp_low4) +\n                              ld * (-hh * v1 * dp_high1 + hh * v2 * dp_high2 -\n                                    lh * v3 * dp_high3 + lh * v4 * dp_high4);\n  const float grad_h_weight = hd * (-hw * v1 * dp_low1 - lw * v2 * dp_low2 +\n                                    hw * v3 * dp_low3 + lw * v4 * dp_low4) +\n                              ld * (-hw * v1 * dp_high1 - lw * v2 * dp_high2 +\n                                    hw * v3 * dp_high3 + lw * v4 * dp_high4);\n\n  /* const float grad_d_weight = -1 * (w1 * v1 * dp_low1 + w2 * v2 * dp_low2 +\n   * w3 * v3 * dp_low3 + w4 * v4 * dp_low4) */\n  /*     + (w1 * v1 * dp_high1 + w2 * v2 * dp_high2 + w3 * v3 * dp_high3 + w4 *\n   * v4 * dp_high4); */\n  const float grad_d_weight = val2 - val1;\n  /* const float grad_d_weight = w1 * v1 * (dp_high1 - dp_low1) */\n  /*     + w2 * v2 * (dp_high2 - dp_low2) */\n  /*     + w3 * v3 * (dp_high3 - dp_low3) */\n  /*     + w4 * v4 * (dp_high4 - dp_low4); */\n\n  atomicAdd(grad_sampling_location,\n            width * top_grad_mc_ms_feat * grad_w_weight);\n  atomicAdd(grad_sampling_location + 1,\n            height * top_grad_mc_ms_feat * grad_h_weight);\n  atomicAdd(grad_sampling_location + 2, top_grad_mc_ms_feat * grad_d_weight);\n}\n\n__global__ void deformable_aggregation_with_depth_kernel(\n    const int num_kernels, float *output, const float *mc_ms_feat,\n    const int *spatial_shape, const int *scale_start_index,\n    const float *sample_location, const float *weights, int batch_size,\n    int num_cams, int num_feat, int num_embeds, int num_scale, int num_pts,\n    int num_groups, int num_depths) {\n  long int idx = blockIdx.x * blockDim.x + threadIdx.x;\n  if (idx >= num_kernels)\n    return;\n\n  float *output_ptr = output + idx;\n  const int channel_index = idx % num_embeds;\n  const int groups_index = channel_index / (num_embeds / num_groups);\n  idx /= num_embeds;\n  const int pts_index = idx % num_pts;\n  idx /= num_pts;\n  const int batch_index = idx;\n\n  const int num_embeds_with_depth = num_embeds + num_depths;\n  const int value_cam_stride = num_feat * num_embeds_with_depth;\n  const int weight_cam_stride = num_scale * num_groups;\n  int loc_offset = (batch_index * num_pts + pts_index) * num_cams * 3;\n  int value_offset = batch_index * num_cams * value_cam_stride;\n  int depth_offset = value_offset + num_embeds;\n  value_offset = value_offset + channel_index;\n  int weight_offset =\n      ((batch_index * num_pts + pts_index) * num_cams * weight_cam_stride +\n       groups_index);\n\n  float result = 0;\n  for (int cam_index = 0; cam_index < num_cams; ++cam_index, loc_offset += 3) {\n    const float loc_w = sample_location[loc_offset];\n    const float loc_h = sample_location[loc_offset + 1];\n    const float loc_d = sample_location[loc_offset + 2];\n\n    if (loc_w > 0 && loc_w < 1 && loc_h > 0 && loc_h < 1 && loc_d > -1 &&\n        loc_d < num_depths) {\n      for (int scale_index = 0; scale_index < num_scale; ++scale_index) {\n        const int scale_offset =\n            scale_start_index[scale_index] * num_embeds_with_depth;\n\n        const int spatial_shape_ptr = scale_index << 1;\n        const int h = spatial_shape[spatial_shape_ptr];\n        const int w = spatial_shape[spatial_shape_ptr + 1];\n\n        const float h_im = loc_h * h - 0.5;\n        const float w_im = loc_w * w - 0.5;\n\n        const int value_ptr =\n            value_offset + scale_offset + value_cam_stride * cam_index;\n        const int depth_ptr =\n            depth_offset + scale_offset + value_cam_stride * cam_index;\n\n        const float *weights_ptr =\n            (weights + weight_offset + scale_index * num_groups +\n             weight_cam_stride * cam_index);\n        result += bilinear_sampling(mc_ms_feat, h, w, num_embeds_with_depth,\n                                    num_depths, h_im, w_im, loc_d, value_ptr,\n                                    depth_ptr) *\n                  *weights_ptr;\n      }\n    }\n  }\n  *output_ptr = result;\n}\n\n__global__ void deformable_aggregation_with_depth_grad_kernel(\n    const int num_kernels, const float *mc_ms_feat, const int *spatial_shape,\n    const int *scale_start_index, const float *sample_location,\n    const float *weights, const float *grad_output, float *grad_mc_ms_feat,\n    float *grad_sampling_location, float *grad_weights, int batch_size,\n    int num_cams, int num_feat, int num_embeds, int num_scale, int num_pts,\n    int num_groups, int num_depths) {\n  long int idx = blockIdx.x * blockDim.x + threadIdx.x;\n  if (idx >= num_kernels)\n    return;\n  const float grad = grad_output[idx];\n\n  const int channel_index = idx % num_embeds;\n  const int groups_index = channel_index / (num_embeds / num_groups);\n  idx /= num_embeds;\n  const int pts_index = idx % num_pts;\n  idx /= num_pts;\n  const int batch_index = idx;\n\n  const int num_embeds_with_depth = num_embeds + num_depths;\n  const int value_cam_stride = num_feat * num_embeds_with_depth;\n  const int weight_cam_stride = num_scale * num_groups;\n  int loc_offset = (batch_index * num_pts + pts_index) * num_cams * 3;\n  int value_offset = batch_index * num_cams * value_cam_stride;\n  int depth_offset = value_offset + num_embeds;\n  value_offset = value_offset + channel_index;\n  int weight_offset =\n      ((batch_index * num_pts + pts_index) * num_cams * weight_cam_stride +\n       groups_index);\n\n  for (int cam_index = 0; cam_index < num_cams; ++cam_index, loc_offset += 3) {\n    const float loc_w = sample_location[loc_offset];\n    const float loc_h = sample_location[loc_offset + 1];\n    const float loc_d = sample_location[loc_offset + 2];\n\n    if (loc_w > 0 && loc_w < 1 && loc_h > 0 && loc_h < 1 && loc_d > -1 &&\n        loc_d < num_depths) {\n      for (int scale_index = 0; scale_index < num_scale; ++scale_index) {\n        const int scale_offset =\n            scale_start_index[scale_index] * num_embeds_with_depth;\n\n        const int spatial_shape_ptr = scale_index << 1;\n        const int h = spatial_shape[spatial_shape_ptr];\n        const int w = spatial_shape[spatial_shape_ptr + 1];\n\n        const float h_im = loc_h * h - 0.5;\n        const float w_im = loc_w * w - 0.5;\n\n        const int value_ptr =\n            value_offset + scale_offset + value_cam_stride * cam_index;\n        const int depth_ptr =\n            depth_offset + scale_offset + value_cam_stride * cam_index;\n        const int weights_ptr = weight_offset + scale_index * num_groups +\n                                weight_cam_stride * cam_index;\n        const float weight = weights[weights_ptr];\n\n        float *grad_location_ptr = grad_sampling_location + loc_offset;\n        float *grad_weights_ptr = grad_weights + weights_ptr;\n        bilinear_sampling_grad(mc_ms_feat, weight, h, w, num_embeds_with_depth,\n                               num_depths, h_im, w_im, loc_d, value_ptr,\n                               depth_ptr, grad, grad_mc_ms_feat,\n                               grad_location_ptr, grad_weights_ptr);\n      }\n    }\n  }\n}\n\nvoid deformable_aggregation_with_depth(\n    float *output, const float *mc_ms_feat, const int *spatial_shape,\n    const int *scale_start_index, const float *sample_location,\n    const float *weights, int batch_size, int num_cams, int num_feat,\n    int num_embeds, int num_scale, int num_pts, int num_groups,\n    int num_depths) {\n  const long int num_kernels = batch_size * num_pts * num_embeds;\n  deformable_aggregation_with_depth_kernel<<<\n      (int)ceil(((double)num_kernels / 512)), 512>>>(\n      num_kernels, output, mc_ms_feat, spatial_shape, scale_start_index,\n      sample_location, weights, batch_size, num_cams, num_feat, num_embeds,\n      num_scale, num_pts, num_groups, num_depths);\n}\n\nvoid deformable_aggregation_with_depth_grad(\n    const float *mc_ms_feat, const int *spatial_shape,\n    const int *scale_start_index, const float *sample_location,\n    const float *weights, const float *grad_output, float *grad_mc_ms_feat,\n    float *grad_sampling_location, float *grad_weights, int batch_size,\n    int num_cams, int num_feat, int num_embeds, int num_scale, int num_pts,\n    int num_groups, int num_depths) {\n  const long int num_kernels = batch_size * num_pts * num_embeds;\n  deformable_aggregation_with_depth_grad_kernel<<<\n      (int)ceil(((double)num_kernels / 512)), 512>>>(\n      num_kernels, mc_ms_feat, spatial_shape, scale_start_index,\n      sample_location, weights, grad_output, grad_mc_ms_feat,\n      grad_sampling_location, grad_weights, batch_size, num_cams, num_feat,\n      num_embeds, num_scale, num_pts, num_groups, num_depths);\n}\n"
  },
  {
    "path": "bip3d/registry.py",
    "content": "from mmengine import DATASETS as MMENGINE_DATASETS\nfrom mmengine import METRICS as MMENGINE_METRICS\nfrom mmengine import MODELS as MMENGINE_MODELS\nfrom mmengine import TASK_UTILS as MMENGINE_TASK_UTILS\nfrom mmengine import TRANSFORMS as MMENGINE_TRANSFORMS\nfrom mmengine import VISBACKENDS as MMENGINE_VISBACKENDS\nfrom mmengine import VISUALIZERS as MMENGINE_VISUALIZERS\nfrom mmengine import Registry\n\nMODELS = Registry('model',\n                  parent=MMENGINE_MODELS,\n                  locations=['bip3d.models'])\nDATASETS = Registry('dataset',\n                    parent=MMENGINE_DATASETS,\n                    locations=['bip3d.datasets'])\nTRANSFORMS = Registry('transform',\n                      parent=MMENGINE_TRANSFORMS,\n                      locations=['bip3d.datasets.transforms'])\nMETRICS = Registry('metric',\n                   parent=MMENGINE_METRICS,\n                   locations=['bip3d.eval'])\nTASK_UTILS = Registry('task util',\n                      parent=MMENGINE_TASK_UTILS,\n                      locations=['bip3d.models'])\n"
  },
  {
    "path": "bip3d/structures/__init__.py",
    "content": "# Copyright (c) OpenRobotLab. All rights reserved.\nfrom .bbox_3d import (BaseInstance3DBoxes, Box3DMode, Coord3DMode,\n                      EulerDepthInstance3DBoxes, EulerInstance3DBoxes,\n                      get_box_type, get_proj_mat_by_coord_type, limit_period,\n                      mono_cam_box2vis, points_cam2img, points_img2cam,\n                      rotation_3d_in_axis, rotation_3d_in_euler, xywhr2xyxyr)\n\n__all__ = [\n    'BaseInstance3DBoxes', 'Box3DMode', 'Coord3DMode', 'EulerInstance3DBoxes',\n    'EulerDepthInstance3DBoxes', 'get_box_type', 'get_proj_mat_by_coord_type',\n    'limit_period', 'mono_cam_box2vis', 'points_cam2img', 'points_img2cam',\n    'rotation_3d_in_axis', 'rotation_3d_in_euler', 'xywhr2xyxyr'\n]\n"
  },
  {
    "path": "bip3d/structures/bbox_3d/__init__.py",
    "content": "# Copyright (c) OpenRobotLab. All rights reserved.\nfrom .base_box3d import BaseInstance3DBoxes\nfrom .box_3d_mode import Box3DMode\nfrom .coord_3d_mode import Coord3DMode\nfrom .euler_box3d import EulerInstance3DBoxes\nfrom .euler_depth_box3d import EulerDepthInstance3DBoxes\nfrom .utils import (batch_points_cam2img, get_box_type,\n                    get_proj_mat_by_coord_type, limit_period, mono_cam_box2vis,\n                    points_cam2img, points_img2cam, rotation_3d_in_axis,\n                    rotation_3d_in_euler, xywhr2xyxyr)\n\n__all__ = [\n    'Box3DMode', 'BaseInstance3DBoxes', 'EulerInstance3DBoxes',\n    'EulerDepthInstance3DBoxes', 'xywhr2xyxyr', 'get_box_type',\n    'rotation_3d_in_axis', 'rotation_3d_in_euler', 'limit_period',\n    'points_cam2img', 'points_img2cam', 'Coord3DMode', 'mono_cam_box2vis',\n    'batch_points_cam2img', 'get_proj_mat_by_coord_type'\n]\n"
  },
  {
    "path": "bip3d/structures/bbox_3d/base_box3d.py",
    "content": "# Copyright (c) OpenMMLab. All rights reserved.\nfrom abc import abstractmethod\nfrom typing import Iterator, Optional, Sequence, Tuple, Union\n\nimport numpy as np\nimport torch\nfrom torch import Tensor\n\nfrom bip3d.structures.points.base_points import BasePoints\n\nfrom .utils import limit_period\n\n\nclass BaseInstance3DBoxes:\n    \"\"\"Base class for 3D Boxes.\n\n    Note:\n        The box is bottom centered, i.e. the relative position of origin in the\n        box is (0.5, 0.5, 0).\n\n    Args:\n        tensor (Tensor or np.ndarray or Sequence[Sequence[float]]): The boxes\n            data with shape (N, box_dim).\n        box_dim (int): Number of the dimension of a box. Each row is\n            (x, y, z, x_size, y_size, z_size, yaw). Defaults to 7.\n        with_yaw (bool): Whether the box is with yaw rotation. If False, the\n            value of yaw will be set to 0 as minmax boxes. Defaults to True.\n        origin (Tuple[float]): Relative position of the box origin.\n            Defaults to (0.5, 0.5, 0). This will guide the box be converted to\n            (0.5, 0.5, 0) mode.\n\n    Attributes:\n        tensor (Tensor): Float matrix with shape (N, box_dim).\n        box_dim (int): Integer indicating the dimension of a box. Each row is\n            (x, y, z, x_size, y_size, z_size, yaw, ...).\n        with_yaw (bool): If True, the value of yaw will be set to 0 as minmax\n            boxes.\n    \"\"\"\n\n    YAW_AXIS: int = 0\n\n    def __init__(\n        self,\n        tensor: Union[Tensor, np.ndarray, Sequence[Sequence[float]]],\n        box_dim: int = 7,\n        with_yaw: bool = True,\n        origin: Tuple[float, float, float] = (0.5, 0.5, 0)\n    ) -> None:\n        if isinstance(tensor, Tensor):\n            device = tensor.device\n        else:\n            device = torch.device('cpu')\n        tensor = torch.as_tensor(tensor, dtype=torch.float32, device=device)\n        if tensor.numel() == 0:\n            # Use reshape, so we don't end up creating a new tensor that does\n            # not depend on the inputs (and consequently confuses jit)\n            tensor = tensor.reshape((-1, box_dim))\n        assert tensor.dim() == 2 and tensor.size(-1) == box_dim, \\\n            ('The box dimension must be 2 and the length of the last '\n             f'dimension must be {box_dim}, but got boxes with shape '\n             f'{tensor.shape}.')\n\n        if tensor.shape[-1] == 6:\n            # If the dimension of boxes is 6, we expand box_dim by padding 0 as\n            # a fake yaw and set with_yaw to False\n            assert box_dim == 6\n            fake_rot = tensor.new_zeros(tensor.shape[0], 1)\n            tensor = torch.cat((tensor, fake_rot), dim=-1)\n            self.box_dim = box_dim + 1\n            self.with_yaw = False\n        else:\n            self.box_dim = box_dim\n            self.with_yaw = with_yaw\n        self.tensor = tensor.clone()\n\n        if origin != (0.5, 0.5, 0):\n            dst = self.tensor.new_tensor((0.5, 0.5, 0))\n            src = self.tensor.new_tensor(origin)\n            self.tensor[:, :3] += self.tensor[:, 3:6] * (dst - src)\n\n    @property\n    def shape(self) -> torch.Size:\n        \"\"\"torch.Size: Shape of boxes.\"\"\"\n        return self.tensor.shape\n\n    @property\n    def volume(self) -> Tensor:\n        \"\"\"Tensor: A vector with volume of each box in shape (N, ).\"\"\"\n        return self.tensor[:, 3] * self.tensor[:, 4] * self.tensor[:, 5]\n\n    @property\n    def dims(self) -> Tensor:\n        \"\"\"Tensor: Size dimensions of each box in shape (N, 3).\"\"\"\n        return self.tensor[:, 3:6]\n\n    @property\n    def yaw(self) -> Tensor:\n        \"\"\"Tensor: A vector with yaw of each box in shape (N, ).\"\"\"\n        return self.tensor[:, 6]\n\n    @property\n    def height(self) -> Tensor:\n        \"\"\"Tensor: A vector with height of each box in shape (N, ).\"\"\"\n        return self.tensor[:, 5]\n\n    @property\n    def top_height(self) -> Tensor:\n        \"\"\"Tensor: A vector with top height of each box in shape (N, ).\"\"\"\n        return self.bottom_height + self.height\n\n    @property\n    def bottom_height(self) -> Tensor:\n        \"\"\"Tensor: A vector with bottom height of each box in shape (N, ).\"\"\"\n        return self.tensor[:, 2]\n\n    @property\n    def center(self) -> Tensor:\n        \"\"\"Calculate the center of all the boxes.\n\n        Note:\n            In MMDetection3D's convention, the bottom center is usually taken\n            as the default center.\n\n            The relative position of the centers in different kinds of boxes\n            are different, e.g., the relative center of a boxes is\n            (0.5, 1.0, 0.5) in camera and (0.5, 0.5, 0) in lidar. It is\n            recommended to use ``bottom_center`` or ``gravity_center`` for\n            clearer usage.\n\n        Returns:\n            Tensor: A tensor with center of each box in shape (N, 3).\n        \"\"\"\n        return self.bottom_center\n\n    @property\n    def bottom_center(self) -> Tensor:\n        \"\"\"Tensor: A tensor with center of each box in shape (N, 3).\"\"\"\n        return self.tensor[:, :3]\n\n    @property\n    def gravity_center(self) -> Tensor:\n        \"\"\"Tensor: A tensor with center of each box in shape (N, 3).\"\"\"\n        bottom_center = self.bottom_center\n        gravity_center = torch.zeros_like(bottom_center)\n        gravity_center[:, :2] = bottom_center[:, :2]\n        gravity_center[:, 2] = bottom_center[:, 2] + self.tensor[:, 5] * 0.5\n        return gravity_center\n\n    @property\n    def corners(self) -> Tensor:\n        \"\"\"Tensor: A tensor with 8 corners of each box in shape (N, 8, 3).\"\"\"\n        pass\n\n    @property\n    def bev(self) -> Tensor:\n        \"\"\"Tensor: 2D BEV box of each box with rotation in XYWHR format, in\n        shape (N, 5).\"\"\"\n        return self.tensor[:, [0, 1, 3, 4, 6]]\n\n    @property\n    def nearest_bev(self) -> Tensor:\n        \"\"\"Tensor: A tensor of 2D BEV box of each box without rotation.\"\"\"\n        # Obtain BEV boxes with rotation in XYWHR format\n        bev_rotated_boxes = self.bev\n        # convert the rotation to a valid range\n        rotations = bev_rotated_boxes[:, -1]\n        normed_rotations = torch.abs(limit_period(rotations, 0.5, np.pi))\n\n        # find the center of boxes\n        conditions = (normed_rotations > np.pi / 4)[..., None]\n        bboxes_xywh = torch.where(conditions, bev_rotated_boxes[:,\n                                                                [0, 1, 3, 2]],\n                                  bev_rotated_boxes[:, :4])\n\n        centers = bboxes_xywh[:, :2]\n        dims = bboxes_xywh[:, 2:]\n        bev_boxes = torch.cat([centers - dims / 2, centers + dims / 2], dim=-1)\n        return bev_boxes\n\n    def in_range_bev(\n            self, box_range: Union[Tensor, np.ndarray,\n                                   Sequence[float]]) -> Tensor:\n        \"\"\"Check whether the boxes are in the given range.\n\n        Args:\n            box_range (Tensor or np.ndarray or Sequence[float]): The range of\n                box in order of (x_min, y_min, x_max, y_max).\n\n        Note:\n            The original implementation of SECOND checks whether boxes in a\n            range by checking whether the points are in a convex polygon, we\n            reduce the burden for simpler cases.\n\n        Returns:\n            Tensor: A binary vector indicating whether each box is inside the\n            reference range.\n        \"\"\"\n        in_range_flags = ((self.bev[:, 0] > box_range[0])\n                          & (self.bev[:, 1] > box_range[1])\n                          & (self.bev[:, 0] < box_range[2])\n                          & (self.bev[:, 1] < box_range[3]))\n        return in_range_flags\n\n    @abstractmethod\n    def rotate(\n        self,\n        angle: Union[Tensor, np.ndarray, float],\n        points: Optional[Union[Tensor, np.ndarray, BasePoints]] = None\n    ) -> Union[Tuple[Tensor, Tensor], Tuple[np.ndarray, np.ndarray], Tuple[\n            BasePoints, Tensor], None]:\n        \"\"\"Rotate boxes with points (optional) with the given angle or rotation\n        matrix.\n\n        Args:\n            angle (Tensor or np.ndarray or float): Rotation angle or rotation\n                matrix.\n            points (Tensor or np.ndarray or :obj:`BasePoints`, optional):\n                Points to rotate. Defaults to None.\n\n        Returns:\n            tuple or None: When ``points`` is None, the function returns None,\n            otherwise it returns the rotated points and the rotation matrix\n            ``rot_mat_T``.\n        \"\"\"\n        pass\n\n    @abstractmethod\n    def flip(\n        self,\n        bev_direction: str = 'horizontal',\n        points: Optional[Union[Tensor, np.ndarray, BasePoints]] = None\n    ) -> Union[Tensor, np.ndarray, BasePoints, None]:\n        \"\"\"Flip the boxes in BEV along given BEV direction.\n\n        Args:\n            bev_direction (str): Direction by which to flip. Can be chosen from\n                'horizontal' and 'vertical'. Defaults to 'horizontal'.\n            points (Tensor or np.ndarray or :obj:`BasePoints`, optional):\n                Points to flip. Defaults to None.\n\n        Returns:\n            Tensor or np.ndarray or :obj:`BasePoints` or None: When ``points``\n            is None, the function returns None, otherwise it returns the\n            flipped points.\n        \"\"\"\n        pass\n\n    def translate(self, trans_vector: Union[Tensor, np.ndarray]) -> None:\n        \"\"\"Translate boxes with the given translation vector.\n\n        Args:\n            trans_vector (Tensor or np.ndarray): Translation vector of size\n                1x3.\n        \"\"\"\n        if not isinstance(trans_vector, Tensor):\n            trans_vector = self.tensor.new_tensor(trans_vector)\n        self.tensor[:, :3] += trans_vector\n\n    def in_range_3d(\n            self, box_range: Union[Tensor, np.ndarray,\n                                   Sequence[float]]) -> Tensor:\n        \"\"\"Check whether the boxes are in the given range.\n\n        Args:\n            box_range (Tensor or np.ndarray or Sequence[float]): The range of\n                box (x_min, y_min, z_min, x_max, y_max, z_max).\n\n        Note:\n            In the original implementation of SECOND, checking whether a box in\n            the range checks whether the points are in a convex polygon, we try\n            to reduce the burden for simpler cases.\n\n        Returns:\n            Tensor: A binary vector indicating whether each point is inside the\n            reference range.\n        \"\"\"\n        in_range_flags = ((self.tensor[:, 0] > box_range[0])\n                          & (self.tensor[:, 1] > box_range[1])\n                          & (self.tensor[:, 2] > box_range[2])\n                          & (self.tensor[:, 0] < box_range[3])\n                          & (self.tensor[:, 1] < box_range[4])\n                          & (self.tensor[:, 2] < box_range[5]))\n        return in_range_flags\n\n    @abstractmethod\n    def convert_to(self,\n                   dst: int,\n                   rt_mat: Optional[Union[Tensor, np.ndarray]] = None,\n                   correct_yaw: bool = False) -> 'BaseInstance3DBoxes':\n        \"\"\"Convert self to ``dst`` mode.\n\n        Args:\n            dst (int): The target Box mode.\n            rt_mat (Tensor or np.ndarray, optional): The rotation and\n                translation matrix between different coordinates.\n                Defaults to None. The conversion from ``src`` coordinates to\n                ``dst`` coordinates usually comes along the change of sensors,\n                e.g., from camera to LiDAR. This requires a transformation\n                matrix.\n            correct_yaw (bool): Whether to convert the yaw angle to the target\n                coordinate. Defaults to False.\n\n        Returns:\n            :obj:`BaseInstance3DBoxes`: The converted box of the same type in\n            the ``dst`` mode.\n        \"\"\"\n        pass\n\n    def scale(self, scale_factor: float) -> None:\n        \"\"\"Scale the box with horizontal and vertical scaling factors.\n\n        Args:\n            scale_factors (float): Scale factors to scale the boxes.\n        \"\"\"\n        self.tensor[:, :6] *= scale_factor\n        self.tensor[:, 7:] *= scale_factor  # velocity\n\n    def limit_yaw(self, offset: float = 0.5, period: float = np.pi) -> None:\n        \"\"\"Limit the yaw to a given period and offset.\n\n        Args:\n            offset (float): The offset of the yaw. Defaults to 0.5.\n            period (float): The expected period. Defaults to np.pi.\n        \"\"\"\n        self.tensor[:, 6] = limit_period(self.tensor[:, 6], offset, period)\n\n    def nonempty(self, threshold: float = 0.0) -> Tensor:\n        \"\"\"Find boxes that are non-empty.\n\n        A box is considered empty if either of its side is no larger than\n        threshold.\n\n        Args:\n            threshold (float): The threshold of minimal sizes. Defaults to 0.0.\n\n        Returns:\n            Tensor: A binary vector which represents whether each box is empty\n            (False) or non-empty (True).\n        \"\"\"\n        box = self.tensor\n        size_x = box[..., 3]\n        size_y = box[..., 4]\n        size_z = box[..., 5]\n        keep = ((size_x > threshold)\n                & (size_y > threshold) & (size_z > threshold))\n        return keep\n\n    def __getitem__(\n            self, item: Union[int, slice, np.ndarray,\n                              Tensor]) -> 'BaseInstance3DBoxes':\n        \"\"\"\n        Args:\n            item (int or slice or np.ndarray or Tensor): Index of boxes.\n\n        Note:\n            The following usage are allowed:\n\n            1. `new_boxes = boxes[3]`: Return a `Boxes` that contains only one\n               box.\n            2. `new_boxes = boxes[2:10]`: Return a slice of boxes.\n            3. `new_boxes = boxes[vector]`: Where vector is a\n               torch.BoolTensor with `length = len(boxes)`. Nonzero elements in\n               the vector will be selected.\n\n            Note that the returned Boxes might share storage with this Boxes,\n            subject to PyTorch's indexing semantics.\n\n        Returns:\n            :obj:`BaseInstance3DBoxes`: A new object of\n            :class:`BaseInstance3DBoxes` after indexing.\n        \"\"\"\n        original_type = type(self)\n        if isinstance(item, int):\n            return original_type(self.tensor[item].view(1, -1),\n                                 box_dim=self.box_dim,\n                                 with_yaw=self.with_yaw)\n        b = self.tensor[item]\n        assert b.dim() == 2, \\\n            f'Indexing on Boxes with {item} failed to return a matrix!'\n        return original_type(b, box_dim=self.box_dim, with_yaw=self.with_yaw)\n\n    def __len__(self) -> int:\n        \"\"\"int: Number of boxes in the current object.\"\"\"\n        return self.tensor.shape[0]\n\n    def __repr__(self) -> str:\n        \"\"\"str: Return a string that describes the object.\"\"\"\n        return self.__class__.__name__ + '(\\n    ' + str(self.tensor) + ')'\n\n    @classmethod\n    def cat(cls, boxes_list: Sequence['BaseInstance3DBoxes']\n            ) -> 'BaseInstance3DBoxes':\n        \"\"\"Concatenate a list of Boxes into a single Boxes.\n\n        Args:\n            boxes_list (Sequence[:obj:`BaseInstance3DBoxes`]): List of boxes.\n\n        Returns:\n            :obj:`BaseInstance3DBoxes`: The concatenated boxes.\n        \"\"\"\n        assert isinstance(boxes_list, (list, tuple))\n        if len(boxes_list) == 0:\n            return cls(torch.empty(0))\n        assert all(isinstance(box, cls) for box in boxes_list)\n\n        # use torch.cat (v.s. layers.cat)\n        # so the returned boxes never share storage with input\n        cat_boxes = cls(torch.cat([b.tensor for b in boxes_list], dim=0),\n                        box_dim=boxes_list[0].box_dim,\n                        with_yaw=boxes_list[0].with_yaw)\n        return cat_boxes\n\n    def numpy(self) -> np.ndarray:\n        \"\"\"Reload ``numpy`` from self.tensor.\"\"\"\n        return self.tensor.numpy()\n\n    def to(self, device: Union[str, torch.device], *args,\n           **kwargs) -> 'BaseInstance3DBoxes':\n        \"\"\"Convert current boxes to a specific device.\n\n        Args:\n            device (str or :obj:`torch.device`): The name of the device.\n\n        Returns:\n            :obj:`BaseInstance3DBoxes`: A new boxes object on the specific\n            device.\n        \"\"\"\n        original_type = type(self)\n        return original_type(self.tensor.to(device, *args, **kwargs),\n                             box_dim=self.box_dim,\n                             with_yaw=self.with_yaw)\n\n    def cpu(self) -> 'BaseInstance3DBoxes':\n        \"\"\"Convert current boxes to cpu device.\n\n        Returns:\n            :obj:`BaseInstance3DBoxes`: A new boxes object on the cpu device.\n        \"\"\"\n        original_type = type(self)\n        return original_type(self.tensor.cpu(),\n                             box_dim=self.box_dim,\n                             with_yaw=self.with_yaw)\n\n    def cuda(self, *args, **kwargs) -> 'BaseInstance3DBoxes':\n        \"\"\"Convert current boxes to cuda device.\n\n        Returns:\n            :obj:`BaseInstance3DBoxes`: A new boxes object on the cuda device.\n        \"\"\"\n        original_type = type(self)\n        return original_type(self.tensor.cuda(*args, **kwargs),\n                             box_dim=self.box_dim,\n                             with_yaw=self.with_yaw)\n\n    def clone(self) -> 'BaseInstance3DBoxes':\n        \"\"\"Clone the boxes.\n\n        Returns:\n            :obj:`BaseInstance3DBoxes`: Box object with the same properties as\n            self.\n        \"\"\"\n        original_type = type(self)\n        return original_type(self.tensor.clone(),\n                             box_dim=self.box_dim,\n                             with_yaw=self.with_yaw)\n\n    def detach(self) -> 'BaseInstance3DBoxes':\n        \"\"\"Detach the boxes.\n\n        Returns:\n            :obj:`BaseInstance3DBoxes`: Box object with the same properties as\n            self.\n        \"\"\"\n        original_type = type(self)\n        return original_type(self.tensor.detach(),\n                             box_dim=self.box_dim,\n                             with_yaw=self.with_yaw)\n\n    @property\n    def device(self) -> torch.device:\n        \"\"\"torch.device: The device of the boxes are on.\"\"\"\n        return self.tensor.device\n\n    def __iter__(self) -> Iterator[Tensor]:\n        \"\"\"Yield a box as a Tensor at a time.\n\n        Returns:\n            Iterator[Tensor]: A box of shape (box_dim, ).\n        \"\"\"\n        yield from self.tensor\n\n    @classmethod\n    def height_overlaps(cls, boxes1: 'BaseInstance3DBoxes',\n                        boxes2: 'BaseInstance3DBoxes') -> Tensor:\n        \"\"\"Calculate height overlaps of two boxes.\n\n        Note:\n            This function calculates the height overlaps between ``boxes1`` and\n            ``boxes2``, ``boxes1`` and ``boxes2`` should be in the same type.\n\n        Args:\n            boxes1 (:obj:`BaseInstance3DBoxes`): Boxes 1 contain N boxes.\n            boxes2 (:obj:`BaseInstance3DBoxes`): Boxes 2 contain M boxes.\n\n        Returns:\n            Tensor: Calculated height overlap of the boxes.\n        \"\"\"\n        assert isinstance(boxes1, BaseInstance3DBoxes)\n        assert isinstance(boxes2, BaseInstance3DBoxes)\n        assert type(boxes1) == type(boxes2), \\\n            '\"boxes1\" and \"boxes2\" should be in the same type, ' \\\n            f'but got {type(boxes1)} and {type(boxes2)}.'\n\n        boxes1_top_height = boxes1.top_height.view(-1, 1)\n        boxes1_bottom_height = boxes1.bottom_height.view(-1, 1)\n        boxes2_top_height = boxes2.top_height.view(1, -1)\n        boxes2_bottom_height = boxes2.bottom_height.view(1, -1)\n\n        heighest_of_bottom = torch.max(boxes1_bottom_height,\n                                       boxes2_bottom_height)\n        lowest_of_top = torch.min(boxes1_top_height, boxes2_top_height)\n        overlaps_h = torch.clamp(lowest_of_top - heighest_of_bottom, min=0)\n        return overlaps_h\n\n    def new_box(\n        self, data: Union[Tensor, np.ndarray, Sequence[Sequence[float]]]\n    ) -> 'BaseInstance3DBoxes':\n        \"\"\"Create a new box object with data.\n\n        The new box and its tensor has the similar properties as self and\n        self.tensor, respectively.\n\n        Args:\n            data (Tensor or np.ndarray or Sequence[Sequence[float]]): Data to\n                be copied.\n\n        Returns:\n            :obj:`BaseInstance3DBoxes`: A new bbox object with ``data``, the\n            object's other properties are similar to ``self``.\n        \"\"\"\n        new_tensor = self.tensor.new_tensor(data) \\\n            if not isinstance(data, Tensor) else data.to(self.device)\n        original_type = type(self)\n        return original_type(new_tensor,\n                             box_dim=self.box_dim,\n                             with_yaw=self.with_yaw)\n"
  },
  {
    "path": "bip3d/structures/bbox_3d/box_3d_mode.py",
    "content": "# Copyright (c) OpenRobotLab. All rights reserved.\nfrom enum import IntEnum, unique\nfrom typing import Optional, Sequence, Union\n\nimport numpy as np\nimport torch\nfrom torch import Tensor\n\nfrom .base_box3d import BaseInstance3DBoxes\nfrom .utils import limit_period\n\n\n@unique\nclass Box3DMode(IntEnum):\n    \"\"\"Enum of different ways to represent a box.\n\n    Coordinates in LiDAR:\n\n    .. code-block:: none\n\n                    up z\n                       ^   x front\n                       |  /\n                       | /\n        left y <------ 0\n\n    The relative coordinate of bottom center in a LiDAR box is (0.5, 0.5, 0),\n    and the yaw is around the z axis, thus the rotation axis=2.\n\n    Coordinates in Camera:\n\n    .. code-block:: none\n\n                z front\n               /\n              /\n             0 ------> x right\n             |\n             |\n             v\n        down y\n\n    The relative coordinate of bottom center in a CAM box is (0.5, 1.0, 0.5),\n    and the yaw is around the y axis, thus the rotation axis=1.\n\n    Coordinates in Depth:\n\n    .. code-block:: none\n\n        up z\n           ^   y front\n           |  /\n           | /\n           0 ------> x right\n\n    The relative coordinate of bottom center in a DEPTH box is (0.5, 0.5, 0),\n    and the yaw is around the z axis, thus the rotation axis=2.\n    \"\"\"\n\n    LIDAR = 0\n    CAM = 1\n    DEPTH = 2\n    EULER_CAM = 3\n    EULER_DEPTH = 4\n\n    @staticmethod\n    def convert(\n        box: Union[Sequence[float], np.ndarray, Tensor, BaseInstance3DBoxes],\n        src: 'Box3DMode',\n        dst: 'Box3DMode',\n        rt_mat: Optional[Union[np.ndarray, Tensor]] = None,\n        with_yaw: bool = True,\n        correct_yaw: bool = False\n    ) -> Union[Sequence[float], np.ndarray, Tensor, BaseInstance3DBoxes]:\n        \"\"\"Convert boxes from ``src`` mode to ``dst`` mode.\n\n        Args:\n            box (Sequence[float] or np.ndarray or Tensor or\n                :obj:`BaseInstance3DBoxes`): Can be a k-tuple, k-list or an Nxk\n                array/tensor.\n            src (:obj:`Box3DMode`): The source box mode.\n            dst (:obj:`Box3DMode`): The target box mode.\n            rt_mat (np.ndarray or Tensor, optional): The rotation and\n                translation matrix between different coordinates.\n                Defaults to None. The conversion from ``src`` coordinates to\n                ``dst`` coordinates usually comes along the change of sensors,\n                e.g., from camera to LiDAR. This requires a transformation\n                matrix.\n            with_yaw (bool): If ``box`` is an instance of\n                :obj:`BaseInstance3DBoxes`, whether or not it has a yaw angle.\n                Defaults to True.\n            correct_yaw (bool): If the yaw is rotated by rt_mat.\n                Defaults to False.\n\n        Returns:\n            Sequence[float] or np.ndarray or Tensor or\n            :obj:`BaseInstance3DBoxes`: The converted box of the same type.\n        \"\"\"\n        if src == dst:\n            return box\n\n        is_numpy = isinstance(box, np.ndarray)\n        is_Instance3DBoxes = isinstance(box, BaseInstance3DBoxes)\n        single_box = isinstance(box, (list, tuple))\n        if single_box:\n            assert len(box) >= 7, (\n                'Box3DMode.convert takes either a k-tuple/list or '\n                'an Nxk array/tensor, where k >= 7')\n            arr = torch.tensor(box)[None, :]\n        else:\n            # avoid modifying the input box\n            if is_numpy:\n                arr = torch.from_numpy(np.asarray(box)).clone()\n            elif is_Instance3DBoxes:\n                arr = box.tensor.clone()\n            else:\n                arr = box.clone()\n\n        if is_Instance3DBoxes:\n            with_yaw = box.with_yaw\n\n        # convert box from `src` mode to `dst` mode.\n        x_size, y_size, z_size = arr[..., 3:4], arr[..., 4:5], arr[..., 5:6]\n        if with_yaw:\n            yaw = arr[..., 6:7]\n        if src == Box3DMode.LIDAR and dst == Box3DMode.CAM:\n            if rt_mat is None:\n                rt_mat = arr.new_tensor([[0, -1, 0], [0, 0, -1], [1, 0, 0]])\n            xyz_size = torch.cat([x_size, z_size, y_size], dim=-1)\n            if with_yaw:\n                if correct_yaw:\n                    yaw_vector = torch.cat([\n                        torch.cos(yaw),\n                        torch.sin(yaw),\n                        torch.zeros_like(yaw)\n                    ],\n                                           dim=1)\n                else:\n                    yaw = -yaw - np.pi / 2\n                    yaw = limit_period(yaw, period=np.pi * 2)\n        elif src == Box3DMode.CAM and dst == Box3DMode.LIDAR:\n            if rt_mat is None:\n                rt_mat = arr.new_tensor([[0, 0, 1], [-1, 0, 0], [0, -1, 0]])\n            xyz_size = torch.cat([x_size, z_size, y_size], dim=-1)\n            if with_yaw:\n                if correct_yaw:\n                    yaw_vector = torch.cat([\n                        torch.cos(-yaw),\n                        torch.zeros_like(yaw),\n                        torch.sin(-yaw)\n                    ],\n                                           dim=1)\n                else:\n                    yaw = -yaw - np.pi / 2\n                    yaw = limit_period(yaw, period=np.pi * 2)\n        elif src == Box3DMode.DEPTH and dst == Box3DMode.CAM:\n            if rt_mat is None:\n                rt_mat = arr.new_tensor([[1, 0, 0], [0, 0, -1], [0, 1, 0]])\n            xyz_size = torch.cat([x_size, z_size, y_size], dim=-1)\n            if with_yaw:\n                if correct_yaw:\n                    yaw_vector = torch.cat([\n                        torch.cos(yaw),\n                        torch.sin(yaw),\n                        torch.zeros_like(yaw)\n                    ],\n                                           dim=1)\n                else:\n                    yaw = -yaw\n        elif src == Box3DMode.CAM and dst == Box3DMode.DEPTH:\n            if rt_mat is None:\n                rt_mat = arr.new_tensor([[1, 0, 0], [0, 0, 1], [0, -1, 0]])\n            xyz_size = torch.cat([x_size, z_size, y_size], dim=-1)\n            if with_yaw:\n                if correct_yaw:\n                    yaw_vector = torch.cat([\n                        torch.cos(-yaw),\n                        torch.zeros_like(yaw),\n                        torch.sin(-yaw)\n                    ],\n                                           dim=1)\n                else:\n                    yaw = -yaw\n        elif src == Box3DMode.LIDAR and dst == Box3DMode.DEPTH:\n            if rt_mat is None:\n                rt_mat = arr.new_tensor([[0, -1, 0], [1, 0, 0], [0, 0, 1]])\n            xyz_size = torch.cat([x_size, y_size, z_size], dim=-1)\n            if with_yaw:\n                if correct_yaw:\n                    yaw_vector = torch.cat([\n                        torch.cos(yaw),\n                        torch.sin(yaw),\n                        torch.zeros_like(yaw)\n                    ],\n                                           dim=1)\n                else:\n                    yaw = yaw + np.pi / 2\n                    yaw = limit_period(yaw, period=np.pi * 2)\n        elif src == Box3DMode.DEPTH and dst == Box3DMode.LIDAR:\n            if rt_mat is None:\n                rt_mat = arr.new_tensor([[0, 1, 0], [-1, 0, 0], [0, 0, 1]])\n            xyz_size = torch.cat([x_size, y_size, z_size], dim=-1)\n            if with_yaw:\n                if correct_yaw:\n                    yaw_vector = torch.cat([\n                        torch.cos(yaw),\n                        torch.sin(yaw),\n                        torch.zeros_like(yaw)\n                    ],\n                                           dim=1)\n                else:\n                    yaw = yaw - np.pi / 2\n                    yaw = limit_period(yaw, period=np.pi * 2)\n        else:  # TODO: add transformation between euler boxes\n            raise NotImplementedError(\n                f'Conversion from Box3DMode {src} to {dst} '\n                'is not supported yet')\n\n        if not isinstance(rt_mat, Tensor):\n            rt_mat = arr.new_tensor(rt_mat)\n        if rt_mat.size(1) == 4:\n            extended_xyz = torch.cat(\n                [arr[..., :3], arr.new_ones(arr.size(0), 1)], dim=-1)\n            xyz = extended_xyz @ rt_mat.t()\n        else:\n            xyz = arr[..., :3] @ rt_mat.t()\n\n        # Note: we only use rotation in rt_mat\n        # so don't need to extend yaw_vector\n        if with_yaw and correct_yaw:\n            rot_yaw_vector = yaw_vector @ rt_mat[:3, :3].t()\n            if dst == Box3DMode.CAM:\n                yaw = torch.atan2(-rot_yaw_vector[:, [2]], rot_yaw_vector[:,\n                                                                          [0]])\n            elif dst in [Box3DMode.LIDAR, Box3DMode.DEPTH]:\n                yaw = torch.atan2(rot_yaw_vector[:, [1]], rot_yaw_vector[:,\n                                                                         [0]])\n            yaw = limit_period(yaw, period=np.pi * 2)\n\n        if with_yaw:\n            remains = arr[..., 7:]\n            arr = torch.cat([xyz[..., :3], xyz_size, yaw, remains], dim=-1)\n        else:\n            remains = arr[..., 6:]\n            arr = torch.cat([xyz[..., :3], xyz_size, remains], dim=-1)\n\n        # convert arr to the original type\n        original_type = type(box)\n        if single_box:\n            return original_type(arr.flatten().tolist())\n        if is_numpy:\n            return arr.numpy()\n        elif is_Instance3DBoxes:\n            raise NotImplementedError(\n                f'Conversion to {dst} through {original_type} '\n                'is not supported yet')\n        else:\n            return arr\n"
  },
  {
    "path": "bip3d/structures/bbox_3d/coord_3d_mode.py",
    "content": "# Copyright (c) OpenRobotLab. All rights reserved.\nfrom enum import IntEnum, unique\nfrom typing import Optional, Sequence, Union\n\nimport numpy as np\nimport torch\nfrom torch import Tensor\n\nfrom bip3d.structures.points import (BasePoints, CameraPoints,\n                                            DepthPoints, LiDARPoints)\n\nfrom .base_box3d import BaseInstance3DBoxes\nfrom .box_3d_mode import Box3DMode\n\n\n@unique\nclass Coord3DMode(IntEnum):\n    \"\"\"Enum of different ways to represent a box and point cloud.\n\n    Coordinates in LiDAR:\n\n    .. code-block:: none\n\n                    up z\n                       ^   x front\n                       |  /\n                       | /\n        left y <------ 0\n\n    The relative coordinate of bottom center in a LiDAR box is (0.5, 0.5, 0),\n    and the yaw is around the z axis, thus the rotation axis=2.\n\n    Coordinates in Camera:\n\n    .. code-block:: none\n\n                z front\n               /\n              /\n             0 ------> x right\n             |\n             |\n             v\n        down y\n\n    The relative coordinate of bottom center in a CAM box is (0.5, 1.0, 0.5),\n    and the yaw is around the y axis, thus the rotation axis=1.\n\n    Coordinates in Depth:\n\n    .. code-block:: none\n\n        up z\n           ^   y front\n           |  /\n           | /\n           0 ------> x right\n\n    The relative coordinate of bottom center in a DEPTH box is (0.5, 0.5, 0),\n    and the yaw is around the z axis, thus the rotation axis=2.\n    \"\"\"\n\n    LIDAR = 0\n    CAM = 1\n    DEPTH = 2\n\n    @staticmethod\n    def convert(input: Union[Sequence[float], np.ndarray, Tensor,\n                             BaseInstance3DBoxes, BasePoints],\n                src: Union[Box3DMode, 'Coord3DMode'],\n                dst: Union[Box3DMode, 'Coord3DMode'],\n                rt_mat: Optional[Union[np.ndarray, Tensor]] = None,\n                with_yaw: bool = True,\n                correct_yaw: bool = False,\n                is_point: bool = True):\n        \"\"\"Convert boxes or points from ``src`` mode to ``dst`` mode.\n\n        Args:\n            input (Sequence[float] or np.ndarray or Tensor or\n                :obj:`BaseInstance3DBoxes` or :obj:`BasePoints`): Can be a\n                k-tuple, k-list or an Nxk array/tensor.\n            src (:obj:`Box3DMode` or :obj:`Coord3DMode`): The source mode.\n            dst (:obj:`Box3DMode` or :obj:`Coord3DMode`): The target mode.\n            rt_mat (np.ndarray or Tensor, optional): The rotation and\n                translation matrix between different coordinates.\n                Defaults to None. The conversion from ``src`` coordinates to\n                ``dst`` coordinates usually comes along the change of sensors,\n                e.g., from camera to LiDAR. This requires a transformation\n                matrix.\n            with_yaw (bool): If ``box`` is an instance of\n                :obj:`BaseInstance3DBoxes`, whether or not it has a yaw angle.\n                Defaults to True.\n            correct_yaw (bool): If the yaw is rotated by rt_mat.\n                Defaults to False.\n            is_point (bool): If ``input`` is neither an instance of\n                :obj:`BaseInstance3DBoxes` nor an instance of\n                :obj:`BasePoints`, whether or not it is point data.\n                Defaults to True.\n\n        Returns:\n            Sequence[float] or np.ndarray or Tensor or\n            :obj:`BaseInstance3DBoxes` or :obj:`BasePoints`: The converted box\n            or points of the same type.\n        \"\"\"\n        if isinstance(input, BaseInstance3DBoxes):\n            return Coord3DMode.convert_box(input,\n                                           src,\n                                           dst,\n                                           rt_mat=rt_mat,\n                                           with_yaw=with_yaw,\n                                           correct_yaw=correct_yaw)\n        elif isinstance(input, BasePoints):\n            return Coord3DMode.convert_point(input, src, dst, rt_mat=rt_mat)\n        elif isinstance(input, (tuple, list, np.ndarray, Tensor)):\n            if is_point:\n                return Coord3DMode.convert_point(input,\n                                                 src,\n                                                 dst,\n                                                 rt_mat=rt_mat)\n            else:\n                return Coord3DMode.convert_box(input,\n                                               src,\n                                               dst,\n                                               rt_mat=rt_mat,\n                                               with_yaw=with_yaw,\n                                               correct_yaw=correct_yaw)\n        else:\n            raise NotImplementedError\n\n    @staticmethod\n    def convert_box(\n        box: Union[Sequence[float], np.ndarray, Tensor, BaseInstance3DBoxes],\n        src: Box3DMode,\n        dst: Box3DMode,\n        rt_mat: Optional[Union[np.ndarray, Tensor]] = None,\n        with_yaw: bool = True,\n        correct_yaw: bool = False\n    ) -> Union[Sequence[float], np.ndarray, Tensor, BaseInstance3DBoxes]:\n        \"\"\"Convert boxes from ``src`` mode to ``dst`` mode.\n\n        Args:\n            box (Sequence[float] or np.ndarray or Tensor or\n                :obj:`BaseInstance3DBoxes`): Can be a k-tuple, k-list or an Nxk\n                array/tensor.\n            src (:obj:`Box3DMode`): The source box mode.\n            dst (:obj:`Box3DMode`): The target box mode.\n            rt_mat (np.ndarray or Tensor, optional): The rotation and\n                translation matrix between different coordinates.\n                Defaults to None. The conversion from ``src`` coordinates to\n                ``dst`` coordinates usually comes along the change of sensors,\n                e.g., from camera to LiDAR. This requires a transformation\n                matrix.\n            with_yaw (bool): If ``box`` is an instance of\n                :obj:`BaseInstance3DBoxes`, whether or not it has a yaw angle.\n                Defaults to True.\n            correct_yaw (bool): If the yaw is rotated by rt_mat.\n                Defaults to False.\n\n        Returns:\n            Sequence[float] or np.ndarray or Tensor or\n            :obj:`BaseInstance3DBoxes`: The converted box of the same type.\n        \"\"\"\n        return Box3DMode.convert(box,\n                                 src,\n                                 dst,\n                                 rt_mat=rt_mat,\n                                 with_yaw=with_yaw,\n                                 correct_yaw=correct_yaw)\n\n    @staticmethod\n    def convert_point(\n        point: Union[Sequence[float], np.ndarray, Tensor, BasePoints],\n        src: 'Coord3DMode',\n        dst: 'Coord3DMode',\n        rt_mat: Optional[Union[np.ndarray, Tensor]] = None,\n    ) -> Union[Sequence[float], np.ndarray, Tensor, BasePoints]:\n        \"\"\"Convert points from ``src`` mode to ``dst`` mode.\n\n        Args:\n            box (Sequence[float] or np.ndarray or Tensor or :obj:`BasePoints`):\n                Can be a k-tuple, k-list or an Nxk array/tensor.\n            src (:obj:`Coord3DMode`): The source point mode.\n            dst (:obj:`Coord3DMode`): The target point mode.\n            rt_mat (np.ndarray or Tensor, optional): The rotation and\n                translation matrix between different coordinates.\n                Defaults to None. The conversion from ``src`` coordinates to\n                ``dst`` coordinates usually comes along the change of sensors,\n                e.g., from camera to LiDAR. This requires a transformation\n                matrix.\n\n        Returns:\n            Sequence[float] or np.ndarray or Tensor or :obj:`BasePoints`: The\n            converted point of the same type.\n        \"\"\"\n        if src == dst:\n            return point\n\n        is_numpy = isinstance(point, np.ndarray)\n        is_InstancePoints = isinstance(point, BasePoints)\n        single_point = isinstance(point, (list, tuple))\n        if single_point:\n            assert len(point) >= 3, (\n                'Coord3DMode.convert takes either a k-tuple/list or '\n                'an Nxk array/tensor, where k >= 3')\n            arr = torch.tensor(point)[None, :]\n        else:\n            # avoid modifying the input point\n            if is_numpy:\n                arr = torch.from_numpy(np.asarray(point)).clone()\n            elif is_InstancePoints:\n                arr = point.tensor.clone()\n            else:\n                arr = point.clone()\n\n        # convert point from `src` mode to `dst` mode.\n        if src == Coord3DMode.LIDAR and dst == Coord3DMode.CAM:\n            if rt_mat is None:\n                rt_mat = arr.new_tensor([[0, -1, 0], [0, 0, -1], [1, 0, 0]])\n        elif src == Coord3DMode.CAM and dst == Coord3DMode.LIDAR:\n            if rt_mat is None:\n                rt_mat = arr.new_tensor([[0, 0, 1], [-1, 0, 0], [0, -1, 0]])\n        elif src == Coord3DMode.DEPTH and dst == Coord3DMode.CAM:\n            if rt_mat is None:\n                rt_mat = arr.new_tensor([[1, 0, 0], [0, 0, -1], [0, 1, 0]])\n        elif src == Coord3DMode.CAM and dst == Coord3DMode.DEPTH:\n            if rt_mat is None:\n                rt_mat = arr.new_tensor([[1, 0, 0], [0, 0, 1], [0, -1, 0]])\n        elif src == Coord3DMode.LIDAR and dst == Coord3DMode.DEPTH:\n            if rt_mat is None:\n                rt_mat = arr.new_tensor([[0, -1, 0], [1, 0, 0], [0, 0, 1]])\n        elif src == Coord3DMode.DEPTH and dst == Coord3DMode.LIDAR:\n            if rt_mat is None:\n                rt_mat = arr.new_tensor([[0, 1, 0], [-1, 0, 0], [0, 0, 1]])\n        else:\n            raise NotImplementedError(\n                f'Conversion from Coord3DMode {src} to {dst} '\n                'is not supported yet')\n\n        if not isinstance(rt_mat, Tensor):\n            rt_mat = arr.new_tensor(rt_mat)\n        if rt_mat.size(1) == 4:\n            extended_xyz = torch.cat(\n                [arr[..., :3], arr.new_ones(arr.size(0), 1)], dim=-1)\n            xyz = extended_xyz @ rt_mat.t()\n        else:\n            xyz = arr[..., :3] @ rt_mat.t()\n\n        remains = arr[..., 3:]\n        arr = torch.cat([xyz[..., :3], remains], dim=-1)\n\n        # convert arr to the original type\n        original_type = type(point)\n        if single_point:\n            return original_type(arr.flatten().tolist())\n        if is_numpy:\n            return arr.numpy()\n        elif is_InstancePoints:\n            if dst == Coord3DMode.CAM:\n                target_type = CameraPoints\n            elif dst == Coord3DMode.LIDAR:\n                target_type = LiDARPoints\n            elif dst == Coord3DMode.DEPTH:\n                target_type = DepthPoints\n            else:\n                raise NotImplementedError(\n                    f'Conversion to {dst} through {original_type} '\n                    'is not supported yet')\n            return target_type(arr,\n                               points_dim=arr.size(-1),\n                               attribute_dims=point.attribute_dims)\n        else:\n            return arr\n"
  },
  {
    "path": "bip3d/structures/bbox_3d/euler_box3d.py",
    "content": "# Copyright (c) OpenRobotLab. All rights reserved.\nimport numpy as np\nimport torch\nfrom pytorch3d.ops import box3d_overlap\nfrom pytorch3d.transforms import euler_angles_to_matrix, matrix_to_euler_angles\n\nfrom ..points.base_points import BasePoints\nfrom .base_box3d import BaseInstance3DBoxes\nfrom .utils import rotation_3d_in_euler\n\n\nclass EulerInstance3DBoxes(BaseInstance3DBoxes):\n    \"\"\"3D boxes with 1-D orientation represented by three Euler angles.\n\n    See https://en.wikipedia.org/wiki/Euler_angles for\n        regarding the definition of Euler angles.\n\n    Attributes:\n        tensor (torch.Tensor): Float matrix of N x box_dim.\n        box_dim (int): Integer indicates the dimension of a box\n            Each row is (x, y, z, x_size, y_size, z_size, alpha, beta, gamma).\n    \"\"\"\n\n    def __init__(self, tensor, box_dim=9, origin=(0.5, 0.5, 0.5)):\n        if isinstance(tensor, torch.Tensor):\n            device = tensor.device\n        else:\n            device = torch.device('cpu')\n        tensor = torch.as_tensor(tensor, dtype=torch.float32, device=device)\n        if tensor.numel() == 0:\n            # Use reshape, so we don't end up creating a new tensor that\n            # does not depend on the inputs (and consequently confuses jit)\n            tensor = tensor.reshape((0, box_dim)).to(dtype=torch.float32,\n                                                     device=device)\n        assert tensor.dim() == 2 and tensor.size(-1) == box_dim, tensor.size()\n\n        if tensor.shape[-1] == 6:\n            # If the dimension of boxes is 6, we expand box_dim by padding\n            # (0, 0, 0) as a fake euler angle.\n            assert box_dim == 6\n            fake_rot = tensor.new_zeros(tensor.shape[0], 3)\n            tensor = torch.cat((tensor, fake_rot), dim=-1)\n            self.box_dim = box_dim + 3\n        elif tensor.shape[-1] == 7:\n            assert box_dim == 7\n            fake_euler = tensor.new_zeros(tensor.shape[0], 2)\n            tensor = torch.cat((tensor, fake_euler), dim=-1)\n            self.box_dim = box_dim + 2\n        else:\n            assert tensor.shape[-1] == 9\n            self.box_dim = box_dim\n        self.tensor = tensor.clone()\n\n        self.origin = origin\n        if origin != (0.5, 0.5, 0.5):\n            dst = self.tensor.new_tensor((0.5, 0.5, 0.5))\n            src = self.tensor.new_tensor(origin)\n            self.tensor[:, :3] += self.tensor[:, 3:6] * (dst - src)\n\n    def get_corners(self, tensor1):\n        \"\"\"torch.Tensor: Coordinates of corners of all the boxes\n        in shape (N, 8, 3).\n\n        Convert the boxes to corners in clockwise order, in form of\n        ``(x0y0z0, x0y0z1, x0y1z1, x0y1z0, x1y0z0, x1y0z1, x1y1z1, x1y1z0)``\n\n        .. code-block:: none\n\n                                           up z\n                            front y           ^\n                                 /            |\n                                /             |\n                  (x0, y1, z1) + -----------  + (x1, y1, z1)\n                              /|            / |\n                             / |           /  |\n               (x0, y0, z1) + ----------- +   + (x1, y1, z0)\n                            |  /      .   |  /\n                            | / origin    | /\n               (x0, y0, z0) + ----------- + --------> right x\n                                          (x1, y0, z0)\n        \"\"\"\n        if tensor1.numel() == 0:\n            return torch.empty([0, 8, 3], device=tensor1.device)\n\n        dims = tensor1[:, 3:6]\n        corners_norm = torch.from_numpy(\n            np.stack(np.unravel_index(np.arange(8), [2] * 3),\n                     axis=1)).to(device=dims.device, dtype=dims.dtype)\n\n        corners_norm = corners_norm[[0, 1, 3, 2, 4, 5, 7, 6]]\n        # use relative origin\n        assert self.origin == (0.5, 0.5, 0.5), \\\n            'self.origin != (0.5, 0.5, 0.5) needs to be checked!'\n        corners_norm = corners_norm - dims.new_tensor(self.origin)\n        corners = dims.view([-1, 1, 3]) * corners_norm.reshape([1, 8, 3])\n\n        # rotate\n        corners = rotation_3d_in_euler(corners, tensor1[:, 6:])\n\n        corners += tensor1[:, :3].view(-1, 1, 3)\n        return corners\n\n    @classmethod\n    def overlaps(cls, boxes1, boxes2, mode='iou', eps=1e-4):\n        \"\"\"Calculate 3D overlaps of two boxes.\n\n        Note:\n            This function calculates the overlaps between ``boxes1`` and\n            ``boxes2``, ``boxes1`` and ``boxes2`` should be in the same type.\n\n        Args:\n            boxes1 (:obj:`EulerInstance3DBoxes`): Boxes 1 contain N boxes.\n            boxes2 (:obj:`EulerInstance3DBoxes`): Boxes 2 contain M boxes.\n            mode (str): Mode of iou calculation. Defaults to 'iou'.\n            eps (bool): Epsilon. Defaults to 1e-4.\n\n        Returns:\n            torch.Tensor: Calculated 3D overlaps of the boxes.\n        \"\"\"\n        assert isinstance(boxes1, EulerInstance3DBoxes)\n        assert isinstance(boxes2, EulerInstance3DBoxes)\n        assert type(boxes1) == type(boxes2), '\"boxes1\" and \"boxes2\" should' \\\n            f'be in the same type, got {type(boxes1)} and {type(boxes2)}.'\n\n        assert mode in ['iou']\n\n        rows = len(boxes1)\n        cols = len(boxes2)\n        if rows * cols == 0:\n            return boxes1.tensor.new(rows, cols)\n\n        corners1 = boxes1.corners\n        corners2 = boxes2.corners\n        _, iou3d = box3d_overlap(corners1, corners2, eps=eps)\n        return iou3d\n\n    @property\n    def gravity_center(self):\n        \"\"\"torch.Tensor: A tensor with center of each box in shape (N, 3).\"\"\"\n        return self.tensor[:, :3]\n\n    @property\n    def corners(self):\n        \"\"\"torch.Tensor: Coordinates of corners of all the boxes\n        in shape (N, 8, 3).\n\n        Convert the boxes to corners in clockwise order, in form of\n        ``(x0y0z0, x0y0z1, x0y1z1, x0y1z0, x1y0z0, x1y0z1, x1y1z1, x1y1z0)``\n\n        .. code-block:: none\n\n                                           up z\n                            front y           ^\n                                 /            |\n                                /             |\n                  (x0, y1, z1) + -----------  + (x1, y1, z1)\n                              /|            / |\n                             / |           /  |\n               (x0, y0, z1) + ----------- +   + (x1, y1, z0)\n                            |  /      .   |  /\n                            | / origin    | /\n               (x0, y0, z0) + ----------- + --------> right x\n                                          (x1, y0, z0)\n        \"\"\"\n        if self.tensor.numel() == 0:\n            return torch.empty([0, 8, 3], device=self.tensor.device)\n\n        dims = self.dims\n        corners_norm = torch.from_numpy(\n            np.stack(np.unravel_index(np.arange(8), [2] * 3),\n                     axis=1)).to(device=dims.device, dtype=dims.dtype)\n\n        corners_norm = corners_norm[[0, 1, 3, 2, 4, 5, 7, 6]]\n        # use relative origin\n        assert self.origin == (0.5, 0.5, 0.5), \\\n            'self.origin != (0.5, 0.5, 0.5) needs to be checked!'\n        corners_norm = corners_norm - dims.new_tensor(self.origin)\n        corners = dims.view([-1, 1, 3]) * corners_norm.reshape([1, 8, 3])\n\n        # rotate\n        corners = rotation_3d_in_euler(corners, self.tensor[:, 6:])\n\n        corners += self.tensor[:, :3].view(-1, 1, 3)\n        return corners\n\n    def transform(self, matrix):\n        if self.tensor.shape[0] == 0:\n            return\n        if not isinstance(matrix, torch.Tensor):\n            matrix = self.tensor.new_tensor(matrix)\n        points = self.tensor[:, :3]\n        constant = points.new_ones(points.shape[0], 1)\n        points_extend = torch.concat([points, constant], dim=-1)\n        points_trans = torch.matmul(points_extend, matrix.transpose(-2,\n                                                                    -1))[:, :3]\n\n        size = self.tensor[:, 3:6]\n\n        # angle_delta = matrix_to_euler_angles(matrix[:3,:3], 'ZXY')\n        # angle = self.tensor[:,6:] + angle_delta\n        ori_matrix = euler_angles_to_matrix(self.tensor[:, 6:], 'ZXY')\n        rot_matrix = matrix[:3, :3].expand_as(ori_matrix)\n        final = torch.bmm(rot_matrix, ori_matrix)\n        angle = matrix_to_euler_angles(final, 'ZXY')\n\n        self.tensor = torch.cat([points_trans, size, angle], dim=-1)\n\n    def scale(self, scale_factor: float) -> None:\n        \"\"\"Scale the box with horizontal and vertical scaling factors.\n\n        Args:\n            scale_factors (float): Scale factors to scale the boxes.\n        \"\"\"\n        self.tensor[:, :6] *= scale_factor\n\n    def rotate(self, angle, points=None):\n        \"\"\"Rotate boxes with points (optional) with the given angle or rotation\n        matrix.\n\n        Args:\n            angle (float | torch.Tensor | np.ndarray):\n                Rotation angle or rotation matrix.\n            points (torch.Tensor | np.ndarray | :obj:`BasePoints`, optional):\n                Points to rotate. Defaults to None.\n\n        Returns:\n            tuple or None: When ``points`` is None, the function returns\n                None, otherwise it returns the rotated points and the\n                rotation matrix ``rot_mat_T``.\n        \"\"\"\n        if not isinstance(angle, torch.Tensor):\n            angle = self.tensor.new_tensor(angle)\n\n        if angle.numel() == 1:  # only given yaw angle for rotation\n            angle = self.tensor.new_tensor([angle, 0., 0.])\n            rot_matrix = euler_angles_to_matrix(angle, 'ZXY')\n        elif angle.numel() == 3:\n            rot_matrix = euler_angles_to_matrix(angle, 'ZXY')\n        elif angle.shape == torch.Size([3, 3]):\n            rot_matrix = angle\n        else:\n            raise NotImplementedError\n\n        rot_mat_T = rot_matrix.T\n        transform_matrix = torch.eye(4)\n        transform_matrix[:3, :3] = rot_matrix\n        self.transform(transform_matrix)\n\n        if points is not None:\n            if isinstance(points, torch.Tensor):\n                points[:, :3] = points[:, :3] @ rot_mat_T\n            elif isinstance(points, np.ndarray):\n                rot_mat_T = rot_mat_T.cpu().numpy()\n                points[:, :3] = np.dot(points[:, :3], rot_mat_T)\n            elif isinstance(points, BasePoints):\n                points.rotate(rot_mat_T)\n            else:\n                raise ValueError\n            return points, rot_mat_T\n        else:\n            return rot_mat_T\n\n    def flip(self, direction='X'):\n        \"\"\"Flip the boxes along the corresponding axis.\n\n        Args:\n            direction (str, optional): Flip axis. Defaults to 'X'.\n        \"\"\"\n        assert direction in ['X', 'Y', 'Z']\n        if direction == 'X':\n            self.tensor[:, 0] = -self.tensor[:, 0]\n            self.tensor[:, 6] = -self.tensor[:, 6] + np.pi\n            self.tensor[:, 8] = -self.tensor[:, 8]\n        elif direction == 'Y':\n            self.tensor[:, 1] = -self.tensor[:, 1]\n            self.tensor[:, 6] = -self.tensor[:, 6]\n            self.tensor[:, 7] = -self.tensor[:, 7] + np.pi\n        elif direction == 'Z':\n            self.tensor[:, 2] = -self.tensor[:, 2]\n            self.tensor[:, 7] = -self.tensor[:, 7]\n            self.tensor[:, 8] = -self.tensor[:, 8] + np.pi\n"
  },
  {
    "path": "bip3d/structures/bbox_3d/euler_depth_box3d.py",
    "content": "# Copyright (c) OpenRobotLab. All rights reserved.\nimport numpy as np\nimport torch\nfrom mmcv.ops import points_in_boxes_all, points_in_boxes_part\n\nfrom ..points.base_points import BasePoints\nfrom .euler_box3d import EulerInstance3DBoxes\n\n\nclass EulerDepthInstance3DBoxes(EulerInstance3DBoxes):\n    \"\"\"3D boxes of instances in Depth coordinates.\n\n    We keep the \"Depth\" coordinate system definition in MMDet3D just for\n    clarification of the points coordinates and the flipping augmentation.\n\n    Coordinates in Depth:\n\n    .. code-block:: none\n\n                    up z    y front (alpha=0.5*pi)\n                       ^   ^\n                       |  /\n                       | /\n                       0 ------> x right (alpha=0)\n\n    The relative coordinate of bottom center in a Depth box is (0.5, 0.5, 0),\n    and the yaw is around the z axis, thus the rotation axis=2.\n    The yaw is 0 at the positive direction of x axis, and decreases from\n    the positive direction of x to the positive direction of y.\n    Also note that rotation of DepthInstance3DBoxes is counterclockwise,\n    which is reverse to the definition of the yaw angle (clockwise).\n\n    Attributes:\n        tensor (torch.Tensor): Float matrix of N x box_dim.\n        box_dim (int): Integer indicates the dimension of a box\n            Each row is (x, y, z, x_size, y_size, z_size, alpha, beta, gamma).\n        with_yaw (bool): If True, the value of yaw will be set to 0 as minmax\n            boxes.\n    \"\"\"\n\n    def __init__(self,\n                 tensor,\n                 box_dim=9,\n                 with_yaw=True,\n                 origin=(0.5, 0.5, 0.5)):\n        super().__init__(tensor, box_dim, origin)\n        self.with_yaw = with_yaw\n\n    def flip(self, bev_direction='horizontal', points=None):\n        \"\"\"Flip the boxes in BEV along given BEV direction.\n\n        In Depth coordinates, it flips x (horizontal) or y (vertical) axis.\n\n        Args:\n            bev_direction (str, optional): Flip direction\n                (horizontal or vertical). Defaults to 'horizontal'.\n            points (torch.Tensor | np.ndarray | :obj:`BasePoints`, optional):\n                Points to flip. Defaults to None.\n\n        Returns:\n            torch.Tensor, numpy.ndarray or None: Flipped points.\n        \"\"\"\n        assert bev_direction in ('horizontal', 'vertical')\n        if bev_direction == 'horizontal':\n            super().flip(direction='X')\n        elif bev_direction == 'vertical':\n            super().flip(direction='Y')\n\n        if points is not None:\n            assert isinstance(points, (torch.Tensor, np.ndarray, BasePoints))\n            if isinstance(points, (torch.Tensor, np.ndarray)):\n                if bev_direction == 'horizontal':\n                    points[:, 0] = -points[:, 0]\n                elif bev_direction == 'vertical':\n                    points[:, 1] = -points[:, 1]\n            elif isinstance(points, BasePoints):\n                points.flip(bev_direction)\n            return points\n\n    def convert_to(self, dst, rt_mat=None):\n        \"\"\"Convert self to ``dst`` mode.\n\n        Args:\n            dst (:obj:`Box3DMode`): The target Box mode.\n            rt_mat (np.ndarray | torch.Tensor, optional): The rotation and\n                translation matrix between different coordinates.\n                Defaults to None.\n                The conversion from ``src`` coordinates to ``dst`` coordinates\n                usually comes along the change of sensors, e.g., from camera\n                to LiDAR. This requires a transformation matrix.\n\n        Returns:\n            :obj:`DepthInstance3DBoxes`:\n                The converted box of the same type in the ``dst`` mode.\n        \"\"\"\n        from .box_3d_mode import Box3DMode\n        assert dst == Box3DMode.EULER_DEPTH\n        return self\n\n    def points_in_boxes_part(self, points, boxes_override=None):\n        \"\"\"Find the box in which each point is.\n\n        Args:\n            points (torch.Tensor): Points in shape (1, M, 3) or (M, 3),\n                3 dimensions are (x, y, z) in LiDAR or depth coordinate.\n            boxes_override (torch.Tensor, optional): Boxes to override\n                `self.tensor`. Defaults to None.\n\n        Returns:\n            torch.Tensor: The index of the first box that each point\n                is in, in shape (M, ). Default value is -1\n                (if the point is not enclosed by any box).\n\n        Note:\n            If a point is enclosed by multiple boxes, the index of the\n            first box will be returned.\n        \"\"\"\n        if boxes_override is not None:\n            boxes = boxes_override\n        else:\n            boxes = self.tensor\n        if points.dim() == 2:\n            points = points.unsqueeze(0)\n        # TODO: take euler angles into consideration\n        aligned_boxes = boxes[..., :7].clone()\n        aligned_boxes[..., 6] = 0\n        box_idx = points_in_boxes_part(\n            points,\n            aligned_boxes.unsqueeze(0).to(points.device)).squeeze(0)\n        return box_idx\n\n    def points_in_boxes_all(self, points, boxes_override=None):\n        \"\"\"Find all boxes in which each point is.\n\n        Args:\n            points (torch.Tensor): Points in shape (1, M, 3) or (M, 3),\n                3 dimensions are (x, y, z) in LiDAR or depth coordinate.\n            boxes_override (torch.Tensor, optional): Boxes to override\n                `self.tensor`. Defaults to None.\n\n        Returns:\n            torch.Tensor: A tensor indicating whether a point is in a box,\n                in shape (M, T). T is the number of boxes. Denote this\n                tensor as A, if the m^th point is in the t^th box, then\n                `A[m, t] == 1`, elsewise `A[m, t] == 0`.\n        \"\"\"\n        if boxes_override is not None:\n            boxes = boxes_override\n        else:\n            boxes = self.tensor\n\n        points_clone = points.clone()[..., :3]\n        if points_clone.dim() == 2:\n            points_clone = points_clone.unsqueeze(0)\n        else:\n            assert points_clone.dim() == 3 and points_clone.shape[0] == 1\n\n        boxes = boxes.to(points_clone.device).unsqueeze(0)\n        # TODO: take euler angles into consideration\n        aligned_boxes = boxes[..., :7].clone()\n        aligned_boxes[..., 6] = 0\n        box_idxs_of_pts = points_in_boxes_all(points_clone, aligned_boxes)\n\n        return box_idxs_of_pts.squeeze(0)\n"
  },
  {
    "path": "bip3d/structures/bbox_3d/utils.py",
    "content": "# Copyright (c) OpenRobotLab. All rights reserved.\nfrom logging import warning\nfrom typing import Tuple, Union\n\nimport numpy as np\nimport torch\nfrom pytorch3d.transforms import euler_angles_to_matrix, quaternion_to_matrix\nfrom torch import Tensor\n\nfrom bip3d.utils.array_converter import array_converter\n\n\n@array_converter(apply_to=('val', ))\ndef limit_period(val: Union[np.ndarray, Tensor],\n                 offset: float = 0.5,\n                 period: float = np.pi) -> Union[np.ndarray, Tensor]:\n    \"\"\"Limit the value into a period for periodic function.\n\n    Args:\n        val (np.ndarray or Tensor): The value to be converted.\n        offset (float): Offset to set the value range. Defaults to 0.5.\n        period (float): Period of the value. Defaults to np.pi.\n\n    Returns:\n        np.ndarray or Tensor: Value in the range of\n        [-offset * period, (1-offset) * period].\n    \"\"\"\n    limited_val = val - torch.floor(val / period + offset) * period\n    return limited_val\n\n\n@array_converter(apply_to=('points', 'angles'))\ndef rotation_3d_in_euler(points, angles, return_mat=False, clockwise=False):\n    \"\"\"Rotate points by angles according to axis.\n\n    Args:\n        points (np.ndarray | torch.Tensor | list | tuple ):\n            Points of shape (N, M, 3).\n        angles (np.ndarray | torch.Tensor | list | tuple):\n            Vector of angles in shape (N, 3)\n        return_mat: Whether or not return the rotation matrix (transposed).\n            Defaults to False.\n        clockwise: Whether the rotation is clockwise. Defaults to False.\n\n    Raises:\n        ValueError: when the axis is not in range [0, 1, 2], it will\n            raise value error.\n\n    Returns:\n        (torch.Tensor | np.ndarray): Rotated points in shape (N, M, 3).\n    \"\"\"\n    batch_free = len(points.shape) == 2\n    if batch_free:\n        points = points[None]\n\n    if len(angles.shape) == 1:\n        angles = angles.expand(points.shape[:1] + (3, ))\n        # angles = torch.full(points.shape[:1], angles)\n\n    assert len(points.shape) == 3 and len(angles.shape) == 2 \\\n        and points.shape[0] == angles.shape[0], f'Incorrect shape of points ' \\\n        f'angles: {points.shape}, {angles.shape}'\n\n    assert points.shape[-1] in [2, 3], \\\n        f'Points size should be 2 or 3 instead of {points.shape[-1]}'\n\n    if angles.shape[1] == 3:\n        rot_mat_T = euler_angles_to_matrix(angles, 'ZXY')  # N, 3,3\n    else:\n        rot_mat_T = quaternion_to_matrix(angles)  # N, 3,3\n    rot_mat_T = rot_mat_T.transpose(-2, -1)\n\n    if clockwise:\n        raise NotImplementedError('clockwise')\n\n    if points.shape[0] == 0:\n        points_new = points\n    else:\n        points_new = torch.bmm(points, rot_mat_T)\n\n    if batch_free:\n        points_new = points_new.squeeze(0)\n\n    if return_mat:\n        if batch_free:\n            rot_mat_T = rot_mat_T.squeeze(0)\n        return points_new, rot_mat_T\n    else:\n        return points_new\n\n\n@array_converter(apply_to=('points', 'angles'))\ndef rotation_3d_in_axis(\n    points: Union[np.ndarray, Tensor],\n    angles: Union[np.ndarray, Tensor, float],\n    axis: int = 0,\n    return_mat: bool = False,\n    clockwise: bool = False\n) -> Union[Tuple[np.ndarray, np.ndarray], Tuple[Tensor, Tensor], np.ndarray,\n           Tensor]:\n    \"\"\"Rotate points by angles according to axis.\n\n    Args:\n        points (np.ndarray or Tensor): Points with shape (N, M, 3).\n        angles (np.ndarray or Tensor or float): Vector of angles with shape\n            (N, ).\n        axis (int): The axis to be rotated. Defaults to 0.\n        return_mat (bool): Whether or not to return the rotation matrix\n            (transposed). Defaults to False.\n        clockwise (bool): Whether the rotation is clockwise. Defaults to False.\n\n    Raises:\n        ValueError: When the axis is not in range [-3, -2, -1, 0, 1, 2], it\n            will raise ValueError.\n\n    Returns:\n        Tuple[np.ndarray, np.ndarray] or Tuple[Tensor, Tensor] or np.ndarray or\n        Tensor: Rotated points with shape (N, M, 3) and rotation matrix with\n        shape (N, 3, 3).\n    \"\"\"\n    batch_free = len(points.shape) == 2\n    if batch_free:\n        points = points[None]\n\n    if isinstance(angles, float) or len(angles.shape) == 0:\n        angles = torch.full(points.shape[:1], angles)\n\n    assert len(points.shape) == 3 and len(angles.shape) == 1 and \\\n        points.shape[0] == angles.shape[0], 'Incorrect shape of points ' \\\n        f'angles: {points.shape}, {angles.shape}'\n\n    assert points.shape[-1] in [2, 3], \\\n        f'Points size should be 2 or 3 instead of {points.shape[-1]}'\n\n    rot_sin = torch.sin(angles)\n    rot_cos = torch.cos(angles)\n    ones = torch.ones_like(rot_cos)\n    zeros = torch.zeros_like(rot_cos)\n\n    if points.shape[-1] == 3:\n        if axis == 1 or axis == -2:\n            rot_mat_T = torch.stack([\n                torch.stack([rot_cos, zeros, -rot_sin]),\n                torch.stack([zeros, ones, zeros]),\n                torch.stack([rot_sin, zeros, rot_cos])\n            ])\n        elif axis == 2 or axis == -1:\n            rot_mat_T = torch.stack([\n                torch.stack([rot_cos, rot_sin, zeros]),\n                torch.stack([-rot_sin, rot_cos, zeros]),\n                torch.stack([zeros, zeros, ones])\n            ])\n        elif axis == 0 or axis == -3:\n            rot_mat_T = torch.stack([\n                torch.stack([ones, zeros, zeros]),\n                torch.stack([zeros, rot_cos, rot_sin]),\n                torch.stack([zeros, -rot_sin, rot_cos])\n            ])\n        else:\n            raise ValueError(\n                f'axis should in range [-3, -2, -1, 0, 1, 2], got {axis}')\n    else:\n        rot_mat_T = torch.stack([\n            torch.stack([rot_cos, rot_sin]),\n            torch.stack([-rot_sin, rot_cos])\n        ])\n\n    if clockwise:\n        rot_mat_T = rot_mat_T.transpose(0, 1)\n\n    if points.shape[0] == 0:\n        points_new = points\n    else:\n        points_new = torch.einsum('aij,jka->aik', points, rot_mat_T)\n\n    if batch_free:\n        points_new = points_new.squeeze(0)\n\n    if return_mat:\n        rot_mat_T = torch.einsum('jka->ajk', rot_mat_T)\n        if batch_free:\n            rot_mat_T = rot_mat_T.squeeze(0)\n        return points_new, rot_mat_T\n    else:\n        return points_new\n\n\n@array_converter(apply_to=('boxes_xywhr', ))\ndef xywhr2xyxyr(\n        boxes_xywhr: Union[Tensor, np.ndarray]) -> Union[Tensor, np.ndarray]:\n    \"\"\"Convert a rotated boxes in XYWHR format to XYXYR format.\n\n    Args:\n        boxes_xywhr (Tensor or np.ndarray): Rotated boxes in XYWHR format.\n\n    Returns:\n        Tensor or np.ndarray: Converted boxes in XYXYR format.\n    \"\"\"\n    boxes = torch.zeros_like(boxes_xywhr)\n    half_w = boxes_xywhr[..., 2] / 2\n    half_h = boxes_xywhr[..., 3] / 2\n\n    boxes[..., 0] = boxes_xywhr[..., 0] - half_w\n    boxes[..., 1] = boxes_xywhr[..., 1] - half_h\n    boxes[..., 2] = boxes_xywhr[..., 0] + half_w\n    boxes[..., 3] = boxes_xywhr[..., 1] + half_h\n    boxes[..., 4] = boxes_xywhr[..., 4]\n    return boxes\n\n\ndef get_box_type(box_type: str) -> Tuple[type, int]:\n    \"\"\"Get the type and mode of box structure.\n\n    We temporarily only support EulerDepthInstance3DBoxes to\n    support 9-DoF box operations\n    and will consider refactoring this class with further experience.\n\n    Args:\n        box_type (str): The type of box structure. The valid value are \"LiDAR\",\n            \"Camera\" and \"Depth\".\n\n    Raises:\n        ValueError: A ValueError is raised when ``box_type`` does not belong to\n            the three valid types.\n\n    Returns:\n        tuple: Box type and box mode.\n    \"\"\"\n    from .box_3d_mode import Box3DMode\n    from .euler_depth_box3d import EulerDepthInstance3DBoxes\n    box_type_lower = box_type.lower()\n    if box_type_lower == 'euler-depth':\n        box_type_3d = EulerDepthInstance3DBoxes\n        box_mode_3d = Box3DMode.EULER_DEPTH\n    # elif box_type_lower == 'euler-camera':\n    #     box_type_3d = EulerCameraInstance3DBoxes\n    #     box_mode_3d = Box3DMode.EULER_CAM\n    else:\n        raise ValueError(\n            'Only \"box_type\" of \"camera\", \"lidar\", \"depth\", \"euler\"'\n            f' are supported, got {box_type}')\n\n    return box_type_3d, box_mode_3d\n\n\n@array_converter(apply_to=('points_3d', 'proj_mat'))\ndef points_cam2img(points_3d: Union[Tensor, np.ndarray],\n                   proj_mat: Union[Tensor, np.ndarray],\n                   with_depth: bool = False) -> Union[Tensor, np.ndarray]:\n    \"\"\"Project points in camera coordinates to image coordinates.\n\n    Args:\n        points_3d (Tensor or np.ndarray): Points in shape (N, 3).\n        proj_mat (Tensor or np.ndarray): Transformation matrix between\n            coordinates.\n        with_depth (bool): Whether to keep depth in the output.\n            Defaults to False.\n\n    Returns:\n        Tensor or np.ndarray: Points in image coordinates with shape [N, 2] if\n        ``with_depth=False``, else [N, 3].\n    \"\"\"\n    points_shape = list(points_3d.shape)\n    points_shape[-1] = 1\n\n    assert len(proj_mat.shape) == 2, \\\n        'The dimension of the projection matrix should be 2 ' \\\n        f'instead of {len(proj_mat.shape)}.'\n    d1, d2 = proj_mat.shape[:2]\n    assert (d1 == 3 and d2 == 3) or (d1 == 3 and d2 == 4) or \\\n        (d1 == 4 and d2 == 4), 'The shape of the projection matrix ' \\\n        f'({d1}*{d2}) is not supported.'\n    if d1 == 3:\n        proj_mat_expanded = torch.eye(4,\n                                      device=proj_mat.device,\n                                      dtype=proj_mat.dtype)\n        proj_mat_expanded[:d1, :d2] = proj_mat\n        proj_mat = proj_mat_expanded\n\n    # previous implementation use new_zeros, new_one yields better results\n    points_4 = torch.cat([points_3d, points_3d.new_ones(points_shape)], dim=-1)\n\n    point_2d = points_4 @ proj_mat.T\n    point_2d_res = point_2d[..., :2] / point_2d[..., 2:3]\n\n    if with_depth:\n        point_2d_res = torch.cat([point_2d_res, point_2d[..., 2:3]], dim=-1)\n\n    return point_2d_res\n\n\n@array_converter(apply_to=('points_3d', 'proj_mat'))\ndef batch_points_cam2img(points_3d, proj_mat, with_depth=False):\n    \"\"\"Project points in camera coordinates to image coordinates.\n\n    Args:\n        points_3d (torch.Tensor | np.ndarray): Points in shape (N, D, 3)\n        proj_mat (torch.Tensor | np.ndarray):\n            Transformation matrix between coordinates.\n        with_depth (bool, optional): Whether to keep depth in the output.\n            Defaults to False.\n\n    Returns:\n        (torch.Tensor | np.ndarray): Points in image coordinates,\n            with shape [N, D, 2] if `with_depth=False`, else [N, D, 3].\n    \"\"\"\n    points_shape = list(points_3d.shape)\n    points_shape[-1] = 1\n\n    assert len(proj_mat.shape) == 3, 'The dimension of the projection'\\\n        f' matrix should be 2 instead of {len(proj_mat.shape)}.'\n    d0, d1, d2 = proj_mat.shape[:3]\n    assert (d1 == 3 and d2 == 3) or (d1 == 3 and d2 == 4) or (\n        d1 == 4 and d2 == 4), 'The shape of the projection matrix'\\\n        f' ({d1}*{d2}) is not supported.'\n    if d1 == 3:\n        proj_mat_expanded = torch.eye(4,\n                                      device=proj_mat.device,\n                                      dtype=proj_mat.dtype)\n        proj_mat_expanded = proj_mat_expanded[None, :, :].expand(d0, -1, -1)\n        proj_mat_expanded[:, :d1, :d2] = proj_mat\n        proj_mat = proj_mat_expanded\n\n    # previous implementation use new_zeros, new_one yields better results\n    points_4 = torch.cat([points_3d, points_3d.new_ones(points_shape)], dim=-1)\n    # do the batch wise operation\n    point_2d = torch.bmm(points_4, proj_mat.permute(0, 2, 1))\n    # point_2d = points_4 @ proj_mat.T\n\n    point_2d_res = point_2d[..., :2] / point_2d[..., 2:3].clamp(min=1e-3)\n\n    if with_depth:\n        point_2d_res = torch.cat([point_2d_res, point_2d[..., 2:3]], dim=-1)\n\n    return point_2d_res\n\n\n@array_converter(apply_to=('points', 'cam2img'))\ndef points_img2cam(\n        points: Union[Tensor, np.ndarray],\n        cam2img: Union[Tensor, np.ndarray]) -> Union[Tensor, np.ndarray]:\n    \"\"\"Project points in image coordinates to camera coordinates.\n\n    Args:\n        points (Tensor or np.ndarray): 2.5D points in 2D images with shape\n            [N, 3], 3 corresponds with x, y in the image and depth.\n        cam2img (Tensor or np.ndarray): Camera intrinsic matrix. The shape can\n            be [3, 3], [3, 4] or [4, 4].\n\n    Returns:\n        Tensor or np.ndarray: Points in 3D space with shape [N, 3], 3\n        corresponds with x, y, z in 3D space.\n    \"\"\"\n    assert cam2img.shape[0] <= 4\n    assert cam2img.shape[1] <= 4\n    assert points.shape[1] == 3\n\n    xys = points[:, :2]\n    depths = points[:, 2].view(-1, 1)\n    unnormed_xys = torch.cat([xys * depths, depths], dim=1)\n\n    pad_cam2img = torch.eye(4, dtype=xys.dtype, device=xys.device)\n    pad_cam2img[:cam2img.shape[0], :cam2img.shape[1]] = cam2img\n    inv_pad_cam2img = torch.inverse(pad_cam2img).transpose(0, 1)\n\n    # Do operation in homogeneous coordinates.\n    num_points = unnormed_xys.shape[0]\n    homo_xys = torch.cat([unnormed_xys, xys.new_ones((num_points, 1))], dim=1)\n    points3D = torch.mm(homo_xys, inv_pad_cam2img)[:, :3]\n\n    return points3D\n\n\ndef mono_cam_box2vis(cam_box):\n    \"\"\"This is a post-processing function on the bboxes from Mono-3D task. If\n    we want to perform projection visualization, we need to:\n\n        1. rotate the box along x-axis for np.pi / 2 (roll)\n        2. change orientation from local yaw to global yaw\n        3. convert yaw by (np.pi / 2 - yaw)\n\n    After applying this function, we can project and draw it on 2D images.\n\n    Args:\n        cam_box (:obj:`CameraInstance3DBoxes`): 3D bbox in camera coordinate\n            system before conversion. Could be gt bbox loaded from dataset or\n            network prediction output.\n\n    Returns:\n        :obj:`CameraInstance3DBoxes`: Box after conversion.\n    \"\"\"\n    warning.warn('DeprecationWarning: The hack of yaw and dimension in the '\n                 'monocular 3D detection on nuScenes has been removed. The '\n                 'function mono_cam_box2vis will be deprecated.')\n    from .cam_box3d import CameraInstance3DBoxes\n    assert isinstance(cam_box, CameraInstance3DBoxes), \\\n        'input bbox should be CameraInstance3DBoxes!'\n    loc = cam_box.gravity_center\n    dim = cam_box.dims\n    yaw = cam_box.yaw\n    feats = cam_box.tensor[:, 7:]\n    # rotate along x-axis for np.pi / 2\n    # see also here: https://github.com/open-mmlab/mmdetection3d/blob/master/mmdet3d/datasets/nuscenes_mono_dataset.py#L557  # noqa\n    dim[:, [1, 2]] = dim[:, [2, 1]]\n    # change local yaw to global yaw for visualization\n    # refer to https://github.com/open-mmlab/mmdetection3d/blob/master/mmdet3d/datasets/nuscenes_mono_dataset.py#L164-L166  # noqa\n    yaw += torch.atan2(loc[:, 0], loc[:, 2])\n    # convert yaw by (-yaw - np.pi / 2)\n    # this is because mono 3D box class such as `NuScenesBox` has different\n    # definition of rotation with our `CameraInstance3DBoxes`\n    yaw = -yaw - np.pi / 2\n    cam_box = torch.cat([loc, dim, yaw[:, None], feats], dim=1)\n    cam_box = CameraInstance3DBoxes(cam_box,\n                                    box_dim=cam_box.shape[-1],\n                                    origin=(0.5, 0.5, 0.5))\n\n    return cam_box\n\n\ndef get_proj_mat_by_coord_type(img_meta: dict, coord_type: str) -> Tensor:\n    \"\"\"Obtain image features using points.\n\n    Args:\n        img_meta (dict): Meta information.\n        coord_type (str): 'DEPTH' or 'CAMERA' or 'LIDAR'. Can be case-\n            insensitive.\n\n    Returns:\n        Tensor: Transformation matrix.\n    \"\"\"\n    coord_type = coord_type.upper()\n    mapping = {'LIDAR': 'lidar2img', 'DEPTH': 'depth2img', 'CAMERA': 'cam2img'}\n    assert coord_type in mapping.keys()\n    return img_meta[mapping[coord_type]]\n\n\ndef yaw2local(yaw: Tensor, loc: Tensor) -> Tensor:\n    \"\"\"Transform global yaw to local yaw (alpha in kitti) in camera\n    coordinates, ranges from -pi to pi.\n\n    Args:\n        yaw (Tensor): A vector with local yaw of each box in shape (N, ).\n        loc (Tensor): Gravity center of each box in shape (N, 3).\n\n    Returns:\n        Tensor: Local yaw (alpha in kitti).\n    \"\"\"\n    local_yaw = yaw - torch.atan2(loc[:, 0], loc[:, 2])\n    larger_idx = (local_yaw > np.pi).nonzero(as_tuple=False)\n    small_idx = (local_yaw < -np.pi).nonzero(as_tuple=False)\n    if len(larger_idx) != 0:\n        local_yaw[larger_idx] -= 2 * np.pi\n    if len(small_idx) != 0:\n        local_yaw[small_idx] += 2 * np.pi\n\n    return local_yaw\n\n\ndef get_lidar2img(cam2img: Tensor, lidar2cam: Tensor) -> Tensor:\n    \"\"\"Get the projection matrix of lidar2img.\n\n    Args:\n        cam2img (torch.Tensor): A 3x3 or 4x4 projection matrix.\n        lidar2cam (torch.Tensor): A 3x3 or 4x4 projection matrix.\n\n    Returns:\n        Tensor: Transformation matrix with shape 4x4.\n    \"\"\"\n    if cam2img.shape == (3, 3):\n        temp = cam2img.new_zeros(4, 4)\n        temp[:3, :3] = cam2img\n        temp[3, 3] = 1\n        cam2img = temp\n\n    if lidar2cam.shape == (3, 3):\n        temp = lidar2cam.new_zeros(4, 4)\n        temp[:3, :3] = lidar2cam\n        temp[3, 3] = 1\n        lidar2cam = temp\n    return torch.matmul(cam2img, lidar2cam)\n"
  },
  {
    "path": "bip3d/structures/ops/__init__.py",
    "content": "# Copyright (c) OpenRobotLab. All rights reserved.\n# yapf:disable\nfrom .box_np_ops import (box2d_to_corner_jit, box3d_to_bbox,\n                         box_camera_to_lidar, boxes3d_to_corners3d_lidar,\n                         camera_to_lidar, center_to_corner_box2d,\n                         center_to_corner_box3d, center_to_minmax_2d,\n                         corner_to_standup_nd_jit, corner_to_surfaces_3d,\n                         corner_to_surfaces_3d_jit, corners_nd,\n                         create_anchors_3d_range, depth_to_lidar_points,\n                         depth_to_points, get_frustum, iou_jit,\n                         minmax_to_corner_2d, points_in_convex_polygon_3d_jit,\n                         points_in_convex_polygon_jit, points_in_rbbox,\n                         projection_matrix_to_CRT_kitti, rbbox2d_to_near_bbox,\n                         remove_outside_points, rotation_points_single_angle,\n                         surface_equ_3d)\n# yapf:enable\nfrom .iou3d_calculator import (AxisAlignedBboxOverlaps3D, BboxOverlaps3D,\n                               BboxOverlapsNearest3D,\n                               axis_aligned_bbox_overlaps_3d, bbox_overlaps_3d,\n                               bbox_overlaps_nearest_3d)\nfrom .transforms import bbox3d2result, bbox3d2roi, bbox3d_mapping_back\n\n__all__ = [\n    'box2d_to_corner_jit', 'box3d_to_bbox', 'box_camera_to_lidar',\n    'boxes3d_to_corners3d_lidar', 'camera_to_lidar', 'center_to_corner_box2d',\n    'center_to_corner_box3d', 'center_to_minmax_2d',\n    'corner_to_standup_nd_jit', 'corner_to_surfaces_3d',\n    'corner_to_surfaces_3d_jit', 'corners_nd', 'create_anchors_3d_range',\n    'depth_to_lidar_points', 'depth_to_points', 'get_frustum', 'iou_jit',\n    'minmax_to_corner_2d', 'points_in_convex_polygon_3d_jit',\n    'points_in_convex_polygon_jit', 'points_in_rbbox',\n    'projection_matrix_to_CRT_kitti', 'rbbox2d_to_near_bbox',\n    'remove_outside_points', 'rotation_points_single_angle', 'surface_equ_3d',\n    'BboxOverlapsNearest3D', 'BboxOverlaps3D', 'bbox_overlaps_nearest_3d',\n    'bbox_overlaps_3d', 'AxisAlignedBboxOverlaps3D',\n    'axis_aligned_bbox_overlaps_3d', 'bbox3d_mapping_back', 'bbox3d2roi',\n    'bbox3d2result'\n]\n"
  },
  {
    "path": "bip3d/structures/ops/box_np_ops.py",
    "content": "# Copyright (c) OpenRobotLab. All rights reserved.\n# TODO: clean the functions in this file and move the APIs into box bbox_3d\n# in the future\n# NOTICE: All functions in this file are valid for LiDAR or depth boxes only\n# if we use default parameters.\n\nimport numba\nimport numpy as np\n\nfrom bip3d.structures.bbox_3d import (limit_period, points_cam2img,\n                                             rotation_3d_in_axis)\n\n\ndef camera_to_lidar(points, r_rect, velo2cam):\n    \"\"\"Convert points in camera coordinate to lidar coordinate.\n\n    Note:\n        This function is for KITTI only.\n\n    Args:\n        points (np.ndarray, shape=[N, 3]): Points in camera coordinate.\n        r_rect (np.ndarray, shape=[4, 4]): Matrix to project points in\n            specific camera coordinate (e.g. CAM2) to CAM0.\n        velo2cam (np.ndarray, shape=[4, 4]): Matrix to project points in\n            camera coordinate to lidar coordinate.\n\n    Returns:\n        np.ndarray, shape=[N, 3]: Points in lidar coordinate.\n    \"\"\"\n    points_shape = list(points.shape[0:-1])\n    if points.shape[-1] == 3:\n        points = np.concatenate([points, np.ones(points_shape + [1])], axis=-1)\n    lidar_points = points @ np.linalg.inv((r_rect @ velo2cam).T)\n    return lidar_points[..., :3]\n\n\ndef box_camera_to_lidar(data, r_rect, velo2cam):\n    \"\"\"Convert boxes in camera coordinate to lidar coordinate.\n\n    Note:\n        This function is for KITTI only.\n\n    Args:\n        data (np.ndarray, shape=[N, 7]): Boxes in camera coordinate.\n        r_rect (np.ndarray, shape=[4, 4]): Matrix to project points in\n            specific camera coordinate (e.g. CAM2) to CAM0.\n        velo2cam (np.ndarray, shape=[4, 4]): Matrix to project points in\n            camera coordinate to lidar coordinate.\n\n    Returns:\n        np.ndarray, shape=[N, 3]: Boxes in lidar coordinate.\n    \"\"\"\n    xyz = data[:, 0:3]\n    x_size, y_size, z_size = data[:, 3:4], data[:, 4:5], data[:, 5:6]\n    r = data[:, 6:7]\n    xyz_lidar = camera_to_lidar(xyz, r_rect, velo2cam)\n    # yaw and dims also needs to be converted\n    r_new = -r - np.pi / 2\n    r_new = limit_period(r_new, period=np.pi * 2)\n    return np.concatenate([xyz_lidar, x_size, z_size, y_size, r_new], axis=1)\n\n\ndef corners_nd(dims, origin=0.5):\n    \"\"\"Generate relative box corners based on length per dim and origin point.\n\n    Args:\n        dims (np.ndarray, shape=[N, ndim]): Array of length per dim\n        origin (list or array or float, optional): origin point relate to\n            smallest point. Defaults to 0.5\n\n    Returns:\n        np.ndarray, shape=[N, 2 ** ndim, ndim]: Returned corners.\n        point layout example: (2d) x0y0, x0y1, x1y0, x1y1;\n            (3d) x0y0z0, x0y0z1, x0y1z0, x0y1z1, x1y0z0, x1y0z1, x1y1z0, x1y1z1\n            where x0 < x1, y0 < y1, z0 < z1.\n    \"\"\"\n    ndim = int(dims.shape[1])\n    corners_norm = np.stack(np.unravel_index(np.arange(2**ndim), [2] * ndim),\n                            axis=1).astype(dims.dtype)\n    # now corners_norm has format: (2d) x0y0, x0y1, x1y0, x1y1\n    # (3d) x0y0z0, x0y0z1, x0y1z0, x0y1z1, x1y0z0, x1y0z1, x1y1z0, x1y1z1\n    # so need to convert to a format which is convenient to do other computing.\n    # for 2d boxes, format is clockwise start with minimum point\n    # for 3d boxes, please draw lines by your hand.\n    if ndim == 2:\n        # generate clockwise box corners\n        corners_norm = corners_norm[[0, 1, 3, 2]]\n    elif ndim == 3:\n        corners_norm = corners_norm[[0, 1, 3, 2, 4, 5, 7, 6]]\n    corners_norm = corners_norm - np.array(origin, dtype=dims.dtype)\n    corners = dims.reshape([-1, 1, ndim]) * corners_norm.reshape(\n        [1, 2**ndim, ndim])\n    return corners\n\n\ndef center_to_corner_box2d(centers, dims, angles=None, origin=0.5):\n    \"\"\"Convert kitti locations, dimensions and angles to corners.\n    format: center(xy), dims(xy), angles(counterclockwise when positive)\n\n    Args:\n        centers (np.ndarray): Locations in kitti label file with shape (N, 2).\n        dims (np.ndarray): Dimensions in kitti label file with shape (N, 2).\n        angles (np.ndarray, optional): Rotation_y in kitti label file with\n            shape (N). Defaults to None.\n        origin (list or array or float, optional): origin point relate to\n            smallest point. Defaults to 0.5.\n\n    Returns:\n        np.ndarray: Corners with the shape of (N, 4, 2).\n    \"\"\"\n    # 'length' in kitti format is in x axis.\n    # xyz(hwl)(kitti label file)<->xyz(lhw)(camera)<->z(-x)(-y)(wlh)(lidar)\n    # center in kitti format is [0.5, 1.0, 0.5] in xyz.\n    corners = corners_nd(dims, origin=origin)\n    # corners: [N, 4, 2]\n    if angles is not None:\n        corners = rotation_3d_in_axis(corners, angles)\n    corners += centers.reshape([-1, 1, 2])\n    return corners\n\n\n@numba.jit(nopython=True)\ndef depth_to_points(depth, trunc_pixel):\n    \"\"\"Convert depth map to points.\n\n    Args:\n        depth (np.array, shape=[H, W]): Depth map which\n            the row of [0~`trunc_pixel`] are truncated.\n        trunc_pixel (int): The number of truncated row.\n\n    Returns:\n        np.ndarray: Points in camera coordinates.\n    \"\"\"\n    num_pts = np.sum(depth[trunc_pixel:, ] > 0.1)\n    points = np.zeros((num_pts, 3), dtype=depth.dtype)\n    x = np.array([0, 0, 1], dtype=depth.dtype)\n    k = 0\n    for i in range(trunc_pixel, depth.shape[0]):\n        for j in range(depth.shape[1]):\n            if depth[i, j] > 0.1:\n                x = np.array([j, i, 1], dtype=depth.dtype)\n                points[k] = x * depth[i, j]\n                k += 1\n    return points\n\n\ndef depth_to_lidar_points(depth, trunc_pixel, P2, r_rect, velo2cam):\n    \"\"\"Convert depth map to points in lidar coordinate.\n\n    Args:\n        depth (np.array, shape=[H, W]): Depth map which\n            the row of [0~`trunc_pixel`] are truncated.\n        trunc_pixel (int): The number of truncated row.\n        P2 (p.array, shape=[4, 4]): Intrinsics of Camera2.\n        r_rect (np.ndarray, shape=[4, 4]): Matrix to project points in\n            specific camera coordinate (e.g. CAM2) to CAM0.\n        velo2cam (np.ndarray, shape=[4, 4]): Matrix to project points in\n            camera coordinate to lidar coordinate.\n\n    Returns:\n        np.ndarray: Points in lidar coordinates.\n    \"\"\"\n    pts = depth_to_points(depth, trunc_pixel)\n    points_shape = list(pts.shape[0:-1])\n    points = np.concatenate([pts, np.ones(points_shape + [1])], axis=-1)\n    points = points @ np.linalg.inv(P2.T)\n    lidar_points = camera_to_lidar(points, r_rect, velo2cam)\n    return lidar_points\n\n\ndef center_to_corner_box3d(centers,\n                           dims,\n                           angles=None,\n                           origin=(0.5, 1.0, 0.5),\n                           axis=1):\n    \"\"\"Convert kitti locations, dimensions and angles to corners.\n\n    Args:\n        centers (np.ndarray): Locations in kitti label file with shape (N, 3).\n        dims (np.ndarray): Dimensions in kitti label file with shape (N, 3).\n        angles (np.ndarray, optional): Rotation_y in kitti label file with\n            shape (N). Defaults to None.\n        origin (list or array or float, optional): Origin point relate to\n            smallest point. Use (0.5, 1.0, 0.5) in camera and (0.5, 0.5, 0)\n            in lidar. Defaults to (0.5, 1.0, 0.5).\n        axis (int, optional): Rotation axis. 1 for camera and 2 for lidar.\n            Defaults to 1.\n\n    Returns:\n        np.ndarray: Corners with the shape of (N, 8, 3).\n    \"\"\"\n    # 'length' in kitti format is in x axis.\n    # yzx(hwl)(kitti label file)<->xyz(lhw)(camera)<->z(-x)(-y)(lwh)(lidar)\n    # center in kitti format is [0.5, 1.0, 0.5] in xyz.\n    corners = corners_nd(dims, origin=origin)\n    # corners: [N, 8, 3]\n    if angles is not None:\n        corners = rotation_3d_in_axis(corners, angles, axis=axis)\n    corners += centers.reshape([-1, 1, 3])\n    return corners\n\n\n@numba.jit(nopython=True)\ndef box2d_to_corner_jit(boxes):\n    \"\"\"Convert box2d to corner.\n\n    Args:\n        boxes (np.ndarray, shape=[N, 5]): Boxes2d with rotation.\n\n    Returns:\n        box_corners (np.ndarray, shape=[N, 4, 2]): Box corners.\n    \"\"\"\n    num_box = boxes.shape[0]\n    corners_norm = np.zeros((4, 2), dtype=boxes.dtype)\n    corners_norm[1, 1] = 1.0\n    corners_norm[2] = 1.0\n    corners_norm[3, 0] = 1.0\n    corners_norm -= np.array([0.5, 0.5], dtype=boxes.dtype)\n    corners = boxes.reshape(num_box, 1, 5)[:, :, 2:4] * corners_norm.reshape(\n        1, 4, 2)\n    rot_mat_T = np.zeros((2, 2), dtype=boxes.dtype)\n    box_corners = np.zeros((num_box, 4, 2), dtype=boxes.dtype)\n    for i in range(num_box):\n        rot_sin = np.sin(boxes[i, -1])\n        rot_cos = np.cos(boxes[i, -1])\n        rot_mat_T[0, 0] = rot_cos\n        rot_mat_T[0, 1] = rot_sin\n        rot_mat_T[1, 0] = -rot_sin\n        rot_mat_T[1, 1] = rot_cos\n        box_corners[i] = corners[i] @ rot_mat_T + boxes[i, :2]\n    return box_corners\n\n\n@numba.njit\ndef corner_to_standup_nd_jit(boxes_corner):\n    \"\"\"Convert boxes_corner to aligned (min-max) boxes.\n\n    Args:\n        boxes_corner (np.ndarray, shape=[N, 2**dim, dim]): Boxes corners.\n\n    Returns:\n        np.ndarray, shape=[N, dim*2]: Aligned (min-max) boxes.\n    \"\"\"\n    num_boxes = boxes_corner.shape[0]\n    ndim = boxes_corner.shape[-1]\n    result = np.zeros((num_boxes, ndim * 2), dtype=boxes_corner.dtype)\n    for i in range(num_boxes):\n        for j in range(ndim):\n            result[i, j] = np.min(boxes_corner[i, :, j])\n        for j in range(ndim):\n            result[i, j + ndim] = np.max(boxes_corner[i, :, j])\n    return result\n\n\n@numba.jit(nopython=True)\ndef corner_to_surfaces_3d_jit(corners):\n    \"\"\"Convert 3d box corners from corner function above to surfaces that\n    normal vectors all direct to internal.\n\n    Args:\n        corners (np.ndarray): 3d box corners with the shape of (N, 8, 3).\n\n    Returns:\n        np.ndarray: Surfaces with the shape of (N, 6, 4, 3).\n    \"\"\"\n    # box_corners: [N, 8, 3], must from corner functions in this module\n    num_boxes = corners.shape[0]\n    surfaces = np.zeros((num_boxes, 6, 4, 3), dtype=corners.dtype)\n    corner_idxes = np.array([\n        0, 1, 2, 3, 7, 6, 5, 4, 0, 3, 7, 4, 1, 5, 6, 2, 0, 4, 5, 1, 3, 2, 6, 7\n    ]).reshape(6, 4)\n    for i in range(num_boxes):\n        for j in range(6):\n            for k in range(4):\n                surfaces[i, j, k] = corners[i, corner_idxes[j, k]]\n    return surfaces\n\n\ndef rotation_points_single_angle(points, angle, axis=0):\n    \"\"\"Rotate points with a single angle.\n\n    Args:\n        points (np.ndarray, shape=[N, 3]]):\n        angle (np.ndarray, shape=[1]]):\n        axis (int, optional): Axis to rotate at. Defaults to 0.\n\n    Returns:\n        np.ndarray: Rotated points.\n    \"\"\"\n    # points: [N, 3]\n    rot_sin = np.sin(angle)\n    rot_cos = np.cos(angle)\n    if axis == 1:\n        rot_mat_T = np.array(\n            [[rot_cos, 0, rot_sin], [0, 1, 0], [-rot_sin, 0, rot_cos]],\n            dtype=points.dtype)\n    elif axis == 2 or axis == -1:\n        rot_mat_T = np.array(\n            [[rot_cos, rot_sin, 0], [-rot_sin, rot_cos, 0], [0, 0, 1]],\n            dtype=points.dtype)\n    elif axis == 0:\n        rot_mat_T = np.array(\n            [[1, 0, 0], [0, rot_cos, rot_sin], [0, -rot_sin, rot_cos]],\n            dtype=points.dtype)\n    else:\n        raise ValueError('axis should in range')\n\n    return points @ rot_mat_T, rot_mat_T\n\n\ndef box3d_to_bbox(box3d, P2):\n    \"\"\"Convert box3d in camera coordinates to bbox in image coordinates.\n\n    Args:\n        box3d (np.ndarray, shape=[N, 7]): Boxes in camera coordinate.\n        P2 (np.array, shape=[4, 4]): Intrinsics of Camera2.\n\n    Returns:\n        np.ndarray, shape=[N, 4]: Boxes 2d in image coordinates.\n    \"\"\"\n    box_corners = center_to_corner_box3d(box3d[:, :3],\n                                         box3d[:, 3:6],\n                                         box3d[:, 6], [0.5, 1.0, 0.5],\n                                         axis=1)\n    box_corners_in_image = points_cam2img(box_corners, P2)\n    # box_corners_in_image: [N, 8, 2]\n    minxy = np.min(box_corners_in_image, axis=1)\n    maxxy = np.max(box_corners_in_image, axis=1)\n    bbox = np.concatenate([minxy, maxxy], axis=1)\n    return bbox\n\n\ndef corner_to_surfaces_3d(corners):\n    \"\"\"convert 3d box corners from corner function above to surfaces that\n    normal vectors all direct to internal.\n\n    Args:\n        corners (np.ndarray): 3D box corners with shape of (N, 8, 3).\n\n    Returns:\n        np.ndarray: Surfaces with the shape of (N, 6, 4, 3).\n    \"\"\"\n    # box_corners: [N, 8, 3], must from corner functions in this module\n    surfaces = np.array([\n        [corners[:, 0], corners[:, 1], corners[:, 2], corners[:, 3]],\n        [corners[:, 7], corners[:, 6], corners[:, 5], corners[:, 4]],\n        [corners[:, 0], corners[:, 3], corners[:, 7], corners[:, 4]],\n        [corners[:, 1], corners[:, 5], corners[:, 6], corners[:, 2]],\n        [corners[:, 0], corners[:, 4], corners[:, 5], corners[:, 1]],\n        [corners[:, 3], corners[:, 2], corners[:, 6], corners[:, 7]],\n    ]).transpose([2, 0, 1, 3])\n    return surfaces\n\n\ndef points_in_rbbox(points, rbbox, z_axis=2, origin=(0.5, 0.5, 0)):\n    \"\"\"Check points in rotated bbox and return indices.\n\n    Note:\n        This function is for counterclockwise boxes.\n\n    Args:\n        points (np.ndarray, shape=[N, 3+dim]): Points to query.\n        rbbox (np.ndarray, shape=[M, 7]): Boxes3d with rotation.\n        z_axis (int, optional): Indicate which axis is height.\n            Defaults to 2.\n        origin (tuple[int], optional): Indicate the position of\n            box center. Defaults to (0.5, 0.5, 0).\n\n    Returns:\n        np.ndarray, shape=[N, M]: Indices of points in each box.\n    \"\"\"\n    # TODO: this function is different from PointCloud3D, be careful\n    # when start to use nuscene, check the input\n    rbbox_corners = center_to_corner_box3d(rbbox[:, :3],\n                                           rbbox[:, 3:6],\n                                           rbbox[:, 6],\n                                           origin=origin,\n                                           axis=z_axis)\n    surfaces = corner_to_surfaces_3d(rbbox_corners)\n    indices = points_in_convex_polygon_3d_jit(points[:, :3], surfaces)\n    return indices\n\n\ndef minmax_to_corner_2d(minmax_box):\n    \"\"\"Convert minmax box to corners2d.\n\n    Args:\n        minmax_box (np.ndarray, shape=[N, dims]): minmax boxes.\n\n    Returns:\n        np.ndarray: 2d corners of boxes\n    \"\"\"\n    ndim = minmax_box.shape[-1] // 2\n    center = minmax_box[..., :ndim]\n    dims = minmax_box[..., ndim:] - center\n    return center_to_corner_box2d(center, dims, origin=0.0)\n\n\ndef create_anchors_3d_range(feature_size,\n                            anchor_range,\n                            sizes=((3.9, 1.6, 1.56), ),\n                            rotations=(0, np.pi / 2),\n                            dtype=np.float32):\n    \"\"\"Create anchors 3d by range.\n\n    Args:\n        feature_size (list[float] | tuple[float]): Feature map size. It is\n            either a list of a tuple of [D, H, W](in order of z, y, and x).\n        anchor_range (torch.Tensor | list[float]): Range of anchors with\n            shape [6]. The order is consistent with that of anchors, i.e.,\n            (x_min, y_min, z_min, x_max, y_max, z_max).\n        sizes (list[list] | np.ndarray | torch.Tensor, optional):\n            Anchor size with shape [N, 3], in order of x, y, z.\n            Defaults to ((3.9, 1.6, 1.56), ).\n        rotations (list[float] | np.ndarray | torch.Tensor, optional):\n            Rotations of anchors in a single feature grid.\n            Defaults to (0, np.pi / 2).\n        dtype (type, optional): Data type. Defaults to np.float32.\n\n    Returns:\n        np.ndarray: Range based anchors with shape of\n            (*feature_size, num_sizes, num_rots, 7).\n    \"\"\"\n    anchor_range = np.array(anchor_range, dtype)\n    z_centers = np.linspace(anchor_range[2],\n                            anchor_range[5],\n                            feature_size[0],\n                            dtype=dtype)\n    y_centers = np.linspace(anchor_range[1],\n                            anchor_range[4],\n                            feature_size[1],\n                            dtype=dtype)\n    x_centers = np.linspace(anchor_range[0],\n                            anchor_range[3],\n                            feature_size[2],\n                            dtype=dtype)\n    sizes = np.reshape(np.array(sizes, dtype=dtype), [-1, 3])\n    rotations = np.array(rotations, dtype=dtype)\n    rets = np.meshgrid(x_centers,\n                       y_centers,\n                       z_centers,\n                       rotations,\n                       indexing='ij')\n    tile_shape = [1] * 5\n    tile_shape[-2] = int(sizes.shape[0])\n    for i in range(len(rets)):\n        rets[i] = np.tile(rets[i][..., np.newaxis, :], tile_shape)\n        rets[i] = rets[i][..., np.newaxis]  # for concat\n    sizes = np.reshape(sizes, [1, 1, 1, -1, 1, 3])\n    tile_size_shape = list(rets[0].shape)\n    tile_size_shape[3] = 1\n    sizes = np.tile(sizes, tile_size_shape)\n    rets.insert(3, sizes)\n    ret = np.concatenate(rets, axis=-1)\n    return np.transpose(ret, [2, 1, 0, 3, 4, 5])\n\n\ndef center_to_minmax_2d(centers, dims, origin=0.5):\n    \"\"\"Center to minmax.\n\n    Args:\n        centers (np.ndarray): Center points.\n        dims (np.ndarray): Dimensions.\n        origin (list or array or float, optional): Origin point relate\n            to smallest point. Defaults to 0.5.\n\n    Returns:\n        np.ndarray: Minmax points.\n    \"\"\"\n    if origin == 0.5:\n        return np.concatenate([centers - dims / 2, centers + dims / 2],\n                              axis=-1)\n    corners = center_to_corner_box2d(centers, dims, origin=origin)\n    return corners[:, [0, 2]].reshape([-1, 4])\n\n\ndef rbbox2d_to_near_bbox(rbboxes):\n    \"\"\"convert rotated bbox to nearest 'standing' or 'lying' bbox.\n\n    Args:\n        rbboxes (np.ndarray): Rotated bboxes with shape of\n            (N, 5(x, y, xdim, ydim, rad)).\n\n    Returns:\n        np.ndarray: Bounding boxes with the shape of\n            (N, 4(xmin, ymin, xmax, ymax)).\n    \"\"\"\n    rots = rbboxes[..., -1]\n    rots_0_pi_div_2 = np.abs(limit_period(rots, 0.5, np.pi))\n    cond = (rots_0_pi_div_2 > np.pi / 4)[..., np.newaxis]\n    bboxes_center = np.where(cond, rbboxes[:, [0, 1, 3, 2]], rbboxes[:, :4])\n    bboxes = center_to_minmax_2d(bboxes_center[:, :2], bboxes_center[:, 2:])\n    return bboxes\n\n\n@numba.jit(nopython=True)\ndef iou_jit(boxes, query_boxes, mode='iou', eps=0.0):\n    \"\"\"Calculate box iou. Note that jit version runs ~10x faster than the\n    box_overlaps function in mmdet3d.core.evaluation.\n\n    Note:\n        This function is for counterclockwise boxes.\n\n    Args:\n        boxes (np.ndarray): Input bounding boxes with shape of (N, 4).\n        query_boxes (np.ndarray): Query boxes with shape of (K, 4).\n        mode (str, optional): IoU mode. Defaults to 'iou'.\n        eps (float, optional): Value added to denominator. Defaults to 0.\n\n    Returns:\n        np.ndarray: Overlap between boxes and query_boxes\n            with the shape of [N, K].\n    \"\"\"\n    N = boxes.shape[0]\n    K = query_boxes.shape[0]\n    overlaps = np.zeros((N, K), dtype=boxes.dtype)\n    for k in range(K):\n        box_area = ((query_boxes[k, 2] - query_boxes[k, 0] + eps) *\n                    (query_boxes[k, 3] - query_boxes[k, 1] + eps))\n        for n in range(N):\n            iw = (min(boxes[n, 2], query_boxes[k, 2]) -\n                  max(boxes[n, 0], query_boxes[k, 0]) + eps)\n            if iw > 0:\n                ih = (min(boxes[n, 3], query_boxes[k, 3]) -\n                      max(boxes[n, 1], query_boxes[k, 1]) + eps)\n                if ih > 0:\n                    if mode == 'iou':\n                        ua = ((boxes[n, 2] - boxes[n, 0] + eps) *\n                              (boxes[n, 3] - boxes[n, 1] + eps) + box_area -\n                              iw * ih)\n                    else:\n                        ua = ((boxes[n, 2] - boxes[n, 0] + eps) *\n                              (boxes[n, 3] - boxes[n, 1] + eps))\n                    overlaps[n, k] = iw * ih / ua\n    return overlaps\n\n\ndef projection_matrix_to_CRT_kitti(proj):\n    \"\"\"Split projection matrix of KITTI.\n\n    Note:\n        This function is for KITTI only.\n\n    P = C @ [R|T]\n    C is upper triangular matrix, so we need to inverse CR and use QR\n    stable for all kitti camera projection matrix.\n\n    Args:\n        proj (p.array, shape=[4, 4]): Intrinsics of camera.\n\n    Returns:\n        tuple[np.ndarray]: Splited matrix of C, R and T.\n    \"\"\"\n\n    CR = proj[0:3, 0:3]\n    CT = proj[0:3, 3]\n    RinvCinv = np.linalg.inv(CR)\n    Rinv, Cinv = np.linalg.qr(RinvCinv)\n    C = np.linalg.inv(Cinv)\n    R = np.linalg.inv(Rinv)\n    T = Cinv @ CT\n    return C, R, T\n\n\ndef remove_outside_points(points, rect, Trv2c, P2, image_shape):\n    \"\"\"Remove points which are outside of image.\n\n    Note:\n        This function is for KITTI only.\n\n    Args:\n        points (np.ndarray, shape=[N, 3+dims]): Total points.\n        rect (np.ndarray, shape=[4, 4]): Matrix to project points in\n            specific camera coordinate (e.g. CAM2) to CAM0.\n        Trv2c (np.ndarray, shape=[4, 4]): Matrix to project points in\n            camera coordinate to lidar coordinate.\n        P2 (p.array, shape=[4, 4]): Intrinsics of Camera2.\n        image_shape (list[int]): Shape of image.\n\n    Returns:\n        np.ndarray, shape=[N, 3+dims]: Filtered points.\n    \"\"\"\n    # 5x faster than remove_outside_points_v1(2ms vs 10ms)\n    C, R, T = projection_matrix_to_CRT_kitti(P2)\n    image_bbox = [0, 0, image_shape[1], image_shape[0]]\n    frustum = get_frustum(image_bbox, C)\n    frustum -= T\n    frustum = np.linalg.inv(R) @ frustum.T\n    frustum = camera_to_lidar(frustum.T, rect, Trv2c)\n    frustum_surfaces = corner_to_surfaces_3d_jit(frustum[np.newaxis, ...])\n    indices = points_in_convex_polygon_3d_jit(points[:, :3], frustum_surfaces)\n    points = points[indices.reshape([-1])]\n    return points\n\n\ndef get_frustum(bbox_image, C, near_clip=0.001, far_clip=100):\n    \"\"\"Get frustum corners in camera coordinates.\n\n    Args:\n        bbox_image (list[int]): box in image coordinates.\n        C (np.ndarray): Intrinsics.\n        near_clip (float, optional): Nearest distance of frustum.\n            Defaults to 0.001.\n        far_clip (float, optional): Farthest distance of frustum.\n            Defaults to 100.\n\n    Returns:\n        np.ndarray, shape=[8, 3]: coordinates of frustum corners.\n    \"\"\"\n    fku = C[0, 0]\n    fkv = -C[1, 1]\n    u0v0 = C[0:2, 2]\n    z_points = np.array([near_clip] * 4 + [far_clip] * 4,\n                        dtype=C.dtype)[:, np.newaxis]\n    b = bbox_image\n    box_corners = np.array(\n        [[b[0], b[1]], [b[0], b[3]], [b[2], b[3]], [b[2], b[1]]],\n        dtype=C.dtype)\n    near_box_corners = (box_corners - u0v0) / np.array(\n        [fku / near_clip, -fkv / near_clip], dtype=C.dtype)\n    far_box_corners = (box_corners - u0v0) / np.array(\n        [fku / far_clip, -fkv / far_clip], dtype=C.dtype)\n    ret_xy = np.concatenate([near_box_corners, far_box_corners],\n                            axis=0)  # [8, 2]\n    ret_xyz = np.concatenate([ret_xy, z_points], axis=1)\n    return ret_xyz\n\n\ndef surface_equ_3d(polygon_surfaces):\n    \"\"\"\n\n    Args:\n        polygon_surfaces (np.ndarray): Polygon surfaces with shape of\n            [num_polygon, max_num_surfaces, max_num_points_of_surface, 3].\n            All surfaces' normal vector must direct to internal.\n            Max_num_points_of_surface must at least 3.\n\n    Returns:\n        tuple: normal vector and its direction.\n    \"\"\"\n    # return [a, b, c], d in ax+by+cz+d=0\n    # polygon_surfaces: [num_polygon, num_surfaces, num_points_of_polygon, 3]\n    surface_vec = polygon_surfaces[:, :, :2, :] - \\\n        polygon_surfaces[:, :, 1:3, :]\n    # normal_vec: [..., 3]\n    normal_vec = np.cross(surface_vec[:, :, 0, :], surface_vec[:, :, 1, :])\n    # print(normal_vec.shape, points[..., 0, :].shape)\n    # d = -np.inner(normal_vec, points[..., 0, :])\n    d = np.einsum('aij, aij->ai', normal_vec, polygon_surfaces[:, :, 0, :])\n    return normal_vec, -d\n\n\n@numba.njit\ndef _points_in_convex_polygon_3d_jit(points, polygon_surfaces, normal_vec, d,\n                                     num_surfaces):\n    \"\"\"\n    Args:\n        points (np.ndarray): Input points with shape of (num_points, 3).\n        polygon_surfaces (np.ndarray): Polygon surfaces with shape of\n            (num_polygon, max_num_surfaces, max_num_points_of_surface, 3).\n            All surfaces' normal vector must direct to internal.\n            Max_num_points_of_surface must at least 3.\n        normal_vec (np.ndarray): Normal vector of polygon_surfaces.\n        d (int): Directions of normal vector.\n        num_surfaces (np.ndarray): Number of surfaces a polygon contains\n            shape of (num_polygon).\n\n    Returns:\n        np.ndarray: Result matrix with the shape of [num_points, num_polygon].\n    \"\"\"\n    max_num_surfaces, max_num_points_of_surface = polygon_surfaces.shape[1:3]\n    num_points = points.shape[0]\n    num_polygons = polygon_surfaces.shape[0]\n    ret = np.ones((num_points, num_polygons), dtype=np.bool_)\n    sign = 0.0\n    for i in range(num_points):\n        for j in range(num_polygons):\n            for k in range(max_num_surfaces):\n                if k > num_surfaces[j]:\n                    break\n                sign = (points[i, 0] * normal_vec[j, k, 0] +\n                        points[i, 1] * normal_vec[j, k, 1] +\n                        points[i, 2] * normal_vec[j, k, 2] + d[j, k])\n                if sign >= 0:\n                    ret[i, j] = False\n                    break\n    return ret\n\n\ndef points_in_convex_polygon_3d_jit(points,\n                                    polygon_surfaces,\n                                    num_surfaces=None):\n    \"\"\"Check points is in 3d convex polygons.\n\n    Args:\n        points (np.ndarray): Input points with shape of (num_points, 3).\n        polygon_surfaces (np.ndarray): Polygon surfaces with shape of\n            (num_polygon, max_num_surfaces, max_num_points_of_surface, 3).\n            All surfaces' normal vector must direct to internal.\n            Max_num_points_of_surface must at least 3.\n        num_surfaces (np.ndarray, optional): Number of surfaces a polygon\n            contains shape of (num_polygon). Defaults to None.\n\n    Returns:\n        np.ndarray: Result matrix with the shape of [num_points, num_polygon].\n    \"\"\"\n    max_num_surfaces, max_num_points_of_surface = polygon_surfaces.shape[1:3]\n    # num_points = points.shape[0]\n    num_polygons = polygon_surfaces.shape[0]\n    if num_surfaces is None:\n        num_surfaces = np.full((num_polygons, ), 9999999, dtype=np.int64)\n    normal_vec, d = surface_equ_3d(polygon_surfaces[:, :, :3, :])\n    # normal_vec: [num_polygon, max_num_surfaces, 3]\n    # d: [num_polygon, max_num_surfaces]\n    return _points_in_convex_polygon_3d_jit(points, polygon_surfaces,\n                                            normal_vec, d, num_surfaces)\n\n\n@numba.njit\ndef points_in_convex_polygon_jit(points, polygon, clockwise=False):\n    \"\"\"Check points is in 2d convex polygons. True when point in polygon.\n\n    Args:\n        points (np.ndarray): Input points with the shape of [num_points, 2].\n        polygon (np.ndarray): Input polygon with the shape of\n            [num_polygon, num_points_of_polygon, 2].\n        clockwise (bool, optional): Indicate polygon is clockwise. Defaults\n            to True.\n\n    Returns:\n        np.ndarray: Result matrix with the shape of [num_points, num_polygon].\n    \"\"\"\n    # first convert polygon to directed lines\n    num_points_of_polygon = polygon.shape[1]\n    num_points = points.shape[0]\n    num_polygons = polygon.shape[0]\n    # vec for all the polygons\n    if clockwise:\n        vec1 = polygon - polygon[:,\n                                 np.array([num_points_of_polygon - 1] +\n                                          list(range(num_points_of_polygon -\n                                                     1))), :]\n    else:\n        vec1 = polygon[:,\n                       np.array([num_points_of_polygon - 1] +\n                                list(range(num_points_of_polygon -\n                                           1))), :] - polygon\n    ret = np.zeros((num_points, num_polygons), dtype=np.bool_)\n    success = True\n    cross = 0.0\n    for i in range(num_points):\n        for j in range(num_polygons):\n            success = True\n            for k in range(num_points_of_polygon):\n                vec = vec1[j, k]\n                cross = vec[1] * (polygon[j, k, 0] - points[i, 0])\n                cross -= vec[0] * (polygon[j, k, 1] - points[i, 1])\n                if cross >= 0:\n                    success = False\n                    break\n            ret[i, j] = success\n    return ret\n\n\ndef boxes3d_to_corners3d_lidar(boxes3d, bottom_center=True):\n    \"\"\"Convert kitti center boxes to corners.\n\n        7 -------- 4\n       /|         /|\n      6 -------- 5 .\n      | |        | |\n      . 3 -------- 0\n      |/         |/\n      2 -------- 1\n\n    Note:\n        This function is for LiDAR boxes only.\n\n    Args:\n        boxes3d (np.ndarray): Boxes with shape of (N, 7)\n            [x, y, z, x_size, y_size, z_size, ry] in LiDAR coords,\n            see the definition of ry in KITTI dataset.\n        bottom_center (bool, optional): Whether z is on the bottom center\n            of object. Defaults to True.\n\n    Returns:\n        np.ndarray: Box corners with the shape of [N, 8, 3].\n    \"\"\"\n    boxes_num = boxes3d.shape[0]\n    x_size, y_size, z_size = boxes3d[:, 3], boxes3d[:, 4], boxes3d[:, 5]\n    x_corners = np.array([\n        x_size / 2., -x_size / 2., -x_size / 2., x_size / 2., x_size / 2.,\n        -x_size / 2., -x_size / 2., x_size / 2.\n    ],\n                         dtype=np.float32).T\n    y_corners = np.array([\n        -y_size / 2., -y_size / 2., y_size / 2., y_size / 2., -y_size / 2.,\n        -y_size / 2., y_size / 2., y_size / 2.\n    ],\n                         dtype=np.float32).T\n    if bottom_center:\n        z_corners = np.zeros((boxes_num, 8), dtype=np.float32)\n        z_corners[:, 4:8] = z_size.reshape(boxes_num,\n                                           1).repeat(4, axis=1)  # (N, 8)\n    else:\n        z_corners = np.array([\n            -z_size / 2., -z_size / 2., -z_size / 2., -z_size / 2.,\n            z_size / 2., z_size / 2., z_size / 2., z_size / 2.\n        ],\n                             dtype=np.float32).T\n\n    ry = boxes3d[:, 6]\n    zeros, ones = np.zeros(ry.size,\n                           dtype=np.float32), np.ones(ry.size,\n                                                      dtype=np.float32)\n    rot_list = np.array([[np.cos(ry), np.sin(ry), zeros],\n                         [-np.sin(ry), np.cos(ry), zeros],\n                         [zeros, zeros, ones]])  # (3, 3, N)\n    R_list = np.transpose(rot_list, (2, 0, 1))  # (N, 3, 3)\n\n    temp_corners = np.concatenate((x_corners.reshape(\n        -1, 8, 1), y_corners.reshape(-1, 8, 1), z_corners.reshape(-1, 8, 1)),\n                                  axis=2)  # (N, 8, 3)\n    rotated_corners = np.matmul(temp_corners, R_list)  # (N, 8, 3)\n    x_corners = rotated_corners[:, :, 0]\n    y_corners = rotated_corners[:, :, 1]\n    z_corners = rotated_corners[:, :, 2]\n\n    x_loc, y_loc, z_loc = boxes3d[:, 0], boxes3d[:, 1], boxes3d[:, 2]\n\n    x = x_loc.reshape(-1, 1) + x_corners.reshape(-1, 8)\n    y = y_loc.reshape(-1, 1) + y_corners.reshape(-1, 8)\n    z = z_loc.reshape(-1, 1) + z_corners.reshape(-1, 8)\n\n    corners = np.concatenate(\n        (x.reshape(-1, 8, 1), y.reshape(-1, 8, 1), z.reshape(-1, 8, 1)),\n        axis=2)\n\n    return corners.astype(np.float32)\n"
  },
  {
    "path": "bip3d/structures/ops/iou3d_calculator.py",
    "content": "# Copyright (c) OpenRobotLab. All rights reserved.\nimport torch\nfrom mmdet.structures.bbox import bbox_overlaps\n\nfrom bip3d.registry import TASK_UTILS\nfrom bip3d.structures.bbox_3d import get_box_type\n\n\n@TASK_UTILS.register_module()\nclass BboxOverlapsNearest3D(object):\n    \"\"\"Nearest 3D IoU Calculator.\n\n    Note:\n        This IoU calculator first finds the nearest 2D boxes in bird eye view\n        (BEV), and then calculates the 2D IoU using :meth:`bbox_overlaps`.\n\n    Args:\n        coordinate (str): 'camera', 'lidar', or 'depth' coordinate system.\n    \"\"\"\n\n    def __init__(self, coordinate='lidar'):\n        assert coordinate in ['camera', 'lidar', 'depth']\n        self.coordinate = coordinate\n\n    def __call__(self, bboxes1, bboxes2, mode='iou', is_aligned=False):\n        \"\"\"Calculate nearest 3D IoU.\n\n        Note:\n            If ``is_aligned`` is ``False``, then it calculates the ious between\n            each bbox of bboxes1 and bboxes2, otherwise it calculates the ious\n            between each aligned pair of bboxes1 and bboxes2.\n\n        Args:\n            bboxes1 (torch.Tensor): shape (N, 7+N)\n                [x, y, z, x_size, y_size, z_size, ry, v].\n            bboxes2 (torch.Tensor): shape (M, 7+N)\n                [x, y, z, x_size, y_size, z_size, ry, v].\n            mode (str): \"iou\" (intersection over union) or iof\n                (intersection over foreground).\n            is_aligned (bool): Whether the calculation is aligned.\n\n        Return:\n            torch.Tensor: If ``is_aligned`` is ``True``, return ious between\n                bboxes1 and bboxes2 with shape (M, N). If ``is_aligned`` is\n                ``False``, return shape is M.\n        \"\"\"\n        return bbox_overlaps_nearest_3d(bboxes1, bboxes2, mode, is_aligned,\n                                        self.coordinate)\n\n    def __repr__(self):\n        \"\"\"str: Return a string that describes the module.\"\"\"\n        repr_str = self.__class__.__name__\n        repr_str += f'(coordinate={self.coordinate}'\n        return repr_str\n\n\n@TASK_UTILS.register_module()\nclass BboxOverlaps3D(object):\n    \"\"\"3D IoU Calculator.\n\n    Args:\n        coordinate (str): The coordinate system, valid options are\n            'camera', 'lidar', and 'depth'.\n    \"\"\"\n\n    def __init__(self, coordinate):\n        assert coordinate in ['camera', 'lidar', 'depth']\n        self.coordinate = coordinate\n\n    def __call__(self, bboxes1, bboxes2, mode='iou'):\n        \"\"\"Calculate 3D IoU using cuda implementation.\n\n        Note:\n            This function calculate the IoU of 3D boxes based on their volumes.\n            IoU calculator ``:class:BboxOverlaps3D`` uses this function to\n            calculate the actual 3D IoUs of boxes.\n\n        Args:\n            bboxes1 (torch.Tensor): with shape (N, 7+C),\n                (x, y, z, x_size, y_size, z_size, ry, v*).\n            bboxes2 (torch.Tensor): with shape (M, 7+C),\n                (x, y, z, x_size, y_size, z_size, ry, v*).\n            mode (str): \"iou\" (intersection over union) or\n                iof (intersection over foreground).\n\n        Return:\n            torch.Tensor: Bbox overlaps results of bboxes1 and bboxes2\n                with shape (M, N) (aligned mode is not supported currently).\n        \"\"\"\n        return bbox_overlaps_3d(bboxes1, bboxes2, mode, self.coordinate)\n\n    def __repr__(self):\n        \"\"\"str: return a string that describes the module\"\"\"\n        repr_str = self.__class__.__name__\n        repr_str += f'(coordinate={self.coordinate}'\n        return repr_str\n\n\ndef bbox_overlaps_nearest_3d(bboxes1,\n                             bboxes2,\n                             mode='iou',\n                             is_aligned=False,\n                             coordinate='lidar'):\n    \"\"\"Calculate nearest 3D IoU.\n\n    Note:\n        This function first finds the nearest 2D boxes in bird eye view\n        (BEV), and then calculates the 2D IoU using :meth:`bbox_overlaps`.\n        This IoU calculator :class:`BboxOverlapsNearest3D` uses this\n        function to calculate IoUs of boxes.\n\n        If ``is_aligned`` is ``False``, then it calculates the ious between\n        each bbox of bboxes1 and bboxes2, otherwise the ious between each\n        aligned pair of bboxes1 and bboxes2.\n\n    Args:\n        bboxes1 (torch.Tensor): with shape (N, 7+C),\n            (x, y, z, x_size, y_size, z_size, ry, v*).\n        bboxes2 (torch.Tensor): with shape (M, 7+C),\n            (x, y, z, x_size, y_size, z_size, ry, v*).\n        mode (str): \"iou\" (intersection over union) or iof\n            (intersection over foreground).\n        is_aligned (bool): Whether the calculation is aligned\n\n    Return:\n        torch.Tensor: If ``is_aligned`` is ``True``, return ious between\n            bboxes1 and bboxes2 with shape (M, N). If ``is_aligned`` is\n            ``False``, return shape is M.\n    \"\"\"\n    assert bboxes1.size(-1) == bboxes2.size(-1) >= 7\n\n    box_type, _ = get_box_type(coordinate)\n\n    bboxes1 = box_type(bboxes1, box_dim=bboxes1.shape[-1])\n    bboxes2 = box_type(bboxes2, box_dim=bboxes2.shape[-1])\n\n    # Change the bboxes to bev\n    # box conversion and iou calculation in torch version on CUDA\n    # is 10x faster than that in numpy version\n    bboxes1_bev = bboxes1.nearest_bev\n    bboxes2_bev = bboxes2.nearest_bev\n\n    ret = bbox_overlaps(bboxes1_bev,\n                        bboxes2_bev,\n                        mode=mode,\n                        is_aligned=is_aligned)\n    return ret\n\n\ndef bbox_overlaps_3d(bboxes1, bboxes2, mode='iou', coordinate='camera'):\n    \"\"\"Calculate 3D IoU using cuda implementation.\n\n    Note:\n        This function calculates the IoU of 3D boxes based on their volumes.\n        IoU calculator :class:`BboxOverlaps3D` uses this function to\n        calculate the actual IoUs of boxes.\n\n    Args:\n        bboxes1 (torch.Tensor): with shape (N, 7+C),\n            (x, y, z, x_size, y_size, z_size, ry, v*).\n        bboxes2 (torch.Tensor): with shape (M, 7+C),\n            (x, y, z, x_size, y_size, z_size, ry, v*).\n        mode (str): \"iou\" (intersection over union) or\n            iof (intersection over foreground).\n        coordinate (str): 'camera' or 'lidar' coordinate system.\n\n    Return:\n        torch.Tensor: Bbox overlaps results of bboxes1 and bboxes2\n            with shape (M, N) (aligned mode is not supported currently).\n    \"\"\"\n    assert bboxes1.size(-1) == bboxes2.size(-1) >= 7\n\n    box_type, _ = get_box_type(coordinate)\n\n    bboxes1 = box_type(bboxes1, box_dim=bboxes1.shape[-1])\n    bboxes2 = box_type(bboxes2, box_dim=bboxes2.shape[-1])\n\n    return bboxes1.overlaps(bboxes1, bboxes2, mode=mode)\n\n\n@TASK_UTILS.register_module()\nclass AxisAlignedBboxOverlaps3D(object):\n    \"\"\"Axis-aligned 3D Overlaps (IoU) Calculator.\"\"\"\n\n    def __call__(self, bboxes1, bboxes2, mode='iou', is_aligned=False):\n        \"\"\"Calculate IoU between 2D bboxes.\n\n        Args:\n            bboxes1 (Tensor): shape (B, m, 6) in <x1, y1, z1, x2, y2, z2>\n                format or empty.\n            bboxes2 (Tensor): shape (B, n, 6) in <x1, y1, z1, x2, y2, z2>\n                format or empty.\n                B indicates the batch dim, in shape (B1, B2, ..., Bn).\n                If ``is_aligned`` is ``True``, then m and n must be equal.\n            mode (str): \"iou\" (intersection over union) or \"giou\" (generalized\n                intersection over union).\n            is_aligned (bool, optional): If True, then m and n must be equal.\n                Defaults to False.\n        Returns:\n            Tensor: shape (m, n) if ``is_aligned`` is False else shape (m,)\n        \"\"\"\n        assert bboxes1.size(-1) == bboxes2.size(-1) == 6\n        return axis_aligned_bbox_overlaps_3d(bboxes1, bboxes2, mode,\n                                             is_aligned)\n\n    def __repr__(self):\n        \"\"\"str: a string describing the module\"\"\"\n        repr_str = self.__class__.__name__ + '()'\n        return repr_str\n\n\ndef axis_aligned_bbox_overlaps_3d(bboxes1,\n                                  bboxes2,\n                                  mode='iou',\n                                  is_aligned=False,\n                                  eps=1e-6):\n    \"\"\"Calculate overlap between two set of axis aligned 3D bboxes. If\n    ``is_aligned`` is ``False``, then calculate the overlaps between each bbox\n    of bboxes1 and bboxes2, otherwise the overlaps between each aligned pair of\n    bboxes1 and bboxes2.\n\n    Args:\n        bboxes1 (Tensor): shape (B, m, 6) in <x1, y1, z1, x2, y2, z2>\n            format or empty.\n        bboxes2 (Tensor): shape (B, n, 6) in <x1, y1, z1, x2, y2, z2>\n            format or empty.\n            B indicates the batch dim, in shape (B1, B2, ..., Bn).\n            If ``is_aligned`` is ``True``, then m and n must be equal.\n        mode (str): \"iou\" (intersection over union) or \"giou\" (generalized\n            intersection over union).\n        is_aligned (bool, optional): If True, then m and n must be equal.\n            Defaults to False.\n        eps (float, optional): A value added to the denominator for numerical\n            stability. Defaults to 1e-6.\n\n    Returns:\n        Tensor: shape (m, n) if ``is_aligned`` is False else shape (m,)\n\n    Example:\n        >>> bboxes1 = torch.FloatTensor([\n        >>>     [0, 0, 0, 10, 10, 10],\n        >>>     [10, 10, 10, 20, 20, 20],\n        >>>     [32, 32, 32, 38, 40, 42],\n        >>> ])\n        >>> bboxes2 = torch.FloatTensor([\n        >>>     [0, 0, 0, 10, 20, 20],\n        >>>     [0, 10, 10, 10, 19, 20],\n        >>>     [10, 10, 10, 20, 20, 20],\n        >>> ])\n        >>> overlaps = axis_aligned_bbox_overlaps_3d(bboxes1, bboxes2)\n        >>> assert overlaps.shape == (3, 3)\n        >>> overlaps = bbox_overlaps(bboxes1, bboxes2, is_aligned=True)\n        >>> assert overlaps.shape == (3, )\n    Example:\n        >>> empty = torch.empty(0, 6)\n        >>> nonempty = torch.FloatTensor([[0, 0, 0, 10, 9, 10]])\n        >>> assert tuple(bbox_overlaps(empty, nonempty).shape) == (0, 1)\n        >>> assert tuple(bbox_overlaps(nonempty, empty).shape) == (1, 0)\n        >>> assert tuple(bbox_overlaps(empty, empty).shape) == (0, 0)\n    \"\"\"\n\n    assert mode in ['iou', 'giou'], f'Unsupported mode {mode}'\n    # Either the boxes are empty or the length of boxes's last dimension is 6\n    assert (bboxes1.size(-1) == 6 or bboxes1.size(0) == 0)\n    assert (bboxes2.size(-1) == 6 or bboxes2.size(0) == 0)\n\n    # Batch dim must be the same\n    # Batch dim: (B1, B2, ... Bn)\n    assert bboxes1.shape[:-2] == bboxes2.shape[:-2]\n    batch_shape = bboxes1.shape[:-2]\n\n    rows = bboxes1.size(-2)\n    cols = bboxes2.size(-2)\n    if is_aligned:\n        assert rows == cols\n\n    if rows * cols == 0:\n        if is_aligned:\n            return bboxes1.new(batch_shape + (rows, ))\n        else:\n            return bboxes1.new(batch_shape + (rows, cols))\n\n    area1 = (bboxes1[..., 3] - bboxes1[..., 0]) * (\n        bboxes1[..., 4] - bboxes1[..., 1]) * (bboxes1[..., 5] -\n                                              bboxes1[..., 2])\n    area2 = (bboxes2[..., 3] - bboxes2[..., 0]) * (\n        bboxes2[..., 4] - bboxes2[..., 1]) * (bboxes2[..., 5] -\n                                              bboxes2[..., 2])\n\n    if is_aligned:\n        lt = torch.max(bboxes1[..., :3], bboxes2[..., :3])  # [B, rows, 3]\n        rb = torch.min(bboxes1[..., 3:], bboxes2[..., 3:])  # [B, rows, 3]\n\n        wh = (rb - lt).clamp(min=0)  # [B, rows, 2]\n        overlap = wh[..., 0] * wh[..., 1] * wh[..., 2]\n\n        if mode in ['iou', 'giou']:\n            union = area1 + area2 - overlap\n        else:\n            union = area1\n        if mode == 'giou':\n            enclosed_lt = torch.min(bboxes1[..., :3], bboxes2[..., :3])\n            enclosed_rb = torch.max(bboxes1[..., 3:], bboxes2[..., 3:])\n    else:\n        lt = torch.max(bboxes1[..., :, None, :3],\n                       bboxes2[..., None, :, :3])  # [B, rows, cols, 3]\n        rb = torch.min(bboxes1[..., :, None, 3:],\n                       bboxes2[..., None, :, 3:])  # [B, rows, cols, 3]\n\n        wh = (rb - lt).clamp(min=0)  # [B, rows, cols, 3]\n        overlap = wh[..., 0] * wh[..., 1] * wh[..., 2]\n\n        if mode in ['iou', 'giou']:\n            union = area1[..., None] + area2[..., None, :] - overlap\n        if mode == 'giou':\n            enclosed_lt = torch.min(bboxes1[..., :, None, :3],\n                                    bboxes2[..., None, :, :3])\n            enclosed_rb = torch.max(bboxes1[..., :, None, 3:],\n                                    bboxes2[..., None, :, 3:])\n\n    eps = union.new_tensor([eps])\n    union = torch.max(union, eps)\n    ious = overlap / union\n    if mode in ['iou']:\n        return ious\n    # calculate gious\n    enclose_wh = (enclosed_rb - enclosed_lt).clamp(min=0)\n    enclose_area = enclose_wh[..., 0] * enclose_wh[..., 1] * enclose_wh[..., 2]\n    enclose_area = torch.max(enclose_area, eps)\n    gious = ious - (enclose_area - union) / enclose_area\n    return gious\n"
  },
  {
    "path": "bip3d/structures/ops/transforms.py",
    "content": "# Copyright (c) OpenRobotLab. All rights reserved.\nimport torch\n\n\ndef bbox3d_mapping_back(bboxes, scale_factor, flip_horizontal, flip_vertical):\n    \"\"\"Map bboxes from testing scale to original image scale.\n\n    Args:\n        bboxes (:obj:`BaseInstance3DBoxes`): Boxes to be mapped back.\n        scale_factor (float): Scale factor.\n        flip_horizontal (bool): Whether to flip horizontally.\n        flip_vertical (bool): Whether to flip vertically.\n\n    Returns:\n        :obj:`BaseInstance3DBoxes`: Boxes mapped back.\n    \"\"\"\n    new_bboxes = bboxes.clone()\n    if flip_horizontal:\n        new_bboxes.flip('horizontal')\n    if flip_vertical:\n        new_bboxes.flip('vertical')\n    new_bboxes.scale(1 / scale_factor)\n\n    return new_bboxes\n\n\ndef bbox3d2roi(bbox_list):\n    \"\"\"Convert a list of bounding boxes to roi format.\n\n    Args:\n        bbox_list (list[torch.Tensor]): A list of bounding boxes\n            corresponding to a batch of images.\n\n    Returns:\n        torch.Tensor: Region of interests in shape (n, c), where\n            the channels are in order of [batch_ind, x, y ...].\n    \"\"\"\n    rois_list = []\n    for img_id, bboxes in enumerate(bbox_list):\n        if bboxes.size(0) > 0:\n            img_inds = bboxes.new_full((bboxes.size(0), 1), img_id)\n            rois = torch.cat([img_inds, bboxes], dim=-1)\n        else:\n            rois = torch.zeros_like(bboxes)\n        rois_list.append(rois)\n    rois = torch.cat(rois_list, 0)\n    return rois\n\n\n# TODO delete this\ndef bbox3d2result(bboxes, scores, labels, attrs=None):\n    \"\"\"Convert detection results to a list of numpy arrays.\n\n    Args:\n        bboxes (torch.Tensor): Bounding boxes with shape (N, 5).\n        labels (torch.Tensor): Labels with shape (N, ).\n        scores (torch.Tensor): Scores with shape (N, ).\n        attrs (torch.Tensor, optional): Attributes with shape (N, ).\n            Defaults to None.\n\n    Returns:\n        dict[str, torch.Tensor]: Bounding box results in cpu mode.\n\n            - boxes_3d (torch.Tensor): 3D boxes.\n            - scores (torch.Tensor): Prediction scores.\n            - labels_3d (torch.Tensor): Box labels.\n            - attrs_3d (torch.Tensor, optional): Box attributes.\n    \"\"\"\n    result_dict = dict(bboxes_3d=bboxes.to('cpu'),\n                       scores_3d=scores.cpu(),\n                       labels_3d=labels.cpu())\n\n    if attrs is not None:\n        result_dict['attr_labels'] = attrs.cpu()\n\n    return result_dict\n"
  },
  {
    "path": "bip3d/structures/points/__init__.py",
    "content": "# Copyright (c) OpenMMLab. All rights reserved.\nfrom .base_points import BasePoints\nfrom .cam_points import CameraPoints\nfrom .depth_points import DepthPoints\nfrom .lidar_points import LiDARPoints\n\n__all__ = ['BasePoints', 'CameraPoints', 'DepthPoints', 'LiDARPoints']\n\n\ndef get_points_type(points_type: str) -> type:\n    \"\"\"Get the class of points according to coordinate type.\n\n    Args:\n        points_type (str): The type of points coordinate. The valid value are\n            \"CAMERA\", \"LIDAR\" and \"DEPTH\".\n\n    Returns:\n        type: Points type.\n    \"\"\"\n    points_type_upper = points_type.upper()\n    if points_type_upper == 'CAMERA':\n        points_cls = CameraPoints\n    elif points_type_upper == 'LIDAR':\n        points_cls = LiDARPoints\n    elif points_type_upper == 'DEPTH':\n        points_cls = DepthPoints\n    else:\n        raise ValueError('Only \"points_type\" of \"CAMERA\", \"LIDAR\" and \"DEPTH\" '\n                         f'are supported, got {points_type}')\n\n    return points_cls\n"
  },
  {
    "path": "bip3d/structures/points/base_points.py",
    "content": "# Copyright (c) OpenMMLab. All rights reserved.\nimport warnings\nfrom abc import abstractmethod\nfrom typing import Iterator, Optional, Sequence, Union\n\nimport numpy as np\nimport torch\nfrom torch import Tensor\n\nfrom bip3d.structures.bbox_3d.utils import (rotation_3d_in_axis,\n                                                   rotation_3d_in_euler)\n\n\nclass BasePoints:\n    \"\"\"Base class for Points.\n\n    Args:\n        tensor (Tensor or np.ndarray or Sequence[Sequence[float]]): The points\n            data with shape (N, points_dim).\n        points_dim (int): Integer indicating the dimension of a point. Each row\n            is (x, y, z, ...). Defaults to 3.\n        attribute_dims (dict, optional): Dictionary to indicate the meaning of\n            extra dimension. Defaults to None.\n\n    Attributes:\n        tensor (Tensor): Float matrix with shape (N, points_dim).\n        points_dim (int): Integer indicating the dimension of a point. Each row\n            is (x, y, z, ...).\n        attribute_dims (dict, optional): Dictionary to indicate the meaning of\n            extra dimension. Defaults to None.\n        rotation_axis (int): Default rotation axis for points rotation.\n    \"\"\"\n\n    def __init__(self,\n                 tensor: Union[Tensor, np.ndarray, Sequence[Sequence[float]]],\n                 points_dim: int = 3,\n                 attribute_dims: Optional[dict] = None) -> None:\n        if isinstance(tensor, Tensor):\n            device = tensor.device\n        else:\n            device = torch.device('cpu')\n        tensor = torch.as_tensor(tensor, dtype=torch.float32, device=device)\n        if tensor.numel() == 0:\n            # Use reshape, so we don't end up creating a new tensor that does\n            # not depend on the inputs (and consequently confuses jit)\n            tensor = tensor.reshape((-1, points_dim))\n        assert tensor.dim() == 2 and tensor.size(-1) == points_dim, \\\n            ('The points dimension must be 2 and the length of the last '\n             f'dimension must be {points_dim}, but got points with shape '\n             f'{tensor.shape}.')\n\n        self.tensor = tensor.clone()\n        self.points_dim = points_dim\n        self.attribute_dims = attribute_dims\n        self.rotation_axis = 0\n\n    @property\n    def coord(self) -> Tensor:\n        \"\"\"Tensor: Coordinates of each point in shape (N, 3).\"\"\"\n        return self.tensor[:, :3]\n\n    @coord.setter\n    def coord(self, tensor: Union[Tensor, np.ndarray]) -> None:\n        \"\"\"Set the coordinates of each point.\n\n        Args:\n            tensor (Tensor or np.ndarray): Coordinates of each point with shape\n                (N, 3).\n        \"\"\"\n        try:\n            tensor = tensor.reshape(self.shape[0], 3)\n        except (RuntimeError, ValueError):  # for torch.Tensor and np.ndarray\n            raise ValueError(f'got unexpected shape {tensor.shape}')\n        if not isinstance(tensor, Tensor):\n            tensor = self.tensor.new_tensor(tensor)\n        self.tensor[:, :3] = tensor\n\n    @property\n    def height(self) -> Union[Tensor, None]:\n        \"\"\"Tensor or None: Returns a vector with height of each point in shape\n        (N, ).\"\"\"\n        if self.attribute_dims is not None and \\\n                'height' in self.attribute_dims.keys():\n            return self.tensor[:, self.attribute_dims['height']]\n        else:\n            return None\n\n    @height.setter\n    def height(self, tensor: Union[Tensor, np.ndarray]) -> None:\n        \"\"\"Set the height of each point.\n\n        Args:\n            tensor (Tensor or np.ndarray): Height of each point with shape\n                (N, ).\n        \"\"\"\n        try:\n            tensor = tensor.reshape(self.shape[0])\n        except (RuntimeError, ValueError):  # for torch.Tensor and np.ndarray\n            raise ValueError(f'got unexpected shape {tensor.shape}')\n        if not isinstance(tensor, Tensor):\n            tensor = self.tensor.new_tensor(tensor)\n        if self.attribute_dims is not None and \\\n                'height' in self.attribute_dims.keys():\n            self.tensor[:, self.attribute_dims['height']] = tensor\n        else:\n            # add height attribute\n            if self.attribute_dims is None:\n                self.attribute_dims = dict()\n            attr_dim = self.shape[1]\n            self.tensor = torch.cat([self.tensor, tensor.unsqueeze(1)], dim=1)\n            self.attribute_dims.update(dict(height=attr_dim))\n            self.points_dim += 1\n\n    @property\n    def color(self) -> Union[Tensor, None]:\n        \"\"\"Tensor or None: Returns a vector with color of each point in shape\n        (N, 3).\"\"\"\n        if self.attribute_dims is not None and \\\n                'color' in self.attribute_dims.keys():\n            return self.tensor[:, self.attribute_dims['color']]\n        else:\n            return None\n\n    @color.setter\n    def color(self, tensor: Union[Tensor, np.ndarray]) -> None:\n        \"\"\"Set the color of each point.\n\n        Args:\n            tensor (Tensor or np.ndarray): Color of each point with shape\n                (N, 3).\n        \"\"\"\n        try:\n            tensor = tensor.reshape(self.shape[0], 3)\n        except (RuntimeError, ValueError):  # for torch.Tensor and np.ndarray\n            raise ValueError(f'got unexpected shape {tensor.shape}')\n        if tensor.max() >= 256 or tensor.min() < 0:\n            warnings.warn('point got color value beyond [0, 255]')\n        if not isinstance(tensor, Tensor):\n            tensor = self.tensor.new_tensor(tensor)\n        if self.attribute_dims is not None and \\\n                'color' in self.attribute_dims.keys():\n            self.tensor[:, self.attribute_dims['color']] = tensor\n        else:\n            # add color attribute\n            if self.attribute_dims is None:\n                self.attribute_dims = dict()\n            attr_dim = self.shape[1]\n            self.tensor = torch.cat([self.tensor, tensor], dim=1)\n            self.attribute_dims.update(\n                dict(color=[attr_dim, attr_dim + 1, attr_dim + 2]))\n            self.points_dim += 3\n\n    @property\n    def shape(self) -> torch.Size:\n        \"\"\"torch.Size: Shape of points.\"\"\"\n        return self.tensor.shape\n\n    def shuffle(self) -> Tensor:\n        \"\"\"Shuffle the points.\n\n        Returns:\n            Tensor: The shuffled index.\n        \"\"\"\n        idx = torch.randperm(self.__len__(), device=self.tensor.device)\n        self.tensor = self.tensor[idx]\n        return idx\n\n    def rotate(self,\n               rotation: Union[Tensor, np.ndarray, float],\n               axis: Optional[int] = None) -> Tensor:\n        \"\"\"Rotate points with the given rotation matrix or angle.\n\n        Args:\n            rotation (Tensor or np.ndarray or float): Rotation matrix or angle.\n            axis (int, optional): Axis to rotate at. Defaults to None.\n\n        Returns:\n            Tensor: Rotation matrix.\n        \"\"\"\n        if not isinstance(rotation, Tensor):\n            rotation = self.tensor.new_tensor(rotation)\n        assert rotation.shape == torch.Size([3, 3]) or rotation.numel() == 1, \\\n            f'invalid rotation shape {rotation.shape}'\n\n        if axis is None:\n            axis = self.rotation_axis\n\n        if rotation.numel() == 1:\n            rotated_points, rot_mat_T = rotation_3d_in_axis(\n                self.tensor[:, :3][None], rotation, axis=axis, return_mat=True)\n            self.tensor[:, :3] = rotated_points.squeeze(0)\n            rot_mat_T = rot_mat_T.squeeze(0)\n        elif rotation.numel() == 3:\n            rotated_points, rot_mat_T = rotation_3d_in_euler(\n                self.tensor[:, :3][None], rotation, return_mat=True)\n            self.tensor[:, :3] = rotated_points.squeeze(0)\n            rot_mat_T = rot_mat_T.squeeze(0)\n        else:\n            # rotation.numel() == 9\n            self.tensor[:, :3] = self.tensor[:, :3] @ rotation\n            rot_mat_T = rotation\n\n        return rot_mat_T\n\n    @abstractmethod\n    def flip(self, bev_direction: str = 'horizontal') -> None:\n        \"\"\"Flip the points along given BEV direction.\n\n        Args:\n            bev_direction (str): Flip direction (horizontal or vertical).\n                Defaults to 'horizontal'.\n        \"\"\"\n        pass\n\n    def translate(self, trans_vector: Union[Tensor, np.ndarray]) -> None:\n        \"\"\"Translate points with the given translation vector.\n\n        Args:\n            trans_vector (Tensor or np.ndarray): Translation vector of size 3\n                or nx3.\n        \"\"\"\n        if not isinstance(trans_vector, Tensor):\n            trans_vector = self.tensor.new_tensor(trans_vector)\n        trans_vector = trans_vector.squeeze(0)\n        if trans_vector.dim() == 1:\n            assert trans_vector.shape[0] == 3\n        elif trans_vector.dim() == 2:\n            assert trans_vector.shape[0] == self.tensor.shape[0] and \\\n                trans_vector.shape[1] == 3\n        else:\n            raise NotImplementedError(\n                f'Unsupported translation vector of shape {trans_vector.shape}'\n            )\n        self.tensor[:, :3] += trans_vector\n\n    def in_range_3d(\n            self, point_range: Union[Tensor, np.ndarray,\n                                     Sequence[float]]) -> Tensor:\n        \"\"\"Check whether the points are in the given range.\n\n        Args:\n            point_range (Tensor or np.ndarray or Sequence[float]): The range of\n                point (x_min, y_min, z_min, x_max, y_max, z_max).\n\n        Note:\n            In the original implementation of SECOND, checking whether a box in\n            the range checks whether the points are in a convex polygon, we try\n            to reduce the burden for simpler cases.\n\n        Returns:\n            Tensor: A binary vector indicating whether each point is inside the\n            reference range.\n        \"\"\"\n        in_range_flags = ((self.tensor[:, 0] > point_range[0])\n                          & (self.tensor[:, 1] > point_range[1])\n                          & (self.tensor[:, 2] > point_range[2])\n                          & (self.tensor[:, 0] < point_range[3])\n                          & (self.tensor[:, 1] < point_range[4])\n                          & (self.tensor[:, 2] < point_range[5]))\n        return in_range_flags\n\n    @property\n    def bev(self) -> Tensor:\n        \"\"\"Tensor: BEV of the points in shape (N, 2).\"\"\"\n        return self.tensor[:, [0, 1]]\n\n    def in_range_bev(\n            self, point_range: Union[Tensor, np.ndarray,\n                                     Sequence[float]]) -> Tensor:\n        \"\"\"Check whether the points are in the given range.\n\n        Args:\n            point_range (Tensor or np.ndarray or Sequence[float]): The range of\n                point in order of (x_min, y_min, x_max, y_max).\n\n        Returns:\n            Tensor: A binary vector indicating whether each point is inside the\n            reference range.\n        \"\"\"\n        in_range_flags = ((self.bev[:, 0] > point_range[0])\n                          & (self.bev[:, 1] > point_range[1])\n                          & (self.bev[:, 0] < point_range[2])\n                          & (self.bev[:, 1] < point_range[3]))\n        return in_range_flags\n\n    @abstractmethod\n    def convert_to(self,\n                   dst: int,\n                   rt_mat: Optional[Union[Tensor,\n                                          np.ndarray]] = None) -> 'BasePoints':\n        \"\"\"Convert self to ``dst`` mode.\n\n        Args:\n            dst (int): The target Point mode.\n            rt_mat (Tensor or np.ndarray, optional): The rotation and\n                translation matrix between different coordinates.\n                Defaults to None. The conversion from ``src`` coordinates to\n                ``dst`` coordinates usually comes along the change of sensors,\n                e.g., from camera to LiDAR. This requires a transformation\n                matrix.\n\n        Returns:\n            :obj:`BasePoints`: The converted point of the same type in the\n            ``dst`` mode.\n        \"\"\"\n        pass\n\n    def scale(self, scale_factor: float) -> None:\n        \"\"\"Scale the points with horizontal and vertical scaling factors.\n\n        Args:\n            scale_factors (float): Scale factors to scale the points.\n        \"\"\"\n        self.tensor[:, :3] *= scale_factor\n\n    def __getitem__(\n            self, item: Union[int, tuple, slice, np.ndarray,\n                              Tensor]) -> 'BasePoints':\n        \"\"\"\n        Args:\n            item (int or tuple or slice or np.ndarray or Tensor): Index of\n                points.\n\n        Note:\n            The following usage are allowed:\n\n            1. `new_points = points[3]`: Return a `Points` that contains only\n               one point.\n            2. `new_points = points[2:10]`: Return a slice of points.\n            3. `new_points = points[vector]`: Whether vector is a\n               torch.BoolTensor with `length = len(points)`. Nonzero elements\n               in the vector will be selected.\n            4. `new_points = points[3:11, vector]`: Return a slice of points\n               and attribute dims.\n            5. `new_points = points[4:12, 2]`: Return a slice of points with\n               single attribute.\n\n            Note that the returned Points might share storage with this Points,\n            subject to PyTorch's indexing semantics.\n\n        Returns:\n            :obj:`BasePoints`: A new object of :class:`BasePoints` after\n            indexing.\n        \"\"\"\n        original_type = type(self)\n        if isinstance(item, int):\n            return original_type(self.tensor[item].view(1, -1),\n                                 points_dim=self.points_dim,\n                                 attribute_dims=self.attribute_dims)\n        elif isinstance(item, tuple) and len(item) == 2:\n            if isinstance(item[1], slice):\n                start = 0 if item[1].start is None else item[1].start\n                stop = self.tensor.shape[1] \\\n                    if item[1].stop is None else item[1].stop\n                step = 1 if item[1].step is None else item[1].step\n                item = list(item)\n                item[1] = list(range(start, stop, step))\n                item = tuple(item)\n            elif isinstance(item[1], int):\n                item = list(item)\n                item[1] = [item[1]]\n                item = tuple(item)\n            p = self.tensor[item[0], item[1]]\n\n            keep_dims = list(\n                set(item[1]).intersection(set(range(3, self.tensor.shape[1]))))\n            if self.attribute_dims is not None:\n                attribute_dims = self.attribute_dims.copy()\n                for key in self.attribute_dims.keys():\n                    cur_attribute_dims = attribute_dims[key]\n                    if isinstance(cur_attribute_dims, int):\n                        cur_attribute_dims = [cur_attribute_dims]\n                    intersect_attr = list(\n                        set(cur_attribute_dims).intersection(set(keep_dims)))\n                    if len(intersect_attr) == 1:\n                        attribute_dims[key] = intersect_attr[0]\n                    elif len(intersect_attr) > 1:\n                        attribute_dims[key] = intersect_attr\n                    else:\n                        attribute_dims.pop(key)\n            else:\n                attribute_dims = None\n        elif isinstance(item, (slice, np.ndarray, Tensor)):\n            p = self.tensor[item]\n            attribute_dims = self.attribute_dims\n        else:\n            raise NotImplementedError(f'Invalid slice {item}!')\n\n        assert p.dim() == 2, \\\n            f'Indexing on Points with {item} failed to return a matrix!'\n        return original_type(p,\n                             points_dim=p.shape[1],\n                             attribute_dims=attribute_dims)\n\n    def __len__(self) -> int:\n        \"\"\"int: Number of points in the current object.\"\"\"\n        return self.tensor.shape[0]\n\n    def __repr__(self) -> str:\n        \"\"\"str: Return a string that describes the object.\"\"\"\n        return self.__class__.__name__ + '(\\n    ' + str(self.tensor) + ')'\n\n    @classmethod\n    def cat(cls, points_list: Sequence['BasePoints']) -> 'BasePoints':\n        \"\"\"Concatenate a list of Points into a single Points.\n\n        Args:\n            points_list (Sequence[:obj:`BasePoints`]): List of points.\n\n        Returns:\n            :obj:`BasePoints`: The concatenated points.\n        \"\"\"\n        assert isinstance(points_list, (list, tuple))\n        if len(points_list) == 0:\n            return cls(torch.empty(0))\n        assert all(isinstance(points, cls) for points in points_list)\n\n        # use torch.cat (v.s. layers.cat)\n        # so the returned points never share storage with input\n        cat_points = cls(torch.cat([p.tensor for p in points_list], dim=0),\n                         points_dim=points_list[0].points_dim,\n                         attribute_dims=points_list[0].attribute_dims)\n        return cat_points\n\n    def numpy(self) -> np.ndarray:\n        \"\"\"Reload ``numpy`` from self.tensor.\"\"\"\n        return self.tensor.numpy()\n\n    def to(self, device: Union[str, torch.device], *args,\n           **kwargs) -> 'BasePoints':\n        \"\"\"Convert current points to a specific device.\n\n        Args:\n            device (str or :obj:`torch.device`): The name of the device.\n\n        Returns:\n            :obj:`BasePoints`: A new points object on the specific device.\n        \"\"\"\n        original_type = type(self)\n        return original_type(self.tensor.to(device, *args, **kwargs),\n                             points_dim=self.points_dim,\n                             attribute_dims=self.attribute_dims)\n\n    def cpu(self) -> 'BasePoints':\n        \"\"\"Convert current points to cpu device.\n\n        Returns:\n            :obj:`BasePoints`: A new points object on the cpu device.\n        \"\"\"\n        original_type = type(self)\n        return original_type(self.tensor.cpu(),\n                             points_dim=self.points_dim,\n                             attribute_dims=self.attribute_dims)\n\n    def cuda(self, *args, **kwargs) -> 'BasePoints':\n        \"\"\"Convert current points to cuda device.\n\n        Returns:\n            :obj:`BasePoints`: A new points object on the cuda device.\n        \"\"\"\n        original_type = type(self)\n        return original_type(self.tensor.cuda(*args, **kwargs),\n                             points_dim=self.points_dim,\n                             attribute_dims=self.attribute_dims)\n\n    def clone(self) -> 'BasePoints':\n        \"\"\"Clone the points.\n\n        Returns:\n            :obj:`BasePoints`: Point object with the same properties as self.\n        \"\"\"\n        original_type = type(self)\n        return original_type(self.tensor.clone(),\n                             points_dim=self.points_dim,\n                             attribute_dims=self.attribute_dims)\n\n    def detach(self) -> 'BasePoints':\n        \"\"\"Detach the points.\n\n        Returns:\n            :obj:`BasePoints`: Point object with the same properties as self.\n        \"\"\"\n        original_type = type(self)\n        return original_type(self.tensor.detach(),\n                             points_dim=self.points_dim,\n                             attribute_dims=self.attribute_dims)\n\n    @property\n    def device(self) -> torch.device:\n        \"\"\"torch.device: The device of the points are on.\"\"\"\n        return self.tensor.device\n\n    def __iter__(self) -> Iterator[Tensor]:\n        \"\"\"Yield a point as a Tensor at a time.\n\n        Returns:\n            Iterator[Tensor]: A point of shape (points_dim, ).\n        \"\"\"\n        yield from self.tensor\n\n    def new_point(\n        self, data: Union[Tensor, np.ndarray, Sequence[Sequence[float]]]\n    ) -> 'BasePoints':\n        \"\"\"Create a new point object with data.\n\n        The new point and its tensor has the similar properties as self and\n        self.tensor, respectively.\n\n        Args:\n            data (Tensor or np.ndarray or Sequence[Sequence[float]]): Data to\n                be copied.\n\n        Returns:\n            :obj:`BasePoints`: A new point object with ``data``, the object's\n            other properties are similar to ``self``.\n        \"\"\"\n        new_tensor = self.tensor.new_tensor(data) \\\n            if not isinstance(data, Tensor) else data.to(self.device)\n        original_type = type(self)\n        return original_type(new_tensor,\n                             points_dim=self.points_dim,\n                             attribute_dims=self.attribute_dims)\n"
  },
  {
    "path": "bip3d/structures/points/cam_points.py",
    "content": "# Copyright (c) OpenMMLab. All rights reserved.\nfrom typing import Optional, Sequence, Union\n\nimport numpy as np\nfrom torch import Tensor\n\nfrom .base_points import BasePoints\n\n\nclass CameraPoints(BasePoints):\n    \"\"\"Points of instances in CAM coordinates.\n\n    Args:\n        tensor (Tensor or np.ndarray or Sequence[Sequence[float]]): The points\n            data with shape (N, points_dim).\n        points_dim (int): Integer indicating the dimension of a point. Each row\n            is (x, y, z, ...). Defaults to 3.\n        attribute_dims (dict, optional): Dictionary to indicate the meaning of\n            extra dimension. Defaults to None.\n\n    Attributes:\n        tensor (Tensor): Float matrix with shape (N, points_dim).\n        points_dim (int): Integer indicating the dimension of a point. Each row\n            is (x, y, z, ...).\n        attribute_dims (dict, optional): Dictionary to indicate the meaning of\n            extra dimension. Defaults to None.\n        rotation_axis (int): Default rotation axis for points rotation.\n    \"\"\"\n\n    def __init__(self,\n                 tensor: Union[Tensor, np.ndarray, Sequence[Sequence[float]]],\n                 points_dim: int = 3,\n                 attribute_dims: Optional[dict] = None) -> None:\n        super(CameraPoints, self).__init__(tensor,\n                                           points_dim=points_dim,\n                                           attribute_dims=attribute_dims)\n        self.rotation_axis = 1\n\n    def flip(self, bev_direction: str = 'horizontal') -> None:\n        \"\"\"Flip the points along given BEV direction.\n\n        Args:\n            bev_direction (str): Flip direction (horizontal or vertical).\n                Defaults to 'horizontal'.\n        \"\"\"\n        assert bev_direction in ('horizontal', 'vertical')\n        if bev_direction == 'horizontal':\n            self.tensor[:, 0] = -self.tensor[:, 0]\n        elif bev_direction == 'vertical':\n            self.tensor[:, 2] = -self.tensor[:, 2]\n\n    @property\n    def bev(self) -> Tensor:\n        \"\"\"Tensor: BEV of the points in shape (N, 2).\"\"\"\n        return self.tensor[:, [0, 2]]\n\n    def convert_to(self,\n                   dst: int,\n                   rt_mat: Optional[Union[Tensor,\n                                          np.ndarray]] = None) -> 'BasePoints':\n        \"\"\"Convert self to ``dst`` mode.\n\n        Args:\n            dst (int): The target Point mode.\n            rt_mat (Tensor or np.ndarray, optional): The rotation and\n                translation matrix between different coordinates.\n                Defaults to None. The conversion from ``src`` coordinates to\n                ``dst`` coordinates usually comes along the change of sensors,\n                e.g., from camera to LiDAR. This requires a transformation\n                matrix.\n\n        Returns:\n            :obj:`BasePoints`: The converted point of the same type in the\n            ``dst`` mode.\n        \"\"\"\n        from embodiedscan.structures.bbox_3d import Coord3DMode\n        return Coord3DMode.convert_point(point=self,\n                                         src=Coord3DMode.CAM,\n                                         dst=dst,\n                                         rt_mat=rt_mat)\n"
  },
  {
    "path": "bip3d/structures/points/depth_points.py",
    "content": "# Copyright (c) OpenMMLab. All rights reserved.\nfrom typing import Optional, Sequence, Union\n\nimport numpy as np\nfrom torch import Tensor\n\nfrom .base_points import BasePoints\n\n\nclass DepthPoints(BasePoints):\n    \"\"\"Points of instances in DEPTH coordinates.\n\n    Args:\n        tensor (Tensor or np.ndarray or Sequence[Sequence[float]]): The points\n            data with shape (N, points_dim).\n        points_dim (int): Integer indicating the dimension of a point. Each row\n            is (x, y, z, ...). Defaults to 3.\n        attribute_dims (dict, optional): Dictionary to indicate the meaning of\n            extra dimension. Defaults to None.\n\n    Attributes:\n        tensor (Tensor): Float matrix with shape (N, points_dim).\n        points_dim (int): Integer indicating the dimension of a point. Each row\n            is (x, y, z, ...).\n        attribute_dims (dict, optional): Dictionary to indicate the meaning of\n            extra dimension. Defaults to None.\n        rotation_axis (int): Default rotation axis for points rotation.\n    \"\"\"\n\n    def __init__(self,\n                 tensor: Union[Tensor, np.ndarray, Sequence[Sequence[float]]],\n                 points_dim: int = 3,\n                 attribute_dims: Optional[dict] = None) -> None:\n        super(DepthPoints, self).__init__(tensor,\n                                          points_dim=points_dim,\n                                          attribute_dims=attribute_dims)\n        self.rotation_axis = 2\n\n    def flip(self, bev_direction: str = 'horizontal') -> None:\n        \"\"\"Flip the points along given BEV direction.\n\n        Args:\n            bev_direction (str): Flip direction (horizontal or vertical).\n                Defaults to 'horizontal'.\n        \"\"\"\n        assert bev_direction in ('horizontal', 'vertical')\n        if bev_direction == 'horizontal':\n            self.tensor[:, 0] = -self.tensor[:, 0]\n        elif bev_direction == 'vertical':\n            self.tensor[:, 1] = -self.tensor[:, 1]\n\n    def convert_to(self,\n                   dst: int,\n                   rt_mat: Optional[Union[Tensor,\n                                          np.ndarray]] = None) -> 'BasePoints':\n        \"\"\"Convert self to ``dst`` mode.\n\n        Args:\n            dst (int): The target Point mode.\n            rt_mat (Tensor or np.ndarray, optional): The rotation and\n                translation matrix between different coordinates.\n                Defaults to None. The conversion from ``src`` coordinates to\n                ``dst`` coordinates usually comes along the change of sensors,\n                e.g., from camera to LiDAR. This requires a transformation\n                matrix.\n\n        Returns:\n            :obj:`BasePoints`: The converted point of the same type in the\n            ``dst`` mode.\n        \"\"\"\n        from embodiedscan.structures.bbox_3d import Coord3DMode\n        return Coord3DMode.convert_point(point=self,\n                                         src=Coord3DMode.DEPTH,\n                                         dst=dst,\n                                         rt_mat=rt_mat)\n"
  },
  {
    "path": "bip3d/structures/points/lidar_points.py",
    "content": "# Copyright (c) OpenMMLab. All rights reserved.\nfrom typing import Optional, Sequence, Union\n\nimport numpy as np\nfrom torch import Tensor\n\nfrom .base_points import BasePoints\n\n\nclass LiDARPoints(BasePoints):\n    \"\"\"Points of instances in LIDAR coordinates.\n\n    Args:\n        tensor (Tensor or np.ndarray or Sequence[Sequence[float]]): The points\n            data with shape (N, points_dim).\n        points_dim (int): Integer indicating the dimension of a point. Each row\n            is (x, y, z, ...). Defaults to 3.\n        attribute_dims (dict, optional): Dictionary to indicate the meaning of\n            extra dimension. Defaults to None.\n\n    Attributes:\n        tensor (Tensor): Float matrix with shape (N, points_dim).\n        points_dim (int): Integer indicating the dimension of a point. Each row\n            is (x, y, z, ...).\n        attribute_dims (dict, optional): Dictionary to indicate the meaning of\n            extra dimension. Defaults to None.\n        rotation_axis (int): Default rotation axis for points rotation.\n    \"\"\"\n\n    def __init__(self,\n                 tensor: Union[Tensor, np.ndarray, Sequence[Sequence[float]]],\n                 points_dim: int = 3,\n                 attribute_dims: Optional[dict] = None) -> None:\n        super(LiDARPoints, self).__init__(tensor,\n                                          points_dim=points_dim,\n                                          attribute_dims=attribute_dims)\n        self.rotation_axis = 2\n\n    def flip(self, bev_direction: str = 'horizontal') -> None:\n        \"\"\"Flip the points along given BEV direction.\n\n        Args:\n            bev_direction (str): Flip direction (horizontal or vertical).\n                Defaults to 'horizontal'.\n        \"\"\"\n        assert bev_direction in ('horizontal', 'vertical')\n        if bev_direction == 'horizontal':\n            self.tensor[:, 1] = -self.tensor[:, 1]\n        elif bev_direction == 'vertical':\n            self.tensor[:, 0] = -self.tensor[:, 0]\n\n    def convert_to(self,\n                   dst: int,\n                   rt_mat: Optional[Union[Tensor,\n                                          np.ndarray]] = None) -> 'BasePoints':\n        \"\"\"Convert self to ``dst`` mode.\n\n        Args:\n            dst (int): The target Point mode.\n            rt_mat (Tensor or np.ndarray, optional): The rotation and\n                translation matrix between different coordinates.\n                Defaults to None. The conversion from ``src`` coordinates to\n                ``dst`` coordinates usually comes along the change of sensors,\n                e.g., from camera to LiDAR. This requires a transformation\n                matrix.\n\n        Returns:\n            :obj:`BasePoints`: The converted point of the same type in the\n            ``dst`` mode.\n        \"\"\"\n        from embodiedscan.structures.bbox_3d import Coord3DMode\n        return Coord3DMode.convert_point(point=self,\n                                         src=Coord3DMode.LIDAR,\n                                         dst=dst,\n                                         rt_mat=rt_mat)\n"
  },
  {
    "path": "bip3d/utils/__init__.py",
    "content": "from .array_converter import ArrayConverter, array_converter\nfrom .typing_config import ConfigType, OptConfigType, OptMultiConfig\n\n__all__ = [\n    'ConfigType', 'OptConfigType', 'OptMultiConfig', 'ArrayConverter',\n    'array_converter'\n]\n"
  },
  {
    "path": "bip3d/utils/array_converter.py",
    "content": "# Copyright (c) OpenMMLab. All rights reserved.\nimport functools\nfrom inspect import getfullargspec\nfrom typing import Callable, Optional, Tuple, Type, Union\n\nimport numpy as np\nimport torch\n\nTemplateArrayType = Union[np.ndarray, torch.Tensor, list, tuple, int, float]\n\n\ndef array_converter(to_torch: bool = True,\n                    apply_to: Tuple[str, ...] = tuple(),\n                    template_arg_name_: Optional[str] = None,\n                    recover: bool = True) -> Callable:\n    \"\"\"Wrapper function for data-type agnostic processing.\n\n    First converts input arrays to PyTorch tensors or NumPy arrays for middle\n    calculation, then convert output to original data-type if `recover=True`.\n\n    Args:\n        to_torch (bool): Whether to convert to PyTorch tensors for middle\n            calculation. Defaults to True.\n        apply_to (Tuple[str]): The arguments to which we apply data-type\n            conversion. Defaults to an empty tuple.\n        template_arg_name_ (str, optional): Argument serving as the template\n            (return arrays should have the same dtype and device as the\n            template). Defaults to None. If None, we will use the first\n            argument in `apply_to` as the template argument.\n        recover (bool): Whether or not to recover the wrapped function outputs\n            to the `template_arg_name_` type. Defaults to True.\n\n    Raises:\n        ValueError: When template_arg_name_ is not among all args, or when\n            apply_to contains an arg which is not among all args, a ValueError\n            will be raised. When the template argument or an argument to\n            convert is a list or tuple, and cannot be converted to a NumPy\n            array, a ValueError will be raised.\n        TypeError: When the type of the template argument or an argument to\n            convert does not belong to the above range, or the contents of such\n            an list-or-tuple-type argument do not share the same data type, a\n            TypeError will be raised.\n\n    Returns:\n        Callable: Wrapped function.\n\n    Examples:\n        >>> import torch\n        >>> import numpy as np\n        >>>\n        >>> # Use torch addition for a + b,\n        >>> # and convert return values to the type of a\n        >>> @array_converter(apply_to=('a', 'b'))\n        >>> def simple_add(a, b):\n        >>>     return a + b\n        >>>\n        >>> a = np.array([1.1])\n        >>> b = np.array([2.2])\n        >>> simple_add(a, b)\n        >>>\n        >>> # Use numpy addition for a + b,\n        >>> # and convert return values to the type of b\n        >>> @array_converter(to_torch=False, apply_to=('a', 'b'),\n        >>>                  template_arg_name_='b')\n        >>> def simple_add(a, b):\n        >>>     return a + b\n        >>>\n        >>> simple_add(a, b)\n        >>>\n        >>> # Use torch funcs for floor(a) if flag=True else ceil(a),\n        >>> # and return the torch tensor\n        >>> @array_converter(apply_to=('a',), recover=False)\n        >>> def floor_or_ceil(a, flag=True):\n        >>>     return torch.floor(a) if flag else torch.ceil(a)\n        >>>\n        >>> floor_or_ceil(a, flag=False)\n    \"\"\"\n\n    def array_converter_wrapper(func):\n        \"\"\"Outer wrapper for the function.\"\"\"\n\n        @functools.wraps(func)\n        def new_func(*args, **kwargs):\n            \"\"\"Inner wrapper for the arguments.\"\"\"\n            if len(apply_to) == 0:\n                return func(*args, **kwargs)\n\n            func_name = func.__name__\n\n            arg_spec = getfullargspec(func)\n\n            arg_names = arg_spec.args\n            arg_num = len(arg_names)\n            default_arg_values = arg_spec.defaults\n            if default_arg_values is None:\n                default_arg_values = []\n            no_default_arg_num = len(arg_names) - len(default_arg_values)\n\n            kwonly_arg_names = arg_spec.kwonlyargs\n            kwonly_default_arg_values = arg_spec.kwonlydefaults\n            if kwonly_default_arg_values is None:\n                kwonly_default_arg_values = {}\n\n            all_arg_names = arg_names + kwonly_arg_names\n\n            # in case there are args in the form of *args\n            if len(args) > arg_num:\n                named_args = args[:arg_num]\n                nameless_args = args[arg_num:]\n            else:\n                named_args = args\n                nameless_args = []\n\n            # template argument data type is used for all array-like arguments\n            if template_arg_name_ is None:\n                template_arg_name = apply_to[0]\n            else:\n                template_arg_name = template_arg_name_\n\n            if template_arg_name not in all_arg_names:\n                raise ValueError(f'{template_arg_name} is not among the '\n                                 f'argument list of function {func_name}')\n\n            # inspect apply_to\n            for arg_to_apply in apply_to:\n                if arg_to_apply not in all_arg_names:\n                    raise ValueError(\n                        f'{arg_to_apply} is not an argument of {func_name}')\n\n            new_args = []\n            new_kwargs = {}\n\n            converter = ArrayConverter()\n            target_type = torch.Tensor if to_torch else np.ndarray\n\n            # non-keyword arguments\n            for i, arg_value in enumerate(named_args):\n                if arg_names[i] in apply_to:\n                    new_args.append(\n                        converter.convert(input_array=arg_value,\n                                          target_type=target_type))\n                else:\n                    new_args.append(arg_value)\n\n                if arg_names[i] == template_arg_name:\n                    template_arg_value = arg_value\n\n            kwonly_default_arg_values.update(kwargs)\n            kwargs = kwonly_default_arg_values\n\n            # keyword arguments and non-keyword arguments using default value\n            for i in range(len(named_args), len(all_arg_names)):\n                arg_name = all_arg_names[i]\n                if arg_name in kwargs:\n                    if arg_name in apply_to:\n                        new_kwargs[arg_name] = converter.convert(\n                            input_array=kwargs[arg_name],\n                            target_type=target_type)\n                    else:\n                        new_kwargs[arg_name] = kwargs[arg_name]\n                else:\n                    default_value = default_arg_values[i - no_default_arg_num]\n                    if arg_name in apply_to:\n                        new_kwargs[arg_name] = converter.convert(\n                            input_array=default_value, target_type=target_type)\n                    else:\n                        new_kwargs[arg_name] = default_value\n                if arg_name == template_arg_name:\n                    template_arg_value = kwargs[arg_name]\n\n            # add nameless args provided by *args (if exists)\n            new_args += nameless_args\n\n            return_values = func(*new_args, **new_kwargs)\n            converter.set_template(template_arg_value)\n\n            def recursive_recover(input_data):\n                if isinstance(input_data, (tuple, list)):\n                    new_data = []\n                    for item in input_data:\n                        new_data.append(recursive_recover(item))\n                    return tuple(new_data) if isinstance(input_data,\n                                                         tuple) else new_data\n                elif isinstance(input_data, dict):\n                    new_data = {}\n                    for k, v in input_data.items():\n                        new_data[k] = recursive_recover(v)\n                    return new_data\n                elif isinstance(input_data, (torch.Tensor, np.ndarray)):\n                    return converter.recover(input_data)\n                else:\n                    return input_data\n\n            if recover:\n                return recursive_recover(return_values)\n            else:\n                return return_values\n\n        return new_func\n\n    return array_converter_wrapper\n\n\nclass ArrayConverter:\n    \"\"\"Utility class for data-type agnostic processing.\n\n    Args:\n        template_array (np.ndarray or torch.Tensor or list or tuple or int or\n            float, optional): Template array. Defaults to None.\n    \"\"\"\n    SUPPORTED_NON_ARRAY_TYPES = (int, float, np.int8, np.int16, np.int32,\n                                 np.int64, np.uint8, np.uint16, np.uint32,\n                                 np.uint64, np.float16, np.float32, np.float64)\n\n    def __init__(self,\n                 template_array: Optional[TemplateArrayType] = None) -> None:\n        if template_array is not None:\n            self.set_template(template_array)\n\n    def set_template(self, array: TemplateArrayType) -> None:\n        \"\"\"Set template array.\n\n        Args:\n            array (np.ndarray or torch.Tensor or list or tuple or int or\n                float): Template array.\n\n        Raises:\n            ValueError: If input is list or tuple and cannot be converted to a\n                NumPy array, a ValueError is raised.\n            TypeError: If input type does not belong to the above range, or the\n                contents of a list or tuple do not share the same data type, a\n                TypeError is raised.\n        \"\"\"\n        self.array_type = type(array)\n        self.is_num = False\n        self.device = 'cpu'\n\n        if isinstance(array, np.ndarray):\n            self.dtype = array.dtype\n        elif isinstance(array, torch.Tensor):\n            self.dtype = array.dtype\n            self.device = array.device\n        elif isinstance(array, (list, tuple)):\n            try:\n                array = np.array(array)\n                if array.dtype not in self.SUPPORTED_NON_ARRAY_TYPES:\n                    raise TypeError\n                self.dtype = array.dtype\n            except (ValueError, TypeError):\n                print('The following list cannot be converted to a numpy '\n                      f'array of supported dtype:\\n{array}')\n                raise\n        elif isinstance(array, (int, float)):\n            self.array_type = np.ndarray\n            self.is_num = True\n            self.dtype = np.dtype(type(array))\n        else:\n            raise TypeError(\n                f'Template type {self.array_type} is not supported.')\n\n    def convert(\n        self,\n        input_array: TemplateArrayType,\n        target_type: Optional[Type] = None,\n        target_array: Optional[Union[np.ndarray, torch.Tensor]] = None\n    ) -> Union[np.ndarray, torch.Tensor]:\n        \"\"\"Convert input array to target data type.\n\n        Args:\n            input_array (np.ndarray or torch.Tensor or list or tuple or int or\n                float): Input array.\n            target_type (Type, optional): Type to which input array is\n                converted. It should be `np.ndarray` or `torch.Tensor`.\n                Defaults to None.\n            target_array (np.ndarray or torch.Tensor, optional): Template array\n                to which input array is converted. Defaults to None.\n\n        Raises:\n            ValueError: If input is list or tuple and cannot be converted to a\n                NumPy array, a ValueError is raised.\n            TypeError: If input type does not belong to the above range, or the\n                contents of a list or tuple do not share the same data type, a\n                TypeError is raised.\n\n        Returns:\n            np.ndarray or torch.Tensor: The converted array.\n        \"\"\"\n        if isinstance(input_array, (list, tuple)):\n            try:\n                input_array = np.array(input_array)\n                if input_array.dtype not in self.SUPPORTED_NON_ARRAY_TYPES:\n                    raise TypeError\n            except (ValueError, TypeError):\n                print('The input cannot be converted to a single-type numpy '\n                      f'array:\\n{input_array}')\n                raise\n        elif isinstance(input_array, self.SUPPORTED_NON_ARRAY_TYPES):\n            input_array = np.array(input_array)\n        array_type = type(input_array)\n        assert target_type is not None or target_array is not None, \\\n            'must specify a target'\n        if target_type is not None:\n            assert target_type in (np.ndarray, torch.Tensor), \\\n                'invalid target type'\n            if target_type == array_type:\n                return input_array\n            elif target_type == np.ndarray:\n                # default dtype is float32\n                converted_array = input_array.cpu().numpy().astype(np.float32)\n            else:\n                # default dtype is float32, device is 'cpu'\n                converted_array = torch.tensor(input_array,\n                                               dtype=torch.float32)\n        else:\n            assert isinstance(target_array, (np.ndarray, torch.Tensor)), \\\n                'invalid target array type'\n            if isinstance(target_array, array_type):\n                return input_array\n            elif isinstance(target_array, np.ndarray):\n                converted_array = input_array.cpu().numpy().astype(\n                    target_array.dtype)\n            else:\n                converted_array = target_array.new_tensor(input_array)\n        return converted_array\n\n    def recover(\n        self, input_array: Union[np.ndarray, torch.Tensor]\n    ) -> Union[np.ndarray, torch.Tensor, int, float]:\n        \"\"\"Recover input type to original array type.\n\n        Args:\n            input_array (np.ndarray or torch.Tensor): Input array.\n\n        Returns:\n            np.ndarray or torch.Tensor or int or float: Converted array.\n        \"\"\"\n        assert isinstance(input_array, (np.ndarray, torch.Tensor)), \\\n            'invalid input array type'\n        if isinstance(input_array, self.array_type):\n            return input_array\n        elif isinstance(input_array, torch.Tensor):\n            converted_array = input_array.cpu().numpy().astype(self.dtype)\n        else:\n            converted_array = torch.tensor(input_array,\n                                           dtype=self.dtype,\n                                           device=self.device)\n        if self.is_num:\n            converted_array = converted_array.item()\n        return converted_array\n"
  },
  {
    "path": "bip3d/utils/default_color_map.py",
    "content": "DEFAULT_COLOR_MAP = {\n    'floor': [255, 193, 193],\n    'wall': [137, 54, 74],\n    'chair': [153, 69, 1],\n    'cabinet': [134, 199, 156],\n    'door': [92, 136, 89],\n    'table': [255.0, 187.0, 120.0],\n    'couch': [3, 95, 161],\n    'shelf': [255, 160, 98],\n    'window': [183, 121, 142],\n    'bed': [123, 104, 238],\n    'curtain': [210, 170, 100],\n    'plant': [163, 255, 0],\n    'stairs': [104, 84, 109],\n    'pillow': [76, 91, 113],\n    'counter': [146, 112, 198],\n    'bench': [250, 0, 30],\n    'rail': [230, 150, 140],\n    'sink': [135, 206, 250],\n    'mirror': [154, 208, 0],\n    'toilet': [0, 165, 120],\n    'refrigerator': [59, 105, 106],\n    'book': [142, 108, 45],\n    'tv': [183, 130, 88],\n    'blanket': [147, 211, 203],\n    'rack': [255, 208, 186],\n    'towel': [225, 199, 255],\n    'backpack': [255, 179, 240],\n    'roof': [92, 82, 55],\n    'bag': [209, 0, 151],\n    'board': [133, 129, 255],\n    'bicycle': [119, 11, 32],\n    'oven': [178, 90, 62],\n    'microwave': [79, 210, 114],\n    'desk': [109, 63, 54],\n    'doorframe': [199, 100, 0],\n    'wardrobe': [7, 246, 231],\n    'picture': [171, 134, 1],\n    'bathtub': [92, 0, 73],\n    'box': [188, 208, 182],\n    'stand': [146, 139, 141],\n    'clothes': [96, 96, 96],\n    'lamp': [107, 255, 200],\n    'dresser': [206, 186, 171],\n    'stool': [73, 77, 174],\n    'fireplace': [255, 77, 255],\n    'commode': [102, 102, 156],\n    'washing machine': [152, 251, 152],\n    'monitor': [208, 195, 210],\n    'window frame': [227, 255, 205],\n    'radiator': [191, 162, 208],\n    'mat': [250, 141, 255],\n    'shower': [154, 255, 154],\n    'ottoman': [95, 32, 0],\n    'column': [60, 143, 255],\n    'blinds': [134, 134, 103],\n    'stove': [128, 64, 128],\n    'bar': [72, 0, 118],\n    'pillar': [220, 20, 60],\n    'bin': [187, 255, 255],\n    'heater': [209, 226, 140],\n    'clothes dryer': [100, 170, 30],\n    'blackboard': [0, 82, 0],\n    'decoration': [107, 142, 35],\n    'steps': [120, 166, 157],\n    'windowsill': [9, 80, 61],\n    'cushion': [0, 228, 0],\n    'carpet': [175, 116, 175],\n    'copier': [241, 129, 0],\n    'countertop': [207, 138, 255],\n    'basket': [0, 0, 70],\n    'mailbox': [150, 100, 100],\n    'kitchen island': [220, 220, 0],\n    'washbasin': [0, 80, 100],\n    'drawer': [0, 220, 176],\n    'piano': [78, 180, 255],\n    'exercise equipment': [151, 0, 95],\n    'beam': [255, 255, 128],\n    'partition': [168, 171, 172],\n    'printer': [179, 0, 194],\n    'frame': [255, 180, 195],\n    'object': [0, 0, 0],\n    'adhesive tape': [0, 220, 176],\n    'air conditioner': [109, 63, 54],\n    'alarm': [0, 114, 143],\n    'album': [147, 186, 208],\n    'arch': [135, 158, 223],\n    'balcony': [70, 70, 70],\n    'ball': [96, 96, 96],\n    'banister': [196, 172, 0],\n    'barricade': [45, 89, 255],\n    'baseboard': [153, 69, 1],\n    'basin': [255.0, 187.0, 120.0],\n    'beanbag': [190, 153, 153],\n    'bidet': [123, 104, 238],\n    'body loofah': [196, 172, 0],\n    'boots': [134, 199, 156],\n    'bottle': [241, 129, 0],\n    'bowl': [92, 136, 89],\n    'bread': [119, 11, 32],\n    'broom': [0, 226, 252],\n    'brush': [255, 255, 128],\n    'bucket': [255, 73, 97],\n    'calendar': [76, 91, 113],\n    'camera': [72, 0, 118],\n    'can': [109, 63, 54],\n    'candle': [78, 180, 255],\n    'candlestick': [104, 84, 109],\n    'cap': [128, 76, 255],\n    'car': [107, 142, 35],\n    'cart': [255, 255, 128],\n    'case': [0, 0, 230],\n    'chandelier': [169, 164, 131],\n    'cleanser': [0, 165, 120],\n    'clock': [190, 153, 153],\n    'coat hanger': [179, 0, 194],\n    'coffee maker': [0, 82, 0],\n    'coil': [255, 179, 240],\n    'computer': [225, 199, 255],\n    'conducting wire': [150, 100, 100],\n    'container': [0, 0, 70],\n    'control': [255, 77, 255],\n    'cosmetics': [142, 108, 45],\n    'crate': [0, 226, 252],\n    'crib': [169, 164, 131],\n    'cube': [116, 112, 0],\n    'cup': [175, 116, 175],\n    'detergent': [255, 208, 186],\n    'device': [146, 139, 141],\n    'dish rack': [0, 0, 142],\n    'dishwasher': [92, 82, 55],\n    'dispenser': [95, 32, 0],\n    'divider': [219, 142, 185],\n    'door knob': [166, 74, 118],\n    'doorway': [134, 134, 103],\n    'dress': [0, 114, 143],\n    'drum': [107, 142, 35],\n    'duct': [0, 80, 100],\n    'dumbbell': [0, 0, 192],\n    'dustpan': [78, 180, 255],\n    'dvd': [0, 143, 149],\n    'eraser': [0, 82, 0],\n    'fan': [0, 0, 70],\n    'faucet': [84, 105, 51],\n    'fence': [190, 153, 153],\n    'file': [255, 228, 255],\n    'fire extinguisher': [107, 255, 200],\n    'flowerpot': [9, 80, 61],\n    'flush': [227, 255, 205],\n    'folder': [208, 229, 228],\n    'food': [109, 63, 54],\n    'footstool': [133, 129, 255],\n    'fruit': [179, 0, 194],\n    'furniture': [220, 20, 60],\n    'garage door': [217, 17, 255],\n    'garbage': [0, 82, 0],\n    'glass': [255, 99, 164],\n    'globe': [255, 77, 255],\n    'glove': [166, 196, 102],\n    'grab bar': [145, 148, 174],\n    'grass': [0, 60, 100],\n    'guitar': [73, 77, 174],\n    'hair dryer': [169, 164, 131],\n    'hamper': [241, 129, 0],\n    'handle': [142, 108, 45],\n    'hanger': [150, 100, 100],\n    'hat': [154, 208, 0],\n    'headboard': [171, 134, 1],\n    'headphones': [124, 74, 181],\n    'helmets': [209, 226, 140],\n    'holder': [151, 0, 95],\n    'hook': [92, 136, 89],\n    'humidifier': [209, 99, 106],\n    'ironware': [127, 167, 115],\n    'jacket': [255, 73, 97],\n    'jalousie': [255, 179, 240],\n    'jar': [106, 154, 176],\n    'kettle': [196, 172, 0],\n    'keyboard': [0, 125, 92],\n    'kitchenware': [74, 65, 105],\n    'knife': [70, 130, 180],\n    'label': [0, 228, 0],\n    'ladder': [0, 114, 143],\n    'laptop': [255, 180, 195],\n    'ledge': [58, 41, 149],\n    'letter': [0, 0, 192],\n    'light': [78, 180, 255],\n    'luggage': [0, 226, 252],\n    'machine': [197, 226, 255],\n    'magazine': [199, 100, 0],\n    'map': [183, 121, 142],\n    'mask': [74, 65, 105],\n    'mattress': [255, 179, 240],\n    'menu': [255, 255, 128],\n    'molding': [104, 84, 109],\n    'mop': [199, 100, 0],\n    'mouse': [5, 121, 0],\n    'napkins': [165, 42, 42],\n    'notebook': [175, 116, 175],\n    'pack': [0, 143, 149],\n    'package': [166, 196, 102],\n    'pad': [208, 229, 228],\n    'pan': [209, 99, 106],\n    'panel': [201, 57, 1],\n    'paper': [255, 179, 240],\n    'paper cutter': [207, 138, 255],\n    'pedestal': [64, 170, 64],\n    'pen': [193, 0, 92],\n    'person': [7, 246, 231],\n    'pipe': [255, 180, 195],\n    'pitcher': [220, 20, 60],\n    'plate': [142, 108, 45],\n    'player': [0, 143, 149],\n    'plug': [255, 77, 255],\n    'plunger': [165, 42, 42],\n    'pool': [153, 69, 1],\n    'pool table': [0, 0, 230],\n    'poster': [130, 114, 135],\n    'pot': [96, 36, 108],\n    'price tag': [255, 77, 255],\n    'projector': [179, 0, 194],\n    'purse': [0, 228, 0],\n    'radio': [116, 112, 0],\n    'range hood': [199, 100, 0],\n    'remote control': [188, 208, 182],\n    'ridge': [59, 105, 106],\n    'rod': [207, 138, 255],\n    'roll': [123, 104, 238],\n    'rope': [110, 76, 0],\n    'sack': [190, 153, 153],\n    'salt': [250, 0, 30],\n    'scale': [58, 41, 149],\n    'scissors': [60, 143, 255],\n    'screen': [0, 82, 0],\n    'seasoning': [254, 212, 124],\n    'shampoo': [70, 130, 180],\n    'sheet': [151, 0, 95],\n    'shirt': [190, 153, 153],\n    'shoe': [199, 100, 0],\n    'shovel': [241, 129, 0],\n    'sign': [208, 195, 210],\n    'soap': [109, 63, 54],\n    'soap dish': [166, 74, 118],\n    'soap dispenser': [95, 32, 0],\n    'socket': [255, 255, 255],\n    'speaker': [65, 70, 15],\n    'sponge': [0, 220, 176],\n    'spoon': [134, 134, 103],\n    'stall': [0, 60, 100],\n    'stapler': [246, 0, 122],\n    'statue': [196, 172, 0],\n    'stick': [0, 165, 120],\n    'stopcock': [0, 60, 100],\n    'structure': [220, 20, 60],\n    'sunglasses': [142, 108, 45],\n    'support': [209, 226, 140],\n    'switch': [7, 246, 231],\n    'tablet': [137, 54, 74],\n    'teapot': [0, 80, 100],\n    'telephone': [220, 220, 0],\n    'thermostat': [128, 76, 255],\n    'tissue': [73, 77, 174],\n    'tissue box': [96, 96, 96],\n    'toaster': [106, 0, 228],\n    'toilet paper': [84, 105, 51],\n    'toiletry': [128, 64, 128],\n    'tool': [220, 20, 60],\n    'toothbrush': [130, 114, 135],\n    'toothpaste': [0, 143, 149],\n    'toy': [255.0, 187.0, 120.0],\n    'tray': [255, 179, 240],\n    'treadmill': [166, 74, 118],\n    'trophy': [0, 220, 176],\n    'tube': [255, 255, 128],\n    'umbrella': [250, 0, 30],\n    'urn': [152, 251, 152],\n    'utensil': [220, 220, 0],\n    'vacuum cleaner': [96, 36, 108],\n    'vanity': [5, 121, 0],\n    'vase': [255, 193, 193],\n    'vent': [209, 226, 140],\n    'ventilation': [123, 104, 238],\n    'water cooler': [255, 255, 128],\n    'water heater': [145, 148, 174],\n    'wine': [220, 220, 0],\n    'wire': [96, 36, 108],\n    'wood': [127, 167, 115],\n    'wrap': [175, 116, 175],\n}\n"
  },
  {
    "path": "bip3d/utils/dist_utils.py",
    "content": "import torch.distributed as dist\n\n\ndef reduce_mean(tensor):\n    \"\"\"\"Obtain the mean of tensor on different GPUs.\"\"\"\n    if not (dist.is_available() and dist.is_initialized()):\n        return tensor\n    tensor = tensor.clone()\n    dist.all_reduce(tensor.div_(dist.get_world_size()), op=dist.ReduceOp.SUM)\n    return tensor\n"
  },
  {
    "path": "bip3d/utils/line_mesh.py",
    "content": "\"\"\"Adapted from https://github.com/isl-\norg/Open3D/pull/738#issuecomment-564785941.\n\nModule which creates mesh lines from a line set\nOpen3D relies upon using glLineWidth to set line width on a LineSet\nHowever, this method is now deprecated and not fully supported in\nnewer OpenGL versions.\nSee:\n    Open3D Github Pull Request:\n        https://github.com/intel-isl/Open3D/pull/738\n    Other Framework Issues:\n        https://github.com/openframeworks/openFrameworks/issues/3460\n\nThis module aims to solve this by converting a line into a triangular mesh\n(which has thickness). The basic idea is to create a cylinder for each line\nsegment, translate it, and then rotate it.\n\nLicense: MIT\n\"\"\"\nimport numpy as np\nimport open3d as o3d\n\n\ndef align_vector_to_another(a=np.array([0, 0, 1]), b=np.array([1, 0, 0])):\n    \"\"\"Aligns vector a to vector b with axis angle rotation.\"\"\"\n    if np.array_equal(a, b):\n        return None, None\n    axis_ = np.cross(a, b)\n    axis_ = axis_ / np.linalg.norm(axis_)\n    angle = np.arccos(np.dot(a, b))\n\n    return axis_, angle\n\n\ndef normalized(a, axis=-1, order=2):\n    \"\"\"Normalizes a numpy array of points.\"\"\"\n    l2 = np.atleast_1d(np.linalg.norm(a, order, axis))\n    l2[l2 == 0] = 1\n    return a / np.expand_dims(l2, axis), l2\n\n\nclass LineMesh(object):\n\n    def __init__(self, points, lines=None, colors=[0, 1, 0], radius=0.15):\n        \"\"\"Creates a line represented as sequence of cylinder triangular\n        meshes.\n\n        Arguments:\n            points {ndarray} -- Numpy array of ponts Nx3.\n\n        Keyword Arguments:\n            lines {list[list] or None} -- List of point index pairs denoting\n                line segments. If None, implicit lines from ordered pairwise\n                points. (default: {None})\n            colors {list} -- list of colors, or single color of the line\n                (default: {[0, 1, 0]})\n            radius {float} -- radius of cylinder (default: {0.15})\n        \"\"\"\n        self.points = np.array(points)\n        self.lines = np.array(\n            lines) if lines is not None else self.lines_from_ordered_points(\n                self.points)\n        self.colors = np.array(colors)\n        self.radius = radius\n        self.cylinder_segments = []\n\n        self.create_line_mesh()\n\n    @staticmethod\n    def lines_from_ordered_points(points):\n        lines = [[i, i + 1] for i in range(0, points.shape[0] - 1, 1)]\n        return np.array(lines)\n\n    def create_line_mesh(self):\n        first_points = self.points[self.lines[:, 0], :]\n        second_points = self.points[self.lines[:, 1], :]\n        line_segments = second_points - first_points\n        line_segments_unit, line_lengths = normalized(line_segments)\n\n        z_axis = np.array([0, 0, 1])\n        # Create triangular mesh cylinder segments of line\n        for i in range(line_segments_unit.shape[0]):\n            line_segment = line_segments_unit[i, :]\n            line_length = line_lengths[i]\n            # get axis angle rotation to allign cylinder with line segment\n            axis, angle = align_vector_to_another(z_axis, line_segment)\n            # Get translation vector\n            translation = first_points[i, :] + \\\n                line_segment * line_length * 0.5\n            # create cylinder and apply transformations\n            cylinder_segment = o3d.geometry.TriangleMesh.create_cylinder(\n                self.radius, line_length)\n            cylinder_segment = cylinder_segment.translate(translation,\n                                                          relative=False)\n            if axis is not None:\n                axis_a = axis * angle\n                cylinder_segment = cylinder_segment.rotate(\n                    R=o3d.geometry.get_rotation_matrix_from_axis_angle(\n                        axis_a))  # center=True)\n                # cylinder_segment = cylinder_segment.rotate(\n                #   axis_a, center=True,\n                #   type=o3d.geometry.RotationType.AxisAngle)\n            # color cylinder\n            color = self.colors if self.colors.ndim == 1 \\\n                else self.colors[i, :]\n            cylinder_segment.paint_uniform_color(color)\n\n            self.cylinder_segments.append(cylinder_segment)\n\n    def add_line(self, vis):\n        \"\"\"Adds this line to the visualizer.\"\"\"\n        for cylinder in self.cylinder_segments:\n            vis.add_geometry(cylinder)\n\n    def remove_line(self, vis):\n        \"\"\"Removes this line from the visualizer.\"\"\"\n        for cylinder in self.cylinder_segments:\n            vis.remove_geometry(cylinder)\n\n\nif __name__ == '__main__':\n\n    def main():\n        print('Demonstrating LineMesh vs LineSet')\n        # Create Line Set\n        points = [[0, 0, 0], [1, 0, 0], [0, 1, 0], [1, 1, 0], [0, 0, 1],\n                  [1, 0, 1], [0, 1, 1], [1, 1, 1]]\n        lines = [[0, 1], [0, 2], [1, 3], [2, 3], [4, 5], [4, 6], [5, 7],\n                 [6, 7], [0, 4], [1, 5], [2, 6], [3, 7]]\n        colors = [[1, 0, 0] for i in range(len(lines))]\n\n        line_set = o3d.geometry.LineSet()\n        line_set.points = o3d.utility.Vector3dVector(points)\n        line_set.lines = o3d.utility.Vector2iVector(lines)\n        line_set.colors = o3d.utility.Vector3dVector(colors)\n\n        # Create Line Mesh 1\n        points = np.array(points) + [0, 0, 2]\n        line_mesh1 = LineMesh(points, lines, colors, radius=0.02)\n        line_mesh1_geoms = line_mesh1.cylinder_segments\n\n        # Create Line Mesh 1\n        points = np.array(points) + [0, 2, 0]\n        line_mesh2 = LineMesh(points, radius=0.03)\n        line_mesh2_geoms = line_mesh2.cylinder_segments\n\n        o3d.visualization.draw_geometries(\n            [line_set, *line_mesh1_geoms, *line_mesh2_geoms])\n\n    main()\n"
  },
  {
    "path": "bip3d/utils/typing_config.py",
    "content": "from collections.abc import Sized\nfrom typing import Dict, List, Optional, Tuple, Union\n\nimport numpy as np\nimport torch\nfrom mmdet.models.task_modules.samplers import SamplingResult\nfrom mmengine.config import ConfigDict\nfrom mmengine.structures import BaseDataElement, InstanceData\n\n\nclass Det3DDataElement(BaseDataElement):\n\n    @property\n    def gt_instances_3d(self) -> InstanceData:\n        return self._gt_instances_3d\n\n    @gt_instances_3d.setter\n    def gt_instances_3d(self, value: InstanceData) -> None:\n        self.set_field(value, '_gt_instances_3d', dtype=InstanceData)\n\n    @gt_instances_3d.deleter\n    def gt_instances_3d(self) -> None:\n        del self._gt_instances_3d\n\n    @property\n    def pred_instances_3d(self) -> InstanceData:\n        return self._pred_instances_3d\n\n    @pred_instances_3d.setter\n    def pred_instances_3d(self, value: InstanceData) -> None:\n        self.set_field(value, '_pred_instances_3d', dtype=InstanceData)\n\n    @pred_instances_3d.deleter\n    def pred_instances_3d(self) -> None:\n        del self._pred_instances_3d\n\n\nIndexType = Union[str, slice, int, list, torch.LongTensor,\n                  torch.cuda.LongTensor, torch.BoolTensor,\n                  torch.cuda.BoolTensor, np.ndarray]\n\n\nclass PointData(BaseDataElement):\n    \"\"\"Data structure for point-level annotations or predictions.\n\n    All data items in ``data_fields`` of ``PointData`` meet the following\n    requirements:\n\n    - They are all one dimension.\n    - They should have the same length.\n\n    `PointData` is used to save point-level semantic and instance mask,\n    it also can save `instances_labels` and `instances_scores` temporarily.\n    In the future, we would consider to move the instance-level info into\n    `gt_instances_3d` and `pred_instances_3d`.\n\n    Examples:\n        >>> metainfo = dict(\n        ...     sample_idx=random.randint(0, 100))\n        >>> points = np.random.randint(0, 255, (100, 3))\n        >>> point_data = PointData(metainfo=metainfo,\n        ...                        points=points)\n        >>> print(len(point_data))\n        100\n\n        >>> # slice\n        >>> slice_data = point_data[10:60]\n        >>> assert len(slice_data) == 50\n\n        >>> # set\n        >>> point_data.pts_semantic_mask = torch.randint(0, 255, (100,))\n        >>> point_data.pts_instance_mask = torch.randint(0, 255, (100,))\n        >>> assert tuple(point_data.pts_semantic_mask.shape) == (100,)\n        >>> assert tuple(point_data.pts_instance_mask.shape) == (100,)\n    \"\"\"\n\n    def __setattr__(self, name: str, value: Sized) -> None:\n        \"\"\"setattr is only used to set data.\n\n        The value must have the attribute of `__len__` and have the same length\n        of `PointData`.\n        \"\"\"\n        if name in ('_metainfo_fields', '_data_fields'):\n            if not hasattr(self, name):\n                super().__setattr__(name, value)\n            else:\n                raise AttributeError(f'{name} has been used as a '\n                                     'private attribute, which is immutable.')\n\n        else:\n            assert isinstance(value,\n                              Sized), 'value must contain `__len__` attribute'\n            # TODO: make sure the input value share the same length\n            super().__setattr__(name, value)\n\n    __setitem__ = __setattr__\n\n    def __getitem__(self, item: IndexType) -> 'PointData':\n        \"\"\"\n        Args:\n            item (str, int, list, :obj:`slice`, :obj:`numpy.ndarray`,\n                :obj:`torch.LongTensor`, :obj:`torch.BoolTensor`):\n                Get the corresponding values according to item.\n\n        Returns:\n            :obj:`PointData`: Corresponding values.\n        \"\"\"\n        if isinstance(item, list):\n            item = np.array(item)\n        if isinstance(item, np.ndarray):\n            # The default int type of numpy is platform dependent, int32 for\n            # windows and int64 for linux. `torch.Tensor` requires the index\n            # should be int64, therefore we simply convert it to int64 here.\n            # Mode details in https://github.com/numpy/numpy/issues/9464\n            item = item.astype(np.int64) if item.dtype == np.int32 else item\n            item = torch.from_numpy(item)\n        assert isinstance(\n            item, (str, slice, int, torch.LongTensor, torch.cuda.LongTensor,\n                   torch.BoolTensor, torch.cuda.BoolTensor))\n\n        if isinstance(item, str):\n            return getattr(self, item)\n\n        if isinstance(item, int):\n            if item >= len(self) or item < -len(self):  # type: ignore\n                raise IndexError(f'Index {item} out of range!')\n            else:\n                # keep the dimension\n                item = slice(item, None, len(self))\n\n        new_data = self.__class__(metainfo=self.metainfo)\n        if isinstance(item, torch.Tensor):\n            assert item.dim() == 1, 'Only support to get the' \\\n                                    ' values along the first dimension.'\n            if isinstance(item, (torch.BoolTensor, torch.cuda.BoolTensor)):\n                assert len(item) == len(self), 'The shape of the ' \\\n                                               'input(BoolTensor) ' \\\n                                               f'{len(item)} ' \\\n                                               'does not match the shape ' \\\n                                               'of the indexed tensor ' \\\n                                               'in results_field ' \\\n                                               f'{len(self)} at ' \\\n                                               'first dimension.'\n\n            for k, v in self.items():\n                if isinstance(v, torch.Tensor):\n                    new_data[k] = v[item]\n                elif isinstance(v, np.ndarray):\n                    new_data[k] = v[item.cpu().numpy()]\n                elif isinstance(\n                        v, (str, list, tuple)) or (hasattr(v, '__getitem__')\n                                                   and hasattr(v, 'cat')):\n                    # convert to indexes from BoolTensor\n                    if isinstance(item,\n                                  (torch.BoolTensor, torch.cuda.BoolTensor)):\n                        indexes = torch.nonzero(item).view(\n                            -1).cpu().numpy().tolist()\n                    else:\n                        indexes = item.cpu().numpy().tolist()\n                    slice_list = []\n                    if indexes:\n                        for index in indexes:\n                            slice_list.append(slice(index, None, len(v)))\n                    else:\n                        slice_list.append(slice(None, 0, None))\n                    r_list = [v[s] for s in slice_list]\n                    if isinstance(v, (str, list, tuple)):\n                        new_value = r_list[0]\n                        for r in r_list[1:]:\n                            new_value = new_value + r\n                    else:\n                        new_value = v.cat(r_list)\n                    new_data[k] = new_value\n                else:\n                    raise ValueError(\n                        f'The type of `{k}` is `{type(v)}`, which has no '\n                        'attribute of `cat`, so it does not '\n                        'support slice with `bool`')\n        else:\n            # item is a slice\n            for k, v in self.items():\n                new_data[k] = v[item]\n        return new_data  # type: ignore\n\n    def __len__(self) -> int:\n        \"\"\"int: The length of `PointData`.\"\"\"\n        if len(self._data_fields) > 0:\n            return len(self.values()[0])\n        else:\n            return 0\n\n\n# Type hint of config data\nConfigType = Union[ConfigDict, dict]\nOptConfigType = Optional[ConfigType]\n\n# Type hint of one or more config data\nMultiConfig = Union[ConfigType, List[ConfigType]]\nOptMultiConfig = Optional[MultiConfig]\n\nInstanceList = List[InstanceData]\nOptInstanceList = Optional[InstanceList]\nForwardResults = Union[Dict[str, torch.Tensor], List[Det3DDataElement],\n                       Tuple[torch.Tensor], torch.Tensor]\n\nSamplingResultList = List[SamplingResult]\n\nOptSamplingResultList = Optional[SamplingResultList]\nSampleList = List[Det3DDataElement]\nOptSampleList = Optional[SampleList]\n"
  },
  {
    "path": "configs/__init__.py",
    "content": ""
  },
  {
    "path": "configs/bip3d_det.py",
    "content": "_base_ = [\"./default_runtime.py\"]\n\nimport os\nfrom bip3d.datasets.embodiedscan_det_grounding_dataset import (\n    class_names, head_labels, common_labels, tail_labels\n)\n\nDEBUG = os.environ.get(\"CLUSTER\") is None\ndel os\n\nbackend_args = None\n\nmetainfo = dict(classes=\"all\")\n\ndepth = True\ndepth_loss = True\n\nz_range=[-0.2, 3]\nmin_depth = 0.25\nmax_depth = 10\nnum_depth = 64\n\nmodel = dict(\n    type=\"BIP3D\",\n    input_3d=\"depth_img\",\n    use_depth_grid_mask=True,\n    data_preprocessor=dict(\n        type=\"CustomDet3DDataPreprocessor\",\n        mean=[123.675, 116.28, 103.53],\n        std=[58.395, 57.12, 57.375],\n        bgr_to_rgb=True,\n        pad_size_divisor=32,\n    ),\n    backbone=dict(\n        type=\"mmdet.SwinTransformer\",\n        embed_dims=96,\n        depths=[2, 2, 6, 2],\n        num_heads=[3, 6, 12, 24],\n        window_size=7,\n        mlp_ratio=4,\n        qkv_bias=True,\n        qk_scale=None,\n        drop_rate=0.0,\n        attn_drop_rate=0.0,\n        drop_path_rate=0.2,\n        patch_norm=True,\n        out_indices=(1, 2, 3),\n        with_cp=True,\n        convert_weights=False,\n    ),\n    neck=dict(\n        type=\"mmdet.ChannelMapper\",\n        in_channels=[192, 384, 768],\n        kernel_size=1,\n        out_channels=256,\n        act_cfg=None,\n        bias=True,\n        norm_cfg=dict(type=\"GN\", num_groups=32),\n        num_outs=4,\n    ),\n    text_encoder=dict(\n        type=\"BertModel\",\n        special_tokens_list=[\"[CLS]\", \"[SEP]\"],\n        name=\"./ckpt/bert-base-uncased\",\n        pad_to_max=False,\n        use_sub_sentence_represent=True,\n        add_pooling_layer=False,\n        max_tokens=768,\n        use_checkpoint=True,\n        return_tokenized=True,\n    ),\n    backbone_3d=(\n        dict(\n            type=\"mmdet.ResNet\",\n            depth=34,\n            in_channels=1,\n            base_channels=4,\n            num_stages=4,\n            out_indices=(1, 2, 3),\n            norm_cfg=dict(type=\"BN\", requires_grad=True),\n            norm_eval=True,\n            with_cp=True,\n            style=\"pytorch\",\n        )\n        if depth\n        else None\n    ),\n    neck_3d=(\n        dict(\n            type=\"mmdet.ChannelMapper\",\n            in_channels=[8, 16, 32],\n            kernel_size=1,\n            out_channels=32,\n            act_cfg=None,\n            bias=True,\n            norm_cfg=dict(type=\"GN\", num_groups=4),\n            num_outs=4,\n        )\n        if depth\n        else None\n    ),\n    feature_enhancer=dict(\n        type=\"TextImageDeformable2DEnhancer\",\n        num_layers=6,\n        text_img_attn_block=dict(\n            v_dim=256, l_dim=256, embed_dim=1024, num_heads=4, init_values=1e-4\n        ),\n        img_attn_block=dict(\n            self_attn_cfg=dict(\n                embed_dims=256, num_levels=4, dropout=0.0, im2col_step=64\n            ),\n            ffn_cfg=dict(\n                embed_dims=256, feedforward_channels=2048, ffn_drop=0.0\n            ),\n        ),\n        text_attn_block=dict(\n            self_attn_cfg=dict(num_heads=4, embed_dims=256, dropout=0.0),\n            ffn_cfg=dict(\n                embed_dims=256, feedforward_channels=1024, ffn_drop=0.0\n            ),\n        ),\n        embed_dims=256,\n        num_feature_levels=4,\n        positional_encoding=dict(\n            num_feats=128, normalize=True, offset=0.0, temperature=20\n        ),\n    ),\n    spatial_enhancer=dict(\n        type=\"DepthFusionSpatialEnhancer\",\n        embed_dims=256,\n        feature_3d_dim=32,\n        num_depth_layers=2,\n        min_depth=min_depth,\n        max_depth=max_depth,\n        num_depth=num_depth,\n        with_feature_3d=depth,\n        loss_depth_weight=1.0 if depth_loss else -1,\n    ),\n    decoder=dict(\n        type=\"BBox3DDecoder\",\n        look_forward_twice=True,\n        instance_bank=dict(\n            type=\"InstanceBank\",\n            num_anchor=50,\n            anchor=\"anchor_files/embodiedscan_kmeans.npy\",\n            embed_dims=256,\n            anchor_in_camera=True,\n        ),\n        anchor_encoder=dict(\n            type=\"DoF9BoxEncoder\",\n            embed_dims=256,\n            rot_dims=3,\n        ),\n        graph_model=dict(\n            type=\"MultiheadAttention\",\n            embed_dims=256,\n            num_heads=8,\n            dropout=0.0,\n            batch_first=True,\n        ),\n        ffn=dict(\n            type=\"FFN\",\n            embed_dims=256,\n            feedforward_channels=2048,\n            ffn_drop=0.0,\n        ),\n        norm_layer=dict(type=\"LN\", normalized_shape=256),\n        deformable_model=dict(\n            type=\"DeformableFeatureAggregation\",\n            embed_dims=256,\n            num_groups=8,\n            num_levels=4,\n            use_camera_embed=True,\n            use_deformable_func=True,\n            with_depth=True,\n            min_depth=min_depth,\n            max_depth=max_depth,\n            kps_generator=dict(\n                type=\"SparseBox3DKeyPointsGenerator\",\n                fix_scale=[\n                    [0, 0, 0],\n                    [0.45, 0, 0],\n                    [-0.45, 0, 0],\n                    [0, 0.45, 0],\n                    [0, -0.45, 0],\n                    [0, 0, 0.45],\n                    [0, 0, -0.45],\n                ],\n                num_learnable_pts=9,\n            ),\n            with_value_proj=True,\n            filter_outlier=True,\n        ),\n        text_cross_attn=dict(\n            type=\"MultiheadAttention\",\n            embed_dims=256,\n            num_heads=8,\n            dropout=0.0,\n            batch_first=True,\n        ),\n        refine_layer=dict(\n            type=\"GroundingRefineClsHead\",\n            embed_dims=256,\n            output_dim=9,\n            cls_bias=True,\n            # cls_layers=True,\n        ),\n        loss_cls=dict(\n            type=\"mmdet.FocalLoss\",\n            use_sigmoid=True,\n            gamma=2.0,\n            alpha=0.25,\n            loss_weight=1.0,\n        ),\n        loss_reg=dict(\n            type=\"DoF9BoxLoss\",\n            loss_weight_wd=1.0,\n            loss_weight_pcd=0.0,\n            loss_weight_cd=0.8,\n        ),\n        sampler=dict(\n            type=\"Grounding3DTarget\",\n            cls_weight=1.0,\n            box_weight=1.0,\n            num_dn=100,\n            cost_weight_wd=1.0,\n            cost_weight_pcd=0.0,\n            cost_weight_cd=0.8,\n            with_dn_query=True,\n            num_classes=284,\n            embed_dims=256,\n        ),\n        gt_reg_key=\"gt_bboxes_3d\",\n        gt_cls_key=\"tokens_positive\",\n        post_processor=dict(\n            type=\"GroundingBox3DPostProcess\",\n            num_output=1000,\n        ),\n        with_instance_id=False,\n    ),\n)\n\ndataset_type = \"EmbodiedScanDetGroundingDataset\"\ndata_root = \"data\"\nmetainfo = dict(\n    classes=class_names,\n    classes_split=(head_labels, common_labels, tail_labels),\n    box_type_3d=\"euler-depth\",\n)\n\n\nimage_wh = (512, 512)\n\nrotate_3rscan = True\nsep_token = \"[SEP]\"\ncam_standardization = True\nif cam_standardization:\n    resize = dict(\n        type=\"CamIntrisicStandardization\",\n        dst_intrinsic=[\n            [432.57943431339237, 0.0, 256],\n            [0.0, 539.8570854208559, 256],\n            [0.0, 0.0, 1.0],\n        ],\n        dst_wh=image_wh,\n    )\nelse:\n    resize = dict(type=\"CustomResize\", scale=image_wh, keep_ratio=False)\n\ntrain_pipeline = [\n    dict(type=\"LoadAnnotations3D\"),\n    dict(\n        type=\"MultiViewPipeline\",\n        n_images=1,\n        max_n_images=18,\n        transforms=[\n            dict(type=\"LoadImageFromFile\", backend_args=backend_args),\n            dict(type=\"LoadDepthFromFile\", backend_args=backend_args),\n            resize,\n            dict(\n                type=\"ResizeCropFlipImage\",\n                data_aug_conf={\n                    \"resize_lim\": (1.0, 1.0),\n                    \"final_dim\": image_wh,\n                    \"bot_pct_lim\": (0.0, 0.05),\n                    \"rot_lim\": (0, 0),\n                    \"H\": image_wh[1],\n                    \"W\": image_wh[0],\n                    \"rand_flip\": False,\n                },\n            ),\n        ],\n        rotate_3rscan=rotate_3rscan,\n        ordered=False,\n    ),\n    dict(\n        type=\"CategoryGroundingDataPrepare\",\n        classes=class_names,\n        filter_others=True,\n        sep_token=sep_token,\n        max_class=128,\n        training=True,\n        z_range=z_range,\n    ),\n    dict(\n        type=\"Pack3DDetInputs\",\n        keys=[\"img\", \"depth_img\", \"gt_bboxes_3d\", \"gt_labels_3d\"],\n    ),\n]\nif depth_loss:\n    train_pipeline.append(\n        dict(\n            type=\"DepthProbLabelGenerator\",\n            origin_stride=4,\n            min_depth=min_depth,\n            max_depth=max_depth,\n            num_depth=num_depth,\n        ),\n    )\n\ntest_pipeline = [\n    dict(type=\"LoadAnnotations3D\"),\n    dict(\n        type=\"MultiViewPipeline\",\n        n_images=1,\n        max_n_images=50,\n        ordered=True,\n        transforms=[\n            dict(type=\"LoadImageFromFile\", backend_args=backend_args),\n            dict(type=\"LoadDepthFromFile\", backend_args=backend_args),\n            resize,\n        ],\n        rotate_3rscan=rotate_3rscan,\n    ),\n    dict(\n        type=\"CategoryGroundingDataPrepare\",\n        classes=class_names,\n        sep_token=sep_token,\n        training=False,\n        filter_others=False,\n    ),\n    dict(\n        type=\"Pack3DDetInputs\",\n        keys=[\"img\", \"depth_img\", \"gt_bboxes_3d\", \"gt_labels_3d\"],\n    ),\n]\n\ntrainval = False\ndata_version = \"v1\"\n\nif data_version == \"v1\":\n    train_ann_file = \"embodiedscan/embodiedscan_infos_train.pkl\"\n    val_ann_file = \"embodiedscan/embodiedscan_infos_val.pkl\"\nelif data_version == \"v2\":\n    train_ann_file = \"embodiedscan-v2/embodiedscan_infos_train.pkl\"\n    val_ann_file = \"embodiedscan-v2/embodiedscan_infos_val.pkl\"\nelse:\n    assert False\n\ntrain_dataset = dict(\n    type=dataset_type,\n    data_root=data_root,\n    ann_file=train_ann_file,\n    pipeline=train_pipeline,\n    test_mode=False,\n    filter_empty_gt=True,\n    box_type_3d=\"Euler-Depth\",\n    metainfo=metainfo,\n)\n\n\nif trainval:\n    train_dataset = dict(\n        type=\"ConcatDataset\",\n        datasets=[\n            train_dataset,\n            dict(\n                type=dataset_type,\n                data_root=data_root,\n                ann_file=val_ann_file,\n                pipeline=train_pipeline,\n                test_mode=False,\n                filter_empty_gt=True,\n                box_type_3d=\"Euler-Depth\",\n                metainfo=metainfo,\n            )\n        ]\n    )\n\n\ntrain_dataloader = dict(\n    batch_size=1,\n    num_workers=4 if not DEBUG else 0,\n    persistent_workers=False,\n    sampler=dict(type=\"DefaultSampler\", shuffle=True),\n    dataset=dict(\n        type=\"RepeatDataset\",\n        times=10,\n        dataset=train_dataset,\n    ),\n)\n\n    \nval_dataloader = dict(\n    batch_size=1,\n    num_workers=4,\n    persistent_workers=False,\n    drop_last=False,\n    sampler=dict(type=\"DefaultSampler\", shuffle=False),\n    dataset=dict(\n        type=dataset_type,\n        data_root=data_root,\n        ann_file=val_ann_file,\n        pipeline=test_pipeline,\n        test_mode=True,\n        filter_empty_gt=True,\n        box_type_3d=\"Euler-Depth\",\n        metainfo=metainfo,\n    ),\n)\ntest_dataloader = val_dataloader\n\nval_evaluator = dict(\n    type=\"IndoorDetMetric\",\n    collect_dir=\"/job_data/.dist_test\" if not DEBUG else None,\n    # collect_device=\"gpu\"\n)\ntest_evaluator = val_evaluator\n\nmax_epochs = 24\ntrain_cfg = dict(\n    type=\"EpochBasedTrainLoop\", max_epochs=max_epochs, val_interval=1\n)\nval_cfg = dict(type=\"ValLoop\")\ntest_cfg = dict(type=\"TestLoop\")\n\nlr = 2e-4\noptim_wrapper = dict(\n    type=\"OptimWrapper\",\n    optimizer=dict(type=\"AdamW\", lr=lr, weight_decay=0.0005),\n    paramwise_cfg=dict(\n        custom_keys={\n            \"backbone.\": dict(lr_mult=0.1),\n            \"text_encoder\": dict(lr_mult=0.05),\n            \"absolute_pos_embed\": dict(decay_mult=0.0),\n        }\n    ),\n    clip_grad=dict(max_norm=10, norm_type=2),\n)\n\n\n# learning rate\nparam_scheduler = [\n    dict(\n        type=\"LinearLR\", start_factor=0.001, by_epoch=False, begin=0, end=500\n    ),\n    dict(\n        type=\"MultiStepLR\",\n        begin=0,\n        end=max_epochs,\n        by_epoch=True,\n        milestones=[int(max_epochs / 12 * 8), int(max_epochs / 12 * 11)],\n        gamma=0.1,\n    ),\n]\n\ncustom_hooks = [dict(type=\"EmptyCacheHook\", after_iter=False)]\ndefault_hooks = dict(\n    checkpoint=dict(type=\"CheckpointHook\", interval=1, max_keep_ckpts=3),\n)\n\nvis_backends = [\n    dict(\n        type=\"TensorboardVisBackend\",\n        save_dir=\"/job_tboard\" if not DEBUG else \"./work-dir\",\n    ),\n]\n\nvisualizer = dict(\n    type=\"Visualizer\",\n    vis_backends=vis_backends,\n    name=\"visualizer\",\n)\n\nload_from = \"ckpt/groundingdino_swint_ogc_mmdet-822d7e9d-rename.pth\"\n"
  },
  {
    "path": "configs/bip3d_det_grounding.py",
    "content": "_base_ = [\"./bip3d_grounding.py\"]\nfrom mmengine import Config\nimport os\ndet_config = Config.fromfile(\"configs/bip3d_det.py\")\ndet_train_dataset = det_config.train_dataset\ndel Config, det_config\n\ntrain_dataloader = _base_.train_dataloader\nDEBUG = os.environ.get(\"CLUSTER\") is None\n\ntrain_dataloader[\"dataset\"] = dict(\n    type=\"ConcatDataset\",\n    datasets=[\n        dict(\n            type=\"RepeatDataset\",\n            times=20,\n            dataset=det_train_dataset,\n        ),\n        train_dataloader[\"dataset\"],\n    ]\n)\n\n\nmax_iters = 50000\ntrain_cfg = dict(\n    type=\"IterBasedTrainLoop\", max_iters=max_iters, val_interval=25000,\n    _delete_=True,\n)\nval_cfg = dict(type=\"ValLoop\")\ntest_cfg = dict(type=\"TestLoop\")\n\nlr = 2e-4\noptim_wrapper = dict(\n    type=\"OptimWrapper\",\n    optimizer=dict(type=\"AdamW\", lr=lr, weight_decay=0.0005),\n    paramwise_cfg=dict(\n        custom_keys={\n            \"backbone.\": dict(lr_mult=0.1),\n            \"text_encoder\": dict(lr_mult=0.05),\n            \"absolute_pos_embed\": dict(decay_mult=0.0),\n        }\n    ),\n    clip_grad=dict(max_norm=10, norm_type=2),\n)\n\n\n# learning rate\nparam_scheduler = [\n    dict(\n        type=\"LinearLR\", start_factor=0.001, by_epoch=False, begin=0, end=500\n    ),\n    dict(\n        type=\"MultiStepLR\",\n        begin=0,\n        end=max_iters,\n        by_epoch=False,\n        milestones=[int(max_iters*0.8), int(max_iters*0.95)],\n        gamma=0.1,\n    ),\n]\n\ncustom_hooks = [dict(type=\"EmptyCacheHook\", after_iter=False)]\ndefault_hooks = dict(\n    logger=dict(\n        type=\"LoggerHook\",\n        interval=25,\n        log_metric_by_epoch=False,\n    ),\n    checkpoint=dict(\n        type=\"CheckpointHook\", interval=25000, max_keep_ckpts=3, by_epoch=False\n    ),\n)\n\nvis_backends = [\n    dict(\n        type=\"TensorboardVisBackend\",\n        save_dir=\"/job_tboard\" if not DEBUG else \"./work-dir\",\n    ),\n]\n\nvisualizer = dict(\n    type=\"Visualizer\",\n    vis_backends=vis_backends,\n    name=\"visualizer\",\n)\nlog_processor = dict(type=\"LogProcessor\", window_size=50, by_epoch=False)\n\nload_from = \"ckpt/bip3d_det.pth\"\n"
  },
  {
    "path": "configs/bip3d_det_rgb.py",
    "content": "_base_ = [\"./bip3d_det.py\"]\n\nmodel = dict(\n    backbone_3d=None,\n    neck_3d=None,\n    spatial_enhancer=dict(with_feature_3d=False),\n    decoder=dict(deformable_model=dict(with_depth=False)),\n)\n"
  },
  {
    "path": "configs/bip3d_grounding.py",
    "content": "_base_ = [\"./default_runtime.py\"]\n\nimport os\nfrom bip3d.datasets.embodiedscan_det_grounding_dataset import (\n    class_names, head_labels, common_labels, tail_labels\n)\n\nDEBUG = os.environ.get(\"CLUSTER\") is None\ndel os\n\nbackend_args = None\n\nmetainfo = dict(classes=\"all\")\n\ndepth = True\ndepth_loss = True\n\nz_range=[-0.2, 3]\nmin_depth = 0.25\nmax_depth = 10\nnum_depth = 64\n\nmodel = dict(\n    type=\"BIP3D\",\n    input_3d=\"depth_img\",\n    use_depth_grid_mask=True,\n    data_preprocessor=dict(\n        type=\"CustomDet3DDataPreprocessor\",\n        mean=[123.675, 116.28, 103.53],\n        std=[58.395, 57.12, 57.375],\n        bgr_to_rgb=True,\n        pad_size_divisor=32,\n    ),\n    backbone=dict(\n        type=\"mmdet.SwinTransformer\",\n        embed_dims=96,\n        depths=[2, 2, 6, 2],\n        num_heads=[3, 6, 12, 24],\n        window_size=7,\n        mlp_ratio=4,\n        qkv_bias=True,\n        qk_scale=None,\n        drop_rate=0.0,\n        attn_drop_rate=0.0,\n        drop_path_rate=0.2,\n        patch_norm=True,\n        out_indices=(1, 2, 3),\n        with_cp=True,\n        convert_weights=False,\n    ),\n    neck=dict(\n        type=\"mmdet.ChannelMapper\",\n        in_channels=[192, 384, 768],\n        kernel_size=1,\n        out_channels=256,\n        act_cfg=None,\n        bias=True,\n        norm_cfg=dict(type=\"GN\", num_groups=32),\n        num_outs=4,\n    ),\n    text_encoder=dict(\n        type=\"BertModel\",\n        special_tokens_list=[\"[CLS]\", \"[SEP]\"],\n        name=\"./ckpt/bert-base-uncased\",\n        pad_to_max=False,\n        use_sub_sentence_represent=True,\n        add_pooling_layer=False,\n        max_tokens=768,\n        use_checkpoint=True,\n        return_tokenized=True,\n    ),\n    backbone_3d=(\n        dict(\n            type=\"mmdet.ResNet\",\n            depth=34,\n            in_channels=1,\n            base_channels=4,\n            num_stages=4,\n            out_indices=(1, 2, 3),\n            norm_cfg=dict(type=\"BN\", requires_grad=True),\n            norm_eval=True,\n            with_cp=True,\n            style=\"pytorch\",\n        )\n        if depth\n        else None\n    ),\n    neck_3d=(\n        dict(\n            type=\"mmdet.ChannelMapper\",\n            in_channels=[8, 16, 32],\n            kernel_size=1,\n            out_channels=32,\n            act_cfg=None,\n            bias=True,\n            norm_cfg=dict(type=\"GN\", num_groups=4),\n            num_outs=4,\n        )\n        if depth\n        else None\n    ),\n    feature_enhancer=dict(\n        type=\"TextImageDeformable2DEnhancer\",\n        num_layers=6,\n        text_img_attn_block=dict(\n            v_dim=256, l_dim=256, embed_dim=1024, num_heads=4, init_values=1e-4\n        ),\n        img_attn_block=dict(\n            self_attn_cfg=dict(\n                embed_dims=256, num_levels=4, dropout=0.0, im2col_step=64\n            ),\n            ffn_cfg=dict(\n                embed_dims=256, feedforward_channels=2048, ffn_drop=0.0\n            ),\n        ),\n        text_attn_block=dict(\n            self_attn_cfg=dict(num_heads=4, embed_dims=256, dropout=0.0),\n            ffn_cfg=dict(\n                embed_dims=256, feedforward_channels=1024, ffn_drop=0.0\n            ),\n        ),\n        embed_dims=256,\n        num_feature_levels=4,\n        positional_encoding=dict(\n            num_feats=128, normalize=True, offset=0.0, temperature=20\n        ),\n    ),\n    spatial_enhancer=dict(\n        type=\"DepthFusionSpatialEnhancer\",\n        embed_dims=256,\n        feature_3d_dim=32,\n        num_depth_layers=2,\n        min_depth=min_depth,\n        max_depth=max_depth,\n        num_depth=num_depth,\n        with_feature_3d=depth,\n        loss_depth_weight=1.0 if depth_loss else -1,\n    ),\n    decoder=dict(\n        type=\"BBox3DDecoder\",\n        look_forward_twice=True,\n        instance_bank=dict(\n            type=\"InstanceBank\",\n            num_anchor=50,\n            anchor=\"anchor_files/embodiedscan_kmeans.npy\",\n            embed_dims=256,\n            anchor_in_camera=True,\n        ),\n        anchor_encoder=dict(\n            type=\"DoF9BoxEncoder\",\n            embed_dims=256,\n            rot_dims=3,\n        ),\n        graph_model=dict(\n            type=\"MultiheadAttention\",\n            embed_dims=256,\n            num_heads=8,\n            dropout=0.0,\n            batch_first=True,\n        ),\n        ffn=dict(\n            type=\"FFN\",\n            embed_dims=256,\n            feedforward_channels=2048,\n            ffn_drop=0.0,\n        ),\n        norm_layer=dict(type=\"LN\", normalized_shape=256),\n        deformable_model=dict(\n            type=\"DeformableFeatureAggregation\",\n            embed_dims=256,\n            num_groups=8,\n            num_levels=4,\n            use_camera_embed=True,\n            use_deformable_func=True,\n            with_depth=True,\n            min_depth=min_depth,\n            max_depth=max_depth,\n            kps_generator=dict(\n                type=\"SparseBox3DKeyPointsGenerator\",\n                fix_scale=[\n                    [0, 0, 0],\n                    [0.45, 0, 0],\n                    [-0.45, 0, 0],\n                    [0, 0.45, 0],\n                    [0, -0.45, 0],\n                    [0, 0, 0.45],\n                    [0, 0, -0.45],\n                ],\n                num_learnable_pts=9,\n            ),\n            with_value_proj=True,\n            filter_outlier=True,\n        ),\n        text_cross_attn=dict(\n            type=\"MultiheadAttention\",\n            embed_dims=256,\n            num_heads=8,\n            dropout=0.0,\n            batch_first=True,\n        ),\n        refine_layer=dict(\n            type=\"GroundingRefineClsHead\",\n            embed_dims=256,\n            output_dim=9,\n            cls_bias=True,\n            # cls_layers=True,\n        ),\n        loss_cls=dict(\n            type=\"mmdet.FocalLoss\",\n            use_sigmoid=True,\n            gamma=2.0,\n            alpha=0.25,\n            loss_weight=1.0,\n        ),\n        loss_reg=dict(\n            type=\"DoF9BoxLoss\",\n            loss_weight_wd=1.0,\n            loss_weight_pcd=0.0,\n            loss_weight_cd=0.8,\n        ),\n        sampler=dict(\n            type=\"Grounding3DTarget\",\n            cls_weight=1.0,\n            box_weight=1.0,\n            num_dn=100,\n            cost_weight_wd=1.0,\n            cost_weight_pcd=0.0,\n            cost_weight_cd=0.8,\n            with_dn_query=True,\n            num_classes=284,\n            embed_dims=256,\n        ),\n        gt_reg_key=\"gt_bboxes_3d\",\n        gt_cls_key=\"tokens_positive\",\n        post_processor=dict(\n            type=\"GroundingBox3DPostProcess\",\n            num_output=1000,\n        ),\n        with_instance_id=False,\n    ),\n)\n\ndataset_type = \"EmbodiedScanDetGroundingDataset\"\ndata_root = \"data\"\nmetainfo = dict(\n    classes=class_names,\n    classes_split=(head_labels, common_labels, tail_labels),\n    box_type_3d=\"euler-depth\",\n)\n\n\nimage_wh = (512, 512)\n\nrotate_3rscan = True\nsep_token = \"[SEP]\"\ncam_standardization = True\nif cam_standardization:\n    resize = dict(\n        type=\"CamIntrisicStandardization\",\n        dst_intrinsic=[\n            [432.57943431339237, 0.0, 256],\n            [0.0, 539.8570854208559, 256],\n            [0.0, 0.0, 1.0],\n        ],\n        dst_wh=image_wh,\n    )\nelse:\n    resize = dict(type=\"CustomResize\", scale=image_wh, keep_ratio=False)\n\ntrain_pipeline = [\n    dict(type=\"LoadAnnotations3D\"),\n    dict(\n        type=\"MultiViewPipeline\",\n        n_images=1,\n        max_n_images=18,\n        transforms=[\n            dict(type=\"LoadImageFromFile\", backend_args=backend_args),\n            dict(type=\"LoadDepthFromFile\", backend_args=backend_args),\n            resize,\n            dict(\n                type=\"ResizeCropFlipImage\",\n                data_aug_conf={\n                    \"resize_lim\": (1.0, 1.0),\n                    \"final_dim\": image_wh,\n                    \"bot_pct_lim\": (0.0, 0.05),\n                    \"rot_lim\": (0, 0),\n                    \"H\": image_wh[1],\n                    \"W\": image_wh[0],\n                    \"rand_flip\": False,\n                },\n            ),\n        ],\n        rotate_3rscan=rotate_3rscan,\n        ordered=True,\n    ),\n    dict(\n        type=\"Pack3DDetInputs\",\n        keys=[\"img\", \"depth_img\", \"gt_bboxes_3d\", \"gt_labels_3d\"],\n    ),\n]\nif depth_loss:\n    train_pipeline.append(\n        dict(\n            type=\"DepthProbLabelGenerator\",\n            origin_stride=4,\n            min_depth=min_depth,\n            max_depth=max_depth,\n            num_depth=num_depth,\n        ),\n    )\n\ntest_pipeline = [\n    dict(type=\"LoadAnnotations3D\"),\n    dict(\n        type=\"MultiViewPipeline\",\n        n_images=1,\n        max_n_images=50,\n        ordered=True,\n        transforms=[\n            dict(type=\"LoadImageFromFile\", backend_args=backend_args),\n            dict(type=\"LoadDepthFromFile\", backend_args=backend_args),\n            resize,\n        ],\n        rotate_3rscan=rotate_3rscan,\n    ),\n    dict(\n        type=\"Pack3DDetInputs\",\n        keys=[\"img\", \"depth_img\", \"gt_bboxes_3d\", \"gt_labels_3d\"],\n    ),\n]\n\ntrainval = False\ndata_version = \"v1\"\n\nif data_version == \"v1\":\n    train_ann_file = \"embodiedscan/embodiedscan_infos_train.pkl\"\n    val_ann_file = \"embodiedscan/embodiedscan_infos_val.pkl\"\n    train_vg_file = \"embodiedscan/embodiedscan_train_vg_all.json\"\n    val_vg_file = \"embodiedscan/embodiedscan_val_vg_all.json\"\nelif data_version == \"v1-mini\":\n    train_ann_file = \"embodiedscan/embodiedscan_infos_train.pkl\"\n    val_ann_file = \"embodiedscan/embodiedscan_infos_val.pkl\"\n    train_vg_file = \"embodiedscan/embodiedscan_train_mini_vg.json\"\n    val_vg_file = \"embodiedscan/embodiedscan_val_mini_vg.json\"\nelif data_version == \"v2\":\n    train_ann_file = \"embodiedscan-v2/embodiedscan_infos_train.pkl\"\n    val_ann_file = \"embodiedscan-v2/embodiedscan_infos_val.pkl\"\n    train_vg_file = \"embodiedscan-v2/embodiedscan_train_vg.json\"\n    val_vg_file = \"embodiedscan-v2/embodiedscan_val_vg.json\"\n\ntest_ann_file = \"embodiedscan/embodiedscan_infos_test.pkl\"\ntest_vg_file = \"embodiedscan/embodiedscan_test_vg.json\"\n\ntrain_dataset = dict(\n    type=dataset_type,\n    data_root=data_root,\n    ann_file=train_ann_file,\n    pipeline=train_pipeline,\n    test_mode=False,\n    filter_empty_gt=True,\n    box_type_3d=\"Euler-Depth\",\n    metainfo=metainfo,\n    mode=\"grounding\",\n    vg_file=train_vg_file,\n    num_text=10,\n    sep_token=sep_token,\n)\n\nif trainval:\n    train_dataset = dict(\n        type=\"ConcatDataset\",\n        datasets=[\n            train_dataset,\n            dict(\n                type=dataset_type,\n                data_root=data_root,\n                ann_file=val_ann_file,\n                pipeline=train_pipeline,\n                test_mode=False,\n                filter_empty_gt=True,\n                box_type_3d=\"Euler-Depth\",\n                metainfo=metainfo,\n                mode=\"grounding\",\n                vg_file=val_vg_file,\n                num_text=10,\n                sep_token=sep_token,\n            )\n        ],\n    )\n\ntrain_dataloader = dict(\n    batch_size=1,\n    num_workers=4 if not DEBUG else 0,\n    persistent_workers=False,\n    sampler=dict(type=\"DefaultSampler\", shuffle=True),\n    dataset=train_dataset,\n)\n\nval_dataloader = dict(\n    batch_size=1,\n    num_workers=4 if not DEBUG else 0,\n    persistent_workers=False,\n    drop_last=False,\n    sampler=dict(type=\"DefaultSampler\", shuffle=False),\n    dataset=dict(\n        type=dataset_type,\n        data_root=data_root,\n        ann_file=val_ann_file,\n        pipeline=test_pipeline,\n        test_mode=True,\n        filter_empty_gt=True,\n        box_type_3d=\"Euler-Depth\",\n        metainfo=metainfo,\n        mode=\"grounding\",\n        vg_file=val_vg_file,\n    ),\n)\n\ntest_dataloader = dict(\n    batch_size=1,\n    num_workers=4 if not DEBUG else 0,\n    persistent_workers=False,\n    drop_last=False,\n    sampler=dict(type=\"DefaultSampler\", shuffle=False),\n    dataset=dict(\n        type=dataset_type,\n        data_root=data_root,\n        ann_file=test_ann_file,\n        pipeline=[\n            dict(type=\"LoadAnnotations3D\"),\n            dict(\n                type=\"MultiViewPipeline\",\n                n_images=1,\n                max_n_images=50,\n                ordered=True,\n                transforms=[\n                    dict(type=\"LoadImageFromFile\", backend_args=backend_args),\n                    dict(type=\"LoadDepthFromFile\", backend_args=backend_args),\n                    resize,\n                ],\n                rotate_3rscan=rotate_3rscan,\n            ),\n            dict(\n                type=\"Pack3DDetInputs\",\n                keys=[\"img\", \"depth_img\", \"gt_bboxes_3d\", \"gt_labels_3d\"],\n            ),\n        ],\n        test_mode=True,\n        filter_empty_gt=True,\n        box_type_3d=\"Euler-Depth\",\n        metainfo=metainfo,\n        mode=\"grounding\",\n        vg_file=test_vg_file,\n    ),\n)\n\nval_evaluator = dict(\n    type=\"GroundingMetric\",\n    collect_dir=\"/job_data/.dist_test\" if not DEBUG else None,\n)\n\ntest_evaluator = dict(\n    type=\"GroundingMetric\",\n    collect_dir=\"/job_data/.dist_test\" if not DEBUG else None,\n    format_only=True,\n    submit_info={\n        'method': 'BIP3D',\n        'team': 'robot-lab manipulation team',\n        'authors': ['xuewu lin'],\n        'e-mail': '878585984@qq.com',\n        'institution': 'Horizon',\n        'country': 'China',\n    },\n    result_dir=\"/job_data\" if not DEBUG else \"./\",\n)\n\nmax_epochs = 2\ntrain_cfg = dict(\n    type=\"EpochBasedTrainLoop\", max_epochs=max_epochs, val_interval=1\n)\nval_cfg = dict(type=\"ValLoop\")\ntest_cfg = dict(type=\"TestLoop\")\n\nlr = 2e-4\noptim_wrapper = dict(\n    type=\"OptimWrapper\",\n    optimizer=dict(type=\"AdamW\", lr=lr, weight_decay=0.0005),\n    paramwise_cfg=dict(\n        custom_keys={\n            \"backbone.\": dict(lr_mult=0.1),\n            # \"text_encoder\": dict(lr_mult=0.05),\n            \"absolute_pos_embed\": dict(decay_mult=0.0),\n        }\n    ),\n    clip_grad=dict(max_norm=10, norm_type=2),\n)\n\n\n# learning rate\nparam_scheduler = [\n    dict(\n        type=\"LinearLR\", start_factor=0.001, by_epoch=False, begin=0, end=500\n    ),\n    dict(\n        type=\"MultiStepLR\",\n        begin=0,\n        end=62000,\n        by_epoch=False,\n        milestones=[40000, 55000],\n        gamma=0.1,\n    ),\n]\n\ncustom_hooks = [dict(type=\"EmptyCacheHook\", after_iter=False)]\ndefault_hooks = dict(\n    checkpoint=dict(type=\"CheckpointHook\", interval=1, max_keep_ckpts=3),\n)\n\nvis_backends = [\n    dict(\n        type=\"TensorboardVisBackend\",\n        save_dir=\"/job_tboard\" if not DEBUG else \"./work-dir\",\n    ),\n]\n\nvisualizer = dict(\n    type=\"Visualizer\",\n    vis_backends=vis_backends,\n    name=\"visualizer\",\n)\n\nload_from = \"ckpt/bip3d_det.pth\"\n"
  },
  {
    "path": "configs/bip3d_grounding_rgb.py",
    "content": "_base_ = [\"./bip3d_grounding.py\"]\n\nmodel = dict(\n    backbone_3d=None,\n    neck_3d=None,\n    spatial_enhancer=dict(with_feature_3d=False),\n    decoder=dict(deformable_model=dict(with_depth=False)),\n)\n\nload_from = \"ckpt/bip3d_det_rgb.pth\"\n"
  },
  {
    "path": "configs/default_runtime.py",
    "content": "default_scope = \"bip3d\"\n\ndefault_hooks = dict(\n    timer=dict(type=\"IterTimerHook\"),\n    logger=dict(\n        type=\"LoggerHook\",\n        interval=25,\n    ),\n    param_scheduler=dict(type=\"ParamSchedulerHook\"),\n    checkpoint=dict(type=\"CheckpointHook\", interval=1, max_keep_ckpts=4),\n    sampler_seed=dict(type=\"DistSamplerSeedHook\"),\n)\n# visualization=dict(type='Det3DVisualizationHook'))\n\nenv_cfg = dict(\n    cudnn_benchmark=False,\n    mp_cfg=dict(mp_start_method=\"fork\", opencv_num_threads=0),\n    dist_cfg=dict(backend=\"nccl\"),\n)\n\nlog_processor = dict(type=\"LogProcessor\", window_size=50, by_epoch=True)\n\nlog_level = \"INFO\"\nload_from = None\nresume = False\n\n# TODO: support auto scaling lr\nrandomness = dict(seed=0)\nfind_unused_parameters = False\n"
  },
  {
    "path": "docs/quick_start.md",
    "content": "# Quick Start\n\n### Set up python environment\n```bash\nvirtualenv mm_bip3d --python=python3.8\nsource mm_bip3d/bin/activate\npip3 install --upgrade pip\n\nbip3d_path=\"path/to/bip3d\"\ncd ${bip3d_path}\n# MMCV recommends installing via a wheel package, url: https://download.openmmlab.com/mmcv/dist/cu{$cuda_version}/torch{$torch_version}/index.html\npip3 install -r requirement.txt\n```\n\n### Compile the deformable_aggregation CUDA op\n```bash\ncd bip3d/ops\npython3 setup.py develop\ncd ../../\n```\n\n### Prepare the data\nDownload the [EmbodiedScan dataset](https://github.com/OpenRobotLab/EmbodiedScan) and create symbolic links.\n```bash\ncd ${bip3d_path}\nmkdir data\nln -s path/to/embodiedscan ./data/embodiedscan\n```\n\nDownload datasets [ScanNet](https://github.com/ScanNet/ScanNet), [3RScan](https://github.com/WaldJohannaU/3RScan), [Matterport3D](https://github.com/niessner/Matterport), and optionally download [ARKitScenes](https://github.com/apple/ARKitScenes). Adjust the data directory structure as follows:\n```bash\n${bip3d_path}\n└──data\n    ├──embodiedscan\n    │   ├──embodiedscan_infos_train.pkl\n    │   ├──embodiedscan_infos_val.pkl\n    │   ...\n    ├──3rscan\n    │   ├──00d42bed-778d-2ac6-86a7-0e0e5f5f5660\n    │   ...\n    ├──scannet\n    │   └──posed_images\n    ├──matterport3d\n    │   ├──17DRP5sb8fy\n    │   ...\n    └──arkitscenes\n        ├──Training\n        └──Validation\n```\n\n### Prepare pre-trained weights\nDownload the required Grounding-DINO pre-trained weights: [Swin-Tiny](https://download.openmmlab.com/mmdetection/v3.0/grounding_dino/groundingdino_swint_ogc_mmdet-822d7e9d.pth) and [Swin-Base](https://download.openmmlab.com/mmdetection/v3.0/grounding_dino/groundingdino_swinb_cogcoor_mmdet-55949c9c.pth).\n```bash\nmkdir ckpt\n\n# Swin-Tiny\nwget https://download.openmmlab.com/mmdetection/v3.0/grounding_dino/groundingdino_swint_ogc_mmdet-822d7e9d.pth -O ckpt/groundingdino_swint_ogc_mmdet-822d7e9d.pth\npython tools/ckpt_rename.py ckpt/groundingdino_swint_ogc_mmdet-822d7e9d.pth\n\n# Swin-Base\nwget https://download.openmmlab.com/mmdetection/v3.0/grounding_dino/groundingdino_swinb_cogcoor_mmdet-55949c9c.pth -O ckpt/groundingdino_swinb_cogcoor_mmdet-55949c9c.pth\npython tools/ckpt_rename.py ckpt/groundingdino_swinb_cogcoor_mmdet-55949c9c.pth\n```\nDownload bert config and pretrain weights from [huggingface](https://huggingface.co/google-bert/bert-base-uncased/tree/main).\n```bash\n${bip3d_path}\n└──ckpt\n    ├──groundingdino_swint_ogc_mmdet-822d7e9d.pth\n    ├──groundingdino_swinb_cogcoor_mmdet-55949c9c.pth\n    └──bert-base-uncased\n        ├──config.json\n        ├──tokenizer_config.json\n        ├──tokenizer.json\n        ├──pytorch_model.bin\n        ...\n```\n\n### Generate anchors by K-means\n```bash\nmkdir anchor_files\npython3 tools/anchor_bbox3d_kmeans.py \\\n    --ann_file data/embodiedscan/embodiedscan_infos_train.pkl \\\n    --output_file anchor_files/embodiedscan_kmeans.npy\n```\nYou can also download the anchor file we provide.\n\n\n### Modify config\nAccording to personal needs, modify some config items, such as [`DEBUG`](../configs/bip3d_det.py#L8), [`collect_dir`](../configs/bip3d_det.py#L425), [`save_dir`](../configs/bip3d_det.py#L475), and [`load_from`](../configs/bip3d_det.py#L485).\n\nIf performing multi-machine training, ensure that [`collect_dir`](../configs/bip3d_det.py#L425) is a shared folder accessible by all machines.\n\n### Run local training and testing\n```bash\nexport CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7\n# train\nbash engine.sh  configs/xxx.py\n\n# test\nbash test.sh  configs/xxx.py  path/to/checkpoint\n```\n"
  },
  {
    "path": "engine.sh",
    "content": "export PYTHONPATH=./:$PYTHONPATH\nCONFIG=configs/bip3d_det.py\nif ! [[ -z $1 ]]; then\n    echo $1\n    CONFIG=$1\nfi\nnvcc -V\nwhich nvcc\n\ngpus=(${CUDA_VISIBLE_DEVICES//,/ })\ngpu_num=${#gpus[@]}\necho \"number of gpus: \"${gpu_num}\necho $CONFIG\n\nif [ ${gpu_num} -gt 1 ]\nthen\n    bash ./tools/dist_train.sh \\\n        ${CONFIG} \\\n        ${gpu_num} \\\n        --work-dir=work-dirs\nelse\n    python ./tools/train.py \\\n        ${CONFIG} \\\n        --work-dir ./work-dir\nfi\n"
  },
  {
    "path": "requirements.txt",
    "content": "einops==0.8.0\nhuggingface-hub==0.26.5\nmmcv==2.1.0\nmmdet==3.3.0\nmmengine==0.10.4\nninja==1.11.1.3\nnumpy==1.26.4\nopencv-python==4.10.0.84\npytorch3d==0.7.5\nrequests==2.32.3\nscipy==1.14.1\ntensorboard==2.18.0\ntimm==1.0.3\ntokenizers==0.21.0\ntorch==2.1.0+cu118\ntorchaudio==2.1.0+cu118\ntorchmetrics==1.6.1\ntorchvision==0.16.0+cu118\n"
  },
  {
    "path": "test.sh",
    "content": "export PYTHONPATH=./:$PYTHONPATH\nCONFIG=$1\nCKPT=$2\n\nnvcc -V\nwhich nvcc\n\ngpus=(${CUDA_VISIBLE_DEVICES//,/ })\ngpu_num=${#gpus[@]}\necho \"number of gpus: \"${gpu_num}\necho \"ckeckpoint: \"$CKPT\necho $CONFIG\n\n\nif [ ${gpu_num} -gt 1 ]\nthen\n    bash ./tools/dist_test.sh \\\n        ${CONFIG} \\\n        $CKPT \\\n        $gpu_num \\\n        --work-dir ./work-dir\nelse\n    python3 tools/test.py \\\n        $CONFIG \\\n        $CKPT \\\n        --work-dir ./work-dir\nfi\n"
  },
  {
    "path": "tools/anchor_bbox3d_kmeans.py",
    "content": "import torch\nimport pickle\nimport tqdm\nfrom sklearn.cluster import KMeans\nimport numpy as np\n\nfrom bip3d.structures.bbox_3d import EulerDepthInstance3DBoxes\n\n\ndef sample(ids, n):\n    if n == len(ids):\n        return np.copy(ids)\n    elif n > len(ids):\n        return np.concatenate(\n            [ids, sample(ids, n - len(ids))]\n        )\n    else:\n        interval = len(ids) / n\n        output = []\n        for i in range(n):\n            output.append(ids[int(interval*i)])\n        return np.array(output)\n\n\ndef kmeans(\n    ann_file,\n    output_file,\n    z_min=-0.2,\n    z_max=3,\n):\n    ann = pickle.load(open(ann_file, \"rb\"))\n    all_cam_bbox = []\n    for x in tqdm.tqdm(ann[\"data_list\"]):\n        bbox = np.array([y[\"bbox_3d\"] for y in x[\"instances\"]])\n        ids = np.arange(len(x[\"images\"]))\n        ids = sample(ids, 50)\n        for idx in ids:\n            image = x[\"images\"][idx]\n            global2cam = np.linalg.inv(x['axis_align_matrix'] @ image['cam2global'])\n            _bbox = EulerDepthInstance3DBoxes(np.copy(bbox[image[\"visible_instance_ids\"]]))\n            mask = torch.logical_and(\n                _bbox.tensor[:, 2] > z_min,\n                _bbox.tensor[:, 2] < z_max\n            )\n            _bbox = _bbox[mask]\n            _bbox.transform(global2cam)\n            all_cam_bbox.append(_bbox.tensor.numpy())\n    all_cam_bbox = np.concatenate(all_cam_bbox)\n    print(\"start to kmeans, please wait\")\n    cluster_cam = KMeans(n_clusters=100).fit(all_cam_bbox).cluster_centers_\n    cluster_cam[:, 3:6] = np.log(cluster_cam[:, 3:6])\n    np.save(output_file, cluster_cam)\n\n\nif __name__ == \"__main__\":\n    import argparse\n    parser = argparse.ArgumentParser(\n        description='anchor bbox3d kmeans for embodiedscan dataset')\n    parser.add_argument(\"ann_file\")\n    parser.add_argument(\"--output_file\")\n    parser.add_argument(\"--z_min\", defaule=-0.2)\n    parser.add_argument(\"--z_max\", defaule=3)\n    args = parser.parse_args()\n    kmeans(args.ann_file, args.output_file, args.z_min, args.z_max)\n"
  },
  {
    "path": "tools/benchmark.py",
    "content": "# Copyright (c) OpenMMLab. All rights reserved.\nimport argparse\nimport os\n\nfrom mmengine import MMLogger\nfrom mmengine.config import Config, DictAction\nfrom mmengine.dist import init_dist\nfrom mmengine.registry import init_default_scope\nfrom mmengine.utils import mkdir_or_exist\n\nfrom mmdet.utils.benchmark import (DataLoaderBenchmark, DatasetBenchmark,\n                                   InferenceBenchmark)\n\nfrom embodiedscan import *\n\ndef parse_args():\n    parser = argparse.ArgumentParser(description='MMDet benchmark')\n    parser.add_argument('config', help='test config file path')\n    parser.add_argument('--checkpoint', help='checkpoint file')\n    parser.add_argument(\n        '--task',\n        choices=['inference', 'dataloader', 'dataset'],\n        default='dataloader',\n        help='Which task do you want to go to benchmark')\n    parser.add_argument(\n        '--repeat-num',\n        type=int,\n        default=1,\n        help='number of repeat times of measurement for averaging the results')\n    parser.add_argument(\n        '--max-iter', type=int, default=2000, help='num of max iter')\n    parser.add_argument(\n        '--log-interval', type=int, default=50, help='interval of logging')\n    parser.add_argument(\n        '--num-warmup', type=int, default=5, help='Number of warmup')\n    parser.add_argument(\n        '--fuse-conv-bn',\n        action='store_true',\n        help='Whether to fuse conv and bn, this will slightly increase'\n        'the inference speed')\n    parser.add_argument(\n        '--dataset-type',\n        choices=['train', 'val', 'test'],\n        default='test',\n        help='Benchmark dataset type. only supports train, val and test')\n    parser.add_argument(\n        '--work-dir',\n        help='the directory to save the file containing '\n        'benchmark metrics')\n    parser.add_argument(\n        '--cfg-options',\n        nargs='+',\n        action=DictAction,\n        help='override some settings in the used config, the key-value pair '\n        'in xxx=yyy format will be merged into config file. If the value to '\n        'be overwritten is a list, it should be like key=\"[a,b]\" or key=a,b '\n        'It also allows nested list/tuple values, e.g. key=\"[(a,b),(c,d)]\" '\n        'Note that the quotation marks are necessary and that no white space '\n        'is allowed.')\n    parser.add_argument(\n        '--launcher',\n        choices=['none', 'pytorch', 'slurm', 'mpi'],\n        default='none',\n        help='job launcher')\n    parser.add_argument('--local_rank', type=int, default=0)\n    args = parser.parse_args()\n    if 'LOCAL_RANK' not in os.environ:\n        os.environ['LOCAL_RANK'] = str(args.local_rank)\n    return args\n\n\ndef inference_benchmark(args, cfg, distributed, logger):\n    benchmark = InferenceBenchmark(\n        cfg,\n        args.checkpoint,\n        distributed,\n        args.fuse_conv_bn,\n        args.max_iter,\n        args.log_interval,\n        args.num_warmup,\n        logger=logger)\n    return benchmark\n\n\ndef dataloader_benchmark(args, cfg, distributed, logger):\n    benchmark = DataLoaderBenchmark(\n        cfg,\n        distributed,\n        args.dataset_type,\n        args.max_iter,\n        args.log_interval,\n        args.num_warmup,\n        logger=logger)\n    return benchmark\n\n\ndef dataset_benchmark(args, cfg, distributed, logger):\n    benchmark = DatasetBenchmark(\n        cfg,\n        args.dataset_type,\n        args.max_iter,\n        args.log_interval,\n        args.num_warmup,\n        logger=logger)\n    return benchmark\n\n\ndef main():\n    args = parse_args()\n    cfg = Config.fromfile(args.config)\n    if args.cfg_options is not None:\n        cfg.merge_from_dict(args.cfg_options)\n\n    init_default_scope(cfg.get('default_scope', 'mmdet'))\n\n    distributed = False\n    if args.launcher != 'none':\n        init_dist(args.launcher, **cfg.get('env_cfg', {}).get('dist_cfg', {}))\n        distributed = True\n\n    log_file = None\n    if args.work_dir:\n        log_file = os.path.join(args.work_dir, 'benchmark.log')\n        mkdir_or_exist(args.work_dir)\n\n    logger = MMLogger.get_instance(\n        'mmdet', log_file=log_file, log_level='INFO')\n\n    benchmark = eval(f'{args.task}_benchmark')(args, cfg, distributed, logger)\n    benchmark.run(args.repeat_num)\n\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "tools/ckpt_rename.py",
    "content": "import os\nimport torch\n\ndef get_renamed_ckpt(file, output=\"./\"):\n    ckpt_rename = dict()\n    ckpt = torch.load(file)\n    if \"state_dict\" in ckpt:\n        ckpt = ckpt[\"state_dict\"]\n    for key,value in ckpt.items():\n        k = None\n        if key.startswith(\"backbone\") or key.startswith(\"neck\"):\n            k = key\n        elif key.startswith(\"language_model.\"):\n            k = key.replace(\"language_model.\", \"text_encoder.\")\n        elif key.startswith(\"encoder.\"):\n            if key.startswith(\"encoder.layers.\"):\n                k = key.replace(\"encoder.layers.\", \"feature_enhancer.img_attn_blocks.\")\n            elif key.startswith(\"encoder.fusion_layers.\"):\n                k = key.replace(\"encoder.fusion_layers.\", \"feature_enhancer.text_img_attn_blocks.\")\n            elif key.startswith(\"encoder.text_layers.\"):\n                k = key.replace(\"encoder.text_layers.\", \"feature_enhancer.text_attn_blocks.\")\n        elif key.startswith(\"decoder.layers.\"):\n            ops = [\n                \"gnn\",\n                \"norm\",\n                \"text_cross_attn\",\n                \"norm\",\n                \"deformable\",\n                \"norm\",\n                \"ffn\",\n                \"norm\",\n                \"refine\",\n            ]\n            layer_id = int(key.split(\".\")[2])\n            name = key.split(\".\")[3]\n            for op in ops:\n                if name == \"self_attn\":\n                    block_id = 0\n                elif name == \"cross_attn_text\":\n                    block_id = 2\n                elif name == \"cross_attn\":\n                    block_id = 4\n                elif name == \"ffn\":\n                    block_id = 6\n                elif name == \"norms\":\n                    norm_id = int(key.split(\".\")[4])\n                    block_id = (norm_id + 1) * 2 - 1\n            op_id = block_id + layer_id * len(ops)\n            k = f\"decoder.layers.{op_id}.\"\n            if name == \"norms\":\n                k += \".\".join(key.split(\".\")[5:])\n            else:\n                k += \".\".join(key.split(\".\")[4:])\n        elif key.startswith(\"bbox_head.\"):\n            layer_id = int(key.split(\".\")[2])\n            op_id = 8 + layer_id * 9\n            if \"reg_branches\" in key:\n                k = f\"decoder.layers.{op_id}.\" + \".\".join(key.split(\".\")[3:])\n            elif \"cls_branches\" in key:\n                k = f\"decoder.layers.{op_id}.bias\"\n        elif \"pts_prob_fc\" in key:\n            k = \"spatial_enhancer.\" + key\n        elif key in [\n            \"pts_prob_pre_fc.weight\",\n            \"pts_prob_pre_fc.bias\",\n            \"pts_fc.weight\",\n            \"pts_fc.bias\",\n            \"fusion_fc.0.layers.0.0.weight\",\n            \"fusion_fc.0.layers.0.0.bias\",\n            \"fusion_fc.0.layers.1.weight\",\n            \"fusion_fc.0.layers.1.bias\",\n            \"fusion_fc.1.weight\",\n            \"fusion_fc.1.bias\",\n            \"fusion_norm.weight\",\n            \"fusion_norm.bias\",\n        ]:\n            k = \"spatial_enhancer.\" + key\n        elif key == \"level_embed\":\n            k = \"feature_enhancer.level_embed\"\n        elif key == \"query_embedding.weight\":\n            k = \"decoder.instance_bank.instance_feature\"\n        if k is None:\n    #         print(key)\n            k = key\n        if k == \"decoder.norm.weight\":\n            print(key)\n        ckpt_rename[k] = value\n\n    path, file_name = os.path.split(file)\n    file_name = file_name[:-4]+\"-rename.pth\"\n    output_file = os.path.join(output, file_name)\n    torch.save(ckpt_rename, output_file)\n\nif __name__ == \"__main__\":\n    import argparse\n    parser = argparse.ArgumentParser(\n        description='mmdet Grounding-DINO checkpoint rename to BIP3D')\n    parser.add_argument(\"file\")\n    parser.add_argument(\"--output\", default=None)\n    args = parser.parse_args()\n    if args.output is None:\n        output = \"./\"\n    get_renamed_ckpt(args.file, output)\n"
  },
  {
    "path": "tools/dist_test.sh",
    "content": "#!/usr/bin/env bash\n\nCONFIG=$1\nCHECKPOINT=$2\nGPUS=$3\nPORT=${PORT:-11000}\n\nPYTHONPATH=\"$(dirname $0)/..\":$PYTHONPATH \\\npython3 -m torch.distributed.launch --nproc_per_node=$GPUS --master_port=$PORT \\\n    $(dirname \"$0\")/test.py $CONFIG $CHECKPOINT --launcher pytorch ${@:4}\n"
  },
  {
    "path": "tools/dist_train.sh",
    "content": "#!/usr/bin/env bash\n\nCONFIG=$1\nGPUS=$2\nPORT=${PORT:-12050}\n\nPYTHONPATH=\"$(dirname $0)/..\":$PYTHONPATH \\\npython3 -m torch.distributed.launch --nproc_per_node=$GPUS --master_port=$PORT \\\n    $(dirname \"$0\")/train.py $CONFIG --launcher pytorch ${@:3}\n"
  },
  {
    "path": "tools/test.py",
    "content": "# Copyright (c) OpenMMLab. All rights reserved.\nimport time\nimport os\n\ndef wait_before_import_config():\n    t = int(os.environ.get('LOCAL_RANK', 0))\n    time.sleep(t * 0.5)\n\ndef wait_after_import_config():\n    t = int(os.environ.get('WORLD_SIZE', 0)) - int(os.environ.get('LOCAL_RANK', 0))\n    time.sleep(t * 0.5)\n\nwait_before_import_config()\nfrom mmengine.config import Config\nwait_after_import_config()\n\nimport argparse\nimport os\nimport os.path as osp\n\nfrom mmengine.config import Config, ConfigDict, DictAction\nfrom mmengine.registry import RUNNERS\nfrom mmengine.runner import Runner\nimport torch\ntorch.multiprocessing.set_sharing_strategy('file_system')\n\n# from embodiedscan.utils import replace_ceph_backend\n\n\n# TODO: support fuse_conv_bn and format_only\ndef parse_args():\n    parser = argparse.ArgumentParser(\n        description='MMDet3D test (and eval) a model')\n    parser.add_argument('config', help='test config file path')\n    parser.add_argument('checkpoint', help='checkpoint file')\n    parser.add_argument(\n        '--work-dir',\n        help='the directory to save the file containing evaluation metrics')\n    parser.add_argument('--task-name', help='task names')\n    parser.add_argument('--ceph',\n                        action='store_true',\n                        help='Use ceph as data storage backend')\n    parser.add_argument('--show',\n                        action='store_true',\n                        help='show prediction results')\n    parser.add_argument('--show-dir',\n                        help='directory where painted images will be saved. '\n                        'If specified, it will be automatically saved '\n                        'to the work_dir/timestamp/show_dir')\n    parser.add_argument('--score-thr',\n                        type=float,\n                        default=0.1,\n                        help='bbox score threshold')\n    parser.add_argument(\n        '--task',\n        type=str,\n        choices=[\n            'mono_det', 'multi-view_det', 'lidar_det', 'lidar_seg',\n            'multi-modality_det'\n        ],\n        help='Determine the visualization method depending on the task.')\n    parser.add_argument('--wait-time',\n                        type=float,\n                        default=2,\n                        help='the interval of show (s)')\n    parser.add_argument(\n        '--cfg-options',\n        nargs='+',\n        action=DictAction,\n        help='override some settings in the used config, the key-value pair '\n        'in xxx=yyy format will be merged into config file. If the value to '\n        'be overwritten is a list, it should be like key=\"[a,b]\" or key=a,b '\n        'It also allows nested list/tuple values, e.g. key=\"[(a,b),(c,d)]\" '\n        'Note that the quotation marks are necessary and that no white space '\n        'is allowed.')\n    parser.add_argument('--launcher',\n                        choices=['none', 'pytorch', 'slurm', 'mpi'],\n                        default='none',\n                        help='job launcher')\n    parser.add_argument('--tta',\n                        action='store_true',\n                        help='Test time augmentation')\n    # When using PyTorch version >= 2.0.0, the `torch.distributed.launch`\n    # will pass the `--local-rank` parameter to `tools/test.py` instead\n    # of `--local_rank`.\n    parser.add_argument('--local_rank', '--local-rank', type=int, default=0)\n    args = parser.parse_args()\n    if 'LOCAL_RANK' not in os.environ:\n        os.environ['LOCAL_RANK'] = str(args.local_rank)\n    return args\n\n\ndef trigger_visualization_hook(cfg, args):\n    default_hooks = cfg.default_hooks\n    if 'visualization' in default_hooks:\n        visualization_hook = default_hooks['visualization']\n        # Turn on visualization\n        visualization_hook['draw'] = True\n        if args.show:\n            visualization_hook['show'] = True\n            visualization_hook['wait_time'] = args.wait_time\n        if args.show_dir:\n            visualization_hook['test_out_dir'] = args.show_dir\n        all_task_choices = [\n            'mono_det', 'multi-view_det', 'lidar_det', 'lidar_seg',\n            'multi-modality_det'\n        ]\n        assert args.task in all_task_choices, 'You must set '\\\n            f\"'--task' in {all_task_choices} in the command \" \\\n            'if you want to use visualization hook'\n        visualization_hook['vis_task'] = args.task\n        visualization_hook['score_thr'] = args.score_thr\n    else:\n        raise RuntimeError(\n            'VisualizationHook must be included in default_hooks.'\n            'refer to usage '\n            '\"visualization=dict(type=\\'VisualizationHook\\')\"')\n\n    return cfg\n\n\ndef main():\n    args = parse_args()\n\n    # load config\n    cfg = Config.fromfile(args.config)\n\n    # TODO: We will unify the ceph support approach with other OpenMMLab repos\n    # if args.ceph:\n    #     cfg = replace_ceph_backend(cfg)\n\n    cfg.launcher = args.launcher\n    if args.cfg_options is not None:\n        cfg.merge_from_dict(args.cfg_options)\n\n    # work_dir is determined in this priority: CLI > segment in file > filename\n    if args.work_dir is not None:\n        # update configs according to CLI args if args.work_dir is not None\n        cfg.work_dir = args.work_dir\n    elif args.task_name is not None:\n        cfg.work_dir = osp.join('./work_dirs', args.task_name)\n    elif cfg.get('work_dir', None) is None:\n        # use config filename as default work_dir if cfg.work_dir is None\n        cfg.work_dir = osp.join('./work_dirs',\n                                osp.splitext(osp.basename(args.config))[0])\n\n    cfg.load_from = args.checkpoint\n\n    if args.show or args.show_dir:\n        cfg = trigger_visualization_hook(cfg, args)\n\n    if args.tta:\n        # Currently, we only support tta for 3D segmentation\n        # TODO: Support tta for 3D detection\n        assert 'tta_model' in cfg, 'Cannot find ``tta_model`` in config.'\n        assert 'tta_pipeline' in cfg, 'Cannot find ``tta_pipeline`` in config.'\n        cfg.test_dataloader.dataset.pipeline = cfg.tta_pipeline\n        cfg.model = ConfigDict(**cfg.tta_model, module=cfg.model)\n\n    # build the runner from config\n    if 'runner_type' not in cfg:\n        # build the default runner\n        runner = Runner.from_cfg(cfg)\n    else:\n        # build customized runner from the registry\n        # if 'runner_type' is set in the cfg\n        runner = RUNNERS.build(cfg)\n\n    # start testing\n    runner.test()\n\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "tools/train.py",
    "content": "# Copyright (c) OpenMMLab. All rights reserved.\nimport time\nimport os\n\ndef wait_before_import_config():\n    t = int(os.environ.get('LOCAL_RANK', 0))\n    time.sleep(t * 0.5)\n\ndef wait_after_import_config():\n    t = int(os.environ.get('WORLD_SIZE', 0)) - int(os.environ.get('LOCAL_RANK', 0))\n    time.sleep(t * 0.5)\n\nwait_before_import_config()\nfrom mmengine.config import Config\nwait_after_import_config()\n\n\nimport argparse\nimport logging\nimport os\nimport os.path as osp\n\nfrom mmengine.config import Config, DictAction\nfrom mmengine.logging import print_log\nfrom mmengine.registry import RUNNERS\nfrom mmengine.runner import Runner\nfrom bip3d import *\nimport torch\n# torch.autograd.set_detect_anomaly(True)\ntorch.multiprocessing.set_sharing_strategy('file_system')\n\n\ndef parse_args():\n    parser = argparse.ArgumentParser(description='Train a 3D detector')\n    parser.add_argument('config', help='train config file path')\n    parser.add_argument('--work-dir', help='the dir to save logs and models')\n    parser.add_argument('--task-name', help='task names')\n    parser.add_argument('--amp',\n                        action='store_true',\n                        default=False,\n                        help='enable automatic-mixed-precision training')\n    parser.add_argument('--auto-scale-lr',\n                        action='store_true',\n                        help='enable automatically scaling LR.')\n    parser.add_argument(\n        '--resume',\n        nargs='?',\n        type=str,\n        const='auto',\n        help='If specify checkpoint path, resume from it, while if not '\n        'specify, try to auto resume from the latest checkpoint '\n        'in the work directory.')\n    parser.add_argument('--ceph',\n                        action='store_true',\n                        help='Use ceph as data storage backend')\n    parser.add_argument(\n        '--cfg-options',\n        nargs='+',\n        action=DictAction,\n        help='override some settings in the used config, the key-value pair '\n        'in xxx=yyy format will be merged into config file. If the value to '\n        'be overwritten is a list, it should be like key=\"[a,b]\" or key=a,b '\n        'It also allows nested list/tuple values, e.g. key=\"[(a,b),(c,d)]\" '\n        'Note that the quotation marks are necessary and that no white space '\n        'is allowed.')\n    parser.add_argument('--launcher',\n                        choices=['none', 'pytorch', 'slurm', 'mpi'],\n                        default='none',\n                        help='job launcher')\n    # When using PyTorch version >= 2.0.0, the `torch.distributed.launch`\n    # will pass the `--local-rank` parameter to `tools/train.py` instead\n    # of `--local_rank`.\n    parser.add_argument('--local_rank', '--local-rank', type=int, default=0)\n    args = parser.parse_args()\n    if 'LOCAL_RANK' not in os.environ:\n        os.environ['LOCAL_RANK'] = str(args.local_rank)\n    return args\n\n\ndef main():\n    args = parse_args()\n\n    # load config\n    cfg = Config.fromfile(args.config)\n\n    # TODO: We will unify the ceph support approach with other OpenMMLab repos\n    # if args.ceph:\n    #     cfg = replace_ceph_backend(cfg)\n\n    cfg.launcher = args.launcher\n    if args.cfg_options is not None:\n        cfg.merge_from_dict(args.cfg_options)\n\n    # work_dir is determined in this priority: CLI > segment in file > filename\n    if args.work_dir is not None:\n        # update configs according to CLI args if args.work_dir is not None\n        cfg.work_dir = args.work_dir\n    elif args.task_name is not None:\n        cfg.work_dir = osp.join('./work_dirs', args.task_name)\n    elif cfg.get('work_dir', None) is None:\n        # use config filename as default work_dir if cfg.work_dir is None\n        cfg.work_dir = osp.join('./work_dirs',\n                                osp.splitext(osp.basename(args.config))[0])\n\n    # enable automatic-mixed-precision training\n    if args.amp is True:\n        optim_wrapper = cfg.optim_wrapper.type\n        if optim_wrapper == 'AmpOptimWrapper':\n            print_log('AMP training is already enabled in your config.',\n                      logger='current',\n                      level=logging.WARNING)\n        else:\n            assert optim_wrapper == 'OptimWrapper', (\n                '`--amp` is only supported when the optimizer wrapper type is '\n                f'`OptimWrapper` but got {optim_wrapper}.')\n            cfg.optim_wrapper.type = 'AmpOptimWrapper'\n            cfg.optim_wrapper.loss_scale = 'dynamic'\n\n    # enable automatically scaling LR\n    if args.auto_scale_lr:\n        if 'auto_scale_lr' in cfg and \\\n                'enable' in cfg.auto_scale_lr and \\\n                'base_batch_size' in cfg.auto_scale_lr:\n            cfg.auto_scale_lr.enable = True\n        else:\n            raise RuntimeError('Can not find \"auto_scale_lr\" or '\n                               '\"auto_scale_lr.enable\" or '\n                               '\"auto_scale_lr.base_batch_size\" in your'\n                               ' configuration file.')\n\n    # resume is determined in this priority: resume from > auto_resume\n    if args.resume == 'auto':\n        cfg.resume = True\n        cfg.load_from = None\n    elif args.resume is not None:\n        cfg.resume = True\n        cfg.load_from = args.resume\n\n    # build the runner from config\n    if 'runner_type' not in cfg:\n        # build the default runner\n        runner = Runner.from_cfg(cfg)\n    else:\n        # build customized runner from the registry\n        # if 'runner_type' is set in the cfg\n        runner = RUNNERS.build(cfg)\n\n    # start training\n    runner.train()\n\n\nif __name__ == '__main__':\n    main()\n"
  }
]