[
  {
    "path": "DATASET_prepare.md",
    "content": "# Prepare Datasets for MaskFreeVIS\n\nA dataset can be used by accessing [DatasetCatalog](https://detectron2.readthedocs.io/modules/data.html#detectron2.data.DatasetCatalog)\nfor its data, or [MetadataCatalog](https://detectron2.readthedocs.io/modules/data.html#detectron2.data.MetadataCatalog) for its metadata (class names, etc).\nThis document explains how to setup the builtin datasets so they can be used by the above APIs.\n[Use Custom Datasets](https://detectron2.readthedocs.io/tutorials/datasets.html) gives a deeper dive on how to use `DatasetCatalog` and `MetadataCatalog`,\nand how to add new datasets to them.\n\nMaskFreeVIS has builtin support for a few datasets.\nThe datasets are assumed to exist in a directory specified by the environment variable\n`DETECTRON2_DATASETS`.\n\nYou can set the location for builtin datasets by `export DETECTRON2_DATASETS=/path/to/datasets`.\nIf left unset, the default is `./datasets` relative to your current working directory.\n\nThe model zoo contains configs and models that use these builtin datasets. We will convert each object mask to box when after reading the corresponding instance annotation.\n\n## Expected dataset structure for [COCO](https://cocodataset.org/#download):\n\n```\ncoco/\n  annotations/\n    instances_{train,val}2017.json\n    panoptic_{train,val}2017.json\n  {train,val}2017/\n    # image files that are mentioned in the corresponding json\n  panoptic_{train,val}2017/  # png annotations\n  panoptic_semseg_{train,val}2017/  # generated by the script mentioned below\n```\n\nInstall panopticapi by:\n```\npip install git+https://github.com/cocodataset/panopticapi.git\n```\nThen, run `python datasets/prepare_coco_semantic_annos_from_panoptic_annos.py`, to extract semantic annotations from panoptic annotations (only used for evaluation).\n\n\n## Expected dataset structure for [YouTubeVIS 2019](https://competitions.codalab.org/competitions/20128):\n\n```\nytvis_2019/\n  {train,valid,test}.json\n  {train,valid,test}/\n    Annotations/\n    JPEGImages/\n```\n\n## Expected dataset structure for [YouTubeVIS 2021](https://competitions.codalab.org/competitions/28988):\n\n```\nytvis_2021/\n  {train,valid,test}.json\n  {train,valid,test}/\n    Annotations/\n    JPEGImages/\n```\n"
  },
  {
    "path": "LICENSE",
    "content": "                                 Apache License\n                           Version 2.0, January 2004\n                        http://www.apache.org/licenses/\n\n   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION\n\n   1. Definitions.\n\n      \"License\" shall mean the terms and conditions for use, reproduction,\n      and distribution as defined by Sections 1 through 9 of this document.\n\n      \"Licensor\" shall mean the copyright owner or entity authorized by\n      the copyright owner that is granting the License.\n\n      \"Legal Entity\" shall mean the union of the acting entity and all\n      other entities that control, are controlled by, or are under common\n      control with that entity. For the purposes of this definition,\n      \"control\" means (i) the power, direct or indirect, to cause the\n      direction or management of such entity, whether by contract or\n      otherwise, or (ii) ownership of fifty percent (50%) or more of the\n      outstanding shares, or (iii) beneficial ownership of such entity.\n\n      \"You\" (or \"Your\") shall mean an individual or Legal Entity\n      exercising permissions granted by this License.\n\n      \"Source\" form shall mean the preferred form for making modifications,\n      including but not limited to software source code, documentation\n      source, and configuration files.\n\n      \"Object\" form shall mean any form resulting from mechanical\n      transformation or translation of a Source form, including but\n      not limited to compiled object code, generated documentation,\n      and conversions to other media types.\n\n      \"Work\" shall mean the work of authorship, whether in Source or\n      Object form, made available under the License, as indicated by a\n      copyright notice that is included in or attached to the work\n      (an example is provided in the Appendix below).\n\n      \"Derivative Works\" shall mean any work, whether in Source or Object\n      form, that is based on (or derived from) the Work and for which the\n      editorial revisions, annotations, elaborations, or other modifications\n      represent, as a whole, an original work of authorship. For the purposes\n      of this License, Derivative Works shall not include works that remain\n      separable from, or merely link (or bind by name) to the interfaces of,\n      the Work and Derivative Works thereof.\n\n      \"Contribution\" shall mean any work of authorship, including\n      the original version of the Work and any modifications or additions\n      to that Work or Derivative Works thereof, that is intentionally\n      submitted to Licensor for inclusion in the Work by the copyright owner\n      or by an individual or Legal Entity authorized to submit on behalf of\n      the copyright owner. For the purposes of this definition, \"submitted\"\n      means any form of electronic, verbal, or written communication sent\n      to the Licensor or its representatives, including but not limited to\n      communication on electronic mailing lists, source code control systems,\n      and issue tracking systems that are managed by, or on behalf of, the\n      Licensor for the purpose of discussing and improving the Work, but\n      excluding communication that is conspicuously marked or otherwise\n      designated in writing by the copyright owner as \"Not a Contribution.\"\n\n      \"Contributor\" shall mean Licensor and any individual or Legal Entity\n      on behalf of whom a Contribution has been received by Licensor and\n      subsequently incorporated within the Work.\n\n   2. Grant of Copyright License. Subject to the terms and conditions of\n      this License, each Contributor hereby grants to You a perpetual,\n      worldwide, non-exclusive, no-charge, royalty-free, irrevocable\n      copyright license to reproduce, prepare Derivative Works of,\n      publicly display, publicly perform, sublicense, and distribute the\n      Work and such Derivative Works in Source or Object form.\n\n   3. Grant of Patent License. Subject to the terms and conditions of\n      this License, each Contributor hereby grants to You a perpetual,\n      worldwide, non-exclusive, no-charge, royalty-free, irrevocable\n      (except as stated in this section) patent license to make, have made,\n      use, offer to sell, sell, import, and otherwise transfer the Work,\n      where such license applies only to those patent claims licensable\n      by such Contributor that are necessarily infringed by their\n      Contribution(s) alone or by combination of their Contribution(s)\n      with the Work to which such Contribution(s) was submitted. If You\n      institute patent litigation against any entity (including a\n      cross-claim or counterclaim in a lawsuit) alleging that the Work\n      or a Contribution incorporated within the Work constitutes direct\n      or contributory patent infringement, then any patent licenses\n      granted to You under this License for that Work shall terminate\n      as of the date such litigation is filed.\n\n   4. Redistribution. You may reproduce and distribute copies of the\n      Work or Derivative Works thereof in any medium, with or without\n      modifications, and in Source or Object form, provided that You\n      meet the following conditions:\n\n      (a) You must give any other recipients of the Work or\n          Derivative Works a copy of this License; and\n\n      (b) You must cause any modified files to carry prominent notices\n          stating that You changed the files; and\n\n      (c) You must retain, in the Source form of any Derivative Works\n          that You distribute, all copyright, patent, trademark, and\n          attribution notices from the Source form of the Work,\n          excluding those notices that do not pertain to any part of\n          the Derivative Works; and\n\n      (d) If the Work includes a \"NOTICE\" text file as part of its\n          distribution, then any Derivative Works that You distribute must\n          include a readable copy of the attribution notices contained\n          within such NOTICE file, excluding those notices that do not\n          pertain to any part of the Derivative Works, in at least one\n          of the following places: within a NOTICE text file distributed\n          as part of the Derivative Works; within the Source form or\n          documentation, if provided along with the Derivative Works; or,\n          within a display generated by the Derivative Works, if and\n          wherever such third-party notices normally appear. The contents\n          of the NOTICE file are for informational purposes only and\n          do not modify the License. You may add Your own attribution\n          notices within Derivative Works that You distribute, alongside\n          or as an addendum to the NOTICE text from the Work, provided\n          that such additional attribution notices cannot be construed\n          as modifying the License.\n\n      You may add Your own copyright statement to Your modifications and\n      may provide additional or different license terms and conditions\n      for use, reproduction, or distribution of Your modifications, or\n      for any such Derivative Works as a whole, provided Your use,\n      reproduction, and distribution of the Work otherwise complies with\n      the conditions stated in this License.\n\n   5. Submission of Contributions. Unless You explicitly state otherwise,\n      any Contribution intentionally submitted for inclusion in the Work\n      by You to the Licensor shall be under the terms and conditions of\n      this License, without any additional terms or conditions.\n      Notwithstanding the above, nothing herein shall supersede or modify\n      the terms of any separate license agreement you may have executed\n      with Licensor regarding such Contributions.\n\n   6. Trademarks. This License does not grant permission to use the trade\n      names, trademarks, service marks, or product names of the Licensor,\n      except as required for reasonable and customary use in describing the\n      origin of the Work and reproducing the content of the NOTICE file.\n\n   7. Disclaimer of Warranty. Unless required by applicable law or\n      agreed to in writing, Licensor provides the Work (and each\n      Contributor provides its Contributions) on an \"AS IS\" BASIS,\n      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or\n      implied, including, without limitation, any warranties or conditions\n      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A\n      PARTICULAR PURPOSE. You are solely responsible for determining the\n      appropriateness of using or redistributing the Work and assume any\n      risks associated with Your exercise of permissions under this License.\n\n   8. Limitation of Liability. In no event and under no legal theory,\n      whether in tort (including negligence), contract, or otherwise,\n      unless required by applicable law (such as deliberate and grossly\n      negligent acts) or agreed to in writing, shall any Contributor be\n      liable to You for damages, including any direct, indirect, special,\n      incidental, or consequential damages of any character arising as a\n      result of this License or out of the use or inability to use the\n      Work (including but not limited to damages for loss of goodwill,\n      work stoppage, computer failure or malfunction, or any and all\n      other commercial damages or losses), even if such Contributor\n      has been advised of the possibility of such damages.\n\n   9. Accepting Warranty or Additional Liability. While redistributing\n      the Work or Derivative Works thereof, You may choose to offer,\n      and charge a fee for, acceptance of support, warranty, indemnity,\n      or other liability obligations and/or rights consistent with this\n      License. However, in accepting such obligations, You may act only\n      on Your own behalf and on Your sole responsibility, not on behalf\n      of any other Contributor, and only if You agree to indemnify,\n      defend, and hold each Contributor harmless for any liability\n      incurred by, or claims asserted against, such Contributor by reason\n      of your accepting any such warranty or additional liability.\n\n   END OF TERMS AND CONDITIONS\n\n   APPENDIX: How to apply the Apache License to your work.\n\n      To apply the Apache License to your work, attach the following\n      boilerplate notice, with the fields enclosed by brackets \"[]\"\n      replaced with your own identifying information. (Don't include\n      the brackets!)  The text should be enclosed in the appropriate\n      comment syntax for the file format. We also recommend that a\n      file or class name and description of purpose be included on the\n      same \"printed page\" as the copyright notice for easier\n      identification within third-party archives.\n\n   Copyright [yyyy] [name of copyright owner]\n\n   Licensed under the Apache License, Version 2.0 (the \"License\");\n   you may not use this file except in compliance with the License.\n   You may obtain a copy of the License at\n\n       http://www.apache.org/licenses/LICENSE-2.0\n\n   Unless required by applicable law or agreed to in writing, software\n   distributed under the License is distributed on an \"AS IS\" BASIS,\n   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n   See the License for the specific language governing permissions and\n   limitations under the License.\n"
  },
  {
    "path": "README.md",
    "content": "# MaskFreeVIS\n\nMask-Free Video Instance Segmentation [CVPR 2023].\n\nThis is the official pytorch implementation of [MaskFreeVIS](https://github.com/SysCV/MaskFreeVis/) built on the open-source detectron2. We aim to **remove the necessity for expensive video masks and even image masks** for training VIS models. Our project website contains more information, including the visual video comparison: [vis.xyz/pub/maskfreevis](https://www.vis.xyz/pub/maskfreevis/).\n\n\n> [**Mask-Free Video Instance Segmentation**](https://arxiv.org/abs/2303.15904)           \n> Lei Ke, Martin Danelljan, Henghui Ding, Yu-Wing Tai, Chi-Keung Tang, Fisher Yu \\\n> CVPR 2023\n\nHighlights\n-----------------\n- **High-performing** video instance segmentation **without using any video masks or even image mask** labels. Using SwinL and built on Mask2Former, MaskFreeVIS achieved 56.0 AP on YTVIS without using any video masks labels. Using ResNet-101, MaskFreeVIS achieves 49.1 AP without using video masks, and 47.3 AP only using COCO mask initialized model.\n- **Novelty:** a new **parameter-free** Temporal KNN-patch Loss (TK-Loss), which leverages temporal masks consistency using unsupervised one-to-k patch correspondence.\n- **Simple:** TK-Loss is flexible to intergrated with state-of-the-art transformer-based VIS models, with no trainable parameters.\n\nVisualization results of MaskFreeVIS\n-----------------\n\n<table>\n  <tr>\n    <td><img src=\"vis_demos/example1.gif\" width=\"350\"></td>\n    <td><img src=\"vis_demos/example2.gif\" width=\"350\"></td>\n  </tr>\n  <tr>\n    <td><img src=\"vis_demos/example3.gif\" width=\"350\"></td>\n    <td><img src=\"vis_demos/example4.gif\" width=\"350\"></td>\n  </tr>\n</table>\n\nIntroduction\n-----------------\nThe recent advancement in Video Instance Segmentation (VIS) has largely been driven by the use of deeper and increasingly data-hungry transformer-based models. However, video masks are tedious and expensive to annotate, limiting the scale and diversity of existing VIS datasets. In this work, we aim to remove the mask-annotation requirement. We propose MaskFreeVIS, achieving highly competitive VIS performance, while only using bounding box annotations for the object state. We leverage the rich temporal mask consistency constraints in videos by introducing the Temporal KNN-patch Loss (TK-Loss), providing strong mask supervision without any labels. Our TK-Loss finds one-to-many matches across frames, through an efficient patch-matching step followed by a K-nearest neighbor selection. A consistency loss is then enforced on the found matches. Our mask-free objective is simple to implement, has no trainable parameters, is computationally efficient, yet outperforms baselines employing, e.g., state-of-the-art optical flow to enforce temporal mask consistency. We validate MaskFreeVIS on the YouTube-VIS 2019/2021, OVIS and BDD100K MOTS benchmarks. The results clearly demonstrate the efficacy of our method by drastically narrowing the gap between fully and weakly-supervised VIS performance.\n\n\nMethods\n-----------------\n<img width=\"1096\" alt=\"image\" src=\"https://user-images.githubusercontent.com/17427852/228353991-ff09784f-9afd-4ac2-bddf-c5b2763d25e6.png\">\n\n### **Installation**\nPlease see [Getting Started with Detectron2](https://github.com/facebookresearch/detectron2/blob/master/GETTING_STARTED.md) for full usage.\n\n### Requirements\n- Linux or macOS with Python 3.6\n- PyTorch 1.9 and [torchvision](https://github.com/pytorch/vision/) that matches the PyTorch installation.\n  Install them together at [pytorch.org](https://pytorch.org) to make sure of this. Note, please check\n  PyTorch version matches that is required by Detectron2.\n- Detectron2: follow [Detectron2 installation instructions](https://detectron2.readthedocs.io/tutorials/install.html).\n- OpenCV is optional but needed by demo and visualization\n- `pip install -r requirements.txt`\n\n### CUDA kernel for MSDeformAttn\nAfter preparing the required environment, run the following command to compile CUDA kernel for MSDeformAttn:\n\n`CUDA_HOME` must be defined and points to the directory of the installed CUDA toolkit.\n\n```bash\ncd mask2former/modeling/pixel_decoder/ops\nsh make.sh\n```\n\n#### Building on another system\nTo build on a system that does not have a GPU device but provide the drivers:\n```bash\nTORCH_CUDA_ARCH_LIST='8.0' FORCE_CUDA=1 python setup.py build install\n```\n\n### Example conda environment setup\n```bash\nconda create --name maskfreevis python=3.8 -y\nconda activate maskfreevis\nconda install pytorch==1.9.0 torchvision==0.10.0 cudatoolkit=11.1 -c pytorch -c nvidia\npip install -U opencv-python\n\n# under your working directory\ngit clone git@github.com:facebookresearch/detectron2.git\ncd detectron2\npip install -e .\n\ncd ..\ngit clone https://github.com/SysCV/MaskFreeVIS.git\ncd MaskFreeVIS\npip install -r requirements.txt\ncd mask2former/modeling/pixel_decoder/ops\nsh make.sh\n```\n\n### **Dataset preparation**\nPlease see the document [here](DATASET_prepare.md).\n\n\n### **Model Zoo**\n\n## Video Instance Segmentation (YouTubeVIS) \n\nUsing COCO image masks **without YTVIS video masks** during training:\n<table><tbody>\n<!-- START TABLE -->\n<!-- TABLE HEADER -->\n<th valign=\"bottom\">Config Name</th>\n<th valign=\"bottom\">Backbone</th>\n<th valign=\"bottom\">AP</th>\n<th valign=\"bottom\">download</th>\n<th valign=\"bottom\">Training Script</th>\n<th valign=\"bottom\">COCO Init Weight</th>\n<!-- TABLE BODY -->\n<!-- ROW: maskformer2_R50_bs16_50ep -->\n <tr><td align=\"left\"><a href=\"configs/youtubevis_2019/video_maskformer2_R50_bs16_8ep.yaml\">MaskFreeVIS</a></td>\n<td align=\"center\">R50</td>\n<td align=\"center\">46.6</td>\n<td align=\"center\"><a href=\"https://drive.google.com/file/d/1Jjq-YgHqwixs2AdJ3kSNp4d2DjjV5qEA/view?usp=share_link\">model</a></td>\n<td align=\"center\"><a href=\"scripts/train_8gpu_mask2former_r50_video.sh\">script</a></td>\n<td align=\"center\"><a href=\"https://dl.fbaipublicfiles.com/maskformer/mask2former/coco/instance/maskformer2_R50_bs16_50ep/model_final_3c8ec9.pkl\">Init</a></td>\n</tr>\n\n<!-- ROW: maskformer2_R101_bs16_50ep -->\n <tr><td align=\"left\"><a href=\"configs/youtubevis_2019/video_maskformer2_R101_bs16_8ep.yaml\">MaskFreeVIS</a></td>\n<td align=\"center\">R101</td>\n<td align=\"center\">49.1</td>\n<td align=\"center\"><a href=\"https://drive.google.com/file/d/1eo05Rdl5cgTEB0mxB2HLwQGhEu6vEwDu/view?usp=share_link\">model</a></td>\n<td align=\"center\"><a href=\"scripts/train_8gpu_mask2former_r101_video.sh\">script</a></td>\n<td align=\"center\"><a href=\"https://dl.fbaipublicfiles.com/maskformer/mask2former/coco/instance/maskformer2_R101_bs16_50ep/model_final_eba159.pkl\">Init</a></td>\n</tr>\n\n<!-- ROW: maskformer2_swin_base_IN21k_384_bs16_50ep -->\n <tr><td align=\"left\"><a href=\"configs/youtubevis_2019/swin/video_maskformer2_swin_large_IN21k_384_bs16_8ep.yaml\">MaskFreeVIS</a></td>\n<td align=\"center\">Swin-L</td>\n<td align=\"center\">56.0</td>\n<td align=\"center\"><a href=\"https://drive.google.com/file/d/1kvckNoaDftN5R16CRJ-izfHeKTl_rskt/view?usp=share_link\">model</a></td>\n<td align=\"center\"><a href=\"scripts/train_8gpu_mask2former_swinl_video.sh\">script</a></td>\n<td align=\"center\"><a href=\"https://dl.fbaipublicfiles.com/maskformer/mask2former/coco/instance/maskformer2_swin_large_IN21k_384_bs16_100ep/model_final_e5f453.pkl\">Init</a></td>\n</tr>\n</tbody></table>\n\n**For below two training settings without using pseudo COCO images masks** for joint video training, please change the folder to:\n```\ncd mfvis_nococo\n```\n\n1) Only using **COCO mask initialized model without YTVIS video masks** during training:\n<table><tbody>\n<!-- START TABLE -->\n<!-- TABLE HEADER -->\n<th valign=\"bottom\">Config Name</th>\n<th valign=\"bottom\">Backbone</th>\n<th valign=\"bottom\">AP</th>\n<th valign=\"bottom\">download</th>\n<th valign=\"bottom\">Training Script</th>\n<th valign=\"bottom\">COCO Init Weight</th>\n<!-- TABLE BODY -->\n<!-- ROW: maskformer2_R50_bs16_50ep -->\n <tr><td align=\"left\"><a href=\"mfvis_nococo/configs/youtubevis_2019/video_maskformer2_R50_bs16_8ep_coco.yaml\">MaskFreeVIS</a></td>\n<td align=\"center\">R50</td>\n<td align=\"center\">43.8</td>\n<td align=\"center\"><a href=\"https://drive.google.com/file/d/1hAfGtRk5uxYj9BPX3PGPjufyiF5l0IsW/view?usp=share_link\">model</a></td>\n<td align=\"center\"><a href=\"mfvis_nococo/scripts/train_8gpu_mask2former_r50_video_coco.sh\">script</a></td>\n<td align=\"center\"><a href=\"https://dl.fbaipublicfiles.com/maskformer/mask2former/coco/instance/maskformer2_R50_bs16_50ep/model_final_3c8ec9.pkl\">Init</a></td>\n</tr>\n<!-- ROW: maskformer2_R101_bs16_50ep -->\n <tr><td align=\"left\"><a href=\"mfvis_nococo/configs/youtubevis_2019/video_maskformer2_R101_bs16_8ep_coco.yaml\">MaskFreeVIS</a></td>\n<td align=\"center\">R101</td>\n<td align=\"center\">47.3</td>\n<td align=\"center\"><a href=\"https://drive.google.com/file/d/1imHH-m9Q9YkJBzEe2MD0ewypjJdfdMZZ/view?usp=share_link\">model</a></td>\n<td align=\"center\"><a href=\"mfvis_nococo/scripts/train_8gpu_mask2former_r101_video_coco.sh\">script</a></td>\n<td align=\"center\"><a href=\"https://dl.fbaipublicfiles.com/maskformer/mask2former/coco/instance/maskformer2_R101_bs16_50ep/model_final_eba159.pkl\">Init</a></td>\n</tr>\n<!-- ROW: maskformer2_swin_base_IN21k_384_bs16_50ep -->\n</tbody></table>\n\n2) Only using **COCO box initialized model without YTVIS video masks** during training:\n<table><tbody>\n<!-- START TABLE -->\n<!-- TABLE HEADER -->\n<th valign=\"bottom\">Config Name</th>\n<th valign=\"bottom\">Backbone</th>\n<th valign=\"bottom\">AP</th>\n<th valign=\"bottom\">download</th>\n<th valign=\"bottom\">Training Script</th>\n<th valign=\"bottom\">COCO Box Init Weight</th>\n<!-- TABLE BODY -->\n<!-- ROW: maskformer2_R50_bs16_50ep -->\n <tr><td align=\"left\"><a href=\"mfvis_nococo/configs/youtubevis_2019/video_maskformer2_R50_bs16_8ep.yaml\">MaskFreeVIS</a></td>\n<td align=\"center\">R50</td>\n<td align=\"center\">42.5</td>\n<td align=\"center\"><a href=\"https://drive.google.com/file/d/1F5VZPxR4637JmFu3t4WaKgvWs4WSxPPl/view?usp=share_link\">model</a></td>\n<td align=\"center\"><a href=\"mfvis_nococo/scripts/train_8gpu_mask2former_r50_video.sh\">script</a></td>\n<td align=\"center\"><a href=\"https://drive.google.com/file/d/1qiFBqFK0VEgdj0ulylEqNKGExSguGc8V/view?usp=share_link\">Init</a></td>\n</tr>\n</tbody></table>\n\n\nPlease see our script folder. \n\n## Inference & Evaluation\n\nFirst download the provided trained model from our model zoo table and put them into the mfvis_models. \n\n```\nmkdir mfvis_models\n```\n\nRefer to our [scripts folder](./scripts) for more commands:\n\nExample evaluation scripts:\n```\nbash scripts/eval_8gpu_mask2former_r50_video.sh\nbash scripts/eval_8gpu_mask2former_r101_video.sh\nbash scripts/eval_8gpu_mask2former_swinl_video.sh\n```\n\n## Results Visualization\n\nExample visualization script:\n```\nbash scripts/visual_video.sh\n```\n\n\nCitation\n---------------\nIf you find MaskFreeVIS useful in your research or refer to the provided baseline results, please star :star: this repository and consider citing :pencil::\n```\n@inproceedings{maskfreevis,\n    author={Ke, Lei and Danelljan, Martin and Ding, Henghui and Tai, Yu-Wing and Tang, Chi-Keung and Yu, Fisher},\n    title={Mask-Free Video Instance Segmentation},\n    booktitle = {CVPR},\n    year = {2023}\n}  \n```\n\n## Acknowledgments\n- Thanks [BoxInst](https://github.com/aim-uofa/AdelaiDet/blob/master/configs/BoxInst/README.md) image-based instance segmentation losses.\n- Thanks [Mask2Former](https://github.com/facebookresearch/Mask2Former) and [VMT](https://github.com/SysCV/vmt) for providing useful inference and evaluation toolkits.\n"
  },
  {
    "path": "configs/coco/instance-segmentation/Base-COCO-InstanceSegmentation.yaml",
    "content": "MODEL:\n  BACKBONE:\n    FREEZE_AT: 0\n    NAME: \"build_resnet_backbone\"\n  WEIGHTS: \"detectron2://ImageNetPretrained/torchvision/R-50.pkl\"\n  PIXEL_MEAN: [123.675, 116.280, 103.530]\n  PIXEL_STD: [58.395, 57.120, 57.375]\n  RESNETS:\n    DEPTH: 50\n    STEM_TYPE: \"basic\"  # not used\n    STEM_OUT_CHANNELS: 64\n    STRIDE_IN_1X1: False\n    OUT_FEATURES: [\"res2\", \"res3\", \"res4\", \"res5\"]\n    # NORM: \"SyncBN\"\n    RES5_MULTI_GRID: [1, 1, 1]  # not used\nDATASETS:\n  TRAIN: (\"coco_2017_train\",)\n  TEST: (\"coco_2017_val\",)\nSOLVER:\n  IMS_PER_BATCH: 16\n  BASE_LR: 0.0001\n  STEPS: (327778, 355092)\n  MAX_ITER: 368750\n  WARMUP_FACTOR: 1.0\n  WARMUP_ITERS: 10\n  WEIGHT_DECAY: 0.05\n  OPTIMIZER: \"ADAMW\"\n  BACKBONE_MULTIPLIER: 0.1\n  CLIP_GRADIENTS:\n    ENABLED: True\n    CLIP_TYPE: \"full_model\"\n    CLIP_VALUE: 0.01\n    NORM_TYPE: 2.0\n  AMP:\n    ENABLED: True\nINPUT:\n  IMAGE_SIZE: 1024\n  MIN_SCALE: 0.1\n  MAX_SCALE: 2.0\n  FORMAT: \"RGB\"\n  DATASET_MAPPER_NAME: \"coco_instance_lsj\"\nTEST:\n  EVAL_PERIOD: 5000\nDATALOADER:\n  FILTER_EMPTY_ANNOTATIONS: True\n  NUM_WORKERS: 4\nVERSION: 2\n"
  },
  {
    "path": "configs/coco/instance-segmentation/maskformer2_R50_bs16_50ep.yaml",
    "content": "_BASE_: Base-COCO-InstanceSegmentation.yaml\nOUTPUT_DIR: './output/'\nMODEL:\n  META_ARCHITECTURE: \"MaskFormer\"\n  SEM_SEG_HEAD:\n    NAME: \"MaskFormerHead\"\n    IGNORE_VALUE: 255\n    NUM_CLASSES: 80\n    LOSS_WEIGHT: 1.0\n    CONVS_DIM: 256\n    MASK_DIM: 256\n    NORM: \"GN\"\n    # pixel decoder\n    PIXEL_DECODER_NAME: \"MSDeformAttnPixelDecoder\"\n    IN_FEATURES: [\"res2\", \"res3\", \"res4\", \"res5\"]\n    DEFORMABLE_TRANSFORMER_ENCODER_IN_FEATURES: [\"res3\", \"res4\", \"res5\"]\n    COMMON_STRIDE: 4\n    TRANSFORMER_ENC_LAYERS: 6\n  MASK_FORMER:\n    TRANSFORMER_DECODER_NAME: \"MultiScaleMaskedTransformerDecoder\"\n    TRANSFORMER_IN_FEATURE: \"multi_scale_pixel_decoder\"\n    DEEP_SUPERVISION: True\n    NO_OBJECT_WEIGHT: 0.1\n    CLASS_WEIGHT: 2.0\n    MASK_WEIGHT: 5.0\n    DICE_WEIGHT: 5.0\n    HIDDEN_DIM: 256\n    NUM_OBJECT_QUERIES: 100\n    NHEADS: 8\n    DROPOUT: 0.0\n    DIM_FEEDFORWARD: 2048\n    ENC_LAYERS: 0\n    PRE_NORM: False\n    ENFORCE_INPUT_PROJ: False\n    SIZE_DIVISIBILITY: 32\n    DEC_LAYERS: 10  # 9 decoder layers, add one for the loss on learnable query\n    TRAIN_NUM_POINTS: 12544\n    OVERSAMPLE_RATIO: 3.0\n    IMPORTANCE_SAMPLE_RATIO: 0.75\n    TEST:\n      SEMANTIC_ON: False\n      INSTANCE_ON: True\n      PANOPTIC_ON: False\n      OVERLAP_THRESHOLD: 0.8\n      OBJECT_MASK_THRESHOLD: 0.8\n"
  },
  {
    "path": "configs/youtubevis_2019/Base-YouTubeVIS-VideoInstanceSegmentation.yaml",
    "content": "MODEL:\n  BACKBONE:\n    FREEZE_AT: 0\n    NAME: \"build_resnet_backbone\"\n  WEIGHTS: \"detectron2://ImageNetPretrained/torchvision/R-50.pkl\"\n  PIXEL_MEAN: [123.675, 116.280, 103.530]\n  PIXEL_STD: [58.395, 57.120, 57.375]\n  MASK_ON: True\n  RESNETS:\n    DEPTH: 50\n    STEM_TYPE: \"basic\"  # not used\n    STEM_OUT_CHANNELS: 64\n    STRIDE_IN_1X1: False\n    OUT_FEATURES: [\"res2\", \"res3\", \"res4\", \"res5\"]\n    # NORM: \"SyncBN\"\n    RES5_MULTI_GRID: [1, 1, 1]  # not used\nDATASETS:\n  TRAIN: (\"ytvis_2019_train\", \"coco_2017_train_fake\",)\n  TEST: (\"ytvis_2019_val\",)\nSOLVER:\n  IMS_PER_BATCH: 16\n  BASE_LR: 0.0001\n  STEPS: (4000,)\n  MAX_ITER: 6000\n  WARMUP_FACTOR: 1.0\n  WARMUP_ITERS: 10\n  WEIGHT_DECAY: 0.05\n  OPTIMIZER: \"ADAMW\"\n  BACKBONE_MULTIPLIER: 0.1\n  CLIP_GRADIENTS:\n    ENABLED: True\n    CLIP_TYPE: \"full_model\"\n    CLIP_VALUE: 0.01\n    NORM_TYPE: 2.0\n  AMP:\n    ENABLED: True\nINPUT:\n  MIN_SIZE_TRAIN_SAMPLING: \"choice_by_clip\"\n  RANDOM_FLIP: \"flip_by_clip\"\n  AUGMENTATIONS: []\n  MIN_SIZE_TRAIN: (360, 480)\n  MIN_SIZE_TEST: 360\n  CROP:\n    ENABLED: False\n    TYPE: \"absolute_range\"\n    SIZE: (600, 720)\n  FORMAT: \"RGB\"\nTEST:\n  EVAL_PERIOD: 0\nDATALOADER:\n  FILTER_EMPTY_ANNOTATIONS: False\n  NUM_WORKERS: 4\nVERSION: 2\n"
  },
  {
    "path": "configs/youtubevis_2019/Base-YouTubeVIS-VideoInstanceSegmentation_long.yaml",
    "content": "MODEL:\n  BACKBONE:\n    FREEZE_AT: 0\n    NAME: \"build_resnet_backbone\"\n  WEIGHTS: \"detectron2://ImageNetPretrained/torchvision/R-50.pkl\"\n  PIXEL_MEAN: [123.675, 116.280, 103.530]\n  PIXEL_STD: [58.395, 57.120, 57.375]\n  MASK_ON: True\n  RESNETS:\n    DEPTH: 50\n    STEM_TYPE: \"basic\"  # not used\n    STEM_OUT_CHANNELS: 64\n    STRIDE_IN_1X1: False\n    OUT_FEATURES: [\"res2\", \"res3\", \"res4\", \"res5\"]\n    # NORM: \"SyncBN\"\n    RES5_MULTI_GRID: [1, 1, 1]  # not used\nDATASETS:\n  TRAIN: (\"coco_2017_train_fake\", \"ytvis_2019_train\",)\n  TEST: (\"ytvis_2019_val\",)\nSOLVER:\n  IMS_PER_BATCH: 8\n  BASE_LR: 0.00005\n  STEPS: (75000,)\n  MAX_ITER: 140000\n  WARMUP_FACTOR: 1.0\n  WARMUP_ITERS: 10\n  WEIGHT_DECAY: 0.05\n  OPTIMIZER: \"ADAMW\"\n  BACKBONE_MULTIPLIER: 0.1\n  CLIP_GRADIENTS:\n    ENABLED: True\n    CLIP_TYPE: \"full_model\"\n    CLIP_VALUE: 0.01\n    NORM_TYPE: 2.0\n  AMP:\n    ENABLED: True\nINPUT:\n  MIN_SIZE_TRAIN_SAMPLING: \"choice_by_clip\"\n  RANDOM_FLIP: \"flip_by_clip\"\n  AUGMENTATIONS: []\n  MIN_SIZE_TRAIN: (360, 480)\n  MIN_SIZE_TEST: 360\n  CROP:\n    ENABLED: False\n    TYPE: \"absolute_range\"\n    SIZE: (600, 720)\n  FORMAT: \"RGB\"\nTEST:\n  EVAL_PERIOD: 0\nDATALOADER:\n  FILTER_EMPTY_ANNOTATIONS: False\n  NUM_WORKERS: 4\nVERSION: 2\n"
  },
  {
    "path": "configs/youtubevis_2019/Base-YouTubeVIS-VideoInstanceSegmentation_long_bs16.yaml",
    "content": "MODEL:\n  BACKBONE:\n    FREEZE_AT: 0\n    NAME: \"build_resnet_backbone\"\n  WEIGHTS: \"detectron2://ImageNetPretrained/torchvision/R-50.pkl\"\n  PIXEL_MEAN: [123.675, 116.280, 103.530]\n  PIXEL_STD: [58.395, 57.120, 57.375]\n  MASK_ON: True\n  RESNETS:\n    DEPTH: 50\n    STEM_TYPE: \"basic\"  # not used\n    STEM_OUT_CHANNELS: 64\n    STRIDE_IN_1X1: False\n    OUT_FEATURES: [\"res2\", \"res3\", \"res4\", \"res5\"]\n    # NORM: \"SyncBN\"\n    RES5_MULTI_GRID: [1, 1, 1]  # not used\nDATASETS:\n  TRAIN: (\"coco_2017_train_fake\", \"ytvis_2019_train\",)\n  TEST: (\"ytvis_2019_val\",)\nSOLVER:\n  IMS_PER_BATCH: 16\n  BASE_LR: 0.0001\n  STEPS: (37500,)\n  MAX_ITER: 70000\n  WARMUP_FACTOR: 1.0\n  WARMUP_ITERS: 10\n  WEIGHT_DECAY: 0.05\n  OPTIMIZER: \"ADAMW\"\n  BACKBONE_MULTIPLIER: 0.1\n  CLIP_GRADIENTS:\n    ENABLED: True\n    CLIP_TYPE: \"full_model\"\n    CLIP_VALUE: 0.01\n    NORM_TYPE: 2.0\n  AMP:\n    ENABLED: True\nINPUT:\n  MIN_SIZE_TRAIN_SAMPLING: \"choice_by_clip\"\n  RANDOM_FLIP: \"flip_by_clip\"\n  AUGMENTATIONS: []\n  MIN_SIZE_TRAIN: (360, 480)\n  MIN_SIZE_TEST: 360\n  CROP:\n    ENABLED: False\n    TYPE: \"absolute_range\"\n    SIZE: (600, 720)\n  FORMAT: \"RGB\"\nTEST:\n  EVAL_PERIOD: 0\nDATALOADER:\n  FILTER_EMPTY_ANNOTATIONS: False\n  NUM_WORKERS: 4\nVERSION: 2\n"
  },
  {
    "path": "configs/youtubevis_2019/swin/video_maskformer2_swin_large_IN21k_384_bs16_8ep.yaml",
    "content": "_BASE_: ../video_maskformer2_R50_bs16_8ep_swin.yaml\nOUTPUT_DIR: 'swinl_joint_withcoco'\nMODEL:\n  WEIGHTS: \"./pretrained_model/model_final_e5f453.pkl\"\n  BACKBONE:\n    NAME: \"D2SwinTransformer\"\n  SWIN:\n    EMBED_DIM: 192\n    DEPTHS: [2, 2, 18, 2]\n    NUM_HEADS: [6, 12, 24, 48]\n    WINDOW_SIZE: 12\n    APE: False\n    DROP_PATH_RATE: 0.3\n    PATCH_NORM: True\n    PRETRAIN_IMG_SIZE: 384\n  #WEIGHTS: \"model_final_e5f453.pkl\"\n  PIXEL_MEAN: [123.675, 116.280, 103.530]\n  PIXEL_STD: [58.395, 57.120, 57.375]\n  MASK_FORMER:\n    NUM_OBJECT_QUERIES: 200\nINPUT:\n  MIN_SIZE_TEST: 480\n"
  },
  {
    "path": "configs/youtubevis_2019/video_maskformer2_R101_bs16_8ep.yaml",
    "content": "_BASE_: video_maskformer2_R50_bs16_8ep.yaml\nOUTPUT_DIR: './r101_coco_joint/'\nMODEL:\n  WEIGHTS: \"pretrained_model/model_final_eba159.pkl\"\n  RESNETS:\n    DEPTH: 101\n    STEM_TYPE: \"basic\"  # not used\n    STEM_OUT_CHANNELS: 64\n    STRIDE_IN_1X1: False\n    OUT_FEATURES: [\"res2\", \"res3\", \"res4\", \"res5\"]\n    # NORM: \"SyncBN\"\n    RES5_MULTI_GRID: [1, 1, 1]  # not used\n"
  },
  {
    "path": "configs/youtubevis_2019/video_maskformer2_R50_bs16_8ep.yaml",
    "content": "_BASE_: Base-YouTubeVIS-VideoInstanceSegmentation_long_bs16.yaml\nOUTPUT_DIR: './r50_coco_joint/'\nSEED: 29118357\nMODEL:\n  WEIGHTS: \"./pretrained_model/model_final_3c8ec9.pkl\"\n  META_ARCHITECTURE: \"VideoMaskFormer\"\n  SEM_SEG_HEAD:\n    NAME: \"MaskFormerHead\"\n    IGNORE_VALUE: 255\n    NUM_CLASSES: 40\n    LOSS_WEIGHT: 1.0\n    CONVS_DIM: 256\n    MASK_DIM: 256\n    NORM: \"GN\"\n    # pixel decoder\n    PIXEL_DECODER_NAME: \"MSDeformAttnPixelDecoder\"\n    IN_FEATURES: [\"res2\", \"res3\", \"res4\", \"res5\"]\n    DEFORMABLE_TRANSFORMER_ENCODER_IN_FEATURES: [\"res3\", \"res4\", \"res5\"]\n    COMMON_STRIDE: 4\n    TRANSFORMER_ENC_LAYERS: 6\n  MASK_FORMER:\n    TRANSFORMER_DECODER_NAME: \"VideoMultiScaleMaskedTransformerDecoder\"\n    TRANSFORMER_IN_FEATURE: \"multi_scale_pixel_decoder\"\n    DEEP_SUPERVISION: True\n    NO_OBJECT_WEIGHT: 0.1\n    CLASS_WEIGHT: 2.0\n    MASK_WEIGHT: 5.0\n    DICE_WEIGHT: 5.0\n    HIDDEN_DIM: 256\n    NUM_OBJECT_QUERIES: 100\n    NHEADS: 8\n    DROPOUT: 0.0\n    DIM_FEEDFORWARD: 2048\n    ENC_LAYERS: 0\n    PRE_NORM: False\n    ENFORCE_INPUT_PROJ: False\n    SIZE_DIVISIBILITY: 32\n    DEC_LAYERS: 10  # 9 decoder layers, add one for the loss on learnable query\n    TRAIN_NUM_POINTS: 20000 #20000 #12544\n    OVERSAMPLE_RATIO: 3.0\n    IMPORTANCE_SAMPLE_RATIO: 0.75\n    TEST:\n      SEMANTIC_ON: False\n      INSTANCE_ON: True\n      PANOPTIC_ON: False\n      OVERLAP_THRESHOLD: 0.8\n      OBJECT_MASK_THRESHOLD: 0.8\n\nINPUT:\n  MIN_SIZE_TRAIN_SAMPLING: \"choice_by_clip\"\n  PSEUDO:\n    SAMPLING_FRAME_NUM: 4\n    SAMPLING_FRAME_RANGE: 20\n    AUGMENTATIONS: ['rotation']\n    MIN_SIZE_TRAIN: (288, 320, 352, 384, 416, 448, 480, 512)\n    MAX_SIZE_TRAIN: 768\n    CROP:\n      ENABLED: True\n      TYPE: \"absolute_range\"\n      SIZE: (384, 600)\n  LSJ_AUG:\n    ENABLED: False\n    IMAGE_SIZE: 768\n    MIN_SCALE: 0.1\n    MAX_SCALE: 2.0\nDATALOADER:\n  FILTER_EMPTY_ANNOTATIONS: True\n  # NUM_WORKERS: 8\n"
  },
  {
    "path": "configs/youtubevis_2019/video_maskformer2_R50_bs16_8ep_swin.yaml",
    "content": "_BASE_: Base-YouTubeVIS-VideoInstanceSegmentation_long.yaml\nOUTPUT_DIR: './swinl_joint_withcoco/'\nSEED: 29118357\nMODEL:\n  WEIGHTS: \"./pretrained_model/model_final_3c8ec9.pkl\"\n  META_ARCHITECTURE: \"VideoMaskFormer\"\n  SEM_SEG_HEAD:\n    NAME: \"MaskFormerHead\"\n    IGNORE_VALUE: 255\n    NUM_CLASSES: 40\n    LOSS_WEIGHT: 1.0\n    CONVS_DIM: 256\n    MASK_DIM: 256\n    NORM: \"GN\"\n    # pixel decoder\n    PIXEL_DECODER_NAME: \"MSDeformAttnPixelDecoder\"\n    IN_FEATURES: [\"res2\", \"res3\", \"res4\", \"res5\"]\n    DEFORMABLE_TRANSFORMER_ENCODER_IN_FEATURES: [\"res3\", \"res4\", \"res5\"]\n    COMMON_STRIDE: 4\n    TRANSFORMER_ENC_LAYERS: 6\n  MASK_FORMER:\n    TRANSFORMER_DECODER_NAME: \"VideoMultiScaleMaskedTransformerDecoder\"\n    TRANSFORMER_IN_FEATURE: \"multi_scale_pixel_decoder\"\n    DEEP_SUPERVISION: True\n    NO_OBJECT_WEIGHT: 0.1\n    CLASS_WEIGHT: 2.0\n    MASK_WEIGHT: 5.0\n    DICE_WEIGHT: 5.0\n    HIDDEN_DIM: 256\n    NUM_OBJECT_QUERIES: 100\n    NHEADS: 8\n    DROPOUT: 0.0\n    DIM_FEEDFORWARD: 2048\n    ENC_LAYERS: 0\n    PRE_NORM: False\n    ENFORCE_INPUT_PROJ: False\n    SIZE_DIVISIBILITY: 32\n    DEC_LAYERS: 10  # 9 decoder layers, add one for the loss on learnable query\n    TRAIN_NUM_POINTS: 20000 #20000 #12544\n    OVERSAMPLE_RATIO: 3.0\n    IMPORTANCE_SAMPLE_RATIO: 0.75\n    TEST:\n      SEMANTIC_ON: False\n      INSTANCE_ON: True\n      PANOPTIC_ON: False\n      OVERLAP_THRESHOLD: 0.8\n      OBJECT_MASK_THRESHOLD: 0.8\n\nINPUT:\n  MIN_SIZE_TRAIN_SAMPLING: \"choice_by_clip\"\n  PSEUDO:\n    SAMPLING_FRAME_NUM: 4\n    SAMPLING_FRAME_RANGE: 20\n    AUGMENTATIONS: ['rotation']\n    MIN_SIZE_TRAIN: (288, 320, 352, 384, 416, 448, 480, 512)\n    MAX_SIZE_TRAIN: 768\n    CROP:\n      ENABLED: True\n      TYPE: \"absolute_range\"\n      SIZE: (384, 600)\n  LSJ_AUG:\n    ENABLED: False\n    IMAGE_SIZE: 768\n    MIN_SCALE: 0.1\n    MAX_SCALE: 2.0\nDATALOADER:\n  FILTER_EMPTY_ANNOTATIONS: True\n  # NUM_WORKERS: 8\n"
  },
  {
    "path": "demo/README.md",
    "content": "## Mask2Former Demo\n\nWe provide a command line tool to run a simple demo of builtin configs.\nThe usage is explained in [GETTING_STARTED.md](../GETTING_STARTED.md).\n"
  },
  {
    "path": "demo/demo.py",
    "content": "# Modified by Bowen Cheng from: https://github.com/facebookresearch/detectron2/blob/master/demo/demo.py\nimport argparse\nimport glob\nimport multiprocessing as mp\nimport os\n# fmt: off\nimport sys\nsys.path.insert(1, os.path.join(sys.path[0], '..'))\n# fmt: on\nimport tempfile\nimport time\nimport warnings\nimport cv2\nimport numpy as np\nimport tqdm\nfrom detectron2.config import get_cfg\nfrom detectron2.data.detection_utils import read_image\nfrom detectron2.projects.deeplab import add_deeplab_config\nfrom detectron2.utils.logger import setup_logger\nfrom mask2former import add_maskformer2_config\nfrom predictor import VisualizationDemo\n\n# constants\nWINDOW_NAME = \"mask2former demo\"\ndef setup_cfg(args):\n    # load config from file and command-line arguments\n    cfg = get_cfg()\n    add_deeplab_config(cfg)\n    add_maskformer2_config(cfg)\n    cfg.merge_from_file(args.config_file)\n    cfg.merge_from_list(args.opts)\n    cfg.freeze()\n    return cfg\n    \ndef get_parser():\n    parser = argparse.ArgumentParser(description=\"maskformer2 demo for builtin configs\")\n    parser.add_argument(\n        \"--config-file\",\n        default=\"configs/coco/panoptic-segmentation/maskformer2_R50_bs16_50ep.yaml\",\n        metavar=\"FILE\",\n        help=\"path to config file\",\n    )\n    parser.add_argument(\"--webcam\", action=\"store_true\", help=\"Take inputs from webcam.\")\n    parser.add_argument(\"--video-input\", help=\"Path to video file.\")\n    parser.add_argument(\n        \"--input\",\n        nargs=\"+\",\n        help=\"A list of space separated input images; \"\n        \"or a single glob pattern such as 'directory/*.jpg'\",\n    )\n    parser.add_argument(\n        \"--output\",\n        help=\"A file or directory to save output visualizations. \"\n        \"If not given, will show output in an OpenCV window.\",\n    )\n    parser.add_argument(\n        \"--confidence-threshold\",\n        type=float,\n        default=0.5,\n        help=\"Minimum score for instance predictions to be shown\",\n    )\n    parser.add_argument(\n        \"--opts\",\n        help=\"Modify config options using the command-line 'KEY VALUE' pairs\",\n        default=[],\n        nargs=argparse.REMAINDER,\n    )\n    return parser\n\ndef test_opencv_video_format(codec, file_ext):\n    with tempfile.TemporaryDirectory(prefix=\"video_format_test\") as dir:\n        filename = os.path.join(dir, \"test_file\" + file_ext)\n        writer = cv2.VideoWriter(\n            filename=filename,\n            fourcc=cv2.VideoWriter_fourcc(*codec),\n            fps=float(30),\n            frameSize=(10, 10),\n            isColor=True,\n        )\n        [writer.write(np.zeros((10, 10, 3), np.uint8)) for _ in range(30)]\n        writer.release()\n        if os.path.isfile(filename):\n            return True\n        return False\n\nif __name__ == \"__main__\":\n    mp.set_start_method(\"spawn\", force=True)\n    args = get_parser().parse_args()\n    setup_logger(name=\"fvcore\")\n    logger = setup_logger()\n    logger.info(\"Arguments: \" + str(args))\n    cfg = setup_cfg(args)\n    demo = VisualizationDemo(cfg)\n    if args.input:\n        if len(args.input) == 1:\n            args.input = glob.glob(os.path.expanduser(args.input[0]))\n            assert args.input, \"The input path(s) was not found\"\n        for path in tqdm.tqdm(args.input, disable=not args.output):\n            # use PIL, to be consistent with evaluation\n            img = read_image(path, format=\"BGR\")\n            start_time = time.time()\n            predictions, visualized_output = demo.run_on_image(img, args.confidence_threshold)\n            logger.info(\n                \"{}: {} in {:.2f}s\".format(\n                    path,\n                    \"detected {} instances\".format(len(predictions[\"instances\"]))\n                    if \"instances\" in predictions\n                    else \"finished\",\n                    time.time() - start_time,\n                )\n            )\n            if args.output:\n                if os.path.isdir(args.output):\n                    assert os.path.isdir(args.output), args.output\n                    out_filename = os.path.join(args.output, os.path.basename(path))\n                else:\n                    #assert len(args.input) == 1, \"Please specify a directory with args.output\"\n                    os.makedirs(args.output)\n                    out_filename = os.path.join(args.output, os.path.basename(path))\n                    #out_filename = args.output\n                visualized_output.save(out_filename)\n            else:\n                cv2.namedWindow(WINDOW_NAME, cv2.WINDOW_NORMAL)\n                cv2.imshow(WINDOW_NAME, visualized_output.get_image()[:, :, ::-1])\n                if cv2.waitKey(0) == 27:\n                    break  # esc to quit\n    elif args.webcam:\n        assert args.input is None, \"Cannot have both --input and --webcam!\"\n        assert args.output is None, \"output not yet supported with --webcam!\"\n        cam = cv2.VideoCapture(0)\n        for vis in tqdm.tqdm(demo.run_on_video(cam)):\n            cv2.namedWindow(WINDOW_NAME, cv2.WINDOW_NORMAL)\n            cv2.imshow(WINDOW_NAME, vis)\n            if cv2.waitKey(1) == 27:\n                break  # esc to quit\n        cam.release()\n        cv2.destroyAllWindows()\n    elif args.video_input:\n        video = cv2.VideoCapture(args.video_input)\n        width = int(video.get(cv2.CAP_PROP_FRAME_WIDTH))\n        height = int(video.get(cv2.CAP_PROP_FRAME_HEIGHT))\n        frames_per_second = video.get(cv2.CAP_PROP_FPS)\n        num_frames = int(video.get(cv2.CAP_PROP_FRAME_COUNT))\n        basename = os.path.basename(args.video_input)\n        codec, file_ext = (\n            (\"x264\", \".mkv\") if test_opencv_video_format(\"x264\", \".mkv\") else (\"mp4v\", \".mp4\")\n        )\n        if codec == \".mp4v\":\n            warnings.warn(\"x264 codec not available, switching to mp4v\")\n        if args.output:\n            if os.path.isdir(args.output):\n                output_fname = os.path.join(args.output, basename)\n                output_fname = os.path.splitext(output_fname)[0] + file_ext\n            else:\n                output_fname = args.output\n            assert not os.path.isfile(output_fname), output_fname\n            output_file = cv2.VideoWriter(\n                filename=output_fname,\n                # some installation of opencv may not support x264 (due to its license),\n                # you can try other format (e.g. MPEG)\n                fourcc=cv2.VideoWriter_fourcc(*codec),\n                fps=float(frames_per_second),\n                frameSize=(width, height),\n                isColor=True,\n            )\n        assert os.path.isfile(args.video_input)\n        for vis_frame in tqdm.tqdm(demo.run_on_video(video), total=num_frames):\n            if args.output:\n                output_file.write(vis_frame)\n            else:\n                cv2.namedWindow(basename, cv2.WINDOW_NORMAL)\n                cv2.imshow(basename, vis_frame)\n                if cv2.waitKey(1) == 27:\n                    break  # esc to quit\n        video.release()\n        if args.output:\n            output_file.release()\n        else:\n            cv2.destroyAllWindows()\n"
  },
  {
    "path": "demo/predictor.py",
    "content": "# Copied from: https://github.com/facebookresearch/detectron2/blob/master/demo/predictor.py\nimport atexit\nimport bisect\nimport multiprocessing as mp\nfrom collections import deque\nimport cv2\nimport torch\nimport numpy as np\nfrom detectron2.data import MetadataCatalog\nfrom detectron2.engine.defaults import DefaultPredictor\nfrom detectron2.utils.video_visualizer import VideoVisualizer\nfrom detectron2.utils.visualizer import ColorMode, Visualizer\nimport matplotlib.pyplot as plt\n\nclass VisualizationDemo(object):\n    def __init__(self, cfg, instance_mode=ColorMode.IMAGE, parallel=False):\n        \"\"\"\n        Args:\n            cfg (CfgNode):\n            instance_mode (ColorMode):\n            parallel (bool): whether to run the model in different processes from visualization.\n                Useful since the visualization logic can be slow.\n        \"\"\"\n        self.metadata = MetadataCatalog.get(\n            cfg.DATASETS.TEST[0] if len(cfg.DATASETS.TEST) else \"__unused\"\n        )\n        self.cpu_device = torch.device(\"cpu\")\n        self.instance_mode = instance_mode\n        self.parallel = parallel\n        self.cfg_vis = cfg\n        if parallel:\n            num_gpu = torch.cuda.device_count()\n            self.predictor = AsyncPredictor(cfg, num_gpus=num_gpu)\n        else:\n            self.predictor = DefaultPredictor(cfg)\n\n    def run_on_image(self, image, conf_thre):\n        \"\"\"\n        Args:\n            image (np.ndarray): an image of shape (H, W, C) (in BGR order).\n                This is the format used by OpenCV.\n        Returns:\n            predictions (dict): the output of the model.\n            vis_output (VisImage): the visualized image output.\n        \"\"\"\n        vis_output = None\n        predictions = self.predictor(image)\n        # Convert image from OpenCV BGR format to Matplotlib RGB format.\n        image = image[:, :, ::-1]\n        visualizer = Visualizer(image, self.metadata, instance_mode=self.instance_mode)\n        if \"panoptic_seg\" in predictions:\n            panoptic_seg, segments_info = predictions[\"panoptic_seg\"]\n            vis_output = visualizer.draw_panoptic_seg_predictions(\n                panoptic_seg.to(self.cpu_device), segments_info\n            )\n        else:\n            if \"sem_seg\" in predictions:\n                vis_output = visualizer.draw_sem_seg(\n                    predictions[\"sem_seg\"].argmax(dim=0).to(self.cpu_device)\n                )\n            if \"instances\" in predictions:\n                instances = predictions[\"instances\"].to(self.cpu_device)\n                instances = instances[instances.scores >= conf_thre]\n                '''\n                mask = instances.pred_masks.squeeze(1).data.cpu().numpy()\n                for i_m in range(len(mask)):\n                    print('mask shape:', mask.shape)\n                    print('mask max:', mask.max())\n                    #heatmapshow = cv2.normalize(mask[i], heatmapshow, alpha=0, beta=255, norm_type=cv2.NORM_MINMAX, dtype=cv2.CV_8U)\n                    heatmapshow = cv2.applyColorMap((mask[i_m] * 255).astype(np.uint8), cv2.COLORMAP_JET) \n                    cv2.imwrite(str(i_m)+\"_heatmap_n.jpg\", heatmapshow)\n                '''\n                '''\n                print('instances scores:', instances.scores.shape)\n                print('instances scores:', instances.scores)\n                print('instances class:', instances.pred_classes.shape)\n                print('instances boxes:', instances.pred_boxes)\n                print('instances masks:', instances.pred_masks.shape)\n                instances.pred_boxes = None\n                '''\n                vis_output = visualizer.draw_instance_predictions(predictions=instances)\n        return predictions, vis_output\n\n    def _frame_from_video(self, video):\n        while video.isOpened():\n            success, frame = video.read()\n            if success:\n                yield frame\n            else:\n                break\n\n    def run_on_video(self, video):\n        \"\"\"\n        Visualizes predictions on frames of the input video.\n        Args:\n            video (cv2.VideoCapture): a :class:`VideoCapture` object, whose source can be\n                either a webcam or a video file.\n        Yields:\n            ndarray: BGR visualizations of each video frame.\n        \"\"\"\n        video_visualizer = VideoVisualizer(self.metadata, self.instance_mode)\n        def process_predictions(frame, predictions):\n            frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)\n            if \"panoptic_seg\" in predictions:\n                panoptic_seg, segments_info = predictions[\"panoptic_seg\"]\n                vis_frame = video_visualizer.draw_panoptic_seg_predictions(\n                    frame, panoptic_seg.to(self.cpu_device), segments_info\n                )\n            elif \"instances\" in predictions:\n                predictions = predictions[\"instances\"].to(self.cpu_device)\n                vis_frame = video_visualizer.draw_instance_predictions(frame, predictions)\n            elif \"sem_seg\" in predictions:\n                vis_frame = video_visualizer.draw_sem_seg(\n                    frame, predictions[\"sem_seg\"].argmax(dim=0).to(self.cpu_device)\n                )\n            # Converts Matplotlib RGB format to OpenCV BGR format\n            vis_frame = cv2.cvtColor(vis_frame.get_image(), cv2.COLOR_RGB2BGR)\n            return vis_frame\n        frame_gen = self._frame_from_video(video)\n        if self.parallel:\n            buffer_size = self.predictor.default_buffer_size\n            frame_data = deque()\n            for cnt, frame in enumerate(frame_gen):\n                frame_data.append(frame)\n                self.predictor.put(frame)\n                if cnt >= buffer_size:\n                    frame = frame_data.popleft()\n                    predictions = self.predictor.get()\n                    yield process_predictions(frame, predictions)\n            while len(frame_data):\n                frame = frame_data.popleft()\n                predictions = self.predictor.get()\n                yield process_predictions(frame, predictions)\n        else:\n            for frame in frame_gen:\n                yield process_predictions(frame, self.predictor(frame))\n\nclass AsyncPredictor:\n    \"\"\"\n    A predictor that runs the model asynchronously, possibly on >1 GPUs.\n    Because rendering the visualization takes considerably amount of time,\n    this helps improve throughput a little bit when rendering videos.\n    \"\"\"\n    class _StopToken:\n        pass\n        \n    class _PredictWorker(mp.Process):\n        def __init__(self, cfg, task_queue, result_queue):\n            self.cfg = cfg\n            self.task_queue = task_queue\n            self.result_queue = result_queue\n            super().__init__()\n        def run(self):\n            predictor = DefaultPredictor(self.cfg)\n            while True:\n                task = self.task_queue.get()\n                if isinstance(task, AsyncPredictor._StopToken):\n                    break\n                idx, data = task\n                result = predictor(data)\n                self.result_queue.put((idx, result))\n\n    def __init__(self, cfg, num_gpus: int = 1):\n        \"\"\"\n        Args:\n            cfg (CfgNode):\n            num_gpus (int): if 0, will run on CPU\n        \"\"\"\n        num_workers = max(num_gpus, 1)\n        self.task_queue = mp.Queue(maxsize=num_workers * 3)\n        self.result_queue = mp.Queue(maxsize=num_workers * 3)\n        self.procs = []\n        for gpuid in range(max(num_gpus, 1)):\n            cfg = cfg.clone()\n            cfg.defrost()\n            cfg.MODEL.DEVICE = \"cuda:{}\".format(gpuid) if num_gpus > 0 else \"cpu\"\n            self.procs.append(\n                AsyncPredictor._PredictWorker(cfg, self.task_queue, self.result_queue)\n            )\n        self.put_idx = 0\n        self.get_idx = 0\n        self.result_rank = []\n        self.result_data = []\n        for p in self.procs:\n            p.start()\n        atexit.register(self.shutdown)\n\n    def put(self, image):\n        self.put_idx += 1\n        self.task_queue.put((self.put_idx, image))\n\n    def get(self):\n        self.get_idx += 1  # the index needed for this request\n        if len(self.result_rank) and self.result_rank[0] == self.get_idx:\n            res = self.result_data[0]\n            del self.result_data[0], self.result_rank[0]\n            return res\n        while True:\n            # make sure the results are returned in the correct order\n            idx, res = self.result_queue.get()\n            if idx == self.get_idx:\n                return res\n            insert = bisect.bisect(self.result_rank, idx)\n            self.result_rank.insert(insert, idx)\n            self.result_data.insert(insert, res)\n\n    def __len__(self):\n        return self.put_idx - self.get_idx\n\n    def __call__(self, image):\n        self.put(image)\n        return self.get()\n\n    def shutdown(self):\n        for _ in self.procs:\n            self.task_queue.put(AsyncPredictor._StopToken())\n\n    @property\n    def default_buffer_size(self):\n        return len(self.procs) * 5\n"
  },
  {
    "path": "demo_video/README.md",
    "content": "## Video Mask2Former Demo\n\nWe provide a command line tool to run a simple demo of builtin configs.\nThe usage is explained in [GETTING_STARTED.md](../GETTING_STARTED.md).\n"
  },
  {
    "path": "demo_video/demo.py",
    "content": "# Modified by Bowen Cheng from: https://github.com/facebookresearch/detectron2/blob/master/demo/demo.py\nimport argparse\nimport glob\nimport multiprocessing as mp\nimport os\n# fmt: off\nimport sys\nsys.path.insert(1, os.path.join(sys.path[0], '..'))\n# fmt: on\nimport tempfile\nimport time\nimport warnings\nimport cv2\nimport numpy as np\nimport tqdm\nfrom torch.cuda.amp import autocast\nfrom detectron2.config import get_cfg\nfrom detectron2.data.detection_utils import read_image\nfrom detectron2.projects.deeplab import add_deeplab_config\nfrom detectron2.utils.logger import setup_logger\nfrom mask2former import add_maskformer2_config\nfrom mask2former_video import add_maskformer2_video_config\nfrom predictor import VisualizationDemo\nimport imageio\n\n# constants\nWINDOW_NAME = \"mask2former video demo\"\ndef setup_cfg(args):\n    # load config from file and command-line arguments\n    cfg = get_cfg()\n    add_deeplab_config(cfg)\n    add_maskformer2_config(cfg)\n    add_maskformer2_video_config(cfg)\n    cfg.merge_from_file(args.config_file)\n    cfg.merge_from_list(args.opts)\n    cfg.freeze()\n    return cfg\ndef get_parser():\n    parser = argparse.ArgumentParser(description=\"maskformer2 demo for builtin configs\")\n    parser.add_argument(\n        \"--config-file\",\n        default=\"configs/youtubevis_2019/video_maskformer2_R50_bs16_8ep.yaml\",\n        metavar=\"FILE\",\n        help=\"path to config file\",\n    )\n    parser.add_argument(\"--video-input\", help=\"Path to video file.\")\n    parser.add_argument(\n        \"--input\",\n        nargs=\"+\",\n        help=\"A list of space separated input images; \"\n        \"or a single glob pattern such as 'directory/*.jpg'\"\n        \"this will be treated as frames of a video\",\n    )\n    parser.add_argument(\n        \"--output\",\n        help=\"A file or directory to save output visualizations. \"\n        \"If not given, will show output in an OpenCV window.\",\n    )\n    parser.add_argument(\n        \"--save-frames\",\n        default=False,\n        help=\"Save frame level image outputs.\",\n    )\n    parser.add_argument(\n        \"--confidence-threshold\",\n        type=float,\n        default=0.5,\n        help=\"Minimum score for instance predictions to be shown\",\n    )\n    parser.add_argument(\n        \"--opts\",\n        help=\"Modify config options using the command-line 'KEY VALUE' pairs\",\n        default=[],\n        nargs=argparse.REMAINDER,\n    )\n    return parser\n\ndef test_opencv_video_format(codec, file_ext):\n    with tempfile.TemporaryDirectory(prefix=\"video_format_test\") as dir:\n        filename = os.path.join(dir, \"test_file\" + file_ext)\n        writer = cv2.VideoWriter(\n            filename=filename,\n            fourcc=cv2.VideoWriter_fourcc(*codec),\n            fps=float(30),\n            frameSize=(10, 10),\n            isColor=True,\n        )\n        [writer.write(np.zeros((10, 10, 3), np.uint8)) for _ in range(30)]\n        writer.release()\n        if os.path.isfile(filename):\n            return True\n        return False\n\nif __name__ == \"__main__\":\n    mp.set_start_method(\"spawn\", force=True)\n    args = get_parser().parse_args()\n    setup_logger(name=\"fvcore\")\n    logger = setup_logger()\n    logger.info(\"Arguments: \" + str(args))\n    cfg = setup_cfg(args)\n    demo = VisualizationDemo(cfg)\n    if args.output:\n        os.makedirs(args.output, exist_ok=True)\n    if args.input:\n        # if len(args.input) == 1:\n        #     args.input = glob.glob(os.path.expanduser(args.input[0]))\n        #     assert args.input, \"The input path(s) was not found\"\n        print('args input:', args.input)\n        args.input = args.input[0]\n        for file_name in os.listdir(args.input):\n            input_path_list = sorted([args.input + file_name + '/' + f for f in os.listdir(args.input + file_name)])\n            print('input path list:', input_path_list)\n            if len(input_path_list) == 0:\n                continue \n            vid_frames = []\n            for path in input_path_list:\n                img = read_image(path, format=\"BGR\")\n                vid_frames.append(img)\n            start_time = time.time()\n            with autocast():\n                predictions, visualized_output = demo.run_on_video(vid_frames, args.confidence_threshold)\n            logger.info(\n                \"detected {} instances per frame in {:.2f}s\".format(\n                    len(predictions[\"pred_scores\"]), time.time() - start_time\n                )\n            )\n            if args.output:\n                if args.save_frames:\n                    if args.output:\n                        os.makedirs(args.output + file_name, exist_ok=True)\n                    print('save frames')\n                    for path, _vis_output in zip(input_path_list, visualized_output):\n                        out_filename = os.path.join(args.output, file_name, os.path.basename(path))\n                        _vis_output.save(out_filename)\n                H, W = visualized_output[0].height, visualized_output[0].width\n                images = []\n                for _vis_output in visualized_output:\n                    frame = _vis_output.get_image()#[:, :, ::-1]\n                    images.append(frame)\n                imageio.mimsave(args.output + file_name + \".gif\", images, fps=5)\n                '''\n                cap = cv2.VideoCapture(-1)\n                fourcc = cv2.VideoWriter_fourcc(*\"mp4v\")\n                out = cv2.VideoWriter(os.path.join(args.output, \"visualization.mp4\"), fourcc, 10.0, (W, H), True)\n                for _vis_output in visualized_output:\n                    frame = _vis_output.get_image()[:, :, ::-1]\n                    out.write(frame)\n                cap.release()\n                out.release()\n                '''\n    elif args.video_input:\n        video = cv2.VideoCapture(args.video_input)\n        vid_frames = []\n        while video.isOpened():\n            success, frame = video.read()\n            if success:\n                vid_frames.append(frame)\n            else:\n                break\n        start_time = time.time()\n        with autocast():\n            predictions, visualized_output = demo.run_on_video(vid_frames)\n        logger.info(\n            \"detected {} instances per frame in {:.2f}s\".format(\n                len(predictions[\"pred_scores\"]), time.time() - start_time\n            )\n        )\n        if args.output:\n            if args.save_frames:\n                for idx, _vis_output in enumerate(visualized_output):\n                    out_filename = os.path.join(args.output, f\"{idx}.jpg\")\n                    _vis_output.save(out_filename)\n            H, W = visualized_output[0].height, visualized_output[0].width\n            cap = cv2.VideoCapture(-1)\n            fourcc = cv2.VideoWriter_fourcc(*\"mp4v\")\n            out = cv2.VideoWriter(os.path.join(args.output, \"visualization.mp4\"), fourcc, 10.0, (W, H), True)\n            for _vis_output in visualized_output:\n                frame = _vis_output.get_image()[:, :, ::-1]\n                out.write(frame)\n            cap.release()\n            out.release()\n"
  },
  {
    "path": "demo_video/predictor.py",
    "content": "# reference: https://github.com/sukjunhwang/IFC/blob/master/projects/IFC/demo/predictor.py\nimport atexit\nimport bisect\nimport multiprocessing as mp\nfrom collections import deque\nimport cv2\nimport torch\nfrom visualizer import TrackVisualizer\nfrom detectron2.data import MetadataCatalog\nfrom detectron2.engine.defaults import DefaultPredictor\nfrom detectron2.structures import Instances\nfrom detectron2.utils.video_visualizer import VideoVisualizer\nfrom detectron2.utils.visualizer import ColorMode\n\nclass VisualizationDemo(object):\n    def __init__(self, cfg, instance_mode=ColorMode.IMAGE, parallel=False):\n        \"\"\"\n        Args:\n            cfg (CfgNode):\n            instance_mode (ColorMode):\n            parallel (bool): whether to run the model in different processes from visualization.\n                Useful since the visualization logic can be slow.\n        \"\"\"\n        self.metadata = MetadataCatalog.get(\n            cfg.DATASETS.TEST[0] if len(cfg.DATASETS.TEST) else \"__unused\"\n        )\n        self.cpu_device = torch.device(\"cpu\")\n        self.instance_mode = instance_mode\n        self.parallel = parallel\n        if parallel:\n            num_gpu = torch.cuda.device_count()\n            self.predictor = AsyncPredictor(cfg, num_gpus=num_gpu)\n        else:\n            self.predictor = VideoPredictor(cfg)\n\n    def run_on_video(self, frames, conf_thre):\n        \"\"\"\n        Args:\n            frames (List[np.ndarray]): a list of images of shape (H, W, C) (in BGR order).\n                This is the format used by OpenCV.\n        Returns:\n            predictions (dict): the output of the model.\n            vis_output (VisImage): the visualized image output.\n        \"\"\"\n        vis_output = None\n        predictions = self.predictor(frames)\n        image_size = predictions[\"image_size\"]\n        pred_scores = predictions[\"pred_scores\"]\n        pred_labels = predictions[\"pred_labels\"]\n        pred_masks = predictions[\"pred_masks\"]\n        remain_index = [ii for ii in range(len(pred_scores)) if pred_scores[ii] >= conf_thre ]\n        pred_scores = [pred_scores[ind] for ind in remain_index]\n        pred_labels = [pred_labels[ind] for ind in remain_index]\n        pred_masks = [pred_masks[ind] for ind in remain_index]\n        frame_masks = list(zip(*pred_masks))\n        total_vis_output = []\n        for frame_idx in range(len(frames)):\n            frame = frames[frame_idx][:, :, ::-1]\n            visualizer = TrackVisualizer(frame, self.metadata, instance_mode=self.instance_mode)\n            ins = Instances(image_size)\n            if len(pred_scores) > 0:\n                print('pred scores:', pred_scores)\n                ins.scores = pred_scores\n                ins.pred_classes = pred_labels\n                ins.pred_masks = torch.stack(frame_masks[frame_idx], dim=0)\n            vis_output = visualizer.draw_instance_predictions(predictions=ins)\n            total_vis_output.append(vis_output)\n        return predictions, total_vis_output\n\nclass VideoPredictor(DefaultPredictor):\n    \"\"\"\n    Create a simple end-to-end predictor with the given config that runs on\n    single device for a single input image.\n    Compared to using the model directly, this class does the following additions:\n    1. Load checkpoint from `cfg.MODEL.WEIGHTS`.\n    2. Always take BGR image as the input and apply conversion defined by `cfg.INPUT.FORMAT`.\n    3. Apply resizing defined by `cfg.INPUT.{MIN,MAX}_SIZE_TEST`.\n    4. Take one input image and produce a single output, instead of a batch.\n    If you'd like to do anything more fancy, please refer to its source code\n    as examples to build and use the model manually.\n    Attributes:\n        metadata (Metadata): the metadata of the underlying dataset, obtained from\n            cfg.DATASETS.TEST.\n    Examples:\n    ::\n        pred = DefaultPredictor(cfg)\n        inputs = cv2.imread(\"input.jpg\")\n        outputs = pred(inputs)\n    \"\"\"\n    def __call__(self, frames):\n        \"\"\"\n        Args:\n            original_image (np.ndarray): an image of shape (H, W, C) (in BGR order).\n        Returns:\n            predictions (dict):\n                the output of the model for one image only.\n                See :doc:`/tutorials/models` for details about the format.\n        \"\"\"\n        with torch.no_grad():  # https://github.com/sphinx-doc/sphinx/issues/4258\n            input_frames = []\n            for original_image in frames:\n                # Apply pre-processing to image.\n                if self.input_format == \"RGB\":\n                    # whether the model expects BGR inputs or RGB\n                    original_image = original_image[:, :, ::-1]\n                height, width = original_image.shape[:2]\n                image = self.aug.get_transform(original_image).apply_image(original_image)\n                image = torch.as_tensor(image.astype(\"float32\").transpose(2, 0, 1))\n                input_frames.append(image)\n            inputs = {\"image\": input_frames, \"height\": height, \"width\": width}\n            predictions = self.model([inputs])\n            return predictions\n\nclass AsyncPredictor:\n    \"\"\"\n    A predictor that runs the model asynchronously, possibly on >1 GPUs.\n    Because rendering the visualization takes considerably amount of time,\n    this helps improve throughput when rendering videos.\n    \"\"\"\n    class _StopToken:\n        pass\n    class _PredictWorker(mp.Process):\n        def __init__(self, cfg, task_queue, result_queue):\n            self.cfg = cfg\n            self.task_queue = task_queue\n            self.result_queue = result_queue\n            super().__init__()\n        def run(self):\n            predictor = VideoPredictor(self.cfg)\n            while True:\n                task = self.task_queue.get()\n                if isinstance(task, AsyncPredictor._StopToken):\n                    break\n                idx, data = task\n                result = predictor(data)\n                self.result_queue.put((idx, result))\n\n    def __init__(self, cfg, num_gpus: int = 1):\n        \"\"\"\n        Args:\n            cfg (CfgNode):\n            num_gpus (int): if 0, will run on CPU\n        \"\"\"\n        num_workers = max(num_gpus, 1)\n        self.task_queue = mp.Queue(maxsize=num_workers * 3)\n        self.result_queue = mp.Queue(maxsize=num_workers * 3)\n        self.procs = []\n        for gpuid in range(max(num_gpus, 1)):\n            cfg = cfg.clone()\n            cfg.defrost()\n            cfg.MODEL.DEVICE = \"cuda:{}\".format(gpuid) if num_gpus > 0 else \"cpu\"\n            self.procs.append(\n                AsyncPredictor._PredictWorker(cfg, self.task_queue, self.result_queue)\n            )\n        self.put_idx = 0\n        self.get_idx = 0\n        self.result_rank = []\n        self.result_data = []\n        for p in self.procs:\n            p.start()\n        atexit.register(self.shutdown)\n\n    def put(self, image):\n        self.put_idx += 1\n        self.task_queue.put((self.put_idx, image))\n\n    def get(self):\n        self.get_idx += 1  # the index needed for this request\n        if len(self.result_rank) and self.result_rank[0] == self.get_idx:\n            res = self.result_data[0]\n            del self.result_data[0], self.result_rank[0]\n            return res\n        while True:\n            # make sure the results are returned in the correct order\n            idx, res = self.result_queue.get()\n            if idx == self.get_idx:\n                return res\n            insert = bisect.bisect(self.result_rank, idx)\n            self.result_rank.insert(insert, idx)\n            self.result_data.insert(insert, res)\n\n    def __len__(self):\n        return self.put_idx - self.get_idx\n\n    def __call__(self, image):\n        self.put(image)\n        return self.get()\n\n    def shutdown(self):\n        for _ in self.procs:\n            self.task_queue.put(AsyncPredictor._StopToken())\n            \n    @property\n    def default_buffer_size(self):\n        return len(self.procs) * 5\n"
  },
  {
    "path": "demo_video/visualizer.py",
    "content": "# reference: https://github.com/sukjunhwang/IFC/blob/master/projects/IFC/demo/visualizer.py\nimport torch\nimport numpy as np\nimport matplotlib.colors as mplc\nfrom detectron2.utils.visualizer import ColorMode, GenericMask, Visualizer, _create_text_labels\n_ID_JITTERS = [[0.9047944201469568, 0.3241718265806123, 0.33443746665210006], [0.4590171386127151, 0.9095038146383864, 0.3143840671974788], [0.4769356899795538, 0.5044406738441948, 0.5354530846360839], [0.00820945625670777, 0.24099210193126785, 0.15471834055332978], [0.6195684374237388, 0.4020380013509799, 0.26100266066404676], [0.08281237756545068, 0.05900744492710419, 0.06106221202154216], [0.2264886829978755, 0.04925271007292076, 0.10214429345996079], [0.1888247470009874, 0.11275000298612425, 0.46112894830685514], [0.37415767691880975, 0.844284596118331, 0.950471611180866], [0.3817344218157631, 0.3483259270707101, 0.6572989333690541], [0.2403115731054466, 0.03078280287279167, 0.5385975692534737], [0.7035076951650824, 0.12352084932325424, 0.12873080308790197], [0.12607434914489934, 0.111244793010015, 0.09333334699716023], [0.6551607300342269, 0.7003064103554443, 0.4131794512286162], [0.13592107365596595, 0.5390702818232149, 0.004540643174930525], [0.38286244894454347, 0.709142545393449, 0.529074791609835], [0.4279376583651734, 0.5634708596431771, 0.8505569717104301], [0.3460488523902999, 0.464769595519293, 0.6676839675477276], [0.8544063246675081, 0.5041190233407755, 0.9081217697141578], [0.9207009090747208, 0.2403865944739051, 0.05375410999863772], [0.6515786136947107, 0.6299918449948327, 0.45292029442034387], [0.986174217295693, 0.2424849846977214, 0.3981993323108266], [0.22101915872994693, 0.3408589198278038, 0.006381420347677524], [0.3159785813515982, 0.1145748921741011, 0.595754317197274], [0.10263421488052715, 0.5864139253490858, 0.23908000741142432], [0.8272999391532938, 0.6123527260897751, 0.3365197327803193], [0.5269583712937912, 0.25668929554516506, 0.7888411215078127], [0.2433880265410031, 0.7240751234287827, 0.8483215810528648], [0.7254601709704898, 0.8316525547295984, 0.9325253855921963], [0.5574483824856672, 0.2935331727879944, 0.6594839453793155], [0.6209642371433579, 0.054030693198821256, 0.5080873988178534], [0.9055507077365624, 0.12865888619203514, 0.9309191861440005], [0.9914469722960537, 0.3074114506206205, 0.8762107657323488], [0.4812682518247371, 0.15055826298548158, 0.9656340505308308], [0.6459219454316445, 0.9144794010251625, 0.751338812155106], [0.860840174209798, 0.8844626353077639, 0.3604624506769899], [0.8194991672032272, 0.926399617787601, 0.8059222327343247], [0.6540413175393658, 0.04579445254618297, 0.26891917826531275], [0.37778835833987046, 0.36247927666109536, 0.7989799305827889], [0.22738304978177726, 0.9038018263773739, 0.6970838854138303], [0.6362015495896184, 0.527680794236961, 0.5570915425178721], [0.6436401915860954, 0.6316925317144524, 0.9137151236993912], [0.04161828388587163, 0.3832413349082706, 0.6880829921949752], [0.7768167825719299, 0.8933821497682587, 0.7221278391266809], [0.8632760876301346, 0.3278628094906323, 0.8421587587114462], [0.8556499133262127, 0.6497385872901932, 0.5436895688477963], [0.9861940318610894, 0.03562313777386272, 0.9183454677106616], [0.8042586091176366, 0.6167222703170994, 0.24181981557207644], [0.9504247117633057, 0.3454233714011461, 0.6883727005547743], [0.9611909135491202, 0.46384154263898114, 0.32700443315058914], [0.523542176970206, 0.446222414615845, 0.9067402987747814], [0.7536954008682911, 0.6675512338797588, 0.22538238957839196], [0.1554052265688285, 0.05746097492966129, 0.8580358872587424], [0.8540838640971405, 0.9165504335482566, 0.6806982829158964], [0.7065090319405029, 0.8683059983962002, 0.05167128320624026], [0.39134812961899124, 0.8910075505622979, 0.7639815712623922], [0.1578117311479783, 0.20047326898284668, 0.9220177338840568], [0.2017488993096358, 0.6949259970936679, 0.8729196864798128], [0.5591089340651949, 0.15576770423813258, 0.1469857469387812], [0.14510398622626974, 0.24451497734532168, 0.46574271993578786], [0.13286397822351492, 0.4178244533944635, 0.03728728952131943], [0.556463206310225, 0.14027595183361663, 0.2731537988657907], [0.4093837966398032, 0.8015225687789814, 0.8033567296903834], [0.527442563956637, 0.902232617214431, 0.7066626674362227], [0.9058355503297827, 0.34983989180213004, 0.8353262183839384], [0.7108382186953104, 0.08591307895133471, 0.21434688012521974], [0.22757345065207668, 0.7943075496583976, 0.2992305547627421], [0.20454109788173636, 0.8251670332103687, 0.012981987094547232], [0.7672562637297392, 0.005429019973062554, 0.022163616037108702], [0.37487345910117564, 0.5086240194440863, 0.9061216063654387], [0.9878004014101087, 0.006345852772772331, 0.17499753379350858], [0.030061528704491303, 0.1409704315546606, 0.3337131835834506], [0.5022506782611504, 0.5448435505388706, 0.40584238936140726], [0.39560774627423445, 0.8905943695833262, 0.5850815030921116], [0.058615671926786406, 0.5365713844300387, 0.1620457551256279], [0.41843842882069693, 0.1536005983609976, 0.3127878501592438], [0.05947621790155899, 0.5412421167331932, 0.2611322146455659], [0.5196159938235607, 0.7066461551682705, 0.970261497412556], [0.30443031606149007, 0.45158581060034975, 0.4331841153149706], [0.8848298403933996, 0.7241791700943656, 0.8917110054596072], [0.5720260591898779, 0.3072801598203052, 0.8891066705989902], [0.13964015336177327, 0.2531778096760302, 0.5703756837403124], [0.2156307542329836, 0.4139947500641685, 0.87051676884144], [0.10800455881891169, 0.05554646035458266, 0.2947027428551443], [0.35198009410633857, 0.365849666213808, 0.06525787683513773], [0.5223264108118847, 0.9032195574351178, 0.28579084943315025], [0.7607724246546966, 0.3087194381828555, 0.6253235528354899], [0.5060485442077824, 0.19173600467625274, 0.9931175692203702], [0.5131805830323746, 0.07719515392040577, 0.923212006754969], [0.3629762141280106, 0.02429179642710888, 0.6963754952399983], [0.7542592485456767, 0.6478893299494212, 0.3424965345400731], [0.49944574453364454, 0.6775665366832825, 0.33758796076989583], [0.010621818120767679, 0.8221571611173205, 0.5186257457566332], [0.5857910304290109, 0.7178133992025467, 0.9729243483606071], [0.16987399482717613, 0.9942570210657463, 0.18120758122552927], [0.016362572521240848, 0.17582788603087263, 0.7255176922640298], [0.10981764283706419, 0.9078582203470377, 0.7638063718334003], [0.9252097840441119, 0.3330197086990039, 0.27888705301420136], [0.12769972651171546, 0.11121470804891687, 0.12710743734391716], [0.5753520518360334, 0.2763862879599456, 0.6115636613363361]]\n_OFF_WHITE = (1.0, 1.0, 240.0 / 255)\n\nclass TrackVisualizer(Visualizer):\n    def __init__(self, img_rgb, metadata=None, scale=1.0, instance_mode=ColorMode.IMAGE):\n        super().__init__(\n            img_rgb, metadata=metadata, scale=scale, instance_mode=instance_mode\n        )\n        self.cpu_device = torch.device(\"cpu\")\n    def _jitter(self, color, id):\n        \"\"\"\n        Randomly modifies given color to produce a slightly different color than the color given.\n        Args:\n            color (tuple[double]): a tuple of 3 elements, containing the RGB values of the color\n                picked. The values in the list are in the [0.0, 1.0] range.\n        Returns:\n            jittered_color (tuple[double]): a tuple of 3 elements, containing the RGB values of the\n                color after being jittered. The values in the list are in the [0.0, 1.0] range.\n        \"\"\"\n        color = mplc.to_rgb(color)\n        vec = _ID_JITTERS[id]\n        # better to do it in another color space\n        vec = vec / np.linalg.norm(vec) * 0.5\n        res = np.clip(vec + color, 0, 1)\n        return tuple(res)\n\n    def overlay_instances(\n        self,\n        *,\n        boxes=None,\n        labels=None,\n        masks=None,\n        keypoints=None,\n        assigned_colors=None,\n        alpha=0.5\n    ):\n        \"\"\"\n        Args:\n            boxes (Boxes, RotatedBoxes or ndarray): either a :class:`Boxes`,\n                or an Nx4 numpy array of XYXY_ABS format for the N objects in a single image,\n                or a :class:`RotatedBoxes`,\n                or an Nx5 numpy array of (x_center, y_center, width, height, angle_degrees) format\n                for the N objects in a single image,\n            labels (list[str]): the text to be displayed for each instance.\n            masks (masks-like object): Supported types are:\n                * :class:`detectron2.structures.PolygonMasks`,\n                  :class:`detectron2.structures.BitMasks`.\n                * list[list[ndarray]]: contains the segmentation masks for all objects in one image.\n                  The first level of the list corresponds to individual instances. The second\n                  level to all the polygon that compose the instance, and the third level\n                  to the polygon coordinates. The third level should have the format of\n                  [x0, y0, x1, y1, ..., xn, yn] (n >= 3).\n                * list[ndarray]: each ndarray is a binary mask of shape (H, W).\n                * list[dict]: each dict is a COCO-style RLE.\n            keypoints (Keypoint or array like): an array-like object of shape (N, K, 3),\n                where the N is the number of instances and K is the number of keypoints.\n                The last dimension corresponds to (x, y, visibility or score).\n            assigned_colors (list[matplotlib.colors]): a list of colors, where each color\n                corresponds to each mask or box in the image. Refer to 'matplotlib.colors'\n                for full list of formats that the colors are accepted in.\n        Returns:\n            output (VisImage): image object with visualizations.\n        \"\"\"\n        num_instances = 0\n        if boxes is not None:\n            boxes = self._convert_boxes(boxes)\n            num_instances = len(boxes)\n        if masks is not None:\n            # print('masks:', masks)\n            #masks = self._convert_masks(masks)\n            if num_instances:\n                assert len(masks) == num_instances\n            else:\n                num_instances = len(masks)\n        if keypoints is not None:\n            if num_instances:\n                assert len(keypoints) == num_instances\n            else:\n                num_instances = len(keypoints)\n            keypoints = self._convert_keypoints(keypoints)\n        if labels is not None:\n            assert len(labels) == num_instances\n        if assigned_colors is None:\n            assigned_colors = [random_color(ii, rgb=True, maximum=1) for ii in range(num_instances)]\n        if num_instances == 0:\n            return self.output\n        if boxes is not None and boxes.shape[1] == 5:\n            return self.overlay_rotated_instances(\n                boxes=boxes, labels=labels, assigned_colors=assigned_colors\n            )\n        # Display in largest to smallest order to reduce occlusion.\n        areas = None\n        if boxes is not None:\n            areas = np.prod(boxes[:, 2:] - boxes[:, :2], axis=1)\n        elif masks is not None:\n            areas = np.asarray([x.sum() for x in masks])\n        if areas is not None:\n            sorted_idxs = np.argsort(-areas).tolist()\n            # Re-order overlapped instances in descending order.\n            boxes = boxes[sorted_idxs] if boxes is not None else None\n            labels = [labels[k] for k in sorted_idxs] if labels is not None else None\n            masks = [masks[idx] for idx in sorted_idxs] if masks is not None else None\n            assigned_colors = [assigned_colors[idx] for idx in sorted_idxs]\n            keypoints = keypoints[sorted_idxs] if keypoints is not None else None\n        for i in range(num_instances):\n            color = assigned_colors[i]\n            # if boxes is not None:\n            #     self.draw_box(boxes[i], edge_color=color)\n            if masks is not None:\n\t            #self.draw_polygon(segment.reshape(-1, 2), color, alpha=alpha)\n                binary_mask = masks[i].astype(np.uint8)\n                #alpha = 0.7\n                #print('binary mask:', binary_mask)\n                self.draw_binary_mask(\n\t\t            binary_mask,\n\t\t            color=color,\n\t\t            edge_color=None, # _OFF_WHITE\n\t\t            alpha=alpha,\n\t            )\n            if False:\n            # if labels is not None:\n                # first get a box\n                if boxes is not None:\n                    x0, y0, x1, y1 = boxes[i]\n                    text_pos = (x0, y0)  # if drawing boxes, put text on the box corner.\n                    horiz_align = \"left\"\n                elif masks is not None:\n                    # skip small mask without polygon\n                    if len(masks[i].polygons) == 0:\n                        continue\n                    x0, y0, x1, y1 = masks[i].bbox()\n                    # draw text in the center (defined by median) when box is not drawn\n                    # median is less sensitive to outliers.\n                    text_pos = np.median(masks[i].mask.nonzero(), axis=1)[::-1]\n                    horiz_align = \"center\"\n                else:\n                    continue  # drawing the box confidence for keypoints isn't very useful.\n                # for small objects, draw text at the side to avoid occlusion\n                instance_area = (y1 - y0) * (x1 - x0)\n                if (\n                    instance_area < _SMALL_OBJECT_AREA_THRESH * self.output.scale\n                    or y1 - y0 < 40 * self.output.scale\n                ):\n                    if y1 >= self.output.height - 5:\n                        text_pos = (x1, y0)\n                    else:\n                        text_pos = (x0, y1)\n                height_ratio = (y1 - y0) / np.sqrt(self.output.height * self.output.width)\n                lighter_color = self._change_color_brightness(color, brightness_factor=0.7)\n                font_size = (\n                    np.clip((height_ratio - 0.02) / 0.08 + 1, 1.2, 2)\n                    * 0.5\n                    * self._default_font_size\n                )\n                # self.draw_text(\n                #     labels[i],\n                #     text_pos,\n                #     color=lighter_color,\n                #     horizontal_alignment=horiz_align,\n                #     font_size=font_size,\n                # )\n        # draw keypoints\n        if keypoints is not None:\n            for keypoints_per_instance in keypoints:\n                self.draw_and_connect_keypoints(keypoints_per_instance)\n        return self.output\n        \n    def draw_instance_predictions(self, predictions):\n        \"\"\"\n        Draw instance-level prediction results on an image.\n        Args:\n            predictions (Instances): the output of an instance detection/segmentation\n                model. Following fields will be used to draw:\n                \"pred_boxes\", \"pred_classes\", \"scores\", \"pred_masks\" (or \"pred_masks_rle\").\n        Returns:\n            output (VisImage): image object with visualizations.\n        \"\"\"\n        preds = predictions.to(self.cpu_device)\n        boxes = preds.pred_boxes if preds.has(\"pred_boxes\") else None\n        scores = preds.scores if preds.has(\"scores\") else None\n        classes = preds.pred_classes if preds.has(\"pred_classes\") else None\n        labels = _create_text_labels(classes, scores, self.metadata.get(\"thing_classes\", None))\n        if labels is not None:\n            labels = [\"[{}] \".format(_id) + l for _id, l in enumerate(labels)]\n        if preds.has(\"pred_masks\"):\n            masks = np.asarray(preds.pred_masks)\n            print('enter here==========')\n            # masks = [GenericMask(x, self.output.height, self.output.width) for x in masks]\n        else:\n            masks = None\n        if classes is None:\n            return self.output\n        colors = [\n            self._jitter([x / 255 for x in self.metadata.thing_colors[c]], id) for id, c in enumerate(classes)\n        ]\n        alpha = 0.5\n        if self._instance_mode == ColorMode.IMAGE_BW:\n            self.output.img = self._create_grayscale_image(\n                (preds.pred_masks.any(dim=0) > 0).numpy()\n                if preds.has(\"pred_masks\")\n                else None\n            )\n            alpha = 0.3\n        self.overlay_instances(\n            masks=masks,\n            boxes=boxes,\n            labels=labels,\n            assigned_colors=colors,\n            alpha=alpha,\n        )\n        return self.output\n"
  },
  {
    "path": "mask2former/__init__.py",
    "content": "from . import data  # register all new datasets\nfrom . import modeling\n\n# config\nfrom .config import add_maskformer2_config\n\n# dataset loading\nfrom .data.dataset_mappers.coco_instance_new_baseline_dataset_mapper import COCOInstanceNewBaselineDatasetMapper\nfrom .data.dataset_mappers.coco_panoptic_new_baseline_dataset_mapper import COCOPanopticNewBaselineDatasetMapper\nfrom .data.dataset_mappers.mask_former_instance_dataset_mapper import (\n    MaskFormerInstanceDatasetMapper,\n)\nfrom .data.dataset_mappers.mask_former_panoptic_dataset_mapper import (\n    MaskFormerPanopticDatasetMapper,\n)\nfrom .data.dataset_mappers.mask_former_semantic_dataset_mapper import (\n    MaskFormerSemanticDatasetMapper,\n)\n\n# models\nfrom .maskformer_model import MaskFormer\nfrom .test_time_augmentation import SemanticSegmentorWithTTA\n\n# evaluation\nfrom .evaluation.instance_evaluation import InstanceSegEvaluator\n"
  },
  {
    "path": "mask2former/config.py",
    "content": "# -*- coding: utf-8 -*-\nfrom detectron2.config import CfgNode as CN\n\n\ndef add_maskformer2_config(cfg):\n    \"\"\"\n    Add config for MASK_FORMER.\n    \"\"\"\n    # NOTE: configs from original maskformer\n    # data config\n    # select the dataset mapper\n    cfg.INPUT.DATASET_MAPPER_NAME = \"mask_former_semantic\"\n    # Color augmentation\n    cfg.INPUT.COLOR_AUG_SSD = False\n    # We retry random cropping until no single category in semantic segmentation GT occupies more\n    # than `SINGLE_CATEGORY_MAX_AREA` part of the crop.\n    cfg.INPUT.CROP.SINGLE_CATEGORY_MAX_AREA = 1.0\n    # Pad image and segmentation GT in dataset mapper.\n    cfg.INPUT.SIZE_DIVISIBILITY = -1\n\n    # solver config\n    # weight decay on embedding\n    cfg.SOLVER.WEIGHT_DECAY_EMBED = 0.0\n    # optimizer\n    cfg.SOLVER.OPTIMIZER = \"ADAMW\"\n    cfg.SOLVER.BACKBONE_MULTIPLIER = 0.1\n\n    # mask_former model config\n    cfg.MODEL.MASK_FORMER = CN()\n\n    # loss\n    cfg.MODEL.MASK_FORMER.DEEP_SUPERVISION = True\n    cfg.MODEL.MASK_FORMER.NO_OBJECT_WEIGHT = 0.1\n    cfg.MODEL.MASK_FORMER.CLASS_WEIGHT = 1.0\n    cfg.MODEL.MASK_FORMER.DICE_WEIGHT = 1.0\n    cfg.MODEL.MASK_FORMER.MASK_WEIGHT = 20.0\n\n    # transformer config\n    cfg.MODEL.MASK_FORMER.NHEADS = 8\n    cfg.MODEL.MASK_FORMER.DROPOUT = 0.1\n    cfg.MODEL.MASK_FORMER.DIM_FEEDFORWARD = 2048\n    cfg.MODEL.MASK_FORMER.ENC_LAYERS = 0\n    cfg.MODEL.MASK_FORMER.DEC_LAYERS = 6\n    cfg.MODEL.MASK_FORMER.PRE_NORM = False\n\n    cfg.MODEL.MASK_FORMER.HIDDEN_DIM = 256\n    cfg.MODEL.MASK_FORMER.NUM_OBJECT_QUERIES = 100\n\n    cfg.MODEL.MASK_FORMER.TRANSFORMER_IN_FEATURE = \"res5\"\n    cfg.MODEL.MASK_FORMER.ENFORCE_INPUT_PROJ = False\n\n    # mask_former inference config\n    cfg.MODEL.MASK_FORMER.TEST = CN()\n    cfg.MODEL.MASK_FORMER.TEST.SEMANTIC_ON = True\n    cfg.MODEL.MASK_FORMER.TEST.INSTANCE_ON = False\n    cfg.MODEL.MASK_FORMER.TEST.PANOPTIC_ON = False\n    cfg.MODEL.MASK_FORMER.TEST.OBJECT_MASK_THRESHOLD = 0.0\n    cfg.MODEL.MASK_FORMER.TEST.OVERLAP_THRESHOLD = 0.0\n    cfg.MODEL.MASK_FORMER.TEST.SEM_SEG_POSTPROCESSING_BEFORE_INFERENCE = False\n\n    # Sometimes `backbone.size_divisibility` is set to 0 for some backbone (e.g. ResNet)\n    # you can use this config to override\n    cfg.MODEL.MASK_FORMER.SIZE_DIVISIBILITY = 32\n\n    # pixel decoder config\n    cfg.MODEL.SEM_SEG_HEAD.MASK_DIM = 256\n    # adding transformer in pixel decoder\n    cfg.MODEL.SEM_SEG_HEAD.TRANSFORMER_ENC_LAYERS = 0\n    # pixel decoder\n    cfg.MODEL.SEM_SEG_HEAD.PIXEL_DECODER_NAME = \"BasePixelDecoder\"\n\n    # swin transformer backbone\n    cfg.MODEL.SWIN = CN()\n    cfg.MODEL.SWIN.PRETRAIN_IMG_SIZE = 224\n    cfg.MODEL.SWIN.PATCH_SIZE = 4\n    cfg.MODEL.SWIN.EMBED_DIM = 96\n    cfg.MODEL.SWIN.DEPTHS = [2, 2, 6, 2]\n    cfg.MODEL.SWIN.NUM_HEADS = [3, 6, 12, 24]\n    cfg.MODEL.SWIN.WINDOW_SIZE = 7\n    cfg.MODEL.SWIN.MLP_RATIO = 4.0\n    cfg.MODEL.SWIN.QKV_BIAS = True\n    cfg.MODEL.SWIN.QK_SCALE = None\n    cfg.MODEL.SWIN.DROP_RATE = 0.0\n    cfg.MODEL.SWIN.ATTN_DROP_RATE = 0.0\n    cfg.MODEL.SWIN.DROP_PATH_RATE = 0.3\n    cfg.MODEL.SWIN.APE = False\n    cfg.MODEL.SWIN.PATCH_NORM = True\n    cfg.MODEL.SWIN.OUT_FEATURES = [\"res2\", \"res3\", \"res4\", \"res5\"]\n    cfg.MODEL.SWIN.USE_CHECKPOINT = False\n\n    # NOTE: maskformer2 extra configs\n    # transformer module\n    cfg.MODEL.MASK_FORMER.TRANSFORMER_DECODER_NAME = \"MultiScaleMaskedTransformerDecoder\"\n\n    # LSJ aug\n    cfg.INPUT.IMAGE_SIZE = 1024\n    cfg.INPUT.MIN_SCALE = 0.1\n    cfg.INPUT.MAX_SCALE = 2.0\n\n    # MSDeformAttn encoder configs\n    cfg.MODEL.SEM_SEG_HEAD.DEFORMABLE_TRANSFORMER_ENCODER_IN_FEATURES = [\"res3\", \"res4\", \"res5\"]\n    cfg.MODEL.SEM_SEG_HEAD.DEFORMABLE_TRANSFORMER_ENCODER_N_POINTS = 4\n    cfg.MODEL.SEM_SEG_HEAD.DEFORMABLE_TRANSFORMER_ENCODER_N_HEADS = 8\n\n    # point loss configs\n    # Number of points sampled during training for a mask point head.\n    cfg.MODEL.MASK_FORMER.TRAIN_NUM_POINTS = 112 * 112\n    # Oversampling parameter for PointRend point sampling during training. Parameter `k` in the\n    # original paper.\n    cfg.MODEL.MASK_FORMER.OVERSAMPLE_RATIO = 3.0\n    # Importance sampling parameter for PointRend point sampling during training. Parametr `beta` in\n    # the original paper.\n    cfg.MODEL.MASK_FORMER.IMPORTANCE_SAMPLE_RATIO = 0.75\n"
  },
  {
    "path": "mask2former/data/__init__.py",
    "content": "from . import datasets\n"
  },
  {
    "path": "mask2former/data/dataset_mappers/__init__.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates.\n"
  },
  {
    "path": "mask2former/data/dataset_mappers/__init__.py.new",
    "content": ""
  },
  {
    "path": "mask2former/data/dataset_mappers/coco_instance_new_baseline_dataset_mapper.py",
    "content": "# Modified by Bowen Cheng from https://github.com/facebookresearch/detr/blob/master/d2/detr/dataset_mapper.py\nimport copy\nimport logging\n\nimport numpy as np\nimport torch\n\nfrom detectron2.config import configurable\nfrom detectron2.data import detection_utils as utils\nfrom detectron2.data import transforms as T\nfrom detectron2.data.transforms import TransformGen\nfrom detectron2.structures import BitMasks, Instances\n\nfrom pycocotools import mask as coco_mask\n\n__all__ = [\"COCOInstanceNewBaselineDatasetMapper\"]\n\ndef masks_to_boxes(masks: torch.Tensor) -> torch.Tensor:\n    \"\"\"\n    Compute the bounding boxes around the provided masks.\n\n    Returns a [N, 4] tensor containing bounding boxes. The boxes are in ``(x1, y1, x2, y2)`` format with\n    ``0 <= x1 < x2`` and ``0 <= y1 < y2``.\n\n    Args:\n        masks (Tensor[N, H, W]): masks to transform where N is the number of masks\n            and (H, W) are the spatial dimensions.\n\n    Returns:\n        Tensor[N, 4]: bounding boxes\n    \"\"\"\n    if masks.numel() == 0:\n        return masks\n\n    n = masks.shape[0]\n    for index, mask in enumerate(masks):\n        y, x = torch.where(mask != 0)\n        if len(x) * len(y) == 0:\n            continue\n        \n        h = torch.max(y) - torch.min(y)\n        w = torch.max(x) - torch.min(x)\n        masks[index, torch.min(y):torch.max(y), torch.min(x):torch.max(x)] = 1.0\n\n    return masks\n\ndef convert_coco_poly_to_mask(segmentations, height, width):\n    masks = []\n    for polygons in segmentations:\n        rles = coco_mask.frPyObjects(polygons, height, width)\n        mask = coco_mask.decode(rles)\n        if len(mask.shape) < 3:\n            mask = mask[..., None]\n        mask = torch.as_tensor(mask, dtype=torch.uint8)\n        mask = mask.any(dim=2)\n        masks.append(mask)\n    if masks:\n        masks = torch.stack(masks, dim=0)\n        masks = masks_to_boxes(masks)\n    else:\n        masks = torch.zeros((0, height, width), dtype=torch.uint8)\n\n    return masks\n\n\ndef build_transform_gen(cfg, is_train):\n    \"\"\"\n    Create a list of default :class:`Augmentation` from config.\n    Now it includes resizing and flipping.\n    Returns:\n        list[Augmentation]\n    \"\"\"\n    assert is_train, \"Only support training augmentation\"\n    image_size = cfg.INPUT.IMAGE_SIZE\n    min_scale = cfg.INPUT.MIN_SCALE\n    max_scale = cfg.INPUT.MAX_SCALE\n\n    augmentation = []\n\n    if cfg.INPUT.RANDOM_FLIP != \"none\":\n        augmentation.append(\n            T.RandomFlip(\n                horizontal=cfg.INPUT.RANDOM_FLIP == \"horizontal\",\n                vertical=cfg.INPUT.RANDOM_FLIP == \"vertical\",\n            )\n        )\n\n    augmentation.extend([\n        T.ResizeScale(\n            min_scale=min_scale, max_scale=max_scale, target_height=image_size, target_width=image_size\n        ),\n        T.FixedSizeCrop(crop_size=(image_size, image_size)),\n    ])\n\n    return augmentation\n\n\n# This is specifically designed for the COCO dataset.\nclass COCOInstanceNewBaselineDatasetMapper:\n    \"\"\"\n    A callable which takes a dataset dict in Detectron2 Dataset format,\n    and map it into a format used by MaskFormer.\n\n    This dataset mapper applies the same transformation as DETR for COCO panoptic segmentation.\n\n    The callable currently does the following:\n\n    1. Read the image from \"file_name\"\n    2. Applies geometric transforms to the image and annotation\n    3. Find and applies suitable cropping to the image and annotation\n    4. Prepare image and annotation to Tensors\n    \"\"\"\n\n    @configurable\n    def __init__(\n        self,\n        is_train=True,\n        *,\n        tfm_gens,\n        image_format,\n    ):\n        \"\"\"\n        NOTE: this interface is experimental.\n        Args:\n            is_train: for training or inference\n            augmentations: a list of augmentations or deterministic transforms to apply\n            tfm_gens: data augmentation\n            image_format: an image format supported by :func:`detection_utils.read_image`.\n        \"\"\"\n        self.tfm_gens = tfm_gens\n        logging.getLogger(__name__).info(\n            \"[COCOInstanceNewBaselineDatasetMapper] Full TransformGens used in training: {}\".format(str(self.tfm_gens))\n        )\n\n        self.img_format = image_format\n        self.is_train = is_train\n    \n    @classmethod\n    def from_config(cls, cfg, is_train=True):\n        # Build augmentation\n        tfm_gens = build_transform_gen(cfg, is_train)\n\n        ret = {\n            \"is_train\": is_train,\n            \"tfm_gens\": tfm_gens,\n            \"image_format\": cfg.INPUT.FORMAT,\n        }\n        return ret\n\n    def __call__(self, dataset_dict):\n        \"\"\"\n        Args:\n            dataset_dict (dict): Metadata of one image, in Detectron2 Dataset format.\n\n        Returns:\n            dict: a format that builtin models in detectron2 accept\n        \"\"\"\n        dataset_dict = copy.deepcopy(dataset_dict)  # it will be modified by code below\n        image = utils.read_image(dataset_dict[\"file_name\"], format=self.img_format)\n        utils.check_image_size(dataset_dict, image)\n\n        # TODO: get padding mask\n        # by feeding a \"segmentation mask\" to the same transforms\n        padding_mask = np.ones(image.shape[:2])\n\n        image, transforms = T.apply_transform_gens(self.tfm_gens, image)\n        # the crop transformation has default padding value 0 for segmentation\n        padding_mask = transforms.apply_segmentation(padding_mask)\n        padding_mask = ~ padding_mask.astype(bool)\n\n        image_shape = image.shape[:2]  # h, w\n\n        # Pytorch's dataloader is efficient on torch.Tensor due to shared-memory,\n        # but not efficient on large generic data structures due to the use of pickle & mp.Queue.\n        # Therefore it's important to use torch.Tensor.\n        dataset_dict[\"image\"] = torch.as_tensor(np.ascontiguousarray(image.transpose(2, 0, 1)))\n        dataset_dict[\"padding_mask\"] = torch.as_tensor(np.ascontiguousarray(padding_mask))\n\n        if not self.is_train:\n            # USER: Modify this if you want to keep them for some reason.\n            dataset_dict.pop(\"annotations\", None)\n            return dataset_dict\n\n        if \"annotations\" in dataset_dict:\n            # USER: Modify this if you want to keep them for some reason.\n            for anno in dataset_dict[\"annotations\"]:\n                # Let's always keep mask\n                # if not self.mask_on:\n                #     anno.pop(\"segmentation\", None)\n                anno.pop(\"keypoints\", None)\n\n            # USER: Implement additional transformations if you have other types of data\n            annos = [\n                utils.transform_instance_annotations(obj, transforms, image_shape)\n                for obj in dataset_dict.pop(\"annotations\")\n                if obj.get(\"iscrowd\", 0) == 0\n            ]\n            # NOTE: does not support BitMask due to augmentation\n            # Current BitMask cannot handle empty objects\n            instances = utils.annotations_to_instances(annos, image_shape)\n            # After transforms such as cropping are applied, the bounding box may no longer\n            # tightly bound the object. As an example, imagine a triangle object\n            # [(0,0), (2,0), (0,2)] cropped by a box [(1,0),(2,2)] (XYXY format). The tight\n            # bounding box of the cropped triangle should be [(1,0),(2,1)], which is not equal to\n            # the intersection of original bounding box and the cropping box.\n            instances.gt_boxes = instances.gt_masks.get_bounding_boxes()\n            # Need to filter empty instances first (due to augmentation)\n            instances = utils.filter_empty_instances(instances)\n            # Generate masks from polygon\n            h, w = instances.image_size\n            # image_size_xyxy = torch.as_tensor([w, h, w, h], dtype=torch.float)\n            if hasattr(instances, 'gt_masks'):\n                gt_masks = instances.gt_masks\n                gt_masks_box = convert_coco_poly_to_mask(gt_masks.polygons, h, w)\n                instances.gt_masks = gt_masks_box\n            dataset_dict[\"instances\"] = instances\n\n        return dataset_dict\n"
  },
  {
    "path": "mask2former/data/dataset_mappers/coco_panoptic_new_baseline_dataset_mapper.py",
    "content": "# Modified by Bowen Cheng from https://github.com/facebookresearch/detr/blob/master/d2/detr/dataset_mapper.py\nimport copy\nimport logging\n\nimport numpy as np\nimport torch\n\nfrom detectron2.config import configurable\nfrom detectron2.data import detection_utils as utils\nfrom detectron2.data import transforms as T\nfrom detectron2.data.transforms import TransformGen\nfrom detectron2.structures import BitMasks, Boxes, Instances\n\n__all__ = [\"COCOPanopticNewBaselineDatasetMapper\"]\n\n\ndef build_transform_gen(cfg, is_train):\n    \"\"\"\n    Create a list of default :class:`Augmentation` from config.\n    Now it includes resizing and flipping.\n    Returns:\n        list[Augmentation]\n    \"\"\"\n    assert is_train, \"Only support training augmentation\"\n    image_size = cfg.INPUT.IMAGE_SIZE\n    min_scale = cfg.INPUT.MIN_SCALE\n    max_scale = cfg.INPUT.MAX_SCALE\n\n    augmentation = []\n\n    if cfg.INPUT.RANDOM_FLIP != \"none\":\n        augmentation.append(\n            T.RandomFlip(\n                horizontal=cfg.INPUT.RANDOM_FLIP == \"horizontal\",\n                vertical=cfg.INPUT.RANDOM_FLIP == \"vertical\",\n            )\n        )\n\n    augmentation.extend([\n        T.ResizeScale(\n            min_scale=min_scale, max_scale=max_scale, target_height=image_size, target_width=image_size\n        ),\n        T.FixedSizeCrop(crop_size=(image_size, image_size)),\n    ])\n\n    return augmentation\n\n\n# This is specifically designed for the COCO dataset.\nclass COCOPanopticNewBaselineDatasetMapper:\n    \"\"\"\n    A callable which takes a dataset dict in Detectron2 Dataset format,\n    and map it into a format used by MaskFormer.\n\n    This dataset mapper applies the same transformation as DETR for COCO panoptic segmentation.\n\n    The callable currently does the following:\n\n    1. Read the image from \"file_name\"\n    2. Applies geometric transforms to the image and annotation\n    3. Find and applies suitable cropping to the image and annotation\n    4. Prepare image and annotation to Tensors\n    \"\"\"\n\n    @configurable\n    def __init__(\n        self,\n        is_train=True,\n        *,\n        tfm_gens,\n        image_format,\n    ):\n        \"\"\"\n        NOTE: this interface is experimental.\n        Args:\n            is_train: for training or inference\n            augmentations: a list of augmentations or deterministic transforms to apply\n            crop_gen: crop augmentation\n            tfm_gens: data augmentation\n            image_format: an image format supported by :func:`detection_utils.read_image`.\n        \"\"\"\n        self.tfm_gens = tfm_gens\n        logging.getLogger(__name__).info(\n            \"[COCOPanopticNewBaselineDatasetMapper] Full TransformGens used in training: {}\".format(\n                str(self.tfm_gens)\n            )\n        )\n\n        self.img_format = image_format\n        self.is_train = is_train\n\n    @classmethod\n    def from_config(cls, cfg, is_train=True):\n        # Build augmentation\n        tfm_gens = build_transform_gen(cfg, is_train)\n\n        ret = {\n            \"is_train\": is_train,\n            \"tfm_gens\": tfm_gens,\n            \"image_format\": cfg.INPUT.FORMAT,\n        }\n        return ret\n\n    def __call__(self, dataset_dict):\n        \"\"\"\n        Args:\n            dataset_dict (dict): Metadata of one image, in Detectron2 Dataset format.\n\n        Returns:\n            dict: a format that builtin models in detectron2 accept\n        \"\"\"\n        dataset_dict = copy.deepcopy(dataset_dict)  # it will be modified by code below\n        image = utils.read_image(dataset_dict[\"file_name\"], format=self.img_format)\n        utils.check_image_size(dataset_dict, image)\n\n        image, transforms = T.apply_transform_gens(self.tfm_gens, image)\n        image_shape = image.shape[:2]  # h, w\n\n        # Pytorch's dataloader is efficient on torch.Tensor due to shared-memory,\n        # but not efficient on large generic data structures due to the use of pickle & mp.Queue.\n        # Therefore it's important to use torch.Tensor.\n        dataset_dict[\"image\"] = torch.as_tensor(np.ascontiguousarray(image.transpose(2, 0, 1)))\n\n        if not self.is_train:\n            # USER: Modify this if you want to keep them for some reason.\n            dataset_dict.pop(\"annotations\", None)\n            return dataset_dict\n\n        if \"pan_seg_file_name\" in dataset_dict:\n            pan_seg_gt = utils.read_image(dataset_dict.pop(\"pan_seg_file_name\"), \"RGB\")\n            segments_info = dataset_dict[\"segments_info\"]\n\n            # apply the same transformation to panoptic segmentation\n            pan_seg_gt = transforms.apply_segmentation(pan_seg_gt)\n\n            from panopticapi.utils import rgb2id\n\n            pan_seg_gt = rgb2id(pan_seg_gt)\n\n            instances = Instances(image_shape)\n            classes = []\n            masks = []\n            for segment_info in segments_info:\n                class_id = segment_info[\"category_id\"]\n                if not segment_info[\"iscrowd\"]:\n                    classes.append(class_id)\n                    masks.append(pan_seg_gt == segment_info[\"id\"])\n\n            classes = np.array(classes)\n            instances.gt_classes = torch.tensor(classes, dtype=torch.int64)\n            if len(masks) == 0:\n                # Some image does not have annotation (all ignored)\n                instances.gt_masks = torch.zeros((0, pan_seg_gt.shape[-2], pan_seg_gt.shape[-1]))\n                instances.gt_boxes = Boxes(torch.zeros((0, 4)))\n            else:\n                masks = BitMasks(\n                    torch.stack([torch.from_numpy(np.ascontiguousarray(x.copy())) for x in masks])\n                )\n                instances.gt_masks = masks.tensor\n                instances.gt_boxes = masks.get_bounding_boxes()\n\n            dataset_dict[\"instances\"] = instances\n\n        return dataset_dict\n"
  },
  {
    "path": "mask2former/data/dataset_mappers/mask_former_instance_dataset_mapper.py",
    "content": "import copy\nimport logging\n\nimport numpy as np\nimport pycocotools.mask as mask_util\nimport torch\nfrom torch.nn import functional as F\n\nfrom detectron2.config import configurable\nfrom detectron2.data import detection_utils as utils\nfrom detectron2.data import transforms as T\nfrom detectron2.projects.point_rend import ColorAugSSDTransform\nfrom detectron2.structures import BitMasks, Instances, polygons_to_bitmask\n\n__all__ = [\"MaskFormerInstanceDatasetMapper\"]\n\n\nclass MaskFormerInstanceDatasetMapper:\n    \"\"\"\n    A callable which takes a dataset dict in Detectron2 Dataset format,\n    and map it into a format used by MaskFormer for instance segmentation.\n\n    The callable currently does the following:\n\n    1. Read the image from \"file_name\"\n    2. Applies geometric transforms to the image and annotation\n    3. Find and applies suitable cropping to the image and annotation\n    4. Prepare image and annotation to Tensors\n    \"\"\"\n\n    @configurable\n    def __init__(\n        self,\n        is_train=True,\n        *,\n        augmentations,\n        image_format,\n        size_divisibility,\n    ):\n        \"\"\"\n        NOTE: this interface is experimental.\n        Args:\n            is_train: for training or inference\n            augmentations: a list of augmentations or deterministic transforms to apply\n            image_format: an image format supported by :func:`detection_utils.read_image`.\n            size_divisibility: pad image size to be divisible by this value\n        \"\"\"\n        self.is_train = is_train\n        self.tfm_gens = augmentations\n        self.img_format = image_format\n        self.size_divisibility = size_divisibility\n\n        logger = logging.getLogger(__name__)\n        mode = \"training\" if is_train else \"inference\"\n        logger.info(f\"[{self.__class__.__name__}] Augmentations used in {mode}: {augmentations}\")\n\n    @classmethod\n    def from_config(cls, cfg, is_train=True):\n        # Build augmentation\n        augs = [\n            T.ResizeShortestEdge(\n                cfg.INPUT.MIN_SIZE_TRAIN,\n                cfg.INPUT.MAX_SIZE_TRAIN,\n                cfg.INPUT.MIN_SIZE_TRAIN_SAMPLING,\n            )\n        ]\n        if cfg.INPUT.CROP.ENABLED:\n            augs.append(\n                T.RandomCrop(\n                    cfg.INPUT.CROP.TYPE,\n                    cfg.INPUT.CROP.SIZE,\n                )\n            )\n        if cfg.INPUT.COLOR_AUG_SSD:\n            augs.append(ColorAugSSDTransform(img_format=cfg.INPUT.FORMAT))\n        augs.append(T.RandomFlip())\n\n        ret = {\n            \"is_train\": is_train,\n            \"augmentations\": augs,\n            \"image_format\": cfg.INPUT.FORMAT,\n            \"size_divisibility\": cfg.INPUT.SIZE_DIVISIBILITY,\n        }\n        return ret\n\n    def __call__(self, dataset_dict):\n        \"\"\"\n        Args:\n            dataset_dict (dict): Metadata of one image, in Detectron2 Dataset format.\n\n        Returns:\n            dict: a format that builtin models in detectron2 accept\n        \"\"\"\n        assert self.is_train, \"MaskFormerPanopticDatasetMapper should only be used for training!\"\n\n        dataset_dict = copy.deepcopy(dataset_dict)  # it will be modified by code below\n        image = utils.read_image(dataset_dict[\"file_name\"], format=self.img_format)\n        utils.check_image_size(dataset_dict, image)\n\n        aug_input = T.AugInput(image)\n        aug_input, transforms = T.apply_transform_gens(self.tfm_gens, aug_input)\n        image = aug_input.image\n\n        # transform instnace masks\n        assert \"annotations\" in dataset_dict\n        for anno in dataset_dict[\"annotations\"]:\n            anno.pop(\"keypoints\", None)\n\n        annos = [\n            utils.transform_instance_annotations(obj, transforms, image.shape[:2])\n            for obj in dataset_dict.pop(\"annotations\")\n            if obj.get(\"iscrowd\", 0) == 0\n        ]\n\n        if len(annos):\n            assert \"segmentation\" in annos[0]\n        segms = [obj[\"segmentation\"] for obj in annos]\n        masks = []\n        for segm in segms:\n            if isinstance(segm, list):\n                # polygon\n                masks.append(polygons_to_bitmask(segm, *image.shape[:2]))\n            elif isinstance(segm, dict):\n                # COCO RLE\n                masks.append(mask_util.decode(segm))\n            elif isinstance(segm, np.ndarray):\n                assert segm.ndim == 2, \"Expect segmentation of 2 dimensions, got {}.\".format(\n                    segm.ndim\n                )\n                # mask array\n                masks.append(segm)\n            else:\n                raise ValueError(\n                    \"Cannot convert segmentation of type '{}' to BitMasks!\"\n                    \"Supported types are: polygons as list[list[float] or ndarray],\"\n                    \" COCO-style RLE as a dict, or a binary segmentation mask \"\n                    \" in a 2D numpy array of shape HxW.\".format(type(segm))\n                )\n\n        # Pad image and segmentation label here!\n        image = torch.as_tensor(np.ascontiguousarray(image.transpose(2, 0, 1)))\n        masks = [torch.from_numpy(np.ascontiguousarray(x)) for x in masks]\n\n        classes = [int(obj[\"category_id\"]) for obj in annos]\n        classes = torch.tensor(classes, dtype=torch.int64)\n\n        if self.size_divisibility > 0:\n            image_size = (image.shape[-2], image.shape[-1])\n            padding_size = [\n                0,\n                self.size_divisibility - image_size[1],\n                0,\n                self.size_divisibility - image_size[0],\n            ]\n            # pad image\n            image = F.pad(image, padding_size, value=128).contiguous()\n            # pad mask\n            masks = [F.pad(x, padding_size, value=0).contiguous() for x in masks]\n\n        image_shape = (image.shape[-2], image.shape[-1])  # h, w\n\n        # Pytorch's dataloader is efficient on torch.Tensor due to shared-memory,\n        # but not efficient on large generic data structures due to the use of pickle & mp.Queue.\n        # Therefore it's important to use torch.Tensor.\n        dataset_dict[\"image\"] = image\n\n        # Prepare per-category binary masks\n        instances = Instances(image_shape)\n        instances.gt_classes = classes\n        if len(masks) == 0:\n            # Some image does not have annotation (all ignored)\n            instances.gt_masks = torch.zeros((0, image.shape[-2], image.shape[-1]))\n        else:\n            masks = BitMasks(torch.stack(masks))\n            instances.gt_masks = masks.tensor\n\n        dataset_dict[\"instances\"] = instances\n\n        return dataset_dict\n"
  },
  {
    "path": "mask2former/data/dataset_mappers/mask_former_panoptic_dataset_mapper.py",
    "content": "import copy\nimport logging\n\nimport numpy as np\nimport torch\nfrom torch.nn import functional as F\n\nfrom detectron2.config import configurable\nfrom detectron2.data import detection_utils as utils\nfrom detectron2.data import transforms as T\nfrom detectron2.structures import BitMasks, Instances\n\nfrom .mask_former_semantic_dataset_mapper import MaskFormerSemanticDatasetMapper\n\n__all__ = [\"MaskFormerPanopticDatasetMapper\"]\n\n\nclass MaskFormerPanopticDatasetMapper(MaskFormerSemanticDatasetMapper):\n    \"\"\"\n    A callable which takes a dataset dict in Detectron2 Dataset format,\n    and map it into a format used by MaskFormer for panoptic segmentation.\n\n    The callable currently does the following:\n\n    1. Read the image from \"file_name\"\n    2. Applies geometric transforms to the image and annotation\n    3. Find and applies suitable cropping to the image and annotation\n    4. Prepare image and annotation to Tensors\n    \"\"\"\n\n    @configurable\n    def __init__(\n        self,\n        is_train=True,\n        *,\n        augmentations,\n        image_format,\n        ignore_label,\n        size_divisibility,\n    ):\n        \"\"\"\n        NOTE: this interface is experimental.\n        Args:\n            is_train: for training or inference\n            augmentations: a list of augmentations or deterministic transforms to apply\n            image_format: an image format supported by :func:`detection_utils.read_image`.\n            ignore_label: the label that is ignored to evaluation\n            size_divisibility: pad image size to be divisible by this value\n        \"\"\"\n        super().__init__(\n            is_train,\n            augmentations=augmentations,\n            image_format=image_format,\n            ignore_label=ignore_label,\n            size_divisibility=size_divisibility,\n        )\n\n    def __call__(self, dataset_dict):\n        \"\"\"\n        Args:\n            dataset_dict (dict): Metadata of one image, in Detectron2 Dataset format.\n\n        Returns:\n            dict: a format that builtin models in detectron2 accept\n        \"\"\"\n        assert self.is_train, \"MaskFormerPanopticDatasetMapper should only be used for training!\"\n\n        dataset_dict = copy.deepcopy(dataset_dict)  # it will be modified by code below\n        image = utils.read_image(dataset_dict[\"file_name\"], format=self.img_format)\n        utils.check_image_size(dataset_dict, image)\n\n        # semantic segmentation\n        if \"sem_seg_file_name\" in dataset_dict:\n            # PyTorch transformation not implemented for uint16, so converting it to double first\n            sem_seg_gt = utils.read_image(dataset_dict.pop(\"sem_seg_file_name\")).astype(\"double\")\n        else:\n            sem_seg_gt = None\n\n        # panoptic segmentation\n        if \"pan_seg_file_name\" in dataset_dict:\n            pan_seg_gt = utils.read_image(dataset_dict.pop(\"pan_seg_file_name\"), \"RGB\")\n            segments_info = dataset_dict[\"segments_info\"]\n        else:\n            pan_seg_gt = None\n            segments_info = None\n\n        if pan_seg_gt is None:\n            raise ValueError(\n                \"Cannot find 'pan_seg_file_name' for panoptic segmentation dataset {}.\".format(\n                    dataset_dict[\"file_name\"]\n                )\n            )\n\n        aug_input = T.AugInput(image, sem_seg=sem_seg_gt)\n        aug_input, transforms = T.apply_transform_gens(self.tfm_gens, aug_input)\n        image = aug_input.image\n        if sem_seg_gt is not None:\n            sem_seg_gt = aug_input.sem_seg\n\n        # apply the same transformation to panoptic segmentation\n        pan_seg_gt = transforms.apply_segmentation(pan_seg_gt)\n\n        from panopticapi.utils import rgb2id\n\n        pan_seg_gt = rgb2id(pan_seg_gt)\n\n        # Pad image and segmentation label here!\n        image = torch.as_tensor(np.ascontiguousarray(image.transpose(2, 0, 1)))\n        if sem_seg_gt is not None:\n            sem_seg_gt = torch.as_tensor(sem_seg_gt.astype(\"long\"))\n        pan_seg_gt = torch.as_tensor(pan_seg_gt.astype(\"long\"))\n\n        if self.size_divisibility > 0:\n            image_size = (image.shape[-2], image.shape[-1])\n            padding_size = [\n                0,\n                self.size_divisibility - image_size[1],\n                0,\n                self.size_divisibility - image_size[0],\n            ]\n            image = F.pad(image, padding_size, value=128).contiguous()\n            if sem_seg_gt is not None:\n                sem_seg_gt = F.pad(sem_seg_gt, padding_size, value=self.ignore_label).contiguous()\n            pan_seg_gt = F.pad(\n                pan_seg_gt, padding_size, value=0\n            ).contiguous()  # 0 is the VOID panoptic label\n\n        image_shape = (image.shape[-2], image.shape[-1])  # h, w\n\n        # Pytorch's dataloader is efficient on torch.Tensor due to shared-memory,\n        # but not efficient on large generic data structures due to the use of pickle & mp.Queue.\n        # Therefore it's important to use torch.Tensor.\n        dataset_dict[\"image\"] = image\n        if sem_seg_gt is not None:\n            dataset_dict[\"sem_seg\"] = sem_seg_gt.long()\n\n        if \"annotations\" in dataset_dict:\n            raise ValueError(\"Pemantic segmentation dataset should not have 'annotations'.\")\n\n        # Prepare per-category binary masks\n        pan_seg_gt = pan_seg_gt.numpy()\n        instances = Instances(image_shape)\n        classes = []\n        masks = []\n        for segment_info in segments_info:\n            class_id = segment_info[\"category_id\"]\n            if not segment_info[\"iscrowd\"]:\n                classes.append(class_id)\n                masks.append(pan_seg_gt == segment_info[\"id\"])\n\n        classes = np.array(classes)\n        instances.gt_classes = torch.tensor(classes, dtype=torch.int64)\n        if len(masks) == 0:\n            # Some image does not have annotation (all ignored)\n            instances.gt_masks = torch.zeros((0, pan_seg_gt.shape[-2], pan_seg_gt.shape[-1]))\n        else:\n            masks = BitMasks(\n                torch.stack([torch.from_numpy(np.ascontiguousarray(x.copy())) for x in masks])\n            )\n            instances.gt_masks = masks.tensor\n\n        dataset_dict[\"instances\"] = instances\n\n        return dataset_dict\n"
  },
  {
    "path": "mask2former/data/dataset_mappers/mask_former_semantic_dataset_mapper.py",
    "content": "import copy\nimport logging\n\nimport numpy as np\nimport torch\nfrom torch.nn import functional as F\n\nfrom detectron2.config import configurable\nfrom detectron2.data import MetadataCatalog\nfrom detectron2.data import detection_utils as utils\nfrom detectron2.data import transforms as T\nfrom detectron2.projects.point_rend import ColorAugSSDTransform\nfrom detectron2.structures import BitMasks, Instances\n\n__all__ = [\"MaskFormerSemanticDatasetMapper\"]\n\n\nclass MaskFormerSemanticDatasetMapper:\n    \"\"\"\n    A callable which takes a dataset dict in Detectron2 Dataset format,\n    and map it into a format used by MaskFormer for semantic segmentation.\n\n    The callable currently does the following:\n\n    1. Read the image from \"file_name\"\n    2. Applies geometric transforms to the image and annotation\n    3. Find and applies suitable cropping to the image and annotation\n    4. Prepare image and annotation to Tensors\n    \"\"\"\n\n    @configurable\n    def __init__(\n        self,\n        is_train=True,\n        *,\n        augmentations,\n        image_format,\n        ignore_label,\n        size_divisibility,\n    ):\n        \"\"\"\n        NOTE: this interface is experimental.\n        Args:\n            is_train: for training or inference\n            augmentations: a list of augmentations or deterministic transforms to apply\n            image_format: an image format supported by :func:`detection_utils.read_image`.\n            ignore_label: the label that is ignored to evaluation\n            size_divisibility: pad image size to be divisible by this value\n        \"\"\"\n        self.is_train = is_train\n        self.tfm_gens = augmentations\n        self.img_format = image_format\n        self.ignore_label = ignore_label\n        self.size_divisibility = size_divisibility\n\n        logger = logging.getLogger(__name__)\n        mode = \"training\" if is_train else \"inference\"\n        logger.info(f\"[{self.__class__.__name__}] Augmentations used in {mode}: {augmentations}\")\n\n    @classmethod\n    def from_config(cls, cfg, is_train=True):\n        # Build augmentation\n        augs = [\n            T.ResizeShortestEdge(\n                cfg.INPUT.MIN_SIZE_TRAIN,\n                cfg.INPUT.MAX_SIZE_TRAIN,\n                cfg.INPUT.MIN_SIZE_TRAIN_SAMPLING,\n            )\n        ]\n        if cfg.INPUT.CROP.ENABLED:\n            augs.append(\n                T.RandomCrop_CategoryAreaConstraint(\n                    cfg.INPUT.CROP.TYPE,\n                    cfg.INPUT.CROP.SIZE,\n                    cfg.INPUT.CROP.SINGLE_CATEGORY_MAX_AREA,\n                    cfg.MODEL.SEM_SEG_HEAD.IGNORE_VALUE,\n                )\n            )\n        if cfg.INPUT.COLOR_AUG_SSD:\n            augs.append(ColorAugSSDTransform(img_format=cfg.INPUT.FORMAT))\n        augs.append(T.RandomFlip())\n\n        # Assume always applies to the training set.\n        dataset_names = cfg.DATASETS.TRAIN\n        meta = MetadataCatalog.get(dataset_names[0])\n        ignore_label = meta.ignore_label\n\n        ret = {\n            \"is_train\": is_train,\n            \"augmentations\": augs,\n            \"image_format\": cfg.INPUT.FORMAT,\n            \"ignore_label\": ignore_label,\n            \"size_divisibility\": cfg.INPUT.SIZE_DIVISIBILITY,\n        }\n        return ret\n\n    def __call__(self, dataset_dict):\n        \"\"\"\n        Args:\n            dataset_dict (dict): Metadata of one image, in Detectron2 Dataset format.\n\n        Returns:\n            dict: a format that builtin models in detectron2 accept\n        \"\"\"\n        assert self.is_train, \"MaskFormerSemanticDatasetMapper should only be used for training!\"\n\n        dataset_dict = copy.deepcopy(dataset_dict)  # it will be modified by code below\n        image = utils.read_image(dataset_dict[\"file_name\"], format=self.img_format)\n        utils.check_image_size(dataset_dict, image)\n\n        if \"sem_seg_file_name\" in dataset_dict:\n            # PyTorch transformation not implemented for uint16, so converting it to double first\n            sem_seg_gt = utils.read_image(dataset_dict.pop(\"sem_seg_file_name\")).astype(\"double\")\n        else:\n            sem_seg_gt = None\n\n        if sem_seg_gt is None:\n            raise ValueError(\n                \"Cannot find 'sem_seg_file_name' for semantic segmentation dataset {}.\".format(\n                    dataset_dict[\"file_name\"]\n                )\n            )\n\n        aug_input = T.AugInput(image, sem_seg=sem_seg_gt)\n        aug_input, transforms = T.apply_transform_gens(self.tfm_gens, aug_input)\n        image = aug_input.image\n        sem_seg_gt = aug_input.sem_seg\n\n        # Pad image and segmentation label here!\n        image = torch.as_tensor(np.ascontiguousarray(image.transpose(2, 0, 1)))\n        if sem_seg_gt is not None:\n            sem_seg_gt = torch.as_tensor(sem_seg_gt.astype(\"long\"))\n\n        if self.size_divisibility > 0:\n            image_size = (image.shape[-2], image.shape[-1])\n            padding_size = [\n                0,\n                self.size_divisibility - image_size[1],\n                0,\n                self.size_divisibility - image_size[0],\n            ]\n            image = F.pad(image, padding_size, value=128).contiguous()\n            if sem_seg_gt is not None:\n                sem_seg_gt = F.pad(sem_seg_gt, padding_size, value=self.ignore_label).contiguous()\n\n        image_shape = (image.shape[-2], image.shape[-1])  # h, w\n\n        # Pytorch's dataloader is efficient on torch.Tensor due to shared-memory,\n        # but not efficient on large generic data structures due to the use of pickle & mp.Queue.\n        # Therefore it's important to use torch.Tensor.\n        dataset_dict[\"image\"] = image\n\n        if sem_seg_gt is not None:\n            dataset_dict[\"sem_seg\"] = sem_seg_gt.long()\n\n        if \"annotations\" in dataset_dict:\n            raise ValueError(\"Semantic segmentation dataset should not have 'annotations'.\")\n\n        # Prepare per-category binary masks\n        if sem_seg_gt is not None:\n            sem_seg_gt = sem_seg_gt.numpy()\n            instances = Instances(image_shape)\n            classes = np.unique(sem_seg_gt)\n            # remove ignored region\n            classes = classes[classes != self.ignore_label]\n            instances.gt_classes = torch.tensor(classes, dtype=torch.int64)\n\n            masks = []\n            for class_id in classes:\n                masks.append(sem_seg_gt == class_id)\n\n            if len(masks) == 0:\n                # Some image does not have annotation (all ignored)\n                instances.gt_masks = torch.zeros((0, sem_seg_gt.shape[-2], sem_seg_gt.shape[-1]))\n            else:\n                masks = BitMasks(\n                    torch.stack([torch.from_numpy(np.ascontiguousarray(x.copy())) for x in masks])\n                )\n                instances.gt_masks = masks.tensor\n\n            dataset_dict[\"instances\"] = instances\n\n        return dataset_dict\n"
  },
  {
    "path": "mask2former/data/datasets/__init__.py",
    "content": "from . import (\n    register_ade20k_full,\n    register_ade20k_panoptic,\n    register_coco_stuff_10k,\n    register_mapillary_vistas,\n    register_coco_panoptic_annos_semseg,\n    register_ade20k_instance,\n    register_mapillary_vistas_panoptic,\n)\n"
  },
  {
    "path": "mask2former/data/datasets/register_ade20k_full.py",
    "content": "import os\n\nfrom detectron2.data import DatasetCatalog, MetadataCatalog\nfrom detectron2.data.datasets import load_sem_seg\n\nADE20K_SEM_SEG_FULL_CATEGORIES = [\n    {\"name\": \"wall\", \"id\": 2978, \"trainId\": 0},\n    {\"name\": \"building, edifice\", \"id\": 312, \"trainId\": 1},\n    {\"name\": \"sky\", \"id\": 2420, \"trainId\": 2},\n    {\"name\": \"tree\", \"id\": 2855, \"trainId\": 3},\n    {\"name\": \"road, route\", \"id\": 2131, \"trainId\": 4},\n    {\"name\": \"floor, flooring\", \"id\": 976, \"trainId\": 5},\n    {\"name\": \"ceiling\", \"id\": 447, \"trainId\": 6},\n    {\"name\": \"bed\", \"id\": 165, \"trainId\": 7},\n    {\"name\": \"sidewalk, pavement\", \"id\": 2377, \"trainId\": 8},\n    {\"name\": \"earth, ground\", \"id\": 838, \"trainId\": 9},\n    {\"name\": \"cabinet\", \"id\": 350, \"trainId\": 10},\n    {\"name\": \"person, individual, someone, somebody, mortal, soul\", \"id\": 1831, \"trainId\": 11},\n    {\"name\": \"grass\", \"id\": 1125, \"trainId\": 12},\n    {\"name\": \"windowpane, window\", \"id\": 3055, \"trainId\": 13},\n    {\"name\": \"car, auto, automobile, machine, motorcar\", \"id\": 401, \"trainId\": 14},\n    {\"name\": \"mountain, mount\", \"id\": 1610, \"trainId\": 15},\n    {\"name\": \"plant, flora, plant life\", \"id\": 1910, \"trainId\": 16},\n    {\"name\": \"table\", \"id\": 2684, \"trainId\": 17},\n    {\"name\": \"chair\", \"id\": 471, \"trainId\": 18},\n    {\"name\": \"curtain, drape, drapery, mantle, pall\", \"id\": 687, \"trainId\": 19},\n    {\"name\": \"door\", \"id\": 774, \"trainId\": 20},\n    {\"name\": \"sofa, couch, lounge\", \"id\": 2473, \"trainId\": 21},\n    {\"name\": \"sea\", \"id\": 2264, \"trainId\": 22},\n    {\"name\": \"painting, picture\", \"id\": 1735, \"trainId\": 23},\n    {\"name\": \"water\", \"id\": 2994, \"trainId\": 24},\n    {\"name\": \"mirror\", \"id\": 1564, \"trainId\": 25},\n    {\"name\": \"house\", \"id\": 1276, \"trainId\": 26},\n    {\"name\": \"rug, carpet, carpeting\", \"id\": 2178, \"trainId\": 27},\n    {\"name\": \"shelf\", \"id\": 2329, \"trainId\": 28},\n    {\"name\": \"armchair\", \"id\": 57, \"trainId\": 29},\n    {\"name\": \"fence, fencing\", \"id\": 907, \"trainId\": 30},\n    {\"name\": \"field\", \"id\": 913, \"trainId\": 31},\n    {\"name\": \"lamp\", \"id\": 1395, \"trainId\": 32},\n    {\"name\": \"rock, stone\", \"id\": 2138, \"trainId\": 33},\n    {\"name\": \"seat\", \"id\": 2272, \"trainId\": 34},\n    {\"name\": \"river\", \"id\": 2128, \"trainId\": 35},\n    {\"name\": \"desk\", \"id\": 724, \"trainId\": 36},\n    {\"name\": \"bathtub, bathing tub, bath, tub\", \"id\": 155, \"trainId\": 37},\n    {\"name\": \"railing, rail\", \"id\": 2053, \"trainId\": 38},\n    {\"name\": \"signboard, sign\", \"id\": 2380, \"trainId\": 39},\n    {\"name\": \"cushion\", \"id\": 689, \"trainId\": 40},\n    {\"name\": \"path\", \"id\": 1788, \"trainId\": 41},\n    {\"name\": \"work surface\", \"id\": 3087, \"trainId\": 42},\n    {\"name\": \"stairs, steps\", \"id\": 2530, \"trainId\": 43},\n    {\"name\": \"column, pillar\", \"id\": 581, \"trainId\": 44},\n    {\"name\": \"sink\", \"id\": 2388, \"trainId\": 45},\n    {\"name\": \"wardrobe, closet, press\", \"id\": 2985, \"trainId\": 46},\n    {\"name\": \"snow\", \"id\": 2454, \"trainId\": 47},\n    {\"name\": \"refrigerator, icebox\", \"id\": 2096, \"trainId\": 48},\n    {\"name\": \"base, pedestal, stand\", \"id\": 137, \"trainId\": 49},\n    {\"name\": \"bridge, span\", \"id\": 294, \"trainId\": 50},\n    {\"name\": \"blind, screen\", \"id\": 212, \"trainId\": 51},\n    {\"name\": \"runway\", \"id\": 2185, \"trainId\": 52},\n    {\"name\": \"cliff, drop, drop-off\", \"id\": 524, \"trainId\": 53},\n    {\"name\": \"sand\", \"id\": 2212, \"trainId\": 54},\n    {\"name\": \"fireplace, hearth, open fireplace\", \"id\": 943, \"trainId\": 55},\n    {\"name\": \"pillow\", \"id\": 1869, \"trainId\": 56},\n    {\"name\": \"screen door, screen\", \"id\": 2251, \"trainId\": 57},\n    {\"name\": \"toilet, can, commode, crapper, pot, potty, stool, throne\", \"id\": 2793, \"trainId\": 58},\n    {\"name\": \"skyscraper\", \"id\": 2423, \"trainId\": 59},\n    {\"name\": \"grandstand, covered stand\", \"id\": 1121, \"trainId\": 60},\n    {\"name\": \"box\", \"id\": 266, \"trainId\": 61},\n    {\"name\": \"pool table, billiard table, snooker table\", \"id\": 1948, \"trainId\": 62},\n    {\"name\": \"palm, palm tree\", \"id\": 1744, \"trainId\": 63},\n    {\"name\": \"double door\", \"id\": 783, \"trainId\": 64},\n    {\"name\": \"coffee table, cocktail table\", \"id\": 571, \"trainId\": 65},\n    {\"name\": \"counter\", \"id\": 627, \"trainId\": 66},\n    {\"name\": \"countertop\", \"id\": 629, \"trainId\": 67},\n    {\"name\": \"chest of drawers, chest, bureau, dresser\", \"id\": 491, \"trainId\": 68},\n    {\"name\": \"kitchen island\", \"id\": 1374, \"trainId\": 69},\n    {\"name\": \"boat\", \"id\": 223, \"trainId\": 70},\n    {\"name\": \"waterfall, falls\", \"id\": 3016, \"trainId\": 71},\n    {\n        \"name\": \"stove, kitchen stove, range, kitchen range, cooking stove\",\n        \"id\": 2598,\n        \"trainId\": 72,\n    },\n    {\"name\": \"flower\", \"id\": 978, \"trainId\": 73},\n    {\"name\": \"bookcase\", \"id\": 239, \"trainId\": 74},\n    {\"name\": \"controls\", \"id\": 608, \"trainId\": 75},\n    {\"name\": \"book\", \"id\": 236, \"trainId\": 76},\n    {\"name\": \"stairway, staircase\", \"id\": 2531, \"trainId\": 77},\n    {\"name\": \"streetlight, street lamp\", \"id\": 2616, \"trainId\": 78},\n    {\n        \"name\": \"computer, computing machine, computing device, data processor, electronic computer, information processing system\",\n        \"id\": 591,\n        \"trainId\": 79,\n    },\n    {\n        \"name\": \"bus, autobus, coach, charabanc, double-decker, jitney, motorbus, motorcoach, omnibus, passenger vehicle\",\n        \"id\": 327,\n        \"trainId\": 80,\n    },\n    {\"name\": \"swivel chair\", \"id\": 2679, \"trainId\": 81},\n    {\"name\": \"light, light source\", \"id\": 1451, \"trainId\": 82},\n    {\"name\": \"bench\", \"id\": 181, \"trainId\": 83},\n    {\"name\": \"case, display case, showcase, vitrine\", \"id\": 420, \"trainId\": 84},\n    {\"name\": \"towel\", \"id\": 2821, \"trainId\": 85},\n    {\"name\": \"fountain\", \"id\": 1023, \"trainId\": 86},\n    {\"name\": \"embankment\", \"id\": 855, \"trainId\": 87},\n    {\n        \"name\": \"television receiver, television, television set, tv, tv set, idiot box, boob tube, telly, goggle box\",\n        \"id\": 2733,\n        \"trainId\": 88,\n    },\n    {\"name\": \"van\", \"id\": 2928, \"trainId\": 89},\n    {\"name\": \"hill\", \"id\": 1240, \"trainId\": 90},\n    {\"name\": \"awning, sunshade, sunblind\", \"id\": 77, \"trainId\": 91},\n    {\"name\": \"poster, posting, placard, notice, bill, card\", \"id\": 1969, \"trainId\": 92},\n    {\"name\": \"truck, motortruck\", \"id\": 2880, \"trainId\": 93},\n    {\"name\": \"airplane, aeroplane, plane\", \"id\": 14, \"trainId\": 94},\n    {\"name\": \"pole\", \"id\": 1936, \"trainId\": 95},\n    {\"name\": \"tower\", \"id\": 2828, \"trainId\": 96},\n    {\"name\": \"court\", \"id\": 631, \"trainId\": 97},\n    {\"name\": \"ball\", \"id\": 103, \"trainId\": 98},\n    {\n        \"name\": \"aircraft carrier, carrier, flattop, attack aircraft carrier\",\n        \"id\": 3144,\n        \"trainId\": 99,\n    },\n    {\"name\": \"buffet, counter, sideboard\", \"id\": 308, \"trainId\": 100},\n    {\"name\": \"hovel, hut, hutch, shack, shanty\", \"id\": 1282, \"trainId\": 101},\n    {\"name\": \"apparel, wearing apparel, dress, clothes\", \"id\": 38, \"trainId\": 102},\n    {\"name\": \"minibike, motorbike\", \"id\": 1563, \"trainId\": 103},\n    {\"name\": \"animal, animate being, beast, brute, creature, fauna\", \"id\": 29, \"trainId\": 104},\n    {\"name\": \"chandelier, pendant, pendent\", \"id\": 480, \"trainId\": 105},\n    {\"name\": \"step, stair\", \"id\": 2569, \"trainId\": 106},\n    {\"name\": \"booth, cubicle, stall, kiosk\", \"id\": 247, \"trainId\": 107},\n    {\"name\": \"bicycle, bike, wheel, cycle\", \"id\": 187, \"trainId\": 108},\n    {\"name\": \"doorframe, doorcase\", \"id\": 778, \"trainId\": 109},\n    {\"name\": \"sconce\", \"id\": 2243, \"trainId\": 110},\n    {\"name\": \"pond\", \"id\": 1941, \"trainId\": 111},\n    {\"name\": \"trade name, brand name, brand, marque\", \"id\": 2833, \"trainId\": 112},\n    {\"name\": \"bannister, banister, balustrade, balusters, handrail\", \"id\": 120, \"trainId\": 113},\n    {\"name\": \"bag\", \"id\": 95, \"trainId\": 114},\n    {\"name\": \"traffic light, traffic signal, stoplight\", \"id\": 2836, \"trainId\": 115},\n    {\"name\": \"gazebo\", \"id\": 1087, \"trainId\": 116},\n    {\"name\": \"escalator, moving staircase, moving stairway\", \"id\": 868, \"trainId\": 117},\n    {\"name\": \"land, ground, soil\", \"id\": 1401, \"trainId\": 118},\n    {\"name\": \"board, plank\", \"id\": 220, \"trainId\": 119},\n    {\"name\": \"arcade machine\", \"id\": 47, \"trainId\": 120},\n    {\"name\": \"eiderdown, duvet, continental quilt\", \"id\": 843, \"trainId\": 121},\n    {\"name\": \"bar\", \"id\": 123, \"trainId\": 122},\n    {\"name\": \"stall, stand, sales booth\", \"id\": 2537, \"trainId\": 123},\n    {\"name\": \"playground\", \"id\": 1927, \"trainId\": 124},\n    {\"name\": \"ship\", \"id\": 2337, \"trainId\": 125},\n    {\"name\": \"ottoman, pouf, pouffe, puff, hassock\", \"id\": 1702, \"trainId\": 126},\n    {\n        \"name\": \"ashcan, trash can, garbage can, wastebin, ash bin, ash-bin, ashbin, dustbin, trash barrel, trash bin\",\n        \"id\": 64,\n        \"trainId\": 127,\n    },\n    {\"name\": \"bottle\", \"id\": 249, \"trainId\": 128},\n    {\"name\": \"cradle\", \"id\": 642, \"trainId\": 129},\n    {\"name\": \"pot, flowerpot\", \"id\": 1981, \"trainId\": 130},\n    {\n        \"name\": \"conveyer belt, conveyor belt, conveyer, conveyor, transporter\",\n        \"id\": 609,\n        \"trainId\": 131,\n    },\n    {\"name\": \"train, railroad train\", \"id\": 2840, \"trainId\": 132},\n    {\"name\": \"stool\", \"id\": 2586, \"trainId\": 133},\n    {\"name\": \"lake\", \"id\": 1393, \"trainId\": 134},\n    {\"name\": \"tank, storage tank\", \"id\": 2704, \"trainId\": 135},\n    {\"name\": \"ice, water ice\", \"id\": 1304, \"trainId\": 136},\n    {\"name\": \"basket, handbasket\", \"id\": 146, \"trainId\": 137},\n    {\"name\": \"manhole\", \"id\": 1494, \"trainId\": 138},\n    {\"name\": \"tent, collapsible shelter\", \"id\": 2739, \"trainId\": 139},\n    {\"name\": \"canopy\", \"id\": 389, \"trainId\": 140},\n    {\"name\": \"microwave, microwave oven\", \"id\": 1551, \"trainId\": 141},\n    {\"name\": \"barrel, cask\", \"id\": 131, \"trainId\": 142},\n    {\"name\": \"dirt track\", \"id\": 738, \"trainId\": 143},\n    {\"name\": \"beam\", \"id\": 161, \"trainId\": 144},\n    {\"name\": \"dishwasher, dish washer, dishwashing machine\", \"id\": 747, \"trainId\": 145},\n    {\"name\": \"plate\", \"id\": 1919, \"trainId\": 146},\n    {\"name\": \"screen, crt screen\", \"id\": 3109, \"trainId\": 147},\n    {\"name\": \"ruins\", \"id\": 2179, \"trainId\": 148},\n    {\"name\": \"washer, automatic washer, washing machine\", \"id\": 2989, \"trainId\": 149},\n    {\"name\": \"blanket, cover\", \"id\": 206, \"trainId\": 150},\n    {\"name\": \"plaything, toy\", \"id\": 1930, \"trainId\": 151},\n    {\"name\": \"food, solid food\", \"id\": 1002, \"trainId\": 152},\n    {\"name\": \"screen, silver screen, projection screen\", \"id\": 2254, \"trainId\": 153},\n    {\"name\": \"oven\", \"id\": 1708, \"trainId\": 154},\n    {\"name\": \"stage\", \"id\": 2526, \"trainId\": 155},\n    {\"name\": \"beacon, lighthouse, beacon light, pharos\", \"id\": 160, \"trainId\": 156},\n    {\"name\": \"umbrella\", \"id\": 2901, \"trainId\": 157},\n    {\"name\": \"sculpture\", \"id\": 2262, \"trainId\": 158},\n    {\"name\": \"aqueduct\", \"id\": 44, \"trainId\": 159},\n    {\"name\": \"container\", \"id\": 597, \"trainId\": 160},\n    {\"name\": \"scaffolding, staging\", \"id\": 2235, \"trainId\": 161},\n    {\"name\": \"hood, exhaust hood\", \"id\": 1260, \"trainId\": 162},\n    {\"name\": \"curb, curbing, kerb\", \"id\": 682, \"trainId\": 163},\n    {\"name\": \"roller coaster\", \"id\": 2151, \"trainId\": 164},\n    {\"name\": \"horse, equus caballus\", \"id\": 3107, \"trainId\": 165},\n    {\"name\": \"catwalk\", \"id\": 432, \"trainId\": 166},\n    {\"name\": \"glass, drinking glass\", \"id\": 1098, \"trainId\": 167},\n    {\"name\": \"vase\", \"id\": 2932, \"trainId\": 168},\n    {\"name\": \"central reservation\", \"id\": 461, \"trainId\": 169},\n    {\"name\": \"carousel\", \"id\": 410, \"trainId\": 170},\n    {\"name\": \"radiator\", \"id\": 2046, \"trainId\": 171},\n    {\"name\": \"closet\", \"id\": 533, \"trainId\": 172},\n    {\"name\": \"machine\", \"id\": 1481, \"trainId\": 173},\n    {\"name\": \"pier, wharf, wharfage, dock\", \"id\": 1858, \"trainId\": 174},\n    {\"name\": \"fan\", \"id\": 894, \"trainId\": 175},\n    {\"name\": \"inflatable bounce game\", \"id\": 1322, \"trainId\": 176},\n    {\"name\": \"pitch\", \"id\": 1891, \"trainId\": 177},\n    {\"name\": \"paper\", \"id\": 1756, \"trainId\": 178},\n    {\"name\": \"arcade, colonnade\", \"id\": 49, \"trainId\": 179},\n    {\"name\": \"hot tub\", \"id\": 1272, \"trainId\": 180},\n    {\"name\": \"helicopter\", \"id\": 1229, \"trainId\": 181},\n    {\"name\": \"tray\", \"id\": 2850, \"trainId\": 182},\n    {\"name\": \"partition, divider\", \"id\": 1784, \"trainId\": 183},\n    {\"name\": \"vineyard\", \"id\": 2962, \"trainId\": 184},\n    {\"name\": \"bowl\", \"id\": 259, \"trainId\": 185},\n    {\"name\": \"bullring\", \"id\": 319, \"trainId\": 186},\n    {\"name\": \"flag\", \"id\": 954, \"trainId\": 187},\n    {\"name\": \"pot\", \"id\": 1974, \"trainId\": 188},\n    {\"name\": \"footbridge, overcrossing, pedestrian bridge\", \"id\": 1013, \"trainId\": 189},\n    {\"name\": \"shower\", \"id\": 2356, \"trainId\": 190},\n    {\"name\": \"bag, traveling bag, travelling bag, grip, suitcase\", \"id\": 97, \"trainId\": 191},\n    {\"name\": \"bulletin board, notice board\", \"id\": 318, \"trainId\": 192},\n    {\"name\": \"confessional booth\", \"id\": 592, \"trainId\": 193},\n    {\"name\": \"trunk, tree trunk, bole\", \"id\": 2885, \"trainId\": 194},\n    {\"name\": \"forest\", \"id\": 1017, \"trainId\": 195},\n    {\"name\": \"elevator door\", \"id\": 851, \"trainId\": 196},\n    {\"name\": \"laptop, laptop computer\", \"id\": 1407, \"trainId\": 197},\n    {\"name\": \"instrument panel\", \"id\": 1332, \"trainId\": 198},\n    {\"name\": \"bucket, pail\", \"id\": 303, \"trainId\": 199},\n    {\"name\": \"tapestry, tapis\", \"id\": 2714, \"trainId\": 200},\n    {\"name\": \"platform\", \"id\": 1924, \"trainId\": 201},\n    {\"name\": \"jacket\", \"id\": 1346, \"trainId\": 202},\n    {\"name\": \"gate\", \"id\": 1081, \"trainId\": 203},\n    {\"name\": \"monitor, monitoring device\", \"id\": 1583, \"trainId\": 204},\n    {\n        \"name\": \"telephone booth, phone booth, call box, telephone box, telephone kiosk\",\n        \"id\": 2727,\n        \"trainId\": 205,\n    },\n    {\"name\": \"spotlight, spot\", \"id\": 2509, \"trainId\": 206},\n    {\"name\": \"ring\", \"id\": 2123, \"trainId\": 207},\n    {\"name\": \"control panel\", \"id\": 602, \"trainId\": 208},\n    {\"name\": \"blackboard, chalkboard\", \"id\": 202, \"trainId\": 209},\n    {\"name\": \"air conditioner, air conditioning\", \"id\": 10, \"trainId\": 210},\n    {\"name\": \"chest\", \"id\": 490, \"trainId\": 211},\n    {\"name\": \"clock\", \"id\": 530, \"trainId\": 212},\n    {\"name\": \"sand dune\", \"id\": 2213, \"trainId\": 213},\n    {\"name\": \"pipe, pipage, piping\", \"id\": 1884, \"trainId\": 214},\n    {\"name\": \"vault\", \"id\": 2934, \"trainId\": 215},\n    {\"name\": \"table football\", \"id\": 2687, \"trainId\": 216},\n    {\"name\": \"cannon\", \"id\": 387, \"trainId\": 217},\n    {\"name\": \"swimming pool, swimming bath, natatorium\", \"id\": 2668, \"trainId\": 218},\n    {\"name\": \"fluorescent, fluorescent fixture\", \"id\": 982, \"trainId\": 219},\n    {\"name\": \"statue\", \"id\": 2547, \"trainId\": 220},\n    {\n        \"name\": \"loudspeaker, speaker, speaker unit, loudspeaker system, speaker system\",\n        \"id\": 1474,\n        \"trainId\": 221,\n    },\n    {\"name\": \"exhibitor\", \"id\": 877, \"trainId\": 222},\n    {\"name\": \"ladder\", \"id\": 1391, \"trainId\": 223},\n    {\"name\": \"carport\", \"id\": 414, \"trainId\": 224},\n    {\"name\": \"dam\", \"id\": 698, \"trainId\": 225},\n    {\"name\": \"pulpit\", \"id\": 2019, \"trainId\": 226},\n    {\"name\": \"skylight, fanlight\", \"id\": 2422, \"trainId\": 227},\n    {\"name\": \"water tower\", \"id\": 3010, \"trainId\": 228},\n    {\"name\": \"grill, grille, grillwork\", \"id\": 1139, \"trainId\": 229},\n    {\"name\": \"display board\", \"id\": 753, \"trainId\": 230},\n    {\"name\": \"pane, pane of glass, window glass\", \"id\": 1747, \"trainId\": 231},\n    {\"name\": \"rubbish, trash, scrap\", \"id\": 2175, \"trainId\": 232},\n    {\"name\": \"ice rink\", \"id\": 1301, \"trainId\": 233},\n    {\"name\": \"fruit\", \"id\": 1033, \"trainId\": 234},\n    {\"name\": \"patio\", \"id\": 1789, \"trainId\": 235},\n    {\"name\": \"vending machine\", \"id\": 2939, \"trainId\": 236},\n    {\"name\": \"telephone, phone, telephone set\", \"id\": 2730, \"trainId\": 237},\n    {\"name\": \"net\", \"id\": 1652, \"trainId\": 238},\n    {\n        \"name\": \"backpack, back pack, knapsack, packsack, rucksack, haversack\",\n        \"id\": 90,\n        \"trainId\": 239,\n    },\n    {\"name\": \"jar\", \"id\": 1349, \"trainId\": 240},\n    {\"name\": \"track\", \"id\": 2830, \"trainId\": 241},\n    {\"name\": \"magazine\", \"id\": 1485, \"trainId\": 242},\n    {\"name\": \"shutter\", \"id\": 2370, \"trainId\": 243},\n    {\"name\": \"roof\", \"id\": 2155, \"trainId\": 244},\n    {\"name\": \"banner, streamer\", \"id\": 118, \"trainId\": 245},\n    {\"name\": \"landfill\", \"id\": 1402, \"trainId\": 246},\n    {\"name\": \"post\", \"id\": 1957, \"trainId\": 247},\n    {\"name\": \"altarpiece, reredos\", \"id\": 3130, \"trainId\": 248},\n    {\"name\": \"hat, chapeau, lid\", \"id\": 1197, \"trainId\": 249},\n    {\"name\": \"arch, archway\", \"id\": 52, \"trainId\": 250},\n    {\"name\": \"table game\", \"id\": 2688, \"trainId\": 251},\n    {\"name\": \"bag, handbag, pocketbook, purse\", \"id\": 96, \"trainId\": 252},\n    {\"name\": \"document, written document, papers\", \"id\": 762, \"trainId\": 253},\n    {\"name\": \"dome\", \"id\": 772, \"trainId\": 254},\n    {\"name\": \"pier\", \"id\": 1857, \"trainId\": 255},\n    {\"name\": \"shanties\", \"id\": 2315, \"trainId\": 256},\n    {\"name\": \"forecourt\", \"id\": 1016, \"trainId\": 257},\n    {\"name\": \"crane\", \"id\": 643, \"trainId\": 258},\n    {\"name\": \"dog, domestic dog, canis familiaris\", \"id\": 3105, \"trainId\": 259},\n    {\"name\": \"piano, pianoforte, forte-piano\", \"id\": 1849, \"trainId\": 260},\n    {\"name\": \"drawing\", \"id\": 791, \"trainId\": 261},\n    {\"name\": \"cabin\", \"id\": 349, \"trainId\": 262},\n    {\n        \"name\": \"ad, advertisement, advertizement, advertising, advertizing, advert\",\n        \"id\": 6,\n        \"trainId\": 263,\n    },\n    {\"name\": \"amphitheater, amphitheatre, coliseum\", \"id\": 3114, \"trainId\": 264},\n    {\"name\": \"monument\", \"id\": 1587, \"trainId\": 265},\n    {\"name\": \"henhouse\", \"id\": 1233, \"trainId\": 266},\n    {\"name\": \"cockpit\", \"id\": 559, \"trainId\": 267},\n    {\"name\": \"heater, warmer\", \"id\": 1223, \"trainId\": 268},\n    {\"name\": \"windmill, aerogenerator, wind generator\", \"id\": 3049, \"trainId\": 269},\n    {\"name\": \"pool\", \"id\": 1943, \"trainId\": 270},\n    {\"name\": \"elevator, lift\", \"id\": 853, \"trainId\": 271},\n    {\"name\": \"decoration, ornament, ornamentation\", \"id\": 709, \"trainId\": 272},\n    {\"name\": \"labyrinth\", \"id\": 1390, \"trainId\": 273},\n    {\"name\": \"text, textual matter\", \"id\": 2748, \"trainId\": 274},\n    {\"name\": \"printer\", \"id\": 2007, \"trainId\": 275},\n    {\"name\": \"mezzanine, first balcony\", \"id\": 1546, \"trainId\": 276},\n    {\"name\": \"mattress\", \"id\": 1513, \"trainId\": 277},\n    {\"name\": \"straw\", \"id\": 2600, \"trainId\": 278},\n    {\"name\": \"stalls\", \"id\": 2538, \"trainId\": 279},\n    {\"name\": \"patio, terrace\", \"id\": 1790, \"trainId\": 280},\n    {\"name\": \"billboard, hoarding\", \"id\": 194, \"trainId\": 281},\n    {\"name\": \"bus stop\", \"id\": 326, \"trainId\": 282},\n    {\"name\": \"trouser, pant\", \"id\": 2877, \"trainId\": 283},\n    {\"name\": \"console table, console\", \"id\": 594, \"trainId\": 284},\n    {\"name\": \"rack\", \"id\": 2036, \"trainId\": 285},\n    {\"name\": \"notebook\", \"id\": 1662, \"trainId\": 286},\n    {\"name\": \"shrine\", \"id\": 2366, \"trainId\": 287},\n    {\"name\": \"pantry\", \"id\": 1754, \"trainId\": 288},\n    {\"name\": \"cart\", \"id\": 418, \"trainId\": 289},\n    {\"name\": \"steam shovel\", \"id\": 2553, \"trainId\": 290},\n    {\"name\": \"porch\", \"id\": 1951, \"trainId\": 291},\n    {\"name\": \"postbox, mailbox, letter box\", \"id\": 1963, \"trainId\": 292},\n    {\"name\": \"figurine, statuette\", \"id\": 918, \"trainId\": 293},\n    {\"name\": \"recycling bin\", \"id\": 2086, \"trainId\": 294},\n    {\"name\": \"folding screen\", \"id\": 997, \"trainId\": 295},\n    {\"name\": \"telescope\", \"id\": 2731, \"trainId\": 296},\n    {\"name\": \"deck chair, beach chair\", \"id\": 704, \"trainId\": 297},\n    {\"name\": \"kennel\", \"id\": 1365, \"trainId\": 298},\n    {\"name\": \"coffee maker\", \"id\": 569, \"trainId\": 299},\n    {\"name\": \"altar, communion table, lord's table\", \"id\": 3108, \"trainId\": 300},\n    {\"name\": \"fish\", \"id\": 948, \"trainId\": 301},\n    {\"name\": \"easel\", \"id\": 839, \"trainId\": 302},\n    {\"name\": \"artificial golf green\", \"id\": 63, \"trainId\": 303},\n    {\"name\": \"iceberg\", \"id\": 1305, \"trainId\": 304},\n    {\"name\": \"candlestick, candle holder\", \"id\": 378, \"trainId\": 305},\n    {\"name\": \"shower stall, shower bath\", \"id\": 2362, \"trainId\": 306},\n    {\"name\": \"television stand\", \"id\": 2734, \"trainId\": 307},\n    {\n        \"name\": \"wall socket, wall plug, electric outlet, electrical outlet, outlet, electric receptacle\",\n        \"id\": 2982,\n        \"trainId\": 308,\n    },\n    {\"name\": \"skeleton\", \"id\": 2398, \"trainId\": 309},\n    {\"name\": \"grand piano, grand\", \"id\": 1119, \"trainId\": 310},\n    {\"name\": \"candy, confect\", \"id\": 382, \"trainId\": 311},\n    {\"name\": \"grille door\", \"id\": 1141, \"trainId\": 312},\n    {\"name\": \"pedestal, plinth, footstall\", \"id\": 1805, \"trainId\": 313},\n    {\"name\": \"jersey, t-shirt, tee shirt\", \"id\": 3102, \"trainId\": 314},\n    {\"name\": \"shoe\", \"id\": 2341, \"trainId\": 315},\n    {\"name\": \"gravestone, headstone, tombstone\", \"id\": 1131, \"trainId\": 316},\n    {\"name\": \"shanty\", \"id\": 2316, \"trainId\": 317},\n    {\"name\": \"structure\", \"id\": 2626, \"trainId\": 318},\n    {\"name\": \"rocking chair, rocker\", \"id\": 3104, \"trainId\": 319},\n    {\"name\": \"bird\", \"id\": 198, \"trainId\": 320},\n    {\"name\": \"place mat\", \"id\": 1896, \"trainId\": 321},\n    {\"name\": \"tomb\", \"id\": 2800, \"trainId\": 322},\n    {\"name\": \"big top\", \"id\": 190, \"trainId\": 323},\n    {\"name\": \"gas pump, gasoline pump, petrol pump, island dispenser\", \"id\": 3131, \"trainId\": 324},\n    {\"name\": \"lockers\", \"id\": 1463, \"trainId\": 325},\n    {\"name\": \"cage\", \"id\": 357, \"trainId\": 326},\n    {\"name\": \"finger\", \"id\": 929, \"trainId\": 327},\n    {\"name\": \"bleachers\", \"id\": 209, \"trainId\": 328},\n    {\"name\": \"ferris wheel\", \"id\": 912, \"trainId\": 329},\n    {\"name\": \"hairdresser chair\", \"id\": 1164, \"trainId\": 330},\n    {\"name\": \"mat\", \"id\": 1509, \"trainId\": 331},\n    {\"name\": \"stands\", \"id\": 2539, \"trainId\": 332},\n    {\"name\": \"aquarium, fish tank, marine museum\", \"id\": 3116, \"trainId\": 333},\n    {\"name\": \"streetcar, tram, tramcar, trolley, trolley car\", \"id\": 2615, \"trainId\": 334},\n    {\"name\": \"napkin, table napkin, serviette\", \"id\": 1644, \"trainId\": 335},\n    {\"name\": \"dummy\", \"id\": 818, \"trainId\": 336},\n    {\"name\": \"booklet, brochure, folder, leaflet, pamphlet\", \"id\": 242, \"trainId\": 337},\n    {\"name\": \"sand trap\", \"id\": 2217, \"trainId\": 338},\n    {\"name\": \"shop, store\", \"id\": 2347, \"trainId\": 339},\n    {\"name\": \"table cloth\", \"id\": 2686, \"trainId\": 340},\n    {\"name\": \"service station\", \"id\": 2300, \"trainId\": 341},\n    {\"name\": \"coffin\", \"id\": 572, \"trainId\": 342},\n    {\"name\": \"drawer\", \"id\": 789, \"trainId\": 343},\n    {\"name\": \"cages\", \"id\": 358, \"trainId\": 344},\n    {\"name\": \"slot machine, coin machine\", \"id\": 2443, \"trainId\": 345},\n    {\"name\": \"balcony\", \"id\": 101, \"trainId\": 346},\n    {\"name\": \"volleyball court\", \"id\": 2969, \"trainId\": 347},\n    {\"name\": \"table tennis\", \"id\": 2692, \"trainId\": 348},\n    {\"name\": \"control table\", \"id\": 606, \"trainId\": 349},\n    {\"name\": \"shirt\", \"id\": 2339, \"trainId\": 350},\n    {\"name\": \"merchandise, ware, product\", \"id\": 1533, \"trainId\": 351},\n    {\"name\": \"railway\", \"id\": 2060, \"trainId\": 352},\n    {\"name\": \"parterre\", \"id\": 1782, \"trainId\": 353},\n    {\"name\": \"chimney\", \"id\": 495, \"trainId\": 354},\n    {\"name\": \"can, tin, tin can\", \"id\": 371, \"trainId\": 355},\n    {\"name\": \"tanks\", \"id\": 2707, \"trainId\": 356},\n    {\"name\": \"fabric, cloth, material, textile\", \"id\": 889, \"trainId\": 357},\n    {\"name\": \"alga, algae\", \"id\": 3156, \"trainId\": 358},\n    {\"name\": \"system\", \"id\": 2683, \"trainId\": 359},\n    {\"name\": \"map\", \"id\": 1499, \"trainId\": 360},\n    {\"name\": \"greenhouse\", \"id\": 1135, \"trainId\": 361},\n    {\"name\": \"mug\", \"id\": 1619, \"trainId\": 362},\n    {\"name\": \"barbecue\", \"id\": 125, \"trainId\": 363},\n    {\"name\": \"trailer\", \"id\": 2838, \"trainId\": 364},\n    {\"name\": \"toilet tissue, toilet paper, bathroom tissue\", \"id\": 2792, \"trainId\": 365},\n    {\"name\": \"organ\", \"id\": 1695, \"trainId\": 366},\n    {\"name\": \"dishrag, dishcloth\", \"id\": 746, \"trainId\": 367},\n    {\"name\": \"island\", \"id\": 1343, \"trainId\": 368},\n    {\"name\": \"keyboard\", \"id\": 1370, \"trainId\": 369},\n    {\"name\": \"trench\", \"id\": 2858, \"trainId\": 370},\n    {\"name\": \"basket, basketball hoop, hoop\", \"id\": 145, \"trainId\": 371},\n    {\"name\": \"steering wheel, wheel\", \"id\": 2565, \"trainId\": 372},\n    {\"name\": \"pitcher, ewer\", \"id\": 1892, \"trainId\": 373},\n    {\"name\": \"goal\", \"id\": 1103, \"trainId\": 374},\n    {\"name\": \"bread, breadstuff, staff of life\", \"id\": 286, \"trainId\": 375},\n    {\"name\": \"beds\", \"id\": 170, \"trainId\": 376},\n    {\"name\": \"wood\", \"id\": 3073, \"trainId\": 377},\n    {\"name\": \"file cabinet\", \"id\": 922, \"trainId\": 378},\n    {\"name\": \"newspaper, paper\", \"id\": 1655, \"trainId\": 379},\n    {\"name\": \"motorboat\", \"id\": 1602, \"trainId\": 380},\n    {\"name\": \"rope\", \"id\": 2160, \"trainId\": 381},\n    {\"name\": \"guitar\", \"id\": 1151, \"trainId\": 382},\n    {\"name\": \"rubble\", \"id\": 2176, \"trainId\": 383},\n    {\"name\": \"scarf\", \"id\": 2239, \"trainId\": 384},\n    {\"name\": \"barrels\", \"id\": 132, \"trainId\": 385},\n    {\"name\": \"cap\", \"id\": 394, \"trainId\": 386},\n    {\"name\": \"leaves\", \"id\": 1424, \"trainId\": 387},\n    {\"name\": \"control tower\", \"id\": 607, \"trainId\": 388},\n    {\"name\": \"dashboard\", \"id\": 700, \"trainId\": 389},\n    {\"name\": \"bandstand\", \"id\": 116, \"trainId\": 390},\n    {\"name\": \"lectern\", \"id\": 1425, \"trainId\": 391},\n    {\"name\": \"switch, electric switch, electrical switch\", \"id\": 2676, \"trainId\": 392},\n    {\"name\": \"baseboard, mopboard, skirting board\", \"id\": 141, \"trainId\": 393},\n    {\"name\": \"shower room\", \"id\": 2360, \"trainId\": 394},\n    {\"name\": \"smoke\", \"id\": 2449, \"trainId\": 395},\n    {\"name\": \"faucet, spigot\", \"id\": 897, \"trainId\": 396},\n    {\"name\": \"bulldozer\", \"id\": 317, \"trainId\": 397},\n    {\"name\": \"saucepan\", \"id\": 2228, \"trainId\": 398},\n    {\"name\": \"shops\", \"id\": 2351, \"trainId\": 399},\n    {\"name\": \"meter\", \"id\": 1543, \"trainId\": 400},\n    {\"name\": \"crevasse\", \"id\": 656, \"trainId\": 401},\n    {\"name\": \"gear\", \"id\": 1088, \"trainId\": 402},\n    {\"name\": \"candelabrum, candelabra\", \"id\": 373, \"trainId\": 403},\n    {\"name\": \"sofa bed\", \"id\": 2472, \"trainId\": 404},\n    {\"name\": \"tunnel\", \"id\": 2892, \"trainId\": 405},\n    {\"name\": \"pallet\", \"id\": 1740, \"trainId\": 406},\n    {\"name\": \"wire, conducting wire\", \"id\": 3067, \"trainId\": 407},\n    {\"name\": \"kettle, boiler\", \"id\": 1367, \"trainId\": 408},\n    {\"name\": \"bidet\", \"id\": 188, \"trainId\": 409},\n    {\n        \"name\": \"baby buggy, baby carriage, carriage, perambulator, pram, stroller, go-cart, pushchair, pusher\",\n        \"id\": 79,\n        \"trainId\": 410,\n    },\n    {\"name\": \"music stand\", \"id\": 1633, \"trainId\": 411},\n    {\"name\": \"pipe, tube\", \"id\": 1885, \"trainId\": 412},\n    {\"name\": \"cup\", \"id\": 677, \"trainId\": 413},\n    {\"name\": \"parking meter\", \"id\": 1779, \"trainId\": 414},\n    {\"name\": \"ice hockey rink\", \"id\": 1297, \"trainId\": 415},\n    {\"name\": \"shelter\", \"id\": 2334, \"trainId\": 416},\n    {\"name\": \"weeds\", \"id\": 3027, \"trainId\": 417},\n    {\"name\": \"temple\", \"id\": 2735, \"trainId\": 418},\n    {\"name\": \"patty, cake\", \"id\": 1791, \"trainId\": 419},\n    {\"name\": \"ski slope\", \"id\": 2405, \"trainId\": 420},\n    {\"name\": \"panel\", \"id\": 1748, \"trainId\": 421},\n    {\"name\": \"wallet\", \"id\": 2983, \"trainId\": 422},\n    {\"name\": \"wheel\", \"id\": 3035, \"trainId\": 423},\n    {\"name\": \"towel rack, towel horse\", \"id\": 2824, \"trainId\": 424},\n    {\"name\": \"roundabout\", \"id\": 2168, \"trainId\": 425},\n    {\"name\": \"canister, cannister, tin\", \"id\": 385, \"trainId\": 426},\n    {\"name\": \"rod\", \"id\": 2148, \"trainId\": 427},\n    {\"name\": \"soap dispenser\", \"id\": 2465, \"trainId\": 428},\n    {\"name\": \"bell\", \"id\": 175, \"trainId\": 429},\n    {\"name\": \"canvas\", \"id\": 390, \"trainId\": 430},\n    {\"name\": \"box office, ticket office, ticket booth\", \"id\": 268, \"trainId\": 431},\n    {\"name\": \"teacup\", \"id\": 2722, \"trainId\": 432},\n    {\"name\": \"trellis\", \"id\": 2857, \"trainId\": 433},\n    {\"name\": \"workbench\", \"id\": 3088, \"trainId\": 434},\n    {\"name\": \"valley, vale\", \"id\": 2926, \"trainId\": 435},\n    {\"name\": \"toaster\", \"id\": 2782, \"trainId\": 436},\n    {\"name\": \"knife\", \"id\": 1378, \"trainId\": 437},\n    {\"name\": \"podium\", \"id\": 1934, \"trainId\": 438},\n    {\"name\": \"ramp\", \"id\": 2072, \"trainId\": 439},\n    {\"name\": \"tumble dryer\", \"id\": 2889, \"trainId\": 440},\n    {\"name\": \"fireplug, fire hydrant, plug\", \"id\": 944, \"trainId\": 441},\n    {\"name\": \"gym shoe, sneaker, tennis shoe\", \"id\": 1158, \"trainId\": 442},\n    {\"name\": \"lab bench\", \"id\": 1383, \"trainId\": 443},\n    {\"name\": \"equipment\", \"id\": 867, \"trainId\": 444},\n    {\"name\": \"rocky formation\", \"id\": 2145, \"trainId\": 445},\n    {\"name\": \"plastic\", \"id\": 1915, \"trainId\": 446},\n    {\"name\": \"calendar\", \"id\": 361, \"trainId\": 447},\n    {\"name\": \"caravan\", \"id\": 402, \"trainId\": 448},\n    {\"name\": \"check-in-desk\", \"id\": 482, \"trainId\": 449},\n    {\"name\": \"ticket counter\", \"id\": 2761, \"trainId\": 450},\n    {\"name\": \"brush\", \"id\": 300, \"trainId\": 451},\n    {\"name\": \"mill\", \"id\": 1554, \"trainId\": 452},\n    {\"name\": \"covered bridge\", \"id\": 636, \"trainId\": 453},\n    {\"name\": \"bowling alley\", \"id\": 260, \"trainId\": 454},\n    {\"name\": \"hanger\", \"id\": 1186, \"trainId\": 455},\n    {\"name\": \"excavator\", \"id\": 871, \"trainId\": 456},\n    {\"name\": \"trestle\", \"id\": 2859, \"trainId\": 457},\n    {\"name\": \"revolving door\", \"id\": 2103, \"trainId\": 458},\n    {\"name\": \"blast furnace\", \"id\": 208, \"trainId\": 459},\n    {\"name\": \"scale, weighing machine\", \"id\": 2236, \"trainId\": 460},\n    {\"name\": \"projector\", \"id\": 2012, \"trainId\": 461},\n    {\"name\": \"soap\", \"id\": 2462, \"trainId\": 462},\n    {\"name\": \"locker\", \"id\": 1462, \"trainId\": 463},\n    {\"name\": \"tractor\", \"id\": 2832, \"trainId\": 464},\n    {\"name\": \"stretcher\", \"id\": 2617, \"trainId\": 465},\n    {\"name\": \"frame\", \"id\": 1024, \"trainId\": 466},\n    {\"name\": \"grating\", \"id\": 1129, \"trainId\": 467},\n    {\"name\": \"alembic\", \"id\": 18, \"trainId\": 468},\n    {\"name\": \"candle, taper, wax light\", \"id\": 376, \"trainId\": 469},\n    {\"name\": \"barrier\", \"id\": 134, \"trainId\": 470},\n    {\"name\": \"cardboard\", \"id\": 407, \"trainId\": 471},\n    {\"name\": \"cave\", \"id\": 434, \"trainId\": 472},\n    {\"name\": \"puddle\", \"id\": 2017, \"trainId\": 473},\n    {\"name\": \"tarp\", \"id\": 2717, \"trainId\": 474},\n    {\"name\": \"price tag\", \"id\": 2005, \"trainId\": 475},\n    {\"name\": \"watchtower\", \"id\": 2993, \"trainId\": 476},\n    {\"name\": \"meters\", \"id\": 1545, \"trainId\": 477},\n    {\n        \"name\": \"light bulb, lightbulb, bulb, incandescent lamp, electric light, electric-light bulb\",\n        \"id\": 1445,\n        \"trainId\": 478,\n    },\n    {\"name\": \"tracks\", \"id\": 2831, \"trainId\": 479},\n    {\"name\": \"hair dryer\", \"id\": 1161, \"trainId\": 480},\n    {\"name\": \"skirt\", \"id\": 2411, \"trainId\": 481},\n    {\"name\": \"viaduct\", \"id\": 2949, \"trainId\": 482},\n    {\"name\": \"paper towel\", \"id\": 1769, \"trainId\": 483},\n    {\"name\": \"coat\", \"id\": 552, \"trainId\": 484},\n    {\"name\": \"sheet\", \"id\": 2327, \"trainId\": 485},\n    {\"name\": \"fire extinguisher, extinguisher, asphyxiator\", \"id\": 939, \"trainId\": 486},\n    {\"name\": \"water wheel\", \"id\": 3013, \"trainId\": 487},\n    {\"name\": \"pottery, clayware\", \"id\": 1986, \"trainId\": 488},\n    {\"name\": \"magazine rack\", \"id\": 1486, \"trainId\": 489},\n    {\"name\": \"teapot\", \"id\": 2723, \"trainId\": 490},\n    {\"name\": \"microphone, mike\", \"id\": 1549, \"trainId\": 491},\n    {\"name\": \"support\", \"id\": 2649, \"trainId\": 492},\n    {\"name\": \"forklift\", \"id\": 1020, \"trainId\": 493},\n    {\"name\": \"canyon\", \"id\": 392, \"trainId\": 494},\n    {\"name\": \"cash register, register\", \"id\": 422, \"trainId\": 495},\n    {\"name\": \"leaf, leafage, foliage\", \"id\": 1419, \"trainId\": 496},\n    {\"name\": \"remote control, remote\", \"id\": 2099, \"trainId\": 497},\n    {\"name\": \"soap dish\", \"id\": 2464, \"trainId\": 498},\n    {\"name\": \"windshield, windscreen\", \"id\": 3058, \"trainId\": 499},\n    {\"name\": \"cat\", \"id\": 430, \"trainId\": 500},\n    {\"name\": \"cue, cue stick, pool cue, pool stick\", \"id\": 675, \"trainId\": 501},\n    {\"name\": \"vent, venthole, vent-hole, blowhole\", \"id\": 2941, \"trainId\": 502},\n    {\"name\": \"videos\", \"id\": 2955, \"trainId\": 503},\n    {\"name\": \"shovel\", \"id\": 2355, \"trainId\": 504},\n    {\"name\": \"eaves\", \"id\": 840, \"trainId\": 505},\n    {\"name\": \"antenna, aerial, transmitting aerial\", \"id\": 32, \"trainId\": 506},\n    {\"name\": \"shipyard\", \"id\": 2338, \"trainId\": 507},\n    {\"name\": \"hen, biddy\", \"id\": 1232, \"trainId\": 508},\n    {\"name\": \"traffic cone\", \"id\": 2834, \"trainId\": 509},\n    {\"name\": \"washing machines\", \"id\": 2991, \"trainId\": 510},\n    {\"name\": \"truck crane\", \"id\": 2879, \"trainId\": 511},\n    {\"name\": \"cds\", \"id\": 444, \"trainId\": 512},\n    {\"name\": \"niche\", \"id\": 1657, \"trainId\": 513},\n    {\"name\": \"scoreboard\", \"id\": 2246, \"trainId\": 514},\n    {\"name\": \"briefcase\", \"id\": 296, \"trainId\": 515},\n    {\"name\": \"boot\", \"id\": 245, \"trainId\": 516},\n    {\"name\": \"sweater, jumper\", \"id\": 2661, \"trainId\": 517},\n    {\"name\": \"hay\", \"id\": 1202, \"trainId\": 518},\n    {\"name\": \"pack\", \"id\": 1714, \"trainId\": 519},\n    {\"name\": \"bottle rack\", \"id\": 251, \"trainId\": 520},\n    {\"name\": \"glacier\", \"id\": 1095, \"trainId\": 521},\n    {\"name\": \"pergola\", \"id\": 1828, \"trainId\": 522},\n    {\"name\": \"building materials\", \"id\": 311, \"trainId\": 523},\n    {\"name\": \"television camera\", \"id\": 2732, \"trainId\": 524},\n    {\"name\": \"first floor\", \"id\": 947, \"trainId\": 525},\n    {\"name\": \"rifle\", \"id\": 2115, \"trainId\": 526},\n    {\"name\": \"tennis table\", \"id\": 2738, \"trainId\": 527},\n    {\"name\": \"stadium\", \"id\": 2525, \"trainId\": 528},\n    {\"name\": \"safety belt\", \"id\": 2194, \"trainId\": 529},\n    {\"name\": \"cover\", \"id\": 634, \"trainId\": 530},\n    {\"name\": \"dish rack\", \"id\": 740, \"trainId\": 531},\n    {\"name\": \"synthesizer\", \"id\": 2682, \"trainId\": 532},\n    {\"name\": \"pumpkin\", \"id\": 2020, \"trainId\": 533},\n    {\"name\": \"gutter\", \"id\": 1156, \"trainId\": 534},\n    {\"name\": \"fruit stand\", \"id\": 1036, \"trainId\": 535},\n    {\"name\": \"ice floe, floe\", \"id\": 1295, \"trainId\": 536},\n    {\"name\": \"handle, grip, handgrip, hold\", \"id\": 1181, \"trainId\": 537},\n    {\"name\": \"wheelchair\", \"id\": 3037, \"trainId\": 538},\n    {\"name\": \"mousepad, mouse mat\", \"id\": 1614, \"trainId\": 539},\n    {\"name\": \"diploma\", \"id\": 736, \"trainId\": 540},\n    {\"name\": \"fairground ride\", \"id\": 893, \"trainId\": 541},\n    {\"name\": \"radio\", \"id\": 2047, \"trainId\": 542},\n    {\"name\": \"hotplate\", \"id\": 1274, \"trainId\": 543},\n    {\"name\": \"junk\", \"id\": 1361, \"trainId\": 544},\n    {\"name\": \"wheelbarrow\", \"id\": 3036, \"trainId\": 545},\n    {\"name\": \"stream\", \"id\": 2606, \"trainId\": 546},\n    {\"name\": \"toll plaza\", \"id\": 2797, \"trainId\": 547},\n    {\"name\": \"punching bag\", \"id\": 2022, \"trainId\": 548},\n    {\"name\": \"trough\", \"id\": 2876, \"trainId\": 549},\n    {\"name\": \"throne\", \"id\": 2758, \"trainId\": 550},\n    {\"name\": \"chair desk\", \"id\": 472, \"trainId\": 551},\n    {\"name\": \"weighbridge\", \"id\": 3028, \"trainId\": 552},\n    {\"name\": \"extractor fan\", \"id\": 882, \"trainId\": 553},\n    {\"name\": \"hanging clothes\", \"id\": 1189, \"trainId\": 554},\n    {\"name\": \"dish, dish aerial, dish antenna, saucer\", \"id\": 743, \"trainId\": 555},\n    {\"name\": \"alarm clock, alarm\", \"id\": 3122, \"trainId\": 556},\n    {\"name\": \"ski lift\", \"id\": 2401, \"trainId\": 557},\n    {\"name\": \"chain\", \"id\": 468, \"trainId\": 558},\n    {\"name\": \"garage\", \"id\": 1061, \"trainId\": 559},\n    {\"name\": \"mechanical shovel\", \"id\": 1523, \"trainId\": 560},\n    {\"name\": \"wine rack\", \"id\": 3059, \"trainId\": 561},\n    {\"name\": \"tramway\", \"id\": 2843, \"trainId\": 562},\n    {\"name\": \"treadmill\", \"id\": 2853, \"trainId\": 563},\n    {\"name\": \"menu\", \"id\": 1529, \"trainId\": 564},\n    {\"name\": \"block\", \"id\": 214, \"trainId\": 565},\n    {\"name\": \"well\", \"id\": 3032, \"trainId\": 566},\n    {\"name\": \"witness stand\", \"id\": 3071, \"trainId\": 567},\n    {\"name\": \"branch\", \"id\": 277, \"trainId\": 568},\n    {\"name\": \"duck\", \"id\": 813, \"trainId\": 569},\n    {\"name\": \"casserole\", \"id\": 426, \"trainId\": 570},\n    {\"name\": \"frying pan\", \"id\": 1039, \"trainId\": 571},\n    {\"name\": \"desk organizer\", \"id\": 727, \"trainId\": 572},\n    {\"name\": \"mast\", \"id\": 1508, \"trainId\": 573},\n    {\"name\": \"spectacles, specs, eyeglasses, glasses\", \"id\": 2490, \"trainId\": 574},\n    {\"name\": \"service elevator\", \"id\": 2299, \"trainId\": 575},\n    {\"name\": \"dollhouse\", \"id\": 768, \"trainId\": 576},\n    {\"name\": \"hammock\", \"id\": 1172, \"trainId\": 577},\n    {\"name\": \"clothes hanging\", \"id\": 537, \"trainId\": 578},\n    {\"name\": \"photocopier\", \"id\": 1847, \"trainId\": 579},\n    {\"name\": \"notepad\", \"id\": 1664, \"trainId\": 580},\n    {\"name\": \"golf cart\", \"id\": 1110, \"trainId\": 581},\n    {\"name\": \"footpath\", \"id\": 1014, \"trainId\": 582},\n    {\"name\": \"cross\", \"id\": 662, \"trainId\": 583},\n    {\"name\": \"baptismal font\", \"id\": 121, \"trainId\": 584},\n    {\"name\": \"boiler\", \"id\": 227, \"trainId\": 585},\n    {\"name\": \"skip\", \"id\": 2410, \"trainId\": 586},\n    {\"name\": \"rotisserie\", \"id\": 2165, \"trainId\": 587},\n    {\"name\": \"tables\", \"id\": 2696, \"trainId\": 588},\n    {\"name\": \"water mill\", \"id\": 3005, \"trainId\": 589},\n    {\"name\": \"helmet\", \"id\": 1231, \"trainId\": 590},\n    {\"name\": \"cover curtain\", \"id\": 635, \"trainId\": 591},\n    {\"name\": \"brick\", \"id\": 292, \"trainId\": 592},\n    {\"name\": \"table runner\", \"id\": 2690, \"trainId\": 593},\n    {\"name\": \"ashtray\", \"id\": 65, \"trainId\": 594},\n    {\"name\": \"street box\", \"id\": 2607, \"trainId\": 595},\n    {\"name\": \"stick\", \"id\": 2574, \"trainId\": 596},\n    {\"name\": \"hangers\", \"id\": 1188, \"trainId\": 597},\n    {\"name\": \"cells\", \"id\": 456, \"trainId\": 598},\n    {\"name\": \"urinal\", \"id\": 2913, \"trainId\": 599},\n    {\"name\": \"centerpiece\", \"id\": 459, \"trainId\": 600},\n    {\"name\": \"portable fridge\", \"id\": 1955, \"trainId\": 601},\n    {\"name\": \"dvds\", \"id\": 827, \"trainId\": 602},\n    {\"name\": \"golf club\", \"id\": 1111, \"trainId\": 603},\n    {\"name\": \"skirting board\", \"id\": 2412, \"trainId\": 604},\n    {\"name\": \"water cooler\", \"id\": 2997, \"trainId\": 605},\n    {\"name\": \"clipboard\", \"id\": 528, \"trainId\": 606},\n    {\"name\": \"camera, photographic camera\", \"id\": 366, \"trainId\": 607},\n    {\"name\": \"pigeonhole\", \"id\": 1863, \"trainId\": 608},\n    {\"name\": \"chips\", \"id\": 500, \"trainId\": 609},\n    {\"name\": \"food processor\", \"id\": 1001, \"trainId\": 610},\n    {\"name\": \"post box\", \"id\": 1958, \"trainId\": 611},\n    {\"name\": \"lid\", \"id\": 1441, \"trainId\": 612},\n    {\"name\": \"drum\", \"id\": 809, \"trainId\": 613},\n    {\"name\": \"blender\", \"id\": 210, \"trainId\": 614},\n    {\"name\": \"cave entrance\", \"id\": 435, \"trainId\": 615},\n    {\"name\": \"dental chair\", \"id\": 718, \"trainId\": 616},\n    {\"name\": \"obelisk\", \"id\": 1674, \"trainId\": 617},\n    {\"name\": \"canoe\", \"id\": 388, \"trainId\": 618},\n    {\"name\": \"mobile\", \"id\": 1572, \"trainId\": 619},\n    {\"name\": \"monitors\", \"id\": 1584, \"trainId\": 620},\n    {\"name\": \"pool ball\", \"id\": 1944, \"trainId\": 621},\n    {\"name\": \"cue rack\", \"id\": 674, \"trainId\": 622},\n    {\"name\": \"baggage carts\", \"id\": 99, \"trainId\": 623},\n    {\"name\": \"shore\", \"id\": 2352, \"trainId\": 624},\n    {\"name\": \"fork\", \"id\": 1019, \"trainId\": 625},\n    {\"name\": \"paper filer\", \"id\": 1763, \"trainId\": 626},\n    {\"name\": \"bicycle rack\", \"id\": 185, \"trainId\": 627},\n    {\"name\": \"coat rack\", \"id\": 554, \"trainId\": 628},\n    {\"name\": \"garland\", \"id\": 1066, \"trainId\": 629},\n    {\"name\": \"sports bag\", \"id\": 2508, \"trainId\": 630},\n    {\"name\": \"fish tank\", \"id\": 951, \"trainId\": 631},\n    {\"name\": \"towel dispenser\", \"id\": 2822, \"trainId\": 632},\n    {\"name\": \"carriage\", \"id\": 415, \"trainId\": 633},\n    {\"name\": \"brochure\", \"id\": 297, \"trainId\": 634},\n    {\"name\": \"plaque\", \"id\": 1914, \"trainId\": 635},\n    {\"name\": \"stringer\", \"id\": 2619, \"trainId\": 636},\n    {\"name\": \"iron\", \"id\": 1338, \"trainId\": 637},\n    {\"name\": \"spoon\", \"id\": 2505, \"trainId\": 638},\n    {\"name\": \"flag pole\", \"id\": 955, \"trainId\": 639},\n    {\"name\": \"toilet brush\", \"id\": 2786, \"trainId\": 640},\n    {\"name\": \"book stand\", \"id\": 238, \"trainId\": 641},\n    {\"name\": \"water faucet, water tap, tap, hydrant\", \"id\": 3000, \"trainId\": 642},\n    {\"name\": \"ticket office\", \"id\": 2763, \"trainId\": 643},\n    {\"name\": \"broom\", \"id\": 299, \"trainId\": 644},\n    {\"name\": \"dvd\", \"id\": 822, \"trainId\": 645},\n    {\"name\": \"ice bucket\", \"id\": 1288, \"trainId\": 646},\n    {\"name\": \"carapace, shell, cuticle, shield\", \"id\": 3101, \"trainId\": 647},\n    {\"name\": \"tureen\", \"id\": 2894, \"trainId\": 648},\n    {\"name\": \"folders\", \"id\": 992, \"trainId\": 649},\n    {\"name\": \"chess\", \"id\": 489, \"trainId\": 650},\n    {\"name\": \"root\", \"id\": 2157, \"trainId\": 651},\n    {\"name\": \"sewing machine\", \"id\": 2309, \"trainId\": 652},\n    {\"name\": \"model\", \"id\": 1576, \"trainId\": 653},\n    {\"name\": \"pen\", \"id\": 1810, \"trainId\": 654},\n    {\"name\": \"violin\", \"id\": 2964, \"trainId\": 655},\n    {\"name\": \"sweatshirt\", \"id\": 2662, \"trainId\": 656},\n    {\"name\": \"recycling materials\", \"id\": 2087, \"trainId\": 657},\n    {\"name\": \"mitten\", \"id\": 1569, \"trainId\": 658},\n    {\"name\": \"chopping board, cutting board\", \"id\": 503, \"trainId\": 659},\n    {\"name\": \"mask\", \"id\": 1505, \"trainId\": 660},\n    {\"name\": \"log\", \"id\": 1468, \"trainId\": 661},\n    {\"name\": \"mouse, computer mouse\", \"id\": 1613, \"trainId\": 662},\n    {\"name\": \"grill\", \"id\": 1138, \"trainId\": 663},\n    {\"name\": \"hole\", \"id\": 1256, \"trainId\": 664},\n    {\"name\": \"target\", \"id\": 2715, \"trainId\": 665},\n    {\"name\": \"trash bag\", \"id\": 2846, \"trainId\": 666},\n    {\"name\": \"chalk\", \"id\": 477, \"trainId\": 667},\n    {\"name\": \"sticks\", \"id\": 2576, \"trainId\": 668},\n    {\"name\": \"balloon\", \"id\": 108, \"trainId\": 669},\n    {\"name\": \"score\", \"id\": 2245, \"trainId\": 670},\n    {\"name\": \"hair spray\", \"id\": 1162, \"trainId\": 671},\n    {\"name\": \"roll\", \"id\": 2149, \"trainId\": 672},\n    {\"name\": \"runner\", \"id\": 2183, \"trainId\": 673},\n    {\"name\": \"engine\", \"id\": 858, \"trainId\": 674},\n    {\"name\": \"inflatable glove\", \"id\": 1324, \"trainId\": 675},\n    {\"name\": \"games\", \"id\": 1055, \"trainId\": 676},\n    {\"name\": \"pallets\", \"id\": 1741, \"trainId\": 677},\n    {\"name\": \"baskets\", \"id\": 149, \"trainId\": 678},\n    {\"name\": \"coop\", \"id\": 615, \"trainId\": 679},\n    {\"name\": \"dvd player\", \"id\": 825, \"trainId\": 680},\n    {\"name\": \"rocking horse\", \"id\": 2143, \"trainId\": 681},\n    {\"name\": \"buckets\", \"id\": 304, \"trainId\": 682},\n    {\"name\": \"bread rolls\", \"id\": 283, \"trainId\": 683},\n    {\"name\": \"shawl\", \"id\": 2322, \"trainId\": 684},\n    {\"name\": \"watering can\", \"id\": 3017, \"trainId\": 685},\n    {\"name\": \"spotlights\", \"id\": 2510, \"trainId\": 686},\n    {\"name\": \"post-it\", \"id\": 1960, \"trainId\": 687},\n    {\"name\": \"bowls\", \"id\": 265, \"trainId\": 688},\n    {\"name\": \"security camera\", \"id\": 2282, \"trainId\": 689},\n    {\"name\": \"runner cloth\", \"id\": 2184, \"trainId\": 690},\n    {\"name\": \"lock\", \"id\": 1461, \"trainId\": 691},\n    {\"name\": \"alarm, warning device, alarm system\", \"id\": 3113, \"trainId\": 692},\n    {\"name\": \"side\", \"id\": 2372, \"trainId\": 693},\n    {\"name\": \"roulette\", \"id\": 2166, \"trainId\": 694},\n    {\"name\": \"bone\", \"id\": 232, \"trainId\": 695},\n    {\"name\": \"cutlery\", \"id\": 693, \"trainId\": 696},\n    {\"name\": \"pool balls\", \"id\": 1945, \"trainId\": 697},\n    {\"name\": \"wheels\", \"id\": 3039, \"trainId\": 698},\n    {\"name\": \"spice rack\", \"id\": 2494, \"trainId\": 699},\n    {\"name\": \"plant pots\", \"id\": 1908, \"trainId\": 700},\n    {\"name\": \"towel ring\", \"id\": 2827, \"trainId\": 701},\n    {\"name\": \"bread box\", \"id\": 280, \"trainId\": 702},\n    {\"name\": \"video\", \"id\": 2950, \"trainId\": 703},\n    {\"name\": \"funfair\", \"id\": 1044, \"trainId\": 704},\n    {\"name\": \"breads\", \"id\": 288, \"trainId\": 705},\n    {\"name\": \"tripod\", \"id\": 2863, \"trainId\": 706},\n    {\"name\": \"ironing board\", \"id\": 1342, \"trainId\": 707},\n    {\"name\": \"skimmer\", \"id\": 2409, \"trainId\": 708},\n    {\"name\": \"hollow\", \"id\": 1258, \"trainId\": 709},\n    {\"name\": \"scratching post\", \"id\": 2249, \"trainId\": 710},\n    {\"name\": \"tricycle\", \"id\": 2862, \"trainId\": 711},\n    {\"name\": \"file box\", \"id\": 920, \"trainId\": 712},\n    {\"name\": \"mountain pass\", \"id\": 1607, \"trainId\": 713},\n    {\"name\": \"tombstones\", \"id\": 2802, \"trainId\": 714},\n    {\"name\": \"cooker\", \"id\": 610, \"trainId\": 715},\n    {\"name\": \"card game, cards\", \"id\": 3129, \"trainId\": 716},\n    {\"name\": \"golf bag\", \"id\": 1108, \"trainId\": 717},\n    {\"name\": \"towel paper\", \"id\": 2823, \"trainId\": 718},\n    {\"name\": \"chaise lounge\", \"id\": 476, \"trainId\": 719},\n    {\"name\": \"sun\", \"id\": 2641, \"trainId\": 720},\n    {\"name\": \"toilet paper holder\", \"id\": 2788, \"trainId\": 721},\n    {\"name\": \"rake\", \"id\": 2070, \"trainId\": 722},\n    {\"name\": \"key\", \"id\": 1368, \"trainId\": 723},\n    {\"name\": \"umbrella stand\", \"id\": 2903, \"trainId\": 724},\n    {\"name\": \"dartboard\", \"id\": 699, \"trainId\": 725},\n    {\"name\": \"transformer\", \"id\": 2844, \"trainId\": 726},\n    {\"name\": \"fireplace utensils\", \"id\": 942, \"trainId\": 727},\n    {\"name\": \"sweatshirts\", \"id\": 2663, \"trainId\": 728},\n    {\n        \"name\": \"cellular telephone, cellular phone, cellphone, cell, mobile phone\",\n        \"id\": 457,\n        \"trainId\": 729,\n    },\n    {\"name\": \"tallboy\", \"id\": 2701, \"trainId\": 730},\n    {\"name\": \"stapler\", \"id\": 2540, \"trainId\": 731},\n    {\"name\": \"sauna\", \"id\": 2231, \"trainId\": 732},\n    {\"name\": \"test tube\", \"id\": 2746, \"trainId\": 733},\n    {\"name\": \"palette\", \"id\": 1738, \"trainId\": 734},\n    {\"name\": \"shopping carts\", \"id\": 2350, \"trainId\": 735},\n    {\"name\": \"tools\", \"id\": 2808, \"trainId\": 736},\n    {\"name\": \"push button, push, button\", \"id\": 2025, \"trainId\": 737},\n    {\"name\": \"star\", \"id\": 2541, \"trainId\": 738},\n    {\"name\": \"roof rack\", \"id\": 2156, \"trainId\": 739},\n    {\"name\": \"barbed wire\", \"id\": 126, \"trainId\": 740},\n    {\"name\": \"spray\", \"id\": 2512, \"trainId\": 741},\n    {\"name\": \"ear\", \"id\": 831, \"trainId\": 742},\n    {\"name\": \"sponge\", \"id\": 2503, \"trainId\": 743},\n    {\"name\": \"racket\", \"id\": 2039, \"trainId\": 744},\n    {\"name\": \"tins\", \"id\": 2774, \"trainId\": 745},\n    {\"name\": \"eyeglasses\", \"id\": 886, \"trainId\": 746},\n    {\"name\": \"file\", \"id\": 919, \"trainId\": 747},\n    {\"name\": \"scarfs\", \"id\": 2240, \"trainId\": 748},\n    {\"name\": \"sugar bowl\", \"id\": 2636, \"trainId\": 749},\n    {\"name\": \"flip flop\", \"id\": 963, \"trainId\": 750},\n    {\"name\": \"headstones\", \"id\": 1218, \"trainId\": 751},\n    {\"name\": \"laptop bag\", \"id\": 1406, \"trainId\": 752},\n    {\"name\": \"leash\", \"id\": 1420, \"trainId\": 753},\n    {\"name\": \"climbing frame\", \"id\": 526, \"trainId\": 754},\n    {\"name\": \"suit hanger\", \"id\": 2639, \"trainId\": 755},\n    {\"name\": \"floor spotlight\", \"id\": 975, \"trainId\": 756},\n    {\"name\": \"plate rack\", \"id\": 1921, \"trainId\": 757},\n    {\"name\": \"sewer\", \"id\": 2305, \"trainId\": 758},\n    {\"name\": \"hard drive\", \"id\": 1193, \"trainId\": 759},\n    {\"name\": \"sprinkler\", \"id\": 2517, \"trainId\": 760},\n    {\"name\": \"tools box\", \"id\": 2809, \"trainId\": 761},\n    {\"name\": \"necklace\", \"id\": 1647, \"trainId\": 762},\n    {\"name\": \"bulbs\", \"id\": 314, \"trainId\": 763},\n    {\"name\": \"steel industry\", \"id\": 2560, \"trainId\": 764},\n    {\"name\": \"club\", \"id\": 545, \"trainId\": 765},\n    {\"name\": \"jack\", \"id\": 1345, \"trainId\": 766},\n    {\"name\": \"door bars\", \"id\": 775, \"trainId\": 767},\n    {\n        \"name\": \"control panel, instrument panel, control board, board, panel\",\n        \"id\": 603,\n        \"trainId\": 768,\n    },\n    {\"name\": \"hairbrush\", \"id\": 1163, \"trainId\": 769},\n    {\"name\": \"napkin holder\", \"id\": 1641, \"trainId\": 770},\n    {\"name\": \"office\", \"id\": 1678, \"trainId\": 771},\n    {\"name\": \"smoke detector\", \"id\": 2450, \"trainId\": 772},\n    {\"name\": \"utensils\", \"id\": 2915, \"trainId\": 773},\n    {\"name\": \"apron\", \"id\": 42, \"trainId\": 774},\n    {\"name\": \"scissors\", \"id\": 2242, \"trainId\": 775},\n    {\"name\": \"terminal\", \"id\": 2741, \"trainId\": 776},\n    {\"name\": \"grinder\", \"id\": 1143, \"trainId\": 777},\n    {\"name\": \"entry phone\", \"id\": 862, \"trainId\": 778},\n    {\"name\": \"newspaper stand\", \"id\": 1654, \"trainId\": 779},\n    {\"name\": \"pepper shaker\", \"id\": 1826, \"trainId\": 780},\n    {\"name\": \"onions\", \"id\": 1689, \"trainId\": 781},\n    {\n        \"name\": \"central processing unit, cpu, c p u , central processor, processor, mainframe\",\n        \"id\": 3124,\n        \"trainId\": 782,\n    },\n    {\"name\": \"tape\", \"id\": 2710, \"trainId\": 783},\n    {\"name\": \"bat\", \"id\": 152, \"trainId\": 784},\n    {\"name\": \"coaster\", \"id\": 549, \"trainId\": 785},\n    {\"name\": \"calculator\", \"id\": 360, \"trainId\": 786},\n    {\"name\": \"potatoes\", \"id\": 1982, \"trainId\": 787},\n    {\"name\": \"luggage rack\", \"id\": 1478, \"trainId\": 788},\n    {\"name\": \"salt\", \"id\": 2203, \"trainId\": 789},\n    {\"name\": \"street number\", \"id\": 2612, \"trainId\": 790},\n    {\"name\": \"viewpoint\", \"id\": 2956, \"trainId\": 791},\n    {\"name\": \"sword\", \"id\": 2681, \"trainId\": 792},\n    {\"name\": \"cd\", \"id\": 437, \"trainId\": 793},\n    {\"name\": \"rowing machine\", \"id\": 2171, \"trainId\": 794},\n    {\"name\": \"plug\", \"id\": 1933, \"trainId\": 795},\n    {\"name\": \"andiron, firedog, dog, dog-iron\", \"id\": 3110, \"trainId\": 796},\n    {\"name\": \"pepper\", \"id\": 1824, \"trainId\": 797},\n    {\"name\": \"tongs\", \"id\": 2803, \"trainId\": 798},\n    {\"name\": \"bonfire\", \"id\": 234, \"trainId\": 799},\n    {\"name\": \"dog dish\", \"id\": 764, \"trainId\": 800},\n    {\"name\": \"belt\", \"id\": 177, \"trainId\": 801},\n    {\"name\": \"dumbbells\", \"id\": 817, \"trainId\": 802},\n    {\"name\": \"videocassette recorder, vcr\", \"id\": 3145, \"trainId\": 803},\n    {\"name\": \"hook\", \"id\": 1262, \"trainId\": 804},\n    {\"name\": \"envelopes\", \"id\": 864, \"trainId\": 805},\n    {\"name\": \"shower faucet\", \"id\": 2359, \"trainId\": 806},\n    {\"name\": \"watch\", \"id\": 2992, \"trainId\": 807},\n    {\"name\": \"padlock\", \"id\": 1725, \"trainId\": 808},\n    {\"name\": \"swimming pool ladder\", \"id\": 2667, \"trainId\": 809},\n    {\"name\": \"spanners\", \"id\": 2484, \"trainId\": 810},\n    {\"name\": \"gravy boat\", \"id\": 1133, \"trainId\": 811},\n    {\"name\": \"notice board\", \"id\": 1667, \"trainId\": 812},\n    {\"name\": \"trash bags\", \"id\": 2847, \"trainId\": 813},\n    {\"name\": \"fire alarm\", \"id\": 932, \"trainId\": 814},\n    {\"name\": \"ladle\", \"id\": 1392, \"trainId\": 815},\n    {\"name\": \"stethoscope\", \"id\": 2573, \"trainId\": 816},\n    {\"name\": \"rocket\", \"id\": 2140, \"trainId\": 817},\n    {\"name\": \"funnel\", \"id\": 1046, \"trainId\": 818},\n    {\"name\": \"bowling pins\", \"id\": 264, \"trainId\": 819},\n    {\"name\": \"valve\", \"id\": 2927, \"trainId\": 820},\n    {\"name\": \"thermometer\", \"id\": 2752, \"trainId\": 821},\n    {\"name\": \"cups\", \"id\": 679, \"trainId\": 822},\n    {\"name\": \"spice jar\", \"id\": 2493, \"trainId\": 823},\n    {\"name\": \"night light\", \"id\": 1658, \"trainId\": 824},\n    {\"name\": \"soaps\", \"id\": 2466, \"trainId\": 825},\n    {\"name\": \"games table\", \"id\": 1057, \"trainId\": 826},\n    {\"name\": \"slotted spoon\", \"id\": 2444, \"trainId\": 827},\n    {\"name\": \"reel\", \"id\": 2093, \"trainId\": 828},\n    {\"name\": \"scourer\", \"id\": 2248, \"trainId\": 829},\n    {\"name\": \"sleeping robe\", \"id\": 2432, \"trainId\": 830},\n    {\"name\": \"desk mat\", \"id\": 726, \"trainId\": 831},\n    {\"name\": \"dumbbell\", \"id\": 816, \"trainId\": 832},\n    {\"name\": \"hammer\", \"id\": 1171, \"trainId\": 833},\n    {\"name\": \"tie\", \"id\": 2766, \"trainId\": 834},\n    {\"name\": \"typewriter\", \"id\": 2900, \"trainId\": 835},\n    {\"name\": \"shaker\", \"id\": 2313, \"trainId\": 836},\n    {\"name\": \"cheese dish\", \"id\": 488, \"trainId\": 837},\n    {\"name\": \"sea star\", \"id\": 2265, \"trainId\": 838},\n    {\"name\": \"racquet\", \"id\": 2043, \"trainId\": 839},\n    {\"name\": \"butane gas cylinder\", \"id\": 332, \"trainId\": 840},\n    {\"name\": \"paper weight\", \"id\": 1771, \"trainId\": 841},\n    {\"name\": \"shaving brush\", \"id\": 2320, \"trainId\": 842},\n    {\"name\": \"sunglasses\", \"id\": 2646, \"trainId\": 843},\n    {\"name\": \"gear shift\", \"id\": 1089, \"trainId\": 844},\n    {\"name\": \"towel rail\", \"id\": 2826, \"trainId\": 845},\n    {\"name\": \"adding machine, totalizer, totaliser\", \"id\": 3148, \"trainId\": 846},\n]\n\n\ndef _get_ade20k_full_meta():\n    # Id 0 is reserved for ignore_label, we change ignore_label for 0\n    # to 255 in our pre-processing, so all ids are shifted by 1.\n    stuff_ids = [k[\"id\"] for k in ADE20K_SEM_SEG_FULL_CATEGORIES]\n    assert len(stuff_ids) == 847, len(stuff_ids)\n\n    # For semantic segmentation, this mapping maps from contiguous stuff id\n    # (in [0, 91], used in models) to ids in the dataset (used for processing results)\n    stuff_dataset_id_to_contiguous_id = {k: i for i, k in enumerate(stuff_ids)}\n    stuff_classes = [k[\"name\"] for k in ADE20K_SEM_SEG_FULL_CATEGORIES]\n\n    ret = {\n        \"stuff_dataset_id_to_contiguous_id\": stuff_dataset_id_to_contiguous_id,\n        \"stuff_classes\": stuff_classes,\n    }\n    return ret\n\n\ndef register_all_ade20k_full(root):\n    root = os.path.join(root, \"ADE20K_2021_17_01\")\n    meta = _get_ade20k_full_meta()\n    for name, dirname in [(\"train\", \"training\"), (\"val\", \"validation\")]:\n        image_dir = os.path.join(root, \"images_detectron2\", dirname)\n        gt_dir = os.path.join(root, \"annotations_detectron2\", dirname)\n        name = f\"ade20k_full_sem_seg_{name}\"\n        DatasetCatalog.register(\n            name, lambda x=image_dir, y=gt_dir: load_sem_seg(y, x, gt_ext=\"tif\", image_ext=\"jpg\")\n        )\n        MetadataCatalog.get(name).set(\n            stuff_classes=meta[\"stuff_classes\"][:],\n            image_root=image_dir,\n            sem_seg_root=gt_dir,\n            evaluator_type=\"sem_seg\",\n            ignore_label=65535,  # NOTE: gt is saved in 16-bit TIFF images\n        )\n\n\n_root = os.getenv(\"DETECTRON2_DATASETS\", \"datasets\")\nregister_all_ade20k_full(_root)\n"
  },
  {
    "path": "mask2former/data/datasets/register_ade20k_instance.py",
    "content": "import json\nimport logging\nimport numpy as np\nimport os\nfrom PIL import Image\n\nfrom detectron2.data import DatasetCatalog, MetadataCatalog\nfrom detectron2.data.datasets.coco import load_coco_json, register_coco_instances\nfrom detectron2.utils.file_io import PathManager\n\nADE_CATEGORIES = [{'id': 7, 'name': 'bed'}, {'id': 8, 'name': 'windowpane'}, {'id': 10, 'name': 'cabinet'}, {'id': 12, 'name': 'person'}, {'id': 14, 'name': 'door'}, {'id': 15, 'name': 'table'}, {'id': 18, 'name': 'curtain'}, {'id': 19, 'name': 'chair'}, {'id': 20, 'name': 'car'}, {'id': 22, 'name': 'painting'}, {'id': 23, 'name': 'sofa'}, {'id': 24, 'name': 'shelf'}, {'id': 27, 'name': 'mirror'}, {'id': 30, 'name': 'armchair'}, {'id': 31, 'name': 'seat'}, {'id': 32, 'name': 'fence'}, {'id': 33, 'name': 'desk'}, {'id': 35, 'name': 'wardrobe'}, {'id': 36, 'name': 'lamp'}, {'id': 37, 'name': 'bathtub'}, {'id': 38, 'name': 'railing'}, {'id': 39, 'name': 'cushion'}, {'id': 41, 'name': 'box'}, {'id': 42, 'name': 'column'}, {'id': 43, 'name': 'signboard'}, {'id': 44, 'name': 'chest of drawers'}, {'id': 45, 'name': 'counter'}, {'id': 47, 'name': 'sink'}, {'id': 49, 'name': 'fireplace'}, {'id': 50, 'name': 'refrigerator'}, {'id': 53, 'name': 'stairs'}, {'id': 55, 'name': 'case'}, {'id': 56, 'name': 'pool table'}, {'id': 57, 'name': 'pillow'}, {'id': 58, 'name': 'screen door'}, {'id': 62, 'name': 'bookcase'}, {'id': 64, 'name': 'coffee table'}, {'id': 65, 'name': 'toilet'}, {'id': 66, 'name': 'flower'}, {'id': 67, 'name': 'book'}, {'id': 69, 'name': 'bench'}, {'id': 70, 'name': 'countertop'}, {'id': 71, 'name': 'stove'}, {'id': 72, 'name': 'palm'}, {'id': 73, 'name': 'kitchen island'}, {'id': 74, 'name': 'computer'}, {'id': 75, 'name': 'swivel chair'}, {'id': 76, 'name': 'boat'}, {'id': 78, 'name': 'arcade machine'}, {'id': 80, 'name': 'bus'}, {'id': 81, 'name': 'towel'}, {'id': 82, 'name': 'light'}, {'id': 83, 'name': 'truck'}, {'id': 85, 'name': 'chandelier'}, {'id': 86, 'name': 'awning'}, {'id': 87, 'name': 'streetlight'}, {'id': 88, 'name': 'booth'}, {'id': 89, 'name': 'television receiver'}, {'id': 90, 'name': 'airplane'}, {'id': 92, 'name': 'apparel'}, {'id': 93, 'name': 'pole'}, {'id': 95, 'name': 'bannister'}, {'id': 97, 'name': 'ottoman'}, {'id': 98, 'name': 'bottle'}, {'id': 102, 'name': 'van'}, {'id': 103, 'name': 'ship'}, {'id': 104, 'name': 'fountain'}, {'id': 107, 'name': 'washer'}, {'id': 108, 'name': 'plaything'}, {'id': 110, 'name': 'stool'}, {'id': 111, 'name': 'barrel'}, {'id': 112, 'name': 'basket'}, {'id': 115, 'name': 'bag'}, {'id': 116, 'name': 'minibike'}, {'id': 118, 'name': 'oven'}, {'id': 119, 'name': 'ball'}, {'id': 120, 'name': 'food'}, {'id': 121, 'name': 'step'}, {'id': 123, 'name': 'trade name'}, {'id': 124, 'name': 'microwave'}, {'id': 125, 'name': 'pot'}, {'id': 126, 'name': 'animal'}, {'id': 127, 'name': 'bicycle'}, {'id': 129, 'name': 'dishwasher'}, {'id': 130, 'name': 'screen'}, {'id': 132, 'name': 'sculpture'}, {'id': 133, 'name': 'hood'}, {'id': 134, 'name': 'sconce'}, {'id': 135, 'name': 'vase'}, {'id': 136, 'name': 'traffic light'}, {'id': 137, 'name': 'tray'}, {'id': 138, 'name': 'ashcan'}, {'id': 139, 'name': 'fan'}, {'id': 142, 'name': 'plate'}, {'id': 143, 'name': 'monitor'}, {'id': 144, 'name': 'bulletin board'}, {'id': 146, 'name': 'radiator'}, {'id': 147, 'name': 'glass'}, {'id': 148, 'name': 'clock'}, {'id': 149, 'name': 'flag'}]\n\n\n_PREDEFINED_SPLITS = {\n    # point annotations without masks\n    \"ade20k_instance_train\": (\n        \"ADEChallengeData2016/images/training\",\n        \"ADEChallengeData2016/ade20k_instance_train.json\",\n    ),\n    \"ade20k_instance_val\": (\n        \"ADEChallengeData2016/images/validation\",\n        \"ADEChallengeData2016/ade20k_instance_val.json\",\n    ),\n}\n\n\ndef _get_ade_instances_meta():\n    thing_ids = [k[\"id\"] for k in ADE_CATEGORIES]\n    assert len(thing_ids) == 100, len(thing_ids)\n    # Mapping from the incontiguous ADE category id to an id in [0, 99]\n    thing_dataset_id_to_contiguous_id = {k: i for i, k in enumerate(thing_ids)}\n    thing_classes = [k[\"name\"] for k in ADE_CATEGORIES]\n    ret = {\n        \"thing_dataset_id_to_contiguous_id\": thing_dataset_id_to_contiguous_id,\n        \"thing_classes\": thing_classes,\n    }\n    return ret\n\n\ndef register_all_ade20k_instance(root):\n    for key, (image_root, json_file) in _PREDEFINED_SPLITS.items():\n        # Assume pre-defined datasets live in `./datasets`.\n        register_coco_instances(\n            key,\n            _get_ade_instances_meta(),\n            os.path.join(root, json_file) if \"://\" not in json_file else json_file,\n            os.path.join(root, image_root),\n        )\n\n\n_root = os.getenv(\"DETECTRON2_DATASETS\", \"datasets\")\nregister_all_ade20k_instance(_root)\n"
  },
  {
    "path": "mask2former/data/datasets/register_ade20k_panoptic.py",
    "content": "import json\nimport os\n\nfrom detectron2.data import DatasetCatalog, MetadataCatalog\nfrom detectron2.utils.file_io import PathManager\n\nADE20K_150_CATEGORIES = [\n    {\"color\": [120, 120, 120], \"id\": 0, \"isthing\": 0, \"name\": \"wall\"},\n    {\"color\": [180, 120, 120], \"id\": 1, \"isthing\": 0, \"name\": \"building\"},\n    {\"color\": [6, 230, 230], \"id\": 2, \"isthing\": 0, \"name\": \"sky\"},\n    {\"color\": [80, 50, 50], \"id\": 3, \"isthing\": 0, \"name\": \"floor\"},\n    {\"color\": [4, 200, 3], \"id\": 4, \"isthing\": 0, \"name\": \"tree\"},\n    {\"color\": [120, 120, 80], \"id\": 5, \"isthing\": 0, \"name\": \"ceiling\"},\n    {\"color\": [140, 140, 140], \"id\": 6, \"isthing\": 0, \"name\": \"road, route\"},\n    {\"color\": [204, 5, 255], \"id\": 7, \"isthing\": 1, \"name\": \"bed\"},\n    {\"color\": [230, 230, 230], \"id\": 8, \"isthing\": 1, \"name\": \"window \"},\n    {\"color\": [4, 250, 7], \"id\": 9, \"isthing\": 0, \"name\": \"grass\"},\n    {\"color\": [224, 5, 255], \"id\": 10, \"isthing\": 1, \"name\": \"cabinet\"},\n    {\"color\": [235, 255, 7], \"id\": 11, \"isthing\": 0, \"name\": \"sidewalk, pavement\"},\n    {\"color\": [150, 5, 61], \"id\": 12, \"isthing\": 1, \"name\": \"person\"},\n    {\"color\": [120, 120, 70], \"id\": 13, \"isthing\": 0, \"name\": \"earth, ground\"},\n    {\"color\": [8, 255, 51], \"id\": 14, \"isthing\": 1, \"name\": \"door\"},\n    {\"color\": [255, 6, 82], \"id\": 15, \"isthing\": 1, \"name\": \"table\"},\n    {\"color\": [143, 255, 140], \"id\": 16, \"isthing\": 0, \"name\": \"mountain, mount\"},\n    {\"color\": [204, 255, 4], \"id\": 17, \"isthing\": 0, \"name\": \"plant\"},\n    {\"color\": [255, 51, 7], \"id\": 18, \"isthing\": 1, \"name\": \"curtain\"},\n    {\"color\": [204, 70, 3], \"id\": 19, \"isthing\": 1, \"name\": \"chair\"},\n    {\"color\": [0, 102, 200], \"id\": 20, \"isthing\": 1, \"name\": \"car\"},\n    {\"color\": [61, 230, 250], \"id\": 21, \"isthing\": 0, \"name\": \"water\"},\n    {\"color\": [255, 6, 51], \"id\": 22, \"isthing\": 1, \"name\": \"painting, picture\"},\n    {\"color\": [11, 102, 255], \"id\": 23, \"isthing\": 1, \"name\": \"sofa\"},\n    {\"color\": [255, 7, 71], \"id\": 24, \"isthing\": 1, \"name\": \"shelf\"},\n    {\"color\": [255, 9, 224], \"id\": 25, \"isthing\": 0, \"name\": \"house\"},\n    {\"color\": [9, 7, 230], \"id\": 26, \"isthing\": 0, \"name\": \"sea\"},\n    {\"color\": [220, 220, 220], \"id\": 27, \"isthing\": 1, \"name\": \"mirror\"},\n    {\"color\": [255, 9, 92], \"id\": 28, \"isthing\": 0, \"name\": \"rug\"},\n    {\"color\": [112, 9, 255], \"id\": 29, \"isthing\": 0, \"name\": \"field\"},\n    {\"color\": [8, 255, 214], \"id\": 30, \"isthing\": 1, \"name\": \"armchair\"},\n    {\"color\": [7, 255, 224], \"id\": 31, \"isthing\": 1, \"name\": \"seat\"},\n    {\"color\": [255, 184, 6], \"id\": 32, \"isthing\": 1, \"name\": \"fence\"},\n    {\"color\": [10, 255, 71], \"id\": 33, \"isthing\": 1, \"name\": \"desk\"},\n    {\"color\": [255, 41, 10], \"id\": 34, \"isthing\": 0, \"name\": \"rock, stone\"},\n    {\"color\": [7, 255, 255], \"id\": 35, \"isthing\": 1, \"name\": \"wardrobe, closet, press\"},\n    {\"color\": [224, 255, 8], \"id\": 36, \"isthing\": 1, \"name\": \"lamp\"},\n    {\"color\": [102, 8, 255], \"id\": 37, \"isthing\": 1, \"name\": \"tub\"},\n    {\"color\": [255, 61, 6], \"id\": 38, \"isthing\": 1, \"name\": \"rail\"},\n    {\"color\": [255, 194, 7], \"id\": 39, \"isthing\": 1, \"name\": \"cushion\"},\n    {\"color\": [255, 122, 8], \"id\": 40, \"isthing\": 0, \"name\": \"base, pedestal, stand\"},\n    {\"color\": [0, 255, 20], \"id\": 41, \"isthing\": 1, \"name\": \"box\"},\n    {\"color\": [255, 8, 41], \"id\": 42, \"isthing\": 1, \"name\": \"column, pillar\"},\n    {\"color\": [255, 5, 153], \"id\": 43, \"isthing\": 1, \"name\": \"signboard, sign\"},\n    {\n        \"color\": [6, 51, 255],\n        \"id\": 44,\n        \"isthing\": 1,\n        \"name\": \"chest of drawers, chest, bureau, dresser\",\n    },\n    {\"color\": [235, 12, 255], \"id\": 45, \"isthing\": 1, \"name\": \"counter\"},\n    {\"color\": [160, 150, 20], \"id\": 46, \"isthing\": 0, \"name\": \"sand\"},\n    {\"color\": [0, 163, 255], \"id\": 47, \"isthing\": 1, \"name\": \"sink\"},\n    {\"color\": [140, 140, 140], \"id\": 48, \"isthing\": 0, \"name\": \"skyscraper\"},\n    {\"color\": [250, 10, 15], \"id\": 49, \"isthing\": 1, \"name\": \"fireplace\"},\n    {\"color\": [20, 255, 0], \"id\": 50, \"isthing\": 1, \"name\": \"refrigerator, icebox\"},\n    {\"color\": [31, 255, 0], \"id\": 51, \"isthing\": 0, \"name\": \"grandstand, covered stand\"},\n    {\"color\": [255, 31, 0], \"id\": 52, \"isthing\": 0, \"name\": \"path\"},\n    {\"color\": [255, 224, 0], \"id\": 53, \"isthing\": 1, \"name\": \"stairs\"},\n    {\"color\": [153, 255, 0], \"id\": 54, \"isthing\": 0, \"name\": \"runway\"},\n    {\"color\": [0, 0, 255], \"id\": 55, \"isthing\": 1, \"name\": \"case, display case, showcase, vitrine\"},\n    {\n        \"color\": [255, 71, 0],\n        \"id\": 56,\n        \"isthing\": 1,\n        \"name\": \"pool table, billiard table, snooker table\",\n    },\n    {\"color\": [0, 235, 255], \"id\": 57, \"isthing\": 1, \"name\": \"pillow\"},\n    {\"color\": [0, 173, 255], \"id\": 58, \"isthing\": 1, \"name\": \"screen door, screen\"},\n    {\"color\": [31, 0, 255], \"id\": 59, \"isthing\": 0, \"name\": \"stairway, staircase\"},\n    {\"color\": [11, 200, 200], \"id\": 60, \"isthing\": 0, \"name\": \"river\"},\n    {\"color\": [255, 82, 0], \"id\": 61, \"isthing\": 0, \"name\": \"bridge, span\"},\n    {\"color\": [0, 255, 245], \"id\": 62, \"isthing\": 1, \"name\": \"bookcase\"},\n    {\"color\": [0, 61, 255], \"id\": 63, \"isthing\": 0, \"name\": \"blind, screen\"},\n    {\"color\": [0, 255, 112], \"id\": 64, \"isthing\": 1, \"name\": \"coffee table\"},\n    {\n        \"color\": [0, 255, 133],\n        \"id\": 65,\n        \"isthing\": 1,\n        \"name\": \"toilet, can, commode, crapper, pot, potty, stool, throne\",\n    },\n    {\"color\": [255, 0, 0], \"id\": 66, \"isthing\": 1, \"name\": \"flower\"},\n    {\"color\": [255, 163, 0], \"id\": 67, \"isthing\": 1, \"name\": \"book\"},\n    {\"color\": [255, 102, 0], \"id\": 68, \"isthing\": 0, \"name\": \"hill\"},\n    {\"color\": [194, 255, 0], \"id\": 69, \"isthing\": 1, \"name\": \"bench\"},\n    {\"color\": [0, 143, 255], \"id\": 70, \"isthing\": 1, \"name\": \"countertop\"},\n    {\"color\": [51, 255, 0], \"id\": 71, \"isthing\": 1, \"name\": \"stove\"},\n    {\"color\": [0, 82, 255], \"id\": 72, \"isthing\": 1, \"name\": \"palm, palm tree\"},\n    {\"color\": [0, 255, 41], \"id\": 73, \"isthing\": 1, \"name\": \"kitchen island\"},\n    {\"color\": [0, 255, 173], \"id\": 74, \"isthing\": 1, \"name\": \"computer\"},\n    {\"color\": [10, 0, 255], \"id\": 75, \"isthing\": 1, \"name\": \"swivel chair\"},\n    {\"color\": [173, 255, 0], \"id\": 76, \"isthing\": 1, \"name\": \"boat\"},\n    {\"color\": [0, 255, 153], \"id\": 77, \"isthing\": 0, \"name\": \"bar\"},\n    {\"color\": [255, 92, 0], \"id\": 78, \"isthing\": 1, \"name\": \"arcade machine\"},\n    {\"color\": [255, 0, 255], \"id\": 79, \"isthing\": 0, \"name\": \"hovel, hut, hutch, shack, shanty\"},\n    {\"color\": [255, 0, 245], \"id\": 80, \"isthing\": 1, \"name\": \"bus\"},\n    {\"color\": [255, 0, 102], \"id\": 81, \"isthing\": 1, \"name\": \"towel\"},\n    {\"color\": [255, 173, 0], \"id\": 82, \"isthing\": 1, \"name\": \"light\"},\n    {\"color\": [255, 0, 20], \"id\": 83, \"isthing\": 1, \"name\": \"truck\"},\n    {\"color\": [255, 184, 184], \"id\": 84, \"isthing\": 0, \"name\": \"tower\"},\n    {\"color\": [0, 31, 255], \"id\": 85, \"isthing\": 1, \"name\": \"chandelier\"},\n    {\"color\": [0, 255, 61], \"id\": 86, \"isthing\": 1, \"name\": \"awning, sunshade, sunblind\"},\n    {\"color\": [0, 71, 255], \"id\": 87, \"isthing\": 1, \"name\": \"street lamp\"},\n    {\"color\": [255, 0, 204], \"id\": 88, \"isthing\": 1, \"name\": \"booth\"},\n    {\"color\": [0, 255, 194], \"id\": 89, \"isthing\": 1, \"name\": \"tv\"},\n    {\"color\": [0, 255, 82], \"id\": 90, \"isthing\": 1, \"name\": \"plane\"},\n    {\"color\": [0, 10, 255], \"id\": 91, \"isthing\": 0, \"name\": \"dirt track\"},\n    {\"color\": [0, 112, 255], \"id\": 92, \"isthing\": 1, \"name\": \"clothes\"},\n    {\"color\": [51, 0, 255], \"id\": 93, \"isthing\": 1, \"name\": \"pole\"},\n    {\"color\": [0, 194, 255], \"id\": 94, \"isthing\": 0, \"name\": \"land, ground, soil\"},\n    {\n        \"color\": [0, 122, 255],\n        \"id\": 95,\n        \"isthing\": 1,\n        \"name\": \"bannister, banister, balustrade, balusters, handrail\",\n    },\n    {\n        \"color\": [0, 255, 163],\n        \"id\": 96,\n        \"isthing\": 0,\n        \"name\": \"escalator, moving staircase, moving stairway\",\n    },\n    {\n        \"color\": [255, 153, 0],\n        \"id\": 97,\n        \"isthing\": 1,\n        \"name\": \"ottoman, pouf, pouffe, puff, hassock\",\n    },\n    {\"color\": [0, 255, 10], \"id\": 98, \"isthing\": 1, \"name\": \"bottle\"},\n    {\"color\": [255, 112, 0], \"id\": 99, \"isthing\": 0, \"name\": \"buffet, counter, sideboard\"},\n    {\n        \"color\": [143, 255, 0],\n        \"id\": 100,\n        \"isthing\": 0,\n        \"name\": \"poster, posting, placard, notice, bill, card\",\n    },\n    {\"color\": [82, 0, 255], \"id\": 101, \"isthing\": 0, \"name\": \"stage\"},\n    {\"color\": [163, 255, 0], \"id\": 102, \"isthing\": 1, \"name\": \"van\"},\n    {\"color\": [255, 235, 0], \"id\": 103, \"isthing\": 1, \"name\": \"ship\"},\n    {\"color\": [8, 184, 170], \"id\": 104, \"isthing\": 1, \"name\": \"fountain\"},\n    {\n        \"color\": [133, 0, 255],\n        \"id\": 105,\n        \"isthing\": 0,\n        \"name\": \"conveyer belt, conveyor belt, conveyer, conveyor, transporter\",\n    },\n    {\"color\": [0, 255, 92], \"id\": 106, \"isthing\": 0, \"name\": \"canopy\"},\n    {\n        \"color\": [184, 0, 255],\n        \"id\": 107,\n        \"isthing\": 1,\n        \"name\": \"washer, automatic washer, washing machine\",\n    },\n    {\"color\": [255, 0, 31], \"id\": 108, \"isthing\": 1, \"name\": \"plaything, toy\"},\n    {\"color\": [0, 184, 255], \"id\": 109, \"isthing\": 0, \"name\": \"pool\"},\n    {\"color\": [0, 214, 255], \"id\": 110, \"isthing\": 1, \"name\": \"stool\"},\n    {\"color\": [255, 0, 112], \"id\": 111, \"isthing\": 1, \"name\": \"barrel, cask\"},\n    {\"color\": [92, 255, 0], \"id\": 112, \"isthing\": 1, \"name\": \"basket, handbasket\"},\n    {\"color\": [0, 224, 255], \"id\": 113, \"isthing\": 0, \"name\": \"falls\"},\n    {\"color\": [112, 224, 255], \"id\": 114, \"isthing\": 0, \"name\": \"tent\"},\n    {\"color\": [70, 184, 160], \"id\": 115, \"isthing\": 1, \"name\": \"bag\"},\n    {\"color\": [163, 0, 255], \"id\": 116, \"isthing\": 1, \"name\": \"minibike, motorbike\"},\n    {\"color\": [153, 0, 255], \"id\": 117, \"isthing\": 0, \"name\": \"cradle\"},\n    {\"color\": [71, 255, 0], \"id\": 118, \"isthing\": 1, \"name\": \"oven\"},\n    {\"color\": [255, 0, 163], \"id\": 119, \"isthing\": 1, \"name\": \"ball\"},\n    {\"color\": [255, 204, 0], \"id\": 120, \"isthing\": 1, \"name\": \"food, solid food\"},\n    {\"color\": [255, 0, 143], \"id\": 121, \"isthing\": 1, \"name\": \"step, stair\"},\n    {\"color\": [0, 255, 235], \"id\": 122, \"isthing\": 0, \"name\": \"tank, storage tank\"},\n    {\"color\": [133, 255, 0], \"id\": 123, \"isthing\": 1, \"name\": \"trade name\"},\n    {\"color\": [255, 0, 235], \"id\": 124, \"isthing\": 1, \"name\": \"microwave\"},\n    {\"color\": [245, 0, 255], \"id\": 125, \"isthing\": 1, \"name\": \"pot\"},\n    {\"color\": [255, 0, 122], \"id\": 126, \"isthing\": 1, \"name\": \"animal\"},\n    {\"color\": [255, 245, 0], \"id\": 127, \"isthing\": 1, \"name\": \"bicycle\"},\n    {\"color\": [10, 190, 212], \"id\": 128, \"isthing\": 0, \"name\": \"lake\"},\n    {\"color\": [214, 255, 0], \"id\": 129, \"isthing\": 1, \"name\": \"dishwasher\"},\n    {\"color\": [0, 204, 255], \"id\": 130, \"isthing\": 1, \"name\": \"screen\"},\n    {\"color\": [20, 0, 255], \"id\": 131, \"isthing\": 0, \"name\": \"blanket, cover\"},\n    {\"color\": [255, 255, 0], \"id\": 132, \"isthing\": 1, \"name\": \"sculpture\"},\n    {\"color\": [0, 153, 255], \"id\": 133, \"isthing\": 1, \"name\": \"hood, exhaust hood\"},\n    {\"color\": [0, 41, 255], \"id\": 134, \"isthing\": 1, \"name\": \"sconce\"},\n    {\"color\": [0, 255, 204], \"id\": 135, \"isthing\": 1, \"name\": \"vase\"},\n    {\"color\": [41, 0, 255], \"id\": 136, \"isthing\": 1, \"name\": \"traffic light\"},\n    {\"color\": [41, 255, 0], \"id\": 137, \"isthing\": 1, \"name\": \"tray\"},\n    {\"color\": [173, 0, 255], \"id\": 138, \"isthing\": 1, \"name\": \"trash can\"},\n    {\"color\": [0, 245, 255], \"id\": 139, \"isthing\": 1, \"name\": \"fan\"},\n    {\"color\": [71, 0, 255], \"id\": 140, \"isthing\": 0, \"name\": \"pier\"},\n    {\"color\": [122, 0, 255], \"id\": 141, \"isthing\": 0, \"name\": \"crt screen\"},\n    {\"color\": [0, 255, 184], \"id\": 142, \"isthing\": 1, \"name\": \"plate\"},\n    {\"color\": [0, 92, 255], \"id\": 143, \"isthing\": 1, \"name\": \"monitor\"},\n    {\"color\": [184, 255, 0], \"id\": 144, \"isthing\": 1, \"name\": \"bulletin board\"},\n    {\"color\": [0, 133, 255], \"id\": 145, \"isthing\": 0, \"name\": \"shower\"},\n    {\"color\": [255, 214, 0], \"id\": 146, \"isthing\": 1, \"name\": \"radiator\"},\n    {\"color\": [25, 194, 194], \"id\": 147, \"isthing\": 1, \"name\": \"glass, drinking glass\"},\n    {\"color\": [102, 255, 0], \"id\": 148, \"isthing\": 1, \"name\": \"clock\"},\n    {\"color\": [92, 0, 255], \"id\": 149, \"isthing\": 1, \"name\": \"flag\"},\n]\n\nADE20k_COLORS = [k[\"color\"] for k in ADE20K_150_CATEGORIES]\n\nMetadataCatalog.get(\"ade20k_sem_seg_train\").set(\n    stuff_colors=ADE20k_COLORS[:],\n)\n\nMetadataCatalog.get(\"ade20k_sem_seg_val\").set(\n    stuff_colors=ADE20k_COLORS[:],\n)\n\n\ndef load_ade20k_panoptic_json(json_file, image_dir, gt_dir, semseg_dir, meta):\n    \"\"\"\n    Args:\n        image_dir (str): path to the raw dataset. e.g., \"~/coco/train2017\".\n        gt_dir (str): path to the raw annotations. e.g., \"~/coco/panoptic_train2017\".\n        json_file (str): path to the json file. e.g., \"~/coco/annotations/panoptic_train2017.json\".\n    Returns:\n        list[dict]: a list of dicts in Detectron2 standard format. (See\n        `Using Custom Datasets </tutorials/datasets.html>`_ )\n    \"\"\"\n\n    def _convert_category_id(segment_info, meta):\n        if segment_info[\"category_id\"] in meta[\"thing_dataset_id_to_contiguous_id\"]:\n            segment_info[\"category_id\"] = meta[\"thing_dataset_id_to_contiguous_id\"][\n                segment_info[\"category_id\"]\n            ]\n            segment_info[\"isthing\"] = True\n        else:\n            segment_info[\"category_id\"] = meta[\"stuff_dataset_id_to_contiguous_id\"][\n                segment_info[\"category_id\"]\n            ]\n            segment_info[\"isthing\"] = False\n        return segment_info\n\n    with PathManager.open(json_file) as f:\n        json_info = json.load(f)\n\n    ret = []\n    for ann in json_info[\"annotations\"]:\n        image_id = ann[\"image_id\"]\n        # TODO: currently we assume image and label has the same filename but\n        # different extension, and images have extension \".jpg\" for COCO. Need\n        # to make image extension a user-provided argument if we extend this\n        # function to support other COCO-like datasets.\n        image_file = os.path.join(image_dir, os.path.splitext(ann[\"file_name\"])[0] + \".jpg\")\n        label_file = os.path.join(gt_dir, ann[\"file_name\"])\n        sem_label_file = os.path.join(semseg_dir, ann[\"file_name\"])\n        segments_info = [_convert_category_id(x, meta) for x in ann[\"segments_info\"]]\n        ret.append(\n            {\n                \"file_name\": image_file,\n                \"image_id\": image_id,\n                \"pan_seg_file_name\": label_file,\n                \"sem_seg_file_name\": sem_label_file,\n                \"segments_info\": segments_info,\n            }\n        )\n    assert len(ret), f\"No images found in {image_dir}!\"\n    assert PathManager.isfile(ret[0][\"file_name\"]), ret[0][\"file_name\"]\n    assert PathManager.isfile(ret[0][\"pan_seg_file_name\"]), ret[0][\"pan_seg_file_name\"]\n    assert PathManager.isfile(ret[0][\"sem_seg_file_name\"]), ret[0][\"sem_seg_file_name\"]\n    return ret\n\n\ndef register_ade20k_panoptic(\n    name, metadata, image_root, panoptic_root, semantic_root, panoptic_json, instances_json=None\n):\n    \"\"\"\n    Register a \"standard\" version of ADE20k panoptic segmentation dataset named `name`.\n    The dictionaries in this registered dataset follows detectron2's standard format.\n    Hence it's called \"standard\".\n    Args:\n        name (str): the name that identifies a dataset,\n            e.g. \"ade20k_panoptic_train\"\n        metadata (dict): extra metadata associated with this dataset.\n        image_root (str): directory which contains all the images\n        panoptic_root (str): directory which contains panoptic annotation images in COCO format\n        panoptic_json (str): path to the json panoptic annotation file in COCO format\n        sem_seg_root (none): not used, to be consistent with\n            `register_coco_panoptic_separated`.\n        instances_json (str): path to the json instance annotation file\n    \"\"\"\n    panoptic_name = name\n    DatasetCatalog.register(\n        panoptic_name,\n        lambda: load_ade20k_panoptic_json(\n            panoptic_json, image_root, panoptic_root, semantic_root, metadata\n        ),\n    )\n    MetadataCatalog.get(panoptic_name).set(\n        panoptic_root=panoptic_root,\n        image_root=image_root,\n        panoptic_json=panoptic_json,\n        json_file=instances_json,\n        evaluator_type=\"ade20k_panoptic_seg\",\n        ignore_label=255,\n        label_divisor=1000,\n        **metadata,\n    )\n\n\n_PREDEFINED_SPLITS_ADE20K_PANOPTIC = {\n    \"ade20k_panoptic_train\": (\n        \"ADEChallengeData2016/images/training\",\n        \"ADEChallengeData2016/ade20k_panoptic_train\",\n        \"ADEChallengeData2016/ade20k_panoptic_train.json\",\n        \"ADEChallengeData2016/annotations_detectron2/training\",\n        \"ADEChallengeData2016/ade20k_instance_train.json\",\n    ),\n    \"ade20k_panoptic_val\": (\n        \"ADEChallengeData2016/images/validation\",\n        \"ADEChallengeData2016/ade20k_panoptic_val\",\n        \"ADEChallengeData2016/ade20k_panoptic_val.json\",\n        \"ADEChallengeData2016/annotations_detectron2/validation\",\n        \"ADEChallengeData2016/ade20k_instance_val.json\",\n    ),\n}\n\n\ndef get_metadata():\n    meta = {}\n    # The following metadata maps contiguous id from [0, #thing categories +\n    # #stuff categories) to their names and colors. We have to replica of the\n    # same name and color under \"thing_*\" and \"stuff_*\" because the current\n    # visualization function in D2 handles thing and class classes differently\n    # due to some heuristic used in Panoptic FPN. We keep the same naming to\n    # enable reusing existing visualization functions.\n    thing_classes = [k[\"name\"] for k in ADE20K_150_CATEGORIES if k[\"isthing\"] == 1]\n    thing_colors = [k[\"color\"] for k in ADE20K_150_CATEGORIES if k[\"isthing\"] == 1]\n    stuff_classes = [k[\"name\"] for k in ADE20K_150_CATEGORIES]\n    stuff_colors = [k[\"color\"] for k in ADE20K_150_CATEGORIES]\n\n    meta[\"thing_classes\"] = thing_classes\n    meta[\"thing_colors\"] = thing_colors\n    meta[\"stuff_classes\"] = stuff_classes\n    meta[\"stuff_colors\"] = stuff_colors\n\n    # Convert category id for training:\n    #   category id: like semantic segmentation, it is the class id for each\n    #   pixel. Since there are some classes not used in evaluation, the category\n    #   id is not always contiguous and thus we have two set of category ids:\n    #       - original category id: category id in the original dataset, mainly\n    #           used for evaluation.\n    #       - contiguous category id: [0, #classes), in order to train the linear\n    #           softmax classifier.\n    thing_dataset_id_to_contiguous_id = {}\n    stuff_dataset_id_to_contiguous_id = {}\n\n    for i, cat in enumerate(ADE20K_150_CATEGORIES):\n        if cat[\"isthing\"]:\n            thing_dataset_id_to_contiguous_id[cat[\"id\"]] = i\n        # else:\n        #     stuff_dataset_id_to_contiguous_id[cat[\"id\"]] = i\n\n        # in order to use sem_seg evaluator\n        stuff_dataset_id_to_contiguous_id[cat[\"id\"]] = i\n\n    meta[\"thing_dataset_id_to_contiguous_id\"] = thing_dataset_id_to_contiguous_id\n    meta[\"stuff_dataset_id_to_contiguous_id\"] = stuff_dataset_id_to_contiguous_id\n\n    return meta\n\n\ndef register_all_ade20k_panoptic(root):\n    metadata = get_metadata()\n    for (\n        prefix,\n        (image_root, panoptic_root, panoptic_json, semantic_root, instance_json),\n    ) in _PREDEFINED_SPLITS_ADE20K_PANOPTIC.items():\n        # The \"standard\" version of COCO panoptic segmentation dataset,\n        # e.g. used by Panoptic-DeepLab\n        register_ade20k_panoptic(\n            prefix,\n            metadata,\n            os.path.join(root, image_root),\n            os.path.join(root, panoptic_root),\n            os.path.join(root, semantic_root),\n            os.path.join(root, panoptic_json),\n            os.path.join(root, instance_json),\n        )\n\n\n_root = os.getenv(\"DETECTRON2_DATASETS\", \"datasets\")\nregister_all_ade20k_panoptic(_root)\n"
  },
  {
    "path": "mask2former/data/datasets/register_coco_panoptic_annos_semseg.py",
    "content": "import json\nimport os\n\nfrom detectron2.data import DatasetCatalog, MetadataCatalog\nfrom detectron2.data.datasets import load_sem_seg\nfrom detectron2.data.datasets.builtin_meta import COCO_CATEGORIES\nfrom detectron2.utils.file_io import PathManager\n\n\n_PREDEFINED_SPLITS_COCO_PANOPTIC = {\n    \"coco_2017_train_panoptic\": (\n        # This is the original panoptic annotation directory\n        \"coco/panoptic_train2017\",\n        \"coco/annotations/panoptic_train2017.json\",\n        # This directory contains semantic annotations that are\n        # converted from panoptic annotations.\n        # It is used by PanopticFPN.\n        # You can use the script at detectron2/datasets/prepare_panoptic_fpn.py\n        # to create these directories.\n        \"coco/panoptic_semseg_train2017\",\n    ),\n    \"coco_2017_val_panoptic\": (\n        \"coco/panoptic_val2017\",\n        \"coco/annotations/panoptic_val2017.json\",\n        \"coco/panoptic_semseg_val2017\",\n    ),\n}\n\n\ndef get_metadata():\n    meta = {}\n    # The following metadata maps contiguous id from [0, #thing categories +\n    # #stuff categories) to their names and colors. We have to replica of the\n    # same name and color under \"thing_*\" and \"stuff_*\" because the current\n    # visualization function in D2 handles thing and class classes differently\n    # due to some heuristic used in Panoptic FPN. We keep the same naming to\n    # enable reusing existing visualization functions.\n    thing_classes = [k[\"name\"] for k in COCO_CATEGORIES if k[\"isthing\"] == 1]\n    thing_colors = [k[\"color\"] for k in COCO_CATEGORIES if k[\"isthing\"] == 1]\n    stuff_classes = [k[\"name\"] for k in COCO_CATEGORIES]\n    stuff_colors = [k[\"color\"] for k in COCO_CATEGORIES]\n\n    meta[\"thing_classes\"] = thing_classes\n    meta[\"thing_colors\"] = thing_colors\n    meta[\"stuff_classes\"] = stuff_classes\n    meta[\"stuff_colors\"] = stuff_colors\n\n    # Convert category id for training:\n    #   category id: like semantic segmentation, it is the class id for each\n    #   pixel. Since there are some classes not used in evaluation, the category\n    #   id is not always contiguous and thus we have two set of category ids:\n    #       - original category id: category id in the original dataset, mainly\n    #           used for evaluation.\n    #       - contiguous category id: [0, #classes), in order to train the linear\n    #           softmax classifier.\n    thing_dataset_id_to_contiguous_id = {}\n    stuff_dataset_id_to_contiguous_id = {}\n\n    for i, cat in enumerate(COCO_CATEGORIES):\n        if cat[\"isthing\"]:\n            thing_dataset_id_to_contiguous_id[cat[\"id\"]] = i\n        # else:\n        #     stuff_dataset_id_to_contiguous_id[cat[\"id\"]] = i\n\n        # in order to use sem_seg evaluator\n        stuff_dataset_id_to_contiguous_id[cat[\"id\"]] = i\n\n    meta[\"thing_dataset_id_to_contiguous_id\"] = thing_dataset_id_to_contiguous_id\n    meta[\"stuff_dataset_id_to_contiguous_id\"] = stuff_dataset_id_to_contiguous_id\n\n    return meta\n\n\ndef load_coco_panoptic_json(json_file, image_dir, gt_dir, semseg_dir, meta):\n    \"\"\"\n    Args:\n        image_dir (str): path to the raw dataset. e.g., \"~/coco/train2017\".\n        gt_dir (str): path to the raw annotations. e.g., \"~/coco/panoptic_train2017\".\n        json_file (str): path to the json file. e.g., \"~/coco/annotations/panoptic_train2017.json\".\n    Returns:\n        list[dict]: a list of dicts in Detectron2 standard format. (See\n        `Using Custom Datasets </tutorials/datasets.html>`_ )\n    \"\"\"\n\n    def _convert_category_id(segment_info, meta):\n        if segment_info[\"category_id\"] in meta[\"thing_dataset_id_to_contiguous_id\"]:\n            segment_info[\"category_id\"] = meta[\"thing_dataset_id_to_contiguous_id\"][\n                segment_info[\"category_id\"]\n            ]\n            segment_info[\"isthing\"] = True\n        else:\n            segment_info[\"category_id\"] = meta[\"stuff_dataset_id_to_contiguous_id\"][\n                segment_info[\"category_id\"]\n            ]\n            segment_info[\"isthing\"] = False\n        return segment_info\n\n    with PathManager.open(json_file) as f:\n        json_info = json.load(f)\n\n    ret = []\n    for ann in json_info[\"annotations\"]:\n        image_id = int(ann[\"image_id\"])\n        # TODO: currently we assume image and label has the same filename but\n        # different extension, and images have extension \".jpg\" for COCO. Need\n        # to make image extension a user-provided argument if we extend this\n        # function to support other COCO-like datasets.\n        image_file = os.path.join(image_dir, os.path.splitext(ann[\"file_name\"])[0] + \".jpg\")\n        label_file = os.path.join(gt_dir, ann[\"file_name\"])\n        sem_label_file = os.path.join(semseg_dir, ann[\"file_name\"])\n        segments_info = [_convert_category_id(x, meta) for x in ann[\"segments_info\"]]\n        ret.append(\n            {\n                \"file_name\": image_file,\n                \"image_id\": image_id,\n                \"pan_seg_file_name\": label_file,\n                \"sem_seg_file_name\": sem_label_file,\n                \"segments_info\": segments_info,\n            }\n        )\n    assert len(ret), f\"No images found in {image_dir}!\"\n    assert PathManager.isfile(ret[0][\"file_name\"]), ret[0][\"file_name\"]\n    assert PathManager.isfile(ret[0][\"pan_seg_file_name\"]), ret[0][\"pan_seg_file_name\"]\n    assert PathManager.isfile(ret[0][\"sem_seg_file_name\"]), ret[0][\"sem_seg_file_name\"]\n    return ret\n\n\ndef register_coco_panoptic_annos_sem_seg(\n    name, metadata, image_root, panoptic_root, panoptic_json, sem_seg_root, instances_json\n):\n    panoptic_name = name\n    delattr(MetadataCatalog.get(panoptic_name), \"thing_classes\")\n    delattr(MetadataCatalog.get(panoptic_name), \"thing_colors\")\n    MetadataCatalog.get(panoptic_name).set(\n        thing_classes=metadata[\"thing_classes\"],\n        thing_colors=metadata[\"thing_colors\"],\n        # thing_dataset_id_to_contiguous_id=metadata[\"thing_dataset_id_to_contiguous_id\"],\n    )\n\n    # the name is \"coco_2017_train_panoptic_with_sem_seg\" and \"coco_2017_val_panoptic_with_sem_seg\"\n    semantic_name = name + \"_with_sem_seg\"\n    DatasetCatalog.register(\n        semantic_name,\n        lambda: load_coco_panoptic_json(panoptic_json, image_root, panoptic_root, sem_seg_root, metadata),\n    )\n    MetadataCatalog.get(semantic_name).set(\n        sem_seg_root=sem_seg_root,\n        panoptic_root=panoptic_root,\n        image_root=image_root,\n        panoptic_json=panoptic_json,\n        json_file=instances_json,\n        evaluator_type=\"coco_panoptic_seg\",\n        ignore_label=255,\n        label_divisor=1000,\n        **metadata,\n    )\n\n\ndef register_all_coco_panoptic_annos_sem_seg(root):\n    for (\n        prefix,\n        (panoptic_root, panoptic_json, semantic_root),\n    ) in _PREDEFINED_SPLITS_COCO_PANOPTIC.items():\n        prefix_instances = prefix[: -len(\"_panoptic\")]\n        instances_meta = MetadataCatalog.get(prefix_instances)\n        image_root, instances_json = instances_meta.image_root, instances_meta.json_file\n\n        register_coco_panoptic_annos_sem_seg(\n            prefix,\n            get_metadata(),\n            image_root,\n            os.path.join(root, panoptic_root),\n            os.path.join(root, panoptic_json),\n            os.path.join(root, semantic_root),\n            instances_json,\n        )\n\n\n_root = os.getenv(\"DETECTRON2_DATASETS\", \"datasets\")\nregister_all_coco_panoptic_annos_sem_seg(_root)\n"
  },
  {
    "path": "mask2former/data/datasets/register_coco_stuff_10k.py",
    "content": "import os\n\nfrom detectron2.data import DatasetCatalog, MetadataCatalog\nfrom detectron2.data.datasets import load_sem_seg\n\nCOCO_CATEGORIES = [\n    {\"color\": [220, 20, 60], \"isthing\": 1, \"id\": 1, \"name\": \"person\"},\n    {\"color\": [119, 11, 32], \"isthing\": 1, \"id\": 2, \"name\": \"bicycle\"},\n    {\"color\": [0, 0, 142], \"isthing\": 1, \"id\": 3, \"name\": \"car\"},\n    {\"color\": [0, 0, 230], \"isthing\": 1, \"id\": 4, \"name\": \"motorcycle\"},\n    {\"color\": [106, 0, 228], \"isthing\": 1, \"id\": 5, \"name\": \"airplane\"},\n    {\"color\": [0, 60, 100], \"isthing\": 1, \"id\": 6, \"name\": \"bus\"},\n    {\"color\": [0, 80, 100], \"isthing\": 1, \"id\": 7, \"name\": \"train\"},\n    {\"color\": [0, 0, 70], \"isthing\": 1, \"id\": 8, \"name\": \"truck\"},\n    {\"color\": [0, 0, 192], \"isthing\": 1, \"id\": 9, \"name\": \"boat\"},\n    {\"color\": [250, 170, 30], \"isthing\": 1, \"id\": 10, \"name\": \"traffic light\"},\n    {\"color\": [100, 170, 30], \"isthing\": 1, \"id\": 11, \"name\": \"fire hydrant\"},\n    {\"color\": [220, 220, 0], \"isthing\": 1, \"id\": 13, \"name\": \"stop sign\"},\n    {\"color\": [175, 116, 175], \"isthing\": 1, \"id\": 14, \"name\": \"parking meter\"},\n    {\"color\": [250, 0, 30], \"isthing\": 1, \"id\": 15, \"name\": \"bench\"},\n    {\"color\": [165, 42, 42], \"isthing\": 1, \"id\": 16, \"name\": \"bird\"},\n    {\"color\": [255, 77, 255], \"isthing\": 1, \"id\": 17, \"name\": \"cat\"},\n    {\"color\": [0, 226, 252], \"isthing\": 1, \"id\": 18, \"name\": \"dog\"},\n    {\"color\": [182, 182, 255], \"isthing\": 1, \"id\": 19, \"name\": \"horse\"},\n    {\"color\": [0, 82, 0], \"isthing\": 1, \"id\": 20, \"name\": \"sheep\"},\n    {\"color\": [120, 166, 157], \"isthing\": 1, \"id\": 21, \"name\": \"cow\"},\n    {\"color\": [110, 76, 0], \"isthing\": 1, \"id\": 22, \"name\": \"elephant\"},\n    {\"color\": [174, 57, 255], \"isthing\": 1, \"id\": 23, \"name\": \"bear\"},\n    {\"color\": [199, 100, 0], \"isthing\": 1, \"id\": 24, \"name\": \"zebra\"},\n    {\"color\": [72, 0, 118], \"isthing\": 1, \"id\": 25, \"name\": \"giraffe\"},\n    {\"color\": [255, 179, 240], \"isthing\": 1, \"id\": 27, \"name\": \"backpack\"},\n    {\"color\": [0, 125, 92], \"isthing\": 1, \"id\": 28, \"name\": \"umbrella\"},\n    {\"color\": [209, 0, 151], \"isthing\": 1, \"id\": 31, \"name\": \"handbag\"},\n    {\"color\": [188, 208, 182], \"isthing\": 1, \"id\": 32, \"name\": \"tie\"},\n    {\"color\": [0, 220, 176], \"isthing\": 1, \"id\": 33, \"name\": \"suitcase\"},\n    {\"color\": [255, 99, 164], \"isthing\": 1, \"id\": 34, \"name\": \"frisbee\"},\n    {\"color\": [92, 0, 73], \"isthing\": 1, \"id\": 35, \"name\": \"skis\"},\n    {\"color\": [133, 129, 255], \"isthing\": 1, \"id\": 36, \"name\": \"snowboard\"},\n    {\"color\": [78, 180, 255], \"isthing\": 1, \"id\": 37, \"name\": \"sports ball\"},\n    {\"color\": [0, 228, 0], \"isthing\": 1, \"id\": 38, \"name\": \"kite\"},\n    {\"color\": [174, 255, 243], \"isthing\": 1, \"id\": 39, \"name\": \"baseball bat\"},\n    {\"color\": [45, 89, 255], \"isthing\": 1, \"id\": 40, \"name\": \"baseball glove\"},\n    {\"color\": [134, 134, 103], \"isthing\": 1, \"id\": 41, \"name\": \"skateboard\"},\n    {\"color\": [145, 148, 174], \"isthing\": 1, \"id\": 42, \"name\": \"surfboard\"},\n    {\"color\": [255, 208, 186], \"isthing\": 1, \"id\": 43, \"name\": \"tennis racket\"},\n    {\"color\": [197, 226, 255], \"isthing\": 1, \"id\": 44, \"name\": \"bottle\"},\n    {\"color\": [171, 134, 1], \"isthing\": 1, \"id\": 46, \"name\": \"wine glass\"},\n    {\"color\": [109, 63, 54], \"isthing\": 1, \"id\": 47, \"name\": \"cup\"},\n    {\"color\": [207, 138, 255], \"isthing\": 1, \"id\": 48, \"name\": \"fork\"},\n    {\"color\": [151, 0, 95], \"isthing\": 1, \"id\": 49, \"name\": \"knife\"},\n    {\"color\": [9, 80, 61], \"isthing\": 1, \"id\": 50, \"name\": \"spoon\"},\n    {\"color\": [84, 105, 51], \"isthing\": 1, \"id\": 51, \"name\": \"bowl\"},\n    {\"color\": [74, 65, 105], \"isthing\": 1, \"id\": 52, \"name\": \"banana\"},\n    {\"color\": [166, 196, 102], \"isthing\": 1, \"id\": 53, \"name\": \"apple\"},\n    {\"color\": [208, 195, 210], \"isthing\": 1, \"id\": 54, \"name\": \"sandwich\"},\n    {\"color\": [255, 109, 65], \"isthing\": 1, \"id\": 55, \"name\": \"orange\"},\n    {\"color\": [0, 143, 149], \"isthing\": 1, \"id\": 56, \"name\": \"broccoli\"},\n    {\"color\": [179, 0, 194], \"isthing\": 1, \"id\": 57, \"name\": \"carrot\"},\n    {\"color\": [209, 99, 106], \"isthing\": 1, \"id\": 58, \"name\": \"hot dog\"},\n    {\"color\": [5, 121, 0], \"isthing\": 1, \"id\": 59, \"name\": \"pizza\"},\n    {\"color\": [227, 255, 205], \"isthing\": 1, \"id\": 60, \"name\": \"donut\"},\n    {\"color\": [147, 186, 208], \"isthing\": 1, \"id\": 61, \"name\": \"cake\"},\n    {\"color\": [153, 69, 1], \"isthing\": 1, \"id\": 62, \"name\": \"chair\"},\n    {\"color\": [3, 95, 161], \"isthing\": 1, \"id\": 63, \"name\": \"couch\"},\n    {\"color\": [163, 255, 0], \"isthing\": 1, \"id\": 64, \"name\": \"potted plant\"},\n    {\"color\": [119, 0, 170], \"isthing\": 1, \"id\": 65, \"name\": \"bed\"},\n    {\"color\": [0, 182, 199], \"isthing\": 1, \"id\": 67, \"name\": \"dining table\"},\n    {\"color\": [0, 165, 120], \"isthing\": 1, \"id\": 70, \"name\": \"toilet\"},\n    {\"color\": [183, 130, 88], \"isthing\": 1, \"id\": 72, \"name\": \"tv\"},\n    {\"color\": [95, 32, 0], \"isthing\": 1, \"id\": 73, \"name\": \"laptop\"},\n    {\"color\": [130, 114, 135], \"isthing\": 1, \"id\": 74, \"name\": \"mouse\"},\n    {\"color\": [110, 129, 133], \"isthing\": 1, \"id\": 75, \"name\": \"remote\"},\n    {\"color\": [166, 74, 118], \"isthing\": 1, \"id\": 76, \"name\": \"keyboard\"},\n    {\"color\": [219, 142, 185], \"isthing\": 1, \"id\": 77, \"name\": \"cell phone\"},\n    {\"color\": [79, 210, 114], \"isthing\": 1, \"id\": 78, \"name\": \"microwave\"},\n    {\"color\": [178, 90, 62], \"isthing\": 1, \"id\": 79, \"name\": \"oven\"},\n    {\"color\": [65, 70, 15], \"isthing\": 1, \"id\": 80, \"name\": \"toaster\"},\n    {\"color\": [127, 167, 115], \"isthing\": 1, \"id\": 81, \"name\": \"sink\"},\n    {\"color\": [59, 105, 106], \"isthing\": 1, \"id\": 82, \"name\": \"refrigerator\"},\n    {\"color\": [142, 108, 45], \"isthing\": 1, \"id\": 84, \"name\": \"book\"},\n    {\"color\": [196, 172, 0], \"isthing\": 1, \"id\": 85, \"name\": \"clock\"},\n    {\"color\": [95, 54, 80], \"isthing\": 1, \"id\": 86, \"name\": \"vase\"},\n    {\"color\": [128, 76, 255], \"isthing\": 1, \"id\": 87, \"name\": \"scissors\"},\n    {\"color\": [201, 57, 1], \"isthing\": 1, \"id\": 88, \"name\": \"teddy bear\"},\n    {\"color\": [246, 0, 122], \"isthing\": 1, \"id\": 89, \"name\": \"hair drier\"},\n    {\"color\": [191, 162, 208], \"isthing\": 1, \"id\": 90, \"name\": \"toothbrush\"},\n    {\"id\": 92, \"name\": \"banner\", \"supercategory\": \"textile\"},\n    {\"id\": 93, \"name\": \"blanket\", \"supercategory\": \"textile\"},\n    {\"id\": 94, \"name\": \"branch\", \"supercategory\": \"plant\"},\n    {\"id\": 95, \"name\": \"bridge\", \"supercategory\": \"building\"},\n    {\"id\": 96, \"name\": \"building-other\", \"supercategory\": \"building\"},\n    {\"id\": 97, \"name\": \"bush\", \"supercategory\": \"plant\"},\n    {\"id\": 98, \"name\": \"cabinet\", \"supercategory\": \"furniture-stuff\"},\n    {\"id\": 99, \"name\": \"cage\", \"supercategory\": \"structural\"},\n    {\"id\": 100, \"name\": \"cardboard\", \"supercategory\": \"raw-material\"},\n    {\"id\": 101, \"name\": \"carpet\", \"supercategory\": \"floor\"},\n    {\"id\": 102, \"name\": \"ceiling-other\", \"supercategory\": \"ceiling\"},\n    {\"id\": 103, \"name\": \"ceiling-tile\", \"supercategory\": \"ceiling\"},\n    {\"id\": 104, \"name\": \"cloth\", \"supercategory\": \"textile\"},\n    {\"id\": 105, \"name\": \"clothes\", \"supercategory\": \"textile\"},\n    {\"id\": 106, \"name\": \"clouds\", \"supercategory\": \"sky\"},\n    {\"id\": 107, \"name\": \"counter\", \"supercategory\": \"furniture-stuff\"},\n    {\"id\": 108, \"name\": \"cupboard\", \"supercategory\": \"furniture-stuff\"},\n    {\"id\": 109, \"name\": \"curtain\", \"supercategory\": \"textile\"},\n    {\"id\": 110, \"name\": \"desk-stuff\", \"supercategory\": \"furniture-stuff\"},\n    {\"id\": 111, \"name\": \"dirt\", \"supercategory\": \"ground\"},\n    {\"id\": 112, \"name\": \"door-stuff\", \"supercategory\": \"furniture-stuff\"},\n    {\"id\": 113, \"name\": \"fence\", \"supercategory\": \"structural\"},\n    {\"id\": 114, \"name\": \"floor-marble\", \"supercategory\": \"floor\"},\n    {\"id\": 115, \"name\": \"floor-other\", \"supercategory\": \"floor\"},\n    {\"id\": 116, \"name\": \"floor-stone\", \"supercategory\": \"floor\"},\n    {\"id\": 117, \"name\": \"floor-tile\", \"supercategory\": \"floor\"},\n    {\"id\": 118, \"name\": \"floor-wood\", \"supercategory\": \"floor\"},\n    {\"id\": 119, \"name\": \"flower\", \"supercategory\": \"plant\"},\n    {\"id\": 120, \"name\": \"fog\", \"supercategory\": \"water\"},\n    {\"id\": 121, \"name\": \"food-other\", \"supercategory\": \"food-stuff\"},\n    {\"id\": 122, \"name\": \"fruit\", \"supercategory\": \"food-stuff\"},\n    {\"id\": 123, \"name\": \"furniture-other\", \"supercategory\": \"furniture-stuff\"},\n    {\"id\": 124, \"name\": \"grass\", \"supercategory\": \"plant\"},\n    {\"id\": 125, \"name\": \"gravel\", \"supercategory\": \"ground\"},\n    {\"id\": 126, \"name\": \"ground-other\", \"supercategory\": \"ground\"},\n    {\"id\": 127, \"name\": \"hill\", \"supercategory\": \"solid\"},\n    {\"id\": 128, \"name\": \"house\", \"supercategory\": \"building\"},\n    {\"id\": 129, \"name\": \"leaves\", \"supercategory\": \"plant\"},\n    {\"id\": 130, \"name\": \"light\", \"supercategory\": \"furniture-stuff\"},\n    {\"id\": 131, \"name\": \"mat\", \"supercategory\": \"textile\"},\n    {\"id\": 132, \"name\": \"metal\", \"supercategory\": \"raw-material\"},\n    {\"id\": 133, \"name\": \"mirror-stuff\", \"supercategory\": \"furniture-stuff\"},\n    {\"id\": 134, \"name\": \"moss\", \"supercategory\": \"plant\"},\n    {\"id\": 135, \"name\": \"mountain\", \"supercategory\": \"solid\"},\n    {\"id\": 136, \"name\": \"mud\", \"supercategory\": \"ground\"},\n    {\"id\": 137, \"name\": \"napkin\", \"supercategory\": \"textile\"},\n    {\"id\": 138, \"name\": \"net\", \"supercategory\": \"structural\"},\n    {\"id\": 139, \"name\": \"paper\", \"supercategory\": \"raw-material\"},\n    {\"id\": 140, \"name\": \"pavement\", \"supercategory\": \"ground\"},\n    {\"id\": 141, \"name\": \"pillow\", \"supercategory\": \"textile\"},\n    {\"id\": 142, \"name\": \"plant-other\", \"supercategory\": \"plant\"},\n    {\"id\": 143, \"name\": \"plastic\", \"supercategory\": \"raw-material\"},\n    {\"id\": 144, \"name\": \"platform\", \"supercategory\": \"ground\"},\n    {\"id\": 145, \"name\": \"playingfield\", \"supercategory\": \"ground\"},\n    {\"id\": 146, \"name\": \"railing\", \"supercategory\": \"structural\"},\n    {\"id\": 147, \"name\": \"railroad\", \"supercategory\": \"ground\"},\n    {\"id\": 148, \"name\": \"river\", \"supercategory\": \"water\"},\n    {\"id\": 149, \"name\": \"road\", \"supercategory\": \"ground\"},\n    {\"id\": 150, \"name\": \"rock\", \"supercategory\": \"solid\"},\n    {\"id\": 151, \"name\": \"roof\", \"supercategory\": \"building\"},\n    {\"id\": 152, \"name\": \"rug\", \"supercategory\": \"textile\"},\n    {\"id\": 153, \"name\": \"salad\", \"supercategory\": \"food-stuff\"},\n    {\"id\": 154, \"name\": \"sand\", \"supercategory\": \"ground\"},\n    {\"id\": 155, \"name\": \"sea\", \"supercategory\": \"water\"},\n    {\"id\": 156, \"name\": \"shelf\", \"supercategory\": \"furniture-stuff\"},\n    {\"id\": 157, \"name\": \"sky-other\", \"supercategory\": \"sky\"},\n    {\"id\": 158, \"name\": \"skyscraper\", \"supercategory\": \"building\"},\n    {\"id\": 159, \"name\": \"snow\", \"supercategory\": \"ground\"},\n    {\"id\": 160, \"name\": \"solid-other\", \"supercategory\": \"solid\"},\n    {\"id\": 161, \"name\": \"stairs\", \"supercategory\": \"furniture-stuff\"},\n    {\"id\": 162, \"name\": \"stone\", \"supercategory\": \"solid\"},\n    {\"id\": 163, \"name\": \"straw\", \"supercategory\": \"plant\"},\n    {\"id\": 164, \"name\": \"structural-other\", \"supercategory\": \"structural\"},\n    {\"id\": 165, \"name\": \"table\", \"supercategory\": \"furniture-stuff\"},\n    {\"id\": 166, \"name\": \"tent\", \"supercategory\": \"building\"},\n    {\"id\": 167, \"name\": \"textile-other\", \"supercategory\": \"textile\"},\n    {\"id\": 168, \"name\": \"towel\", \"supercategory\": \"textile\"},\n    {\"id\": 169, \"name\": \"tree\", \"supercategory\": \"plant\"},\n    {\"id\": 170, \"name\": \"vegetable\", \"supercategory\": \"food-stuff\"},\n    {\"id\": 171, \"name\": \"wall-brick\", \"supercategory\": \"wall\"},\n    {\"id\": 172, \"name\": \"wall-concrete\", \"supercategory\": \"wall\"},\n    {\"id\": 173, \"name\": \"wall-other\", \"supercategory\": \"wall\"},\n    {\"id\": 174, \"name\": \"wall-panel\", \"supercategory\": \"wall\"},\n    {\"id\": 175, \"name\": \"wall-stone\", \"supercategory\": \"wall\"},\n    {\"id\": 176, \"name\": \"wall-tile\", \"supercategory\": \"wall\"},\n    {\"id\": 177, \"name\": \"wall-wood\", \"supercategory\": \"wall\"},\n    {\"id\": 178, \"name\": \"water-other\", \"supercategory\": \"water\"},\n    {\"id\": 179, \"name\": \"waterdrops\", \"supercategory\": \"water\"},\n    {\"id\": 180, \"name\": \"window-blind\", \"supercategory\": \"window\"},\n    {\"id\": 181, \"name\": \"window-other\", \"supercategory\": \"window\"},\n    {\"id\": 182, \"name\": \"wood\", \"supercategory\": \"solid\"},\n]\n\n\ndef _get_coco_stuff_meta():\n    # Id 0 is reserved for ignore_label, we change ignore_label for 0\n    # to 255 in our pre-processing.\n    stuff_ids = [k[\"id\"] for k in COCO_CATEGORIES]\n    assert len(stuff_ids) == 171, len(stuff_ids)\n\n    # For semantic segmentation, this mapping maps from contiguous stuff id\n    # (in [0, 91], used in models) to ids in the dataset (used for processing results)\n    stuff_dataset_id_to_contiguous_id = {k: i for i, k in enumerate(stuff_ids)}\n    stuff_classes = [k[\"name\"] for k in COCO_CATEGORIES]\n\n    ret = {\n        \"stuff_dataset_id_to_contiguous_id\": stuff_dataset_id_to_contiguous_id,\n        \"stuff_classes\": stuff_classes,\n    }\n    return ret\n\n\ndef register_all_coco_stuff_10k(root):\n    root = os.path.join(root, \"coco\", \"coco_stuff_10k\")\n    meta = _get_coco_stuff_meta()\n    for name, image_dirname, sem_seg_dirname in [\n        (\"train\", \"images_detectron2/train\", \"annotations_detectron2/train\"),\n        (\"test\", \"images_detectron2/test\", \"annotations_detectron2/test\"),\n    ]:\n        image_dir = os.path.join(root, image_dirname)\n        gt_dir = os.path.join(root, sem_seg_dirname)\n        name = f\"coco_2017_{name}_stuff_10k_sem_seg\"\n        DatasetCatalog.register(\n            name, lambda x=image_dir, y=gt_dir: load_sem_seg(y, x, gt_ext=\"png\", image_ext=\"jpg\")\n        )\n        MetadataCatalog.get(name).set(\n            image_root=image_dir,\n            sem_seg_root=gt_dir,\n            evaluator_type=\"sem_seg\",\n            ignore_label=255,\n            **meta,\n        )\n\n\n_root = os.getenv(\"DETECTRON2_DATASETS\", \"datasets\")\nregister_all_coco_stuff_10k(_root)\n"
  },
  {
    "path": "mask2former/data/datasets/register_mapillary_vistas.py",
    "content": "import os\n\nfrom detectron2.data import DatasetCatalog, MetadataCatalog\nfrom detectron2.data.datasets import load_sem_seg\n\nMAPILLARY_VISTAS_SEM_SEG_CATEGORIES = [\n    {\n        \"color\": [165, 42, 42],\n        \"instances\": True,\n        \"readable\": \"Bird\",\n        \"name\": \"animal--bird\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [0, 192, 0],\n        \"instances\": True,\n        \"readable\": \"Ground Animal\",\n        \"name\": \"animal--ground-animal\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [196, 196, 196],\n        \"instances\": False,\n        \"readable\": \"Curb\",\n        \"name\": \"construction--barrier--curb\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [190, 153, 153],\n        \"instances\": False,\n        \"readable\": \"Fence\",\n        \"name\": \"construction--barrier--fence\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [180, 165, 180],\n        \"instances\": False,\n        \"readable\": \"Guard Rail\",\n        \"name\": \"construction--barrier--guard-rail\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [90, 120, 150],\n        \"instances\": False,\n        \"readable\": \"Barrier\",\n        \"name\": \"construction--barrier--other-barrier\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [102, 102, 156],\n        \"instances\": False,\n        \"readable\": \"Wall\",\n        \"name\": \"construction--barrier--wall\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [128, 64, 255],\n        \"instances\": False,\n        \"readable\": \"Bike Lane\",\n        \"name\": \"construction--flat--bike-lane\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [140, 140, 200],\n        \"instances\": True,\n        \"readable\": \"Crosswalk - Plain\",\n        \"name\": \"construction--flat--crosswalk-plain\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [170, 170, 170],\n        \"instances\": False,\n        \"readable\": \"Curb Cut\",\n        \"name\": \"construction--flat--curb-cut\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [250, 170, 160],\n        \"instances\": False,\n        \"readable\": \"Parking\",\n        \"name\": \"construction--flat--parking\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [96, 96, 96],\n        \"instances\": False,\n        \"readable\": \"Pedestrian Area\",\n        \"name\": \"construction--flat--pedestrian-area\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [230, 150, 140],\n        \"instances\": False,\n        \"readable\": \"Rail Track\",\n        \"name\": \"construction--flat--rail-track\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [128, 64, 128],\n        \"instances\": False,\n        \"readable\": \"Road\",\n        \"name\": \"construction--flat--road\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [110, 110, 110],\n        \"instances\": False,\n        \"readable\": \"Service Lane\",\n        \"name\": \"construction--flat--service-lane\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [244, 35, 232],\n        \"instances\": False,\n        \"readable\": \"Sidewalk\",\n        \"name\": \"construction--flat--sidewalk\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [150, 100, 100],\n        \"instances\": False,\n        \"readable\": \"Bridge\",\n        \"name\": \"construction--structure--bridge\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [70, 70, 70],\n        \"instances\": False,\n        \"readable\": \"Building\",\n        \"name\": \"construction--structure--building\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [150, 120, 90],\n        \"instances\": False,\n        \"readable\": \"Tunnel\",\n        \"name\": \"construction--structure--tunnel\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [220, 20, 60],\n        \"instances\": True,\n        \"readable\": \"Person\",\n        \"name\": \"human--person\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [255, 0, 0],\n        \"instances\": True,\n        \"readable\": \"Bicyclist\",\n        \"name\": \"human--rider--bicyclist\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [255, 0, 100],\n        \"instances\": True,\n        \"readable\": \"Motorcyclist\",\n        \"name\": \"human--rider--motorcyclist\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [255, 0, 200],\n        \"instances\": True,\n        \"readable\": \"Other Rider\",\n        \"name\": \"human--rider--other-rider\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [200, 128, 128],\n        \"instances\": True,\n        \"readable\": \"Lane Marking - Crosswalk\",\n        \"name\": \"marking--crosswalk-zebra\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [255, 255, 255],\n        \"instances\": False,\n        \"readable\": \"Lane Marking - General\",\n        \"name\": \"marking--general\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [64, 170, 64],\n        \"instances\": False,\n        \"readable\": \"Mountain\",\n        \"name\": \"nature--mountain\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [230, 160, 50],\n        \"instances\": False,\n        \"readable\": \"Sand\",\n        \"name\": \"nature--sand\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [70, 130, 180],\n        \"instances\": False,\n        \"readable\": \"Sky\",\n        \"name\": \"nature--sky\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [190, 255, 255],\n        \"instances\": False,\n        \"readable\": \"Snow\",\n        \"name\": \"nature--snow\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [152, 251, 152],\n        \"instances\": False,\n        \"readable\": \"Terrain\",\n        \"name\": \"nature--terrain\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [107, 142, 35],\n        \"instances\": False,\n        \"readable\": \"Vegetation\",\n        \"name\": \"nature--vegetation\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [0, 170, 30],\n        \"instances\": False,\n        \"readable\": \"Water\",\n        \"name\": \"nature--water\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [255, 255, 128],\n        \"instances\": True,\n        \"readable\": \"Banner\",\n        \"name\": \"object--banner\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [250, 0, 30],\n        \"instances\": True,\n        \"readable\": \"Bench\",\n        \"name\": \"object--bench\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [100, 140, 180],\n        \"instances\": True,\n        \"readable\": \"Bike Rack\",\n        \"name\": \"object--bike-rack\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [220, 220, 220],\n        \"instances\": True,\n        \"readable\": \"Billboard\",\n        \"name\": \"object--billboard\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [220, 128, 128],\n        \"instances\": True,\n        \"readable\": \"Catch Basin\",\n        \"name\": \"object--catch-basin\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [222, 40, 40],\n        \"instances\": True,\n        \"readable\": \"CCTV Camera\",\n        \"name\": \"object--cctv-camera\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [100, 170, 30],\n        \"instances\": True,\n        \"readable\": \"Fire Hydrant\",\n        \"name\": \"object--fire-hydrant\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [40, 40, 40],\n        \"instances\": True,\n        \"readable\": \"Junction Box\",\n        \"name\": \"object--junction-box\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [33, 33, 33],\n        \"instances\": True,\n        \"readable\": \"Mailbox\",\n        \"name\": \"object--mailbox\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [100, 128, 160],\n        \"instances\": True,\n        \"readable\": \"Manhole\",\n        \"name\": \"object--manhole\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [142, 0, 0],\n        \"instances\": True,\n        \"readable\": \"Phone Booth\",\n        \"name\": \"object--phone-booth\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [70, 100, 150],\n        \"instances\": False,\n        \"readable\": \"Pothole\",\n        \"name\": \"object--pothole\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [210, 170, 100],\n        \"instances\": True,\n        \"readable\": \"Street Light\",\n        \"name\": \"object--street-light\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [153, 153, 153],\n        \"instances\": True,\n        \"readable\": \"Pole\",\n        \"name\": \"object--support--pole\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [128, 128, 128],\n        \"instances\": True,\n        \"readable\": \"Traffic Sign Frame\",\n        \"name\": \"object--support--traffic-sign-frame\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [0, 0, 80],\n        \"instances\": True,\n        \"readable\": \"Utility Pole\",\n        \"name\": \"object--support--utility-pole\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [250, 170, 30],\n        \"instances\": True,\n        \"readable\": \"Traffic Light\",\n        \"name\": \"object--traffic-light\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [192, 192, 192],\n        \"instances\": True,\n        \"readable\": \"Traffic Sign (Back)\",\n        \"name\": \"object--traffic-sign--back\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [220, 220, 0],\n        \"instances\": True,\n        \"readable\": \"Traffic Sign (Front)\",\n        \"name\": \"object--traffic-sign--front\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [140, 140, 20],\n        \"instances\": True,\n        \"readable\": \"Trash Can\",\n        \"name\": \"object--trash-can\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [119, 11, 32],\n        \"instances\": True,\n        \"readable\": \"Bicycle\",\n        \"name\": \"object--vehicle--bicycle\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [150, 0, 255],\n        \"instances\": True,\n        \"readable\": \"Boat\",\n        \"name\": \"object--vehicle--boat\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [0, 60, 100],\n        \"instances\": True,\n        \"readable\": \"Bus\",\n        \"name\": \"object--vehicle--bus\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [0, 0, 142],\n        \"instances\": True,\n        \"readable\": \"Car\",\n        \"name\": \"object--vehicle--car\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [0, 0, 90],\n        \"instances\": True,\n        \"readable\": \"Caravan\",\n        \"name\": \"object--vehicle--caravan\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [0, 0, 230],\n        \"instances\": True,\n        \"readable\": \"Motorcycle\",\n        \"name\": \"object--vehicle--motorcycle\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [0, 80, 100],\n        \"instances\": False,\n        \"readable\": \"On Rails\",\n        \"name\": \"object--vehicle--on-rails\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [128, 64, 64],\n        \"instances\": True,\n        \"readable\": \"Other Vehicle\",\n        \"name\": \"object--vehicle--other-vehicle\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [0, 0, 110],\n        \"instances\": True,\n        \"readable\": \"Trailer\",\n        \"name\": \"object--vehicle--trailer\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [0, 0, 70],\n        \"instances\": True,\n        \"readable\": \"Truck\",\n        \"name\": \"object--vehicle--truck\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [0, 0, 192],\n        \"instances\": True,\n        \"readable\": \"Wheeled Slow\",\n        \"name\": \"object--vehicle--wheeled-slow\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [32, 32, 32],\n        \"instances\": False,\n        \"readable\": \"Car Mount\",\n        \"name\": \"void--car-mount\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [120, 10, 10],\n        \"instances\": False,\n        \"readable\": \"Ego Vehicle\",\n        \"name\": \"void--ego-vehicle\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [0, 0, 0],\n        \"instances\": False,\n        \"readable\": \"Unlabeled\",\n        \"name\": \"void--unlabeled\",\n        \"evaluate\": False,\n    },\n]\n\n\ndef _get_mapillary_vistas_meta():\n    stuff_classes = [k[\"readable\"] for k in MAPILLARY_VISTAS_SEM_SEG_CATEGORIES if k[\"evaluate\"]]\n    assert len(stuff_classes) == 65\n\n    stuff_colors = [k[\"color\"] for k in MAPILLARY_VISTAS_SEM_SEG_CATEGORIES if k[\"evaluate\"]]\n    assert len(stuff_colors) == 65\n\n    ret = {\n        \"stuff_classes\": stuff_classes,\n        \"stuff_colors\": stuff_colors,\n    }\n    return ret\n\n\ndef register_all_mapillary_vistas(root):\n    root = os.path.join(root, \"mapillary_vistas\")\n    meta = _get_mapillary_vistas_meta()\n    for name, dirname in [(\"train\", \"training\"), (\"val\", \"validation\")]:\n        image_dir = os.path.join(root, dirname, \"images\")\n        gt_dir = os.path.join(root, dirname, \"labels\")\n        name = f\"mapillary_vistas_sem_seg_{name}\"\n        DatasetCatalog.register(\n            name, lambda x=image_dir, y=gt_dir: load_sem_seg(y, x, gt_ext=\"png\", image_ext=\"jpg\")\n        )\n        MetadataCatalog.get(name).set(\n            image_root=image_dir,\n            sem_seg_root=gt_dir,\n            evaluator_type=\"sem_seg\",\n            ignore_label=65,  # different from other datasets, Mapillary Vistas sets ignore_label to 65\n            **meta,\n        )\n\n\n_root = os.getenv(\"DETECTRON2_DATASETS\", \"datasets\")\nregister_all_mapillary_vistas(_root)\n"
  },
  {
    "path": "mask2former/data/datasets/register_mapillary_vistas_panoptic.py",
    "content": "import json\nimport os\n\nfrom detectron2.data import DatasetCatalog, MetadataCatalog\nfrom detectron2.utils.file_io import PathManager\n\n\nMAPILLARY_VISTAS_SEM_SEG_CATEGORIES = [\n    {'color': [165, 42, 42],\n    'id': 1,\n    'isthing': 1,\n    'name': 'Bird',\n    'supercategory': 'animal--bird'},\n    {'color': [0, 192, 0],\n    'id': 2,\n    'isthing': 1,\n    'name': 'Ground Animal',\n    'supercategory': 'animal--ground-animal'},\n    {'color': [196, 196, 196],\n    'id': 3,\n    'isthing': 0,\n    'name': 'Curb',\n    'supercategory': 'construction--barrier--curb'},\n    {'color': [190, 153, 153],\n    'id': 4,\n    'isthing': 0,\n    'name': 'Fence',\n    'supercategory': 'construction--barrier--fence'},\n    {'color': [180, 165, 180],\n    'id': 5,\n    'isthing': 0,\n    'name': 'Guard Rail',\n    'supercategory': 'construction--barrier--guard-rail'},\n    {'color': [90, 120, 150],\n    'id': 6,\n    'isthing': 0,\n    'name': 'Barrier',\n    'supercategory': 'construction--barrier--other-barrier'},\n    {'color': [102, 102, 156],\n    'id': 7,\n    'isthing': 0,\n    'name': 'Wall',\n    'supercategory': 'construction--barrier--wall'},\n    {'color': [128, 64, 255],\n    'id': 8,\n    'isthing': 0,\n    'name': 'Bike Lane',\n    'supercategory': 'construction--flat--bike-lane'},\n    {'color': [140, 140, 200],\n    'id': 9,\n    'isthing': 1,\n    'name': 'Crosswalk - Plain',\n    'supercategory': 'construction--flat--crosswalk-plain'},\n    {'color': [170, 170, 170],\n    'id': 10,\n    'isthing': 0,\n    'name': 'Curb Cut',\n    'supercategory': 'construction--flat--curb-cut'},\n    {'color': [250, 170, 160],\n    'id': 11,\n    'isthing': 0,\n    'name': 'Parking',\n    'supercategory': 'construction--flat--parking'},\n    {'color': [96, 96, 96],\n    'id': 12,\n    'isthing': 0,\n    'name': 'Pedestrian Area',\n    'supercategory': 'construction--flat--pedestrian-area'},\n    {'color': [230, 150, 140],\n    'id': 13,\n    'isthing': 0,\n    'name': 'Rail Track',\n    'supercategory': 'construction--flat--rail-track'},\n    {'color': [128, 64, 128],\n    'id': 14,\n    'isthing': 0,\n    'name': 'Road',\n    'supercategory': 'construction--flat--road'},\n    {'color': [110, 110, 110],\n    'id': 15,\n    'isthing': 0,\n    'name': 'Service Lane',\n    'supercategory': 'construction--flat--service-lane'},\n    {'color': [244, 35, 232],\n    'id': 16,\n    'isthing': 0,\n    'name': 'Sidewalk',\n    'supercategory': 'construction--flat--sidewalk'},\n    {'color': [150, 100, 100],\n    'id': 17,\n    'isthing': 0,\n    'name': 'Bridge',\n    'supercategory': 'construction--structure--bridge'},\n    {'color': [70, 70, 70],\n    'id': 18,\n    'isthing': 0,\n    'name': 'Building',\n    'supercategory': 'construction--structure--building'},\n    {'color': [150, 120, 90],\n    'id': 19,\n    'isthing': 0,\n    'name': 'Tunnel',\n    'supercategory': 'construction--structure--tunnel'},\n    {'color': [220, 20, 60],\n    'id': 20,\n    'isthing': 1,\n    'name': 'Person',\n    'supercategory': 'human--person'},\n    {'color': [255, 0, 0],\n    'id': 21,\n    'isthing': 1,\n    'name': 'Bicyclist',\n    'supercategory': 'human--rider--bicyclist'},\n    {'color': [255, 0, 100],\n    'id': 22,\n    'isthing': 1,\n    'name': 'Motorcyclist',\n    'supercategory': 'human--rider--motorcyclist'},\n    {'color': [255, 0, 200],\n    'id': 23,\n    'isthing': 1,\n    'name': 'Other Rider',\n    'supercategory': 'human--rider--other-rider'},\n    {'color': [200, 128, 128],\n    'id': 24,\n    'isthing': 1,\n    'name': 'Lane Marking - Crosswalk',\n    'supercategory': 'marking--crosswalk-zebra'},\n    {'color': [255, 255, 255],\n    'id': 25,\n    'isthing': 0,\n    'name': 'Lane Marking - General',\n    'supercategory': 'marking--general'},\n    {'color': [64, 170, 64],\n    'id': 26,\n    'isthing': 0,\n    'name': 'Mountain',\n    'supercategory': 'nature--mountain'},\n    {'color': [230, 160, 50],\n    'id': 27,\n    'isthing': 0,\n    'name': 'Sand',\n    'supercategory': 'nature--sand'},\n    {'color': [70, 130, 180],\n    'id': 28,\n    'isthing': 0,\n    'name': 'Sky',\n    'supercategory': 'nature--sky'},\n    {'color': [190, 255, 255],\n    'id': 29,\n    'isthing': 0,\n    'name': 'Snow',\n    'supercategory': 'nature--snow'},\n    {'color': [152, 251, 152],\n    'id': 30,\n    'isthing': 0,\n    'name': 'Terrain',\n    'supercategory': 'nature--terrain'},\n    {'color': [107, 142, 35],\n    'id': 31,\n    'isthing': 0,\n    'name': 'Vegetation',\n    'supercategory': 'nature--vegetation'},\n    {'color': [0, 170, 30],\n    'id': 32,\n    'isthing': 0,\n    'name': 'Water',\n    'supercategory': 'nature--water'},\n    {'color': [255, 255, 128],\n    'id': 33,\n    'isthing': 1,\n    'name': 'Banner',\n    'supercategory': 'object--banner'},\n    {'color': [250, 0, 30],\n    'id': 34,\n    'isthing': 1,\n    'name': 'Bench',\n    'supercategory': 'object--bench'},\n    {'color': [100, 140, 180],\n    'id': 35,\n    'isthing': 1,\n    'name': 'Bike Rack',\n    'supercategory': 'object--bike-rack'},\n    {'color': [220, 220, 220],\n    'id': 36,\n    'isthing': 1,\n    'name': 'Billboard',\n    'supercategory': 'object--billboard'},\n    {'color': [220, 128, 128],\n    'id': 37,\n    'isthing': 1,\n    'name': 'Catch Basin',\n    'supercategory': 'object--catch-basin'},\n    {'color': [222, 40, 40],\n    'id': 38,\n    'isthing': 1,\n    'name': 'CCTV Camera',\n    'supercategory': 'object--cctv-camera'},\n    {'color': [100, 170, 30],\n    'id': 39,\n    'isthing': 1,\n    'name': 'Fire Hydrant',\n    'supercategory': 'object--fire-hydrant'},\n    {'color': [40, 40, 40],\n    'id': 40,\n    'isthing': 1,\n    'name': 'Junction Box',\n    'supercategory': 'object--junction-box'},\n    {'color': [33, 33, 33],\n    'id': 41,\n    'isthing': 1,\n    'name': 'Mailbox',\n    'supercategory': 'object--mailbox'},\n    {'color': [100, 128, 160],\n    'id': 42,\n    'isthing': 1,\n    'name': 'Manhole',\n    'supercategory': 'object--manhole'},\n    {'color': [142, 0, 0],\n    'id': 43,\n    'isthing': 1,\n    'name': 'Phone Booth',\n    'supercategory': 'object--phone-booth'},\n    {'color': [70, 100, 150],\n    'id': 44,\n    'isthing': 0,\n    'name': 'Pothole',\n    'supercategory': 'object--pothole'},\n    {'color': [210, 170, 100],\n    'id': 45,\n    'isthing': 1,\n    'name': 'Street Light',\n    'supercategory': 'object--street-light'},\n    {'color': [153, 153, 153],\n    'id': 46,\n    'isthing': 1,\n    'name': 'Pole',\n    'supercategory': 'object--support--pole'},\n    {'color': [128, 128, 128],\n    'id': 47,\n    'isthing': 1,\n    'name': 'Traffic Sign Frame',\n    'supercategory': 'object--support--traffic-sign-frame'},\n    {'color': [0, 0, 80],\n    'id': 48,\n    'isthing': 1,\n    'name': 'Utility Pole',\n    'supercategory': 'object--support--utility-pole'},\n    {'color': [250, 170, 30],\n    'id': 49,\n    'isthing': 1,\n    'name': 'Traffic Light',\n    'supercategory': 'object--traffic-light'},\n    {'color': [192, 192, 192],\n    'id': 50,\n    'isthing': 1,\n    'name': 'Traffic Sign (Back)',\n    'supercategory': 'object--traffic-sign--back'},\n    {'color': [220, 220, 0],\n    'id': 51,\n    'isthing': 1,\n    'name': 'Traffic Sign (Front)',\n    'supercategory': 'object--traffic-sign--front'},\n    {'color': [140, 140, 20],\n    'id': 52,\n    'isthing': 1,\n    'name': 'Trash Can',\n    'supercategory': 'object--trash-can'},\n    {'color': [119, 11, 32],\n    'id': 53,\n    'isthing': 1,\n    'name': 'Bicycle',\n    'supercategory': 'object--vehicle--bicycle'},\n    {'color': [150, 0, 255],\n    'id': 54,\n    'isthing': 1,\n    'name': 'Boat',\n    'supercategory': 'object--vehicle--boat'},\n    {'color': [0, 60, 100],\n    'id': 55,\n    'isthing': 1,\n    'name': 'Bus',\n    'supercategory': 'object--vehicle--bus'},\n    {'color': [0, 0, 142],\n    'id': 56,\n    'isthing': 1,\n    'name': 'Car',\n    'supercategory': 'object--vehicle--car'},\n    {'color': [0, 0, 90],\n    'id': 57,\n    'isthing': 1,\n    'name': 'Caravan',\n    'supercategory': 'object--vehicle--caravan'},\n    {'color': [0, 0, 230],\n    'id': 58,\n    'isthing': 1,\n    'name': 'Motorcycle',\n    'supercategory': 'object--vehicle--motorcycle'},\n    {'color': [0, 80, 100],\n    'id': 59,\n    'isthing': 0,\n    'name': 'On Rails',\n    'supercategory': 'object--vehicle--on-rails'},\n    {'color': [128, 64, 64],\n    'id': 60,\n    'isthing': 1,\n    'name': 'Other Vehicle',\n    'supercategory': 'object--vehicle--other-vehicle'},\n    {'color': [0, 0, 110],\n    'id': 61,\n    'isthing': 1,\n    'name': 'Trailer',\n    'supercategory': 'object--vehicle--trailer'},\n    {'color': [0, 0, 70],\n    'id': 62,\n    'isthing': 1,\n    'name': 'Truck',\n    'supercategory': 'object--vehicle--truck'},\n    {'color': [0, 0, 192],\n    'id': 63,\n    'isthing': 1,\n    'name': 'Wheeled Slow',\n    'supercategory': 'object--vehicle--wheeled-slow'},\n    {'color': [32, 32, 32],\n    'id': 64,\n    'isthing': 0,\n    'name': 'Car Mount',\n    'supercategory': 'void--car-mount'},\n    {'color': [120, 10, 10],\n    'id': 65,\n    'isthing': 0,\n    'name': 'Ego Vehicle',\n    'supercategory': 'void--ego-vehicle'}\n]\n\n\ndef load_mapillary_vistas_panoptic_json(json_file, image_dir, gt_dir, semseg_dir, meta):\n    \"\"\"\n    Args:\n        image_dir (str): path to the raw dataset. e.g., \"~/coco/train2017\".\n        gt_dir (str): path to the raw annotations. e.g., \"~/coco/panoptic_train2017\".\n        json_file (str): path to the json file. e.g., \"~/coco/annotations/panoptic_train2017.json\".\n    Returns:\n        list[dict]: a list of dicts in Detectron2 standard format. (See\n        `Using Custom Datasets </tutorials/datasets.html>`_ )\n    \"\"\"\n\n    def _convert_category_id(segment_info, meta):\n        if segment_info[\"category_id\"] in meta[\"thing_dataset_id_to_contiguous_id\"]:\n            segment_info[\"category_id\"] = meta[\"thing_dataset_id_to_contiguous_id\"][\n                segment_info[\"category_id\"]\n            ]\n            segment_info[\"isthing\"] = True\n        else:\n            segment_info[\"category_id\"] = meta[\"stuff_dataset_id_to_contiguous_id\"][\n                segment_info[\"category_id\"]\n            ]\n            segment_info[\"isthing\"] = False\n        return segment_info\n\n    with PathManager.open(json_file) as f:\n        json_info = json.load(f)\n\n    ret = []\n    for ann in json_info[\"annotations\"]:\n        image_id = ann[\"image_id\"]\n        # TODO: currently we assume image and label has the same filename but\n        # different extension, and images have extension \".jpg\" for COCO. Need\n        # to make image extension a user-provided argument if we extend this\n        # function to support other COCO-like datasets.\n        image_file = os.path.join(image_dir, os.path.splitext(ann[\"file_name\"])[0] + \".jpg\")\n        label_file = os.path.join(gt_dir, ann[\"file_name\"])\n        sem_label_file = os.path.join(semseg_dir, ann[\"file_name\"])\n        segments_info = [_convert_category_id(x, meta) for x in ann[\"segments_info\"]]\n        ret.append(\n            {\n                \"file_name\": image_file,\n                \"image_id\": image_id,\n                \"pan_seg_file_name\": label_file,\n                \"sem_seg_file_name\": sem_label_file,\n                \"segments_info\": segments_info,\n            }\n        )\n    assert len(ret), f\"No images found in {image_dir}!\"\n    assert PathManager.isfile(ret[0][\"file_name\"]), ret[0][\"file_name\"]\n    assert PathManager.isfile(ret[0][\"pan_seg_file_name\"]), ret[0][\"pan_seg_file_name\"]\n    assert PathManager.isfile(ret[0][\"sem_seg_file_name\"]), ret[0][\"sem_seg_file_name\"]\n    return ret\n\n\ndef register_mapillary_vistas_panoptic(\n    name, metadata, image_root, panoptic_root, semantic_root, panoptic_json, instances_json=None\n):\n    \"\"\"\n    Register a \"standard\" version of ADE20k panoptic segmentation dataset named `name`.\n    The dictionaries in this registered dataset follows detectron2's standard format.\n    Hence it's called \"standard\".\n    Args:\n        name (str): the name that identifies a dataset,\n            e.g. \"ade20k_panoptic_train\"\n        metadata (dict): extra metadata associated with this dataset.\n        image_root (str): directory which contains all the images\n        panoptic_root (str): directory which contains panoptic annotation images in COCO format\n        panoptic_json (str): path to the json panoptic annotation file in COCO format\n        sem_seg_root (none): not used, to be consistent with\n            `register_coco_panoptic_separated`.\n        instances_json (str): path to the json instance annotation file\n    \"\"\"\n    panoptic_name = name\n    DatasetCatalog.register(\n        panoptic_name,\n        lambda: load_mapillary_vistas_panoptic_json(\n            panoptic_json, image_root, panoptic_root, semantic_root, metadata\n        ),\n    )\n    MetadataCatalog.get(panoptic_name).set(\n        panoptic_root=panoptic_root,\n        image_root=image_root,\n        panoptic_json=panoptic_json,\n        json_file=instances_json,\n        evaluator_type=\"mapillary_vistas_panoptic_seg\",\n        ignore_label=65,  # different from other datasets, Mapillary Vistas sets ignore_label to 65\n        label_divisor=1000,\n        **metadata,\n    )\n\n\n_PREDEFINED_SPLITS_ADE20K_PANOPTIC = {\n    \"mapillary_vistas_panoptic_train\": (\n        \"mapillary_vistas/training/images\",\n        \"mapillary_vistas/training/panoptic\",\n        \"mapillary_vistas/training/panoptic/panoptic_2018.json\",\n        \"mapillary_vistas/training/labels\",\n    ),\n    \"mapillary_vistas_panoptic_val\": (\n        \"mapillary_vistas/validation/images\",\n        \"mapillary_vistas/validation/panoptic\",\n        \"mapillary_vistas/validation/panoptic/panoptic_2018.json\",\n        \"mapillary_vistas/validation/labels\",\n    ),\n}\n\n\ndef get_metadata():\n    meta = {}\n    # The following metadata maps contiguous id from [0, #thing categories +\n    # #stuff categories) to their names and colors. We have to replica of the\n    # same name and color under \"thing_*\" and \"stuff_*\" because the current\n    # visualization function in D2 handles thing and class classes differently\n    # due to some heuristic used in Panoptic FPN. We keep the same naming to\n    # enable reusing existing visualization functions.\n    thing_classes = [k[\"name\"] for k in MAPILLARY_VISTAS_SEM_SEG_CATEGORIES]\n    thing_colors = [k[\"color\"] for k in MAPILLARY_VISTAS_SEM_SEG_CATEGORIES]\n    stuff_classes = [k[\"name\"] for k in MAPILLARY_VISTAS_SEM_SEG_CATEGORIES]\n    stuff_colors = [k[\"color\"] for k in MAPILLARY_VISTAS_SEM_SEG_CATEGORIES]\n\n    meta[\"thing_classes\"] = thing_classes\n    meta[\"thing_colors\"] = thing_colors\n    meta[\"stuff_classes\"] = stuff_classes\n    meta[\"stuff_colors\"] = stuff_colors\n\n    # Convert category id for training:\n    #   category id: like semantic segmentation, it is the class id for each\n    #   pixel. Since there are some classes not used in evaluation, the category\n    #   id is not always contiguous and thus we have two set of category ids:\n    #       - original category id: category id in the original dataset, mainly\n    #           used for evaluation.\n    #       - contiguous category id: [0, #classes), in order to train the linear\n    #           softmax classifier.\n    thing_dataset_id_to_contiguous_id = {}\n    stuff_dataset_id_to_contiguous_id = {}\n\n    for i, cat in enumerate(MAPILLARY_VISTAS_SEM_SEG_CATEGORIES):\n        if cat[\"isthing\"]:\n            thing_dataset_id_to_contiguous_id[cat[\"id\"]] = i\n        # else:\n        #     stuff_dataset_id_to_contiguous_id[cat[\"id\"]] = i\n\n        # in order to use sem_seg evaluator\n        stuff_dataset_id_to_contiguous_id[cat[\"id\"]] = i\n\n    meta[\"thing_dataset_id_to_contiguous_id\"] = thing_dataset_id_to_contiguous_id\n    meta[\"stuff_dataset_id_to_contiguous_id\"] = stuff_dataset_id_to_contiguous_id\n\n    return meta\n\n\ndef register_all_mapillary_vistas_panoptic(root):\n    metadata = get_metadata()\n    for (\n        prefix,\n        (image_root, panoptic_root, panoptic_json, semantic_root),\n    ) in _PREDEFINED_SPLITS_ADE20K_PANOPTIC.items():\n        # The \"standard\" version of COCO panoptic segmentation dataset,\n        # e.g. used by Panoptic-DeepLab\n        register_mapillary_vistas_panoptic(\n            prefix,\n            metadata,\n            os.path.join(root, image_root),\n            os.path.join(root, panoptic_root),\n            os.path.join(root, semantic_root),\n            os.path.join(root, panoptic_json),\n        )\n\n\n_root = os.getenv(\"DETECTRON2_DATASETS\", \"datasets\")\nregister_all_mapillary_vistas_panoptic(_root)\n"
  },
  {
    "path": "mask2former/evaluation/__init__.py",
    "content": ""
  },
  {
    "path": "mask2former/evaluation/__init__.py.new",
    "content": ""
  },
  {
    "path": "mask2former/evaluation/instance_evaluation.py",
    "content": "import contextlib\nimport copy\nimport io\nimport itertools\nimport json\nimport logging\nimport numpy as np\nimport os\nimport pickle\nfrom collections import OrderedDict\nimport pycocotools.mask as mask_util\nimport torch\nfrom pycocotools.coco import COCO\nfrom pycocotools.cocoeval import COCOeval\nfrom tabulate import tabulate\n\nimport detectron2.utils.comm as comm\nfrom detectron2.config import CfgNode\nfrom detectron2.data import MetadataCatalog\nfrom detectron2.data.datasets.coco import convert_to_coco_json\nfrom detectron2.evaluation.coco_evaluation import COCOEvaluator, _evaluate_predictions_on_coco\nfrom detectron2.evaluation.fast_eval_api import COCOeval_opt\nfrom detectron2.structures import Boxes, BoxMode, pairwise_iou\nfrom detectron2.utils.file_io import PathManager\nfrom detectron2.utils.logger import create_small_table\n\n\n# modified from COCOEvaluator for instance segmetnat\nclass InstanceSegEvaluator(COCOEvaluator):\n    \"\"\"\n    Evaluate AR for object proposals, AP for instance detection/segmentation, AP\n    for keypoint detection outputs using COCO's metrics.\n    See http://cocodataset.org/#detection-eval and\n    http://cocodataset.org/#keypoints-eval to understand its metrics.\n    The metrics range from 0 to 100 (instead of 0 to 1), where a -1 or NaN means\n    the metric cannot be computed (e.g. due to no predictions made).\n\n    In addition to COCO, this evaluator is able to support any bounding box detection,\n    instance segmentation, or keypoint detection dataset.\n    \"\"\"\n\n    def _eval_predictions(self, predictions, img_ids=None):\n        \"\"\"\n        Evaluate predictions. Fill self._results with the metrics of the tasks.\n        \"\"\"\n        self._logger.info(\"Preparing results for COCO format ...\")\n        coco_results = list(itertools.chain(*[x[\"instances\"] for x in predictions]))\n        tasks = self._tasks or self._tasks_from_predictions(coco_results)\n\n        # unmap the category ids for COCO\n        if hasattr(self._metadata, \"thing_dataset_id_to_contiguous_id\"):\n            dataset_id_to_contiguous_id = self._metadata.thing_dataset_id_to_contiguous_id\n            # all_contiguous_ids = list(dataset_id_to_contiguous_id.values())\n            # num_classes = len(all_contiguous_ids)\n            # assert min(all_contiguous_ids) == 0 and max(all_contiguous_ids) == num_classes - 1\n\n            reverse_id_mapping = {v: k for k, v in dataset_id_to_contiguous_id.items()}\n            for result in coco_results:\n                category_id = result[\"category_id\"]\n                # assert category_id < num_classes, (\n                #     f\"A prediction has class={category_id}, \"\n                #     f\"but the dataset only has {num_classes} classes and \"\n                #     f\"predicted class id should be in [0, {num_classes - 1}].\"\n                # )\n                assert category_id in reverse_id_mapping, (\n                    f\"A prediction has class={category_id}, \"\n                    f\"but the dataset only has class ids in {dataset_id_to_contiguous_id}.\"\n                )\n                result[\"category_id\"] = reverse_id_mapping[category_id]\n\n        if self._output_dir:\n            file_path = os.path.join(self._output_dir, \"coco_instances_results.json\")\n            self._logger.info(\"Saving results to {}\".format(file_path))\n            with PathManager.open(file_path, \"w\") as f:\n                f.write(json.dumps(coco_results))\n                f.flush()\n\n        if not self._do_evaluation:\n            self._logger.info(\"Annotations are not available for evaluation.\")\n            return\n\n        self._logger.info(\n            \"Evaluating predictions with {} COCO API...\".format(\n                \"unofficial\" if self._use_fast_impl else \"official\"\n            )\n        )\n        for task in sorted(tasks):\n            assert task in {\"bbox\", \"segm\", \"keypoints\"}, f\"Got unknown task: {task}!\"\n            coco_eval = (\n                _evaluate_predictions_on_coco(\n                    self._coco_api,\n                    coco_results,\n                    task,\n                    kpt_oks_sigmas=self._kpt_oks_sigmas,\n                    use_fast_impl=self._use_fast_impl,\n                    img_ids=img_ids,\n                    max_dets_per_image=self._max_dets_per_image,\n                )\n                if len(coco_results) > 0\n                else None  # cocoapi does not handle empty results very well\n            )\n\n            res = self._derive_coco_results(\n                coco_eval, task, class_names=self._metadata.get(\"thing_classes\")\n            )\n            self._results[task] = res\n"
  },
  {
    "path": "mask2former/maskformer_model.py",
    "content": "from typing import Tuple\n\nimport torch\nfrom torch import nn\nfrom torch.nn import functional as F\n\nfrom detectron2.config import configurable\nfrom detectron2.data import MetadataCatalog\nfrom detectron2.modeling import META_ARCH_REGISTRY, build_backbone, build_sem_seg_head\nfrom detectron2.modeling.backbone import Backbone\nfrom detectron2.modeling.postprocessing import sem_seg_postprocess\nfrom detectron2.structures import Boxes, ImageList, Instances, BitMasks\nfrom detectron2.utils.memory import retry_if_cuda_oom\n\nfrom .modeling.criterion import SetCriterion\nfrom .modeling.matcher import HungarianMatcher\n\nfrom skimage import color\nimport cv2\nimport numpy as np\n\ndef unfold_wo_center(x, kernel_size, dilation):\n    assert x.dim() == 4\n    assert kernel_size % 2 == 1\n\n    # using SAME padding\n    padding = (kernel_size + (dilation - 1) * (kernel_size - 1)) // 2\n    unfolded_x = F.unfold(\n        x, kernel_size=kernel_size,\n        padding=padding,\n        dilation=dilation\n    )\n\n    unfolded_x = unfolded_x.reshape(\n        x.size(0), x.size(1), -1, x.size(2), x.size(3)\n    )\n\n    # remove the center pixels\n    size = kernel_size ** 2\n    unfolded_x = torch.cat((\n        unfolded_x[:, :, :size // 2],\n        unfolded_x[:, :, size // 2 + 1:]\n    ), dim=2)\n\n    return unfolded_x\n\ndef get_images_color_similarity(images, kernel_size, dilation):\n    assert images.dim() == 4\n    assert images.size(0) == 1\n\n    unfolded_images = unfold_wo_center(\n        images, kernel_size=kernel_size, dilation=dilation\n    )\n\n    diff = images[:, :, None] - unfolded_images\n    similarity = torch.exp(-torch.norm(diff, dim=1) * 0.5)\n\n    return similarity\n\n\n@META_ARCH_REGISTRY.register()\nclass MaskFormer(nn.Module):\n    \"\"\"\n    Main class for mask classification semantic segmentation architectures.\n    \"\"\"\n\n    @configurable\n    def __init__(\n        self,\n        *,\n        backbone: Backbone,\n        sem_seg_head: nn.Module,\n        criterion: nn.Module,\n        num_queries: int,\n        object_mask_threshold: float,\n        overlap_threshold: float,\n        metadata,\n        size_divisibility: int,\n        sem_seg_postprocess_before_inference: bool,\n        pixel_mean: Tuple[float],\n        pixel_std: Tuple[float],\n        # inference\n        semantic_on: bool,\n        panoptic_on: bool,\n        instance_on: bool,\n        test_topk_per_image: int,\n    ):\n        \"\"\"\n        Args:\n            backbone: a backbone module, must follow detectron2's backbone interface\n            sem_seg_head: a module that predicts semantic segmentation from backbone features\n            criterion: a module that defines the loss\n            num_queries: int, number of queries\n            object_mask_threshold: float, threshold to filter query based on classification score\n                for panoptic segmentation inference\n            overlap_threshold: overlap threshold used in general inference for panoptic segmentation\n            metadata: dataset meta, get `thing` and `stuff` category names for panoptic\n                segmentation inference\n            size_divisibility: Some backbones require the input height and width to be divisible by a\n                specific integer. We can use this to override such requirement.\n            sem_seg_postprocess_before_inference: whether to resize the prediction back\n                to original input size before semantic segmentation inference or after.\n                For high-resolution dataset like Mapillary, resizing predictions before\n                inference will cause OOM error.\n            pixel_mean, pixel_std: list or tuple with #channels element, representing\n                the per-channel mean and std to be used to normalize the input image\n            semantic_on: bool, whether to output semantic segmentation prediction\n            instance_on: bool, whether to output instance segmentation prediction\n            panoptic_on: bool, whether to output panoptic segmentation prediction\n            test_topk_per_image: int, instance segmentation parameter, keep topk instances per image\n        \"\"\"\n        super().__init__()\n        self.backbone = backbone\n        self.sem_seg_head = sem_seg_head\n        self.criterion = criterion\n        self.num_queries = num_queries\n        self.overlap_threshold = overlap_threshold\n        self.object_mask_threshold = object_mask_threshold\n        self.metadata = metadata\n        if size_divisibility < 0:\n            # use backbone size_divisibility if not set\n            size_divisibility = self.backbone.size_divisibility\n        self.size_divisibility = size_divisibility\n        self.sem_seg_postprocess_before_inference = sem_seg_postprocess_before_inference\n        self.register_buffer(\"pixel_mean\", torch.Tensor(pixel_mean).view(-1, 1, 1), False)\n        self.register_buffer(\"pixel_std\", torch.Tensor(pixel_std).view(-1, 1, 1), False)\n\n        # additional args\n        self.semantic_on = semantic_on\n        self.instance_on = instance_on\n        self.panoptic_on = panoptic_on\n        self.test_topk_per_image = test_topk_per_image\n\n        if not self.semantic_on:\n            assert self.sem_seg_postprocess_before_inference\n\n    @classmethod\n    def from_config(cls, cfg):\n        backbone = build_backbone(cfg)\n        sem_seg_head = build_sem_seg_head(cfg, backbone.output_shape())\n\n        # Loss parameters:\n        deep_supervision = cfg.MODEL.MASK_FORMER.DEEP_SUPERVISION\n        no_object_weight = cfg.MODEL.MASK_FORMER.NO_OBJECT_WEIGHT\n\n        # loss weights\n        class_weight = cfg.MODEL.MASK_FORMER.CLASS_WEIGHT\n        dice_weight = cfg.MODEL.MASK_FORMER.DICE_WEIGHT\n        mask_weight = cfg.MODEL.MASK_FORMER.MASK_WEIGHT\n\n        # building criterion\n        matcher = HungarianMatcher(\n            cost_class=class_weight,\n            cost_mask=mask_weight,\n            cost_dice=dice_weight,\n            num_points=cfg.MODEL.MASK_FORMER.TRAIN_NUM_POINTS,\n        )\n\n        weight_dict = {\"loss_ce\": class_weight, \"loss_mask\": mask_weight, \"loss_dice\": dice_weight, \"loss_bound\": mask_weight}\n\n        if deep_supervision:\n            dec_layers = cfg.MODEL.MASK_FORMER.DEC_LAYERS\n            aux_weight_dict = {}\n            for i in range(dec_layers - 1):\n                aux_weight_dict.update({k + f\"_{i}\": v for k, v in weight_dict.items()})\n            weight_dict.update(aux_weight_dict)\n\n        losses = [\"labels\", \"masks\"]\n\n        criterion = SetCriterion(\n            sem_seg_head.num_classes,\n            matcher=matcher,\n            weight_dict=weight_dict,\n            eos_coef=no_object_weight,\n            losses=losses,\n            num_points=cfg.MODEL.MASK_FORMER.TRAIN_NUM_POINTS,\n            oversample_ratio=cfg.MODEL.MASK_FORMER.OVERSAMPLE_RATIO,\n            importance_sample_ratio=cfg.MODEL.MASK_FORMER.IMPORTANCE_SAMPLE_RATIO,\n        )\n\n        return {\n            \"backbone\": backbone,\n            \"sem_seg_head\": sem_seg_head,\n            \"criterion\": criterion,\n            \"num_queries\": cfg.MODEL.MASK_FORMER.NUM_OBJECT_QUERIES,\n            \"object_mask_threshold\": cfg.MODEL.MASK_FORMER.TEST.OBJECT_MASK_THRESHOLD,\n            \"overlap_threshold\": cfg.MODEL.MASK_FORMER.TEST.OVERLAP_THRESHOLD,\n            \"metadata\": MetadataCatalog.get(cfg.DATASETS.TRAIN[0]),\n            \"size_divisibility\": cfg.MODEL.MASK_FORMER.SIZE_DIVISIBILITY,\n            \"sem_seg_postprocess_before_inference\": (\n                cfg.MODEL.MASK_FORMER.TEST.SEM_SEG_POSTPROCESSING_BEFORE_INFERENCE\n                or cfg.MODEL.MASK_FORMER.TEST.PANOPTIC_ON\n                or cfg.MODEL.MASK_FORMER.TEST.INSTANCE_ON\n            ),\n            \"pixel_mean\": cfg.MODEL.PIXEL_MEAN,\n            \"pixel_std\": cfg.MODEL.PIXEL_STD,\n            # inference\n            \"semantic_on\": cfg.MODEL.MASK_FORMER.TEST.SEMANTIC_ON,\n            \"instance_on\": cfg.MODEL.MASK_FORMER.TEST.INSTANCE_ON,\n            \"panoptic_on\": cfg.MODEL.MASK_FORMER.TEST.PANOPTIC_ON,\n            \"test_topk_per_image\": cfg.TEST.DETECTIONS_PER_IMAGE,\n        }\n\n    @property\n    def device(self):\n        return self.pixel_mean.device\n\n    def forward(self, batched_inputs):\n        \"\"\"\n        Args:\n            batched_inputs: a list, batched outputs of :class:`DatasetMapper`.\n                Each item in the list contains the inputs for one image.\n                For now, each item in the list is a dict that contains:\n                   * \"image\": Tensor, image in (C, H, W) format.\n                   * \"instances\": per-region ground truth\n                   * Other information that's included in the original dicts, such as:\n                     \"height\", \"width\" (int): the output resolution of the model (may be different\n                     from input resolution), used in inference.\n        Returns:\n            list[dict]:\n                each dict has the results for one image. The dict contains the following keys:\n\n                * \"sem_seg\":\n                    A Tensor that represents the\n                    per-pixel segmentation prediced by the head.\n                    The prediction has shape KxHxW that represents the logits of\n                    each class for each pixel.\n                * \"panoptic_seg\":\n                    A tuple that represent panoptic output\n                    panoptic_seg (Tensor): of shape (height, width) where the values are ids for each segment.\n                    segments_info (list[dict]): Describe each segment in `panoptic_seg`.\n                        Each dict contains keys \"id\", \"category_id\", \"isthing\".\n        \"\"\"\n        images = [x[\"image\"].to(self.device) for x in batched_inputs]\n        # if self.training:\n        #     downsampled_images = [F.avg_pool2d(img.float(), kernel_size=4, stride=4, padding=0)[[2, 1, 0]] for img in images]\n        #     images_lab = [torch.as_tensor(color.rgb2lab(ds_image.byte().permute(1, 2, 0).cpu().numpy()), device=ds_image.device, dtype=torch.float32).permute(2, 0, 1) for ds_image in downsampled_images]\n        #     images_lab_sim = [get_images_color_similarity(img_lab.unsqueeze(0), 3, 2) for img_lab in images_lab] # ori is 0.3, 0.5, 0.7\n            \n        #     # for i_m, im_sim in enumerate(images_lab_sim):\n        #     #     heatmapshow = cv2.applyColorMap((im_sim[0, 0] * 255).cpu().numpy().astype(np.uint8), cv2.COLORMAP_JET)\n        #     #     cv2.imwrite('./vis_debug3/'+str(batched_inputs[i_m]['image_id'])+\"_heatmap_n_bina_new1.jpg\", heatmapshow)\n        #     #     cv2.imwrite('./vis_debug3/'+str(batched_inputs[i_m]['image_id'])+\"_img.jpg\", downsampled_images[i_m].byte().permute(1, 2, 0).cpu().numpy())\n\n        #     # print('images_lab_sim shape:', [im_sim.shape1 for im_sim in images_lab_sim])\n\n        # print('mask in image_masks:', [m.shape for m in image_masks])\n        # print('mask in image_masks max:', [m.max() for m in image_masks])\n        # print('mask in image_masks min:', [m.min() for m in image_masks])\n        # print('mask in image_masks percent:', [m.sum() / (m.shape[0] * m.shape[1]) for m in image_masks])\n\n        if self.training:\n            rs_images = ImageList.from_tensors(images, self.size_divisibility)\n            image_masks = [~ x[\"padding_mask\"].to(self.device) for x in batched_inputs]\n            image_masks_back = [x[\"padding_mask\"].to(self.device) for x in batched_inputs]\n            # for ii, i_mask in enumerate(image_masks):\n            #     print('index:', ii, 'i_mask:', i_mask.shape)\n            #     print('index:', ii, 'i_mask:', i_mask.max())\n            #     cv2.imwrite('vis_mask_check/'+str(batched_inputs[ii]['image_id'])+str(ii)+'_mask.jpg', i_mask.float().cpu().numpy() * 255)\n            # print('mask in image_masks:', [m.shape for m in image_masks])\n            # print('mask in image_masks max:', [m.max() for m in image_masks])\n            # print('mask in image_masks min:', [m.min() for m in image_masks])\n            image_masks_bool = [((m.sum() / (m.shape[0] * m.shape[1])) > 0.25).float()*((m_b.sum() / (m.shape[0] * m.shape[1])) > 0.25).float() for m, m_b in zip(image_masks, image_masks_back)] #0.25, 0.64\n            #image_masks_bool = [((m.sum() / (m.shape[0] * m.shape[1])) > 1.0).float() for m in image_masks] #0.25, 0.64\n            # print('len image_masks_bool:', image_masks_bool)\n            downsampled_images = F.avg_pool2d(rs_images.tensor.float(), kernel_size=4, stride=4, padding=0) #for img in images]\n            # print('len downsampled_images:', len(downsampled_images))\n            images_lab = [torch.as_tensor(color.rgb2lab(ds_image[[2, 1, 0]].byte().permute(1, 2, 0).cpu().numpy()), device=ds_image.device, dtype=torch.float32).permute(2, 0, 1) for ds_image in downsampled_images]\n            images_lab_sim = [get_images_color_similarity(img_lab.unsqueeze(0), 3, 2) * float(img_m_bool) for img_lab, img_m_bool in zip(images_lab, image_masks_bool)] # ori is 0.3, 0.5, 0.7\n            \n            # for i_m, im_sim in enumerate(images_lab_sim):\n            #     heatmapshow = cv2.applyColorMap((im_sim[0, 0] * 255).cpu().numpy().astype(np.uint8), cv2.COLORMAP_JET)\n            #     cv2.imwrite('./vis_debug3/'+str(batched_inputs[i_m]['image_id'])+\"_heatmap_n_bina_new1.jpg\", heatmapshow)\n            #     cv2.imwrite('./vis_debug3/'+str(batched_inputs[i_m]['image_id'])+\"_img.jpg\", downsampled_images[i_m].byte().permute(1, 2, 0).cpu().numpy())\n\n            # print('images_lab_sim shape:', [im_sim.shape1 for im_sim in images_lab_sim])\n\n        # ori_images = ImageList.from_tensors(images, self.size_divisibility)\n        # ori_images_tensor = ori_images.tensor[:, :, ::4, ::4]\n        # print('ori images:', ori_images_tensor.shape)\n\n        images = [(x - self.pixel_mean) / self.pixel_std for x in images]\n        images = ImageList.from_tensors(images, self.size_divisibility)\n\n        features = self.backbone(images.tensor)\n        outputs = self.sem_seg_head(features)\n\n        if self.training:\n            # mask classification target\n            if \"instances\" in batched_inputs[0]:\n                gt_instances = [x[\"instances\"].to(self.device) for x in batched_inputs]\n                targets = self.prepare_targets(gt_instances, images)\n            else:\n                targets = None\n\n            # bipartite matching-based loss\n            losses = self.criterion(outputs, targets, images_lab_sim)\n\n            for k in list(losses.keys()):\n                if k in self.criterion.weight_dict:\n                    losses[k] *= self.criterion.weight_dict[k]\n                else:\n                    # remove this loss if not specified in `weight_dict`\n                    losses.pop(k)\n            return losses\n        else:\n            mask_cls_results = outputs[\"pred_logits\"]\n            mask_pred_results = outputs[\"pred_masks\"]\n            # upsample masks\n            mask_pred_results = F.interpolate(\n                mask_pred_results,\n                size=(images.tensor.shape[-2], images.tensor.shape[-1]),\n                mode=\"bilinear\",\n                align_corners=False,\n            )\n\n            del outputs\n\n            processed_results = []\n            for mask_cls_result, mask_pred_result, input_per_image, image_size in zip(\n                mask_cls_results, mask_pred_results, batched_inputs, images.image_sizes\n            ):\n                height = input_per_image.get(\"height\", image_size[0])\n                width = input_per_image.get(\"width\", image_size[1])\n                processed_results.append({})\n\n                if self.sem_seg_postprocess_before_inference:\n                    mask_pred_result = retry_if_cuda_oom(sem_seg_postprocess)(\n                        mask_pred_result, image_size, height, width\n                    )\n                    mask_cls_result = mask_cls_result.to(mask_pred_result)\n\n                # semantic segmentation inference\n                if self.semantic_on:\n                    r = retry_if_cuda_oom(self.semantic_inference)(mask_cls_result, mask_pred_result)\n                    if not self.sem_seg_postprocess_before_inference:\n                        r = retry_if_cuda_oom(sem_seg_postprocess)(r, image_size, height, width)\n                    processed_results[-1][\"sem_seg\"] = r\n\n                # panoptic segmentation inference\n                if self.panoptic_on:\n                    panoptic_r = retry_if_cuda_oom(self.panoptic_inference)(mask_cls_result, mask_pred_result)\n                    processed_results[-1][\"panoptic_seg\"] = panoptic_r\n                \n                # instance segmentation inference\n                if self.instance_on:\n                    instance_r = retry_if_cuda_oom(self.instance_inference)(mask_cls_result, mask_pred_result)\n                    processed_results[-1][\"instances\"] = instance_r\n\n            return processed_results\n\n    def prepare_targets(self, targets, images):\n        h_pad, w_pad = images.tensor.shape[-2:]\n        new_targets = []\n        for targets_per_image in targets:\n            # pad gt\n            gt_masks = targets_per_image.gt_masks\n            padded_masks = torch.zeros((gt_masks.shape[0], h_pad, w_pad), dtype=gt_masks.dtype, device=gt_masks.device)\n            padded_masks[:, : gt_masks.shape[1], : gt_masks.shape[2]] = gt_masks\n            new_targets.append(\n                {\n                    \"labels\": targets_per_image.gt_classes,\n                    \"masks\": padded_masks,\n                }\n            )\n        return new_targets\n\n    def semantic_inference(self, mask_cls, mask_pred):\n        mask_cls = F.softmax(mask_cls, dim=-1)[..., :-1]\n        mask_pred = mask_pred.sigmoid()\n        semseg = torch.einsum(\"qc,qhw->chw\", mask_cls, mask_pred)\n        return semseg\n\n    def panoptic_inference(self, mask_cls, mask_pred):\n        scores, labels = F.softmax(mask_cls, dim=-1).max(-1)\n        mask_pred = mask_pred.sigmoid()\n\n        keep = labels.ne(self.sem_seg_head.num_classes) & (scores > self.object_mask_threshold)\n        cur_scores = scores[keep]\n        cur_classes = labels[keep]\n        cur_masks = mask_pred[keep]\n        cur_mask_cls = mask_cls[keep]\n        cur_mask_cls = cur_mask_cls[:, :-1]\n\n        cur_prob_masks = cur_scores.view(-1, 1, 1) * cur_masks\n\n        h, w = cur_masks.shape[-2:]\n        panoptic_seg = torch.zeros((h, w), dtype=torch.int32, device=cur_masks.device)\n        segments_info = []\n\n        current_segment_id = 0\n\n        if cur_masks.shape[0] == 0:\n            # We didn't detect any mask :(\n            return panoptic_seg, segments_info\n        else:\n            # take argmax\n            cur_mask_ids = cur_prob_masks.argmax(0)\n            stuff_memory_list = {}\n            for k in range(cur_classes.shape[0]):\n                pred_class = cur_classes[k].item()\n                isthing = pred_class in self.metadata.thing_dataset_id_to_contiguous_id.values()\n                mask_area = (cur_mask_ids == k).sum().item()\n                original_area = (cur_masks[k] >= 0.5).sum().item()\n                mask = (cur_mask_ids == k) & (cur_masks[k] >= 0.5)\n\n                if mask_area > 0 and original_area > 0 and mask.sum().item() > 0:\n                    if mask_area / original_area < self.overlap_threshold:\n                        continue\n\n                    # merge stuff regions\n                    if not isthing:\n                        if int(pred_class) in stuff_memory_list.keys():\n                            panoptic_seg[mask] = stuff_memory_list[int(pred_class)]\n                            continue\n                        else:\n                            stuff_memory_list[int(pred_class)] = current_segment_id + 1\n\n                    current_segment_id += 1\n                    panoptic_seg[mask] = current_segment_id\n\n                    segments_info.append(\n                        {\n                            \"id\": current_segment_id,\n                            \"isthing\": bool(isthing),\n                            \"category_id\": int(pred_class),\n                        }\n                    )\n\n            return panoptic_seg, segments_info\n\n    def instance_inference(self, mask_cls, mask_pred):\n        # mask_pred is already processed to have the same shape as original input\n        image_size = mask_pred.shape[-2:]\n\n        # [Q, K]\n        scores = F.softmax(mask_cls, dim=-1)[:, :-1]\n        labels = torch.arange(self.sem_seg_head.num_classes, device=self.device).unsqueeze(0).repeat(self.num_queries, 1).flatten(0, 1)\n        # scores_per_image, topk_indices = scores.flatten(0, 1).topk(self.num_queries, sorted=False)\n        scores_per_image, topk_indices = scores.flatten(0, 1).topk(self.test_topk_per_image, sorted=False)\n        labels_per_image = labels[topk_indices]\n\n        topk_indices = topk_indices // self.sem_seg_head.num_classes\n        # mask_pred = mask_pred.unsqueeze(1).repeat(1, self.sem_seg_head.num_classes, 1).flatten(0, 1)\n        mask_pred = mask_pred[topk_indices]\n\n        # if this is panoptic segmentation, we only keep the \"thing\" classes\n        if self.panoptic_on:\n            keep = torch.zeros_like(scores_per_image).bool()\n            for i, lab in enumerate(labels_per_image):\n                keep[i] = lab in self.metadata.thing_dataset_id_to_contiguous_id.values()\n\n            scores_per_image = scores_per_image[keep]\n            labels_per_image = labels_per_image[keep]\n            mask_pred = mask_pred[keep]\n\n        result = Instances(image_size)\n        # mask (before sigmoid)\n        result.pred_masks = (mask_pred > 0).float()\n        # result.pred_masks = (mask_pred.sigmoid() >= 0.5)*(mask_pred.sigmoid() < 0.75).float()\n        # result.pred_boxes = Boxes(torch.zeros(mask_pred.size(0), 4))\n        # Uncomment the following to get boxes from masks (this is slow)\n        result.pred_boxes = BitMasks(mask_pred > 0).get_bounding_boxes()\n\n        # calculate average mask prob\n        mask_scores_per_image = (mask_pred.sigmoid().flatten(1) * result.pred_masks.flatten(1)).sum(1) / (result.pred_masks.flatten(1).sum(1) + 1e-6)\n        result.scores = scores_per_image * mask_scores_per_image\n        result.pred_classes = labels_per_image\n        return result\n"
  },
  {
    "path": "mask2former/modeling/__init__.py",
    "content": "from .backbone.swin import D2SwinTransformer\nfrom .pixel_decoder.fpn import BasePixelDecoder\nfrom .pixel_decoder.msdeformattn import MSDeformAttnPixelDecoder\nfrom .meta_arch.mask_former_head import MaskFormerHead\nfrom .meta_arch.per_pixel_baseline import PerPixelBaselineHead, PerPixelBaselinePlusHead\n"
  },
  {
    "path": "mask2former/modeling/backbone/__init__.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates.\n"
  },
  {
    "path": "mask2former/modeling/backbone/__init__.py.new",
    "content": ""
  },
  {
    "path": "mask2former/modeling/backbone/swin.py",
    "content": "# --------------------------------------------------------\n# Swin Transformer\n# Copyright (c) 2021 Microsoft\n# Licensed under The MIT License [see LICENSE for details]\n# Written by Ze Liu, Yutong Lin, Yixuan Wei\n# --------------------------------------------------------\n\n# Modified by Bowen Cheng from https://github.com/SwinTransformer/Swin-Transformer-Semantic-Segmentation/blob/main/mmseg/models/backbones/swin_transformer.py\n\nimport numpy as np\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nimport torch.utils.checkpoint as checkpoint\nfrom timm.models.layers import DropPath, to_2tuple, trunc_normal_\n\nfrom detectron2.modeling import BACKBONE_REGISTRY, Backbone, ShapeSpec\n\n\nclass Mlp(nn.Module):\n    \"\"\"Multilayer perceptron.\"\"\"\n\n    def __init__(\n        self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.0\n    ):\n        super().__init__()\n        out_features = out_features or in_features\n        hidden_features = hidden_features or in_features\n        self.fc1 = nn.Linear(in_features, hidden_features)\n        self.act = act_layer()\n        self.fc2 = nn.Linear(hidden_features, out_features)\n        self.drop = nn.Dropout(drop)\n\n    def forward(self, x):\n        x = self.fc1(x)\n        x = self.act(x)\n        x = self.drop(x)\n        x = self.fc2(x)\n        x = self.drop(x)\n        return x\n\n\ndef window_partition(x, window_size):\n    \"\"\"\n    Args:\n        x: (B, H, W, C)\n        window_size (int): window size\n    Returns:\n        windows: (num_windows*B, window_size, window_size, C)\n    \"\"\"\n    B, H, W, C = x.shape\n    x = x.view(B, H // window_size, window_size, W // window_size, window_size, C)\n    windows = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(-1, window_size, window_size, C)\n    return windows\n\n\ndef window_reverse(windows, window_size, H, W):\n    \"\"\"\n    Args:\n        windows: (num_windows*B, window_size, window_size, C)\n        window_size (int): Window size\n        H (int): Height of image\n        W (int): Width of image\n    Returns:\n        x: (B, H, W, C)\n    \"\"\"\n    B = int(windows.shape[0] / (H * W / window_size / window_size))\n    x = windows.view(B, H // window_size, W // window_size, window_size, window_size, -1)\n    x = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(B, H, W, -1)\n    return x\n\n\nclass WindowAttention(nn.Module):\n    \"\"\"Window based multi-head self attention (W-MSA) module with relative position bias.\n    It supports both of shifted and non-shifted window.\n    Args:\n        dim (int): Number of input channels.\n        window_size (tuple[int]): The height and width of the window.\n        num_heads (int): Number of attention heads.\n        qkv_bias (bool, optional):  If True, add a learnable bias to query, key, value. Default: True\n        qk_scale (float | None, optional): Override default qk scale of head_dim ** -0.5 if set\n        attn_drop (float, optional): Dropout ratio of attention weight. Default: 0.0\n        proj_drop (float, optional): Dropout ratio of output. Default: 0.0\n    \"\"\"\n\n    def __init__(\n        self,\n        dim,\n        window_size,\n        num_heads,\n        qkv_bias=True,\n        qk_scale=None,\n        attn_drop=0.0,\n        proj_drop=0.0,\n    ):\n\n        super().__init__()\n        self.dim = dim\n        self.window_size = window_size  # Wh, Ww\n        self.num_heads = num_heads\n        head_dim = dim // num_heads\n        self.scale = qk_scale or head_dim ** -0.5\n\n        # define a parameter table of relative position bias\n        self.relative_position_bias_table = nn.Parameter(\n            torch.zeros((2 * window_size[0] - 1) * (2 * window_size[1] - 1), num_heads)\n        )  # 2*Wh-1 * 2*Ww-1, nH\n\n        # get pair-wise relative position index for each token inside the window\n        coords_h = torch.arange(self.window_size[0])\n        coords_w = torch.arange(self.window_size[1])\n        coords = torch.stack(torch.meshgrid([coords_h, coords_w]))  # 2, Wh, Ww\n        coords_flatten = torch.flatten(coords, 1)  # 2, Wh*Ww\n        relative_coords = coords_flatten[:, :, None] - coords_flatten[:, None, :]  # 2, Wh*Ww, Wh*Ww\n        relative_coords = relative_coords.permute(1, 2, 0).contiguous()  # Wh*Ww, Wh*Ww, 2\n        relative_coords[:, :, 0] += self.window_size[0] - 1  # shift to start from 0\n        relative_coords[:, :, 1] += self.window_size[1] - 1\n        relative_coords[:, :, 0] *= 2 * self.window_size[1] - 1\n        relative_position_index = relative_coords.sum(-1)  # Wh*Ww, Wh*Ww\n        self.register_buffer(\"relative_position_index\", relative_position_index)\n\n        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)\n        self.attn_drop = nn.Dropout(attn_drop)\n        self.proj = nn.Linear(dim, dim)\n        self.proj_drop = nn.Dropout(proj_drop)\n\n        trunc_normal_(self.relative_position_bias_table, std=0.02)\n        self.softmax = nn.Softmax(dim=-1)\n\n    def forward(self, x, mask=None):\n        \"\"\"Forward function.\n        Args:\n            x: input features with shape of (num_windows*B, N, C)\n            mask: (0/-inf) mask with shape of (num_windows, Wh*Ww, Wh*Ww) or None\n        \"\"\"\n        B_, N, C = x.shape\n        qkv = (\n            self.qkv(x)\n            .reshape(B_, N, 3, self.num_heads, C // self.num_heads)\n            .permute(2, 0, 3, 1, 4)\n        )\n        q, k, v = qkv[0], qkv[1], qkv[2]  # make torchscript happy (cannot use tensor as tuple)\n\n        q = q * self.scale\n        attn = q @ k.transpose(-2, -1)\n\n        relative_position_bias = self.relative_position_bias_table[\n            self.relative_position_index.view(-1)\n        ].view(\n            self.window_size[0] * self.window_size[1], self.window_size[0] * self.window_size[1], -1\n        )  # Wh*Ww,Wh*Ww,nH\n        relative_position_bias = relative_position_bias.permute(\n            2, 0, 1\n        ).contiguous()  # nH, Wh*Ww, Wh*Ww\n        attn = attn + relative_position_bias.unsqueeze(0)\n\n        if mask is not None:\n            nW = mask.shape[0]\n            attn = attn.view(B_ // nW, nW, self.num_heads, N, N) + mask.unsqueeze(1).unsqueeze(0)\n            attn = attn.view(-1, self.num_heads, N, N)\n            attn = self.softmax(attn)\n        else:\n            attn = self.softmax(attn)\n\n        attn = self.attn_drop(attn)\n\n        x = (attn @ v).transpose(1, 2).reshape(B_, N, C)\n        x = self.proj(x)\n        x = self.proj_drop(x)\n        return x\n\n\nclass SwinTransformerBlock(nn.Module):\n    \"\"\"Swin Transformer Block.\n    Args:\n        dim (int): Number of input channels.\n        num_heads (int): Number of attention heads.\n        window_size (int): Window size.\n        shift_size (int): Shift size for SW-MSA.\n        mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.\n        qkv_bias (bool, optional): If True, add a learnable bias to query, key, value. Default: True\n        qk_scale (float | None, optional): Override default qk scale of head_dim ** -0.5 if set.\n        drop (float, optional): Dropout rate. Default: 0.0\n        attn_drop (float, optional): Attention dropout rate. Default: 0.0\n        drop_path (float, optional): Stochastic depth rate. Default: 0.0\n        act_layer (nn.Module, optional): Activation layer. Default: nn.GELU\n        norm_layer (nn.Module, optional): Normalization layer.  Default: nn.LayerNorm\n    \"\"\"\n\n    def __init__(\n        self,\n        dim,\n        num_heads,\n        window_size=7,\n        shift_size=0,\n        mlp_ratio=4.0,\n        qkv_bias=True,\n        qk_scale=None,\n        drop=0.0,\n        attn_drop=0.0,\n        drop_path=0.0,\n        act_layer=nn.GELU,\n        norm_layer=nn.LayerNorm,\n    ):\n        super().__init__()\n        self.dim = dim\n        self.num_heads = num_heads\n        self.window_size = window_size\n        self.shift_size = shift_size\n        self.mlp_ratio = mlp_ratio\n        assert 0 <= self.shift_size < self.window_size, \"shift_size must in 0-window_size\"\n\n        self.norm1 = norm_layer(dim)\n        self.attn = WindowAttention(\n            dim,\n            window_size=to_2tuple(self.window_size),\n            num_heads=num_heads,\n            qkv_bias=qkv_bias,\n            qk_scale=qk_scale,\n            attn_drop=attn_drop,\n            proj_drop=drop,\n        )\n\n        self.drop_path = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()\n        self.norm2 = norm_layer(dim)\n        mlp_hidden_dim = int(dim * mlp_ratio)\n        self.mlp = Mlp(\n            in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop\n        )\n\n        self.H = None\n        self.W = None\n\n    def forward(self, x, mask_matrix):\n        \"\"\"Forward function.\n        Args:\n            x: Input feature, tensor size (B, H*W, C).\n            H, W: Spatial resolution of the input feature.\n            mask_matrix: Attention mask for cyclic shift.\n        \"\"\"\n        B, L, C = x.shape\n        H, W = self.H, self.W\n        assert L == H * W, \"input feature has wrong size\"\n\n        shortcut = x\n        x = self.norm1(x)\n        x = x.view(B, H, W, C)\n\n        # pad feature maps to multiples of window size\n        pad_l = pad_t = 0\n        pad_r = (self.window_size - W % self.window_size) % self.window_size\n        pad_b = (self.window_size - H % self.window_size) % self.window_size\n        x = F.pad(x, (0, 0, pad_l, pad_r, pad_t, pad_b))\n        _, Hp, Wp, _ = x.shape\n\n        # cyclic shift\n        if self.shift_size > 0:\n            shifted_x = torch.roll(x, shifts=(-self.shift_size, -self.shift_size), dims=(1, 2))\n            attn_mask = mask_matrix\n        else:\n            shifted_x = x\n            attn_mask = None\n\n        # partition windows\n        x_windows = window_partition(\n            shifted_x, self.window_size\n        )  # nW*B, window_size, window_size, C\n        x_windows = x_windows.view(\n            -1, self.window_size * self.window_size, C\n        )  # nW*B, window_size*window_size, C\n\n        # W-MSA/SW-MSA\n        attn_windows = self.attn(x_windows, mask=attn_mask)  # nW*B, window_size*window_size, C\n\n        # merge windows\n        attn_windows = attn_windows.view(-1, self.window_size, self.window_size, C)\n        shifted_x = window_reverse(attn_windows, self.window_size, Hp, Wp)  # B H' W' C\n\n        # reverse cyclic shift\n        if self.shift_size > 0:\n            x = torch.roll(shifted_x, shifts=(self.shift_size, self.shift_size), dims=(1, 2))\n        else:\n            x = shifted_x\n\n        if pad_r > 0 or pad_b > 0:\n            x = x[:, :H, :W, :].contiguous()\n\n        x = x.view(B, H * W, C)\n\n        # FFN\n        x = shortcut + self.drop_path(x)\n        x = x + self.drop_path(self.mlp(self.norm2(x)))\n\n        return x\n\n\nclass PatchMerging(nn.Module):\n    \"\"\"Patch Merging Layer\n    Args:\n        dim (int): Number of input channels.\n        norm_layer (nn.Module, optional): Normalization layer.  Default: nn.LayerNorm\n    \"\"\"\n\n    def __init__(self, dim, norm_layer=nn.LayerNorm):\n        super().__init__()\n        self.dim = dim\n        self.reduction = nn.Linear(4 * dim, 2 * dim, bias=False)\n        self.norm = norm_layer(4 * dim)\n\n    def forward(self, x, H, W):\n        \"\"\"Forward function.\n        Args:\n            x: Input feature, tensor size (B, H*W, C).\n            H, W: Spatial resolution of the input feature.\n        \"\"\"\n        B, L, C = x.shape\n        assert L == H * W, \"input feature has wrong size\"\n\n        x = x.view(B, H, W, C)\n\n        # padding\n        pad_input = (H % 2 == 1) or (W % 2 == 1)\n        if pad_input:\n            x = F.pad(x, (0, 0, 0, W % 2, 0, H % 2))\n\n        x0 = x[:, 0::2, 0::2, :]  # B H/2 W/2 C\n        x1 = x[:, 1::2, 0::2, :]  # B H/2 W/2 C\n        x2 = x[:, 0::2, 1::2, :]  # B H/2 W/2 C\n        x3 = x[:, 1::2, 1::2, :]  # B H/2 W/2 C\n        x = torch.cat([x0, x1, x2, x3], -1)  # B H/2 W/2 4*C\n        x = x.view(B, -1, 4 * C)  # B H/2*W/2 4*C\n\n        x = self.norm(x)\n        x = self.reduction(x)\n\n        return x\n\n\nclass BasicLayer(nn.Module):\n    \"\"\"A basic Swin Transformer layer for one stage.\n    Args:\n        dim (int): Number of feature channels\n        depth (int): Depths of this stage.\n        num_heads (int): Number of attention head.\n        window_size (int): Local window size. Default: 7.\n        mlp_ratio (float): Ratio of mlp hidden dim to embedding dim. Default: 4.\n        qkv_bias (bool, optional): If True, add a learnable bias to query, key, value. Default: True\n        qk_scale (float | None, optional): Override default qk scale of head_dim ** -0.5 if set.\n        drop (float, optional): Dropout rate. Default: 0.0\n        attn_drop (float, optional): Attention dropout rate. Default: 0.0\n        drop_path (float | tuple[float], optional): Stochastic depth rate. Default: 0.0\n        norm_layer (nn.Module, optional): Normalization layer. Default: nn.LayerNorm\n        downsample (nn.Module | None, optional): Downsample layer at the end of the layer. Default: None\n        use_checkpoint (bool): Whether to use checkpointing to save memory. Default: False.\n    \"\"\"\n\n    def __init__(\n        self,\n        dim,\n        depth,\n        num_heads,\n        window_size=7,\n        mlp_ratio=4.0,\n        qkv_bias=True,\n        qk_scale=None,\n        drop=0.0,\n        attn_drop=0.0,\n        drop_path=0.0,\n        norm_layer=nn.LayerNorm,\n        downsample=None,\n        use_checkpoint=False,\n    ):\n        super().__init__()\n        self.window_size = window_size\n        self.shift_size = window_size // 2\n        self.depth = depth\n        self.use_checkpoint = use_checkpoint\n\n        # build blocks\n        self.blocks = nn.ModuleList(\n            [\n                SwinTransformerBlock(\n                    dim=dim,\n                    num_heads=num_heads,\n                    window_size=window_size,\n                    shift_size=0 if (i % 2 == 0) else window_size // 2,\n                    mlp_ratio=mlp_ratio,\n                    qkv_bias=qkv_bias,\n                    qk_scale=qk_scale,\n                    drop=drop,\n                    attn_drop=attn_drop,\n                    drop_path=drop_path[i] if isinstance(drop_path, list) else drop_path,\n                    norm_layer=norm_layer,\n                )\n                for i in range(depth)\n            ]\n        )\n\n        # patch merging layer\n        if downsample is not None:\n            self.downsample = downsample(dim=dim, norm_layer=norm_layer)\n        else:\n            self.downsample = None\n\n    def forward(self, x, H, W):\n        \"\"\"Forward function.\n        Args:\n            x: Input feature, tensor size (B, H*W, C).\n            H, W: Spatial resolution of the input feature.\n        \"\"\"\n\n        # calculate attention mask for SW-MSA\n        Hp = int(np.ceil(H / self.window_size)) * self.window_size\n        Wp = int(np.ceil(W / self.window_size)) * self.window_size\n        img_mask = torch.zeros((1, Hp, Wp, 1), device=x.device)  # 1 Hp Wp 1\n        h_slices = (\n            slice(0, -self.window_size),\n            slice(-self.window_size, -self.shift_size),\n            slice(-self.shift_size, None),\n        )\n        w_slices = (\n            slice(0, -self.window_size),\n            slice(-self.window_size, -self.shift_size),\n            slice(-self.shift_size, None),\n        )\n        cnt = 0\n        for h in h_slices:\n            for w in w_slices:\n                img_mask[:, h, w, :] = cnt\n                cnt += 1\n\n        mask_windows = window_partition(\n            img_mask, self.window_size\n        )  # nW, window_size, window_size, 1\n        mask_windows = mask_windows.view(-1, self.window_size * self.window_size)\n        attn_mask = mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2)\n        attn_mask = attn_mask.masked_fill(attn_mask != 0, float(-100.0)).masked_fill(\n            attn_mask == 0, float(0.0)\n        )\n\n        for blk in self.blocks:\n            blk.H, blk.W = H, W\n            if self.use_checkpoint:\n                x = checkpoint.checkpoint(blk, x, attn_mask)\n            else:\n                x = blk(x, attn_mask)\n        if self.downsample is not None:\n            x_down = self.downsample(x, H, W)\n            Wh, Ww = (H + 1) // 2, (W + 1) // 2\n            return x, H, W, x_down, Wh, Ww\n        else:\n            return x, H, W, x, H, W\n\n\nclass PatchEmbed(nn.Module):\n    \"\"\"Image to Patch Embedding\n    Args:\n        patch_size (int): Patch token size. Default: 4.\n        in_chans (int): Number of input image channels. Default: 3.\n        embed_dim (int): Number of linear projection output channels. Default: 96.\n        norm_layer (nn.Module, optional): Normalization layer. Default: None\n    \"\"\"\n\n    def __init__(self, patch_size=4, in_chans=3, embed_dim=96, norm_layer=None):\n        super().__init__()\n        patch_size = to_2tuple(patch_size)\n        self.patch_size = patch_size\n\n        self.in_chans = in_chans\n        self.embed_dim = embed_dim\n\n        self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=patch_size)\n        if norm_layer is not None:\n            self.norm = norm_layer(embed_dim)\n        else:\n            self.norm = None\n\n    def forward(self, x):\n        \"\"\"Forward function.\"\"\"\n        # padding\n        _, _, H, W = x.size()\n        if W % self.patch_size[1] != 0:\n            x = F.pad(x, (0, self.patch_size[1] - W % self.patch_size[1]))\n        if H % self.patch_size[0] != 0:\n            x = F.pad(x, (0, 0, 0, self.patch_size[0] - H % self.patch_size[0]))\n\n        x = self.proj(x)  # B C Wh Ww\n        if self.norm is not None:\n            Wh, Ww = x.size(2), x.size(3)\n            x = x.flatten(2).transpose(1, 2)\n            x = self.norm(x)\n            x = x.transpose(1, 2).view(-1, self.embed_dim, Wh, Ww)\n\n        return x\n\n\nclass SwinTransformer(nn.Module):\n    \"\"\"Swin Transformer backbone.\n        A PyTorch impl of : `Swin Transformer: Hierarchical Vision Transformer using Shifted Windows`  -\n          https://arxiv.org/pdf/2103.14030\n    Args:\n        pretrain_img_size (int): Input image size for training the pretrained model,\n            used in absolute postion embedding. Default 224.\n        patch_size (int | tuple(int)): Patch size. Default: 4.\n        in_chans (int): Number of input image channels. Default: 3.\n        embed_dim (int): Number of linear projection output channels. Default: 96.\n        depths (tuple[int]): Depths of each Swin Transformer stage.\n        num_heads (tuple[int]): Number of attention head of each stage.\n        window_size (int): Window size. Default: 7.\n        mlp_ratio (float): Ratio of mlp hidden dim to embedding dim. Default: 4.\n        qkv_bias (bool): If True, add a learnable bias to query, key, value. Default: True\n        qk_scale (float): Override default qk scale of head_dim ** -0.5 if set.\n        drop_rate (float): Dropout rate.\n        attn_drop_rate (float): Attention dropout rate. Default: 0.\n        drop_path_rate (float): Stochastic depth rate. Default: 0.2.\n        norm_layer (nn.Module): Normalization layer. Default: nn.LayerNorm.\n        ape (bool): If True, add absolute position embedding to the patch embedding. Default: False.\n        patch_norm (bool): If True, add normalization after patch embedding. Default: True.\n        out_indices (Sequence[int]): Output from which stages.\n        frozen_stages (int): Stages to be frozen (stop grad and set eval mode).\n            -1 means not freezing any parameters.\n        use_checkpoint (bool): Whether to use checkpointing to save memory. Default: False.\n    \"\"\"\n\n    def __init__(\n        self,\n        pretrain_img_size=224,\n        patch_size=4,\n        in_chans=3,\n        embed_dim=96,\n        depths=[2, 2, 6, 2],\n        num_heads=[3, 6, 12, 24],\n        window_size=7,\n        mlp_ratio=4.0,\n        qkv_bias=True,\n        qk_scale=None,\n        drop_rate=0.0,\n        attn_drop_rate=0.0,\n        drop_path_rate=0.2,\n        norm_layer=nn.LayerNorm,\n        ape=False,\n        patch_norm=True,\n        out_indices=(0, 1, 2, 3),\n        frozen_stages=-1,\n        use_checkpoint=False,\n    ):\n        super().__init__()\n\n        self.pretrain_img_size = pretrain_img_size\n        self.num_layers = len(depths)\n        self.embed_dim = embed_dim\n        self.ape = ape\n        self.patch_norm = patch_norm\n        self.out_indices = out_indices\n        self.frozen_stages = frozen_stages\n\n        # split image into non-overlapping patches\n        self.patch_embed = PatchEmbed(\n            patch_size=patch_size,\n            in_chans=in_chans,\n            embed_dim=embed_dim,\n            norm_layer=norm_layer if self.patch_norm else None,\n        )\n\n        # absolute position embedding\n        if self.ape:\n            pretrain_img_size = to_2tuple(pretrain_img_size)\n            patch_size = to_2tuple(patch_size)\n            patches_resolution = [\n                pretrain_img_size[0] // patch_size[0],\n                pretrain_img_size[1] // patch_size[1],\n            ]\n\n            self.absolute_pos_embed = nn.Parameter(\n                torch.zeros(1, embed_dim, patches_resolution[0], patches_resolution[1])\n            )\n            trunc_normal_(self.absolute_pos_embed, std=0.02)\n\n        self.pos_drop = nn.Dropout(p=drop_rate)\n\n        # stochastic depth\n        dpr = [\n            x.item() for x in torch.linspace(0, drop_path_rate, sum(depths))\n        ]  # stochastic depth decay rule\n\n        # build layers\n        self.layers = nn.ModuleList()\n        for i_layer in range(self.num_layers):\n            layer = BasicLayer(\n                dim=int(embed_dim * 2 ** i_layer),\n                depth=depths[i_layer],\n                num_heads=num_heads[i_layer],\n                window_size=window_size,\n                mlp_ratio=mlp_ratio,\n                qkv_bias=qkv_bias,\n                qk_scale=qk_scale,\n                drop=drop_rate,\n                attn_drop=attn_drop_rate,\n                drop_path=dpr[sum(depths[:i_layer]) : sum(depths[: i_layer + 1])],\n                norm_layer=norm_layer,\n                downsample=PatchMerging if (i_layer < self.num_layers - 1) else None,\n                use_checkpoint=use_checkpoint,\n            )\n            self.layers.append(layer)\n\n        num_features = [int(embed_dim * 2 ** i) for i in range(self.num_layers)]\n        self.num_features = num_features\n\n        # add a norm layer for each output\n        for i_layer in out_indices:\n            layer = norm_layer(num_features[i_layer])\n            layer_name = f\"norm{i_layer}\"\n            self.add_module(layer_name, layer)\n\n        self._freeze_stages()\n\n    def _freeze_stages(self):\n        if self.frozen_stages >= 0:\n            self.patch_embed.eval()\n            for param in self.patch_embed.parameters():\n                param.requires_grad = False\n\n        if self.frozen_stages >= 1 and self.ape:\n            self.absolute_pos_embed.requires_grad = False\n\n        if self.frozen_stages >= 2:\n            self.pos_drop.eval()\n            for i in range(0, self.frozen_stages - 1):\n                m = self.layers[i]\n                m.eval()\n                for param in m.parameters():\n                    param.requires_grad = False\n\n    def init_weights(self, pretrained=None):\n        \"\"\"Initialize the weights in backbone.\n        Args:\n            pretrained (str, optional): Path to pre-trained weights.\n                Defaults to None.\n        \"\"\"\n\n        def _init_weights(m):\n            if isinstance(m, nn.Linear):\n                trunc_normal_(m.weight, std=0.02)\n                if isinstance(m, nn.Linear) and m.bias is not None:\n                    nn.init.constant_(m.bias, 0)\n            elif isinstance(m, nn.LayerNorm):\n                nn.init.constant_(m.bias, 0)\n                nn.init.constant_(m.weight, 1.0)\n\n    def forward(self, x):\n        \"\"\"Forward function.\"\"\"\n        x = self.patch_embed(x)\n\n        Wh, Ww = x.size(2), x.size(3)\n        if self.ape:\n            # interpolate the position embedding to the corresponding size\n            absolute_pos_embed = F.interpolate(\n                self.absolute_pos_embed, size=(Wh, Ww), mode=\"bicubic\"\n            )\n            x = (x + absolute_pos_embed).flatten(2).transpose(1, 2)  # B Wh*Ww C\n        else:\n            x = x.flatten(2).transpose(1, 2)\n        x = self.pos_drop(x)\n\n        outs = {}\n        for i in range(self.num_layers):\n            layer = self.layers[i]\n            x_out, H, W, x, Wh, Ww = layer(x, Wh, Ww)\n\n            if i in self.out_indices:\n                norm_layer = getattr(self, f\"norm{i}\")\n                x_out = norm_layer(x_out)\n\n                out = x_out.view(-1, H, W, self.num_features[i]).permute(0, 3, 1, 2).contiguous()\n                outs[\"res{}\".format(i + 2)] = out\n\n        return outs\n\n    def train(self, mode=True):\n        \"\"\"Convert the model into training mode while keep layers freezed.\"\"\"\n        super(SwinTransformer, self).train(mode)\n        self._freeze_stages()\n\n\n@BACKBONE_REGISTRY.register()\nclass D2SwinTransformer(SwinTransformer, Backbone):\n    def __init__(self, cfg, input_shape):\n\n        pretrain_img_size = cfg.MODEL.SWIN.PRETRAIN_IMG_SIZE\n        patch_size = cfg.MODEL.SWIN.PATCH_SIZE\n        in_chans = 3\n        embed_dim = cfg.MODEL.SWIN.EMBED_DIM\n        depths = cfg.MODEL.SWIN.DEPTHS\n        num_heads = cfg.MODEL.SWIN.NUM_HEADS\n        window_size = cfg.MODEL.SWIN.WINDOW_SIZE\n        mlp_ratio = cfg.MODEL.SWIN.MLP_RATIO\n        qkv_bias = cfg.MODEL.SWIN.QKV_BIAS\n        qk_scale = cfg.MODEL.SWIN.QK_SCALE\n        drop_rate = cfg.MODEL.SWIN.DROP_RATE\n        attn_drop_rate = cfg.MODEL.SWIN.ATTN_DROP_RATE\n        drop_path_rate = cfg.MODEL.SWIN.DROP_PATH_RATE\n        norm_layer = nn.LayerNorm\n        ape = cfg.MODEL.SWIN.APE\n        patch_norm = cfg.MODEL.SWIN.PATCH_NORM\n        use_checkpoint = cfg.MODEL.SWIN.USE_CHECKPOINT\n\n        super().__init__(\n            pretrain_img_size,\n            patch_size,\n            in_chans,\n            embed_dim,\n            depths,\n            num_heads,\n            window_size,\n            mlp_ratio,\n            qkv_bias,\n            qk_scale,\n            drop_rate,\n            attn_drop_rate,\n            drop_path_rate,\n            norm_layer,\n            ape,\n            patch_norm,\n            use_checkpoint=use_checkpoint,\n        )\n\n        self._out_features = cfg.MODEL.SWIN.OUT_FEATURES\n\n        self._out_feature_strides = {\n            \"res2\": 4,\n            \"res3\": 8,\n            \"res4\": 16,\n            \"res5\": 32,\n        }\n        self._out_feature_channels = {\n            \"res2\": self.num_features[0],\n            \"res3\": self.num_features[1],\n            \"res4\": self.num_features[2],\n            \"res5\": self.num_features[3],\n        }\n\n    def forward(self, x):\n        \"\"\"\n        Args:\n            x: Tensor of shape (N,C,H,W). H, W must be a multiple of ``self.size_divisibility``.\n        Returns:\n            dict[str->Tensor]: names and the corresponding features\n        \"\"\"\n        assert (\n            x.dim() == 4\n        ), f\"SwinTransformer takes an input of shape (N, C, H, W). Got {x.shape} instead!\"\n        outputs = {}\n        y = super().forward(x)\n        for k in y.keys():\n            if k in self._out_features:\n                outputs[k] = y[k]\n        return outputs\n\n    def output_shape(self):\n        return {\n            name: ShapeSpec(\n                channels=self._out_feature_channels[name], stride=self._out_feature_strides[name]\n            )\n            for name in self._out_features\n        }\n\n    @property\n    def size_divisibility(self):\n        return 32\n"
  },
  {
    "path": "mask2former/modeling/criterion.py",
    "content": "# Modified by Bowen Cheng from https://github.com/facebookresearch/detr/blob/master/models/detr.py\n\"\"\"\nMaskFormer criterion.\n\"\"\"\nimport logging\n\nimport torch\nimport torch.nn.functional as F\nfrom torch import nn\n\nfrom detectron2.utils.comm import get_world_size\nfrom detectron2.projects.point_rend.point_features import (\n    get_uncertain_point_coords_with_randomness,\n    point_sample,\n)\n\nfrom ..utils.misc import is_dist_avail_and_initialized, nested_tensor_from_tensor_list\n\ndef unfold_wo_center(x, kernel_size, dilation):\n    assert x.dim() == 4\n    assert kernel_size % 2 == 1\n\n    # using SAME padding\n    padding = (kernel_size + (dilation - 1) * (kernel_size - 1)) // 2\n    unfolded_x = F.unfold(\n        x, kernel_size=kernel_size,\n        padding=padding,\n        dilation=dilation\n    )\n\n    unfolded_x = unfolded_x.reshape(\n        x.size(0), x.size(1), -1, x.size(2), x.size(3)\n    )\n\n    # remove the center pixels\n    size = kernel_size ** 2\n    unfolded_x = torch.cat((\n        unfolded_x[:, :, :size // 2],\n        unfolded_x[:, :, size // 2 + 1:]\n    ), dim=2)\n\n    return unfolded_x\n\ndef compute_pairwise_term(mask_logits, pairwise_size, pairwise_dilation):\n    assert mask_logits.dim() == 4\n\n    log_fg_prob = F.logsigmoid(mask_logits)\n    log_bg_prob = F.logsigmoid(-mask_logits)\n\n    log_fg_prob_unfold = unfold_wo_center(\n        log_fg_prob, kernel_size=pairwise_size,\n        dilation=pairwise_dilation\n    )\n    log_bg_prob_unfold = unfold_wo_center(\n        log_bg_prob, kernel_size=pairwise_size,\n        dilation=pairwise_dilation\n    )\n\n    # the probability of making the same prediction = p_i * p_j + (1 - p_i) * (1 - p_j)\n    # we compute the the probability in log space to avoid numerical instability\n    log_same_fg_prob = log_fg_prob[:, :, None] + log_fg_prob_unfold\n    log_same_bg_prob = log_bg_prob[:, :, None] + log_bg_prob_unfold\n\n    max_ = torch.max(log_same_fg_prob, log_same_bg_prob)\n    log_same_prob = torch.log(\n        torch.exp(log_same_fg_prob - max_) +\n        torch.exp(log_same_bg_prob - max_)\n    ) + max_\n\n    # loss = -log(prob)\n    return -log_same_prob[:, 0]\n\ndef get_incoherent_mask(input_masks, sfact):\n    mask = input_masks.float()\n    w = input_masks.shape[-1]\n    h = input_masks.shape[-2]\n    mask_small = F.interpolate(mask, (h//sfact, w//sfact), mode='bilinear')\n    mask_recover = F.interpolate(mask_small, (h, w), mode='bilinear')\n    mask_uncertain = (mask - mask_recover).abs()\n    \n    mask_uncertain = (mask_uncertain > 0.01).float()\n    return mask_uncertain\n\ndef dice_coefficient(x, target):\n    eps = 1e-5\n    n_inst = x.size(0)\n    x = x.reshape(n_inst, -1)\n    target = target.reshape(n_inst, -1)\n    intersection = (x * target).sum(dim=1)\n    union = (x ** 2.0).sum(dim=1) + (target ** 2.0).sum(dim=1) + eps\n    loss = 1. - (2 * intersection / union)\n    return loss\n\ndef compute_project_term(mask_scores, gt_bitmasks):\n    mask_losses_y = dice_coefficient(\n        mask_scores.max(dim=2, keepdim=True)[0],\n        gt_bitmasks.max(dim=2, keepdim=True)[0]\n    )\n    mask_losses_x = dice_coefficient(\n        mask_scores.max(dim=3, keepdim=True)[0],\n        gt_bitmasks.max(dim=3, keepdim=True)[0]\n    )\n    return (mask_losses_x + mask_losses_y).mean()\n\ndef dice_loss(\n        inputs: torch.Tensor,\n        targets: torch.Tensor,\n        num_masks: float,\n    ):\n    \"\"\"\n    Compute the DICE loss, similar to generalized IOU for masks\n    Args:\n        inputs: A float tensor of arbitrary shape.\n                The predictions for each example.\n        targets: A float tensor with the same shape as inputs. Stores the binary\n                 classification label for each element in inputs\n                (0 for the negative class and 1 for the positive class).\n    \"\"\"\n    inputs = inputs.sigmoid()\n    inputs = inputs.flatten(1)\n    numerator = 2 * (inputs * targets).sum(-1)\n    denominator = inputs.sum(-1) + targets.sum(-1)\n    loss = 1 - (numerator + 1) / (denominator + 1)\n    return loss.sum() / num_masks\n\n\ndice_loss_jit = torch.jit.script(\n    dice_loss\n)  # type: torch.jit.ScriptModule\n\n\ndef sigmoid_ce_loss(\n        inputs: torch.Tensor,\n        targets: torch.Tensor,\n        num_masks: float,\n    ):\n    \"\"\"\n    Args:\n        inputs: A float tensor of arbitrary shape.\n                The predictions for each example.\n        targets: A float tensor with the same shape as inputs. Stores the binary\n                 classification label for each element in inputs\n                (0 for the negative class and 1 for the positive class).\n    Returns:\n        Loss tensor\n    \"\"\"\n    loss = F.binary_cross_entropy_with_logits(inputs, targets, reduction=\"none\")\n\n    return loss.mean(1).sum() / num_masks\n\n\nsigmoid_ce_loss_jit = torch.jit.script(\n    sigmoid_ce_loss\n)  # type: torch.jit.ScriptModule\n\n\ndef calculate_uncertainty(logits):\n    \"\"\"\n    We estimate uncerainty as L1 distance between 0.0 and the logit prediction in 'logits' for the\n        foreground class in `classes`.\n    Args:\n        logits (Tensor): A tensor of shape (R, 1, ...) for class-specific or\n            class-agnostic, where R is the total number of predicted masks in all images and C is\n            the number of foreground classes. The values are logits.\n    Returns:\n        scores (Tensor): A tensor of shape (R, 1, ...) that contains uncertainty scores with\n            the most uncertain locations having the highest uncertainty score.\n    \"\"\"\n    assert logits.shape[1] == 1\n    gt_class_logits = logits.clone()\n    return -(torch.abs(gt_class_logits))\n\n\nclass SetCriterion(nn.Module):\n    \"\"\"This class computes the loss for DETR.\n    The process happens in two steps:\n        1) we compute hungarian assignment between ground truth boxes and the outputs of the model\n        2) we supervise each pair of matched ground-truth / prediction (supervise class and box)\n    \"\"\"\n\n    def __init__(self, num_classes, matcher, weight_dict, eos_coef, losses,\n                 num_points, oversample_ratio, importance_sample_ratio):\n        \"\"\"Create the criterion.\n        Parameters:\n            num_classes: number of object categories, omitting the special no-object category\n            matcher: module able to compute a matching between targets and proposals\n            weight_dict: dict containing as key the names of the losses and as values their relative weight.\n            eos_coef: relative classification weight applied to the no-object category\n            losses: list of all the losses to be applied. See get_loss for list of available losses.\n        \"\"\"\n        super().__init__()\n        self.num_classes = num_classes\n        self.matcher = matcher\n        self.weight_dict = weight_dict\n        self.eos_coef = eos_coef\n        self.losses = losses\n        empty_weight = torch.ones(self.num_classes + 1)\n        empty_weight[-1] = self.eos_coef\n        self.register_buffer(\"empty_weight\", empty_weight)\n\n        # pointwise mask loss parameters\n        self.num_points = num_points\n        self.oversample_ratio = oversample_ratio\n        self.importance_sample_ratio = importance_sample_ratio\n        self.laplacian_kernel = torch.tensor([-1, -1, -1, -1, 8, -1, -1, -1, -1], dtype=torch.float32).reshape(1, 1, 3, 3).requires_grad_(False)\n\n        self.register_buffer(\"_iter\", torch.zeros([1]))\n        self._warmup_iters = 1000 #20000\n\n    def loss_labels(self, outputs, targets, indices, num_masks):\n        \"\"\"Classification loss (NLL)\n        targets dicts must contain the key \"labels\" containing a tensor of dim [nb_target_boxes]\n        \"\"\"\n        assert \"pred_logits\" in outputs\n        src_logits = outputs[\"pred_logits\"].float()\n\n        idx = self._get_src_permutation_idx(indices)\n        target_classes_o = torch.cat([t[\"labels\"][J] for t, (_, J) in zip(targets, indices)])\n        target_classes = torch.full(\n            src_logits.shape[:2], self.num_classes, dtype=torch.int64, device=src_logits.device\n        )\n        target_classes[idx] = target_classes_o\n\n        loss_ce = F.cross_entropy(src_logits.transpose(1, 2), target_classes, self.empty_weight)\n        losses = {\"loss_ce\": loss_ce}\n        return losses\n\n    \n    def loss_masks_proj(self, outputs, targets, indices, num_masks, images_lab_sim):\n        assert \"pred_masks\" in outputs\n\n        self._iter += 1\n\n        src_idx = self._get_src_permutation_idx(indices)\n        tgt_idx = self._get_tgt_permutation_idx(indices)\n        src_masks = outputs[\"pred_masks\"]\n        src_masks = src_masks[src_idx]\n        masks = [t[\"masks\"] for t in targets]\n        # TODO use valid to mask invalid areas due to padding in loss\n        target_masks, valid = nested_tensor_from_tensor_list(masks).decompose()\n        target_masks = target_masks.to(src_masks)\n        target_masks = target_masks[tgt_idx]\n        \n        if len(src_idx[0].tolist()) > 0:\n            images_lab_sim = torch.cat([images_lab_sim[ind] for ind in src_idx[0].tolist()])\n        \n\n        # No need to upsample predictions as we are using normalized coordinates :)\n        # N x 1 x H x W\n        src_masks = src_masks[:, None]\n        target_masks = target_masks[:, None]\n        target_masks = F.interpolate(target_masks, (src_masks.shape[-2], src_masks.shape[-1]), mode='bilinear')\n        \n        if src_masks.shape[0] > 0:\n            loss_prj_term = compute_project_term(src_masks.sigmoid(), target_masks)  \n            pairwise_losses = compute_pairwise_term(\n                src_masks, 3, 2\n            )\n            \n            inc_mask = get_incoherent_mask(src_masks.detach().sigmoid() > 0.5, 2) #* images_lab_sim).bool()\n            inc_mask = F.conv2d(inc_mask, self.laplacian_kernel.to(inc_mask.device), padding=1).abs()\n            inc_mask = (inc_mask > 0.5).float()\n            \n            weights = (images_lab_sim >= 0.3).float() * target_masks.float() #* inc_mask\n            loss_pairwise = ((pairwise_losses * weights).sum() / weights.sum().clamp(min=1.0)) * 0.25\n            warmup_factor = min(self._iter.item() / float(self._warmup_iters), 1.0)\n            loss_pairwise = loss_pairwise * warmup_factor #* 0.\n        else:\n            loss_prj_term = src_masks.sum() * 0.\n            loss_pairwise = src_masks.sum() * 0.\n\n       \n\n        losses = {\n            \"loss_mask\": loss_prj_term,\n            \"loss_bound\": loss_pairwise,\n        }\n\n        del src_masks\n        del target_masks\n        return losses\n\n\n\n    def loss_masks(self, outputs, targets, indices, num_masks):\n        \"\"\"Compute the losses related to the masks: the focal loss and the dice loss.\n        targets dicts must contain the key \"masks\" containing a tensor of dim [nb_target_boxes, h, w]\n        \"\"\"\n        assert \"pred_masks\" in outputs\n\n        src_idx = self._get_src_permutation_idx(indices)\n        tgt_idx = self._get_tgt_permutation_idx(indices)\n        src_masks = outputs[\"pred_masks\"]\n        src_masks = src_masks[src_idx]\n        masks = [t[\"masks\"] for t in targets]\n        # TODO use valid to mask invalid areas due to padding in loss\n        target_masks, valid = nested_tensor_from_tensor_list(masks).decompose()\n        target_masks = target_masks.to(src_masks)\n        target_masks = target_masks[tgt_idx]\n\n        # No need to upsample predictions as we are using normalized coordinates :)\n        # N x 1 x H x W\n        src_masks = src_masks[:, None]\n        target_masks = target_masks[:, None]\n\n        with torch.no_grad():\n            # sample point_coords\n            point_coords = get_uncertain_point_coords_with_randomness(\n                src_masks,\n                lambda logits: calculate_uncertainty(logits),\n                self.num_points,\n                self.oversample_ratio,\n                self.importance_sample_ratio,\n            )\n            # get gt labels\n            point_labels = point_sample(\n                target_masks,\n                point_coords,\n                align_corners=False,\n            ).squeeze(1)\n\n        point_logits = point_sample(\n            src_masks,\n            point_coords,\n            align_corners=False,\n        ).squeeze(1)\n\n        losses = {\n            \"loss_mask\": sigmoid_ce_loss_jit(point_logits, point_labels, num_masks),\n            \"loss_dice\": dice_loss_jit(point_logits, point_labels, num_masks),\n        }\n\n        del src_masks\n        del target_masks\n        return losses\n\n    def _get_src_permutation_idx(self, indices):\n        # permute predictions following indices\n        batch_idx = torch.cat([torch.full_like(src, i) for i, (src, _) in enumerate(indices)])\n        src_idx = torch.cat([src for (src, _) in indices])\n        return batch_idx, src_idx\n\n    def _get_tgt_permutation_idx(self, indices):\n        # permute targets following indices\n        batch_idx = torch.cat([torch.full_like(tgt, i) for i, (_, tgt) in enumerate(indices)])\n        tgt_idx = torch.cat([tgt for (_, tgt) in indices])\n        return batch_idx, tgt_idx\n\n    def get_loss(self, loss, outputs, targets, indices, num_masks, images_lab_sim):\n        loss_map = {\n            'labels': self.loss_labels,\n            'masks': self.loss_masks_proj,\n        }\n        assert loss in loss_map, f\"do you really want to compute {loss} loss?\"\n        if loss == 'masks':\n            return loss_map[loss](outputs, targets, indices, num_masks, images_lab_sim)\n        else:\n            return loss_map[loss](outputs, targets, indices, num_masks)\n\n    def forward(self, outputs, targets, images_lab_sim):\n        \"\"\"This performs the loss computation.\n        Parameters:\n             outputs: dict of tensors, see the output specification of the model for the format\n             targets: list of dicts, such that len(targets) == batch_size.\n                      The expected keys in each dict depends on the losses applied, see each loss' doc\n        \"\"\"\n        outputs_without_aux = {k: v for k, v in outputs.items() if k != \"aux_outputs\"}\n\n        # Retrieve the matching between the outputs of the last layer and the targets\n        indices = self.matcher(outputs_without_aux, targets)\n\n        # Compute the average number of target boxes accross all nodes, for normalization purposes\n        num_masks = sum(len(t[\"labels\"]) for t in targets)\n        num_masks = torch.as_tensor(\n            [num_masks], dtype=torch.float, device=next(iter(outputs.values())).device\n        )\n        if is_dist_avail_and_initialized():\n            torch.distributed.all_reduce(num_masks)\n        num_masks = torch.clamp(num_masks / get_world_size(), min=1).item()\n\n        # Compute all the requested losses\n        losses = {}\n        for loss in self.losses:\n            losses.update(self.get_loss(loss, outputs, targets, indices, num_masks, images_lab_sim))\n\n        # In case of auxiliary losses, we repeat this process with the output of each intermediate layer.\n        if \"aux_outputs\" in outputs:\n            for i, aux_outputs in enumerate(outputs[\"aux_outputs\"]):\n                indices = self.matcher(aux_outputs, targets)\n                for loss in self.losses:\n                    l_dict = self.get_loss(loss, aux_outputs, targets, indices, num_masks, images_lab_sim)\n                    l_dict = {k + f\"_{i}\": v for k, v in l_dict.items()}\n                    losses.update(l_dict)\n\n        return losses\n\n    def __repr__(self):\n        head = \"Criterion \" + self.__class__.__name__\n        body = [\n            \"matcher: {}\".format(self.matcher.__repr__(_repr_indent=8)),\n            \"losses: {}\".format(self.losses),\n            \"weight_dict: {}\".format(self.weight_dict),\n            \"num_classes: {}\".format(self.num_classes),\n            \"eos_coef: {}\".format(self.eos_coef),\n            \"num_points: {}\".format(self.num_points),\n            \"oversample_ratio: {}\".format(self.oversample_ratio),\n            \"importance_sample_ratio: {}\".format(self.importance_sample_ratio),\n        ]\n        _repr_indent = 4\n        lines = [head] + [\" \" * _repr_indent + line for line in body]\n        return \"\\n\".join(lines)\n"
  },
  {
    "path": "mask2former/modeling/matcher.py",
    "content": "# Modified by Bowen Cheng from https://github.com/facebookresearch/detr/blob/master/models/matcher.py\n\"\"\"\nModules to compute the matching cost and solve the corresponding LSAP.\n\"\"\"\nimport torch\nimport torch.nn.functional as F\nfrom scipy.optimize import linear_sum_assignment\nfrom torch import nn\nfrom torch.cuda.amp import autocast\n\nfrom detectron2.projects.point_rend.point_features import point_sample\nfrom util.box_ops import box_cxcywh_to_xyxy, generalized_box_iou, generalized_multi_box_iou\n\ndef batch_dice_loss(inputs: torch.Tensor, targets: torch.Tensor):\n    \"\"\"\n    Compute the DICE loss, similar to generalized IOU for masks\n    Args:\n        inputs: A float tensor of arbitrary shape.\n                The predictions for each example.\n        targets: A float tensor with the same shape as inputs. Stores the binary\n                 classification label for each element in inputs\n                (0 for the negative class and 1 for the positive class).\n    \"\"\"\n    inputs = inputs #.sigmoid()\n    inputs = inputs.flatten(1)\n    numerator = 2 * torch.einsum(\"nc,mc->nm\", inputs, targets)\n    denominator = inputs.sum(-1)[:, None] + targets.sum(-1)[None, :]\n    loss = 1 - (numerator + 1) / (denominator + 1)\n    return loss\n\n\nbatch_dice_loss_jit = torch.jit.script(\n    batch_dice_loss\n)  # type: torch.jit.ScriptModule\n\n\ndef batch_sigmoid_ce_loss(inputs: torch.Tensor, targets: torch.Tensor):\n    \"\"\"\n    Args:\n        inputs: A float tensor of arbitrary shape.\n                The predictions for each example.\n        targets: A float tensor with the same shape as inputs. Stores the binary\n                 classification label for each element in inputs\n                (0 for the negative class and 1 for the positive class).\n    Returns:\n        Loss tensor\n    \"\"\"\n    hw = inputs.shape[1]\n\n\n    pos = F.binary_cross_entropy(\n        inputs, torch.ones_like(inputs), reduction=\"none\"\n    )\n    neg = F.binary_cross_entropy(\n        inputs, torch.zeros_like(inputs), reduction=\"none\"\n    )\n\n    loss = torch.einsum(\"nc,mc->nm\", pos, targets) + torch.einsum(\n        \"nc,mc->nm\", neg, (1 - targets)\n    )\n\n    return loss / hw\n\n\nbatch_sigmoid_ce_loss_jit = torch.jit.script(\n    batch_sigmoid_ce_loss\n)  # type: torch.jit.ScriptModule\n\ndef masks_to_boxes(masks: torch.Tensor) -> torch.Tensor:\n    \"\"\"\n    Compute the bounding boxes around the provided masks.\n\n    Returns a [N, 4] tensor containing bounding boxes. The boxes are in ``(x1, y1, x2, y2)`` format with\n    ``0 <= x1 < x2`` and ``0 <= y1 < y2``.\n\n    Args:\n        masks (Tensor[N, H, W]): masks to transform where N is the number of masks\n            and (H, W) are the spatial dimensions.\n\n    Returns:\n        Tensor[N, 4]: bounding boxes\n    \"\"\"\n    if masks.numel() == 0:\n        return masks\n\n    n = masks.shape[0]\n    for index, mask in enumerate(masks):\n        y, x = torch.where(mask != 0)\n        if len(x) * len(y) == 0:\n            continue\n        \n        h = torch.max(y) - torch.min(y)\n        w = torch.max(x) - torch.min(x)\n        masks[index, torch.min(y):torch.max(y), torch.min(x):torch.max(x)] = 1.0\n\n    return masks\n\ndef masks_to_boxes_cc(masks: torch.Tensor) -> torch.Tensor:\n    \"\"\"\n    Compute the bounding boxes around the provided masks.\n\n    Returns a [N, 4] tensor containing bounding boxes. The boxes are in ``(x1, y1, x2, y2)`` format with\n    ``0 <= x1 < x2`` and ``0 <= y1 < y2``.\n\n    Args:\n        masks (Tensor[N, H, W]): masks to transform where N is the number of masks\n            and (H, W) are the spatial dimensions.\n\n    Returns:\n        Tensor[N, 4]: bounding boxes\n    \"\"\"\n    if masks.numel() == 0:\n        return torch.zeros((0, 4), device=masks.device, dtype=torch.float)\n\n    n = masks.shape[0]\n    h = masks.shape[1]\n    w = masks.shape[2]\n\n    bounding_boxes = torch.zeros((n, 4), device=masks.device, dtype=torch.float)\n\n    for index, mask in enumerate(masks):\n        y, x = torch.where(mask != 0)\n        if len(x) * len(y) == 0:\n            continue\n\n        bounding_boxes[index, 0] = torch.min(x) / float(w)\n        bounding_boxes[index, 1] = torch.min(y) / float(h)\n        bounding_boxes[index, 2] = torch.max(x) / float(w)\n        bounding_boxes[index, 3] = torch.max(y) / float(h)\n\n    return bounding_boxes\n\n\nclass HungarianMatcher(nn.Module):\n    \"\"\"This class computes an assignment between the targets and the predictions of the network\n\n    For efficiency reasons, the targets don't include the no_object. Because of this, in general,\n    there are more predictions than targets. In this case, we do a 1-to-1 matching of the best predictions,\n    while the others are un-matched (and thus treated as non-objects).\n    \"\"\"\n\n    def __init__(self, cost_class: float = 1, cost_mask: float = 1, cost_dice: float = 1, num_points: int = 0):\n        \"\"\"Creates the matcher\n\n        Params:\n            cost_class: This is the relative weight of the classification error in the matching cost\n            cost_mask: This is the relative weight of the focal loss of the binary mask in the matching cost\n            cost_dice: This is the relative weight of the dice loss of the binary mask in the matching cost\n        \"\"\"\n        super().__init__()\n        self.cost_class = cost_class\n        self.cost_mask = cost_mask\n        self.cost_dice = cost_dice\n\n        self.cost_giou = 2.0\n        self.cost_bbox = 5.0\n\n        assert cost_class != 0 or cost_mask != 0 or cost_dice != 0, \"all costs cant be 0\"\n\n        self.num_points = num_points\n\n    @torch.no_grad()\n    def memory_efficient_forward(self, outputs, targets):\n        \"\"\"More memory-friendly matching\"\"\"\n        bs, num_queries = outputs[\"pred_logits\"].shape[:2]\n\n        indices = []\n\n        # Iterate through batch size\n        for b in range(bs):\n\n            out_prob = outputs[\"pred_logits\"][b].softmax(-1)  # [num_queries, num_classes]\n            tgt_ids = targets[b][\"labels\"]\n\n            # Compute the classification cost. Contrary to the loss, we don't use the NLL,\n            # but approximate it in 1 - proba[target class].\n            # The 1 is a constant that doesn't change the matching, it can be ommitted.\n            cost_class = -out_prob[:, tgt_ids]\n\n            out_mask = outputs[\"pred_masks\"][b]  # [num_queries, H_pred, W_pred]\n            out_mask_box = masks_to_boxes_cc((out_mask.sigmoid() > 0.5).float())\n            # gt masks are already padded when preparing target\n            tgt_mask = targets[b][\"masks\"].to(out_mask)\n            tgt_mask_box = masks_to_boxes_cc(tgt_mask)\n            # print('tgt_mask_box shape:', tgt_mask_box.shape)\n\n            with autocast(enabled=False):\n                cost_bbox = torch.cdist(out_mask_box, tgt_mask_box)\n                cost_giou = -generalized_box_iou(out_mask_box, tgt_mask_box)\n                if torch.isnan(cost_bbox).any():\n                    print('cost_bbox:', cost_bbox)\n                if torch.isnan(cost_giou).any():\n                    print('cost_giou:', cost_giou)\n\n            \n            C = (\n                self.cost_bbox * cost_bbox\n                + self.cost_class * cost_class\n                + self.cost_giou * cost_giou\n            )\n\n            C = C.reshape(num_queries, -1).cpu()\n            \n            indices.append(linear_sum_assignment(C))\n\n        return [\n            (torch.as_tensor(i, dtype=torch.int64), torch.as_tensor(j, dtype=torch.int64))\n            for i, j in indices\n        ]\n\n    @torch.no_grad()\n    def forward(self, outputs, targets):\n        \"\"\"Performs the matching\n\n        Params:\n            outputs: This is a dict that contains at least these entries:\n                 \"pred_logits\": Tensor of dim [batch_size, num_queries, num_classes] with the classification logits\n                 \"pred_masks\": Tensor of dim [batch_size, num_queries, H_pred, W_pred] with the predicted masks\n\n            targets: This is a list of targets (len(targets) = batch_size), where each target is a dict containing:\n                 \"labels\": Tensor of dim [num_target_boxes] (where num_target_boxes is the number of ground-truth\n                           objects in the target) containing the class labels\n                 \"masks\": Tensor of dim [num_target_boxes, H_gt, W_gt] containing the target masks\n\n        Returns:\n            A list of size batch_size, containing tuples of (index_i, index_j) where:\n                - index_i is the indices of the selected predictions (in order)\n                - index_j is the indices of the corresponding selected targets (in order)\n            For each batch element, it holds:\n                len(index_i) = len(index_j) = min(num_queries, num_target_boxes)\n        \"\"\"\n        return self.memory_efficient_forward(outputs, targets)\n\n    def __repr__(self, _repr_indent=4):\n        head = \"Matcher \" + self.__class__.__name__\n        body = [\n            \"cost_class: {}\".format(self.cost_class),\n            \"cost_mask: {}\".format(self.cost_mask),\n            \"cost_dice: {}\".format(self.cost_dice),\n        ]\n        lines = [head] + [\" \" * _repr_indent + line for line in body]\n        return \"\\n\".join(lines)\n"
  },
  {
    "path": "mask2former/modeling/meta_arch/__init__.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates.\n"
  },
  {
    "path": "mask2former/modeling/meta_arch/__init__.py.new",
    "content": ""
  },
  {
    "path": "mask2former/modeling/meta_arch/mask_former_head.py",
    "content": "import logging\nfrom copy import deepcopy\nfrom typing import Callable, Dict, List, Optional, Tuple, Union\n\nimport fvcore.nn.weight_init as weight_init\nfrom torch import nn\nfrom torch.nn import functional as F\n\nfrom detectron2.config import configurable\nfrom detectron2.layers import Conv2d, ShapeSpec, get_norm\nfrom detectron2.modeling import SEM_SEG_HEADS_REGISTRY\n\nfrom ..transformer_decoder.maskformer_transformer_decoder import build_transformer_decoder\nfrom ..pixel_decoder.fpn import build_pixel_decoder\n\n    \n@SEM_SEG_HEADS_REGISTRY.register()\nclass MaskFormerHead(nn.Module):\n\n    _version = 2\n\n    def _load_from_state_dict(\n        self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs\n    ):\n        version = local_metadata.get(\"version\", None)\n        if version is None or version < 2:\n            # Do not warn if train from scratch\n            scratch = True\n            logger = logging.getLogger(__name__)\n            for k in list(state_dict.keys()):\n                newk = k\n                '''\n                if \"sem_seg_head\" in k and not k.startswith(prefix + \"predictor\"):\n                    newk = k.replace(prefix, prefix + \"pixel_decoder.\")\n                    # logger.debug(f\"{k} ==> {newk}\")\n                '''\n                if newk != k:\n                    state_dict[newk] = state_dict[k]\n                    del state_dict[k]\n                    scratch = False\n\n            if not scratch:\n                logger.warning(\n                    f\"Weight format of {self.__class__.__name__} have changed! \"\n                    \"Please upgrade your models. Applying automatic conversion now ...\"\n                )\n\n    @configurable\n    def __init__(\n        self,\n        input_shape: Dict[str, ShapeSpec],\n        *,\n        num_classes: int,\n        pixel_decoder: nn.Module,\n        loss_weight: float = 1.0,\n        ignore_value: int = -1,\n        # extra parameters\n        transformer_predictor: nn.Module,\n        transformer_in_feature: str,\n    ):\n        \"\"\"\n        NOTE: this interface is experimental.\n        Args:\n            input_shape: shapes (channels and stride) of the input features\n            num_classes: number of classes to predict\n            pixel_decoder: the pixel decoder module\n            loss_weight: loss weight\n            ignore_value: category id to be ignored during training.\n            transformer_predictor: the transformer decoder that makes prediction\n            transformer_in_feature: input feature name to the transformer_predictor\n        \"\"\"\n        super().__init__()\n        input_shape = sorted(input_shape.items(), key=lambda x: x[1].stride)\n        self.in_features = [k for k, v in input_shape]\n        feature_strides = [v.stride for k, v in input_shape]\n        feature_channels = [v.channels for k, v in input_shape]\n\n        self.ignore_value = ignore_value\n        self.common_stride = 4\n        self.loss_weight = loss_weight\n\n        self.pixel_decoder = pixel_decoder\n        self.predictor = transformer_predictor\n        self.transformer_in_feature = transformer_in_feature\n\n        self.num_classes = num_classes\n\n    @classmethod\n    def from_config(cls, cfg, input_shape: Dict[str, ShapeSpec]):\n        # figure out in_channels to transformer predictor\n        if cfg.MODEL.MASK_FORMER.TRANSFORMER_IN_FEATURE == \"transformer_encoder\":\n            transformer_predictor_in_channels = cfg.MODEL.SEM_SEG_HEAD.CONVS_DIM\n        elif cfg.MODEL.MASK_FORMER.TRANSFORMER_IN_FEATURE == \"pixel_embedding\":\n            transformer_predictor_in_channels = cfg.MODEL.SEM_SEG_HEAD.MASK_DIM\n        elif cfg.MODEL.MASK_FORMER.TRANSFORMER_IN_FEATURE == \"multi_scale_pixel_decoder\":  # for maskformer2\n            transformer_predictor_in_channels = cfg.MODEL.SEM_SEG_HEAD.CONVS_DIM\n        else:\n            transformer_predictor_in_channels = input_shape[cfg.MODEL.MASK_FORMER.TRANSFORMER_IN_FEATURE].channels\n\n        return {\n            \"input_shape\": {\n                k: v for k, v in input_shape.items() if k in cfg.MODEL.SEM_SEG_HEAD.IN_FEATURES\n            },\n            \"ignore_value\": cfg.MODEL.SEM_SEG_HEAD.IGNORE_VALUE,\n            \"num_classes\": cfg.MODEL.SEM_SEG_HEAD.NUM_CLASSES,\n            \"pixel_decoder\": build_pixel_decoder(cfg, input_shape),\n            \"loss_weight\": cfg.MODEL.SEM_SEG_HEAD.LOSS_WEIGHT,\n            \"transformer_in_feature\": cfg.MODEL.MASK_FORMER.TRANSFORMER_IN_FEATURE,\n            \"transformer_predictor\": build_transformer_decoder(\n                cfg,\n                transformer_predictor_in_channels,\n                mask_classification=True,\n            ),\n        }\n\n    def forward(self, features, mask=None):\n        return self.layers(features, mask)\n\n    def layers(self, features, mask=None):\n        mask_features, transformer_encoder_features, multi_scale_features = self.pixel_decoder.forward_features(features)\n        if self.transformer_in_feature == \"multi_scale_pixel_decoder\":\n            predictions = self.predictor(multi_scale_features, mask_features, mask)\n        else:\n            if self.transformer_in_feature == \"transformer_encoder\":\n                assert (\n                    transformer_encoder_features is not None\n                ), \"Please use the TransformerEncoderPixelDecoder.\"\n                predictions = self.predictor(transformer_encoder_features, mask_features, mask)\n            elif self.transformer_in_feature == \"pixel_embedding\":\n                predictions = self.predictor(mask_features, mask_features, mask)\n            else:\n                predictions = self.predictor(features[self.transformer_in_feature], mask_features, mask)\n        return predictions\n"
  },
  {
    "path": "mask2former/modeling/meta_arch/per_pixel_baseline.py",
    "content": "import logging\nfrom typing import Callable, Dict, List, Optional, Tuple, Union\n\nimport fvcore.nn.weight_init as weight_init\nfrom torch import nn\nfrom torch.nn import functional as F\n\nfrom detectron2.config import configurable\nfrom detectron2.layers import Conv2d, ShapeSpec, get_norm\nfrom detectron2.modeling import SEM_SEG_HEADS_REGISTRY\n\nfrom ..transformer_decoder.maskformer_transformer_decoder import StandardTransformerDecoder\nfrom ..pixel_decoder.fpn import build_pixel_decoder\n\n\n@SEM_SEG_HEADS_REGISTRY.register()\nclass PerPixelBaselineHead(nn.Module):\n\n    _version = 2\n\n    def _load_from_state_dict(\n        self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs\n    ):\n        version = local_metadata.get(\"version\", None)\n        if version is None or version < 2:\n            logger = logging.getLogger(__name__)\n            # Do not warn if train from scratch\n            scratch = True\n            logger = logging.getLogger(__name__)\n            for k in list(state_dict.keys()):\n                newk = k\n                if \"sem_seg_head\" in k and not k.startswith(prefix + \"predictor\"):\n                    newk = k.replace(prefix, prefix + \"pixel_decoder.\")\n                    # logger.warning(f\"{k} ==> {newk}\")\n                if newk != k:\n                    state_dict[newk] = state_dict[k]\n                    del state_dict[k]\n                    scratch = False\n\n            if not scratch:\n                logger.warning(\n                    f\"Weight format of {self.__class__.__name__} have changed! \"\n                    \"Please upgrade your models. Applying automatic conversion now ...\"\n                )\n\n    @configurable\n    def __init__(\n        self,\n        input_shape: Dict[str, ShapeSpec],\n        *,\n        num_classes: int,\n        pixel_decoder: nn.Module,\n        loss_weight: float = 1.0,\n        ignore_value: int = -1,\n    ):\n        \"\"\"\n        NOTE: this interface is experimental.\n        Args:\n            input_shape: shapes (channels and stride) of the input features\n            num_classes: number of classes to predict\n            pixel_decoder: the pixel decoder module\n            loss_weight: loss weight\n            ignore_value: category id to be ignored during training.\n        \"\"\"\n        super().__init__()\n        input_shape = sorted(input_shape.items(), key=lambda x: x[1].stride)\n        self.in_features = [k for k, v in input_shape]\n        feature_strides = [v.stride for k, v in input_shape]\n        feature_channels = [v.channels for k, v in input_shape]\n\n        self.ignore_value = ignore_value\n        self.common_stride = 4\n        self.loss_weight = loss_weight\n\n        self.pixel_decoder = pixel_decoder\n        self.predictor = Conv2d(\n            self.pixel_decoder.mask_dim, num_classes, kernel_size=1, stride=1, padding=0\n        )\n        weight_init.c2_msra_fill(self.predictor)\n\n    @classmethod\n    def from_config(cls, cfg, input_shape: Dict[str, ShapeSpec]):\n        return {\n            \"input_shape\": {\n                k: v for k, v in input_shape.items() if k in cfg.MODEL.SEM_SEG_HEAD.IN_FEATURES\n            },\n            \"ignore_value\": cfg.MODEL.SEM_SEG_HEAD.IGNORE_VALUE,\n            \"num_classes\": cfg.MODEL.SEM_SEG_HEAD.NUM_CLASSES,\n            \"pixel_decoder\": build_pixel_decoder(cfg, input_shape),\n            \"loss_weight\": cfg.MODEL.SEM_SEG_HEAD.LOSS_WEIGHT,\n        }\n\n    def forward(self, features, targets=None):\n        \"\"\"\n        Returns:\n            In training, returns (None, dict of losses)\n            In inference, returns (CxHxW logits, {})\n        \"\"\"\n        x = self.layers(features)\n        if self.training:\n            return None, self.losses(x, targets)\n        else:\n            x = F.interpolate(\n                x, scale_factor=self.common_stride, mode=\"bilinear\", align_corners=False\n            )\n            return x, {}\n\n    def layers(self, features):\n        x, _, _ = self.pixel_decoder.forward_features(features)\n        x = self.predictor(x)\n        return x\n\n    def losses(self, predictions, targets):\n        predictions = predictions.float()  # https://github.com/pytorch/pytorch/issues/48163\n        predictions = F.interpolate(\n            predictions, scale_factor=self.common_stride, mode=\"bilinear\", align_corners=False\n        )\n        loss = F.cross_entropy(\n            predictions, targets, reduction=\"mean\", ignore_index=self.ignore_value\n        )\n        losses = {\"loss_sem_seg\": loss * self.loss_weight}\n        return losses\n\n\n@SEM_SEG_HEADS_REGISTRY.register()\nclass PerPixelBaselinePlusHead(PerPixelBaselineHead):\n    def _load_from_state_dict(\n        self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs\n    ):\n        version = local_metadata.get(\"version\", None)\n        if version is None or version < 2:\n            # Do not warn if train from scratch\n            scratch = True\n            logger = logging.getLogger(__name__)\n            for k in list(state_dict.keys()):\n                newk = k\n                if \"sem_seg_head\" in k and not k.startswith(prefix + \"predictor\"):\n                    newk = k.replace(prefix, prefix + \"pixel_decoder.\")\n                    logger.debug(f\"{k} ==> {newk}\")\n                if newk != k:\n                    state_dict[newk] = state_dict[k]\n                    del state_dict[k]\n                    scratch = False\n\n            if not scratch:\n                logger.warning(\n                    f\"Weight format of {self.__class__.__name__} have changed! \"\n                    \"Please upgrade your models. Applying automatic conversion now ...\"\n                )\n\n    @configurable\n    def __init__(\n        self,\n        input_shape: Dict[str, ShapeSpec],\n        *,\n        # extra parameters\n        transformer_predictor: nn.Module,\n        transformer_in_feature: str,\n        deep_supervision: bool,\n        # inherit parameters\n        num_classes: int,\n        pixel_decoder: nn.Module,\n        loss_weight: float = 1.0,\n        ignore_value: int = -1,\n    ):\n        \"\"\"\n        NOTE: this interface is experimental.\n        Args:\n            input_shape: shapes (channels and stride) of the input features\n            transformer_predictor: the transformer decoder that makes prediction\n            transformer_in_feature: input feature name to the transformer_predictor\n            deep_supervision: whether or not to add supervision to the output of\n                every transformer decoder layer\n            num_classes: number of classes to predict\n            pixel_decoder: the pixel decoder module\n            loss_weight: loss weight\n            ignore_value: category id to be ignored during training.\n        \"\"\"\n        super().__init__(\n            input_shape,\n            num_classes=num_classes,\n            pixel_decoder=pixel_decoder,\n            loss_weight=loss_weight,\n            ignore_value=ignore_value,\n        )\n\n        del self.predictor\n\n        self.predictor = transformer_predictor\n        self.transformer_in_feature = transformer_in_feature\n        self.deep_supervision = deep_supervision\n\n    @classmethod\n    def from_config(cls, cfg, input_shape: Dict[str, ShapeSpec]):\n        ret = super().from_config(cfg, input_shape)\n        ret[\"transformer_in_feature\"] = cfg.MODEL.MASK_FORMER.TRANSFORMER_IN_FEATURE\n        if cfg.MODEL.MASK_FORMER.TRANSFORMER_IN_FEATURE == \"transformer_encoder\":\n            in_channels = cfg.MODEL.SEM_SEG_HEAD.CONVS_DIM\n        else:\n            in_channels = input_shape[ret[\"transformer_in_feature\"]].channels\n        ret[\"transformer_predictor\"] = StandardTransformerDecoder(\n            cfg, in_channels, mask_classification=False\n        )\n        ret[\"deep_supervision\"] = cfg.MODEL.MASK_FORMER.DEEP_SUPERVISION\n        return ret\n\n    def forward(self, features, targets=None):\n        \"\"\"\n        Returns:\n            In training, returns (None, dict of losses)\n            In inference, returns (CxHxW logits, {})\n        \"\"\"\n        x, aux_outputs = self.layers(features)\n        if self.training:\n            if self.deep_supervision:\n                losses = self.losses(x, targets)\n                for i, aux_output in enumerate(aux_outputs):\n                    losses[\"loss_sem_seg\" + f\"_{i}\"] = self.losses(\n                        aux_output[\"pred_masks\"], targets\n                    )[\"loss_sem_seg\"]\n                return None, losses\n            else:\n                return None, self.losses(x, targets)\n        else:\n            x = F.interpolate(\n                x, scale_factor=self.common_stride, mode=\"bilinear\", align_corners=False\n            )\n            return x, {}\n\n    def layers(self, features):\n        mask_features, transformer_encoder_features, _ = self.pixel_decoder.forward_features(features)\n        if self.transformer_in_feature == \"transformer_encoder\":\n            assert (\n                transformer_encoder_features is not None\n            ), \"Please use the TransformerEncoderPixelDecoder.\"\n            predictions = self.predictor(transformer_encoder_features, mask_features)\n        else:\n            predictions = self.predictor(features[self.transformer_in_feature], mask_features)\n        if self.deep_supervision:\n            return predictions[\"pred_masks\"], predictions[\"aux_outputs\"]\n        else:\n            return predictions[\"pred_masks\"], None\n"
  },
  {
    "path": "mask2former/modeling/pixel_decoder/__init__.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates.\n"
  },
  {
    "path": "mask2former/modeling/pixel_decoder/__init__.py.new",
    "content": ""
  },
  {
    "path": "mask2former/modeling/pixel_decoder/fpn.py",
    "content": "import logging\nimport numpy as np\nfrom typing import Callable, Dict, List, Optional, Tuple, Union\n\nimport fvcore.nn.weight_init as weight_init\nimport torch\nfrom torch import nn\nfrom torch.nn import functional as F\nfrom torch.nn.init import xavier_uniform_, constant_, uniform_, normal_\nfrom torch.cuda.amp import autocast\n\nfrom detectron2.config import configurable\nfrom detectron2.layers import Conv2d, DeformConv, ShapeSpec, get_norm\nfrom detectron2.modeling import SEM_SEG_HEADS_REGISTRY\n\nfrom ..transformer_decoder.position_encoding import PositionEmbeddingSine\nfrom ..transformer_decoder.transformer import TransformerEncoder, TransformerEncoderLayer, _get_clones, _get_activation_fn\n\n\ndef build_pixel_decoder(cfg, input_shape):\n    \"\"\"\n    Build a pixel decoder from `cfg.MODEL.MASK_FORMER.PIXEL_DECODER_NAME`.\n    \"\"\"\n    name = cfg.MODEL.SEM_SEG_HEAD.PIXEL_DECODER_NAME\n    model = SEM_SEG_HEADS_REGISTRY.get(name)(cfg, input_shape)\n    forward_features = getattr(model, \"forward_features\", None)\n    if not callable(forward_features):\n        raise ValueError(\n            \"Only SEM_SEG_HEADS with forward_features method can be used as pixel decoder. \"\n            f\"Please implement forward_features for {name} to only return mask features.\"\n        )\n    return model\n\n\n# This is a modified FPN decoder.\n@SEM_SEG_HEADS_REGISTRY.register()\nclass BasePixelDecoder(nn.Module):\n    @configurable\n    def __init__(\n        self,\n        input_shape: Dict[str, ShapeSpec],\n        *,\n        conv_dim: int,\n        mask_dim: int,\n        norm: Optional[Union[str, Callable]] = None,\n    ):\n        \"\"\"\n        NOTE: this interface is experimental.\n        Args:\n            input_shape: shapes (channels and stride) of the input features\n            conv_dims: number of output channels for the intermediate conv layers.\n            mask_dim: number of output channels for the final conv layer.\n            norm (str or callable): normalization for all conv layers\n        \"\"\"\n        super().__init__()\n\n        input_shape = sorted(input_shape.items(), key=lambda x: x[1].stride)\n        self.in_features = [k for k, v in input_shape]  # starting from \"res2\" to \"res5\"\n        feature_channels = [v.channels for k, v in input_shape]\n\n        lateral_convs = []\n        output_convs = []\n\n        use_bias = norm == \"\"\n        for idx, in_channels in enumerate(feature_channels):\n            if idx == len(self.in_features) - 1:\n                output_norm = get_norm(norm, conv_dim)\n                output_conv = Conv2d(\n                    in_channels,\n                    conv_dim,\n                    kernel_size=3,\n                    stride=1,\n                    padding=1,\n                    bias=use_bias,\n                    norm=output_norm,\n                    activation=F.relu,\n                )\n                weight_init.c2_xavier_fill(output_conv)\n                self.add_module(\"layer_{}\".format(idx + 1), output_conv)\n\n                lateral_convs.append(None)\n                output_convs.append(output_conv)\n            else:\n                lateral_norm = get_norm(norm, conv_dim)\n                output_norm = get_norm(norm, conv_dim)\n\n                lateral_conv = Conv2d(\n                    in_channels, conv_dim, kernel_size=1, bias=use_bias, norm=lateral_norm\n                )\n                output_conv = Conv2d(\n                    conv_dim,\n                    conv_dim,\n                    kernel_size=3,\n                    stride=1,\n                    padding=1,\n                    bias=use_bias,\n                    norm=output_norm,\n                    activation=F.relu,\n                )\n                weight_init.c2_xavier_fill(lateral_conv)\n                weight_init.c2_xavier_fill(output_conv)\n                self.add_module(\"adapter_{}\".format(idx + 1), lateral_conv)\n                self.add_module(\"layer_{}\".format(idx + 1), output_conv)\n\n                lateral_convs.append(lateral_conv)\n                output_convs.append(output_conv)\n        # Place convs into top-down order (from low to high resolution)\n        # to make the top-down computation in forward clearer.\n        self.lateral_convs = lateral_convs[::-1]\n        self.output_convs = output_convs[::-1]\n\n        self.mask_dim = mask_dim\n        self.mask_features = Conv2d(\n            conv_dim,\n            mask_dim,\n            kernel_size=3,\n            stride=1,\n            padding=1,\n        )\n        weight_init.c2_xavier_fill(self.mask_features)\n\n        self.maskformer_num_feature_levels = 3  # always use 3 scales\n\n    @classmethod\n    def from_config(cls, cfg, input_shape: Dict[str, ShapeSpec]):\n        ret = {}\n        ret[\"input_shape\"] = {\n            k: v for k, v in input_shape.items() if k in cfg.MODEL.SEM_SEG_HEAD.IN_FEATURES\n        }\n        ret[\"conv_dim\"] = cfg.MODEL.SEM_SEG_HEAD.CONVS_DIM\n        ret[\"mask_dim\"] = cfg.MODEL.SEM_SEG_HEAD.MASK_DIM\n        ret[\"norm\"] = cfg.MODEL.SEM_SEG_HEAD.NORM\n        return ret\n\n    def forward_features(self, features):\n        multi_scale_features = []\n        num_cur_levels = 0\n        # Reverse feature maps into top-down order (from low to high resolution)\n        for idx, f in enumerate(self.in_features[::-1]):\n            x = features[f]\n            lateral_conv = self.lateral_convs[idx]\n            output_conv = self.output_convs[idx]\n            if lateral_conv is None:\n                y = output_conv(x)\n            else:\n                cur_fpn = lateral_conv(x)\n                # Following FPN implementation, we use nearest upsampling here\n                y = cur_fpn + F.interpolate(y, size=cur_fpn.shape[-2:], mode=\"nearest\")\n                y = output_conv(y)\n            if num_cur_levels < self.maskformer_num_feature_levels:\n                multi_scale_features.append(y)\n                num_cur_levels += 1\n        return self.mask_features(y), None, multi_scale_features\n\n    def forward(self, features, targets=None):\n        logger = logging.getLogger(__name__)\n        logger.warning(\"Calling forward() may cause unpredicted behavior of PixelDecoder module.\")\n        return self.forward_features(features)\n\n\nclass TransformerEncoderOnly(nn.Module):\n    def __init__(\n        self,\n        d_model=512,\n        nhead=8,\n        num_encoder_layers=6,\n        dim_feedforward=2048,\n        dropout=0.1,\n        activation=\"relu\",\n        normalize_before=False,\n    ):\n        super().__init__()\n\n        encoder_layer = TransformerEncoderLayer(\n            d_model, nhead, dim_feedforward, dropout, activation, normalize_before\n        )\n        encoder_norm = nn.LayerNorm(d_model) if normalize_before else None\n        self.encoder = TransformerEncoder(encoder_layer, num_encoder_layers, encoder_norm)\n\n        self._reset_parameters()\n\n        self.d_model = d_model\n        self.nhead = nhead\n\n    def _reset_parameters(self):\n        for p in self.parameters():\n            if p.dim() > 1:\n                nn.init.xavier_uniform_(p)\n\n    def forward(self, src, mask, pos_embed):\n        # flatten NxCxHxW to HWxNxC\n        bs, c, h, w = src.shape\n        src = src.flatten(2).permute(2, 0, 1)\n        pos_embed = pos_embed.flatten(2).permute(2, 0, 1)\n        if mask is not None:\n            mask = mask.flatten(1)\n\n        memory = self.encoder(src, src_key_padding_mask=mask, pos=pos_embed)\n        return memory.permute(1, 2, 0).view(bs, c, h, w)\n\n\n# This is a modified FPN decoder with extra Transformer encoder that processes the lowest-resolution feature map.\n@SEM_SEG_HEADS_REGISTRY.register()\nclass TransformerEncoderPixelDecoder(BasePixelDecoder):\n    @configurable\n    def __init__(\n        self,\n        input_shape: Dict[str, ShapeSpec],\n        *,\n        transformer_dropout: float,\n        transformer_nheads: int,\n        transformer_dim_feedforward: int,\n        transformer_enc_layers: int,\n        transformer_pre_norm: bool,\n        conv_dim: int,\n        mask_dim: int,\n        norm: Optional[Union[str, Callable]] = None,\n    ):\n        \"\"\"\n        NOTE: this interface is experimental.\n        Args:\n            input_shape: shapes (channels and stride) of the input features\n            transformer_dropout: dropout probability in transformer\n            transformer_nheads: number of heads in transformer\n            transformer_dim_feedforward: dimension of feedforward network\n            transformer_enc_layers: number of transformer encoder layers\n            transformer_pre_norm: whether to use pre-layernorm or not\n            conv_dims: number of output channels for the intermediate conv layers.\n            mask_dim: number of output channels for the final conv layer.\n            norm (str or callable): normalization for all conv layers\n        \"\"\"\n        super().__init__(input_shape, conv_dim=conv_dim, mask_dim=mask_dim, norm=norm)\n\n        input_shape = sorted(input_shape.items(), key=lambda x: x[1].stride)\n        self.in_features = [k for k, v in input_shape]  # starting from \"res2\" to \"res5\"\n        feature_strides = [v.stride for k, v in input_shape]\n        feature_channels = [v.channels for k, v in input_shape]\n\n        in_channels = feature_channels[len(self.in_features) - 1]\n        self.input_proj = Conv2d(in_channels, conv_dim, kernel_size=1)\n        weight_init.c2_xavier_fill(self.input_proj)\n        self.transformer = TransformerEncoderOnly(\n            d_model=conv_dim,\n            dropout=transformer_dropout,\n            nhead=transformer_nheads,\n            dim_feedforward=transformer_dim_feedforward,\n            num_encoder_layers=transformer_enc_layers,\n            normalize_before=transformer_pre_norm,\n        )\n        N_steps = conv_dim // 2\n        self.pe_layer = PositionEmbeddingSine(N_steps, normalize=True)\n\n        # update layer\n        use_bias = norm == \"\"\n        output_norm = get_norm(norm, conv_dim)\n        output_conv = Conv2d(\n            conv_dim,\n            conv_dim,\n            kernel_size=3,\n            stride=1,\n            padding=1,\n            bias=use_bias,\n            norm=output_norm,\n            activation=F.relu,\n        )\n        weight_init.c2_xavier_fill(output_conv)\n        delattr(self, \"layer_{}\".format(len(self.in_features)))\n        self.add_module(\"layer_{}\".format(len(self.in_features)), output_conv)\n        self.output_convs[0] = output_conv\n\n    @classmethod\n    def from_config(cls, cfg, input_shape: Dict[str, ShapeSpec]):\n        ret = super().from_config(cfg, input_shape)\n        ret[\"transformer_dropout\"] = cfg.MODEL.MASK_FORMER.DROPOUT\n        ret[\"transformer_nheads\"] = cfg.MODEL.MASK_FORMER.NHEADS\n        ret[\"transformer_dim_feedforward\"] = cfg.MODEL.MASK_FORMER.DIM_FEEDFORWARD\n        ret[\n            \"transformer_enc_layers\"\n        ] = cfg.MODEL.SEM_SEG_HEAD.TRANSFORMER_ENC_LAYERS  # a separate config\n        ret[\"transformer_pre_norm\"] = cfg.MODEL.MASK_FORMER.PRE_NORM\n        return ret\n\n    def forward_features(self, features):\n        multi_scale_features = []\n        num_cur_levels = 0\n        # Reverse feature maps into top-down order (from low to high resolution)\n        for idx, f in enumerate(self.in_features[::-1]):\n            x = features[f]\n            lateral_conv = self.lateral_convs[idx]\n            output_conv = self.output_convs[idx]\n            if lateral_conv is None:\n                transformer = self.input_proj(x)\n                pos = self.pe_layer(x)\n                transformer = self.transformer(transformer, None, pos)\n                y = output_conv(transformer)\n                # save intermediate feature as input to Transformer decoder\n                transformer_encoder_features = transformer\n            else:\n                cur_fpn = lateral_conv(x)\n                # Following FPN implementation, we use nearest upsampling here\n                y = cur_fpn + F.interpolate(y, size=cur_fpn.shape[-2:], mode=\"nearest\")\n                y = output_conv(y)\n            if num_cur_levels < self.maskformer_num_feature_levels:\n                multi_scale_features.append(y)\n                num_cur_levels += 1\n        return self.mask_features(y), transformer_encoder_features, multi_scale_features\n\n    def forward(self, features, targets=None):\n        logger = logging.getLogger(__name__)\n        logger.warning(\"Calling forward() may cause unpredicted behavior of PixelDecoder module.\")\n        return self.forward_features(features)\n"
  },
  {
    "path": "mask2former/modeling/pixel_decoder/msdeformattn.py",
    "content": "import logging\nimport numpy as np\nfrom typing import Callable, Dict, List, Optional, Tuple, Union\n\nimport fvcore.nn.weight_init as weight_init\nimport torch\nfrom torch import nn\nfrom torch.nn import functional as F\nfrom torch.nn.init import xavier_uniform_, constant_, uniform_, normal_\nfrom torch.cuda.amp import autocast\n\nfrom detectron2.config import configurable\nfrom detectron2.layers import Conv2d, ShapeSpec, get_norm\nfrom detectron2.modeling import SEM_SEG_HEADS_REGISTRY\n\nfrom ..transformer_decoder.position_encoding import PositionEmbeddingSine\nfrom ..transformer_decoder.transformer import _get_clones, _get_activation_fn\nfrom .ops.modules import MSDeformAttn\n\n\n# MSDeformAttn Transformer encoder in deformable detr\nclass MSDeformAttnTransformerEncoderOnly(nn.Module):\n    def __init__(self, d_model=256, nhead=8,\n                 num_encoder_layers=6, dim_feedforward=1024, dropout=0.1,\n                 activation=\"relu\",\n                 num_feature_levels=4, enc_n_points=4,\n        ):\n        super().__init__()\n\n        self.d_model = d_model\n        self.nhead = nhead\n\n        encoder_layer = MSDeformAttnTransformerEncoderLayer(d_model, dim_feedforward,\n                                                            dropout, activation,\n                                                            num_feature_levels, nhead, enc_n_points)\n        self.encoder = MSDeformAttnTransformerEncoder(encoder_layer, num_encoder_layers)\n\n        self.level_embed = nn.Parameter(torch.Tensor(num_feature_levels, d_model))\n\n        self._reset_parameters()\n\n    def _reset_parameters(self):\n        for p in self.parameters():\n            if p.dim() > 1:\n                nn.init.xavier_uniform_(p)\n        for m in self.modules():\n            if isinstance(m, MSDeformAttn):\n                m._reset_parameters()\n        normal_(self.level_embed)\n\n    def get_valid_ratio(self, mask):\n        _, H, W = mask.shape\n        valid_H = torch.sum(~mask[:, :, 0], 1)\n        valid_W = torch.sum(~mask[:, 0, :], 1)\n        valid_ratio_h = valid_H.float() / H\n        valid_ratio_w = valid_W.float() / W\n        valid_ratio = torch.stack([valid_ratio_w, valid_ratio_h], -1)\n        return valid_ratio\n\n    def forward(self, srcs, pos_embeds):\n        masks = [torch.zeros((x.size(0), x.size(2), x.size(3)), device=x.device, dtype=torch.bool) for x in srcs]\n        # prepare input for encoder\n        src_flatten = []\n        mask_flatten = []\n        lvl_pos_embed_flatten = []\n        spatial_shapes = []\n        for lvl, (src, mask, pos_embed) in enumerate(zip(srcs, masks, pos_embeds)):\n            bs, c, h, w = src.shape\n            spatial_shape = (h, w)\n            spatial_shapes.append(spatial_shape)\n            src = src.flatten(2).transpose(1, 2)\n            mask = mask.flatten(1)\n            pos_embed = pos_embed.flatten(2).transpose(1, 2)\n            lvl_pos_embed = pos_embed + self.level_embed[lvl].view(1, 1, -1)\n            lvl_pos_embed_flatten.append(lvl_pos_embed)\n            src_flatten.append(src)\n            mask_flatten.append(mask)\n        src_flatten = torch.cat(src_flatten, 1)\n        mask_flatten = torch.cat(mask_flatten, 1)\n        lvl_pos_embed_flatten = torch.cat(lvl_pos_embed_flatten, 1)\n        spatial_shapes = torch.as_tensor(spatial_shapes, dtype=torch.long, device=src_flatten.device)\n        level_start_index = torch.cat((spatial_shapes.new_zeros((1, )), spatial_shapes.prod(1).cumsum(0)[:-1]))\n        valid_ratios = torch.stack([self.get_valid_ratio(m) for m in masks], 1)\n\n        # encoder\n        memory = self.encoder(src_flatten, spatial_shapes, level_start_index, valid_ratios, lvl_pos_embed_flatten, mask_flatten)\n\n        return memory, spatial_shapes, level_start_index\n\n\nclass MSDeformAttnTransformerEncoderLayer(nn.Module):\n    def __init__(self,\n                 d_model=256, d_ffn=1024,\n                 dropout=0.1, activation=\"relu\",\n                 n_levels=4, n_heads=8, n_points=4):\n        super().__init__()\n\n        # self attention\n        self.self_attn = MSDeformAttn(d_model, n_levels, n_heads, n_points)\n        self.dropout1 = nn.Dropout(dropout)\n        self.norm1 = nn.LayerNorm(d_model)\n\n        # ffn\n        self.linear1 = nn.Linear(d_model, d_ffn)\n        self.activation = _get_activation_fn(activation)\n        self.dropout2 = nn.Dropout(dropout)\n        self.linear2 = nn.Linear(d_ffn, d_model)\n        self.dropout3 = nn.Dropout(dropout)\n        self.norm2 = nn.LayerNorm(d_model)\n\n    @staticmethod\n    def with_pos_embed(tensor, pos):\n        return tensor if pos is None else tensor + pos\n\n    def forward_ffn(self, src):\n        src2 = self.linear2(self.dropout2(self.activation(self.linear1(src))))\n        src = src + self.dropout3(src2)\n        src = self.norm2(src)\n        return src\n\n    def forward(self, src, pos, reference_points, spatial_shapes, level_start_index, padding_mask=None):\n        # self attention\n        src2 = self.self_attn(self.with_pos_embed(src, pos), reference_points, src, spatial_shapes, level_start_index, padding_mask)\n        src = src + self.dropout1(src2)\n        src = self.norm1(src)\n\n        # ffn\n        src = self.forward_ffn(src)\n\n        return src\n\n\nclass MSDeformAttnTransformerEncoder(nn.Module):\n    def __init__(self, encoder_layer, num_layers):\n        super().__init__()\n        self.layers = _get_clones(encoder_layer, num_layers)\n        self.num_layers = num_layers\n\n    @staticmethod\n    def get_reference_points(spatial_shapes, valid_ratios, device):\n        reference_points_list = []\n        for lvl, (H_, W_) in enumerate(spatial_shapes):\n\n            ref_y, ref_x = torch.meshgrid(torch.linspace(0.5, H_ - 0.5, H_, dtype=torch.float32, device=device),\n                                          torch.linspace(0.5, W_ - 0.5, W_, dtype=torch.float32, device=device))\n            ref_y = ref_y.reshape(-1)[None] / (valid_ratios[:, None, lvl, 1] * H_)\n            ref_x = ref_x.reshape(-1)[None] / (valid_ratios[:, None, lvl, 0] * W_)\n            ref = torch.stack((ref_x, ref_y), -1)\n            reference_points_list.append(ref)\n        reference_points = torch.cat(reference_points_list, 1)\n        reference_points = reference_points[:, :, None] * valid_ratios[:, None]\n        return reference_points\n\n    def forward(self, src, spatial_shapes, level_start_index, valid_ratios, pos=None, padding_mask=None):\n        output = src\n        reference_points = self.get_reference_points(spatial_shapes, valid_ratios, device=src.device)\n        for _, layer in enumerate(self.layers):\n            output = layer(output, pos, reference_points, spatial_shapes, level_start_index, padding_mask)\n\n        return output\n\n\n@SEM_SEG_HEADS_REGISTRY.register()\nclass MSDeformAttnPixelDecoder(nn.Module):\n    @configurable\n    def __init__(\n        self,\n        input_shape: Dict[str, ShapeSpec],\n        *,\n        transformer_dropout: float,\n        transformer_nheads: int,\n        transformer_dim_feedforward: int,\n        transformer_enc_layers: int,\n        conv_dim: int,\n        mask_dim: int,\n        norm: Optional[Union[str, Callable]] = None,\n        # deformable transformer encoder args\n        transformer_in_features: List[str],\n        common_stride: int,\n    ):\n        \"\"\"\n        NOTE: this interface is experimental.\n        Args:\n            input_shape: shapes (channels and stride) of the input features\n            transformer_dropout: dropout probability in transformer\n            transformer_nheads: number of heads in transformer\n            transformer_dim_feedforward: dimension of feedforward network\n            transformer_enc_layers: number of transformer encoder layers\n            conv_dims: number of output channels for the intermediate conv layers.\n            mask_dim: number of output channels for the final conv layer.\n            norm (str or callable): normalization for all conv layers\n        \"\"\"\n        super().__init__()\n        transformer_input_shape = {\n            k: v for k, v in input_shape.items() if k in transformer_in_features\n        }\n\n        # this is the input shape of pixel decoder\n        input_shape = sorted(input_shape.items(), key=lambda x: x[1].stride)\n        self.in_features = [k for k, v in input_shape]  # starting from \"res2\" to \"res5\"\n        self.feature_strides = [v.stride for k, v in input_shape]\n        self.feature_channels = [v.channels for k, v in input_shape]\n        \n        # this is the input shape of transformer encoder (could use less features than pixel decoder\n        transformer_input_shape = sorted(transformer_input_shape.items(), key=lambda x: x[1].stride)\n        self.transformer_in_features = [k for k, v in transformer_input_shape]  # starting from \"res2\" to \"res5\"\n        transformer_in_channels = [v.channels for k, v in transformer_input_shape]\n        self.transformer_feature_strides = [v.stride for k, v in transformer_input_shape]  # to decide extra FPN layers\n\n        self.transformer_num_feature_levels = len(self.transformer_in_features)\n        if self.transformer_num_feature_levels > 1:\n            input_proj_list = []\n            # from low resolution to high resolution (res5 -> res2)\n            for in_channels in transformer_in_channels[::-1]:\n                input_proj_list.append(nn.Sequential(\n                    nn.Conv2d(in_channels, conv_dim, kernel_size=1),\n                    nn.GroupNorm(32, conv_dim),\n                ))\n            self.input_proj = nn.ModuleList(input_proj_list)\n        else:\n            self.input_proj = nn.ModuleList([\n                nn.Sequential(\n                    nn.Conv2d(transformer_in_channels[-1], conv_dim, kernel_size=1),\n                    nn.GroupNorm(32, conv_dim),\n                )])\n\n        for proj in self.input_proj:\n            nn.init.xavier_uniform_(proj[0].weight, gain=1)\n            nn.init.constant_(proj[0].bias, 0)\n\n        self.transformer = MSDeformAttnTransformerEncoderOnly(\n            d_model=conv_dim,\n            dropout=transformer_dropout,\n            nhead=transformer_nheads,\n            dim_feedforward=transformer_dim_feedforward,\n            num_encoder_layers=transformer_enc_layers,\n            num_feature_levels=self.transformer_num_feature_levels,\n        )\n        N_steps = conv_dim // 2\n        self.pe_layer = PositionEmbeddingSine(N_steps, normalize=True)\n\n        self.mask_dim = mask_dim\n        # use 1x1 conv instead\n        self.mask_features = Conv2d(\n            conv_dim,\n            mask_dim,\n            kernel_size=1,\n            stride=1,\n            padding=0,\n        )\n        weight_init.c2_xavier_fill(self.mask_features)\n        \n        self.maskformer_num_feature_levels = 3  # always use 3 scales\n        self.common_stride = common_stride\n\n        # extra fpn levels\n        stride = min(self.transformer_feature_strides)\n        self.num_fpn_levels = int(np.log2(stride) - np.log2(self.common_stride))\n\n        lateral_convs = []\n        output_convs = []\n\n        use_bias = norm == \"\"\n        for idx, in_channels in enumerate(self.feature_channels[:self.num_fpn_levels]):\n            lateral_norm = get_norm(norm, conv_dim)\n            output_norm = get_norm(norm, conv_dim)\n\n            lateral_conv = Conv2d(\n                in_channels, conv_dim, kernel_size=1, bias=use_bias, norm=lateral_norm\n            )\n            output_conv = Conv2d(\n                conv_dim,\n                conv_dim,\n                kernel_size=3,\n                stride=1,\n                padding=1,\n                bias=use_bias,\n                norm=output_norm,\n                activation=F.relu,\n            )\n            weight_init.c2_xavier_fill(lateral_conv)\n            weight_init.c2_xavier_fill(output_conv)\n            self.add_module(\"adapter_{}\".format(idx + 1), lateral_conv)\n            self.add_module(\"layer_{}\".format(idx + 1), output_conv)\n\n            lateral_convs.append(lateral_conv)\n            output_convs.append(output_conv)\n        # Place convs into top-down order (from low to high resolution)\n        # to make the top-down computation in forward clearer.\n        self.lateral_convs = lateral_convs[::-1]\n        self.output_convs = output_convs[::-1]\n\n    @classmethod\n    def from_config(cls, cfg, input_shape: Dict[str, ShapeSpec]):\n        ret = {}\n        ret[\"input_shape\"] = {\n            k: v for k, v in input_shape.items() if k in cfg.MODEL.SEM_SEG_HEAD.IN_FEATURES\n        }\n        ret[\"conv_dim\"] = cfg.MODEL.SEM_SEG_HEAD.CONVS_DIM\n        ret[\"mask_dim\"] = cfg.MODEL.SEM_SEG_HEAD.MASK_DIM\n        ret[\"norm\"] = cfg.MODEL.SEM_SEG_HEAD.NORM\n        ret[\"transformer_dropout\"] = cfg.MODEL.MASK_FORMER.DROPOUT\n        ret[\"transformer_nheads\"] = cfg.MODEL.MASK_FORMER.NHEADS\n        # ret[\"transformer_dim_feedforward\"] = cfg.MODEL.MASK_FORMER.DIM_FEEDFORWARD\n        ret[\"transformer_dim_feedforward\"] = 1024  # use 1024 for deformable transformer encoder\n        ret[\n            \"transformer_enc_layers\"\n        ] = cfg.MODEL.SEM_SEG_HEAD.TRANSFORMER_ENC_LAYERS  # a separate config\n        ret[\"transformer_in_features\"] = cfg.MODEL.SEM_SEG_HEAD.DEFORMABLE_TRANSFORMER_ENCODER_IN_FEATURES\n        ret[\"common_stride\"] = cfg.MODEL.SEM_SEG_HEAD.COMMON_STRIDE\n        return ret\n\n    @autocast(enabled=False)\n    def forward_features(self, features):\n        srcs = []\n        pos = []\n        # Reverse feature maps into top-down order (from low to high resolution)\n        for idx, f in enumerate(self.transformer_in_features[::-1]):\n            x = features[f].float()  # deformable detr does not support half precision\n            srcs.append(self.input_proj[idx](x))\n            pos.append(self.pe_layer(x))\n\n        y, spatial_shapes, level_start_index = self.transformer(srcs, pos)\n        bs = y.shape[0]\n\n        split_size_or_sections = [None] * self.transformer_num_feature_levels\n        for i in range(self.transformer_num_feature_levels):\n            if i < self.transformer_num_feature_levels - 1:\n                split_size_or_sections[i] = level_start_index[i + 1] - level_start_index[i]\n            else:\n                split_size_or_sections[i] = y.shape[1] - level_start_index[i]\n        y = torch.split(y, split_size_or_sections, dim=1)\n\n        out = []\n        multi_scale_features = []\n        num_cur_levels = 0\n        for i, z in enumerate(y):\n            out.append(z.transpose(1, 2).view(bs, -1, spatial_shapes[i][0], spatial_shapes[i][1]))\n\n        # append `out` with extra FPN levels\n        # Reverse feature maps into top-down order (from low to high resolution)\n        for idx, f in enumerate(self.in_features[:self.num_fpn_levels][::-1]):\n            x = features[f].float()\n            lateral_conv = self.lateral_convs[idx]\n            output_conv = self.output_convs[idx]\n            cur_fpn = lateral_conv(x)\n            # Following FPN implementation, we use nearest upsampling here\n            y = cur_fpn + F.interpolate(out[-1], size=cur_fpn.shape[-2:], mode=\"bilinear\", align_corners=False)\n            y = output_conv(y)\n            out.append(y)\n\n        for o in out:\n            if num_cur_levels < self.maskformer_num_feature_levels:\n                multi_scale_features.append(o)\n                num_cur_levels += 1\n\n        return self.mask_features(out[-1]), out[0], multi_scale_features\n"
  },
  {
    "path": "mask2former/modeling/pixel_decoder/ops/functions/__init__.py",
    "content": "# ------------------------------------------------------------------------------------------------\n# Deformable DETR\n# Copyright (c) 2020 SenseTime. All Rights Reserved.\n# Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n# ------------------------------------------------------------------------------------------------\n# Modified from https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0\n# ------------------------------------------------------------------------------------------------\n\n# Modified by Bowen Cheng from https://github.com/fundamentalvision/Deformable-DETR\n\nfrom .ms_deform_attn_func import MSDeformAttnFunction\n\n"
  },
  {
    "path": "mask2former/modeling/pixel_decoder/ops/functions/ms_deform_attn_func.py",
    "content": "# ------------------------------------------------------------------------------------------------\n# Deformable DETR\n# Copyright (c) 2020 SenseTime. All Rights Reserved.\n# Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n# ------------------------------------------------------------------------------------------------\n# Modified from https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0\n# ------------------------------------------------------------------------------------------------\n\n# Modified by Bowen Cheng from https://github.com/fundamentalvision/Deformable-DETR\n\nfrom __future__ import absolute_import\nfrom __future__ import print_function\nfrom __future__ import division\n\nimport torch\nimport torch.nn.functional as F\nfrom torch.autograd import Function\nfrom torch.autograd.function import once_differentiable\n\ntry:\n    import MultiScaleDeformableAttention as MSDA\nexcept ModuleNotFoundError as e:\n    info_string = (\n        \"\\n\\nPlease compile MultiScaleDeformableAttention CUDA op with the following commands:\\n\"\n        \"\\t`cd mask2former/modeling/pixel_decoder/ops`\\n\"\n        \"\\t`sh make.sh`\\n\"\n    )\n    raise ModuleNotFoundError(info_string)\n\n\nclass MSDeformAttnFunction(Function):\n    @staticmethod\n    def forward(ctx, value, value_spatial_shapes, value_level_start_index, sampling_locations, attention_weights, im2col_step):\n        ctx.im2col_step = im2col_step\n        output = MSDA.ms_deform_attn_forward(\n            value, value_spatial_shapes, value_level_start_index, sampling_locations, attention_weights, ctx.im2col_step)\n        ctx.save_for_backward(value, value_spatial_shapes, value_level_start_index, sampling_locations, attention_weights)\n        return output\n\n    @staticmethod\n    @once_differentiable\n    def backward(ctx, grad_output):\n        value, value_spatial_shapes, value_level_start_index, sampling_locations, attention_weights = ctx.saved_tensors\n        grad_value, grad_sampling_loc, grad_attn_weight = \\\n            MSDA.ms_deform_attn_backward(\n                value, value_spatial_shapes, value_level_start_index, sampling_locations, attention_weights, grad_output, ctx.im2col_step)\n\n        return grad_value, None, None, grad_sampling_loc, grad_attn_weight, None\n\n\ndef ms_deform_attn_core_pytorch(value, value_spatial_shapes, sampling_locations, attention_weights):\n    # for debug and test only,\n    # need to use cuda version instead\n    N_, S_, M_, D_ = value.shape\n    _, Lq_, M_, L_, P_, _ = sampling_locations.shape\n    value_list = value.split([H_ * W_ for H_, W_ in value_spatial_shapes], dim=1)\n    sampling_grids = 2 * sampling_locations - 1\n    sampling_value_list = []\n    for lid_, (H_, W_) in enumerate(value_spatial_shapes):\n        # N_, H_*W_, M_, D_ -> N_, H_*W_, M_*D_ -> N_, M_*D_, H_*W_ -> N_*M_, D_, H_, W_\n        value_l_ = value_list[lid_].flatten(2).transpose(1, 2).reshape(N_*M_, D_, H_, W_)\n        # N_, Lq_, M_, P_, 2 -> N_, M_, Lq_, P_, 2 -> N_*M_, Lq_, P_, 2\n        sampling_grid_l_ = sampling_grids[:, :, :, lid_].transpose(1, 2).flatten(0, 1)\n        # N_*M_, D_, Lq_, P_\n        sampling_value_l_ = F.grid_sample(value_l_, sampling_grid_l_,\n                                          mode='bilinear', padding_mode='zeros', align_corners=False)\n        sampling_value_list.append(sampling_value_l_)\n    # (N_, Lq_, M_, L_, P_) -> (N_, M_, Lq_, L_, P_) -> (N_, M_, 1, Lq_, L_*P_)\n    attention_weights = attention_weights.transpose(1, 2).reshape(N_*M_, 1, Lq_, L_*P_)\n    output = (torch.stack(sampling_value_list, dim=-2).flatten(-2) * attention_weights).sum(-1).view(N_, M_*D_, Lq_)\n    return output.transpose(1, 2).contiguous()\n"
  },
  {
    "path": "mask2former/modeling/pixel_decoder/ops/make.sh",
    "content": "#!/usr/bin/env bash\n# ------------------------------------------------------------------------------------------------\n# Deformable DETR\n# Copyright (c) 2020 SenseTime. All Rights Reserved.\n# Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n# ------------------------------------------------------------------------------------------------\n# Modified from https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0\n# ------------------------------------------------------------------------------------------------\n\n# Copyright (c) Facebook, Inc. and its affiliates.\n# Modified by Bowen Cheng from https://github.com/fundamentalvision/Deformable-DETR\n\npython setup.py build install\n"
  },
  {
    "path": "mask2former/modeling/pixel_decoder/ops/modules/__init__.py",
    "content": "# ------------------------------------------------------------------------------------------------\n# Deformable DETR\n# Copyright (c) 2020 SenseTime. All Rights Reserved.\n# Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n# ------------------------------------------------------------------------------------------------\n# Modified from https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0\n# ------------------------------------------------------------------------------------------------\n\n# Modified by Bowen Cheng from https://github.com/fundamentalvision/Deformable-DETR\n\nfrom .ms_deform_attn import MSDeformAttn\n"
  },
  {
    "path": "mask2former/modeling/pixel_decoder/ops/modules/ms_deform_attn.py",
    "content": "# ------------------------------------------------------------------------------------------------\n# Deformable DETR\n# Copyright (c) 2020 SenseTime. All Rights Reserved.\n# Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n# ------------------------------------------------------------------------------------------------\n# Modified from https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0\n# ------------------------------------------------------------------------------------------------\n\n# Modified by Bowen Cheng from https://github.com/fundamentalvision/Deformable-DETR\n\nfrom __future__ import absolute_import\nfrom __future__ import print_function\nfrom __future__ import division\n\nimport warnings\nimport math\n\nimport torch\nfrom torch import nn\nimport torch.nn.functional as F\nfrom torch.nn.init import xavier_uniform_, constant_\n\nfrom ..functions import MSDeformAttnFunction\nfrom ..functions.ms_deform_attn_func import ms_deform_attn_core_pytorch\n\n\ndef _is_power_of_2(n):\n    if (not isinstance(n, int)) or (n < 0):\n        raise ValueError(\"invalid input for _is_power_of_2: {} (type: {})\".format(n, type(n)))\n    return (n & (n-1) == 0) and n != 0\n\n\nclass MSDeformAttn(nn.Module):\n    def __init__(self, d_model=256, n_levels=4, n_heads=8, n_points=4):\n        \"\"\"\n        Multi-Scale Deformable Attention Module\n        :param d_model      hidden dimension\n        :param n_levels     number of feature levels\n        :param n_heads      number of attention heads\n        :param n_points     number of sampling points per attention head per feature level\n        \"\"\"\n        super().__init__()\n        if d_model % n_heads != 0:\n            raise ValueError('d_model must be divisible by n_heads, but got {} and {}'.format(d_model, n_heads))\n        _d_per_head = d_model // n_heads\n        # you'd better set _d_per_head to a power of 2 which is more efficient in our CUDA implementation\n        if not _is_power_of_2(_d_per_head):\n            warnings.warn(\"You'd better set d_model in MSDeformAttn to make the dimension of each attention head a power of 2 \"\n                          \"which is more efficient in our CUDA implementation.\")\n\n        self.im2col_step = 128\n\n        self.d_model = d_model\n        self.n_levels = n_levels\n        self.n_heads = n_heads\n        self.n_points = n_points\n\n        self.sampling_offsets = nn.Linear(d_model, n_heads * n_levels * n_points * 2)\n        self.attention_weights = nn.Linear(d_model, n_heads * n_levels * n_points)\n        self.value_proj = nn.Linear(d_model, d_model)\n        self.output_proj = nn.Linear(d_model, d_model)\n\n        self._reset_parameters()\n\n    def _reset_parameters(self):\n        constant_(self.sampling_offsets.weight.data, 0.)\n        thetas = torch.arange(self.n_heads, dtype=torch.float32) * (2.0 * math.pi / self.n_heads)\n        grid_init = torch.stack([thetas.cos(), thetas.sin()], -1)\n        grid_init = (grid_init / grid_init.abs().max(-1, keepdim=True)[0]).view(self.n_heads, 1, 1, 2).repeat(1, self.n_levels, self.n_points, 1)\n        for i in range(self.n_points):\n            grid_init[:, :, i, :] *= i + 1\n        with torch.no_grad():\n            self.sampling_offsets.bias = nn.Parameter(grid_init.view(-1))\n        constant_(self.attention_weights.weight.data, 0.)\n        constant_(self.attention_weights.bias.data, 0.)\n        xavier_uniform_(self.value_proj.weight.data)\n        constant_(self.value_proj.bias.data, 0.)\n        xavier_uniform_(self.output_proj.weight.data)\n        constant_(self.output_proj.bias.data, 0.)\n\n    def forward(self, query, reference_points, input_flatten, input_spatial_shapes, input_level_start_index, input_padding_mask=None):\n        \"\"\"\n        :param query                       (N, Length_{query}, C)\n        :param reference_points            (N, Length_{query}, n_levels, 2), range in [0, 1], top-left (0,0), bottom-right (1, 1), including padding area\n                                        or (N, Length_{query}, n_levels, 4), add additional (w, h) to form reference boxes\n        :param input_flatten               (N, \\sum_{l=0}^{L-1} H_l \\cdot W_l, C)\n        :param input_spatial_shapes        (n_levels, 2), [(H_0, W_0), (H_1, W_1), ..., (H_{L-1}, W_{L-1})]\n        :param input_level_start_index     (n_levels, ), [0, H_0*W_0, H_0*W_0+H_1*W_1, H_0*W_0+H_1*W_1+H_2*W_2, ..., H_0*W_0+H_1*W_1+...+H_{L-1}*W_{L-1}]\n        :param input_padding_mask          (N, \\sum_{l=0}^{L-1} H_l \\cdot W_l), True for padding elements, False for non-padding elements\n\n        :return output                     (N, Length_{query}, C)\n        \"\"\"\n        N, Len_q, _ = query.shape\n        N, Len_in, _ = input_flatten.shape\n        assert (input_spatial_shapes[:, 0] * input_spatial_shapes[:, 1]).sum() == Len_in\n\n        value = self.value_proj(input_flatten)\n        if input_padding_mask is not None:\n            value = value.masked_fill(input_padding_mask[..., None], float(0))\n        value = value.view(N, Len_in, self.n_heads, self.d_model // self.n_heads)\n        sampling_offsets = self.sampling_offsets(query).view(N, Len_q, self.n_heads, self.n_levels, self.n_points, 2)\n        attention_weights = self.attention_weights(query).view(N, Len_q, self.n_heads, self.n_levels * self.n_points)\n        attention_weights = F.softmax(attention_weights, -1).view(N, Len_q, self.n_heads, self.n_levels, self.n_points)\n        # N, Len_q, n_heads, n_levels, n_points, 2\n        if reference_points.shape[-1] == 2:\n            offset_normalizer = torch.stack([input_spatial_shapes[..., 1], input_spatial_shapes[..., 0]], -1)\n            sampling_locations = reference_points[:, :, None, :, None, :] \\\n                                 + sampling_offsets / offset_normalizer[None, None, None, :, None, :]\n        elif reference_points.shape[-1] == 4:\n            sampling_locations = reference_points[:, :, None, :, None, :2] \\\n                                 + sampling_offsets / self.n_points * reference_points[:, :, None, :, None, 2:] * 0.5\n        else:\n            raise ValueError(\n                'Last dim of reference_points must be 2 or 4, but get {} instead.'.format(reference_points.shape[-1]))\n        try:\n            output = MSDeformAttnFunction.apply(\n                value, input_spatial_shapes, input_level_start_index, sampling_locations, attention_weights, self.im2col_step)\n        except:\n            # CPU\n            output = ms_deform_attn_core_pytorch(value, input_spatial_shapes, sampling_locations, attention_weights)\n        # # For FLOPs calculation only\n        # output = ms_deform_attn_core_pytorch(value, input_spatial_shapes, sampling_locations, attention_weights)\n        output = self.output_proj(output)\n        return output\n"
  },
  {
    "path": "mask2former/modeling/pixel_decoder/ops/setup.py",
    "content": "# ------------------------------------------------------------------------------------------------\n# Deformable DETR\n# Copyright (c) 2020 SenseTime. All Rights Reserved.\n# Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n# ------------------------------------------------------------------------------------------------\n# Modified from https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0\n# ------------------------------------------------------------------------------------------------\n\n# Modified by Bowen Cheng from https://github.com/fundamentalvision/Deformable-DETR\n\nimport os\nimport glob\n\nimport torch\n\nfrom torch.utils.cpp_extension import CUDA_HOME\nfrom torch.utils.cpp_extension import CppExtension\nfrom torch.utils.cpp_extension import CUDAExtension\n\nfrom setuptools import find_packages\nfrom setuptools import setup\n\nrequirements = [\"torch\", \"torchvision\"]\n\ndef get_extensions():\n    this_dir = os.path.dirname(os.path.abspath(__file__))\n    extensions_dir = os.path.join(this_dir, \"src\")\n\n    main_file = glob.glob(os.path.join(extensions_dir, \"*.cpp\"))\n    source_cpu = glob.glob(os.path.join(extensions_dir, \"cpu\", \"*.cpp\"))\n    source_cuda = glob.glob(os.path.join(extensions_dir, \"cuda\", \"*.cu\"))\n\n    sources = main_file + source_cpu\n    extension = CppExtension\n    extra_compile_args = {\"cxx\": []}\n    define_macros = []\n\n    # Force cuda since torch ask for a device, not if cuda is in fact available.\n    if (os.environ.get('FORCE_CUDA') or torch.cuda.is_available()) and CUDA_HOME is not None:\n        extension = CUDAExtension\n        sources += source_cuda\n        define_macros += [(\"WITH_CUDA\", None)]\n        extra_compile_args[\"nvcc\"] = [\n            \"-DCUDA_HAS_FP16=1\",\n            \"-D__CUDA_NO_HALF_OPERATORS__\",\n            \"-D__CUDA_NO_HALF_CONVERSIONS__\",\n            \"-D__CUDA_NO_HALF2_OPERATORS__\",\n        ]\n    else:\n        if CUDA_HOME is None:\n            raise NotImplementedError('CUDA_HOME is None. Please set environment variable CUDA_HOME.')\n        else:\n            raise NotImplementedError('No CUDA runtime is found. Please set FORCE_CUDA=1 or test it by running torch.cuda.is_available().')\n\n    sources = [os.path.join(extensions_dir, s) for s in sources]\n    include_dirs = [extensions_dir]\n    ext_modules = [\n        extension(\n            \"MultiScaleDeformableAttention\",\n            sources,\n            include_dirs=include_dirs,\n            define_macros=define_macros,\n            extra_compile_args=extra_compile_args,\n        )\n    ]\n    return ext_modules\n\nsetup(\n    name=\"MultiScaleDeformableAttention\",\n    version=\"1.0\",\n    author=\"Weijie Su\",\n    url=\"https://github.com/fundamentalvision/Deformable-DETR\",\n    description=\"PyTorch Wrapper for CUDA Functions of Multi-Scale Deformable Attention\",\n    packages=find_packages(exclude=(\"configs\", \"tests\",)),\n    ext_modules=get_extensions(),\n    cmdclass={\"build_ext\": torch.utils.cpp_extension.BuildExtension},\n)\n"
  },
  {
    "path": "mask2former/modeling/pixel_decoder/ops/src/cpu/ms_deform_attn_cpu.cpp",
    "content": "/*!\n**************************************************************************************************\n* Deformable DETR\n* Copyright (c) 2020 SenseTime. All Rights Reserved.\n* Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n**************************************************************************************************\n* Modified from https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0\n**************************************************************************************************\n*/\n\n/*!\n* Copyright (c) Facebook, Inc. and its affiliates.\n* Modified by Bowen Cheng from https://github.com/fundamentalvision/Deformable-DETR\n*/\n\n#include <vector>\n\n#include <ATen/ATen.h>\n#include <ATen/cuda/CUDAContext.h>\n\n\nat::Tensor\nms_deform_attn_cpu_forward(\n    const at::Tensor &value, \n    const at::Tensor &spatial_shapes,\n    const at::Tensor &level_start_index,\n    const at::Tensor &sampling_loc,\n    const at::Tensor &attn_weight,\n    const int im2col_step)\n{\n    AT_ERROR(\"Not implement on cpu\");\n}\n\nstd::vector<at::Tensor>\nms_deform_attn_cpu_backward(\n    const at::Tensor &value, \n    const at::Tensor &spatial_shapes,\n    const at::Tensor &level_start_index,\n    const at::Tensor &sampling_loc,\n    const at::Tensor &attn_weight,\n    const at::Tensor &grad_output,\n    const int im2col_step)\n{\n    AT_ERROR(\"Not implement on cpu\");\n}\n\n"
  },
  {
    "path": "mask2former/modeling/pixel_decoder/ops/src/cpu/ms_deform_attn_cpu.h",
    "content": "/*!\n**************************************************************************************************\n* Deformable DETR\n* Copyright (c) 2020 SenseTime. All Rights Reserved.\n* Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n**************************************************************************************************\n* Modified from https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0\n**************************************************************************************************\n*/\n\n/*!\n* Copyright (c) Facebook, Inc. and its affiliates.\n* Modified by Bowen Cheng from https://github.com/fundamentalvision/Deformable-DETR\n*/\n\n#pragma once\n#include <torch/extension.h>\n\nat::Tensor\nms_deform_attn_cpu_forward(\n    const at::Tensor &value, \n    const at::Tensor &spatial_shapes,\n    const at::Tensor &level_start_index,\n    const at::Tensor &sampling_loc,\n    const at::Tensor &attn_weight,\n    const int im2col_step);\n\nstd::vector<at::Tensor>\nms_deform_attn_cpu_backward(\n    const at::Tensor &value, \n    const at::Tensor &spatial_shapes,\n    const at::Tensor &level_start_index,\n    const at::Tensor &sampling_loc,\n    const at::Tensor &attn_weight,\n    const at::Tensor &grad_output,\n    const int im2col_step);\n\n\n"
  },
  {
    "path": "mask2former/modeling/pixel_decoder/ops/src/cuda/ms_deform_attn_cuda.cu",
    "content": "/*!\n**************************************************************************************************\n* Deformable DETR\n* Copyright (c) 2020 SenseTime. All Rights Reserved.\n* Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n**************************************************************************************************\n* Modified from https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0\n**************************************************************************************************\n*/\n\n/*!\n* Copyright (c) Facebook, Inc. and its affiliates.\n* Modified by Bowen Cheng from https://github.com/fundamentalvision/Deformable-DETR\n*/\n\n#include <vector>\n#include \"cuda/ms_deform_im2col_cuda.cuh\"\n\n#include <ATen/ATen.h>\n#include <ATen/cuda/CUDAContext.h>\n#include <cuda.h>\n#include <cuda_runtime.h>\n\n\nat::Tensor ms_deform_attn_cuda_forward(\n    const at::Tensor &value, \n    const at::Tensor &spatial_shapes,\n    const at::Tensor &level_start_index,\n    const at::Tensor &sampling_loc,\n    const at::Tensor &attn_weight,\n    const int im2col_step)\n{\n    AT_ASSERTM(value.is_contiguous(), \"value tensor has to be contiguous\");\n    AT_ASSERTM(spatial_shapes.is_contiguous(), \"spatial_shapes tensor has to be contiguous\");\n    AT_ASSERTM(level_start_index.is_contiguous(), \"level_start_index tensor has to be contiguous\");\n    AT_ASSERTM(sampling_loc.is_contiguous(), \"sampling_loc tensor has to be contiguous\");\n    AT_ASSERTM(attn_weight.is_contiguous(), \"attn_weight tensor has to be contiguous\");\n\n    AT_ASSERTM(value.type().is_cuda(), \"value must be a CUDA tensor\");\n    AT_ASSERTM(spatial_shapes.type().is_cuda(), \"spatial_shapes must be a CUDA tensor\");\n    AT_ASSERTM(level_start_index.type().is_cuda(), \"level_start_index must be a CUDA tensor\");\n    AT_ASSERTM(sampling_loc.type().is_cuda(), \"sampling_loc must be a CUDA tensor\");\n    AT_ASSERTM(attn_weight.type().is_cuda(), \"attn_weight must be a CUDA tensor\");\n\n    const int batch = value.size(0);\n    const int spatial_size = value.size(1);\n    const int num_heads = value.size(2);\n    const int channels = value.size(3);\n\n    const int num_levels = spatial_shapes.size(0);\n\n    const int num_query = sampling_loc.size(1);\n    const int num_point = sampling_loc.size(4);\n\n    const int im2col_step_ = std::min(batch, im2col_step);\n\n    AT_ASSERTM(batch % im2col_step_ == 0, \"batch(%d) must divide im2col_step(%d)\", batch, im2col_step_);\n    \n    auto output = at::zeros({batch, num_query, num_heads, channels}, value.options());\n\n    const int batch_n = im2col_step_;\n    auto output_n = output.view({batch/im2col_step_, batch_n, num_query, num_heads, channels});\n    auto per_value_size = spatial_size * num_heads * channels;\n    auto per_sample_loc_size = num_query * num_heads * num_levels * num_point * 2;\n    auto per_attn_weight_size = num_query * num_heads * num_levels * num_point;\n    for (int n = 0; n < batch/im2col_step_; ++n)\n    {\n        auto columns = output_n.select(0, n);\n        AT_DISPATCH_FLOATING_TYPES(value.type(), \"ms_deform_attn_forward_cuda\", ([&] {\n            ms_deformable_im2col_cuda(at::cuda::getCurrentCUDAStream(),\n                value.data<scalar_t>() + n * im2col_step_ * per_value_size,\n                spatial_shapes.data<int64_t>(),\n                level_start_index.data<int64_t>(),\n                sampling_loc.data<scalar_t>() + n * im2col_step_ * per_sample_loc_size,\n                attn_weight.data<scalar_t>() + n * im2col_step_ * per_attn_weight_size,\n                batch_n, spatial_size, num_heads, channels, num_levels, num_query, num_point,\n                columns.data<scalar_t>());\n\n        }));\n    }\n\n    output = output.view({batch, num_query, num_heads*channels});\n\n    return output;\n}\n\n\nstd::vector<at::Tensor> ms_deform_attn_cuda_backward(\n    const at::Tensor &value, \n    const at::Tensor &spatial_shapes,\n    const at::Tensor &level_start_index,\n    const at::Tensor &sampling_loc,\n    const at::Tensor &attn_weight,\n    const at::Tensor &grad_output,\n    const int im2col_step)\n{\n\n    AT_ASSERTM(value.is_contiguous(), \"value tensor has to be contiguous\");\n    AT_ASSERTM(spatial_shapes.is_contiguous(), \"spatial_shapes tensor has to be contiguous\");\n    AT_ASSERTM(level_start_index.is_contiguous(), \"level_start_index tensor has to be contiguous\");\n    AT_ASSERTM(sampling_loc.is_contiguous(), \"sampling_loc tensor has to be contiguous\");\n    AT_ASSERTM(attn_weight.is_contiguous(), \"attn_weight tensor has to be contiguous\");\n    AT_ASSERTM(grad_output.is_contiguous(), \"grad_output tensor has to be contiguous\");\n\n    AT_ASSERTM(value.type().is_cuda(), \"value must be a CUDA tensor\");\n    AT_ASSERTM(spatial_shapes.type().is_cuda(), \"spatial_shapes must be a CUDA tensor\");\n    AT_ASSERTM(level_start_index.type().is_cuda(), \"level_start_index must be a CUDA tensor\");\n    AT_ASSERTM(sampling_loc.type().is_cuda(), \"sampling_loc must be a CUDA tensor\");\n    AT_ASSERTM(attn_weight.type().is_cuda(), \"attn_weight must be a CUDA tensor\");\n    AT_ASSERTM(grad_output.type().is_cuda(), \"grad_output must be a CUDA tensor\");\n\n    const int batch = value.size(0);\n    const int spatial_size = value.size(1);\n    const int num_heads = value.size(2);\n    const int channels = value.size(3);\n\n    const int num_levels = spatial_shapes.size(0);\n\n    const int num_query = sampling_loc.size(1);\n    const int num_point = sampling_loc.size(4);\n\n    const int im2col_step_ = std::min(batch, im2col_step);\n\n    AT_ASSERTM(batch % im2col_step_ == 0, \"batch(%d) must divide im2col_step(%d)\", batch, im2col_step_);\n\n    auto grad_value = at::zeros_like(value);\n    auto grad_sampling_loc = at::zeros_like(sampling_loc);\n    auto grad_attn_weight = at::zeros_like(attn_weight);\n\n    const int batch_n = im2col_step_;\n    auto per_value_size = spatial_size * num_heads * channels;\n    auto per_sample_loc_size = num_query * num_heads * num_levels * num_point * 2;\n    auto per_attn_weight_size = num_query * num_heads * num_levels * num_point;\n    auto grad_output_n = grad_output.view({batch/im2col_step_, batch_n, num_query, num_heads, channels});\n    \n    for (int n = 0; n < batch/im2col_step_; ++n)\n    {\n        auto grad_output_g = grad_output_n.select(0, n);\n        AT_DISPATCH_FLOATING_TYPES(value.type(), \"ms_deform_attn_backward_cuda\", ([&] {\n            ms_deformable_col2im_cuda(at::cuda::getCurrentCUDAStream(),\n                                    grad_output_g.data<scalar_t>(),\n                                    value.data<scalar_t>() + n * im2col_step_ * per_value_size,\n                                    spatial_shapes.data<int64_t>(),\n                                    level_start_index.data<int64_t>(),\n                                    sampling_loc.data<scalar_t>() + n * im2col_step_ * per_sample_loc_size,\n                                    attn_weight.data<scalar_t>() + n * im2col_step_ * per_attn_weight_size,\n                                    batch_n, spatial_size, num_heads, channels, num_levels, num_query, num_point,\n                                    grad_value.data<scalar_t>() +  n * im2col_step_ * per_value_size,\n                                    grad_sampling_loc.data<scalar_t>() + n * im2col_step_ * per_sample_loc_size,\n                                    grad_attn_weight.data<scalar_t>() + n * im2col_step_ * per_attn_weight_size);\n\n        }));\n    }\n\n    return {\n        grad_value, grad_sampling_loc, grad_attn_weight\n    };\n}"
  },
  {
    "path": "mask2former/modeling/pixel_decoder/ops/src/cuda/ms_deform_attn_cuda.h",
    "content": "/*!\n**************************************************************************************************\n* Deformable DETR\n* Copyright (c) 2020 SenseTime. All Rights Reserved.\n* Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n**************************************************************************************************\n* Modified from https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0\n**************************************************************************************************\n*/\n\n/*!\n* Copyright (c) Facebook, Inc. and its affiliates.\n* Modified by Bowen Cheng from https://github.com/fundamentalvision/Deformable-DETR\n*/\n\n#pragma once\n#include <torch/extension.h>\n\nat::Tensor ms_deform_attn_cuda_forward(\n    const at::Tensor &value, \n    const at::Tensor &spatial_shapes,\n    const at::Tensor &level_start_index,\n    const at::Tensor &sampling_loc,\n    const at::Tensor &attn_weight,\n    const int im2col_step);\n\nstd::vector<at::Tensor> ms_deform_attn_cuda_backward(\n    const at::Tensor &value, \n    const at::Tensor &spatial_shapes,\n    const at::Tensor &level_start_index,\n    const at::Tensor &sampling_loc,\n    const at::Tensor &attn_weight,\n    const at::Tensor &grad_output,\n    const int im2col_step);\n\n"
  },
  {
    "path": "mask2former/modeling/pixel_decoder/ops/src/cuda/ms_deform_im2col_cuda.cuh",
    "content": "/*!\n**************************************************************************\n* Deformable DETR\n* Copyright (c) 2020 SenseTime. All Rights Reserved.\n* Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n**************************************************************************\n* Modified from DCN (https://github.com/msracver/Deformable-ConvNets)\n* Copyright (c) 2018 Microsoft\n**************************************************************************\n*/\n\n/*!\n* Copyright (c) Facebook, Inc. and its affiliates.\n* Modified by Bowen Cheng from https://github.com/fundamentalvision/Deformable-DETR\n*/\n\n#include <cstdio>\n#include <algorithm>\n#include <cstring>\n\n#include <ATen/ATen.h>\n#include <ATen/cuda/CUDAContext.h>\n\n#include <THC/THCAtomics.cuh>\n\n#define CUDA_KERNEL_LOOP(i, n)                          \\\n  for (int i = blockIdx.x * blockDim.x + threadIdx.x;   \\\n      i < (n);                                          \\\n      i += blockDim.x * gridDim.x)\n\nconst int CUDA_NUM_THREADS = 1024;\ninline int GET_BLOCKS(const int N, const int num_threads)\n{\n  return (N + num_threads - 1) / num_threads;\n}\n\n\ntemplate <typename scalar_t>\n__device__ scalar_t ms_deform_attn_im2col_bilinear(const scalar_t* &bottom_data, \n                                                   const int &height, const int &width, const int &nheads, const int &channels,\n                                                   const scalar_t &h, const scalar_t &w, const int &m, const int &c)\n{\n  const int h_low = floor(h);\n  const int w_low = floor(w);\n  const int h_high = h_low + 1;\n  const int w_high = w_low + 1;\n\n  const scalar_t lh = h - h_low;\n  const scalar_t lw = w - w_low;\n  const scalar_t hh = 1 - lh, hw = 1 - lw;\n\n  const int w_stride = nheads * channels;\n  const int h_stride = width * w_stride;\n  const int h_low_ptr_offset = h_low * h_stride;\n  const int h_high_ptr_offset = h_low_ptr_offset + h_stride;\n  const int w_low_ptr_offset = w_low * w_stride;\n  const int w_high_ptr_offset = w_low_ptr_offset + w_stride;\n  const int base_ptr = m * channels + c;\n\n  scalar_t v1 = 0;\n  if (h_low >= 0 && w_low >= 0)\n  {\n    const int ptr1 = h_low_ptr_offset + w_low_ptr_offset + base_ptr;\n    v1 = bottom_data[ptr1];\n  }\n  scalar_t v2 = 0;\n  if (h_low >= 0 && w_high <= width - 1)\n  {\n    const int ptr2 = h_low_ptr_offset + w_high_ptr_offset + base_ptr;\n    v2 = bottom_data[ptr2];\n  }\n  scalar_t v3 = 0;\n  if (h_high <= height - 1 && w_low >= 0)\n  {\n    const int ptr3 = h_high_ptr_offset + w_low_ptr_offset + base_ptr;\n    v3 = bottom_data[ptr3];\n  }\n  scalar_t v4 = 0;\n  if (h_high <= height - 1 && w_high <= width - 1)\n  {\n    const int ptr4 = h_high_ptr_offset + w_high_ptr_offset + base_ptr;\n    v4 = bottom_data[ptr4];\n  }\n\n  const scalar_t w1 = hh * hw, w2 = hh * lw, w3 = lh * hw, w4 = lh * lw;\n\n  const scalar_t val = (w1 * v1 + w2 * v2 + w3 * v3 + w4 * v4);\n  return val;\n}\n\n\ntemplate <typename scalar_t>\n__device__ void ms_deform_attn_col2im_bilinear(const scalar_t* &bottom_data, \n                                                   const int &height, const int &width, const int &nheads, const int &channels,\n                                                   const scalar_t &h, const scalar_t &w, const int &m, const int &c,\n                                                   const scalar_t &top_grad,\n                                                   const scalar_t &attn_weight,\n                                                   scalar_t* &grad_value, \n                                                   scalar_t* grad_sampling_loc,\n                                                   scalar_t* grad_attn_weight)\n{\n  const int h_low = floor(h);\n  const int w_low = floor(w);\n  const int h_high = h_low + 1;\n  const int w_high = w_low + 1;\n\n  const scalar_t lh = h - h_low;\n  const scalar_t lw = w - w_low;\n  const scalar_t hh = 1 - lh, hw = 1 - lw;\n\n  const int w_stride = nheads * channels;\n  const int h_stride = width * w_stride;\n  const int h_low_ptr_offset = h_low * h_stride;\n  const int h_high_ptr_offset = h_low_ptr_offset + h_stride;\n  const int w_low_ptr_offset = w_low * w_stride;\n  const int w_high_ptr_offset = w_low_ptr_offset + w_stride;\n  const int base_ptr = m * channels + c;\n\n  const scalar_t w1 = hh * hw, w2 = hh * lw, w3 = lh * hw, w4 = lh * lw;\n  const scalar_t top_grad_value = top_grad * attn_weight;\n  scalar_t grad_h_weight = 0, grad_w_weight = 0;\n\n  scalar_t v1 = 0;\n  if (h_low >= 0 && w_low >= 0)\n  {\n    const int ptr1 = h_low_ptr_offset + w_low_ptr_offset + base_ptr;\n    v1 = bottom_data[ptr1];\n    grad_h_weight -= hw * v1;\n    grad_w_weight -= hh * v1;\n    atomicAdd(grad_value+ptr1, w1*top_grad_value);\n  }\n  scalar_t v2 = 0;\n  if (h_low >= 0 && w_high <= width - 1)\n  {\n    const int ptr2 = h_low_ptr_offset + w_high_ptr_offset + base_ptr;\n    v2 = bottom_data[ptr2];\n    grad_h_weight -= lw * v2;\n    grad_w_weight += hh * v2;\n    atomicAdd(grad_value+ptr2, w2*top_grad_value);\n  }\n  scalar_t v3 = 0;\n  if (h_high <= height - 1 && w_low >= 0)\n  {\n    const int ptr3 = h_high_ptr_offset + w_low_ptr_offset + base_ptr;\n    v3 = bottom_data[ptr3];\n    grad_h_weight += hw * v3;\n    grad_w_weight -= lh * v3;\n    atomicAdd(grad_value+ptr3, w3*top_grad_value); \n  }\n  scalar_t v4 = 0;\n  if (h_high <= height - 1 && w_high <= width - 1)\n  {\n    const int ptr4 = h_high_ptr_offset + w_high_ptr_offset + base_ptr;\n    v4 = bottom_data[ptr4];\n    grad_h_weight += lw * v4;\n    grad_w_weight += lh * v4;\n    atomicAdd(grad_value+ptr4, w4*top_grad_value);\n  }\n\n  const scalar_t val = (w1 * v1 + w2 * v2 + w3 * v3 + w4 * v4);\n  *grad_attn_weight = top_grad * val;\n  *grad_sampling_loc = width * grad_w_weight * top_grad_value;\n  *(grad_sampling_loc + 1) = height * grad_h_weight * top_grad_value;\n}\n\n\ntemplate <typename scalar_t>\n__device__ void ms_deform_attn_col2im_bilinear_gm(const scalar_t* &bottom_data, \n                                                   const int &height, const int &width, const int &nheads, const int &channels,\n                                                   const scalar_t &h, const scalar_t &w, const int &m, const int &c,\n                                                   const scalar_t &top_grad,\n                                                   const scalar_t &attn_weight,\n                                                   scalar_t* &grad_value, \n                                                   scalar_t* grad_sampling_loc,\n                                                   scalar_t* grad_attn_weight)\n{\n  const int h_low = floor(h);\n  const int w_low = floor(w);\n  const int h_high = h_low + 1;\n  const int w_high = w_low + 1;\n\n  const scalar_t lh = h - h_low;\n  const scalar_t lw = w - w_low;\n  const scalar_t hh = 1 - lh, hw = 1 - lw;\n\n  const int w_stride = nheads * channels;\n  const int h_stride = width * w_stride;\n  const int h_low_ptr_offset = h_low * h_stride;\n  const int h_high_ptr_offset = h_low_ptr_offset + h_stride;\n  const int w_low_ptr_offset = w_low * w_stride;\n  const int w_high_ptr_offset = w_low_ptr_offset + w_stride;\n  const int base_ptr = m * channels + c;\n\n  const scalar_t w1 = hh * hw, w2 = hh * lw, w3 = lh * hw, w4 = lh * lw;\n  const scalar_t top_grad_value = top_grad * attn_weight;\n  scalar_t grad_h_weight = 0, grad_w_weight = 0;\n\n  scalar_t v1 = 0;\n  if (h_low >= 0 && w_low >= 0)\n  {\n    const int ptr1 = h_low_ptr_offset + w_low_ptr_offset + base_ptr;\n    v1 = bottom_data[ptr1];\n    grad_h_weight -= hw * v1;\n    grad_w_weight -= hh * v1;\n    atomicAdd(grad_value+ptr1, w1*top_grad_value);\n  }\n  scalar_t v2 = 0;\n  if (h_low >= 0 && w_high <= width - 1)\n  {\n    const int ptr2 = h_low_ptr_offset + w_high_ptr_offset + base_ptr;\n    v2 = bottom_data[ptr2];\n    grad_h_weight -= lw * v2;\n    grad_w_weight += hh * v2;\n    atomicAdd(grad_value+ptr2, w2*top_grad_value);\n  }\n  scalar_t v3 = 0;\n  if (h_high <= height - 1 && w_low >= 0)\n  {\n    const int ptr3 = h_high_ptr_offset + w_low_ptr_offset + base_ptr;\n    v3 = bottom_data[ptr3];\n    grad_h_weight += hw * v3;\n    grad_w_weight -= lh * v3;\n    atomicAdd(grad_value+ptr3, w3*top_grad_value); \n  }\n  scalar_t v4 = 0;\n  if (h_high <= height - 1 && w_high <= width - 1)\n  {\n    const int ptr4 = h_high_ptr_offset + w_high_ptr_offset + base_ptr;\n    v4 = bottom_data[ptr4];\n    grad_h_weight += lw * v4;\n    grad_w_weight += lh * v4;\n    atomicAdd(grad_value+ptr4, w4*top_grad_value);\n  }\n\n  const scalar_t val = (w1 * v1 + w2 * v2 + w3 * v3 + w4 * v4);\n  atomicAdd(grad_attn_weight, top_grad * val); \n  atomicAdd(grad_sampling_loc, width * grad_w_weight * top_grad_value);\n  atomicAdd(grad_sampling_loc + 1, height * grad_h_weight * top_grad_value);\n}\n\n\ntemplate <typename scalar_t>\n__global__ void ms_deformable_im2col_gpu_kernel(const int n,\n                                                const scalar_t *data_value, \n                                                const int64_t *data_spatial_shapes,\n                                                const int64_t *data_level_start_index, \n                                                const scalar_t *data_sampling_loc,\n                                                const scalar_t *data_attn_weight,\n                                                const int batch_size, \n                                                const int spatial_size, \n                                                const int num_heads,\n                                                const int channels, \n                                                const int num_levels,\n                                                const int num_query,\n                                                const int num_point,\n                                                scalar_t *data_col)\n{\n  CUDA_KERNEL_LOOP(index, n)\n  {\n    int _temp = index;\n    const int c_col = _temp % channels;\n    _temp /= channels;\n    const int sampling_index = _temp; \n    const int m_col = _temp % num_heads;\n    _temp /= num_heads;\n    const int q_col = _temp % num_query;\n    _temp /= num_query;\n    const int b_col = _temp;\n\n    scalar_t *data_col_ptr = data_col + index;\n    int data_weight_ptr = sampling_index * num_levels * num_point;\n    int data_loc_w_ptr = data_weight_ptr << 1;\n    const int qid_stride = num_heads * channels;\n    const int data_value_ptr_init_offset = b_col * spatial_size * qid_stride;\n    scalar_t col = 0;\n    \n    for (int l_col=0; l_col < num_levels; ++l_col)\n    {\n      const int level_start_id = data_level_start_index[l_col];\n      const int spatial_h_ptr = l_col << 1;\n      const int spatial_h = data_spatial_shapes[spatial_h_ptr];\n      const int spatial_w = data_spatial_shapes[spatial_h_ptr + 1];\n      const scalar_t *data_value_ptr = data_value + (data_value_ptr_init_offset + level_start_id * qid_stride);\n      for (int p_col=0; p_col < num_point; ++p_col)\n      {\n        const scalar_t loc_w = data_sampling_loc[data_loc_w_ptr];\n        const scalar_t loc_h = data_sampling_loc[data_loc_w_ptr + 1];\n        const scalar_t weight = data_attn_weight[data_weight_ptr];\n\n        const scalar_t h_im = loc_h * spatial_h - 0.5;\n        const scalar_t w_im = loc_w * spatial_w - 0.5;\n\n        if (h_im > -1 && w_im > -1 && h_im < spatial_h && w_im < spatial_w)\n        {\n          col += ms_deform_attn_im2col_bilinear(data_value_ptr, spatial_h, spatial_w, num_heads, channels, h_im, w_im, m_col, c_col) * weight;\n        }\n\n        data_weight_ptr += 1;\n        data_loc_w_ptr += 2;\n      }\n    }\n    *data_col_ptr = col;\n  }\n}\n\ntemplate <typename scalar_t, unsigned int blockSize>\n__global__ void ms_deformable_col2im_gpu_kernel_shm_blocksize_aware_reduce_v1(const int n,\n                                                const scalar_t *grad_col,\n                                                const scalar_t *data_value,\n                                                const int64_t *data_spatial_shapes,\n                                                const int64_t *data_level_start_index, \n                                                const scalar_t *data_sampling_loc,\n                                                const scalar_t *data_attn_weight,\n                                                const int batch_size, \n                                                const int spatial_size, \n                                                const int num_heads,\n                                                const int channels, \n                                                const int num_levels,\n                                                const int num_query,\n                                                const int num_point,\n                                                scalar_t *grad_value,\n                                                scalar_t *grad_sampling_loc,\n                                                scalar_t *grad_attn_weight)\n{\n  CUDA_KERNEL_LOOP(index, n)\n  {\n    __shared__ scalar_t cache_grad_sampling_loc[blockSize * 2];\n    __shared__ scalar_t cache_grad_attn_weight[blockSize];\n    unsigned int tid = threadIdx.x;\n    int _temp = index;\n    const int c_col = _temp % channels;\n    _temp /= channels;\n    const int sampling_index = _temp; \n    const int m_col = _temp % num_heads;\n    _temp /= num_heads;\n    const int q_col = _temp % num_query;\n    _temp /= num_query;\n    const int b_col = _temp;\n\n    const scalar_t top_grad = grad_col[index];\n\n    int data_weight_ptr = sampling_index * num_levels * num_point;\n    int data_loc_w_ptr = data_weight_ptr << 1;\n    const int grad_sampling_ptr = data_weight_ptr;\n    grad_sampling_loc += grad_sampling_ptr << 1;\n    grad_attn_weight += grad_sampling_ptr;\n    const int grad_weight_stride = 1;\n    const int grad_loc_stride = 2;\n    const int qid_stride = num_heads * channels;\n    const int data_value_ptr_init_offset = b_col * spatial_size * qid_stride;\n\n    for (int l_col=0; l_col < num_levels; ++l_col)\n    {\n      const int level_start_id = data_level_start_index[l_col];\n      const int spatial_h_ptr = l_col << 1;\n      const int spatial_h = data_spatial_shapes[spatial_h_ptr];\n      const int spatial_w = data_spatial_shapes[spatial_h_ptr + 1];\n      const int value_ptr_offset = data_value_ptr_init_offset + level_start_id * qid_stride;\n      const scalar_t *data_value_ptr = data_value + value_ptr_offset;\n      scalar_t *grad_value_ptr = grad_value + value_ptr_offset;\n\n      for (int p_col=0; p_col < num_point; ++p_col)\n      {\n        const scalar_t loc_w = data_sampling_loc[data_loc_w_ptr];\n        const scalar_t loc_h = data_sampling_loc[data_loc_w_ptr + 1];\n        const scalar_t weight = data_attn_weight[data_weight_ptr];\n\n        const scalar_t h_im = loc_h * spatial_h - 0.5;\n        const scalar_t w_im = loc_w * spatial_w - 0.5;\n        *(cache_grad_sampling_loc+(threadIdx.x << 1)) = 0;\n        *(cache_grad_sampling_loc+((threadIdx.x << 1) + 1)) = 0;\n        *(cache_grad_attn_weight+threadIdx.x)=0;\n        if (h_im > -1 && w_im > -1 && h_im < spatial_h && w_im < spatial_w)\n        {\n          ms_deform_attn_col2im_bilinear(\n            data_value_ptr, spatial_h, spatial_w, num_heads, channels, h_im, w_im, m_col, c_col,\n            top_grad, weight, grad_value_ptr, \n            cache_grad_sampling_loc+(threadIdx.x << 1), cache_grad_attn_weight+threadIdx.x);\n        }\n        \n        __syncthreads();\n        if (tid == 0)\n        {\n          scalar_t _grad_w=cache_grad_sampling_loc[0], _grad_h=cache_grad_sampling_loc[1], _grad_a=cache_grad_attn_weight[0];\n          int sid=2;\n          for (unsigned int tid = 1; tid < blockSize; ++tid)\n          {\n            _grad_w += cache_grad_sampling_loc[sid];\n            _grad_h += cache_grad_sampling_loc[sid + 1];\n            _grad_a += cache_grad_attn_weight[tid];\n            sid += 2;\n          }\n          \n          \n          *grad_sampling_loc = _grad_w;\n          *(grad_sampling_loc + 1) = _grad_h;\n          *grad_attn_weight = _grad_a;\n        }\n        __syncthreads();\n\n        data_weight_ptr += 1;\n        data_loc_w_ptr += 2;\n        grad_attn_weight += grad_weight_stride;\n        grad_sampling_loc += grad_loc_stride;\n      }\n    }\n  }\n}\n\n\ntemplate <typename scalar_t, unsigned int blockSize>\n__global__ void ms_deformable_col2im_gpu_kernel_shm_blocksize_aware_reduce_v2(const int n,\n                                                const scalar_t *grad_col,\n                                                const scalar_t *data_value,\n                                                const int64_t *data_spatial_shapes,\n                                                const int64_t *data_level_start_index, \n                                                const scalar_t *data_sampling_loc,\n                                                const scalar_t *data_attn_weight,\n                                                const int batch_size, \n                                                const int spatial_size, \n                                                const int num_heads,\n                                                const int channels, \n                                                const int num_levels,\n                                                const int num_query,\n                                                const int num_point,\n                                                scalar_t *grad_value,\n                                                scalar_t *grad_sampling_loc,\n                                                scalar_t *grad_attn_weight)\n{\n  CUDA_KERNEL_LOOP(index, n)\n  {\n    __shared__ scalar_t cache_grad_sampling_loc[blockSize * 2];\n    __shared__ scalar_t cache_grad_attn_weight[blockSize];\n    unsigned int tid = threadIdx.x;\n    int _temp = index;\n    const int c_col = _temp % channels;\n    _temp /= channels;\n    const int sampling_index = _temp; \n    const int m_col = _temp % num_heads;\n    _temp /= num_heads;\n    const int q_col = _temp % num_query;\n    _temp /= num_query;\n    const int b_col = _temp;\n\n    const scalar_t top_grad = grad_col[index];\n\n    int data_weight_ptr = sampling_index * num_levels * num_point;\n    int data_loc_w_ptr = data_weight_ptr << 1;\n    const int grad_sampling_ptr = data_weight_ptr;\n    grad_sampling_loc += grad_sampling_ptr << 1;\n    grad_attn_weight += grad_sampling_ptr;\n    const int grad_weight_stride = 1;\n    const int grad_loc_stride = 2;\n    const int qid_stride = num_heads * channels;\n    const int data_value_ptr_init_offset = b_col * spatial_size * qid_stride;\n\n    for (int l_col=0; l_col < num_levels; ++l_col)\n    {\n      const int level_start_id = data_level_start_index[l_col];\n      const int spatial_h_ptr = l_col << 1;\n      const int spatial_h = data_spatial_shapes[spatial_h_ptr];\n      const int spatial_w = data_spatial_shapes[spatial_h_ptr + 1];\n      const int value_ptr_offset = data_value_ptr_init_offset + level_start_id * qid_stride;\n      const scalar_t *data_value_ptr = data_value + value_ptr_offset;\n      scalar_t *grad_value_ptr = grad_value + value_ptr_offset;\n\n      for (int p_col=0; p_col < num_point; ++p_col)\n      {\n        const scalar_t loc_w = data_sampling_loc[data_loc_w_ptr];\n        const scalar_t loc_h = data_sampling_loc[data_loc_w_ptr + 1];\n        const scalar_t weight = data_attn_weight[data_weight_ptr];\n\n        const scalar_t h_im = loc_h * spatial_h - 0.5;\n        const scalar_t w_im = loc_w * spatial_w - 0.5;\n        *(cache_grad_sampling_loc+(threadIdx.x << 1)) = 0;\n        *(cache_grad_sampling_loc+((threadIdx.x << 1) + 1)) = 0;\n        *(cache_grad_attn_weight+threadIdx.x)=0;\n        if (h_im > -1 && w_im > -1 && h_im < spatial_h && w_im < spatial_w)\n        {\n          ms_deform_attn_col2im_bilinear(\n            data_value_ptr, spatial_h, spatial_w, num_heads, channels, h_im, w_im, m_col, c_col,\n            top_grad, weight, grad_value_ptr, \n            cache_grad_sampling_loc+(threadIdx.x << 1), cache_grad_attn_weight+threadIdx.x);\n        }\n        \n        __syncthreads();\n\n        for (unsigned int s=blockSize/2; s>0; s>>=1)\n        {\n          if (tid < s) {\n            const unsigned int xid1 = tid << 1;\n            const unsigned int xid2 = (tid + s) << 1;\n            cache_grad_attn_weight[tid] += cache_grad_attn_weight[tid + s];\n            cache_grad_sampling_loc[xid1] += cache_grad_sampling_loc[xid2];\n            cache_grad_sampling_loc[xid1 + 1] += cache_grad_sampling_loc[xid2 + 1];\n          }\n          __syncthreads();\n        }\n\n        if (tid == 0)\n        { \n          *grad_sampling_loc = cache_grad_sampling_loc[0];\n          *(grad_sampling_loc + 1) = cache_grad_sampling_loc[1];\n          *grad_attn_weight = cache_grad_attn_weight[0];\n        }\n        __syncthreads();\n\n        data_weight_ptr += 1;\n        data_loc_w_ptr += 2;\n        grad_attn_weight += grad_weight_stride;\n        grad_sampling_loc += grad_loc_stride;\n      }\n    }\n  }\n}\n\n\ntemplate <typename scalar_t>\n__global__ void ms_deformable_col2im_gpu_kernel_shm_reduce_v1(const int n,\n                                                const scalar_t *grad_col,\n                                                const scalar_t *data_value,\n                                                const int64_t *data_spatial_shapes,\n                                                const int64_t *data_level_start_index, \n                                                const scalar_t *data_sampling_loc,\n                                                const scalar_t *data_attn_weight,\n                                                const int batch_size, \n                                                const int spatial_size, \n                                                const int num_heads,\n                                                const int channels, \n                                                const int num_levels,\n                                                const int num_query,\n                                                const int num_point,\n                                                scalar_t *grad_value,\n                                                scalar_t *grad_sampling_loc,\n                                                scalar_t *grad_attn_weight)\n{\n  CUDA_KERNEL_LOOP(index, n)\n  {\n    extern __shared__ int _s[];\n    scalar_t* cache_grad_sampling_loc = (scalar_t*)_s;\n    scalar_t* cache_grad_attn_weight = cache_grad_sampling_loc + 2 * blockDim.x;\n    unsigned int tid = threadIdx.x;\n    int _temp = index;\n    const int c_col = _temp % channels;\n    _temp /= channels;\n    const int sampling_index = _temp; \n    const int m_col = _temp % num_heads;\n    _temp /= num_heads;\n    const int q_col = _temp % num_query;\n    _temp /= num_query;\n    const int b_col = _temp;\n\n    const scalar_t top_grad = grad_col[index];\n\n    int data_weight_ptr = sampling_index * num_levels * num_point;\n    int data_loc_w_ptr = data_weight_ptr << 1;\n    const int grad_sampling_ptr = data_weight_ptr;\n    grad_sampling_loc += grad_sampling_ptr << 1;\n    grad_attn_weight += grad_sampling_ptr;\n    const int grad_weight_stride = 1;\n    const int grad_loc_stride = 2;\n    const int qid_stride = num_heads * channels;\n    const int data_value_ptr_init_offset = b_col * spatial_size * qid_stride;\n\n    for (int l_col=0; l_col < num_levels; ++l_col)\n    {\n      const int level_start_id = data_level_start_index[l_col];\n      const int spatial_h_ptr = l_col << 1;\n      const int spatial_h = data_spatial_shapes[spatial_h_ptr];\n      const int spatial_w = data_spatial_shapes[spatial_h_ptr + 1];\n      const int value_ptr_offset = data_value_ptr_init_offset + level_start_id * qid_stride;\n      const scalar_t *data_value_ptr = data_value + value_ptr_offset;\n      scalar_t *grad_value_ptr = grad_value + value_ptr_offset;\n\n      for (int p_col=0; p_col < num_point; ++p_col)\n      {\n        const scalar_t loc_w = data_sampling_loc[data_loc_w_ptr];\n        const scalar_t loc_h = data_sampling_loc[data_loc_w_ptr + 1];\n        const scalar_t weight = data_attn_weight[data_weight_ptr];\n\n        const scalar_t h_im = loc_h * spatial_h - 0.5;\n        const scalar_t w_im = loc_w * spatial_w - 0.5;\n        *(cache_grad_sampling_loc+(threadIdx.x << 1)) = 0;\n        *(cache_grad_sampling_loc+((threadIdx.x << 1) + 1)) = 0;\n        *(cache_grad_attn_weight+threadIdx.x)=0;\n        if (h_im > -1 && w_im > -1 && h_im < spatial_h && w_im < spatial_w)\n        {\n          ms_deform_attn_col2im_bilinear(\n            data_value_ptr, spatial_h, spatial_w, num_heads, channels, h_im, w_im, m_col, c_col,\n            top_grad, weight, grad_value_ptr, \n            cache_grad_sampling_loc+(threadIdx.x << 1), cache_grad_attn_weight+threadIdx.x);\n        }\n        \n        __syncthreads();\n        if (tid == 0)\n        {\n          scalar_t _grad_w=cache_grad_sampling_loc[0], _grad_h=cache_grad_sampling_loc[1], _grad_a=cache_grad_attn_weight[0];\n          int sid=2;\n          for (unsigned int tid = 1; tid < blockDim.x; ++tid)\n          {\n            _grad_w += cache_grad_sampling_loc[sid];\n            _grad_h += cache_grad_sampling_loc[sid + 1];\n            _grad_a += cache_grad_attn_weight[tid];\n            sid += 2;\n          }\n          \n          \n          *grad_sampling_loc = _grad_w;\n          *(grad_sampling_loc + 1) = _grad_h;\n          *grad_attn_weight = _grad_a;\n        }\n        __syncthreads();\n\n        data_weight_ptr += 1;\n        data_loc_w_ptr += 2;\n        grad_attn_weight += grad_weight_stride;\n        grad_sampling_loc += grad_loc_stride;\n      }\n    }\n  }\n}\n\ntemplate <typename scalar_t>\n__global__ void ms_deformable_col2im_gpu_kernel_shm_reduce_v2(const int n,\n                                                const scalar_t *grad_col,\n                                                const scalar_t *data_value,\n                                                const int64_t *data_spatial_shapes,\n                                                const int64_t *data_level_start_index, \n                                                const scalar_t *data_sampling_loc,\n                                                const scalar_t *data_attn_weight,\n                                                const int batch_size, \n                                                const int spatial_size, \n                                                const int num_heads,\n                                                const int channels, \n                                                const int num_levels,\n                                                const int num_query,\n                                                const int num_point,\n                                                scalar_t *grad_value,\n                                                scalar_t *grad_sampling_loc,\n                                                scalar_t *grad_attn_weight)\n{\n  CUDA_KERNEL_LOOP(index, n)\n  {\n    extern __shared__ int _s[];\n    scalar_t* cache_grad_sampling_loc = (scalar_t*)_s;\n    scalar_t* cache_grad_attn_weight = cache_grad_sampling_loc + 2 * blockDim.x;\n    unsigned int tid = threadIdx.x;\n    int _temp = index;\n    const int c_col = _temp % channels;\n    _temp /= channels;\n    const int sampling_index = _temp; \n    const int m_col = _temp % num_heads;\n    _temp /= num_heads;\n    const int q_col = _temp % num_query;\n    _temp /= num_query;\n    const int b_col = _temp;\n\n    const scalar_t top_grad = grad_col[index];\n\n    int data_weight_ptr = sampling_index * num_levels * num_point;\n    int data_loc_w_ptr = data_weight_ptr << 1;\n    const int grad_sampling_ptr = data_weight_ptr;\n    grad_sampling_loc += grad_sampling_ptr << 1;\n    grad_attn_weight += grad_sampling_ptr;\n    const int grad_weight_stride = 1;\n    const int grad_loc_stride = 2;\n    const int qid_stride = num_heads * channels;\n    const int data_value_ptr_init_offset = b_col * spatial_size * qid_stride;\n\n    for (int l_col=0; l_col < num_levels; ++l_col)\n    {\n      const int level_start_id = data_level_start_index[l_col];\n      const int spatial_h_ptr = l_col << 1;\n      const int spatial_h = data_spatial_shapes[spatial_h_ptr];\n      const int spatial_w = data_spatial_shapes[spatial_h_ptr + 1];\n      const int value_ptr_offset = data_value_ptr_init_offset + level_start_id * qid_stride;\n      const scalar_t *data_value_ptr = data_value + value_ptr_offset;\n      scalar_t *grad_value_ptr = grad_value + value_ptr_offset;\n\n      for (int p_col=0; p_col < num_point; ++p_col)\n      {\n        const scalar_t loc_w = data_sampling_loc[data_loc_w_ptr];\n        const scalar_t loc_h = data_sampling_loc[data_loc_w_ptr + 1];\n        const scalar_t weight = data_attn_weight[data_weight_ptr];\n\n        const scalar_t h_im = loc_h * spatial_h - 0.5;\n        const scalar_t w_im = loc_w * spatial_w - 0.5;\n        *(cache_grad_sampling_loc+(threadIdx.x << 1)) = 0;\n        *(cache_grad_sampling_loc+((threadIdx.x << 1) + 1)) = 0;\n        *(cache_grad_attn_weight+threadIdx.x)=0;\n        if (h_im > -1 && w_im > -1 && h_im < spatial_h && w_im < spatial_w)\n        {\n          ms_deform_attn_col2im_bilinear(\n            data_value_ptr, spatial_h, spatial_w, num_heads, channels, h_im, w_im, m_col, c_col,\n            top_grad, weight, grad_value_ptr, \n            cache_grad_sampling_loc+(threadIdx.x << 1), cache_grad_attn_weight+threadIdx.x);\n        }\n        \n        __syncthreads();\n\n        for (unsigned int s=blockDim.x/2, spre=blockDim.x; s>0; s>>=1, spre>>=1)\n        {\n          if (tid < s) {\n            const unsigned int xid1 = tid << 1;\n            const unsigned int xid2 = (tid + s) << 1;\n            cache_grad_attn_weight[tid] += cache_grad_attn_weight[tid + s];\n            cache_grad_sampling_loc[xid1] += cache_grad_sampling_loc[xid2];\n            cache_grad_sampling_loc[xid1 + 1] += cache_grad_sampling_loc[xid2 + 1];\n            if (tid + (s << 1) < spre)\n            {\n              cache_grad_attn_weight[tid] += cache_grad_attn_weight[tid + (s << 1)];\n              cache_grad_sampling_loc[xid1] += cache_grad_sampling_loc[xid2 + (s << 1)];\n              cache_grad_sampling_loc[xid1 + 1] += cache_grad_sampling_loc[xid2 + 1 + (s << 1)];\n            } \n          }\n          __syncthreads();\n        }\n\n        if (tid == 0)\n        {\n          *grad_sampling_loc = cache_grad_sampling_loc[0];\n          *(grad_sampling_loc + 1) = cache_grad_sampling_loc[1];\n          *grad_attn_weight = cache_grad_attn_weight[0];\n        }\n        __syncthreads();\n\n        data_weight_ptr += 1;\n        data_loc_w_ptr += 2;\n        grad_attn_weight += grad_weight_stride;\n        grad_sampling_loc += grad_loc_stride;\n      }\n    }\n  }\n}\n\ntemplate <typename scalar_t>\n__global__ void ms_deformable_col2im_gpu_kernel_shm_reduce_v2_multi_blocks(const int n,\n                                                const scalar_t *grad_col,\n                                                const scalar_t *data_value,\n                                                const int64_t *data_spatial_shapes,\n                                                const int64_t *data_level_start_index, \n                                                const scalar_t *data_sampling_loc,\n                                                const scalar_t *data_attn_weight,\n                                                const int batch_size, \n                                                const int spatial_size, \n                                                const int num_heads,\n                                                const int channels, \n                                                const int num_levels,\n                                                const int num_query,\n                                                const int num_point,\n                                                scalar_t *grad_value,\n                                                scalar_t *grad_sampling_loc,\n                                                scalar_t *grad_attn_weight)\n{\n  CUDA_KERNEL_LOOP(index, n)\n  {\n    extern __shared__ int _s[];\n    scalar_t* cache_grad_sampling_loc = (scalar_t*)_s;\n    scalar_t* cache_grad_attn_weight = cache_grad_sampling_loc + 2 * blockDim.x;\n    unsigned int tid = threadIdx.x;\n    int _temp = index;\n    const int c_col = _temp % channels;\n    _temp /= channels;\n    const int sampling_index = _temp; \n    const int m_col = _temp % num_heads;\n    _temp /= num_heads;\n    const int q_col = _temp % num_query;\n    _temp /= num_query;\n    const int b_col = _temp;\n\n    const scalar_t top_grad = grad_col[index];\n\n    int data_weight_ptr = sampling_index * num_levels * num_point;\n    int data_loc_w_ptr = data_weight_ptr << 1;\n    const int grad_sampling_ptr = data_weight_ptr;\n    grad_sampling_loc += grad_sampling_ptr << 1;\n    grad_attn_weight += grad_sampling_ptr;\n    const int grad_weight_stride = 1;\n    const int grad_loc_stride = 2;\n    const int qid_stride = num_heads * channels;\n    const int data_value_ptr_init_offset = b_col * spatial_size * qid_stride;\n\n    for (int l_col=0; l_col < num_levels; ++l_col)\n    {\n      const int level_start_id = data_level_start_index[l_col];\n      const int spatial_h_ptr = l_col << 1;\n      const int spatial_h = data_spatial_shapes[spatial_h_ptr];\n      const int spatial_w = data_spatial_shapes[spatial_h_ptr + 1];\n      const int value_ptr_offset = data_value_ptr_init_offset + level_start_id * qid_stride;\n      const scalar_t *data_value_ptr = data_value + value_ptr_offset;\n      scalar_t *grad_value_ptr = grad_value + value_ptr_offset;\n\n      for (int p_col=0; p_col < num_point; ++p_col)\n      {\n        const scalar_t loc_w = data_sampling_loc[data_loc_w_ptr];\n        const scalar_t loc_h = data_sampling_loc[data_loc_w_ptr + 1];\n        const scalar_t weight = data_attn_weight[data_weight_ptr];\n\n        const scalar_t h_im = loc_h * spatial_h - 0.5;\n        const scalar_t w_im = loc_w * spatial_w - 0.5;\n        *(cache_grad_sampling_loc+(threadIdx.x << 1)) = 0;\n        *(cache_grad_sampling_loc+((threadIdx.x << 1) + 1)) = 0;\n        *(cache_grad_attn_weight+threadIdx.x)=0;\n        if (h_im > -1 && w_im > -1 && h_im < spatial_h && w_im < spatial_w)\n        {\n          ms_deform_attn_col2im_bilinear(\n            data_value_ptr, spatial_h, spatial_w, num_heads, channels, h_im, w_im, m_col, c_col,\n            top_grad, weight, grad_value_ptr, \n            cache_grad_sampling_loc+(threadIdx.x << 1), cache_grad_attn_weight+threadIdx.x);\n        }\n        \n        __syncthreads();\n\n        for (unsigned int s=blockDim.x/2, spre=blockDim.x; s>0; s>>=1, spre>>=1)\n        {\n          if (tid < s) {\n            const unsigned int xid1 = tid << 1;\n            const unsigned int xid2 = (tid + s) << 1;\n            cache_grad_attn_weight[tid] += cache_grad_attn_weight[tid + s];\n            cache_grad_sampling_loc[xid1] += cache_grad_sampling_loc[xid2];\n            cache_grad_sampling_loc[xid1 + 1] += cache_grad_sampling_loc[xid2 + 1];\n            if (tid + (s << 1) < spre)\n            {\n              cache_grad_attn_weight[tid] += cache_grad_attn_weight[tid + (s << 1)];\n              cache_grad_sampling_loc[xid1] += cache_grad_sampling_loc[xid2 + (s << 1)];\n              cache_grad_sampling_loc[xid1 + 1] += cache_grad_sampling_loc[xid2 + 1 + (s << 1)];\n            }\n          }\n          __syncthreads();\n        }\n\n        if (tid == 0)\n        {\n          atomicAdd(grad_sampling_loc, cache_grad_sampling_loc[0]);\n          atomicAdd(grad_sampling_loc + 1, cache_grad_sampling_loc[1]);\n          atomicAdd(grad_attn_weight, cache_grad_attn_weight[0]);\n        }\n        __syncthreads();\n\n        data_weight_ptr += 1;\n        data_loc_w_ptr += 2;\n        grad_attn_weight += grad_weight_stride;\n        grad_sampling_loc += grad_loc_stride;\n      }\n    }\n  }\n}\n\n\ntemplate <typename scalar_t>\n__global__ void ms_deformable_col2im_gpu_kernel_gm(const int n,\n                                                const scalar_t *grad_col,\n                                                const scalar_t *data_value,\n                                                const int64_t *data_spatial_shapes,\n                                                const int64_t *data_level_start_index, \n                                                const scalar_t *data_sampling_loc,\n                                                const scalar_t *data_attn_weight,\n                                                const int batch_size, \n                                                const int spatial_size, \n                                                const int num_heads,\n                                                const int channels, \n                                                const int num_levels,\n                                                const int num_query,\n                                                const int num_point,\n                                                scalar_t *grad_value,\n                                                scalar_t *grad_sampling_loc,\n                                                scalar_t *grad_attn_weight)\n{\n  CUDA_KERNEL_LOOP(index, n)\n  {\n    int _temp = index;\n    const int c_col = _temp % channels;\n    _temp /= channels;\n    const int sampling_index = _temp; \n    const int m_col = _temp % num_heads;\n    _temp /= num_heads;\n    const int q_col = _temp % num_query;\n    _temp /= num_query;\n    const int b_col = _temp;\n\n    const scalar_t top_grad = grad_col[index];\n\n    int data_weight_ptr = sampling_index * num_levels * num_point;\n    int data_loc_w_ptr = data_weight_ptr << 1;\n    const int grad_sampling_ptr = data_weight_ptr;\n    grad_sampling_loc += grad_sampling_ptr << 1;\n    grad_attn_weight += grad_sampling_ptr;\n    const int grad_weight_stride = 1;\n    const int grad_loc_stride = 2;\n    const int qid_stride = num_heads * channels;\n    const int data_value_ptr_init_offset = b_col * spatial_size * qid_stride;\n\n    for (int l_col=0; l_col < num_levels; ++l_col)\n    {\n      const int level_start_id = data_level_start_index[l_col];\n      const int spatial_h_ptr = l_col << 1;\n      const int spatial_h = data_spatial_shapes[spatial_h_ptr];\n      const int spatial_w = data_spatial_shapes[spatial_h_ptr + 1];\n      const int value_ptr_offset = data_value_ptr_init_offset + level_start_id * qid_stride;\n      const scalar_t *data_value_ptr = data_value + value_ptr_offset;\n      scalar_t *grad_value_ptr = grad_value + value_ptr_offset;\n\n      for (int p_col=0; p_col < num_point; ++p_col)\n      {\n        const scalar_t loc_w = data_sampling_loc[data_loc_w_ptr];\n        const scalar_t loc_h = data_sampling_loc[data_loc_w_ptr + 1];\n        const scalar_t weight = data_attn_weight[data_weight_ptr];\n\n        const scalar_t h_im = loc_h * spatial_h - 0.5;\n        const scalar_t w_im = loc_w * spatial_w - 0.5;\n        if (h_im > -1 && w_im > -1 && h_im < spatial_h && w_im < spatial_w)\n        {\n          ms_deform_attn_col2im_bilinear_gm(\n            data_value_ptr, spatial_h, spatial_w, num_heads, channels, h_im, w_im, m_col, c_col,\n            top_grad, weight, grad_value_ptr, \n            grad_sampling_loc, grad_attn_weight);\n        }\n        data_weight_ptr += 1;\n        data_loc_w_ptr += 2;\n        grad_attn_weight += grad_weight_stride;\n        grad_sampling_loc += grad_loc_stride;\n      }\n    }\n  }\n}\n\n\ntemplate <typename scalar_t>\nvoid ms_deformable_im2col_cuda(cudaStream_t stream,\n                              const scalar_t* data_value,\n                              const int64_t* data_spatial_shapes, \n                              const int64_t* data_level_start_index, \n                              const scalar_t* data_sampling_loc,\n                              const scalar_t* data_attn_weight,\n                              const int batch_size,\n                              const int spatial_size, \n                              const int num_heads, \n                              const int channels, \n                              const int num_levels, \n                              const int num_query,\n                              const int num_point,\n                              scalar_t* data_col)\n{\n  const int num_kernels = batch_size * num_query * num_heads * channels;\n  const int num_actual_kernels = batch_size * num_query * num_heads * channels;\n  const int num_threads = CUDA_NUM_THREADS;\n  ms_deformable_im2col_gpu_kernel<scalar_t>\n      <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n          0, stream>>>(\n      num_kernels, data_value, data_spatial_shapes, data_level_start_index, data_sampling_loc, data_attn_weight, \n      batch_size, spatial_size, num_heads, channels, num_levels, num_query, num_point, data_col);\n  \n  cudaError_t err = cudaGetLastError();\n  if (err != cudaSuccess)\n  {\n    printf(\"error in ms_deformable_im2col_cuda: %s\\n\", cudaGetErrorString(err));\n  }\n\n}\n\ntemplate <typename scalar_t>\nvoid ms_deformable_col2im_cuda(cudaStream_t stream,\n                              const scalar_t* grad_col,\n                              const scalar_t* data_value,\n                              const int64_t * data_spatial_shapes,\n                              const int64_t * data_level_start_index,\n                              const scalar_t * data_sampling_loc,\n                              const scalar_t * data_attn_weight,\n                              const int batch_size, \n                              const int spatial_size, \n                              const int num_heads,\n                              const int channels, \n                              const int num_levels,\n                              const int num_query,\n                              const int num_point, \n                              scalar_t* grad_value,\n                              scalar_t* grad_sampling_loc,\n                              scalar_t* grad_attn_weight)\n{\n  const int num_threads = (channels > CUDA_NUM_THREADS)?CUDA_NUM_THREADS:channels;\n  const int num_kernels = batch_size * num_query * num_heads * channels;\n  const int num_actual_kernels = batch_size * num_query * num_heads * channels;\n  if (channels > 1024)\n  {\n    if ((channels & 1023) == 0)\n    {\n      ms_deformable_col2im_gpu_kernel_shm_reduce_v2_multi_blocks<scalar_t>\n          <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n              num_threads*3*sizeof(scalar_t), stream>>>(\n                        num_kernels, \n                        grad_col,\n                        data_value,\n                        data_spatial_shapes,\n                        data_level_start_index, \n                        data_sampling_loc,\n                        data_attn_weight,\n                        batch_size, \n                        spatial_size, \n                        num_heads,\n                        channels, \n                        num_levels,\n                        num_query,\n                        num_point,\n                        grad_value,\n                        grad_sampling_loc,\n                        grad_attn_weight);\n    }\n    else\n    {\n      ms_deformable_col2im_gpu_kernel_gm<scalar_t>\n        <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n            0, stream>>>(\n                      num_kernels, \n                      grad_col,\n                      data_value,\n                      data_spatial_shapes,\n                      data_level_start_index, \n                      data_sampling_loc,\n                      data_attn_weight,\n                      batch_size, \n                      spatial_size, \n                      num_heads,\n                      channels, \n                      num_levels,\n                      num_query,\n                      num_point,\n                      grad_value,\n                      grad_sampling_loc,\n                      grad_attn_weight);\n    }\n  }\n  else{\n    switch(channels)\n    {\n      case 1:\n        ms_deformable_col2im_gpu_kernel_shm_blocksize_aware_reduce_v1<scalar_t, 1>\n        <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n            0, stream>>>(\n                      num_kernels, \n                      grad_col,\n                      data_value,\n                      data_spatial_shapes,\n                      data_level_start_index, \n                      data_sampling_loc,\n                      data_attn_weight,\n                      batch_size, \n                      spatial_size, \n                      num_heads,\n                      channels, \n                      num_levels,\n                      num_query,\n                      num_point,\n                      grad_value,\n                      grad_sampling_loc,\n                      grad_attn_weight);\n        break;\n      case 2:\n        ms_deformable_col2im_gpu_kernel_shm_blocksize_aware_reduce_v1<scalar_t, 2>\n        <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n            0, stream>>>(\n                      num_kernels, \n                      grad_col,\n                      data_value,\n                      data_spatial_shapes,\n                      data_level_start_index, \n                      data_sampling_loc,\n                      data_attn_weight,\n                      batch_size, \n                      spatial_size, \n                      num_heads,\n                      channels, \n                      num_levels,\n                      num_query,\n                      num_point,\n                      grad_value,\n                      grad_sampling_loc,\n                      grad_attn_weight);\n        break;\n      case 4:\n        ms_deformable_col2im_gpu_kernel_shm_blocksize_aware_reduce_v1<scalar_t, 4>\n        <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n            0, stream>>>(\n                      num_kernels, \n                      grad_col,\n                      data_value,\n                      data_spatial_shapes,\n                      data_level_start_index, \n                      data_sampling_loc,\n                      data_attn_weight,\n                      batch_size, \n                      spatial_size, \n                      num_heads,\n                      channels, \n                      num_levels,\n                      num_query,\n                      num_point,\n                      grad_value,\n                      grad_sampling_loc,\n                      grad_attn_weight);\n        break;\n      case 8:\n        ms_deformable_col2im_gpu_kernel_shm_blocksize_aware_reduce_v1<scalar_t, 8>\n        <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n            0, stream>>>(\n                      num_kernels, \n                      grad_col,\n                      data_value,\n                      data_spatial_shapes,\n                      data_level_start_index, \n                      data_sampling_loc,\n                      data_attn_weight,\n                      batch_size, \n                      spatial_size, \n                      num_heads,\n                      channels, \n                      num_levels,\n                      num_query,\n                      num_point,\n                      grad_value,\n                      grad_sampling_loc,\n                      grad_attn_weight);\n        break;\n      case 16:\n        ms_deformable_col2im_gpu_kernel_shm_blocksize_aware_reduce_v1<scalar_t, 16>\n        <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n            0, stream>>>(\n                      num_kernels, \n                      grad_col,\n                      data_value,\n                      data_spatial_shapes,\n                      data_level_start_index, \n                      data_sampling_loc,\n                      data_attn_weight,\n                      batch_size, \n                      spatial_size, \n                      num_heads,\n                      channels, \n                      num_levels,\n                      num_query,\n                      num_point,\n                      grad_value,\n                      grad_sampling_loc,\n                      grad_attn_weight);\n        break;\n      case 32:\n        ms_deformable_col2im_gpu_kernel_shm_blocksize_aware_reduce_v1<scalar_t, 32>\n        <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n            0, stream>>>(\n                      num_kernels, \n                      grad_col,\n                      data_value,\n                      data_spatial_shapes,\n                      data_level_start_index, \n                      data_sampling_loc,\n                      data_attn_weight,\n                      batch_size, \n                      spatial_size, \n                      num_heads,\n                      channels, \n                      num_levels,\n                      num_query,\n                      num_point,\n                      grad_value,\n                      grad_sampling_loc,\n                      grad_attn_weight);\n        break;\n      case 64:\n        ms_deformable_col2im_gpu_kernel_shm_blocksize_aware_reduce_v2<scalar_t, 64>\n        <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n            0, stream>>>(\n                      num_kernels, \n                      grad_col,\n                      data_value,\n                      data_spatial_shapes,\n                      data_level_start_index, \n                      data_sampling_loc,\n                      data_attn_weight,\n                      batch_size, \n                      spatial_size, \n                      num_heads,\n                      channels, \n                      num_levels,\n                      num_query,\n                      num_point,\n                      grad_value,\n                      grad_sampling_loc,\n                      grad_attn_weight);\n        break;\n      case 128:\n        ms_deformable_col2im_gpu_kernel_shm_blocksize_aware_reduce_v2<scalar_t, 128>\n        <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n            0, stream>>>(\n                      num_kernels, \n                      grad_col,\n                      data_value,\n                      data_spatial_shapes,\n                      data_level_start_index, \n                      data_sampling_loc,\n                      data_attn_weight,\n                      batch_size, \n                      spatial_size, \n                      num_heads,\n                      channels, \n                      num_levels,\n                      num_query,\n                      num_point,\n                      grad_value,\n                      grad_sampling_loc,\n                      grad_attn_weight);\n        break;\n      case 256:\n        ms_deformable_col2im_gpu_kernel_shm_blocksize_aware_reduce_v2<scalar_t, 256>\n        <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n            0, stream>>>(\n                      num_kernels, \n                      grad_col,\n                      data_value,\n                      data_spatial_shapes,\n                      data_level_start_index, \n                      data_sampling_loc,\n                      data_attn_weight,\n                      batch_size, \n                      spatial_size, \n                      num_heads,\n                      channels, \n                      num_levels,\n                      num_query,\n                      num_point,\n                      grad_value,\n                      grad_sampling_loc,\n                      grad_attn_weight);\n        break;\n      case 512:\n        ms_deformable_col2im_gpu_kernel_shm_blocksize_aware_reduce_v2<scalar_t, 512>\n        <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n            0, stream>>>(\n                      num_kernels, \n                      grad_col,\n                      data_value,\n                      data_spatial_shapes,\n                      data_level_start_index, \n                      data_sampling_loc,\n                      data_attn_weight,\n                      batch_size, \n                      spatial_size, \n                      num_heads,\n                      channels, \n                      num_levels,\n                      num_query,\n                      num_point,\n                      grad_value,\n                      grad_sampling_loc,\n                      grad_attn_weight);\n        break;\n      case 1024:\n        ms_deformable_col2im_gpu_kernel_shm_blocksize_aware_reduce_v2<scalar_t, 1024>\n        <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n            0, stream>>>(\n                      num_kernels, \n                      grad_col,\n                      data_value,\n                      data_spatial_shapes,\n                      data_level_start_index, \n                      data_sampling_loc,\n                      data_attn_weight,\n                      batch_size, \n                      spatial_size, \n                      num_heads,\n                      channels, \n                      num_levels,\n                      num_query,\n                      num_point,\n                      grad_value,\n                      grad_sampling_loc,\n                      grad_attn_weight);\n        break;\n      default:\n        if (channels < 64)\n        {\n          ms_deformable_col2im_gpu_kernel_shm_reduce_v1<scalar_t>\n          <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n              num_threads*3*sizeof(scalar_t), stream>>>(\n                        num_kernels, \n                        grad_col,\n                        data_value,\n                        data_spatial_shapes,\n                        data_level_start_index, \n                        data_sampling_loc,\n                        data_attn_weight,\n                        batch_size, \n                        spatial_size, \n                        num_heads,\n                        channels, \n                        num_levels,\n                        num_query,\n                        num_point,\n                        grad_value,\n                        grad_sampling_loc,\n                        grad_attn_weight);\n        }\n        else\n        {\n          ms_deformable_col2im_gpu_kernel_shm_reduce_v2<scalar_t>\n          <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n              num_threads*3*sizeof(scalar_t), stream>>>(\n                        num_kernels, \n                        grad_col,\n                        data_value,\n                        data_spatial_shapes,\n                        data_level_start_index, \n                        data_sampling_loc,\n                        data_attn_weight,\n                        batch_size, \n                        spatial_size, \n                        num_heads,\n                        channels, \n                        num_levels,\n                        num_query,\n                        num_point,\n                        grad_value,\n                        grad_sampling_loc,\n                        grad_attn_weight);\n        }\n    }\n  }\n  cudaError_t err = cudaGetLastError();\n  if (err != cudaSuccess)\n  {\n    printf(\"error in ms_deformable_col2im_cuda: %s\\n\", cudaGetErrorString(err));\n  }\n\n}"
  },
  {
    "path": "mask2former/modeling/pixel_decoder/ops/src/ms_deform_attn.h",
    "content": "/*!\n**************************************************************************************************\n* Deformable DETR\n* Copyright (c) 2020 SenseTime. All Rights Reserved.\n* Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n**************************************************************************************************\n* Modified from https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0\n**************************************************************************************************\n*/\n\n/*!\n* Copyright (c) Facebook, Inc. and its affiliates.\n* Modified by Bowen Cheng from https://github.com/fundamentalvision/Deformable-DETR\n*/\n\n#pragma once\n\n#include \"cpu/ms_deform_attn_cpu.h\"\n\n#ifdef WITH_CUDA\n#include \"cuda/ms_deform_attn_cuda.h\"\n#endif\n\n\nat::Tensor\nms_deform_attn_forward(\n    const at::Tensor &value, \n    const at::Tensor &spatial_shapes,\n    const at::Tensor &level_start_index,\n    const at::Tensor &sampling_loc,\n    const at::Tensor &attn_weight,\n    const int im2col_step)\n{\n    if (value.type().is_cuda())\n    {\n#ifdef WITH_CUDA\n        return ms_deform_attn_cuda_forward(\n            value, spatial_shapes, level_start_index, sampling_loc, attn_weight, im2col_step);\n#else\n        AT_ERROR(\"Not compiled with GPU support\");\n#endif\n    }\n    AT_ERROR(\"Not implemented on the CPU\");\n}\n\nstd::vector<at::Tensor>\nms_deform_attn_backward(\n    const at::Tensor &value, \n    const at::Tensor &spatial_shapes,\n    const at::Tensor &level_start_index,\n    const at::Tensor &sampling_loc,\n    const at::Tensor &attn_weight,\n    const at::Tensor &grad_output,\n    const int im2col_step)\n{\n    if (value.type().is_cuda())\n    {\n#ifdef WITH_CUDA\n        return ms_deform_attn_cuda_backward(\n            value, spatial_shapes, level_start_index, sampling_loc, attn_weight, grad_output, im2col_step);\n#else\n        AT_ERROR(\"Not compiled with GPU support\");\n#endif\n    }\n    AT_ERROR(\"Not implemented on the CPU\");\n}\n\n"
  },
  {
    "path": "mask2former/modeling/pixel_decoder/ops/src/vision.cpp",
    "content": "/*!\n**************************************************************************************************\n* Deformable DETR\n* Copyright (c) 2020 SenseTime. All Rights Reserved.\n* Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n**************************************************************************************************\n* Modified from https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0\n**************************************************************************************************\n*/\n\n/*!\n* Copyright (c) Facebook, Inc. and its affiliates.\n* Modified by Bowen Cheng from https://github.com/fundamentalvision/Deformable-DETR\n*/\n\n#include \"ms_deform_attn.h\"\n\nPYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {\n  m.def(\"ms_deform_attn_forward\", &ms_deform_attn_forward, \"ms_deform_attn_forward\");\n  m.def(\"ms_deform_attn_backward\", &ms_deform_attn_backward, \"ms_deform_attn_backward\");\n}\n"
  },
  {
    "path": "mask2former/modeling/pixel_decoder/ops/test.py",
    "content": "# ------------------------------------------------------------------------------------------------\n# Deformable DETR\n# Copyright (c) 2020 SenseTime. All Rights Reserved.\n# Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n# ------------------------------------------------------------------------------------------------\n# Modified from https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0\n# ------------------------------------------------------------------------------------------------\n\n# Modified by Bowen Cheng from https://github.com/fundamentalvision/Deformable-DETR\n\nfrom __future__ import absolute_import\nfrom __future__ import print_function\nfrom __future__ import division\n\nimport time\nimport torch\nimport torch.nn as nn\nfrom torch.autograd import gradcheck\n\nfrom functions.ms_deform_attn_func import MSDeformAttnFunction, ms_deform_attn_core_pytorch\n\n\nN, M, D = 1, 2, 2\nLq, L, P = 2, 2, 2\nshapes = torch.as_tensor([(6, 4), (3, 2)], dtype=torch.long).cuda()\nlevel_start_index = torch.cat((shapes.new_zeros((1, )), shapes.prod(1).cumsum(0)[:-1]))\nS = sum([(H*W).item() for H, W in shapes])\n\n\ntorch.manual_seed(3)\n\n\n@torch.no_grad()\ndef check_forward_equal_with_pytorch_double():\n    value = torch.rand(N, S, M, D).cuda() * 0.01\n    sampling_locations = torch.rand(N, Lq, M, L, P, 2).cuda()\n    attention_weights = torch.rand(N, Lq, M, L, P).cuda() + 1e-5\n    attention_weights /= attention_weights.sum(-1, keepdim=True).sum(-2, keepdim=True)\n    im2col_step = 2\n    output_pytorch = ms_deform_attn_core_pytorch(value.double(), shapes, sampling_locations.double(), attention_weights.double()).detach().cpu()\n    output_cuda = MSDeformAttnFunction.apply(value.double(), shapes, level_start_index, sampling_locations.double(), attention_weights.double(), im2col_step).detach().cpu()\n    fwdok = torch.allclose(output_cuda, output_pytorch)\n    max_abs_err = (output_cuda - output_pytorch).abs().max()\n    max_rel_err = ((output_cuda - output_pytorch).abs() / output_pytorch.abs()).max()\n\n    print(f'* {fwdok} check_forward_equal_with_pytorch_double: max_abs_err {max_abs_err:.2e} max_rel_err {max_rel_err:.2e}')\n\n\n@torch.no_grad()\ndef check_forward_equal_with_pytorch_float():\n    value = torch.rand(N, S, M, D).cuda() * 0.01\n    sampling_locations = torch.rand(N, Lq, M, L, P, 2).cuda()\n    attention_weights = torch.rand(N, Lq, M, L, P).cuda() + 1e-5\n    attention_weights /= attention_weights.sum(-1, keepdim=True).sum(-2, keepdim=True)\n    im2col_step = 2\n    output_pytorch = ms_deform_attn_core_pytorch(value, shapes, sampling_locations, attention_weights).detach().cpu()\n    output_cuda = MSDeformAttnFunction.apply(value, shapes, level_start_index, sampling_locations, attention_weights, im2col_step).detach().cpu()\n    fwdok = torch.allclose(output_cuda, output_pytorch, rtol=1e-2, atol=1e-3)\n    max_abs_err = (output_cuda - output_pytorch).abs().max()\n    max_rel_err = ((output_cuda - output_pytorch).abs() / output_pytorch.abs()).max()\n\n    print(f'* {fwdok} check_forward_equal_with_pytorch_float: max_abs_err {max_abs_err:.2e} max_rel_err {max_rel_err:.2e}')\n\n\ndef check_gradient_numerical(channels=4, grad_value=True, grad_sampling_loc=True, grad_attn_weight=True):\n\n    value = torch.rand(N, S, M, channels).cuda() * 0.01\n    sampling_locations = torch.rand(N, Lq, M, L, P, 2).cuda()\n    attention_weights = torch.rand(N, Lq, M, L, P).cuda() + 1e-5\n    attention_weights /= attention_weights.sum(-1, keepdim=True).sum(-2, keepdim=True)\n    im2col_step = 2\n    func = MSDeformAttnFunction.apply\n\n    value.requires_grad = grad_value\n    sampling_locations.requires_grad = grad_sampling_loc\n    attention_weights.requires_grad = grad_attn_weight\n\n    gradok = gradcheck(func, (value.double(), shapes, level_start_index, sampling_locations.double(), attention_weights.double(), im2col_step))\n\n    print(f'* {gradok} check_gradient_numerical(D={channels})')\n\n\nif __name__ == '__main__':\n    check_forward_equal_with_pytorch_double()\n    check_forward_equal_with_pytorch_float()\n\n    for channels in [30, 32, 64, 71, 1025, 2048, 3096]:\n        check_gradient_numerical(channels, True, True, True)\n\n\n\n"
  },
  {
    "path": "mask2former/modeling/transformer_decoder/__init__.py",
    "content": "from .maskformer_transformer_decoder import StandardTransformerDecoder\nfrom .mask2former_transformer_decoder import MultiScaleMaskedTransformerDecoder\n"
  },
  {
    "path": "mask2former/modeling/transformer_decoder/mask2former_transformer_decoder.py",
    "content": "# Modified by Bowen Cheng from: https://github.com/facebookresearch/detr/blob/master/models/detr.py\nimport logging\nimport fvcore.nn.weight_init as weight_init\nfrom typing import Optional\nimport torch\nfrom torch import nn, Tensor\nfrom torch.nn import functional as F\n\nfrom detectron2.config import configurable\nfrom detectron2.layers import Conv2d\n\nfrom .position_encoding import PositionEmbeddingSine\nfrom .maskformer_transformer_decoder import TRANSFORMER_DECODER_REGISTRY\n\n\nclass SelfAttentionLayer(nn.Module):\n\n    def __init__(self, d_model, nhead, dropout=0.0,\n                 activation=\"relu\", normalize_before=False):\n        super().__init__()\n        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)\n\n        self.norm = nn.LayerNorm(d_model)\n        self.dropout = nn.Dropout(dropout)\n\n        self.activation = _get_activation_fn(activation)\n        self.normalize_before = normalize_before\n\n        self._reset_parameters()\n    \n    def _reset_parameters(self):\n        for p in self.parameters():\n            if p.dim() > 1:\n                nn.init.xavier_uniform_(p)\n\n    def with_pos_embed(self, tensor, pos: Optional[Tensor]):\n        return tensor if pos is None else tensor + pos\n\n    def forward_post(self, tgt,\n                     tgt_mask: Optional[Tensor] = None,\n                     tgt_key_padding_mask: Optional[Tensor] = None,\n                     query_pos: Optional[Tensor] = None):\n        q = k = self.with_pos_embed(tgt, query_pos)\n        tgt2 = self.self_attn(q, k, value=tgt, attn_mask=tgt_mask,\n                              key_padding_mask=tgt_key_padding_mask)[0]\n        tgt = tgt + self.dropout(tgt2)\n        tgt = self.norm(tgt)\n\n        return tgt\n\n    def forward_pre(self, tgt,\n                    tgt_mask: Optional[Tensor] = None,\n                    tgt_key_padding_mask: Optional[Tensor] = None,\n                    query_pos: Optional[Tensor] = None):\n        tgt2 = self.norm(tgt)\n        q = k = self.with_pos_embed(tgt2, query_pos)\n        tgt2 = self.self_attn(q, k, value=tgt2, attn_mask=tgt_mask,\n                              key_padding_mask=tgt_key_padding_mask)[0]\n        tgt = tgt + self.dropout(tgt2)\n        \n        return tgt\n\n    def forward(self, tgt,\n                tgt_mask: Optional[Tensor] = None,\n                tgt_key_padding_mask: Optional[Tensor] = None,\n                query_pos: Optional[Tensor] = None):\n        if self.normalize_before:\n            return self.forward_pre(tgt, tgt_mask,\n                                    tgt_key_padding_mask, query_pos)\n        return self.forward_post(tgt, tgt_mask,\n                                 tgt_key_padding_mask, query_pos)\n\n\nclass CrossAttentionLayer(nn.Module):\n\n    def __init__(self, d_model, nhead, dropout=0.0,\n                 activation=\"relu\", normalize_before=False):\n        super().__init__()\n        self.multihead_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)\n\n        self.norm = nn.LayerNorm(d_model)\n        self.dropout = nn.Dropout(dropout)\n\n        self.activation = _get_activation_fn(activation)\n        self.normalize_before = normalize_before\n\n        self._reset_parameters()\n    \n    def _reset_parameters(self):\n        for p in self.parameters():\n            if p.dim() > 1:\n                nn.init.xavier_uniform_(p)\n\n    def with_pos_embed(self, tensor, pos: Optional[Tensor]):\n        return tensor if pos is None else tensor + pos\n\n    def forward_post(self, tgt, memory,\n                     memory_mask: Optional[Tensor] = None,\n                     memory_key_padding_mask: Optional[Tensor] = None,\n                     pos: Optional[Tensor] = None,\n                     query_pos: Optional[Tensor] = None):\n        tgt2 = self.multihead_attn(query=self.with_pos_embed(tgt, query_pos),\n                                   key=self.with_pos_embed(memory, pos),\n                                   value=memory, attn_mask=memory_mask,\n                                   key_padding_mask=memory_key_padding_mask)[0]\n        tgt = tgt + self.dropout(tgt2)\n        tgt = self.norm(tgt)\n        \n        return tgt\n\n    def forward_pre(self, tgt, memory,\n                    memory_mask: Optional[Tensor] = None,\n                    memory_key_padding_mask: Optional[Tensor] = None,\n                    pos: Optional[Tensor] = None,\n                    query_pos: Optional[Tensor] = None):\n        tgt2 = self.norm(tgt)\n        tgt2 = self.multihead_attn(query=self.with_pos_embed(tgt2, query_pos),\n                                   key=self.with_pos_embed(memory, pos),\n                                   value=memory, attn_mask=memory_mask,\n                                   key_padding_mask=memory_key_padding_mask)[0]\n        tgt = tgt + self.dropout(tgt2)\n\n        return tgt\n\n    def forward(self, tgt, memory,\n                memory_mask: Optional[Tensor] = None,\n                memory_key_padding_mask: Optional[Tensor] = None,\n                pos: Optional[Tensor] = None,\n                query_pos: Optional[Tensor] = None):\n        if self.normalize_before:\n            return self.forward_pre(tgt, memory, memory_mask,\n                                    memory_key_padding_mask, pos, query_pos)\n        return self.forward_post(tgt, memory, memory_mask,\n                                 memory_key_padding_mask, pos, query_pos)\n\n\nclass FFNLayer(nn.Module):\n\n    def __init__(self, d_model, dim_feedforward=2048, dropout=0.0,\n                 activation=\"relu\", normalize_before=False):\n        super().__init__()\n        # Implementation of Feedforward model\n        self.linear1 = nn.Linear(d_model, dim_feedforward)\n        self.dropout = nn.Dropout(dropout)\n        self.linear2 = nn.Linear(dim_feedforward, d_model)\n\n        self.norm = nn.LayerNorm(d_model)\n\n        self.activation = _get_activation_fn(activation)\n        self.normalize_before = normalize_before\n\n        self._reset_parameters()\n    \n    def _reset_parameters(self):\n        for p in self.parameters():\n            if p.dim() > 1:\n                nn.init.xavier_uniform_(p)\n\n    def with_pos_embed(self, tensor, pos: Optional[Tensor]):\n        return tensor if pos is None else tensor + pos\n\n    def forward_post(self, tgt):\n        tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt))))\n        tgt = tgt + self.dropout(tgt2)\n        tgt = self.norm(tgt)\n        return tgt\n\n    def forward_pre(self, tgt):\n        tgt2 = self.norm(tgt)\n        tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt2))))\n        tgt = tgt + self.dropout(tgt2)\n        return tgt\n\n    def forward(self, tgt):\n        if self.normalize_before:\n            return self.forward_pre(tgt)\n        return self.forward_post(tgt)\n\n\ndef _get_activation_fn(activation):\n    \"\"\"Return an activation function given a string\"\"\"\n    if activation == \"relu\":\n        return F.relu\n    if activation == \"gelu\":\n        return F.gelu\n    if activation == \"glu\":\n        return F.glu\n    raise RuntimeError(F\"activation should be relu/gelu, not {activation}.\")\n\n\nclass MLP(nn.Module):\n    \"\"\" Very simple multi-layer perceptron (also called FFN)\"\"\"\n\n    def __init__(self, input_dim, hidden_dim, output_dim, num_layers):\n        super().__init__()\n        self.num_layers = num_layers\n        h = [hidden_dim] * (num_layers - 1)\n        self.layers = nn.ModuleList(nn.Linear(n, k) for n, k in zip([input_dim] + h, h + [output_dim]))\n\n    def forward(self, x):\n        for i, layer in enumerate(self.layers):\n            x = F.relu(layer(x)) if i < self.num_layers - 1 else layer(x)\n        return x\n\n\n@TRANSFORMER_DECODER_REGISTRY.register()\nclass MultiScaleMaskedTransformerDecoder(nn.Module):\n\n    _version = 2\n\n    def _load_from_state_dict(\n        self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs\n    ):\n        version = local_metadata.get(\"version\", None)\n        if version is None or version < 2:\n            # Do not warn if train from scratch\n            scratch = True\n            logger = logging.getLogger(__name__)\n            for k in list(state_dict.keys()):\n                newk = k\n                if \"static_query\" in k:\n                    newk = k.replace(\"static_query\", \"query_feat\")\n                if newk != k:\n                    state_dict[newk] = state_dict[k]\n                    del state_dict[k]\n                    scratch = False\n\n            if not scratch:\n                logger.warning(\n                    f\"Weight format of {self.__class__.__name__} have changed! \"\n                    \"Please upgrade your models. Applying automatic conversion now ...\"\n                )\n\n    @configurable\n    def __init__(\n        self,\n        in_channels,\n        mask_classification=True,\n        *,\n        num_classes: int,\n        hidden_dim: int,\n        num_queries: int,\n        nheads: int,\n        dim_feedforward: int,\n        dec_layers: int,\n        pre_norm: bool,\n        mask_dim: int,\n        enforce_input_project: bool,\n    ):\n        \"\"\"\n        NOTE: this interface is experimental.\n        Args:\n            in_channels: channels of the input features\n            mask_classification: whether to add mask classifier or not\n            num_classes: number of classes\n            hidden_dim: Transformer feature dimension\n            num_queries: number of queries\n            nheads: number of heads\n            dim_feedforward: feature dimension in feedforward network\n            enc_layers: number of Transformer encoder layers\n            dec_layers: number of Transformer decoder layers\n            pre_norm: whether to use pre-LayerNorm or not\n            mask_dim: mask feature dimension\n            enforce_input_project: add input project 1x1 conv even if input\n                channels and hidden dim is identical\n        \"\"\"\n        super().__init__()\n\n        assert mask_classification, \"Only support mask classification model\"\n        self.mask_classification = mask_classification\n\n        # positional encoding\n        N_steps = hidden_dim // 2\n        self.pe_layer = PositionEmbeddingSine(N_steps, normalize=True)\n        \n        # define Transformer decoder here\n        self.num_heads = nheads\n        self.num_layers = dec_layers\n        self.transformer_self_attention_layers = nn.ModuleList()\n        self.transformer_cross_attention_layers = nn.ModuleList()\n        self.transformer_ffn_layers = nn.ModuleList()\n\n        for _ in range(self.num_layers):\n            self.transformer_self_attention_layers.append(\n                SelfAttentionLayer(\n                    d_model=hidden_dim,\n                    nhead=nheads,\n                    dropout=0.0,\n                    normalize_before=pre_norm,\n                )\n            )\n\n            self.transformer_cross_attention_layers.append(\n                CrossAttentionLayer(\n                    d_model=hidden_dim,\n                    nhead=nheads,\n                    dropout=0.0,\n                    normalize_before=pre_norm,\n                )\n            )\n\n            self.transformer_ffn_layers.append(\n                FFNLayer(\n                    d_model=hidden_dim,\n                    dim_feedforward=dim_feedforward,\n                    dropout=0.0,\n                    normalize_before=pre_norm,\n                )\n            )\n\n        self.decoder_norm = nn.LayerNorm(hidden_dim)\n\n        self.num_queries = num_queries\n        # learnable query features\n        self.query_feat = nn.Embedding(num_queries, hidden_dim)\n        # learnable query p.e.\n        self.query_embed = nn.Embedding(num_queries, hidden_dim)\n\n        # level embedding (we always use 3 scales)\n        self.num_feature_levels = 3\n        self.level_embed = nn.Embedding(self.num_feature_levels, hidden_dim)\n        self.input_proj = nn.ModuleList()\n        for _ in range(self.num_feature_levels):\n            if in_channels != hidden_dim or enforce_input_project:\n                self.input_proj.append(Conv2d(in_channels, hidden_dim, kernel_size=1))\n                weight_init.c2_xavier_fill(self.input_proj[-1])\n            else:\n                self.input_proj.append(nn.Sequential())\n\n        # output FFNs\n        if self.mask_classification:\n            self.class_embed = nn.Linear(hidden_dim, num_classes + 1)\n        self.mask_embed = MLP(hidden_dim, hidden_dim, mask_dim, 3)\n\n    @classmethod\n    def from_config(cls, cfg, in_channels, mask_classification):\n        ret = {}\n        ret[\"in_channels\"] = in_channels\n        ret[\"mask_classification\"] = mask_classification\n        \n        ret[\"num_classes\"] = cfg.MODEL.SEM_SEG_HEAD.NUM_CLASSES\n        ret[\"hidden_dim\"] = cfg.MODEL.MASK_FORMER.HIDDEN_DIM\n        ret[\"num_queries\"] = cfg.MODEL.MASK_FORMER.NUM_OBJECT_QUERIES\n        # Transformer parameters:\n        ret[\"nheads\"] = cfg.MODEL.MASK_FORMER.NHEADS\n        ret[\"dim_feedforward\"] = cfg.MODEL.MASK_FORMER.DIM_FEEDFORWARD\n\n        # NOTE: because we add learnable query features which requires supervision,\n        # we add minus 1 to decoder layers to be consistent with our loss\n        # implementation: that is, number of auxiliary losses is always\n        # equal to number of decoder layers. With learnable query features, the number of\n        # auxiliary losses equals number of decoders plus 1.\n        assert cfg.MODEL.MASK_FORMER.DEC_LAYERS >= 1\n        ret[\"dec_layers\"] = cfg.MODEL.MASK_FORMER.DEC_LAYERS - 1\n        ret[\"pre_norm\"] = cfg.MODEL.MASK_FORMER.PRE_NORM\n        ret[\"enforce_input_project\"] = cfg.MODEL.MASK_FORMER.ENFORCE_INPUT_PROJ\n\n        ret[\"mask_dim\"] = cfg.MODEL.SEM_SEG_HEAD.MASK_DIM\n\n        return ret\n\n    def forward(self, x, mask_features, mask = None):\n        # x is a list of multi-scale feature\n        assert len(x) == self.num_feature_levels\n        src = []\n        pos = []\n        size_list = []\n\n        # disable mask, it does not affect performance\n        del mask\n\n        for i in range(self.num_feature_levels):\n            size_list.append(x[i].shape[-2:])\n            pos.append(self.pe_layer(x[i], None).flatten(2))\n            src.append(self.input_proj[i](x[i]).flatten(2) + self.level_embed.weight[i][None, :, None])\n\n            # flatten NxCxHxW to HWxNxC\n            pos[-1] = pos[-1].permute(2, 0, 1)\n            src[-1] = src[-1].permute(2, 0, 1)\n\n        _, bs, _ = src[0].shape\n\n        # QxNxC\n        query_embed = self.query_embed.weight.unsqueeze(1).repeat(1, bs, 1)\n        # query_embed = None\n        # print('come here==========')\n        output = self.query_feat.weight.unsqueeze(1).repeat(1, bs, 1)\n\n        predictions_class = []\n        predictions_mask = []\n        \n        # prediction heads on learnable query features\n        outputs_class, outputs_mask, attn_mask = self.forward_prediction_heads(output, mask_features, attn_mask_target_size=size_list[0])\n        predictions_class.append(outputs_class)\n        predictions_mask.append(outputs_mask)\n\n        for i in range(self.num_layers):\n            level_index = i % self.num_feature_levels\n            attn_mask[torch.where(attn_mask.sum(-1) == attn_mask.shape[-1])] = False\n            # attention: cross-attention first\n            output = self.transformer_cross_attention_layers[i](\n                output, src[level_index],\n                memory_mask=attn_mask,\n                memory_key_padding_mask=None,  # here we do not apply masking on padded region\n                pos=pos[level_index], query_pos=query_embed\n            )\n\n            output = self.transformer_self_attention_layers[i](\n                output, tgt_mask=None,\n                tgt_key_padding_mask=None,\n                query_pos=query_embed\n            )\n            \n            # FFN\n            output = self.transformer_ffn_layers[i](\n                output\n            )\n\n            outputs_class, outputs_mask, attn_mask = self.forward_prediction_heads(output, mask_features, attn_mask_target_size=size_list[(i + 1) % self.num_feature_levels])\n            predictions_class.append(outputs_class)\n            predictions_mask.append(outputs_mask)\n\n        assert len(predictions_class) == self.num_layers + 1\n        # print('len mask predictions:', len(predictions_mask))\n        out = {\n            'pred_logits': predictions_class[-1],\n            'pred_masks': predictions_mask[-1],\n            'aux_outputs': self._set_aux_loss(\n                predictions_class if self.mask_classification else None, predictions_mask\n            )\n        }\n        return out\n\n    def forward_prediction_heads(self, output, mask_features, attn_mask_target_size):\n        decoder_output = self.decoder_norm(output)\n        decoder_output = decoder_output.transpose(0, 1)\n        outputs_class = self.class_embed(decoder_output)\n        mask_embed = self.mask_embed(decoder_output)\n        outputs_mask = torch.einsum(\"bqc,bchw->bqhw\", mask_embed, mask_features)\n\n        # NOTE: prediction is of higher-resolution\n        # [B, Q, H, W] -> [B, Q, H*W] -> [B, h, Q, H*W] -> [B*h, Q, HW]\n        attn_mask = F.interpolate(outputs_mask, size=attn_mask_target_size, mode=\"bilinear\", align_corners=False)\n        # must use bool type\n        # If a BoolTensor is provided, positions with ``True`` are not allowed to attend while ``False`` values will be unchanged.\n        attn_mask = (attn_mask.sigmoid().flatten(2).unsqueeze(1).repeat(1, self.num_heads, 1, 1).flatten(0, 1) < 0.5).bool()\n        attn_mask = attn_mask.detach()\n\n        return outputs_class, outputs_mask, attn_mask\n\n    @torch.jit.unused\n    def _set_aux_loss(self, outputs_class, outputs_seg_masks):\n        # this is a workaround to make torchscript happy, as torchscript\n        # doesn't support dictionary with non-homogeneous values, such\n        # as a dict having both a Tensor and a list.\n        if self.mask_classification:\n            return [\n                {\"pred_logits\": a, \"pred_masks\": b}\n                for a, b in zip(outputs_class[:-1], outputs_seg_masks[:-1])\n            ]\n        else:\n            return [{\"pred_masks\": b} for b in outputs_seg_masks[:-1]]\n"
  },
  {
    "path": "mask2former/modeling/transformer_decoder/maskformer_transformer_decoder.py",
    "content": "# Modified by Bowen Cheng from: https://github.com/facebookresearch/detr/blob/master/models/detr.py\nimport fvcore.nn.weight_init as weight_init\nimport torch\nfrom torch import nn\nfrom torch.nn import functional as F\n\nfrom detectron2.config import configurable\nfrom detectron2.layers import Conv2d\nfrom detectron2.utils.registry import Registry\n\nfrom .position_encoding import PositionEmbeddingSine\nfrom .transformer import Transformer\n\n\nTRANSFORMER_DECODER_REGISTRY = Registry(\"TRANSFORMER_MODULE\")\nTRANSFORMER_DECODER_REGISTRY.__doc__ = \"\"\"\nRegistry for transformer module in MaskFormer.\n\"\"\"\n\n\ndef build_transformer_decoder(cfg, in_channels, mask_classification=True):\n    \"\"\"\n    Build a instance embedding branch from `cfg.MODEL.INS_EMBED_HEAD.NAME`.\n    \"\"\"\n    name = cfg.MODEL.MASK_FORMER.TRANSFORMER_DECODER_NAME\n    return TRANSFORMER_DECODER_REGISTRY.get(name)(cfg, in_channels, mask_classification)\n\n\n@TRANSFORMER_DECODER_REGISTRY.register()\nclass StandardTransformerDecoder(nn.Module):\n    @configurable\n    def __init__(\n        self,\n        in_channels,\n        mask_classification=True,\n        *,\n        num_classes: int,\n        hidden_dim: int,\n        num_queries: int,\n        nheads: int,\n        dropout: float,\n        dim_feedforward: int,\n        enc_layers: int,\n        dec_layers: int,\n        pre_norm: bool,\n        deep_supervision: bool,\n        mask_dim: int,\n        enforce_input_project: bool,\n    ):\n        \"\"\"\n        NOTE: this interface is experimental.\n        Args:\n            in_channels: channels of the input features\n            mask_classification: whether to add mask classifier or not\n            num_classes: number of classes\n            hidden_dim: Transformer feature dimension\n            num_queries: number of queries\n            nheads: number of heads\n            dropout: dropout in Transformer\n            dim_feedforward: feature dimension in feedforward network\n            enc_layers: number of Transformer encoder layers\n            dec_layers: number of Transformer decoder layers\n            pre_norm: whether to use pre-LayerNorm or not\n            deep_supervision: whether to add supervision to every decoder layers\n            mask_dim: mask feature dimension\n            enforce_input_project: add input project 1x1 conv even if input\n                channels and hidden dim is identical\n        \"\"\"\n        super().__init__()\n\n        self.mask_classification = mask_classification\n\n        # positional encoding\n        N_steps = hidden_dim // 2\n        self.pe_layer = PositionEmbeddingSine(N_steps, normalize=True)\n\n        transformer = Transformer(\n            d_model=hidden_dim,\n            dropout=dropout,\n            nhead=nheads,\n            dim_feedforward=dim_feedforward,\n            num_encoder_layers=enc_layers,\n            num_decoder_layers=dec_layers,\n            normalize_before=pre_norm,\n            return_intermediate_dec=deep_supervision,\n        )\n\n        self.num_queries = num_queries\n        self.transformer = transformer\n        hidden_dim = transformer.d_model\n\n        self.query_embed = nn.Embedding(num_queries, hidden_dim)\n\n        if in_channels != hidden_dim or enforce_input_project:\n            self.input_proj = Conv2d(in_channels, hidden_dim, kernel_size=1)\n            weight_init.c2_xavier_fill(self.input_proj)\n        else:\n            self.input_proj = nn.Sequential()\n        self.aux_loss = deep_supervision\n\n        # output FFNs\n        if self.mask_classification:\n            self.class_embed = nn.Linear(hidden_dim, num_classes + 1)\n        self.mask_embed = MLP(hidden_dim, hidden_dim, mask_dim, 3)\n\n    @classmethod\n    def from_config(cls, cfg, in_channels, mask_classification):\n        ret = {}\n        ret[\"in_channels\"] = in_channels\n        ret[\"mask_classification\"] = mask_classification\n\n        ret[\"num_classes\"] = cfg.MODEL.SEM_SEG_HEAD.NUM_CLASSES\n        ret[\"hidden_dim\"] = cfg.MODEL.MASK_FORMER.HIDDEN_DIM\n        ret[\"num_queries\"] = cfg.MODEL.MASK_FORMER.NUM_OBJECT_QUERIES\n        # Transformer parameters:\n        ret[\"nheads\"] = cfg.MODEL.MASK_FORMER.NHEADS\n        ret[\"dropout\"] = cfg.MODEL.MASK_FORMER.DROPOUT\n        ret[\"dim_feedforward\"] = cfg.MODEL.MASK_FORMER.DIM_FEEDFORWARD\n        ret[\"enc_layers\"] = cfg.MODEL.MASK_FORMER.ENC_LAYERS\n        ret[\"dec_layers\"] = cfg.MODEL.MASK_FORMER.DEC_LAYERS\n        ret[\"pre_norm\"] = cfg.MODEL.MASK_FORMER.PRE_NORM\n        ret[\"deep_supervision\"] = cfg.MODEL.MASK_FORMER.DEEP_SUPERVISION\n        ret[\"enforce_input_project\"] = cfg.MODEL.MASK_FORMER.ENFORCE_INPUT_PROJ\n\n        ret[\"mask_dim\"] = cfg.MODEL.SEM_SEG_HEAD.MASK_DIM\n\n        return ret\n\n    def forward(self, x, mask_features, mask=None):\n        if mask is not None:\n            mask = F.interpolate(mask[None].float(), size=x.shape[-2:]).to(torch.bool)[0]\n        pos = self.pe_layer(x, mask)\n\n        src = x\n        hs, memory = self.transformer(self.input_proj(src), mask, self.query_embed.weight, pos)\n\n        if self.mask_classification:\n            outputs_class = self.class_embed(hs)\n            out = {\"pred_logits\": outputs_class[-1]}\n        else:\n            out = {}\n\n        if self.aux_loss:\n            # [l, bs, queries, embed]\n            mask_embed = self.mask_embed(hs)\n            outputs_seg_masks = torch.einsum(\"lbqc,bchw->lbqhw\", mask_embed, mask_features)\n            out[\"pred_masks\"] = outputs_seg_masks[-1]\n            out[\"aux_outputs\"] = self._set_aux_loss(\n                outputs_class if self.mask_classification else None, outputs_seg_masks\n            )\n        else:\n            # FIXME h_boxes takes the last one computed, keep this in mind\n            # [bs, queries, embed]\n            mask_embed = self.mask_embed(hs[-1])\n            outputs_seg_masks = torch.einsum(\"bqc,bchw->bqhw\", mask_embed, mask_features)\n            out[\"pred_masks\"] = outputs_seg_masks\n        return out\n\n    @torch.jit.unused\n    def _set_aux_loss(self, outputs_class, outputs_seg_masks):\n        # this is a workaround to make torchscript happy, as torchscript\n        # doesn't support dictionary with non-homogeneous values, such\n        # as a dict having both a Tensor and a list.\n        if self.mask_classification:\n            return [\n                {\"pred_logits\": a, \"pred_masks\": b}\n                for a, b in zip(outputs_class[:-1], outputs_seg_masks[:-1])\n            ]\n        else:\n            return [{\"pred_masks\": b} for b in outputs_seg_masks[:-1]]\n\n\nclass MLP(nn.Module):\n    \"\"\"Very simple multi-layer perceptron (also called FFN)\"\"\"\n\n    def __init__(self, input_dim, hidden_dim, output_dim, num_layers):\n        super().__init__()\n        self.num_layers = num_layers\n        h = [hidden_dim] * (num_layers - 1)\n        self.layers = nn.ModuleList(\n            nn.Linear(n, k) for n, k in zip([input_dim] + h, h + [output_dim])\n        )\n\n    def forward(self, x):\n        for i, layer in enumerate(self.layers):\n            x = F.relu(layer(x)) if i < self.num_layers - 1 else layer(x)\n        return x\n"
  },
  {
    "path": "mask2former/modeling/transformer_decoder/position_encoding.py",
    "content": "# # Modified by Bowen Cheng from: https://github.com/facebookresearch/detr/blob/master/models/position_encoding.py\n\"\"\"\nVarious positional encodings for the transformer.\n\"\"\"\nimport math\n\nimport torch\nfrom torch import nn\n\n\nclass PositionEmbeddingSine(nn.Module):\n    \"\"\"\n    This is a more standard version of the position embedding, very similar to the one\n    used by the Attention is all you need paper, generalized to work on images.\n    \"\"\"\n\n    def __init__(self, num_pos_feats=64, temperature=10000, normalize=False, scale=None):\n        super().__init__()\n        self.num_pos_feats = num_pos_feats\n        self.temperature = temperature\n        self.normalize = normalize\n        if scale is not None and normalize is False:\n            raise ValueError(\"normalize should be True if scale is passed\")\n        if scale is None:\n            scale = 2 * math.pi\n        self.scale = scale\n\n    def forward(self, x, mask=None):\n        if mask is None:\n            mask = torch.zeros((x.size(0), x.size(2), x.size(3)), device=x.device, dtype=torch.bool)\n        not_mask = ~mask\n        y_embed = not_mask.cumsum(1, dtype=torch.float32)\n        x_embed = not_mask.cumsum(2, dtype=torch.float32)\n        if self.normalize:\n            eps = 1e-6\n            y_embed = y_embed / (y_embed[:, -1:, :] + eps) * self.scale\n            x_embed = x_embed / (x_embed[:, :, -1:] + eps) * self.scale\n\n        dim_t = torch.arange(self.num_pos_feats, dtype=torch.float32, device=x.device)\n        dim_t = self.temperature ** (2 * (dim_t // 2) / self.num_pos_feats)\n\n        pos_x = x_embed[:, :, :, None] / dim_t\n        pos_y = y_embed[:, :, :, None] / dim_t\n        pos_x = torch.stack(\n            (pos_x[:, :, :, 0::2].sin(), pos_x[:, :, :, 1::2].cos()), dim=4\n        ).flatten(3)\n        pos_y = torch.stack(\n            (pos_y[:, :, :, 0::2].sin(), pos_y[:, :, :, 1::2].cos()), dim=4\n        ).flatten(3)\n        pos = torch.cat((pos_y, pos_x), dim=3).permute(0, 3, 1, 2)\n        return pos\n    \n    def __repr__(self, _repr_indent=4):\n        head = \"Positional encoding \" + self.__class__.__name__\n        body = [\n            \"num_pos_feats: {}\".format(self.num_pos_feats),\n            \"temperature: {}\".format(self.temperature),\n            \"normalize: {}\".format(self.normalize),\n            \"scale: {}\".format(self.scale),\n        ]\n        # _repr_indent = 4\n        lines = [head] + [\" \" * _repr_indent + line for line in body]\n        return \"\\n\".join(lines)\n"
  },
  {
    "path": "mask2former/modeling/transformer_decoder/transformer.py",
    "content": "# Modified by Bowen Cheng from: https://github.com/facebookresearch/detr/blob/master/models/transformer.py\n\"\"\"\nTransformer class.\n\nCopy-paste from torch.nn.Transformer with modifications:\n    * positional encodings are passed in MHattention\n    * extra LN at the end of encoder is removed\n    * decoder returns a stack of activations from all decoding layers\n\"\"\"\nimport copy\nfrom typing import List, Optional\n\nimport torch\nimport torch.nn.functional as F\nfrom torch import Tensor, nn\n\n\nclass Transformer(nn.Module):\n    def __init__(\n        self,\n        d_model=512,\n        nhead=8,\n        num_encoder_layers=6,\n        num_decoder_layers=6,\n        dim_feedforward=2048,\n        dropout=0.1,\n        activation=\"relu\",\n        normalize_before=False,\n        return_intermediate_dec=False,\n    ):\n        super().__init__()\n\n        encoder_layer = TransformerEncoderLayer(\n            d_model, nhead, dim_feedforward, dropout, activation, normalize_before\n        )\n        encoder_norm = nn.LayerNorm(d_model) if normalize_before else None\n        self.encoder = TransformerEncoder(encoder_layer, num_encoder_layers, encoder_norm)\n\n        decoder_layer = TransformerDecoderLayer(\n            d_model, nhead, dim_feedforward, dropout, activation, normalize_before\n        )\n        decoder_norm = nn.LayerNorm(d_model)\n        self.decoder = TransformerDecoder(\n            decoder_layer,\n            num_decoder_layers,\n            decoder_norm,\n            return_intermediate=return_intermediate_dec,\n        )\n\n        self._reset_parameters()\n\n        self.d_model = d_model\n        self.nhead = nhead\n\n    def _reset_parameters(self):\n        for p in self.parameters():\n            if p.dim() > 1:\n                nn.init.xavier_uniform_(p)\n\n    def forward(self, src, mask, query_embed, pos_embed):\n        # flatten NxCxHxW to HWxNxC\n        bs, c, h, w = src.shape\n        src = src.flatten(2).permute(2, 0, 1)\n        pos_embed = pos_embed.flatten(2).permute(2, 0, 1)\n        query_embed = query_embed.unsqueeze(1).repeat(1, bs, 1)\n        if mask is not None:\n            mask = mask.flatten(1)\n\n        tgt = torch.zeros_like(query_embed)\n        memory = self.encoder(src, src_key_padding_mask=mask, pos=pos_embed)\n        hs = self.decoder(\n            tgt, memory, memory_key_padding_mask=mask, pos=pos_embed, query_pos=query_embed\n        )\n        return hs.transpose(1, 2), memory.permute(1, 2, 0).view(bs, c, h, w)\n\n\nclass TransformerEncoder(nn.Module):\n    def __init__(self, encoder_layer, num_layers, norm=None):\n        super().__init__()\n        self.layers = _get_clones(encoder_layer, num_layers)\n        self.num_layers = num_layers\n        self.norm = norm\n\n    def forward(\n        self,\n        src,\n        mask: Optional[Tensor] = None,\n        src_key_padding_mask: Optional[Tensor] = None,\n        pos: Optional[Tensor] = None,\n    ):\n        output = src\n\n        for layer in self.layers:\n            output = layer(\n                output, src_mask=mask, src_key_padding_mask=src_key_padding_mask, pos=pos\n            )\n\n        if self.norm is not None:\n            output = self.norm(output)\n\n        return output\n\n\nclass TransformerDecoder(nn.Module):\n    def __init__(self, decoder_layer, num_layers, norm=None, return_intermediate=False):\n        super().__init__()\n        self.layers = _get_clones(decoder_layer, num_layers)\n        self.num_layers = num_layers\n        self.norm = norm\n        self.return_intermediate = return_intermediate\n\n    def forward(\n        self,\n        tgt,\n        memory,\n        tgt_mask: Optional[Tensor] = None,\n        memory_mask: Optional[Tensor] = None,\n        tgt_key_padding_mask: Optional[Tensor] = None,\n        memory_key_padding_mask: Optional[Tensor] = None,\n        pos: Optional[Tensor] = None,\n        query_pos: Optional[Tensor] = None,\n    ):\n        output = tgt\n\n        intermediate = []\n\n        for layer in self.layers:\n            output = layer(\n                output,\n                memory,\n                tgt_mask=tgt_mask,\n                memory_mask=memory_mask,\n                tgt_key_padding_mask=tgt_key_padding_mask,\n                memory_key_padding_mask=memory_key_padding_mask,\n                pos=pos,\n                query_pos=query_pos,\n            )\n            if self.return_intermediate:\n                intermediate.append(self.norm(output))\n\n        if self.norm is not None:\n            output = self.norm(output)\n            if self.return_intermediate:\n                intermediate.pop()\n                intermediate.append(output)\n\n        if self.return_intermediate:\n            return torch.stack(intermediate)\n\n        return output.unsqueeze(0)\n\n\nclass TransformerEncoderLayer(nn.Module):\n    def __init__(\n        self,\n        d_model,\n        nhead,\n        dim_feedforward=2048,\n        dropout=0.1,\n        activation=\"relu\",\n        normalize_before=False,\n    ):\n        super().__init__()\n        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)\n        # Implementation of Feedforward model\n        self.linear1 = nn.Linear(d_model, dim_feedforward)\n        self.dropout = nn.Dropout(dropout)\n        self.linear2 = nn.Linear(dim_feedforward, d_model)\n\n        self.norm1 = nn.LayerNorm(d_model)\n        self.norm2 = nn.LayerNorm(d_model)\n        self.dropout1 = nn.Dropout(dropout)\n        self.dropout2 = nn.Dropout(dropout)\n\n        self.activation = _get_activation_fn(activation)\n        self.normalize_before = normalize_before\n\n    def with_pos_embed(self, tensor, pos: Optional[Tensor]):\n        return tensor if pos is None else tensor + pos\n\n    def forward_post(\n        self,\n        src,\n        src_mask: Optional[Tensor] = None,\n        src_key_padding_mask: Optional[Tensor] = None,\n        pos: Optional[Tensor] = None,\n    ):\n        q = k = self.with_pos_embed(src, pos)\n        src2 = self.self_attn(\n            q, k, value=src, attn_mask=src_mask, key_padding_mask=src_key_padding_mask\n        )[0]\n        src = src + self.dropout1(src2)\n        src = self.norm1(src)\n        src2 = self.linear2(self.dropout(self.activation(self.linear1(src))))\n        src = src + self.dropout2(src2)\n        src = self.norm2(src)\n        return src\n\n    def forward_pre(\n        self,\n        src,\n        src_mask: Optional[Tensor] = None,\n        src_key_padding_mask: Optional[Tensor] = None,\n        pos: Optional[Tensor] = None,\n    ):\n        src2 = self.norm1(src)\n        q = k = self.with_pos_embed(src2, pos)\n        src2 = self.self_attn(\n            q, k, value=src2, attn_mask=src_mask, key_padding_mask=src_key_padding_mask\n        )[0]\n        src = src + self.dropout1(src2)\n        src2 = self.norm2(src)\n        src2 = self.linear2(self.dropout(self.activation(self.linear1(src2))))\n        src = src + self.dropout2(src2)\n        return src\n\n    def forward(\n        self,\n        src,\n        src_mask: Optional[Tensor] = None,\n        src_key_padding_mask: Optional[Tensor] = None,\n        pos: Optional[Tensor] = None,\n    ):\n        if self.normalize_before:\n            return self.forward_pre(src, src_mask, src_key_padding_mask, pos)\n        return self.forward_post(src, src_mask, src_key_padding_mask, pos)\n\n\nclass TransformerDecoderLayer(nn.Module):\n    def __init__(\n        self,\n        d_model,\n        nhead,\n        dim_feedforward=2048,\n        dropout=0.1,\n        activation=\"relu\",\n        normalize_before=False,\n    ):\n        super().__init__()\n        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)\n        self.multihead_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)\n        # Implementation of Feedforward model\n        self.linear1 = nn.Linear(d_model, dim_feedforward)\n        self.dropout = nn.Dropout(dropout)\n        self.linear2 = nn.Linear(dim_feedforward, d_model)\n\n        self.norm1 = nn.LayerNorm(d_model)\n        self.norm2 = nn.LayerNorm(d_model)\n        self.norm3 = nn.LayerNorm(d_model)\n        self.dropout1 = nn.Dropout(dropout)\n        self.dropout2 = nn.Dropout(dropout)\n        self.dropout3 = nn.Dropout(dropout)\n\n        self.activation = _get_activation_fn(activation)\n        self.normalize_before = normalize_before\n\n    def with_pos_embed(self, tensor, pos: Optional[Tensor]):\n        return tensor if pos is None else tensor + pos\n\n    def forward_post(\n        self,\n        tgt,\n        memory,\n        tgt_mask: Optional[Tensor] = None,\n        memory_mask: Optional[Tensor] = None,\n        tgt_key_padding_mask: Optional[Tensor] = None,\n        memory_key_padding_mask: Optional[Tensor] = None,\n        pos: Optional[Tensor] = None,\n        query_pos: Optional[Tensor] = None,\n    ):\n        q = k = self.with_pos_embed(tgt, query_pos)\n        tgt2 = self.self_attn(\n            q, k, value=tgt, attn_mask=tgt_mask, key_padding_mask=tgt_key_padding_mask\n        )[0]\n        tgt = tgt + self.dropout1(tgt2)\n        tgt = self.norm1(tgt)\n        tgt2 = self.multihead_attn(\n            query=self.with_pos_embed(tgt, query_pos),\n            key=self.with_pos_embed(memory, pos),\n            value=memory,\n            attn_mask=memory_mask,\n            key_padding_mask=memory_key_padding_mask,\n        )[0]\n        tgt = tgt + self.dropout2(tgt2)\n        tgt = self.norm2(tgt)\n        tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt))))\n        tgt = tgt + self.dropout3(tgt2)\n        tgt = self.norm3(tgt)\n        return tgt\n\n    def forward_pre(\n        self,\n        tgt,\n        memory,\n        tgt_mask: Optional[Tensor] = None,\n        memory_mask: Optional[Tensor] = None,\n        tgt_key_padding_mask: Optional[Tensor] = None,\n        memory_key_padding_mask: Optional[Tensor] = None,\n        pos: Optional[Tensor] = None,\n        query_pos: Optional[Tensor] = None,\n    ):\n        tgt2 = self.norm1(tgt)\n        q = k = self.with_pos_embed(tgt2, query_pos)\n        tgt2 = self.self_attn(\n            q, k, value=tgt2, attn_mask=tgt_mask, key_padding_mask=tgt_key_padding_mask\n        )[0]\n        tgt = tgt + self.dropout1(tgt2)\n        tgt2 = self.norm2(tgt)\n        tgt2 = self.multihead_attn(\n            query=self.with_pos_embed(tgt2, query_pos),\n            key=self.with_pos_embed(memory, pos),\n            value=memory,\n            attn_mask=memory_mask,\n            key_padding_mask=memory_key_padding_mask,\n        )[0]\n        tgt = tgt + self.dropout2(tgt2)\n        tgt2 = self.norm3(tgt)\n        tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt2))))\n        tgt = tgt + self.dropout3(tgt2)\n        return tgt\n\n    def forward(\n        self,\n        tgt,\n        memory,\n        tgt_mask: Optional[Tensor] = None,\n        memory_mask: Optional[Tensor] = None,\n        tgt_key_padding_mask: Optional[Tensor] = None,\n        memory_key_padding_mask: Optional[Tensor] = None,\n        pos: Optional[Tensor] = None,\n        query_pos: Optional[Tensor] = None,\n    ):\n        if self.normalize_before:\n            return self.forward_pre(\n                tgt,\n                memory,\n                tgt_mask,\n                memory_mask,\n                tgt_key_padding_mask,\n                memory_key_padding_mask,\n                pos,\n                query_pos,\n            )\n        return self.forward_post(\n            tgt,\n            memory,\n            tgt_mask,\n            memory_mask,\n            tgt_key_padding_mask,\n            memory_key_padding_mask,\n            pos,\n            query_pos,\n        )\n\n\ndef _get_clones(module, N):\n    return nn.ModuleList([copy.deepcopy(module) for i in range(N)])\n\n\ndef _get_activation_fn(activation):\n    \"\"\"Return an activation function given a string\"\"\"\n    if activation == \"relu\":\n        return F.relu\n    if activation == \"gelu\":\n        return F.gelu\n    if activation == \"glu\":\n        return F.glu\n    raise RuntimeError(f\"activation should be relu/gelu, not {activation}.\")\n"
  },
  {
    "path": "mask2former/test_time_augmentation.py",
    "content": "import copy\nimport logging\nfrom itertools import count\n\nimport numpy as np\nimport torch\nfrom fvcore.transforms import HFlipTransform\nfrom torch import nn\nfrom torch.nn.parallel import DistributedDataParallel\n\nfrom detectron2.data.detection_utils import read_image\nfrom detectron2.modeling import DatasetMapperTTA\n\n\n__all__ = [\n    \"SemanticSegmentorWithTTA\",\n]\n\n\nclass SemanticSegmentorWithTTA(nn.Module):\n    \"\"\"\n    A SemanticSegmentor with test-time augmentation enabled.\n    Its :meth:`__call__` method has the same interface as :meth:`SemanticSegmentor.forward`.\n    \"\"\"\n\n    def __init__(self, cfg, model, tta_mapper=None, batch_size=1):\n        \"\"\"\n        Args:\n            cfg (CfgNode):\n            model (SemanticSegmentor): a SemanticSegmentor to apply TTA on.\n            tta_mapper (callable): takes a dataset dict and returns a list of\n                augmented versions of the dataset dict. Defaults to\n                `DatasetMapperTTA(cfg)`.\n            batch_size (int): batch the augmented images into this batch size for inference.\n        \"\"\"\n        super().__init__()\n        if isinstance(model, DistributedDataParallel):\n            model = model.module\n        self.cfg = cfg.clone()\n\n        self.model = model\n\n        if tta_mapper is None:\n            tta_mapper = DatasetMapperTTA(cfg)\n        self.tta_mapper = tta_mapper\n        self.batch_size = batch_size\n\n    def __call__(self, batched_inputs):\n        \"\"\"\n        Same input/output format as :meth:`SemanticSegmentor.forward`\n        \"\"\"\n\n        def _maybe_read_image(dataset_dict):\n            ret = copy.copy(dataset_dict)\n            if \"image\" not in ret:\n                image = read_image(ret.pop(\"file_name\"), self.model.input_format)\n                image = torch.from_numpy(np.ascontiguousarray(image.transpose(2, 0, 1)))  # CHW\n                ret[\"image\"] = image\n            if \"height\" not in ret and \"width\" not in ret:\n                ret[\"height\"] = image.shape[1]\n                ret[\"width\"] = image.shape[2]\n            return ret\n\n        processed_results = []\n        for x in batched_inputs:\n            result = self._inference_one_image(_maybe_read_image(x))\n            processed_results.append(result)\n        return processed_results\n\n    def _inference_one_image(self, input):\n        \"\"\"\n        Args:\n            input (dict): one dataset dict with \"image\" field being a CHW tensor\n        Returns:\n            dict: one output dict\n        \"\"\"\n        orig_shape = (input[\"height\"], input[\"width\"])\n        augmented_inputs, tfms = self._get_augmented_inputs(input)\n\n        final_predictions = None\n        count_predictions = 0\n        for input, tfm in zip(augmented_inputs, tfms):\n            count_predictions += 1\n            with torch.no_grad():\n                if final_predictions is None:\n                    if any(isinstance(t, HFlipTransform) for t in tfm.transforms):\n                        final_predictions = self.model([input])[0].pop(\"sem_seg\").flip(dims=[2])\n                    else:\n                        final_predictions = self.model([input])[0].pop(\"sem_seg\")\n                else:\n                    if any(isinstance(t, HFlipTransform) for t in tfm.transforms):\n                        final_predictions += self.model([input])[0].pop(\"sem_seg\").flip(dims=[2])\n                    else:\n                        final_predictions += self.model([input])[0].pop(\"sem_seg\")\n\n        final_predictions = final_predictions / count_predictions\n        return {\"sem_seg\": final_predictions}\n\n    def _get_augmented_inputs(self, input):\n        augmented_inputs = self.tta_mapper(input)\n        tfms = [x.pop(\"transforms\") for x in augmented_inputs]\n        return augmented_inputs, tfms\n"
  },
  {
    "path": "mask2former/utils/__init__.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates.\n"
  },
  {
    "path": "mask2former/utils/__init__.py.new",
    "content": ""
  },
  {
    "path": "mask2former/utils/misc.py",
    "content": "# Modified by Bowen Cheng from https://github.com/facebookresearch/detr/blob/master/util/misc.py\n\"\"\"\nMisc functions, including distributed helpers.\n\nMostly copy-paste from torchvision references.\n\"\"\"\nfrom typing import List, Optional\n\nimport torch\nimport torch.distributed as dist\nimport torchvision\nfrom torch import Tensor\n\n\ndef _max_by_axis(the_list):\n    # type: (List[List[int]]) -> List[int]\n    maxes = the_list[0]\n    for sublist in the_list[1:]:\n        for index, item in enumerate(sublist):\n            maxes[index] = max(maxes[index], item)\n    return maxes\n\n\nclass NestedTensor(object):\n    def __init__(self, tensors, mask: Optional[Tensor]):\n        self.tensors = tensors\n        self.mask = mask\n\n    def to(self, device):\n        # type: (Device) -> NestedTensor # noqa\n        cast_tensor = self.tensors.to(device)\n        mask = self.mask\n        if mask is not None:\n            assert mask is not None\n            cast_mask = mask.to(device)\n        else:\n            cast_mask = None\n        return NestedTensor(cast_tensor, cast_mask)\n\n    def decompose(self):\n        return self.tensors, self.mask\n\n    def __repr__(self):\n        return str(self.tensors)\n\n\ndef nested_tensor_from_tensor_list(tensor_list: List[Tensor]):\n    # TODO make this more general\n    if tensor_list[0].ndim == 3:\n        if torchvision._is_tracing():\n            # nested_tensor_from_tensor_list() does not export well to ONNX\n            # call _onnx_nested_tensor_from_tensor_list() instead\n            return _onnx_nested_tensor_from_tensor_list(tensor_list)\n\n        # TODO make it support different-sized images\n        max_size = _max_by_axis([list(img.shape) for img in tensor_list])\n        # min_size = tuple(min(s) for s in zip(*[img.shape for img in tensor_list]))\n        batch_shape = [len(tensor_list)] + max_size\n        b, c, h, w = batch_shape\n        dtype = tensor_list[0].dtype\n        device = tensor_list[0].device\n        tensor = torch.zeros(batch_shape, dtype=dtype, device=device)\n        mask = torch.ones((b, h, w), dtype=torch.bool, device=device)\n        for img, pad_img, m in zip(tensor_list, tensor, mask):\n            pad_img[: img.shape[0], : img.shape[1], : img.shape[2]].copy_(img)\n            m[: img.shape[1], : img.shape[2]] = False\n    else:\n        raise ValueError(\"not supported\")\n    return NestedTensor(tensor, mask)\n\n\n# _onnx_nested_tensor_from_tensor_list() is an implementation of\n# nested_tensor_from_tensor_list() that is supported by ONNX tracing.\n@torch.jit.unused\ndef _onnx_nested_tensor_from_tensor_list(tensor_list: List[Tensor]) -> NestedTensor:\n    max_size = []\n    for i in range(tensor_list[0].dim()):\n        max_size_i = torch.max(\n            torch.stack([img.shape[i] for img in tensor_list]).to(torch.float32)\n        ).to(torch.int64)\n        max_size.append(max_size_i)\n    max_size = tuple(max_size)\n\n    # work around for\n    # pad_img[: img.shape[0], : img.shape[1], : img.shape[2]].copy_(img)\n    # m[: img.shape[1], :img.shape[2]] = False\n    # which is not yet supported in onnx\n    padded_imgs = []\n    padded_masks = []\n    for img in tensor_list:\n        padding = [(s1 - s2) for s1, s2 in zip(max_size, tuple(img.shape))]\n        padded_img = torch.nn.functional.pad(img, (0, padding[2], 0, padding[1], 0, padding[0]))\n        padded_imgs.append(padded_img)\n\n        m = torch.zeros_like(img[0], dtype=torch.int, device=img.device)\n        padded_mask = torch.nn.functional.pad(m, (0, padding[2], 0, padding[1]), \"constant\", 1)\n        padded_masks.append(padded_mask.to(torch.bool))\n\n    tensor = torch.stack(padded_imgs)\n    mask = torch.stack(padded_masks)\n\n    return NestedTensor(tensor, mask=mask)\n\n\ndef is_dist_avail_and_initialized():\n    if not dist.is_available():\n        return False\n    if not dist.is_initialized():\n        return False\n    return True\n"
  },
  {
    "path": "mask2former_video/__init__.py",
    "content": "from . import modeling\n\n# config\nfrom .config import add_maskformer2_video_config\n\n# models\nfrom .video_maskformer_model import VideoMaskFormer\n\n# video\nfrom .data_video import (\n    YTVISDatasetMapper,\n    CocoClipDatasetMapper,\n    YTVISEvaluator,\n    build_detection_train_loader,\n    build_detection_test_loader,\n    build_combined_loader,\n    get_detection_dataset_dicts,\n)\n"
  },
  {
    "path": "mask2former_video/config.py",
    "content": "# -*- coding: utf-8 -*-\nfrom detectron2.config import CfgNode as CN\n\n\ndef add_maskformer2_video_config(cfg):\n    # video data\n    # DataLoader\n    cfg.INPUT.SAMPLING_FRAME_NUM = 3 \n    cfg.INPUT.SAMPLING_FRAME_RANGE = 5\n    cfg.INPUT.SAMPLING_FRAME_SHUFFLE = True \n    cfg.INPUT.AUGMENTATIONS = [] \n\n    cfg.INPUT.PSEUDO = CN()\n    cfg.INPUT.PSEUDO.AUGMENTATIONS = ['rotation']\n    cfg.INPUT.PSEUDO.MIN_SIZE_TRAIN = (480, 512, 544, 576, 608, 640, 672, 704, 736, 768)\n    cfg.INPUT.PSEUDO.MAX_SIZE_TRAIN = 768\n    cfg.INPUT.PSEUDO.MIN_SIZE_TRAIN_SAMPLING = \"choice_by_clip\"\n    cfg.INPUT.PSEUDO.SAMPLING_FRAME_NUM = 4 \n    cfg.INPUT.PSEUDO.SAMPLING_FRAME_RANGE = 20 \n    cfg.INPUT.PSEUDO.CROP = CN()\n    cfg.INPUT.PSEUDO.CROP.ENABLED = False\n    cfg.INPUT.PSEUDO.CROP.TYPE = \"absolute_range\"\n    cfg.INPUT.PSEUDO.CROP.SIZE = (384, 600)\n\n    # LSJ\n    cfg.INPUT.LSJ_AUG = CN()\n    cfg.INPUT.LSJ_AUG.ENABLED = False\n    cfg.INPUT.LSJ_AUG.IMAGE_SIZE = 1024\n    cfg.INPUT.LSJ_AUG.MIN_SCALE = 0.1\n    cfg.INPUT.LSJ_AUG.MAX_SCALE = 2.0\n\n"
  },
  {
    "path": "mask2former_video/data_video/__init__.py",
    "content": "# Modified by Bowen Cheng from https://github.com/sukjunhwang/IFC\n\nfrom .dataset_mapper import YTVISDatasetMapper, CocoClipDatasetMapper\nfrom .build import *\n\nfrom .datasets import *\nfrom .ytvis_eval import YTVISEvaluator\n"
  },
  {
    "path": "mask2former_video/data_video/augmentation.py",
    "content": "# Modified by Bowen Cheng from https://github.com/sukjunhwang/IFC\n\nimport numpy as np\nimport logging\nimport sys\nfrom fvcore.transforms.transform import (\n    HFlipTransform,\n    NoOpTransform,\n    VFlipTransform,\n)\nfrom PIL import Image\nfrom typing import Tuple\nfrom detectron2.data import transforms as T\n\nclass RandomApplyClip(T.Augmentation):\n    \"\"\"\n    Randomly apply an augmentation with a given probability.\n    \"\"\"\n\n    def __init__(self, tfm_or_aug, prob=0.5, clip_frame_cnt=1):\n        \"\"\"\n        Args:\n            tfm_or_aug (Transform, Augmentation): the transform or augmentation\n                to be applied. It can either be a `Transform` or `Augmentation`\n                instance.\n            prob (float): probability between 0.0 and 1.0 that\n                the wrapper transformation is applied\n        \"\"\"\n        super().__init__()\n        self.aug = T.augmentation._transform_to_aug(tfm_or_aug)\n        assert 0.0 <= prob <= 1.0, f\"Probablity must be between 0.0 and 1.0 (given: {prob})\"\n        self.prob = prob\n        self._cnt = 0\n        self.clip_frame_cnt = clip_frame_cnt\n\n    def get_transform(self, *args):\n        if self._cnt % self.clip_frame_cnt == 0:\n            self.do = self._rand_range() < self.prob\n            self._cnt = 0   # avoiding overflow\n        self._cnt += 1\n\n        if self.do:\n            return self.aug.get_transform(*args)\n        else:\n            return NoOpTransform()\n\n    def __call__(self, aug_input):\n        if self._cnt % self.clip_frame_cnt == 0:\n            self.do = self._rand_range() < self.prob\n            self._cnt = 0   # avoiding overflow\n        self._cnt += 1\n\n        if self.do:\n            return self.aug(aug_input)\n        else:\n            return NoOpTransform()\n\n\nclass RandomRotationClip(T.Augmentation):\n    \"\"\"\n    This method returns a copy of this image, rotated the given\n    number of degrees counter clockwise around the given center.\n    \"\"\"\n\n    def __init__(self, angle, prob=0.5, expand=True, center=None, interp=None, clip_frame_cnt=1):\n        \"\"\"\n        Args:\n            angle (list[float]): If ``sample_style==\"range\"``,\n                a [min, max] interval from which to sample the angle (in degrees).\n                If ``sample_style==\"choice\"``, a list of angles to sample from\n            expand (bool): choose if the image should be resized to fit the whole\n                rotated image (default), or simply cropped\n            center (list[[float, float]]):  If ``sample_style==\"range\"``,\n                a [[minx, miny], [maxx, maxy]] relative interval from which to sample the center,\n                [0, 0] being the top left of the image and [1, 1] the bottom right.\n                If ``sample_style==\"choice\"``, a list of centers to sample from\n                Default: None, which means that the center of rotation is the center of the image\n                center has no effect if expand=True because it only affects shifting\n        \"\"\"\n        super().__init__()\n        if isinstance(angle, (float, int)):\n            angle = (angle, angle)\n        if center is not None and isinstance(center[0], (float, int)):\n            center = (center, center)\n        self.angle_save = None\n        self.center_save = None\n        self._cnt = 0\n        self._init(locals())\n\n    def get_transform(self, image):\n        h, w = image.shape[:2]\n        if self._cnt % self.clip_frame_cnt == 0:\n            center = None\n            angle = np.random.uniform(self.angle[0], self.angle[1], size=self.clip_frame_cnt)\n            if self.center is not None:\n                center = (\n                    np.random.uniform(self.center[0][0], self.center[1][0]),\n                    np.random.uniform(self.center[0][1], self.center[1][1]),\n                )\n            angle = np.sort(angle)\n            if self._rand_range() < self.prob:\n                angle = angle[::-1]\n            self.angle_save = angle\n            self.center_save = center\n\n            self._cnt = 0   # avoiding overflow\n\n        angle = self.angle_save[self._cnt]\n        center = self.center_save\n\n        self._cnt += 1\n\n        if center is not None:\n            center = (w * center[0], h * center[1])  # Convert to absolute coordinates\n\n        if angle % 360 == 0:\n            return NoOpTransform()\n\n        return T.RotationTransform(h, w, angle, expand=self.expand, center=center, interp=self.interp)\n\n\nclass ResizeShortestEdge(T.Augmentation):\n    \"\"\"\n    Scale the shorter edge to the given size, with a limit of `max_size` on the longer edge.\n    If `max_size` is reached, then downscale so that the longer edge does not exceed max_size.\n    \"\"\"\n\n    def __init__(\n        self, short_edge_length, max_size=sys.maxsize, sample_style=\"range\", interp=Image.BILINEAR, clip_frame_cnt=1\n    ):\n        \"\"\"\n        Args:\n            short_edge_length (list[int]): If ``sample_style==\"range\"``,\n                a [min, max] interval from which to sample the shortest edge length.\n                If ``sample_style==\"choice\"``, a list of shortest edge lengths to sample from.\n            max_size (int): maximum allowed longest edge length.\n            sample_style (str): either \"range\" or \"choice\".\n        \"\"\"\n        super().__init__()\n        assert sample_style in [\"range\", \"choice\", \"range_by_clip\", \"choice_by_clip\"], sample_style\n\n        self.is_range = (\"range\" in sample_style)\n        if isinstance(short_edge_length, int):\n            short_edge_length = (short_edge_length, short_edge_length)\n        if self.is_range:\n            assert len(short_edge_length) == 2, (\n                \"short_edge_length must be two values using 'range' sample style.\"\n                f\" Got {short_edge_length}!\"\n            )\n        self._cnt = 0\n        self._init(locals())\n\n    def get_transform(self, image):\n        if self._cnt % self.clip_frame_cnt == 0:\n            if self.is_range:\n                self.size = np.random.randint(self.short_edge_length[0], self.short_edge_length[1] + 1)\n            else:\n                self.size = np.random.choice(self.short_edge_length)\n            if self.size == 0:\n                return NoOpTransform()\n\n            self._cnt = 0   # avoiding overflow\n        self._cnt += 1\n\n        h, w = image.shape[:2]\n\n        scale = self.size * 1.0 / min(h, w)\n        if h < w:\n            newh, neww = self.size, scale * w\n        else:\n            newh, neww = scale * h, self.size\n        if max(newh, neww) > self.max_size:\n            scale = self.max_size * 1.0 / max(newh, neww)\n            newh = newh * scale\n            neww = neww * scale\n        neww = int(neww + 0.5)\n        newh = int(newh + 0.5)\n        return T.ResizeTransform(h, w, newh, neww, self.interp)\n\n\nclass RandomFlip(T.Augmentation):\n    \"\"\"\n    Flip the image horizontally or vertically with the given probability.\n    \"\"\"\n\n    def __init__(self, prob=0.5, *, horizontal=True, vertical=False, clip_frame_cnt=1):\n        \"\"\"\n        Args:\n            prob (float): probability of flip.\n            horizontal (boolean): whether to apply horizontal flipping\n            vertical (boolean): whether to apply vertical flipping\n        \"\"\"\n        super().__init__()\n\n        if horizontal and vertical:\n            raise ValueError(\"Cannot do both horiz and vert. Please use two Flip instead.\")\n        if not horizontal and not vertical:\n            raise ValueError(\"At least one of horiz or vert has to be True!\")\n        self._cnt = 0\n\n        self._init(locals())\n\n    def get_transform(self, image):\n        if self._cnt % self.clip_frame_cnt == 0:\n            self.do = self._rand_range() < self.prob\n            self._cnt = 0   # avoiding overflow\n        self._cnt += 1\n\n        h, w = image.shape[:2]\n\n        if self.do:\n            if self.horizontal:\n                return HFlipTransform(w)\n            elif self.vertical:\n                return VFlipTransform(h)\n        else:\n            return NoOpTransform()\n\nclass RandomCropClip(T.Augmentation):\n    \"\"\"\n    Randomly crop a rectangle region out of an image.\n    \"\"\"\n\n    def __init__(self, crop_type: str, crop_size, clip_frame_cnt=1):\n        \"\"\"\n        Args:\n            crop_type (str): one of \"relative_range\", \"relative\", \"absolute\", \"absolute_range\".\n            crop_size (tuple[float, float]): two floats, explained below.\n        - \"relative\": crop a (H * crop_size[0], W * crop_size[1]) region from an input image of\n          size (H, W). crop size should be in (0, 1]\n        - \"relative_range\": uniformly sample two values from [crop_size[0], 1]\n          and [crop_size[1]], 1], and use them as in \"relative\" crop type.\n        - \"absolute\" crop a (crop_size[0], crop_size[1]) region from input image.\n          crop_size must be smaller than the input image size.\n        - \"absolute_range\", for an input of size (H, W), uniformly sample H_crop in\n          [crop_size[0], min(H, crop_size[1])] and W_crop in [crop_size[0], min(W, crop_size[1])].\n          Then crop a region (H_crop, W_crop).\n        \"\"\"\n        # TODO style of relative_range and absolute_range are not consistent:\n        # one takes (h, w) but another takes (min, max)\n        super().__init__()\n        assert crop_type in [\"relative_range\", \"relative\", \"absolute\", \"absolute_range\"]\n        self._init(locals())\n        self._cnt = 0\n\n    def get_transform(self, image):\n        h, w = image.shape[:2]      # 667, 500\n        if self._cnt % self.clip_frame_cnt == 0:\n            croph, cropw = self.get_crop_size((h, w))\n            assert h >= croph and w >= cropw, \"Shape computation in {} has bugs.\".format(self)\n\n            h0 = np.random.randint(h - croph + 1)   # rand(124) -> 5\n            w0 = np.random.randint(w - cropw + 1)   # rand(111) -> 634\n\n            h1 = np.random.randint(h0, h - croph + 1)\n            w1 = np.random.randint(w0, w - cropw + 1)\n\n            x = np.sort(np.random.rand(self.clip_frame_cnt))\n\n            h = h0 * x + h1 * (1-x)\n            w = w0 * x + w1 * (1-x)\n            h = np.round_(h).astype(np.int)\n            w = np.round_(w).astype(np.int)\n\n            if self._rand_range() < 0.5:\n                h = h[::-1]\n                w = w[::-1]\n\n            self.hw_save = (h, w)\n            self.crop_h_save, self.crop_w_save = croph, cropw\n            self._cnt = 0   # avoiding overflow\n        _h, _w = self.hw_save[0][self._cnt], self.hw_save[1][self._cnt]\n        self._cnt += 1\n\n        return T.CropTransform(_w, _h, self.crop_w_save, self.crop_h_save)\n\n    def get_crop_size(self, image_size):\n        \"\"\"\n        Args:\n            image_size (tuple): height, width\n        Returns:\n            crop_size (tuple): height, width in absolute pixels\n        \"\"\"\n        h, w = image_size\n        if self.crop_type == \"relative\":\n            ch, cw = self.crop_size\n            return int(h * ch + 0.5), int(w * cw + 0.5)\n        elif self.crop_type == \"relative_range\":\n            crop_size = np.asarray(self.crop_size, dtype=np.float32)\n            ch, cw = crop_size + np.random.rand(2) * (1 - crop_size)\n            return int(h * ch + 0.5), int(w * cw + 0.5)\n        elif self.crop_type == \"absolute\":\n            return (min(self.crop_size[0], h), min(self.crop_size[1], w))\n        elif self.crop_type == \"absolute_range\":\n            assert self.crop_size[0] <= self.crop_size[1]\n            ch = np.random.randint(min(h, self.crop_size[0]), min(h, self.crop_size[1]) + 1)\n            cw = np.random.randint(min(w, self.crop_size[0]), min(w, self.crop_size[1]) + 1)\n            return ch, cw\n        else:\n            raise NotImplementedError(\"Unknown crop type {}\".format(self.crop_type))\n\n\nclass FixedSizeCropClip(T.Augmentation):\n    \"\"\"\n    If `crop_size` is smaller than the input image size, then it uses a random crop of\n    the crop size. If `crop_size` is larger than the input image size, then it pads\n    the right and the bottom of the image to the crop size if `pad` is True, otherwise\n    it returns the smaller image.\n    \"\"\"\n\n    def __init__(self, crop_size: Tuple[int], pad: bool = True, pad_value: float = 128.0, clip_frame_cnt=1):\n        \"\"\"\n        Args:\n            crop_size: target image (height, width).\n            pad: if True, will pad images smaller than `crop_size` up to `crop_size`\n            pad_value: the padding value.\n        \"\"\"\n        super().__init__()\n        self._init(locals())\n        self._cnt = 0\n\n    def _get_crop(self, image: np.ndarray):\n        # Compute the image scale and scaled size.\n        input_size = image.shape[:2]\n        output_size = self.crop_size\n\n        # Add random crop if the image is scaled up.\n        max_offset = np.subtract(input_size, output_size)\n        max_offset = np.maximum(max_offset, 0)\n\n        if self._cnt % self.clip_frame_cnt == 0:\n            offset = np.multiply(max_offset, np.random.uniform(0.0, 1.0))\n            offset = np.round(offset).astype(int)\n            self.offset_save = offset\n            self._cnt = 0   # avoiding overflow\n        self._cnt += 1\n        offset = self.offset_save\n        return CropTransform(\n            offset[1], offset[0], output_size[1], output_size[0], input_size[1], input_size[0]\n        )\n\n    def _get_pad(self, image: np.ndarray):\n        # Compute the image scale and scaled size.\n        input_size = image.shape[:2]\n        output_size = self.crop_size\n\n        # Add padding if the image is scaled down.\n        pad_size = np.subtract(output_size, input_size)\n        pad_size = np.maximum(pad_size, 0)\n        original_size = np.minimum(input_size, output_size)\n        return PadTransform(\n            0, 0, pad_size[1], pad_size[0], original_size[1], original_size[0], self.pad_value\n        )\n\n    def get_transform(self, image: np.ndarray):\n        transforms = [self._get_crop(image)]\n        if self.pad:\n            transforms.append(self._get_pad(image))\n        return TransformList(transforms)\n\nclass ResizeShortestEdgeClip(T.Augmentation):\n    \"\"\"\n    Scale the shorter edge to the given size, with a limit of `max_size` on the longer edge.\n    If `max_size` is reached, then downscale so that the longer edge does not exceed max_size.\n    \"\"\"\n\n    def __init__(\n        self, short_edge_length, max_size=sys.maxsize, sample_style=\"range\", interp=Image.BILINEAR, clip_frame_cnt=1\n    ):\n        \"\"\"\n        Args:\n            short_edge_length (list[int]): If ``sample_style==\"range\"``,\n                a [min, max] interval from which to sample the shortest edge length.\n                If ``sample_style==\"choice\"``, a list of shortest edge lengths to sample from.\n            max_size (int): maximum allowed longest edge length.\n            sample_style (str): either \"range\" or \"choice\".\n        \"\"\"\n        super().__init__()\n        assert sample_style in [\"range\", \"choice\", \"range_by_clip\", \"choice_by_clip\"], sample_style\n\n        self.is_range = (\"range\" in sample_style)\n        if isinstance(short_edge_length, int):\n            short_edge_length = (short_edge_length, short_edge_length)\n        if self.is_range:\n            assert len(short_edge_length) == 2, (\n                \"short_edge_length must be two values using 'range' sample style.\"\n                f\" Got {short_edge_length}!\"\n            )\n        self._cnt = 0\n        self._init(locals())\n\n    def get_transform(self, image):\n        if self._cnt % self.clip_frame_cnt == 0:\n            if self.is_range:\n                self.size = np.random.randint(self.short_edge_length[0], self.short_edge_length[1] + 1)\n            else:\n                self.size = np.random.choice(self.short_edge_length)\n            self._cnt = 0   # avoiding overflow\n\n            if self.size == 0:\n                return NoOpTransform()\n        self._cnt += 1\n\n        h, w = image.shape[:2]\n\n        scale = self.size * 1.0 / min(h, w)\n        if h < w:\n            newh, neww = self.size, scale * w\n        else:\n            newh, neww = scale * h, self.size\n        if max(newh, neww) > self.max_size:\n            scale = self.max_size * 1.0 / max(newh, neww)\n            newh = newh * scale\n            neww = neww * scale\n        neww = int(neww + 0.5)\n        newh = int(newh + 0.5)\n        return T.ResizeTransform(h, w, newh, neww, self.interp)\n\n\nclass RandomFlipClip(T.Augmentation):\n    \"\"\"\n    Flip the image horizontally or vertically with the given probability.\n    \"\"\"\n\n    def __init__(self, prob=0.5, *, horizontal=True, vertical=False, clip_frame_cnt=1):\n        \"\"\"\n        Args:\n            prob (float): probability of flip.\n            horizontal (boolean): whether to apply horizontal flipping\n            vertical (boolean): whether to apply vertical flipping\n        \"\"\"\n        super().__init__()\n\n        if horizontal and vertical:\n            raise ValueError(\"Cannot do both horiz and vert. Please use two Flip instead.\")\n        if not horizontal and not vertical:\n            raise ValueError(\"At least one of horiz or vert has to be True!\")\n        self._cnt = 0\n\n        self._init(locals())\n\n    def get_transform(self, image):\n        if self._cnt % self.clip_frame_cnt == 0:\n            self.do = self._rand_range() < self.prob\n            self._cnt = 0   # avoiding overflow\n        self._cnt += 1\n\n        h, w = image.shape[:2]\n\n        if self.do:\n            if self.horizontal:\n                return HFlipTransform(w)\n            elif self.vertical:\n                return VFlipTransform(h)\n        else:\n            return NoOpTransform()\n\n\ndef build_augmentation(cfg, is_train):\n    logger = logging.getLogger(__name__)\n    aug_list = []\n    if is_train:\n        # Crop\n        if cfg.INPUT.CROP.ENABLED:\n            aug_list.append(T.RandomCrop(cfg.INPUT.CROP.TYPE, cfg.INPUT.CROP.SIZE))\n\n        # Resize\n        min_size = cfg.INPUT.MIN_SIZE_TRAIN\n        max_size = cfg.INPUT.MAX_SIZE_TRAIN\n        sample_style = cfg.INPUT.MIN_SIZE_TRAIN_SAMPLING\n        ms_clip_frame_cnt = cfg.INPUT.SAMPLING_FRAME_NUM if \"by_clip\" in cfg.INPUT.MIN_SIZE_TRAIN_SAMPLING else 1\n        aug_list.append(ResizeShortestEdge(min_size, max_size, sample_style, clip_frame_cnt=ms_clip_frame_cnt))\n\n        # Flip\n        if cfg.INPUT.RANDOM_FLIP != \"none\":\n            if cfg.INPUT.RANDOM_FLIP == \"flip_by_clip\":\n                flip_clip_frame_cnt = cfg.INPUT.SAMPLING_FRAME_NUM\n            else:\n                flip_clip_frame_cnt = 1\n\n            aug_list.append(\n                # NOTE using RandomFlip modified for the support of flip maintenance\n                RandomFlip(\n                    horizontal=(cfg.INPUT.RANDOM_FLIP == \"horizontal\") or (cfg.INPUT.RANDOM_FLIP == \"flip_by_clip\"),\n                    vertical=cfg.INPUT.RANDOM_FLIP == \"vertical\",\n                    clip_frame_cnt=flip_clip_frame_cnt,\n                )\n            )\n\n        # Additional augmentations : brightness, contrast, saturation, rotation\n        augmentations = cfg.INPUT.AUGMENTATIONS\n        \n        if \"brightness\" in augmentations:\n            aug_list.append(T.RandomBrightness(0.9, 1.1))\n        if \"contrast\" in augmentations:\n            aug_list.append(T.RandomContrast(0.9, 1.1))\n        if \"saturation\" in augmentations:\n            aug_list.append(T.RandomSaturation(0.9, 1.1))\n        if \"rotation\" in augmentations:\n            # print('not come here' * 10)\n            aug_list.append(\n                T.RandomRotation(\n                    [-10, 10], expand=False, center=[(0.4, 0.4), (0.6, 0.6)], sample_style=\"range\"\n                )\n            )\n    else:\n        # Resize\n        min_size = cfg.INPUT.MIN_SIZE_TEST\n        max_size = cfg.INPUT.MAX_SIZE_TEST\n        sample_style = \"choice\"\n        aug_list.append(T.ResizeShortestEdge(min_size, max_size, sample_style))\n\n    return aug_list\n\n\ndef build_pseudo_augmentation(cfg, is_train):\n    logger = logging.getLogger(__name__)\n    aug_list = []\n    if is_train:\n        use_lsj = cfg.INPUT.LSJ_AUG.ENABLED\n        if use_lsj:\n            image_size = cfg.INPUT.LSJ_AUG.IMAGE_SIZE\n            min_scale = cfg.INPUT.LSJ_AUG.MIN_SCALE\n            max_scale = cfg.INPUT.LSJ_AUG.MAX_SCALE\n\n            if cfg.INPUT.RANDOM_FLIP != \"none\":\n                if cfg.INPUT.RANDOM_FLIP == \"flip_by_clip\":\n                    clip_frame_cnt = cfg.INPUT.PSEUDO.SAMPLING_FRAME_NUM\n                else:\n                    clip_frame_cnt = 1\n\n                aug_list.append(\n                    # NOTE using RandomFlip modified for the support of flip maintenance\n                    RandomFlipClip(\n                        horizontal=(cfg.INPUT.RANDOM_FLIP == \"horizontal\") or (cfg.INPUT.RANDOM_FLIP == \"flip_by_clip\"),\n                        vertical=cfg.INPUT.RANDOM_FLIP == \"vertical\",\n                        clip_frame_cnt=clip_frame_cnt,\n                    )\n                )\n\n            # Additional augmentations : brightness, contrast, saturation, rotation\n            augmentations = cfg.INPUT.PSEUDO.AUGMENTATIONS\n            if \"brightness\" in augmentations:\n                aug_list.append(T.RandomBrightness(0.9, 1.1))\n            if \"contrast\" in augmentations:\n                aug_list.append(T.RandomContrast(0.9, 1.1))\n            if \"saturation\" in augmentations:\n                aug_list.append(T.RandomSaturation(0.9, 1.1))\n            if \"rotation\" in augmentations:\n                aug_list.append(\n                    RandomRotationClip(\n                        [-15, 15], expand=False, center=[(0.4, 0.4), (0.6, 0.6)], clip_frame_cnt=clip_frame_cnt,\n                    )\n                )\n\n            aug_list.extend([\n                ResizeScaleClip(\n                    min_scale=min_scale, max_scale=max_scale, target_height=image_size, target_width=image_size,\n                    clip_frame_cnt=clip_frame_cnt,\n                ),\n                FixedSizeCropClip(crop_size=(image_size, image_size), clip_frame_cnt=clip_frame_cnt),\n            ])\n        else:\n            min_size = cfg.INPUT.PSEUDO.MIN_SIZE_TRAIN\n            max_size = cfg.INPUT.PSEUDO.MAX_SIZE_TRAIN\n            sample_style = cfg.INPUT.MIN_SIZE_TRAIN_SAMPLING\n            clip_frame_cnt = cfg.INPUT.PSEUDO.SAMPLING_FRAME_NUM\n\n            # Crop\n            if cfg.INPUT.PSEUDO.CROP.ENABLED:\n                crop_aug = RandomApplyClip(\n                    T.AugmentationList([\n                        ResizeShortestEdgeClip([400, 500, 600], 1333, sample_style, clip_frame_cnt=clip_frame_cnt),\n                        RandomCropClip(cfg.INPUT.PSEUDO.CROP.TYPE, cfg.INPUT.PSEUDO.CROP.SIZE, clip_frame_cnt=clip_frame_cnt)\n                    ]),\n                    clip_frame_cnt=clip_frame_cnt\n                )\n                aug_list.append(crop_aug)\n\n            # Resize\n            aug_list.append(ResizeShortestEdgeClip(min_size, max_size, sample_style, clip_frame_cnt=clip_frame_cnt))\n\n            # Flip\n            aug_list.append(\n                # NOTE using RandomFlip modified for the support of flip maintenance\n                RandomFlipClip(\n                    horizontal=(cfg.INPUT.RANDOM_FLIP == \"horizontal\") or (cfg.INPUT.RANDOM_FLIP == \"flip_by_clip\"),\n                    vertical=cfg.INPUT.RANDOM_FLIP == \"vertical\",\n                    clip_frame_cnt=clip_frame_cnt,\n                )\n            )\n\n            # Additional augmentations : brightness, contrast, saturation, rotation\n            augmentations = cfg.INPUT.PSEUDO.AUGMENTATIONS\n            if \"brightness\" in augmentations:\n                aug_list.append(T.RandomBrightness(0.9, 1.1))\n            if \"contrast\" in augmentations:\n                aug_list.append(T.RandomContrast(0.9, 1.1))\n            if \"saturation\" in augmentations:\n                aug_list.append(T.RandomSaturation(0.9, 1.1))\n            if \"rotation\" in augmentations:\n                aug_list.append(\n                    RandomRotationClip(\n                        [-15, 15], expand=False, center=[(0.4, 0.4), (0.6, 0.6)], clip_frame_cnt=clip_frame_cnt,\n                    )\n                )\n    else:\n        # Resize\n        min_size = cfg.INPUT.MIN_SIZE_TEST\n        max_size = cfg.INPUT.MAX_SIZE_TEST\n        sample_style = \"choice\"\n        aug_list.append(T.ResizeShortestEdge(min_size, max_size, sample_style))\n\n    return aug_list\n\n\n\n\n"
  },
  {
    "path": "mask2former_video/data_video/build.py",
    "content": "# Modified by Bowen Cheng from https://github.com/sukjunhwang/IFC\n\nimport itertools\nimport logging\nimport torch.utils.data\nfrom typing import Collection, Sequence\n\nfrom detectron2.config import CfgNode, configurable\nfrom detectron2.data.build import (\n    build_batch_data_loader,\n    load_proposals_into_dataset,\n    trivial_batch_collator,\n)\nfrom detectron2.data.catalog import DatasetCatalog\nfrom detectron2.data.common import DatasetFromList, MapDataset\nfrom detectron2.data.dataset_mapper import DatasetMapper\nfrom detectron2.data.samplers import InferenceSampler, TrainingSampler\nfrom detectron2.utils.comm import get_world_size\n\nfrom .combined_loader import CombinedDataLoader, Loader\n\ndef _compute_num_images_per_worker(cfg: CfgNode):\n    num_workers = get_world_size()\n    images_per_batch = cfg.SOLVER.IMS_PER_BATCH\n    assert (\n        images_per_batch % num_workers == 0\n    ), \"SOLVER.IMS_PER_BATCH ({}) must be divisible by the number of workers ({}).\".format(\n        images_per_batch, num_workers\n    )\n    assert (\n        images_per_batch >= num_workers\n    ), \"SOLVER.IMS_PER_BATCH ({}) must be larger than the number of workers ({}).\".format(\n        images_per_batch, num_workers\n    )\n    images_per_worker = images_per_batch // num_workers\n    return images_per_worker\n\n\ndef filter_images_with_only_crowd_annotations(dataset_dicts, dataset_names):\n    \"\"\"\n    Filter out images with none annotations or only crowd annotations\n    (i.e., images without non-crowd annotations).\n    A common training-time preprocessing on COCO dataset.\n\n    Args:\n        dataset_dicts (list[dict]): annotations in Detectron2 Dataset format.\n\n    Returns:\n        list[dict]: the same format, but filtered.\n    \"\"\"\n    num_before = len(dataset_dicts)\n\n    def valid(anns):\n        for ann in anns:\n            if isinstance(ann, list):\n                for instance in ann:\n                    if instance.get(\"iscrowd\", 0) == 0:\n                        return True\n            else:\n                if ann.get(\"iscrowd\", 0) == 0:\n                    return True\n        return False\n\n    dataset_dicts = [x for x in dataset_dicts if valid(x[\"annotations\"])]\n    num_after = len(dataset_dicts)\n    logger = logging.getLogger(__name__)\n    logger.info(\n        \"Removed {} images with no usable annotations. {} images left.\".format(\n            num_before - num_after, num_after\n        )\n    )\n    return dataset_dicts\n\n\ndef get_detection_dataset_dicts(\n    dataset_names, filter_empty=True, proposal_files=None\n):\n    \"\"\"\n    Load and prepare dataset dicts for instance detection/segmentation and semantic segmentation.\n\n    Args:\n        dataset_names (str or list[str]): a dataset name or a list of dataset names\n        filter_empty (bool): whether to filter out images without instance annotations\n        proposal_files (list[str]): if given, a list of object proposal files\n            that match each dataset in `dataset_names`.\n\n    Returns:\n        list[dict]: a list of dicts following the standard dataset dict format.\n    \"\"\"\n    if isinstance(dataset_names, str):\n        dataset_names = [dataset_names]\n    assert len(dataset_names)\n    dataset_dicts = [DatasetCatalog.get(dataset_name) for dataset_name in dataset_names]\n    for dataset_name, dicts in zip(dataset_names, dataset_dicts):\n        assert len(dicts), \"Dataset '{}' is empty!\".format(dataset_name)\n\n    if proposal_files is not None:\n        assert len(dataset_names) == len(proposal_files)\n        # load precomputed proposals from proposal files\n        dataset_dicts = [\n            load_proposals_into_dataset(dataset_i_dicts, proposal_file)\n            for dataset_i_dicts, proposal_file in zip(dataset_dicts, proposal_files)\n        ]\n\n    dataset_dicts = list(itertools.chain.from_iterable(dataset_dicts))\n\n    has_instances = \"annotations\" in dataset_dicts[0]\n    if filter_empty and has_instances:\n        dataset_dicts = filter_images_with_only_crowd_annotations(dataset_dicts, dataset_names)\n\n    assert len(dataset_dicts), \"No valid data found in {}.\".format(\",\".join(dataset_names))\n    return dataset_dicts\n\n\ndef _train_loader_from_config(cfg, mapper, dataset_name=None, *, dataset=None, sampler=None):\n    if dataset is None:\n        dataset = get_detection_dataset_dicts(\n            dataset_name,\n            filter_empty=cfg.DATALOADER.FILTER_EMPTY_ANNOTATIONS,\n            proposal_files=cfg.DATASETS.PROPOSAL_FILES_TRAIN if cfg.MODEL.LOAD_PROPOSALS else None,\n        )\n\n    if mapper is None:\n        mapper = DatasetMapper(cfg, True)\n\n    if sampler is None:\n        sampler_name = cfg.DATALOADER.SAMPLER_TRAIN\n        logger = logging.getLogger(__name__)\n        logger.info(\"Using training sampler {}\".format(sampler_name))\n        sampler = TrainingSampler(len(dataset))\n\n    return {\n        \"dataset\": dataset,\n        \"sampler\": sampler,\n        \"mapper\": mapper,\n        \"total_batch_size\": cfg.SOLVER.IMS_PER_BATCH,\n        \"aspect_ratio_grouping\": cfg.DATALOADER.ASPECT_RATIO_GROUPING,\n        \"num_workers\": cfg.DATALOADER.NUM_WORKERS,\n    }\n\n\n\n# TODO can allow dataset as an iterable or IterableDataset to make this function more general\n@configurable(from_config=_train_loader_from_config)\ndef build_detection_train_loader(\n    dataset, *, mapper, sampler=None, total_batch_size, aspect_ratio_grouping=True, num_workers=0\n):\n    \"\"\"\n    Build a dataloader for object detection with some default features.\n    This interface is experimental.\n\n    Args:\n        dataset (list or torch.utils.data.Dataset): a list of dataset dicts,\n            or a map-style pytorch dataset. They can be obtained by using\n            :func:`DatasetCatalog.get` or :func:`get_detection_dataset_dicts`.\n        mapper (callable): a callable which takes a sample (dict) from dataset and\n            returns the format to be consumed by the model.\n            When using cfg, the default choice is ``DatasetMapper(cfg, is_train=True)``.\n        sampler (torch.utils.data.sampler.Sampler or None): a sampler that\n            produces indices to be applied on ``dataset``.\n            Default to :class:`TrainingSampler`, which coordinates a random shuffle\n            sequence across all workers.\n        total_batch_size (int): total batch size across all workers. Batching\n            simply puts data into a list.\n        aspect_ratio_grouping (bool): whether to group images with similar\n            aspect ratio for efficiency. When enabled, it requires each\n            element in dataset be a dict with keys \"width\" and \"height\".\n        num_workers (int): number of parallel data loading workers\n\n    Returns:\n        torch.utils.data.DataLoader: a dataloader. Each output from it is a\n            ``list[mapped_element]`` of length ``total_batch_size / num_workers``,\n            where ``mapped_element`` is produced by the ``mapper``.\n    \"\"\"\n    if isinstance(dataset, list):\n        dataset = DatasetFromList(dataset, copy=False)\n    if mapper is not None:\n        dataset = MapDataset(dataset, mapper)\n\n    if sampler is None:\n        sampler = TrainingSampler(len(dataset))\n    \n    assert isinstance(sampler, torch.utils.data.sampler.Sampler)\n    return build_batch_data_loader(\n        dataset,\n        sampler,\n        total_batch_size,\n        aspect_ratio_grouping=aspect_ratio_grouping,\n        num_workers=num_workers,\n    )\n\n\ndef build_combined_loader(cfg: CfgNode, loaders: Collection[Loader], ratios: Sequence[float]):\n    images_per_worker = _compute_num_images_per_worker(cfg)\n    return CombinedDataLoader(loaders, images_per_worker, ratios)\n\ndef _test_loader_from_config(cfg, dataset_name, mapper=None):\n    \"\"\"\n    Uses the given `dataset_name` argument (instead of the names in cfg), because the\n    standard practice is to evaluate each test set individually (not combining them).\n    \"\"\"\n    dataset = get_detection_dataset_dicts(\n        [dataset_name],\n        filter_empty=False,\n        proposal_files=[\n            cfg.DATASETS.PROPOSAL_FILES_TEST[list(cfg.DATASETS.TEST).index(dataset_name)]\n        ]\n        if cfg.MODEL.LOAD_PROPOSALS\n        else None,\n    )\n    if mapper is None:\n        mapper = DatasetMapper(cfg, False)\n    return {\"dataset\": dataset, \"mapper\": mapper, \"num_workers\": cfg.DATALOADER.NUM_WORKERS}\n\n\n@configurable(from_config=_test_loader_from_config)\ndef build_detection_test_loader(dataset, *, mapper, num_workers=0):\n    \"\"\"\n    Similar to `build_detection_train_loader`, but uses a batch size of 1.\n    This interface is experimental.\n\n    Args:\n        dataset (list or torch.utils.data.Dataset): a list of dataset dicts,\n            or a map-style pytorch dataset. They can be obtained by using\n            :func:`DatasetCatalog.get` or :func:`get_detection_dataset_dicts`.\n        mapper (callable): a callable which takes a sample (dict) from dataset\n           and returns the format to be consumed by the model.\n           When using cfg, the default choice is ``DatasetMapper(cfg, is_train=False)``.\n        num_workers (int): number of parallel data loading workers\n\n    Returns:\n        DataLoader: a torch DataLoader, that loads the given detection\n        dataset, with test-time transformation and batching.\n\n    Examples:\n    ::\n        data_loader = build_detection_test_loader(\n            DatasetRegistry.get(\"my_test\"),\n            mapper=DatasetMapper(...))\n\n        # or, instantiate with a CfgNode:\n        data_loader = build_detection_test_loader(cfg, \"my_test\")\n    \"\"\"\n    if isinstance(dataset, list):\n        dataset = DatasetFromList(dataset, copy=False)\n    if mapper is not None:\n        dataset = MapDataset(dataset, mapper)\n    sampler = InferenceSampler(len(dataset))\n    # Always use 1 image per worker during inference since this is the\n    # standard when reporting inference time in papers.\n    batch_sampler = torch.utils.data.sampler.BatchSampler(sampler, 1, drop_last=False)\n    data_loader = torch.utils.data.DataLoader(\n        dataset,\n        num_workers=num_workers,\n        batch_sampler=batch_sampler,\n        collate_fn=trivial_batch_collator,\n    )\n    return data_loader\n"
  },
  {
    "path": "mask2former_video/data_video/combined_loader.py",
    "content": "import random\nfrom collections import deque\nfrom typing import Any, Collection, Deque, Iterable, Iterator, List, Sequence\n\nLoader = Iterable[Any]\n\n\ndef _pooled_next(iterator: Iterator[Any], pool: Deque[Any]):\n    if not pool:\n        pool.extend(next(iterator))\n    return pool.popleft()\n\n\nclass CombinedDataLoader:\n    \"\"\"\n    Combines data loaders using the provided sampling ratios\n    \"\"\"\n\n    BATCH_COUNT = 100\n\n    def __init__(self, loaders: Collection[Loader], batch_size: int, ratios: Sequence[float]):\n        self.loaders = loaders\n        self.batch_size = batch_size\n        self.ratios = ratios\n\n    def __iter__(self) -> Iterator[List[Any]]:\n        iters = [iter(loader) for loader in self.loaders]\n        indices = []\n        pool = [deque()] * len(iters)\n        # infinite iterator, as in D2\n        while True:\n            if not indices:\n                # just a buffer of indices, its size doesn't matter\n                # as long as it's a multiple of batch_size\n                k = self.batch_size * self.BATCH_COUNT\n                indices = random.choices(range(len(self.loaders)), self.ratios, k=k)\n            try:\n                batch = [_pooled_next(iters[i], pool[i]) for i in indices[: self.batch_size]]\n            except StopIteration:\n                break\n            indices = indices[self.batch_size :]\n            yield batch\n"
  },
  {
    "path": "mask2former_video/data_video/dataset_mapper.py",
    "content": "# Modified by Bowen Cheng from https://github.com/sukjunhwang/IFC\n\nimport copy\nimport logging\nimport random\nimport numpy as np\nfrom typing import List, Union\nimport torch\n\nfrom detectron2.config import configurable\nfrom detectron2.structures import (\n    BitMasks,\n    Boxes,\n    BoxMode,\n    Instances,\n)\n\nfrom detectron2.data import detection_utils as utils\nfrom detectron2.data import transforms as T\nfrom detectron2.data import MetadataCatalog\n\nfrom .augmentation import build_augmentation, build_pseudo_augmentation #build_coco_augmentation\nfrom .datasets.ytvis import COCO_TO_YTVIS_2019, COCO_TO_YTVIS_2021\nimport os\n\nfrom pycocotools import mask as coco_mask\n\n__all__ = [\"YTVISDatasetMapper\", \"CocoClipDatasetMapper\"]\n\ndef seed_everything(seed):\n    random.seed(seed)\n    os.environ['PYTHONHASHSEED'] = str(seed)\n    np.random.seed(seed)\n    torch.manual_seed(seed)\n\ndef filter_empty_instances(instances, by_box=True, by_mask=True, box_threshold=1e-5):\n    \"\"\"\n    Filter out empty instances in an `Instances` object.\n\n    Args:\n        instances (Instances):\n        by_box (bool): whether to filter out instances with empty boxes\n        by_mask (bool): whether to filter out instances with empty masks\n        box_threshold (float): minimum width and height to be considered non-empty\n\n    Returns:\n        Instances: the filtered instances.\n    \"\"\"\n    assert by_box or by_mask\n    r = []\n    if by_box:\n        r.append(instances.gt_boxes.nonempty(threshold=box_threshold))\n    if instances.has(\"gt_masks\") and by_mask:\n        r.append(instances.gt_masks.nonempty())\n\n    if not r:\n        return instances\n    m = r[0]\n    for x in r[1:]:\n        m = m & x\n\n    instances.gt_ids[~m] = -1\n    return instances\n\n\n\ndef _get_dummy_anno():\n    return {\n        \"iscrowd\": 0,\n        \"category_id\": -1,\n        \"id\": -1,\n        \"bbox\": np.array([0, 0, 0, 0]),\n        \"bbox_mode\": BoxMode.XYXY_ABS,\n        \"segmentation\": [np.array([0.0] * 6)]\n    }\n\ndef ytvis_annotations_to_instances(annos, image_size):\n    \"\"\"\n    Create an :class:`Instances` object used by the models,\n    from instance annotations in the dataset dict.\n\n    Args:\n        annos (list[dict]): a list of instance annotations in one image, each\n            element for one instance.\n        image_size (tuple): height, width\n\n    Returns:\n        Instances:\n            It will contain fields \"gt_boxes\", \"gt_classes\", \"gt_ids\",\n            \"gt_masks\", if they can be obtained from `annos`.\n            This is the format that builtin models expect.\n    \"\"\"\n    boxes = [BoxMode.convert(obj[\"bbox\"], obj[\"bbox_mode\"], BoxMode.XYXY_ABS) for obj in annos]\n    target = Instances(image_size)\n    target.gt_boxes = Boxes(boxes)\n\n    classes = [int(obj[\"category_id\"]) for obj in annos]\n    classes = torch.tensor(classes, dtype=torch.int64)\n    target.gt_classes = classes\n\n    ids = [int(obj[\"id\"]) for obj in annos]\n    ids = torch.tensor(ids, dtype=torch.int64)\n    target.gt_ids = ids\n\n    if len(annos) and \"segmentation\" in annos[0]:\n        segms = [obj[\"segmentation\"] for obj in annos]\n        masks = []\n        for segm in segms:\n            assert segm.ndim == 2, \"Expect segmentation of 2 dimensions, got {}.\".format(\n                segm.ndim\n            )\n            # mask array\n            masks.append(segm)\n        # torch.from_numpy does not support array with negative stride.\n        masks = BitMasks(\n            torch.stack([torch.from_numpy(np.ascontiguousarray(x)) for x in masks])\n        )\n        target.gt_masks = masks\n\n    return target\n\ndef convert_coco_poly_to_mask(segmentations, height, width):\n    masks = []\n    for polygons in segmentations:\n        rles = coco_mask.frPyObjects(polygons, height, width)\n        mask = coco_mask.decode(rles)\n        if len(mask.shape) < 3:\n            mask = mask[..., None]\n        mask = torch.as_tensor(mask, dtype=torch.uint8)\n        mask = mask.any(dim=2)\n        masks.append(mask)\n    if masks:\n        masks = torch.stack(masks, dim=0)\n    else:\n        masks = torch.zeros((0, height, width), dtype=torch.uint8)\n    return masks\n\nclass YTVISDatasetMapper:\n    \"\"\"\n    A callable which takes a dataset dict in YouTube-VIS Dataset format,\n    and map it into a format used by the model.\n    \"\"\"\n\n    @configurable\n    def __init__(\n        self,\n        is_train: bool,\n        is_tgt: bool,\n        *,\n        augmentations: List[Union[T.Augmentation, T.Transform]],\n        image_format: str,\n        use_instance_mask: bool = False,\n        sampling_frame_num: int = 2,\n        sampling_frame_range: int = 5,\n        sampling_frame_shuffle: bool = False,\n        num_classes: int = 40,\n        src_dataset_name: str = \"\",\n        tgt_dataset_name: str = \"\",\n    ):\n        \"\"\"\n        NOTE: this interface is experimental.\n        Args:\n            is_train: whether it's used in training or inference\n            augmentations: a list of augmentations or deterministic transforms to apply\n            image_format: an image format supported by :func:`detection_utils.read_image`.\n            use_instance_mask: whether to process instance segmentation annotations, if available\n        \"\"\"\n        # fmt: off\n        self.is_train               = is_train\n        self.is_tgt                 = is_tgt\n        self.augmentations          = T.AugmentationList(augmentations)\n        self.image_format           = image_format\n        self.use_instance_mask      = use_instance_mask\n        self.sampling_frame_num     = sampling_frame_num\n        self.sampling_frame_range   = sampling_frame_range\n        self.sampling_frame_shuffle = sampling_frame_shuffle\n        self.num_classes            = num_classes\n\n        if not is_tgt:\n            self.src_metadata = MetadataCatalog.get(src_dataset_name)\n            self.tgt_metadata = MetadataCatalog.get(tgt_dataset_name)\n            print('tgt_dataset_name:', tgt_dataset_name)\n            if tgt_dataset_name.startswith(\"ytvis_2019\"):\n                src2tgt = OVIS_TO_YTVIS_2019\n            elif tgt_dataset_name.startswith(\"ytvis_2021\"):\n                src2tgt = OVIS_TO_YTVIS_2021\n            elif tgt_dataset_name.startswith(\"ovis\"):\n                if src_dataset_name.startswith(\"ytvis_2019\"):\n                    src2tgt = YTVIS_2019_TO_OVIS\n                elif src_dataset_name.startswith(\"ytvis_2021\"):\n                    src2tgt = YTVIS_2021_TO_OVIS\n                else:\n                    raise NotImplementedError\n            else:\n                raise NotImplementedError\n\n            self.src2tgt = {}\n            for k, v in src2tgt.items():\n                self.src2tgt[\n                    self.src_metadata.thing_dataset_id_to_contiguous_id[k]\n                ] = self.tgt_metadata.thing_dataset_id_to_contiguous_id[v]\n\n        # fmt: on\n        logger = logging.getLogger(__name__)\n        mode = \"training\" if is_train else \"inference\"\n        logger.info(f\"[DatasetMapper] Augmentations used in {mode}: {augmentations}\")\n\n    @classmethod\n    def from_config(cls, cfg, is_train: bool = True, is_tgt: bool = True):\n        augs = build_augmentation(cfg, is_train)\n\n        sampling_frame_num = cfg.INPUT.SAMPLING_FRAME_NUM\n        sampling_frame_range = cfg.INPUT.SAMPLING_FRAME_RANGE\n        sampling_frame_shuffle = cfg.INPUT.SAMPLING_FRAME_SHUFFLE\n\n        ret = {\n            \"is_train\": is_train,\n            \"is_tgt\": is_tgt,\n            \"augmentations\": augs,\n            \"image_format\": cfg.INPUT.FORMAT,\n            \"use_instance_mask\": cfg.MODEL.MASK_ON,\n            \"sampling_frame_num\": sampling_frame_num,\n            \"sampling_frame_range\": sampling_frame_range,\n            \"sampling_frame_shuffle\": sampling_frame_shuffle,\n            \"num_classes\": cfg.MODEL.SEM_SEG_HEAD.NUM_CLASSES,\n            \"tgt_dataset_name\": cfg.DATASETS.TRAIN[-1],\n        }\n\n        return ret\n\n    def __call__(self, dataset_dict):\n        \"\"\"\n        Args:\n            dataset_dict (dict): Metadata of one video, in YTVIS Dataset format.\n\n        Returns:\n            dict: a format that builtin models in detectron2 accept\n        \"\"\"\n        # TODO consider examining below deepcopy as it costs huge amount of computations.\n        dataset_dict = copy.deepcopy(dataset_dict)  # it will be modified by code below\n\n        video_length = dataset_dict[\"length\"]\n        if self.is_train:\n            ref_frame = random.randrange(video_length)\n\n            start_idx = max(0, ref_frame-self.sampling_frame_range)\n            end_idx = min(video_length, ref_frame+self.sampling_frame_range + 1)\n\n            selected_idx = np.random.choice(\n                np.array(list(range(start_idx, ref_frame)) + list(range(ref_frame+1, end_idx))),\n                self.sampling_frame_num - 1,\n            )\n            selected_idx = selected_idx.tolist() + [ref_frame]\n            selected_idx = sorted(selected_idx)\n            if self.sampling_frame_shuffle:\n                random.shuffle(selected_idx)\n        else:\n            selected_idx = range(video_length)\n\n        video_annos = dataset_dict.pop(\"annotations\", None)\n        file_names = dataset_dict.pop(\"file_names\", None)\n\n        if self.is_train:\n            _ids = set()\n            for frame_idx in selected_idx:\n                _ids.update([anno[\"id\"] for anno in video_annos[frame_idx]])\n            ids = dict()\n            for i, _id in enumerate(_ids):\n                ids[_id] = i\n\n        dataset_dict[\"video_len\"] = len(video_annos)\n        dataset_dict[\"frame_idx\"] = list(selected_idx)\n        dataset_dict[\"image\"] = []\n        dataset_dict[\"instances\"] = []\n        dataset_dict[\"file_names\"] = []\n        for frame_idx in selected_idx:\n            dataset_dict[\"file_names\"].append(file_names[frame_idx])\n\n            # Read image\n            image = utils.read_image(file_names[frame_idx], format=self.image_format)\n            utils.check_image_size(dataset_dict, image)\n\n            aug_input = T.AugInput(image)\n            transforms = self.augmentations(aug_input)\n            image = aug_input.image\n\n            image_shape = image.shape[:2]  # h, w\n            # Pytorch's dataloader is efficient on torch.Tensor due to shared-memory,\n            # but not efficient on large generic data structures due to the use of pickle & mp.Queue.\n            # Therefore it's important to use torch.Tensor.\n            dataset_dict[\"image\"].append(torch.as_tensor(np.ascontiguousarray(image.transpose(2, 0, 1))))\n\n            if (video_annos is None) or (not self.is_train):\n                continue\n\n            # NOTE copy() is to prevent annotations getting changed from applying augmentations\n            _frame_annos = []\n            for anno in video_annos[frame_idx]:\n                _anno = {}\n                for k, v in anno.items():\n                    _anno[k] = copy.deepcopy(v)\n                _frame_annos.append(_anno)\n\n            # USER: Implement additional transformations if you have other types of data\n            annos = [\n                utils.transform_instance_annotations(obj, transforms, image_shape)\n                for obj in _frame_annos\n                if obj.get(\"iscrowd\", 0) == 0\n            ]\n            sorted_annos = [_get_dummy_anno() for _ in range(len(ids))]\n\n            for _anno in annos:\n                idx = ids[_anno[\"id\"]]\n                sorted_annos[idx] = _anno\n            _gt_ids = [_anno[\"id\"] for _anno in sorted_annos]\n\n            instances = utils.annotations_to_instances(sorted_annos, image_shape, mask_format=\"bitmask\")\n            if not self.is_tgt:\n                instances.gt_classes = torch.tensor(\n                    [self.src2tgt[c] if c in self.src2tgt else -1 for c in instances.gt_classes.tolist()]\n                )\n            instances.gt_ids = torch.tensor(_gt_ids)\n            instances = filter_empty_instances(instances)\n            # if instances.has(\"gt_masks\"):\n            #     instances.gt_boxes = instances.gt_masks.get_bounding_boxes()\n            #     instances = filter_empty_instances(instances)\n            if not instances.has(\"gt_masks\"):\n                instances.gt_masks = BitMasks(torch.empty((0, *image_shape)))\n            dataset_dict[\"instances\"].append(instances)\n\n        return dataset_dict\n\n\nclass CocoClipDatasetMapper:\n    \"\"\"\n    A callable which takes a COCO image which converts into multiple frames,\n    and map it into a format used by the model.\n    \"\"\"\n\n    @configurable\n    def __init__(\n        self,\n        is_train: bool,\n        is_tgt: bool,\n        *,\n        augmentations: List[Union[T.Augmentation, T.Transform]],\n        image_format: str,\n        sampling_frame_num: int = 2,\n        sampling_frame_range: int = 5,\n        src_dataset_name: str = \"\",\n        tgt_dataset_name: str = \"\",\n    ):\n        \"\"\"\n        NOTE: this interface is experimental.\n        Args:\n            is_train: whether it's used in training or inference\n            augmentations: a list of augmentations or deterministic transforms to apply\n            image_format: an image format supported by :func:`detection_utils.read_image`.\n        \"\"\"\n        # fmt: off\n        self.is_train               = is_train\n        self.is_tgt                 = is_tgt\n        self.augmentations          = T.AugmentationList(augmentations)\n        self.image_format           = image_format\n        self.sampling_frame_num     = sampling_frame_num\n        self.sampling_frame_range   = sampling_frame_range\n\n        if not is_tgt:\n            self.src_metadata = MetadataCatalog.get(src_dataset_name)\n            self.tgt_metadata = MetadataCatalog.get(tgt_dataset_name)\n            if tgt_dataset_name.startswith(\"ytvis_2019\"):\n                src2tgt = COCO_TO_YTVIS_2019\n            elif tgt_dataset_name.startswith(\"ytvis_2021\"):\n                src2tgt = COCO_TO_YTVIS_2021\n            elif tgt_dataset_name.startswith(\"ovis\"):\n                src2tgt = COCO_TO_OVIS\n            else:\n                raise NotImplementedError\n\n            self.src2tgt = {}\n            for k, v in src2tgt.items():\n                self.src2tgt[\n                    self.src_metadata.thing_dataset_id_to_contiguous_id[k]\n                ] = self.tgt_metadata.thing_dataset_id_to_contiguous_id[v]\n\n        # fmt: on\n        logger = logging.getLogger(__name__)\n        mode = \"training\" if is_train else \"inference\"\n        logger.info(f\"[DatasetMapper] Augmentations used in {mode}: {augmentations}\")\n\n    @classmethod\n    def from_config(cls, cfg, is_train: bool = True, is_tgt: bool = True):\n        if is_tgt:\n            augs = build_augmentation(cfg, is_train)\n        else:\n            # print('come here')\n            augs = build_pseudo_augmentation(cfg, is_train)\n\n        sampling_frame_num = cfg.INPUT.PSEUDO.SAMPLING_FRAME_NUM\n        sampling_frame_range = cfg.INPUT.PSEUDO.SAMPLING_FRAME_RANGE\n\n        ret = {\n            \"is_train\": is_train,\n            \"is_tgt\": is_tgt,\n            \"augmentations\": augs,\n            \"image_format\": cfg.INPUT.FORMAT,\n            \"sampling_frame_num\": sampling_frame_num,\n            \"sampling_frame_range\": sampling_frame_range,\n            \"tgt_dataset_name\": cfg.DATASETS.TRAIN[-1],\n        }\n\n        return ret\n\n    def __call__(self, dataset_dict):\n        \"\"\"\n        Args:\n            dataset_dict (dict): Metadata of one image, in Detectron2 Dataset format.\n        Returns:\n            dict: a format that builtin models in detectron2 accept\n        \"\"\"\n        dataset_dict = copy.deepcopy(dataset_dict)  # it will be modified by code below\n\n        img_annos = dataset_dict.pop(\"annotations\", None)\n        file_name = dataset_dict.pop(\"file_name\", None)\n        original_image = utils.read_image(file_name, format=self.image_format)\n\n        if self.is_train:\n            video_length = random.randrange(16, 49)\n            ref_frame = random.randrange(video_length)\n\n            start_idx = max(0, ref_frame-self.sampling_frame_range)\n            end_idx = min(video_length, ref_frame+self.sampling_frame_range + 1)\n\n            selected_idx = np.random.choice(\n                np.array(list(range(start_idx, ref_frame)) + list(range(ref_frame+1, end_idx))),\n                self.sampling_frame_num - 1,\n            )\n            selected_idx = selected_idx.tolist() + [ref_frame]\n            selected_idx = sorted(selected_idx)\n        else:\n            video_length = self.sampling_frame_num\n            selected_idx = list(range(self.sampling_frame_num))\n\n        dataset_dict[\"video_len\"] = video_length\n        dataset_dict[\"frame_idx\"] = selected_idx\n        dataset_dict[\"image\"] = []\n        dataset_dict[\"instances\"] = []\n        dataset_dict[\"file_names\"] = [file_name] * self.sampling_frame_num\n        for _ in range(self.sampling_frame_num):\n            utils.check_image_size(dataset_dict, original_image)\n\n            aug_input = T.AugInput(original_image)\n            transforms = self.augmentations(aug_input)\n            image = aug_input.image\n\n            image_shape = image.shape[:2]  # h, w\n            # Pytorch's dataloader is efficient on torch.Tensor due to shared-memory,\n            # but not efficient on large generic data structures due to the use of pickle & mp.Queue.\n            # Therefore it's important to use torch.Tensor.\n            dataset_dict[\"image\"].append(torch.as_tensor(np.ascontiguousarray(image.transpose(2, 0, 1))))\n\n            if (img_annos is None) or (not self.is_train):\n                continue\n\n            _img_annos = []\n            for anno in img_annos:\n                _anno = {}\n                for k, v in anno.items():\n                    _anno[k] = copy.deepcopy(v)\n                _img_annos.append(_anno)\n\n            # USER: Implement additional transformations if you have other types of data\n            annos = [\n                utils.transform_instance_annotations(obj, transforms, image_shape)\n                for obj in _img_annos\n                if obj.get(\"iscrowd\", 0) == 0\n            ]\n            _gt_ids = list(range(len(annos)))\n            for idx in range(len(annos)):\n                if len(annos[idx][\"segmentation\"]) == 0:\n                    annos[idx][\"segmentation\"] = [np.array([0.0] * 6)]\n\n            instances = utils.annotations_to_instances(annos, image_shape)\n            if not self.is_tgt:\n                instances.gt_classes = torch.tensor(\n                    [self.src2tgt[c] if c in self.src2tgt else -1 for c in instances.gt_classes.tolist()]\n                )\n            instances.gt_ids = torch.tensor(_gt_ids)\n            # instances.gt_boxes = instances.gt_masks.get_bounding_boxes()  # NOTE we don't need boxes\n            instances = filter_empty_instances(instances)\n            h, w = instances.image_size\n            if hasattr(instances, 'gt_masks'):\n                gt_masks = instances.gt_masks\n                gt_masks = convert_coco_poly_to_mask(gt_masks.polygons, h, w)\n                instances.gt_masks = gt_masks\n            else:\n                instances.gt_masks = torch.zeros((0, h, w), dtype=torch.uint8)\n            dataset_dict[\"instances\"].append(instances)\n\n        return dataset_dict\n"
  },
  {
    "path": "mask2former_video/data_video/datasets/__init__.py",
    "content": "# Modified by Bowen Cheng from https://github.com/sukjunhwang/IFC\n\nfrom . import builtin  # ensure the builtin datasets are registered\n\n__all__ = [k for k in globals().keys() if \"builtin\" not in k and not k.startswith(\"_\")]\n"
  },
  {
    "path": "mask2former_video/data_video/datasets/builtin.py",
    "content": "# Modified by Bowen Cheng from https://github.com/sukjunhwang/IFC\n\nimport os\n\nfrom .ytvis import (\n    register_ytvis_instances,\n    _get_ytvis_2019_instances_meta,\n    _get_ytvis_2021_instances_meta,\n)\n\nfrom detectron2.data.datasets.coco import register_coco_instances\nfrom detectron2.data.datasets.builtin_meta import _get_builtin_metadata\n\n_PREDEFINED_SPLITS_COCO = {}\n_PREDEFINED_SPLITS_COCO[\"coco\"] = {\n    \"coco_2017_train_fake\": (\"coco/train2017\", \"coco/annotations/coco2ytvis2019_train.json\"),\n}\n\n# ==== Predefined splits for YTVIS 2019 ===========\n_PREDEFINED_SPLITS_YTVIS_2019 = {\n    \"ytvis_2019_train\": (\"ytvis_2019/train/JPEGImages\",\n                         \"ytvis_2019/train.json\"), \n    \"ytvis_2019_val\": (\"ytvis_2019/valid/JPEGImages\",\n                       \"ytvis_2019/valid.json\"),\n    \"ytvis_2019_test\": (\"ytvis_2019/test/JPEGImages\",\n                        \"ytvis_2019/test.json\"),\n}\n\n\n# ==== Predefined splits for YTVIS 2021 ===========\n_PREDEFINED_SPLITS_YTVIS_2021 = {\n    \"ytvis_2021_train\": (\"ytvis_2021/train/JPEGImages\",\n                         \"ytvis_2021/train.json\"),\n    \"ytvis_2021_val\": (\"ytvis_2021/valid/JPEGImages\",\n                       \"ytvis_2021/valid.json\"),\n    \"ytvis_2021_test\": (\"ytvis_2021/test/JPEGImages\",\n                        \"ytvis_2021/test.json\"),\n}\n\n\ndef register_all_ytvis_2019(root):\n    for key, (image_root, json_file) in _PREDEFINED_SPLITS_YTVIS_2019.items():\n        # Assume pre-defined datasets live in `./datasets`.\n        register_ytvis_instances(\n            key,\n            _get_ytvis_2019_instances_meta(),\n            os.path.join(root, json_file) if \"://\" not in json_file else json_file,\n            os.path.join(root, image_root),\n        )\n\n\ndef register_all_ytvis_2021(root):\n    for key, (image_root, json_file) in _PREDEFINED_SPLITS_YTVIS_2021.items():\n        # Assume pre-defined datasets live in `./datasets`.\n        register_ytvis_instances(\n            key,\n            _get_ytvis_2021_instances_meta(),\n            os.path.join(root, json_file) if \"://\" not in json_file else json_file,\n            os.path.join(root, image_root),\n        )\n\ndef register_all_coco(root):\n    for dataset_name, splits_per_dataset in _PREDEFINED_SPLITS_COCO.items():\n        for key, (image_root, json_file) in splits_per_dataset.items():\n            # Assume pre-defined datasets live in `./datasets`.\n            register_coco_instances(\n                key,\n                _get_builtin_metadata(dataset_name),\n                os.path.join(root, json_file) if \"://\" not in json_file else json_file,\n                os.path.join(root, image_root),\n            )\n\nif __name__.endswith(\".builtin\"):\n    # Assume pre-defined datasets live in `./datasets`.\n    _root = os.getenv(\"DETECTRON2_DATASETS\", \"datasets\")\n    register_all_ytvis_2019(_root)\n    register_all_ytvis_2021(_root)\n    register_all_coco(_root)\n"
  },
  {
    "path": "mask2former_video/data_video/datasets/ytvis.py",
    "content": "# Modified by Bowen Cheng from https://github.com/sukjunhwang/IFC\n\nimport contextlib\nimport io\nimport json\nimport logging\nimport numpy as np\nimport os\nimport pycocotools.mask as mask_util\nfrom fvcore.common.file_io import PathManager\nfrom fvcore.common.timer import Timer\n\nfrom detectron2.structures import Boxes, BoxMode, PolygonMasks\nfrom detectron2.data import DatasetCatalog, MetadataCatalog\n\n\"\"\"\nThis file contains functions to parse YTVIS dataset of\nCOCO-format annotations into dicts in \"Detectron2 format\".\n\"\"\"\n\nlogger = logging.getLogger(__name__)\n\n__all__ = [\"load_ytvis_json\", \"register_ytvis_instances\"]\n\nCOCO_TO_YTVIS_2019 = {\n    1:1, 2:21, 3:6, 4:21, 5:28, 7:17, 8:29, 9:34, 17:14, 18:8, 19:18, 21:15, 22:32, 23:20, 24:30, 25:22, 35:33, 36:33, 41:5, 42:27, 43:40\n}\nCOCO_TO_YTVIS_2021 = {\n    1:26, 2:23, 3:5, 4:23, 5:1, 7:36, 8:37, 9:4, 16:3, 17:6, 18:9, 19:19, 21:7, 22:12, 23:2, 24:40, 25:18, 34:14, 35:31, 36:31, 41:29, 42:33, 43:34\n}\n\nYTVIS_CATEGORIES_2019 = [\n    {\"color\": [220, 20, 60], \"isthing\": 1, \"id\": 1, \"name\": \"person\"},\n    {\"color\": [0, 82, 0], \"isthing\": 1, \"id\": 2, \"name\": \"giant_panda\"},\n    {\"color\": [119, 11, 32], \"isthing\": 1, \"id\": 3, \"name\": \"lizard\"},\n    {\"color\": [165, 42, 42], \"isthing\": 1, \"id\": 4, \"name\": \"parrot\"},\n    {\"color\": [134, 134, 103], \"isthing\": 1, \"id\": 5, \"name\": \"skateboard\"},\n    {\"color\": [0, 0, 142], \"isthing\": 1, \"id\": 6, \"name\": \"sedan\"},\n    {\"color\": [255, 109, 65], \"isthing\": 1, \"id\": 7, \"name\": \"ape\"},\n    {\"color\": [0, 226, 252], \"isthing\": 1, \"id\": 8, \"name\": \"dog\"},\n    {\"color\": [5, 121, 0], \"isthing\": 1, \"id\": 9, \"name\": \"snake\"},\n    {\"color\": [0, 60, 100], \"isthing\": 1, \"id\": 10, \"name\": \"monkey\"},\n    {\"color\": [250, 170, 30], \"isthing\": 1, \"id\": 11, \"name\": \"hand\"},\n    {\"color\": [100, 170, 30], \"isthing\": 1, \"id\": 12, \"name\": \"rabbit\"},\n    {\"color\": [179, 0, 194], \"isthing\": 1, \"id\": 13, \"name\": \"duck\"},\n    {\"color\": [255, 77, 255], \"isthing\": 1, \"id\": 14, \"name\": \"cat\"},\n    {\"color\": [120, 166, 157], \"isthing\": 1, \"id\": 15, \"name\": \"cow\"},\n    {\"color\": [73, 77, 174], \"isthing\": 1, \"id\": 16, \"name\": \"fish\"},\n    {\"color\": [0, 80, 100], \"isthing\": 1, \"id\": 17, \"name\": \"train\"},\n    {\"color\": [182, 182, 255], \"isthing\": 1, \"id\": 18, \"name\": \"horse\"},\n    {\"color\": [0, 143, 149], \"isthing\": 1, \"id\": 19, \"name\": \"turtle\"},\n    {\"color\": [174, 57, 255], \"isthing\": 1, \"id\": 20, \"name\": \"bear\"},\n    {\"color\": [0, 0, 230], \"isthing\": 1, \"id\": 21, \"name\": \"motorbike\"},\n    {\"color\": [72, 0, 118], \"isthing\": 1, \"id\": 22, \"name\": \"giraffe\"},\n    {\"color\": [255, 179, 240], \"isthing\": 1, \"id\": 23, \"name\": \"leopard\"},\n    {\"color\": [0, 125, 92], \"isthing\": 1, \"id\": 24, \"name\": \"fox\"},\n    {\"color\": [209, 0, 151], \"isthing\": 1, \"id\": 25, \"name\": \"deer\"},\n    {\"color\": [188, 208, 182], \"isthing\": 1, \"id\": 26, \"name\": \"owl\"},\n    {\"color\": [145, 148, 174], \"isthing\": 1, \"id\": 27, \"name\": \"surfboard\"},\n    {\"color\": [106, 0, 228], \"isthing\": 1, \"id\": 28, \"name\": \"airplane\"},\n    {\"color\": [0, 0, 70], \"isthing\": 1, \"id\": 29, \"name\": \"truck\"},\n    {\"color\": [199, 100, 0], \"isthing\": 1, \"id\": 30, \"name\": \"zebra\"},\n    {\"color\": [166, 196, 102], \"isthing\": 1, \"id\": 31, \"name\": \"tiger\"},\n    {\"color\": [110, 76, 0], \"isthing\": 1, \"id\": 32, \"name\": \"elephant\"},\n    {\"color\": [133, 129, 255], \"isthing\": 1, \"id\": 33, \"name\": \"snowboard\"},\n    {\"color\": [0, 0, 192], \"isthing\": 1, \"id\": 34, \"name\": \"boat\"},\n    {\"color\": [183, 130, 88], \"isthing\": 1, \"id\": 35, \"name\": \"shark\"},\n    {\"color\": [130, 114, 135], \"isthing\": 1, \"id\": 36, \"name\": \"mouse\"},\n    {\"color\": [107, 142, 35], \"isthing\": 1, \"id\": 37, \"name\": \"frog\"},\n    {\"color\": [0, 228, 0], \"isthing\": 1, \"id\": 38, \"name\": \"eagle\"},\n    {\"color\": [174, 255, 243], \"isthing\": 1, \"id\": 39, \"name\": \"earless_seal\"},\n    {\"color\": [255, 208, 186], \"isthing\": 1, \"id\": 40, \"name\": \"tennis_racket\"},\n]\n\n\nYTVIS_CATEGORIES_2021 = [\n    {\"color\": [106, 0, 228], \"isthing\": 1, \"id\": 1, \"name\": \"airplane\"},\n    {\"color\": [174, 57, 255], \"isthing\": 1, \"id\": 2, \"name\": \"bear\"},\n    {\"color\": [255, 109, 65], \"isthing\": 1, \"id\": 3, \"name\": \"bird\"},\n    {\"color\": [0, 0, 192], \"isthing\": 1, \"id\": 4, \"name\": \"boat\"},\n    {\"color\": [0, 0, 142], \"isthing\": 1, \"id\": 5, \"name\": \"car\"},\n    {\"color\": [255, 77, 255], \"isthing\": 1, \"id\": 6, \"name\": \"cat\"},\n    {\"color\": [120, 166, 157], \"isthing\": 1, \"id\": 7, \"name\": \"cow\"},\n    {\"color\": [209, 0, 151], \"isthing\": 1, \"id\": 8, \"name\": \"deer\"},\n    {\"color\": [0, 226, 252], \"isthing\": 1, \"id\": 9, \"name\": \"dog\"},\n    {\"color\": [179, 0, 194], \"isthing\": 1, \"id\": 10, \"name\": \"duck\"},\n    {\"color\": [174, 255, 243], \"isthing\": 1, \"id\": 11, \"name\": \"earless_seal\"},\n    {\"color\": [110, 76, 0], \"isthing\": 1, \"id\": 12, \"name\": \"elephant\"},\n    {\"color\": [73, 77, 174], \"isthing\": 1, \"id\": 13, \"name\": \"fish\"},\n    {\"color\": [250, 170, 30], \"isthing\": 1, \"id\": 14, \"name\": \"flying_disc\"},\n    {\"color\": [0, 125, 92], \"isthing\": 1, \"id\": 15, \"name\": \"fox\"},\n    {\"color\": [107, 142, 35], \"isthing\": 1, \"id\": 16, \"name\": \"frog\"},\n    {\"color\": [0, 82, 0], \"isthing\": 1, \"id\": 17, \"name\": \"giant_panda\"},\n    {\"color\": [72, 0, 118], \"isthing\": 1, \"id\": 18, \"name\": \"giraffe\"},\n    {\"color\": [182, 182, 255], \"isthing\": 1, \"id\": 19, \"name\": \"horse\"},\n    {\"color\": [255, 179, 240], \"isthing\": 1, \"id\": 20, \"name\": \"leopard\"},\n    {\"color\": [119, 11, 32], \"isthing\": 1, \"id\": 21, \"name\": \"lizard\"},\n    {\"color\": [0, 60, 100], \"isthing\": 1, \"id\": 22, \"name\": \"monkey\"},\n    {\"color\": [0, 0, 230], \"isthing\": 1, \"id\": 23, \"name\": \"motorbike\"},\n    {\"color\": [130, 114, 135], \"isthing\": 1, \"id\": 24, \"name\": \"mouse\"},\n    {\"color\": [165, 42, 42], \"isthing\": 1, \"id\": 25, \"name\": \"parrot\"},\n    {\"color\": [220, 20, 60], \"isthing\": 1, \"id\": 26, \"name\": \"person\"},\n    {\"color\": [100, 170, 30], \"isthing\": 1, \"id\": 27, \"name\": \"rabbit\"},\n    {\"color\": [183, 130, 88], \"isthing\": 1, \"id\": 28, \"name\": \"shark\"},\n    {\"color\": [134, 134, 103], \"isthing\": 1, \"id\": 29, \"name\": \"skateboard\"},\n    {\"color\": [5, 121, 0], \"isthing\": 1, \"id\": 30, \"name\": \"snake\"},\n    {\"color\": [133, 129, 255], \"isthing\": 1, \"id\": 31, \"name\": \"snowboard\"},\n    {\"color\": [188, 208, 182], \"isthing\": 1, \"id\": 32, \"name\": \"squirrel\"},\n    {\"color\": [145, 148, 174], \"isthing\": 1, \"id\": 33, \"name\": \"surfboard\"},\n    {\"color\": [255, 208, 186], \"isthing\": 1, \"id\": 34, \"name\": \"tennis_racket\"},\n    {\"color\": [166, 196, 102], \"isthing\": 1, \"id\": 35, \"name\": \"tiger\"},\n    {\"color\": [0, 80, 100], \"isthing\": 1, \"id\": 36, \"name\": \"train\"},\n    {\"color\": [0, 0, 70], \"isthing\": 1, \"id\": 37, \"name\": \"truck\"},\n    {\"color\": [0, 143, 149], \"isthing\": 1, \"id\": 38, \"name\": \"turtle\"},\n    {\"color\": [0, 228, 0], \"isthing\": 1, \"id\": 39, \"name\": \"whale\"},\n    {\"color\": [199, 100, 0], \"isthing\": 1, \"id\": 40, \"name\": \"zebra\"},\n]\n\n\ndef _get_ytvis_2019_instances_meta():\n    thing_ids = [k[\"id\"] for k in YTVIS_CATEGORIES_2019 if k[\"isthing\"] == 1]\n    thing_colors = [k[\"color\"] for k in YTVIS_CATEGORIES_2019 if k[\"isthing\"] == 1]\n    assert len(thing_ids) == 40, len(thing_ids)\n    # Mapping from the incontiguous YTVIS category id to an id in [0, 39]\n    thing_dataset_id_to_contiguous_id = {k: i for i, k in enumerate(thing_ids)}\n    thing_classes = [k[\"name\"] for k in YTVIS_CATEGORIES_2019 if k[\"isthing\"] == 1]\n    ret = {\n        \"thing_dataset_id_to_contiguous_id\": thing_dataset_id_to_contiguous_id,\n        \"thing_classes\": thing_classes,\n        \"thing_colors\": thing_colors,\n    }\n    return ret\n\n\ndef _get_ytvis_2021_instances_meta():\n    thing_ids = [k[\"id\"] for k in YTVIS_CATEGORIES_2021 if k[\"isthing\"] == 1]\n    thing_colors = [k[\"color\"] for k in YTVIS_CATEGORIES_2021 if k[\"isthing\"] == 1]\n    assert len(thing_ids) == 40, len(thing_ids)\n    # Mapping from the incontiguous YTVIS category id to an id in [0, 39]\n    thing_dataset_id_to_contiguous_id = {k: i for i, k in enumerate(thing_ids)}\n    thing_classes = [k[\"name\"] for k in YTVIS_CATEGORIES_2021 if k[\"isthing\"] == 1]\n    ret = {\n        \"thing_dataset_id_to_contiguous_id\": thing_dataset_id_to_contiguous_id,\n        \"thing_classes\": thing_classes,\n        \"thing_colors\": thing_colors,\n    }\n    return ret\n\n\ndef load_ytvis_json(json_file, image_root, dataset_name=None, extra_annotation_keys=None):\n    from .ytvis_api.ytvos import YTVOS\n\n    timer = Timer()\n    json_file = PathManager.get_local_path(json_file)\n    with contextlib.redirect_stdout(io.StringIO()):\n        ytvis_api = YTVOS(json_file)\n    if timer.seconds() > 1:\n        logger.info(\"Loading {} takes {:.2f} seconds.\".format(json_file, timer.seconds()))\n\n    id_map = None\n    if dataset_name is not None:\n        meta = MetadataCatalog.get(dataset_name)\n        cat_ids = sorted(ytvis_api.getCatIds())\n        cats = ytvis_api.loadCats(cat_ids)\n        # The categories in a custom json file may not be sorted.\n        thing_classes = [c[\"name\"] for c in sorted(cats, key=lambda x: x[\"id\"])]\n        meta.thing_classes = thing_classes\n\n        # In COCO, certain category ids are artificially removed,\n        # and by convention they are always ignored.\n        # We deal with COCO's id issue and translate\n        # the category ids to contiguous ids in [0, 80).\n\n        # It works by looking at the \"categories\" field in the json, therefore\n        # if users' own json also have incontiguous ids, we'll\n        # apply this mapping as well but print a warning.\n        if not (min(cat_ids) == 1 and max(cat_ids) == len(cat_ids)):\n            if \"coco\" not in dataset_name:\n                logger.warning(\n                    \"\"\"\nCategory ids in annotations are not in [1, #categories]! We'll apply a mapping for you.\n\"\"\"\n                )\n        id_map = {v: i for i, v in enumerate(cat_ids)}\n        meta.thing_dataset_id_to_contiguous_id = id_map\n\n    # sort indices for reproducible results\n    vid_ids = sorted(ytvis_api.vids.keys())\n    # vids is a list of dicts, each looks something like:\n    # {'license': 1,\n    #  'flickr_url': ' ',\n    #  'file_names': ['ff25f55852/00000.jpg', 'ff25f55852/00005.jpg', ..., 'ff25f55852/00175.jpg'],\n    #  'height': 720,\n    #  'width': 1280,\n    #  'length': 36,\n    #  'date_captured': '2019-04-11 00:55:41.903902',\n    #  'id': 2232}\n    vids = ytvis_api.loadVids(vid_ids)\n\n    anns = [ytvis_api.vidToAnns[vid_id] for vid_id in vid_ids]\n    total_num_valid_anns = sum([len(x) for x in anns])\n    total_num_anns = len(ytvis_api.anns)\n    if total_num_valid_anns < total_num_anns:\n        logger.warning(\n            f\"{json_file} contains {total_num_anns} annotations, but only \"\n            f\"{total_num_valid_anns} of them match to images in the file.\"\n        )\n\n    vids_anns = list(zip(vids, anns))\n    logger.info(\"Loaded {} videos in YTVIS format from {}\".format(len(vids_anns), json_file))\n\n    dataset_dicts = []\n\n    ann_keys = [\"iscrowd\", \"category_id\", \"id\"] + (extra_annotation_keys or [])\n\n    num_instances_without_valid_segmentation = 0\n\n    for (vid_dict, anno_dict_list) in vids_anns:\n        record = {}\n        record[\"file_names\"] = [os.path.join(image_root, vid_dict[\"file_names\"][i]) for i in range(vid_dict[\"length\"])]\n        record[\"height\"] = vid_dict[\"height\"]\n        record[\"width\"] = vid_dict[\"width\"]\n        record[\"length\"] = vid_dict[\"length\"]\n        video_id = record[\"video_id\"] = vid_dict[\"id\"]\n\n        video_objs = []\n        for frame_idx in range(record[\"length\"]):\n            frame_objs = []\n            for anno in anno_dict_list:\n                assert anno[\"video_id\"] == video_id\n\n                obj = {key: anno[key] for key in ann_keys if key in anno}\n\n                _bboxes = anno.get(\"bboxes\", None)\n                _segm = anno.get(\"segmentations\", None)\n\n                if not (_bboxes and _segm and _bboxes[frame_idx] and _segm[frame_idx]):\n                    continue\n\n                bbox = _bboxes[frame_idx]\n                segm = _segm[frame_idx]\n\n                obj[\"bbox\"] = bbox\n                obj[\"bbox_mode\"] = BoxMode.XYWH_ABS\n\n                if isinstance(segm, dict):\n                    if isinstance(segm[\"counts\"], list):\n                        # convert to compressed RLE\n                        segm = mask_util.frPyObjects(segm, *segm[\"size\"])\n                elif segm:\n                    # filter out invalid polygons (< 3 points)\n                    segm = [poly for poly in segm if len(poly) % 2 == 0 and len(poly) >= 6]\n                    if len(segm) == 0:\n                        num_instances_without_valid_segmentation += 1\n                        continue  # ignore this instance\n                obj[\"segmentation\"] = segm\n\n                if id_map:\n                    obj[\"category_id\"] = id_map[obj[\"category_id\"]]\n                frame_objs.append(obj)\n            video_objs.append(frame_objs)\n        record[\"annotations\"] = video_objs\n        dataset_dicts.append(record)\n\n    if num_instances_without_valid_segmentation > 0:\n        logger.warning(\n            \"Filtered out {} instances without valid segmentation. \".format(\n                num_instances_without_valid_segmentation\n            )\n            + \"There might be issues in your dataset generation process. \"\n            \"A valid polygon should be a list[float] with even length >= 6.\"\n        )\n    return dataset_dicts\n\n\ndef register_ytvis_instances(name, metadata, json_file, image_root):\n    \"\"\"\n    Register a dataset in YTVIS's json annotation format for\n    instance tracking.\n\n    Args:\n        name (str): the name that identifies a dataset, e.g. \"ytvis_train\".\n        metadata (dict): extra metadata associated with this dataset.  You can\n            leave it as an empty dict.\n        json_file (str): path to the json instance annotation file.\n        image_root (str or path-like): directory which contains all the images.\n    \"\"\"\n    assert isinstance(name, str), name\n    assert isinstance(json_file, (str, os.PathLike)), json_file\n    assert isinstance(image_root, (str, os.PathLike)), image_root\n    # 1. register a function which returns dicts\n    DatasetCatalog.register(name, lambda: load_ytvis_json(json_file, image_root, name))\n\n    # 2. Optionally, add metadata about this dataset,\n    # since they might be useful in evaluation, visualization or logging\n    MetadataCatalog.get(name).set(\n        json_file=json_file, image_root=image_root, evaluator_type=\"ytvis\", **metadata\n    )\n\n\nif __name__ == \"__main__\":\n    \"\"\"\n    Test the YTVIS json dataset loader.\n    \"\"\"\n    from detectron2.utils.logger import setup_logger\n    from detectron2.utils.visualizer import Visualizer\n    import detectron2.data.datasets  # noqa # add pre-defined metadata\n    import sys\n    from PIL import Image\n\n    logger = setup_logger(name=__name__)\n    #assert sys.argv[3] in DatasetCatalog.list()\n    meta = MetadataCatalog.get(\"ytvis_2019_train\")\n\n    json_file = \"./datasets/ytvis/instances_train_sub.json\"\n    image_root = \"./datasets/ytvis/train/JPEGImages\"\n    dicts = load_ytvis_json(json_file, image_root, dataset_name=\"ytvis_2019_train\")\n    logger.info(\"Done loading {} samples.\".format(len(dicts)))\n\n    dirname = \"ytvis-data-vis\"\n    os.makedirs(dirname, exist_ok=True)\n\n    def extract_frame_dic(dic, frame_idx):\n        import copy\n        frame_dic = copy.deepcopy(dic)\n        annos = frame_dic.get(\"annotations\", None)\n        if annos:\n            frame_dic[\"annotations\"] = annos[frame_idx]\n\n        return frame_dic\n\n    for d in dicts:\n        vid_name = d[\"file_names\"][0].split('/')[-2]\n        os.makedirs(os.path.join(dirname, vid_name), exist_ok=True)\n        for idx, file_name in enumerate(d[\"file_names\"]):\n            img = np.array(Image.open(file_name))\n            visualizer = Visualizer(img, metadata=meta)\n            vis = visualizer.draw_dataset_dict(extract_frame_dic(d, idx))\n            fpath = os.path.join(dirname, vid_name, file_name.split('/')[-1])\n            vis.save(fpath)\n"
  },
  {
    "path": "mask2former_video/data_video/datasets/ytvis_api/__init__.py",
    "content": "# Modified by Bowen Cheng from https://github.com/youtubevos/cocoapi\n"
  },
  {
    "path": "mask2former_video/data_video/datasets/ytvis_api/ytvos.py",
    "content": "# Modified by Bowen Cheng from https://github.com/youtubevos/cocoapi\n\n__author__ = 'ychfan'\n# Interface for accessing the YouTubeVIS dataset.\n\n# The following API functions are defined:\n#  YTVOS       - YTVOS api class that loads YouTubeVIS annotation file and prepare data structures.\n#  decodeMask - Decode binary mask M encoded via run-length encoding.\n#  encodeMask - Encode binary mask M using run-length encoding.\n#  getAnnIds  - Get ann ids that satisfy given filter conditions.\n#  getCatIds  - Get cat ids that satisfy given filter conditions.\n#  getImgIds  - Get img ids that satisfy given filter conditions.\n#  loadAnns   - Load anns with the specified ids.\n#  loadCats   - Load cats with the specified ids.\n#  loadImgs   - Load imgs with the specified ids.\n#  annToMask  - Convert segmentation in an annotation to binary mask.\n#  loadRes    - Load algorithm results and create API for accessing them.\n\n# Microsoft COCO Toolbox.      version 2.0\n# Data, paper, and tutorials available at:  http://mscoco.org/\n# Code written by Piotr Dollar and Tsung-Yi Lin, 2014.\n# Licensed under the Simplified BSD License [see bsd.txt]\n\nimport json\nimport time\nimport matplotlib.pyplot as plt\nfrom matplotlib.collections import PatchCollection\nfrom matplotlib.patches import Polygon\nimport numpy as np\nimport copy\nimport itertools\nfrom pycocotools import mask as maskUtils\nimport os\nfrom collections import defaultdict\nimport sys\nPYTHON_VERSION = sys.version_info[0]\nif PYTHON_VERSION == 2:\n    from urllib import urlretrieve\nelif PYTHON_VERSION == 3:\n    from urllib.request import urlretrieve\n\n\ndef _isArrayLike(obj):\n    return hasattr(obj, '__iter__') and hasattr(obj, '__len__')\n\n\nclass YTVOS:\n    def __init__(self, annotation_file=None):\n        \"\"\"\n        Constructor of Microsoft COCO helper class for reading and visualizing annotations.\n        :param annotation_file (str): location of annotation file\n        :param image_folder (str): location to the folder that hosts images.\n        :return:\n        \"\"\"\n        # load dataset\n        self.dataset,self.anns,self.cats,self.vids = dict(),dict(),dict(),dict()\n        self.vidToAnns, self.catToVids = defaultdict(list), defaultdict(list)\n        if not annotation_file == None:\n            print('loading annotations into memory...')\n            tic = time.time()\n            dataset = json.load(open(annotation_file, 'r'))\n            assert type(dataset)==dict, 'annotation file format {} not supported'.format(type(dataset))\n            print('Done (t={:0.2f}s)'.format(time.time()- tic))\n            self.dataset = dataset\n            self.createIndex()\n\n    def createIndex(self):\n        # create index\n        print('creating index...')\n        anns, cats, vids = {}, {}, {}\n        vidToAnns,catToVids = defaultdict(list),defaultdict(list)\n        if 'annotations' in self.dataset:\n            for ann in self.dataset['annotations']:\n                vidToAnns[ann['video_id']].append(ann)\n                anns[ann['id']] = ann\n\n        if 'videos' in self.dataset:\n            for vid in self.dataset['videos']:\n                vids[vid['id']] = vid\n\n        if 'categories' in self.dataset:\n            for cat in self.dataset['categories']:\n                cats[cat['id']] = cat\n\n        if 'annotations' in self.dataset and 'categories' in self.dataset:\n            for ann in self.dataset['annotations']:\n                catToVids[ann['category_id']].append(ann['video_id'])\n\n        print('index created!')\n\n        # create class members\n        self.anns = anns\n        self.vidToAnns = vidToAnns\n        self.catToVids = catToVids\n        self.vids = vids\n        self.cats = cats\n\n    def info(self):\n        \"\"\"\n        Print information about the annotation file.\n        :return:\n        \"\"\"\n        for key, value in self.dataset['info'].items():\n            print('{}: {}'.format(key, value))\n\n    def getAnnIds(self, vidIds=[], catIds=[], areaRng=[], iscrowd=None):\n        \"\"\"\n        Get ann ids that satisfy given filter conditions. default skips that filter\n        :param vidIds  (int array)     : get anns for given vids\n               catIds  (int array)     : get anns for given cats\n               areaRng (float array)   : get anns for given area range (e.g. [0 inf])\n               iscrowd (boolean)       : get anns for given crowd label (False or True)\n        :return: ids (int array)       : integer array of ann ids\n        \"\"\"\n        vidIds = vidIds if _isArrayLike(vidIds) else [vidIds]\n        catIds = catIds if _isArrayLike(catIds) else [catIds]\n\n        if len(vidIds) == len(catIds) == len(areaRng) == 0:\n            anns = self.dataset['annotations']\n        else:\n            if not len(vidIds) == 0:\n                lists = [self.vidToAnns[vidId] for vidId in vidIds if vidId in self.vidToAnns]\n                anns = list(itertools.chain.from_iterable(lists))\n            else:\n                anns = self.dataset['annotations']\n            anns = anns if len(catIds)  == 0 else [ann for ann in anns if ann['category_id'] in catIds]\n            anns = anns if len(areaRng) == 0 else [ann for ann in anns if ann['avg_area'] > areaRng[0] and ann['avg_area'] < areaRng[1]]\n        if not iscrowd == None:\n            ids = [ann['id'] for ann in anns if ann['iscrowd'] == iscrowd]\n        else:\n            ids = [ann['id'] for ann in anns]\n        return ids\n\n    def getCatIds(self, catNms=[], supNms=[], catIds=[]):\n        \"\"\"\n        filtering parameters. default skips that filter.\n        :param catNms (str array)  : get cats for given cat names\n        :param supNms (str array)  : get cats for given supercategory names\n        :param catIds (int array)  : get cats for given cat ids\n        :return: ids (int array)   : integer array of cat ids\n        \"\"\"\n        catNms = catNms if _isArrayLike(catNms) else [catNms]\n        supNms = supNms if _isArrayLike(supNms) else [supNms]\n        catIds = catIds if _isArrayLike(catIds) else [catIds]\n\n        if len(catNms) == len(supNms) == len(catIds) == 0:\n            cats = self.dataset['categories']\n        else:\n            cats = self.dataset['categories']\n            cats = cats if len(catNms) == 0 else [cat for cat in cats if cat['name']          in catNms]\n            cats = cats if len(supNms) == 0 else [cat for cat in cats if cat['supercategory'] in supNms]\n            cats = cats if len(catIds) == 0 else [cat for cat in cats if cat['id']            in catIds]\n        ids = [cat['id'] for cat in cats]\n        return ids\n\n    def getVidIds(self, vidIds=[], catIds=[]):\n        '''\n        Get vid ids that satisfy given filter conditions.\n        :param vidIds (int array) : get vids for given ids\n        :param catIds (int array) : get vids with all given cats\n        :return: ids (int array)  : integer array of vid ids\n        '''\n        vidIds = vidIds if _isArrayLike(vidIds) else [vidIds]\n        catIds = catIds if _isArrayLike(catIds) else [catIds]\n\n        if len(vidIds) == len(catIds) == 0:\n            ids = self.vids.keys()\n        else:\n            ids = set(vidIds)\n            for i, catId in enumerate(catIds):\n                if i == 0 and len(ids) == 0:\n                    ids = set(self.catToVids[catId])\n                else:\n                    ids &= set(self.catToVids[catId])\n        return list(ids)\n\n    def loadAnns(self, ids=[]):\n        \"\"\"\n        Load anns with the specified ids.\n        :param ids (int array)       : integer ids specifying anns\n        :return: anns (object array) : loaded ann objects\n        \"\"\"\n        if _isArrayLike(ids):\n            return [self.anns[id] for id in ids]\n        elif type(ids) == int:\n            return [self.anns[ids]]\n\n    def loadCats(self, ids=[]):\n        \"\"\"\n        Load cats with the specified ids.\n        :param ids (int array)       : integer ids specifying cats\n        :return: cats (object array) : loaded cat objects\n        \"\"\"\n        if _isArrayLike(ids):\n            return [self.cats[id] for id in ids]\n        elif type(ids) == int:\n            return [self.cats[ids]]\n\n    def loadVids(self, ids=[]):\n        \"\"\"\n        Load anns with the specified ids.\n        :param ids (int array)       : integer ids specifying vid\n        :return: vids (object array) : loaded vid objects\n        \"\"\"\n        if _isArrayLike(ids):\n            return [self.vids[id] for id in ids]\n        elif type(ids) == int:\n            return [self.vids[ids]]\n\n\n    def loadRes(self, resFile):\n        \"\"\"\n        Load result file and return a result api object.\n        :param   resFile (str)     : file name of result file\n        :return: res (obj)         : result api object\n        \"\"\"\n        res = YTVOS()\n        res.dataset['videos'] = [img for img in self.dataset['videos']]\n\n        print('Loading and preparing results...')\n        tic = time.time()\n        if type(resFile) == str or (PYTHON_VERSION == 2 and type(resFile) == unicode):\n            anns = json.load(open(resFile))\n        elif type(resFile) == np.ndarray:\n            anns = self.loadNumpyAnnotations(resFile)\n        else:\n            anns = resFile\n        assert type(anns) == list, 'results in not an array of objects'\n        annsVidIds = [ann['video_id'] for ann in anns]\n        assert set(annsVidIds) == (set(annsVidIds) & set(self.getVidIds())), \\\n               'Results do not correspond to current coco set'\n        if 'segmentations' in anns[0]:\n            res.dataset['categories'] = copy.deepcopy(self.dataset['categories'])\n            for id, ann in enumerate(anns):\n                ann['areas'] = []\n                if not 'bboxes' in ann:\n                    ann['bboxes'] = []\n                for seg in ann['segmentations']:\n                    # now only support compressed RLE format as segmentation results\n                    if seg:\n                        ann['areas'].append(maskUtils.area(seg))\n                        if len(ann['bboxes']) < len(ann['areas']):\n                            ann['bboxes'].append(maskUtils.toBbox(seg))\n                    else:\n                        ann['areas'].append(None)\n                        if len(ann['bboxes']) < len(ann['areas']):\n                            ann['bboxes'].append(None)\n                ann['id'] = id+1\n                l = [a for a in ann['areas'] if a]\n                if len(l)==0:\n                  ann['avg_area'] = 0\n                else:\n                  ann['avg_area'] = np.array(l).mean() \n                ann['iscrowd'] = 0\n        print('DONE (t={:0.2f}s)'.format(time.time()- tic))\n\n        res.dataset['annotations'] = anns\n        res.createIndex()\n        return res\n\n    def annToRLE(self, ann, frameId):\n        \"\"\"\n        Convert annotation which can be polygons, uncompressed RLE to RLE.\n        :return: binary mask (numpy 2D array)\n        \"\"\"\n        t = self.vids[ann['video_id']]\n        h, w = t['height'], t['width']\n        segm = ann['segmentations'][frameId]\n        if type(segm) == list:\n            # polygon -- a single object might consist of multiple parts\n            # we merge all parts into one mask rle code\n            rles = maskUtils.frPyObjects(segm, h, w)\n            rle = maskUtils.merge(rles)\n        elif type(segm['counts']) == list:\n            # uncompressed RLE\n            rle = maskUtils.frPyObjects(segm, h, w)\n        else:\n            # rle\n            rle = segm\n        return rle\n\n    def annToMask(self, ann, frameId):\n        \"\"\"\n        Convert annotation which can be polygons, uncompressed RLE, or RLE to binary mask.\n        :return: binary mask (numpy 2D array)\n        \"\"\"\n        rle = self.annToRLE(ann, frameId)\n        m = maskUtils.decode(rle)\n        return m\n"
  },
  {
    "path": "mask2former_video/data_video/datasets/ytvis_api/ytvoseval.py",
    "content": "# Modified by Bowen Cheng from https://github.com/youtubevos/cocoapi\n\n__author__ = 'ychfan'\n\nimport numpy as np\nimport datetime\nimport time\nfrom collections import defaultdict\nfrom pycocotools import mask as maskUtils\nimport copy\n\nclass YTVOSeval:\n    # Interface for evaluating video instance segmentation on the YouTubeVIS dataset.\n    #\n    # The usage for YTVOSeval is as follows:\n    #  cocoGt=..., cocoDt=...       # load dataset and results\n    #  E = YTVOSeval(cocoGt,cocoDt); # initialize YTVOSeval object\n    #  E.params.recThrs = ...;      # set parameters as desired\n    #  E.evaluate();                # run per image evaluation\n    #  E.accumulate();              # accumulate per image results\n    #  E.summarize();               # display summary metrics of results\n    # For example usage see evalDemo.m and http://mscoco.org/.\n    #\n    # The evaluation parameters are as follows (defaults in brackets):\n    #  imgIds     - [all] N img ids to use for evaluation\n    #  catIds     - [all] K cat ids to use for evaluation\n    #  iouThrs    - [.5:.05:.95] T=10 IoU thresholds for evaluation\n    #  recThrs    - [0:.01:1] R=101 recall thresholds for evaluation\n    #  areaRng    - [...] A=4 object area ranges for evaluation\n    #  maxDets    - [1 10 100] M=3 thresholds on max detections per image\n    #  iouType    - ['segm'] set iouType to 'segm', 'bbox' or 'keypoints'\n    #  iouType replaced the now DEPRECATED useSegm parameter.\n    #  useCats    - [1] if true use category labels for evaluation\n    # Note: if useCats=0 category labels are ignored as in proposal scoring.\n    # Note: multiple areaRngs [Ax2] and maxDets [Mx1] can be specified.\n    #\n    # evaluate(): evaluates detections on every image and every category and\n    # concats the results into the \"evalImgs\" with fields:\n    #  dtIds      - [1xD] id for each of the D detections (dt)\n    #  gtIds      - [1xG] id for each of the G ground truths (gt)\n    #  dtMatches  - [TxD] matching gt id at each IoU or 0\n    #  gtMatches  - [TxG] matching dt id at each IoU or 0\n    #  dtScores   - [1xD] confidence of each dt\n    #  gtIgnore   - [1xG] ignore flag for each gt\n    #  dtIgnore   - [TxD] ignore flag for each dt at each IoU\n    #\n    # accumulate(): accumulates the per-image, per-category evaluation\n    # results in \"evalImgs\" into the dictionary \"eval\" with fields:\n    #  params     - parameters used for evaluation\n    #  date       - date evaluation was performed\n    #  counts     - [T,R,K,A,M] parameter dimensions (see above)\n    #  precision  - [TxRxKxAxM] precision for every evaluation setting\n    #  recall     - [TxKxAxM] max recall for every evaluation setting\n    # Note: precision and recall==-1 for settings with no gt objects.\n    #\n    # See also coco, mask, pycocoDemo, pycocoEvalDemo\n    #\n    # Microsoft COCO Toolbox.      version 2.0\n    # Data, paper, and tutorials available at:  http://mscoco.org/\n    # Code written by Piotr Dollar and Tsung-Yi Lin, 2015.\n    # Licensed under the Simplified BSD License [see coco/license.txt]\n    def __init__(self, cocoGt=None, cocoDt=None, iouType='segm'):\n        '''\n        Initialize CocoEval using coco APIs for gt and dt\n        :param cocoGt: coco object with ground truth annotations\n        :param cocoDt: coco object with detection results\n        :return: None\n        '''\n        if not iouType:\n            print('iouType not specified. use default iouType segm')\n        self.cocoGt   = cocoGt              # ground truth COCO API\n        self.cocoDt   = cocoDt              # detections COCO API\n        self.params   = {}                  # evaluation parameters\n        self.evalVids = defaultdict(list)   # per-image per-category evaluation results [KxAxI] elements\n        self.eval     = {}                  # accumulated evaluation results\n        self._gts = defaultdict(list)       # gt for evaluation\n        self._dts = defaultdict(list)       # dt for evaluation\n        self.params = Params(iouType=iouType) # parameters\n        self._paramsEval = {}               # parameters for evaluation\n        self.stats = []                     # result summarization\n        self.ious = {}                      # ious between all gts and dts\n        if not cocoGt is None:\n            self.params.vidIds = sorted(cocoGt.getVidIds())\n            self.params.catIds = sorted(cocoGt.getCatIds())\n\n\n    def _prepare(self):\n        '''\n        Prepare ._gts and ._dts for evaluation based on params\n        :return: None\n        '''\n        def _toMask(anns, coco):\n            # modify ann['segmentation'] by reference\n            for ann in anns:\n                for i, a in enumerate(ann['segmentations']):\n                    if a:\n                        rle = coco.annToRLE(ann, i)\n                        ann['segmentations'][i] = rle\n                l = [a for a in ann['areas'] if a]\n                if len(l)==0:\n                  ann['avg_area'] = 0\n                else:\n                  ann['avg_area'] = np.array(l).mean() \n        p = self.params\n        if p.useCats:\n            gts=self.cocoGt.loadAnns(self.cocoGt.getAnnIds(vidIds=p.vidIds, catIds=p.catIds))\n            dts=self.cocoDt.loadAnns(self.cocoDt.getAnnIds(vidIds=p.vidIds, catIds=p.catIds))\n        else:\n            gts=self.cocoGt.loadAnns(self.cocoGt.getAnnIds(vidIds=p.vidIds))\n            dts=self.cocoDt.loadAnns(self.cocoDt.getAnnIds(vidIds=p.vidIds))\n\n        # convert ground truth to mask if iouType == 'segm'\n        if p.iouType == 'segm':\n            _toMask(gts, self.cocoGt)\n            _toMask(dts, self.cocoDt)\n        # set ignore flag\n        for gt in gts:\n            gt['ignore'] = gt['ignore'] if 'ignore' in gt else 0\n            gt['ignore'] = 'iscrowd' in gt and gt['iscrowd']\n            if p.iouType == 'keypoints':\n                gt['ignore'] = (gt['num_keypoints'] == 0) or gt['ignore']\n        self._gts = defaultdict(list)       # gt for evaluation\n        self._dts = defaultdict(list)       # dt for evaluation\n        for gt in gts:\n            self._gts[gt['video_id'], gt['category_id']].append(gt)\n        for dt in dts:\n            self._dts[dt['video_id'], dt['category_id']].append(dt)\n        self.evalVids = defaultdict(list)   # per-image per-category evaluation results\n        self.eval     = {}                  # accumulated evaluation results\n\n    def evaluate(self):\n        '''\n        Run per image evaluation on given images and store results (a list of dict) in self.evalVids\n        :return: None\n        '''\n        tic = time.time()\n        print('Running per image evaluation...')\n        p = self.params\n        # add backward compatibility if useSegm is specified in params\n        if not p.useSegm is None:\n            p.iouType = 'segm' if p.useSegm == 1 else 'bbox'\n            print('useSegm (deprecated) is not None. Running {} evaluation'.format(p.iouType))\n        print('Evaluate annotation type *{}*'.format(p.iouType))\n        p.vidIds = list(np.unique(p.vidIds))\n        if p.useCats:\n            p.catIds = list(np.unique(p.catIds))\n        p.maxDets = sorted(p.maxDets)\n        self.params=p\n\n        self._prepare()\n        # loop through images, area range, max detection number\n        catIds = p.catIds if p.useCats else [-1]\n\n        if p.iouType == 'segm' or p.iouType == 'bbox':\n            computeIoU = self.computeIoU\n        elif p.iouType == 'keypoints':\n            computeIoU = self.computeOks\n        self.ious = {(vidId, catId): computeIoU(vidId, catId) \\\n                        for vidId in p.vidIds\n                        for catId in catIds}\n\n        evaluateVid = self.evaluateVid\n        maxDet = p.maxDets[-1]\n        \n        \n        self.evalImgs = [evaluateVid(vidId, catId, areaRng, maxDet)\n                 for catId in catIds\n                 for areaRng in p.areaRng\n                 for vidId in p.vidIds\n             ]\n        self._paramsEval = copy.deepcopy(self.params)\n        toc = time.time()\n        print('DONE (t={:0.2f}s).'.format(toc-tic))\n\n    def computeIoU(self, vidId, catId):\n        p = self.params\n        if p.useCats:\n            gt = self._gts[vidId,catId]\n            dt = self._dts[vidId,catId]\n        else:\n            gt = [_ for cId in p.catIds for _ in self._gts[vidId,cId]]\n            dt = [_ for cId in p.catIds for _ in self._dts[vidId,cId]]\n        if len(gt) == 0 and len(dt) ==0:\n            return []\n        inds = np.argsort([-d['score'] for d in dt], kind='mergesort')\n        dt = [dt[i] for i in inds]\n        if len(dt) > p.maxDets[-1]:\n            dt=dt[0:p.maxDets[-1]]\n\n        if p.iouType == 'segm':\n            g = [g['segmentations'] for g in gt]\n            d = [d['segmentations'] for d in dt]\n        elif p.iouType == 'bbox':\n            g = [g['bboxes'] for g in gt]\n            d = [d['bboxes'] for d in dt]\n        else:\n            raise Exception('unknown iouType for iou computation')\n\n        # compute iou between each dt and gt region\n        iscrowd = [int(o['iscrowd']) for o in gt]\n        #ious = maskUtils.iou(d,g,iscrowd)\n        def iou_seq(d_seq, g_seq):\n            i = .0\n            u = .0\n            for d, g in zip(d_seq, g_seq):\n                if d and g:\n                    i += maskUtils.area(maskUtils.merge([d, g], True))\n                    u += maskUtils.area(maskUtils.merge([d, g], False))\n                elif not d and g:\n                    u += maskUtils.area(g)\n                elif d and not g:\n                    u += maskUtils.area(d)\n            if not u > .0:\n                print(\"Mask sizes in video {} and category {} may not match!\".format(vidId, catId))\n            iou = i / u if u > .0 else .0\n            return iou\n        ious = np.zeros([len(d), len(g)])\n        for i, j in np.ndindex(ious.shape):\n            ious[i, j] = iou_seq(d[i], g[j])\n        #print(vidId, catId, ious.shape, ious)\n        return ious\n\n    def computeOks(self, imgId, catId):\n        p = self.params\n        # dimention here should be Nxm\n        gts = self._gts[imgId, catId]\n        dts = self._dts[imgId, catId]\n        inds = np.argsort([-d['score'] for d in dts], kind='mergesort')\n        dts = [dts[i] for i in inds]\n        if len(dts) > p.maxDets[-1]:\n            dts = dts[0:p.maxDets[-1]]\n        # if len(gts) == 0 and len(dts) == 0:\n        if len(gts) == 0 or len(dts) == 0:\n            return []\n        ious = np.zeros((len(dts), len(gts)))\n        sigmas = np.array([.26, .25, .25, .35, .35, .79, .79, .72, .72, .62,.62, 1.07, 1.07, .87, .87, .89, .89])/10.0\n        vars = (sigmas * 2)**2\n        k = len(sigmas)\n        # compute oks between each detection and ground truth object\n        for j, gt in enumerate(gts):\n            # create bounds for ignore regions(double the gt bbox)\n            g = np.array(gt['keypoints'])\n            xg = g[0::3]; yg = g[1::3]; vg = g[2::3]\n            k1 = np.count_nonzero(vg > 0)\n            bb = gt['bbox']\n            x0 = bb[0] - bb[2]; x1 = bb[0] + bb[2] * 2\n            y0 = bb[1] - bb[3]; y1 = bb[1] + bb[3] * 2\n            for i, dt in enumerate(dts):\n                d = np.array(dt['keypoints'])\n                xd = d[0::3]; yd = d[1::3]\n                if k1>0:\n                    # measure the per-keypoint distance if keypoints visible\n                    dx = xd - xg\n                    dy = yd - yg\n                else:\n                    # measure minimum distance to keypoints in (x0,y0) & (x1,y1)\n                    z = np.zeros((k))\n                    dx = np.max((z, x0-xd),axis=0)+np.max((z, xd-x1),axis=0)\n                    dy = np.max((z, y0-yd),axis=0)+np.max((z, yd-y1),axis=0)\n                e = (dx**2 + dy**2) / vars / (gt['avg_area']+np.spacing(1)) / 2\n                if k1 > 0:\n                    e=e[vg > 0]\n                ious[i, j] = np.sum(np.exp(-e)) / e.shape[0]\n        return ious\n\n    def evaluateVid(self, vidId, catId, aRng, maxDet):\n        '''\n        perform evaluation for single category and image\n        :return: dict (single image results)\n        '''\n        p = self.params\n        if p.useCats:\n            gt = self._gts[vidId,catId]\n            dt = self._dts[vidId,catId]\n        else:\n            gt = [_ for cId in p.catIds for _ in self._gts[vidId,cId]]\n            dt = [_ for cId in p.catIds for _ in self._dts[vidId,cId]]\n        if len(gt) == 0 and len(dt) ==0:\n            return None\n\n        for g in gt:\n            if g['ignore'] or (g['avg_area']<aRng[0] or g['avg_area']>aRng[1]):\n                g['_ignore'] = 1\n            else:\n                g['_ignore'] = 0\n\n        # sort dt highest score first, sort gt ignore last\n        gtind = np.argsort([g['_ignore'] for g in gt], kind='mergesort')\n        gt = [gt[i] for i in gtind]\n        dtind = np.argsort([-d['score'] for d in dt], kind='mergesort')\n        dt = [dt[i] for i in dtind[0:maxDet]]\n        iscrowd = [int(o['iscrowd']) for o in gt]\n        # load computed ious\n        ious = self.ious[vidId, catId][:, gtind] if len(self.ious[vidId, catId]) > 0 else self.ious[vidId, catId]\n\n        T = len(p.iouThrs)\n        G = len(gt)\n        D = len(dt)\n        gtm  = np.zeros((T,G))\n        dtm  = np.zeros((T,D))\n        gtIg = np.array([g['_ignore'] for g in gt])\n        dtIg = np.zeros((T,D))\n        if not len(ious)==0:\n            for tind, t in enumerate(p.iouThrs):\n                for dind, d in enumerate(dt):\n                    # information about best match so far (m=-1 -> unmatched)\n                    iou = min([t,1-1e-10])\n                    m   = -1\n                    for gind, g in enumerate(gt):\n                        # if this gt already matched, and not a crowd, continue\n                        if gtm[tind,gind]>0 and not iscrowd[gind]:\n                            continue\n                        # if dt matched to reg gt, and on ignore gt, stop\n                        if m>-1 and gtIg[m]==0 and gtIg[gind]==1:\n                            break\n                        # continue to next gt unless better match made\n                        if ious[dind,gind] < iou:\n                            continue\n                        # if match successful and best so far, store appropriately\n                        iou=ious[dind,gind]\n                        m=gind\n                    # if match made store id of match for both dt and gt\n                    if m ==-1:\n                        continue\n                    dtIg[tind,dind] = gtIg[m]\n                    dtm[tind,dind]  = gt[m]['id']\n                    gtm[tind,m]     = d['id']\n        # set unmatched detections outside of area range to ignore\n        a = np.array([d['avg_area']<aRng[0] or d['avg_area']>aRng[1] for d in dt]).reshape((1, len(dt)))\n        dtIg = np.logical_or(dtIg, np.logical_and(dtm==0, np.repeat(a,T,0)))\n        # store results for given image and category\n        return {\n                'video_id':     vidId,\n                'category_id':  catId,\n                'aRng':         aRng,\n                'maxDet':       maxDet,\n                'dtIds':        [d['id'] for d in dt],\n                'gtIds':        [g['id'] for g in gt],\n                'dtMatches':    dtm,\n                'gtMatches':    gtm,\n                'dtScores':     [d['score'] for d in dt],\n                'gtIgnore':     gtIg,\n                'dtIgnore':     dtIg,\n            }\n\n    def accumulate(self, p = None):\n        '''\n        Accumulate per image evaluation results and store the result in self.eval\n        :param p: input params for evaluation\n        :return: None\n        '''\n        print('Accumulating evaluation results...')\n        tic = time.time()\n        if not self.evalImgs:\n            print('Please run evaluate() first')\n        # allows input customized parameters\n        if p is None:\n            p = self.params\n        p.catIds = p.catIds if p.useCats == 1 else [-1]\n        T           = len(p.iouThrs)\n        R           = len(p.recThrs)\n        K           = len(p.catIds) if p.useCats else 1\n        A           = len(p.areaRng)\n        M           = len(p.maxDets)\n        precision   = -np.ones((T,R,K,A,M)) # -1 for the precision of absent categories\n        recall      = -np.ones((T,K,A,M))\n        scores      = -np.ones((T,R,K,A,M))\n\n        # create dictionary for future indexing\n        _pe = self._paramsEval\n        catIds = _pe.catIds if _pe.useCats else [-1]\n        setK = set(catIds)\n        setA = set(map(tuple, _pe.areaRng))\n        setM = set(_pe.maxDets)\n        setI = set(_pe.vidIds)\n        # get inds to evaluate\n        k_list = [n for n, k in enumerate(p.catIds)  if k in setK]\n        m_list = [m for n, m in enumerate(p.maxDets) if m in setM]\n        a_list = [n for n, a in enumerate(map(lambda x: tuple(x), p.areaRng)) if a in setA]\n        i_list = [n for n, i in enumerate(p.vidIds)  if i in setI]\n        I0 = len(_pe.vidIds)\n        A0 = len(_pe.areaRng)\n        # retrieve E at each category, area range, and max number of detections\n        for k, k0 in enumerate(k_list):\n            Nk = k0*A0*I0\n            for a, a0 in enumerate(a_list):\n                Na = a0*I0\n                for m, maxDet in enumerate(m_list):\n                    E = [self.evalImgs[Nk + Na + i] for i in i_list]\n                    E = [e for e in E if not e is None]\n                    if len(E) == 0:\n                        continue\n                    dtScores = np.concatenate([e['dtScores'][0:maxDet] for e in E])\n\n                    # different sorting method generates slightly different results.\n                    # mergesort is used to be consistent as Matlab implementation.\n                    inds = np.argsort(-dtScores, kind='mergesort')\n                    dtScoresSorted = dtScores[inds]\n\n                    dtm  = np.concatenate([e['dtMatches'][:,0:maxDet] for e in E], axis=1)[:,inds]\n                    dtIg = np.concatenate([e['dtIgnore'][:,0:maxDet]  for e in E], axis=1)[:,inds]\n                    gtIg = np.concatenate([e['gtIgnore'] for e in E])\n                    npig = np.count_nonzero(gtIg==0 )\n                    if npig == 0:\n                        continue\n                    tps = np.logical_and(               dtm,  np.logical_not(dtIg) )\n                    fps = np.logical_and(np.logical_not(dtm), np.logical_not(dtIg) )\n\n                    tp_sum = np.cumsum(tps, axis=1).astype(dtype=np.float)\n                    fp_sum = np.cumsum(fps, axis=1).astype(dtype=np.float)\n                    for t, (tp, fp) in enumerate(zip(tp_sum, fp_sum)):\n                        tp = np.array(tp)\n                        fp = np.array(fp)\n                        nd = len(tp)\n                        rc = tp / npig\n                        pr = tp / (fp+tp+np.spacing(1))\n                        q  = np.zeros((R,))\n                        ss = np.zeros((R,))\n\n                        if nd:\n                            recall[t,k,a,m] = rc[-1]\n                        else:\n                            recall[t,k,a,m] = 0\n\n                        # numpy is slow without cython optimization for accessing elements\n                        # use python array gets significant speed improvement\n                        pr = pr.tolist(); q = q.tolist()\n\n                        for i in range(nd-1, 0, -1):\n                            if pr[i] > pr[i-1]:\n                                pr[i-1] = pr[i]\n\n                        inds = np.searchsorted(rc, p.recThrs, side='left')\n                        try:\n                            for ri, pi in enumerate(inds):\n                                q[ri] = pr[pi]\n                                ss[ri] = dtScoresSorted[pi]\n                        except:\n                            pass\n                        precision[t,:,k,a,m] = np.array(q)\n                        scores[t,:,k,a,m] = np.array(ss)\n        self.eval = {\n            'params': p,\n            'counts': [T, R, K, A, M],\n            'date': datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S'),\n            'precision': precision,\n            'recall':   recall,\n            'scores': scores,\n        }\n        toc = time.time()\n        print('DONE (t={:0.2f}s).'.format( toc-tic))\n\n    def summarize(self):\n        '''\n        Compute and display summary metrics for evaluation results.\n        Note this functin can *only* be applied on the default parameter setting\n        '''\n        def _summarize( ap=1, iouThr=None, areaRng='all', maxDets=100 ):\n            p = self.params\n            iStr = ' {:<18} {} @[ IoU={:<9} | area={:>6s} | maxDets={:>3d} ] = {:0.3f}'\n            titleStr = 'Average Precision' if ap == 1 else 'Average Recall'\n            typeStr = '(AP)' if ap==1 else '(AR)'\n            iouStr = '{:0.2f}:{:0.2f}'.format(p.iouThrs[0], p.iouThrs[-1]) \\\n                if iouThr is None else '{:0.2f}'.format(iouThr)\n\n            aind = [i for i, aRng in enumerate(p.areaRngLbl) if aRng == areaRng]\n            mind = [i for i, mDet in enumerate(p.maxDets) if mDet == maxDets]\n            if ap == 1:\n                # dimension of precision: [TxRxKxAxM]\n                s = self.eval['precision']\n                # IoU\n                if iouThr is not None:\n                    t = np.where(iouThr == p.iouThrs)[0]\n                    s = s[t]\n                s = s[:,:,:,aind,mind]\n            else:\n                # dimension of recall: [TxKxAxM]\n                s = self.eval['recall']\n                if iouThr is not None:\n                    t = np.where(iouThr == p.iouThrs)[0]\n                    s = s[t]\n                s = s[:,:,aind,mind]\n            if len(s[s>-1])==0:\n                mean_s = -1\n            else:\n                mean_s = np.mean(s[s>-1])\n            print(iStr.format(titleStr, typeStr, iouStr, areaRng, maxDets, mean_s))\n            return mean_s\n        def _summarizeDets():\n            stats = np.zeros((12,))\n            stats[0] = _summarize(1)\n            stats[1] = _summarize(1, iouThr=.5, maxDets=self.params.maxDets[2])\n            stats[2] = _summarize(1, iouThr=.75, maxDets=self.params.maxDets[2])\n            stats[3] = _summarize(1, areaRng='small', maxDets=self.params.maxDets[2])\n            stats[4] = _summarize(1, areaRng='medium', maxDets=self.params.maxDets[2])\n            stats[5] = _summarize(1, areaRng='large', maxDets=self.params.maxDets[2])\n            stats[6] = _summarize(0, maxDets=self.params.maxDets[0])\n            stats[7] = _summarize(0, maxDets=self.params.maxDets[1])\n            stats[8] = _summarize(0, maxDets=self.params.maxDets[2])\n            stats[9] = _summarize(0, areaRng='small', maxDets=self.params.maxDets[2])\n            stats[10] = _summarize(0, areaRng='medium', maxDets=self.params.maxDets[2])\n            stats[11] = _summarize(0, areaRng='large', maxDets=self.params.maxDets[2])\n            return stats\n        def _summarizeKps():\n            stats = np.zeros((10,))\n            stats[0] = _summarize(1, maxDets=20)\n            stats[1] = _summarize(1, maxDets=20, iouThr=.5)\n            stats[2] = _summarize(1, maxDets=20, iouThr=.75)\n            stats[3] = _summarize(1, maxDets=20, areaRng='medium')\n            stats[4] = _summarize(1, maxDets=20, areaRng='large')\n            stats[5] = _summarize(0, maxDets=20)\n            stats[6] = _summarize(0, maxDets=20, iouThr=.5)\n            stats[7] = _summarize(0, maxDets=20, iouThr=.75)\n            stats[8] = _summarize(0, maxDets=20, areaRng='medium')\n            stats[9] = _summarize(0, maxDets=20, areaRng='large')\n            return stats\n        if not self.eval:\n            raise Exception('Please run accumulate() first')\n        iouType = self.params.iouType\n        if iouType == 'segm' or iouType == 'bbox':\n            summarize = _summarizeDets\n        elif iouType == 'keypoints':\n            summarize = _summarizeKps\n        self.stats = summarize()\n\n    def __str__(self):\n        self.summarize()\n\nclass Params:\n    '''\n    Params for coco evaluation api\n    '''\n    def setDetParams(self):\n        self.vidIds = []\n        self.catIds = []\n        # np.arange causes trouble.  the data point on arange is slightly larger than the true value\n        #self.iouThrs = np.linspace(.5, 0.95, np.round((0.95 - .5) / .05) + 1, endpoint=True)\n        #self.recThrs = np.linspace(.0, 1.00, np.round((1.00 - .0) / .01) + 1, endpoint=True)\n        self.iouThrs = np.linspace(.5, 0.95, int(np.round((0.95 - .5) / .05)) + 1, endpoint=True)\n        self.recThrs = np.linspace(.0, 1.00, int(np.round((1.00 - .0) / .01)) + 1, endpoint=True)\n        self.maxDets = [1, 10, 100]\n        self.areaRng = [[0 ** 2, 1e5 ** 2], [0 ** 2, 128 ** 2], [ 128 ** 2, 256 ** 2], [256 ** 2, 1e5 ** 2]]\n        self.areaRngLbl = ['all', 'small', 'medium', 'large']\n        self.useCats = 1\n\n    def setKpParams(self):\n        self.vidIds = []\n        self.catIds = []\n        # np.arange causes trouble.  the data point on arange is slightly larger than the true value\n        self.iouThrs = np.linspace(.5, 0.95, np.round((0.95 - .5) / .05) + 1, endpoint=True)\n        self.recThrs = np.linspace(.0, 1.00, np.round((1.00 - .0) / .01) + 1, endpoint=True)\n        self.maxDets = [20]\n        self.areaRng = [[0 ** 2, 1e5 ** 2], [32 ** 2, 96 ** 2], [96 ** 2, 1e5 ** 2]]\n        self.areaRngLbl = ['all', 'medium', 'large']\n        self.useCats = 1\n\n    def __init__(self, iouType='segm'):\n        if iouType == 'segm' or iouType == 'bbox':\n            self.setDetParams()\n        elif iouType == 'keypoints':\n            self.setKpParams()\n        else:\n            raise Exception('iouType not supported')\n        self.iouType = iouType\n        # useSegm is deprecated\n        self.useSegm = None\n"
  },
  {
    "path": "mask2former_video/data_video/ytvis_eval.py",
    "content": "# Modified by Bowen Cheng from https://github.com/sukjunhwang/IFC\n\nimport contextlib\nimport copy\nimport io\nimport itertools\nimport json\nimport logging\nimport numpy as np\nimport os\nfrom collections import OrderedDict\nimport pycocotools.mask as mask_util\nimport torch\nfrom .datasets.ytvis_api.ytvos import YTVOS\nfrom .datasets.ytvis_api.ytvoseval import YTVOSeval\nfrom tabulate import tabulate\n\nimport detectron2.utils.comm as comm\nfrom detectron2.config import CfgNode\nfrom detectron2.data import MetadataCatalog\nfrom detectron2.evaluation import DatasetEvaluator\nfrom detectron2.utils.file_io import PathManager\nfrom detectron2.utils.logger import create_small_table\n\n\nclass YTVISEvaluator(DatasetEvaluator):\n    \"\"\"\n    Evaluate AR for object proposals, AP for instance detection/segmentation, AP\n    for keypoint detection outputs using COCO's metrics.\n    See http://cocodataset.org/#detection-eval and\n    http://cocodataset.org/#keypoints-eval to understand its metrics.\n\n    In addition to COCO, this evaluator is able to support any bounding box detection,\n    instance segmentation, or keypoint detection dataset.\n    \"\"\"\n\n    def __init__(\n        self,\n        dataset_name,\n        tasks=None,\n        distributed=True,\n        output_dir=None,\n        *,\n        use_fast_impl=True,\n    ):\n        \"\"\"\n        Args:\n            dataset_name (str): name of the dataset to be evaluated.\n                It must have either the following corresponding metadata:\n\n                    \"json_file\": the path to the COCO format annotation\n\n                Or it must be in detectron2's standard dataset format\n                so it can be converted to COCO format automatically.\n            tasks (tuple[str]): tasks that can be evaluated under the given\n                configuration. A task is one of \"bbox\", \"segm\", \"keypoints\".\n                By default, will infer this automatically from predictions.\n            distributed (True): if True, will collect results from all ranks and run evaluation\n                in the main process.\n                Otherwise, will only evaluate the results in the current process.\n            output_dir (str): optional, an output directory to dump all\n                results predicted on the dataset. The dump contains two files:\n\n                1. \"instances_predictions.pth\" a file in torch serialization\n                   format that contains all the raw original predictions.\n                2. \"coco_instances_results.json\" a json file in COCO's result\n                   format.\n            use_fast_impl (bool): use a fast but **unofficial** implementation to compute AP.\n                Although the results should be very close to the official implementation in COCO\n                API, it is still recommended to compute results with the official API for use in\n                papers. The faster implementation also uses more RAM.\n        \"\"\"\n        self._logger = logging.getLogger(__name__)\n        self._distributed = distributed\n        self._output_dir = output_dir\n        self._use_fast_impl = use_fast_impl\n\n        if tasks is not None and isinstance(tasks, CfgNode):\n            self._logger.warning(\n                \"COCO Evaluator instantiated using config, this is deprecated behavior.\"\n                \" Please pass in explicit arguments instead.\"\n            )\n            self._tasks = None  # Infering it from predictions should be better\n        else:\n            self._tasks = tasks\n\n        self._cpu_device = torch.device(\"cpu\")\n\n        self._metadata = MetadataCatalog.get(dataset_name)\n\n        json_file = PathManager.get_local_path(self._metadata.json_file)\n        with contextlib.redirect_stdout(io.StringIO()):\n            self._ytvis_api = YTVOS(json_file)\n\n        # Test set json files do not contain annotations (evaluation must be\n        # performed using the COCO evaluation server).\n        self._do_evaluation = \"annotations\" in self._ytvis_api.dataset\n\n    def reset(self):\n        self._predictions = []\n\n    def process(self, inputs, outputs):\n        \"\"\"\n        Args:\n            inputs: the inputs to a COCO model (e.g., GeneralizedRCNN).\n                It is a list of dict. Each dict corresponds to an image and\n                contains keys like \"height\", \"width\", \"file_name\", \"image_id\".\n            outputs: the outputs of a COCO model. It is a list of dicts with key\n                \"instances\" that contains :class:`Instances`.\n        \"\"\"\n        prediction = instances_to_coco_json_video(inputs, outputs)\n        self._predictions.extend(prediction)\n\n    def evaluate(self):\n        \"\"\"\n        Args:\n            img_ids: a list of image IDs to evaluate on. Default to None for the whole dataset\n        \"\"\"\n        if self._distributed:\n            comm.synchronize()\n            predictions = comm.gather(self._predictions, dst=0)\n            predictions = list(itertools.chain(*predictions))\n\n            if not comm.is_main_process():\n                return {}\n        else:\n            predictions = self._predictions\n\n        if len(predictions) == 0:\n            self._logger.warning(\"[COCOEvaluator] Did not receive valid predictions.\")\n            return {}\n\n        if self._output_dir:\n            PathManager.mkdirs(self._output_dir)\n            file_path = os.path.join(self._output_dir, \"instances_predictions.pth\")\n            with PathManager.open(file_path, \"wb\") as f:\n                torch.save(predictions, f)\n\n        self._results = OrderedDict()\n        self._eval_predictions(predictions)\n        # Copy so the caller can do whatever with results\n        return copy.deepcopy(self._results)\n\n    def _eval_predictions(self, predictions):\n        \"\"\"\n        Evaluate predictions. Fill self._results with the metrics of the tasks.\n        \"\"\"\n        self._logger.info(\"Preparing results for YTVIS format ...\")\n\n        # unmap the category ids for COCO\n        if hasattr(self._metadata, \"thing_dataset_id_to_contiguous_id\"):\n            dataset_id_to_contiguous_id = self._metadata.thing_dataset_id_to_contiguous_id\n            all_contiguous_ids = list(dataset_id_to_contiguous_id.values())\n            num_classes = len(all_contiguous_ids)\n            assert min(all_contiguous_ids) == 0 and max(all_contiguous_ids) == num_classes - 1\n\n            reverse_id_mapping = {v: k for k, v in dataset_id_to_contiguous_id.items()}\n            for result in predictions:\n                category_id = result[\"category_id\"]\n                assert category_id < num_classes, (\n                    f\"A prediction has class={category_id}, \"\n                    f\"but the dataset only has {num_classes} classes and \"\n                    f\"predicted class id should be in [0, {num_classes - 1}].\"\n                )\n                result[\"category_id\"] = reverse_id_mapping[category_id]\n\n        if self._output_dir:\n            file_path = os.path.join(self._output_dir, \"results.json\")\n            self._logger.info(\"Saving results to {}\".format(file_path))\n            with PathManager.open(file_path, \"w\") as f:\n                f.write(json.dumps(predictions))\n                f.flush()\n\n        if not self._do_evaluation:\n            self._logger.info(\"Annotations are not available for evaluation.\")\n            return\n\n        coco_eval = (\n            _evaluate_predictions_on_coco(\n                self._ytvis_api,\n                predictions,\n            )\n            if len(predictions) > 0\n            else None  # cocoapi does not handle empty results very well\n        )\n\n        res = self._derive_coco_results(\n            coco_eval, class_names=self._metadata.get(\"thing_classes\")\n        )\n        self._results[\"segm\"] = res\n\n    def _derive_coco_results(self, coco_eval, class_names=None):\n        \"\"\"\n        Derive the desired score numbers from summarized COCOeval.\n        Args:\n            coco_eval (None or COCOEval): None represents no predictions from model.\n            iou_type (str):\n            class_names (None or list[str]): if provided, will use it to predict\n                per-category AP.\n        Returns:\n            a dict of {metric name: score}\n        \"\"\"\n\n        metrics = [\"AP\", \"AP50\", \"AP75\", \"APs\", \"APm\", \"APl\", \"AR1\", \"AR10\"]\n\n        if coco_eval is None:\n            self._logger.warn(\"No predictions from the model!\")\n            return {metric: float(\"nan\") for metric in metrics}\n\n        # the standard metrics\n        results = {\n            metric: float(coco_eval.stats[idx] * 100 if coco_eval.stats[idx] >= 0 else \"nan\")\n            for idx, metric in enumerate(metrics)\n        }\n        self._logger.info(\n            \"Evaluation results for {}: \\n\".format(\"segm\") + create_small_table(results)\n        )\n        if not np.isfinite(sum(results.values())):\n            self._logger.info(\"Some metrics cannot be computed and is shown as NaN.\")\n\n        if class_names is None or len(class_names) <= 1:\n            return results\n        # Compute per-category AP\n        # from https://github.com/facebookresearch/Detectron/blob/a6a835f5b8208c45d0dce217ce9bbda915f44df7/detectron/datasets/json_dataset_evaluator.py#L222-L252 # noqa\n        precisions = coco_eval.eval[\"precision\"]\n        # precision has dims (iou, recall, cls, area range, max dets)\n        assert len(class_names) == precisions.shape[2]\n\n        results_per_category = []\n        for idx, name in enumerate(class_names):\n            # area range index 0: all area ranges\n            # max dets index -1: typically 100 per image\n            precision = precisions[:, :, idx, 0, -1]\n            precision = precision[precision > -1]\n            ap = np.mean(precision) if precision.size else float(\"nan\")\n            results_per_category.append((\"{}\".format(name), float(ap * 100)))\n\n        # tabulate it\n        N_COLS = min(6, len(results_per_category) * 2)\n        results_flatten = list(itertools.chain(*results_per_category))\n        results_2d = itertools.zip_longest(*[results_flatten[i::N_COLS] for i in range(N_COLS)])\n        table = tabulate(\n            results_2d,\n            tablefmt=\"pipe\",\n            floatfmt=\".3f\",\n            headers=[\"category\", \"AP\"] * (N_COLS // 2),\n            numalign=\"left\",\n        )\n        self._logger.info(\"Per-category {} AP: \\n\".format(\"segm\") + table)\n\n        results.update({\"AP-\" + name: ap for name, ap in results_per_category})\n        return results\n\n\ndef instances_to_coco_json_video(inputs, outputs):\n    \"\"\"\n    Dump an \"Instances\" object to a COCO-format json that's used for evaluation.\n\n    Args:\n        instances (Instances):\n        video_id (int): the image id\n\n    Returns:\n        list[dict]: list of json annotations in COCO format.\n    \"\"\"\n    assert len(inputs) == 1, \"More than one inputs are loaded for inference!\"\n\n    video_id = inputs[0][\"video_id\"]\n    video_length = inputs[0][\"length\"]\n\n    scores = outputs[\"pred_scores\"]\n    labels = outputs[\"pred_labels\"]\n    masks = outputs[\"pred_masks\"]\n\n    ytvis_results = []\n    for instance_id, (s, l, m) in enumerate(zip(scores, labels, masks)):\n        segms = [\n            mask_util.encode(np.array(_mask[:, :, None], order=\"F\", dtype=\"uint8\"))[0]\n            for _mask in m\n        ]\n        for rle in segms:\n            rle[\"counts\"] = rle[\"counts\"].decode(\"utf-8\")\n\n        res = {\n            \"video_id\": video_id,\n            \"score\": s,\n            \"category_id\": l,\n            \"segmentations\": segms,\n        }\n        ytvis_results.append(res)\n\n    return ytvis_results\n\n\ndef _evaluate_predictions_on_coco(\n    coco_gt,\n    coco_results,\n    img_ids=None,\n):\n    \"\"\"\n    Evaluate the coco results using COCOEval API.\n    \"\"\"\n    assert len(coco_results) > 0\n\n    coco_results = copy.deepcopy(coco_results)\n    # When evaluating mask AP, if the results contain bbox, cocoapi will\n    # use the box area as the area of the instance, instead of the mask area.\n    # This leads to a different definition of small/medium/large.\n    # We remove the bbox field to let mask AP use mask area.\n    for c in coco_results:\n        c.pop(\"bbox\", None)\n\n    coco_dt = coco_gt.loadRes(coco_results)\n    coco_eval = YTVOSeval(coco_gt, coco_dt)\n    # For COCO, the default max_dets_per_image is [1, 10, 100].\n    max_dets_per_image = [1, 10, 100]  # Default from COCOEval\n    coco_eval.params.maxDets = max_dets_per_image\n\n    if img_ids is not None:\n        coco_eval.params.imgIds = img_ids\n\n    coco_eval.evaluate()\n    coco_eval.accumulate()\n    coco_eval.summarize()\n\n    return coco_eval\n"
  },
  {
    "path": "mask2former_video/modeling/__init__.py",
    "content": "from .transformer_decoder.video_mask2former_transformer_decoder import VideoMultiScaleMaskedTransformerDecoder\n"
  },
  {
    "path": "mask2former_video/modeling/criterion.py",
    "content": "# Modified by Bowen Cheng from https://github.com/facebookresearch/detr/blob/master/models/detr.py\n\nimport logging\n\nimport torch\nimport torch.nn.functional as F\nfrom torch import nn\n\nfrom detectron2.utils.comm import get_world_size\nfrom detectron2.projects.point_rend.point_features import (\n    get_uncertain_point_coords_with_randomness,\n    point_sample,\n)\n\nfrom mask2former.utils.misc import is_dist_avail_and_initialized\nimport random\nimport cv2\nimport os\n\ndef unfold_wo_center(x, kernel_size, dilation):\n    assert x.dim() == 4\n    assert kernel_size % 2 == 1\n\n    # using SAME padding\n    padding = (kernel_size + (dilation - 1) * (kernel_size - 1)) // 2\n    unfolded_x = F.unfold(\n        x, kernel_size=kernel_size,\n        padding=padding,\n        dilation=dilation\n    )\n\n    unfolded_x = unfolded_x.reshape(\n        x.size(0), x.size(1), -1, x.size(2), x.size(3)\n    )\n\n    # remove the center pixels\n    size = kernel_size ** 2\n    unfolded_x = torch.cat((\n        unfolded_x[:, :, :size // 2],\n        unfolded_x[:, :, size // 2 + 1:]\n    ), dim=2)\n\n    return unfolded_x\n\ndef unfold_w_center(x, kernel_size, dilation):\n    assert x.dim() == 4\n    assert kernel_size % 2 == 1\n\n    # using SAME padding\n    padding = (kernel_size + (dilation - 1) * (kernel_size - 1)) // 2\n    unfolded_x = F.unfold(\n        x, kernel_size=kernel_size,\n        padding=padding,\n        dilation=dilation\n    )\n\n    unfolded_x = unfolded_x.reshape(\n        x.size(0), x.size(1), -1, x.size(2), x.size(3)\n    )\n\n    \n    return unfolded_x\n\ndef compute_pairwise_term(mask_logits, pairwise_size, pairwise_dilation):\n    assert mask_logits.dim() == 4\n\n    log_fg_prob = F.logsigmoid(mask_logits)\n    log_bg_prob = F.logsigmoid(-mask_logits)\n\n    log_fg_prob_unfold = unfold_wo_center(\n        log_fg_prob, kernel_size=pairwise_size,\n        dilation=pairwise_dilation\n    )\n    log_bg_prob_unfold = unfold_wo_center(\n        log_bg_prob, kernel_size=pairwise_size,\n        dilation=pairwise_dilation\n    )\n\n    # the probability of making the same prediction = p_i * p_j + (1 - p_i) * (1 - p_j)\n    # we compute the the probability in log space to avoid numerical instability\n    log_same_fg_prob = log_fg_prob[:, :, None] + log_fg_prob_unfold\n    log_same_bg_prob = log_bg_prob[:, :, None] + log_bg_prob_unfold\n\n    max_ = torch.max(log_same_fg_prob, log_same_bg_prob)\n    log_same_prob = torch.log(\n        torch.exp(log_same_fg_prob - max_) +\n        torch.exp(log_same_bg_prob - max_)\n    ) + max_\n\n    # loss = -log(prob)\n    return -log_same_prob[:, 0]\n\ndef compute_pairwise_term_neighbor(mask_logits, mask_logits_neighbor, pairwise_size, pairwise_dilation):\n    assert mask_logits.dim() == 4\n\n    log_fg_prob_neigh = F.logsigmoid(mask_logits_neighbor)\n    log_bg_prob_neigh = F.logsigmoid(-mask_logits_neighbor)\n\n    log_fg_prob = F.logsigmoid(mask_logits)\n    log_bg_prob = F.logsigmoid(-mask_logits)\n    \n    log_fg_prob_unfold = unfold_w_center(\n        log_fg_prob, kernel_size=pairwise_size,\n        dilation=pairwise_dilation\n    )\n    # print('log_fg_prob shape:', log_fg_prob.shape, 'log_fg_prob unfold:', log_fg_prob_unfold.shape)\n    log_bg_prob_unfold = unfold_w_center(\n        log_bg_prob, kernel_size=pairwise_size,\n        dilation=pairwise_dilation\n    )\n\n    # the probability of making the same prediction = p_i * p_j + (1 - p_i) * (1 - p_j)\n    # we compute the the probability in log space to avoid numerical instability\n    log_same_fg_prob = log_fg_prob_neigh[:, :, None] + log_fg_prob_unfold\n    log_same_bg_prob = log_bg_prob_neigh[:, :, None] + log_bg_prob_unfold\n\n    max_ = torch.max(log_same_fg_prob, log_same_bg_prob)\n    log_same_prob = torch.log(\n        torch.exp(log_same_fg_prob - max_) +\n        torch.exp(log_same_bg_prob - max_)\n    ) + max_\n\n    # loss = -log(prob)\n    return -log_same_prob[:, 0]\n\ndef dice_coefficient(x, target):\n    eps = 1e-5\n    n_inst = x.size(0)\n    x = x.reshape(n_inst, -1)\n    target = target.reshape(n_inst, -1)\n    intersection = (x * target).sum(dim=1)\n    union = (x ** 2.0).sum(dim=1) + (target ** 2.0).sum(dim=1) + eps\n    loss = 1. - (2 * intersection / union)\n    return loss\n\ndef compute_project_term(mask_scores, gt_bitmasks):\n    mask_losses_y = dice_coefficient(\n        mask_scores.max(dim=2, keepdim=True)[0],\n        gt_bitmasks.max(dim=2, keepdim=True)[0]\n    )\n    mask_losses_x = dice_coefficient(\n        mask_scores.max(dim=3, keepdim=True)[0],\n        gt_bitmasks.max(dim=3, keepdim=True)[0]\n    )\n    return (mask_losses_x + mask_losses_y).mean()\n\ndef dice_loss(\n        inputs: torch.Tensor,\n        targets: torch.Tensor,\n        num_masks: float,\n    ):\n    \"\"\"\n    Compute the DICE loss, similar to generalized IOU for masks\n    Args:\n        inputs: A float tensor of arbitrary shape.\n                The predictions for each example.\n        targets: A float tensor with the same shape as inputs. Stores the binary\n                 classification label for each element in inputs\n                (0 for the negative class and 1 for the positive class).\n    \"\"\"\n    inputs = inputs.sigmoid()\n    inputs = inputs.flatten(1)\n    numerator = 2 * (inputs * targets).sum(-1)\n    denominator = inputs.sum(-1) + targets.sum(-1)\n    loss = 1 - (numerator + 1) / (denominator + 1)\n    return loss.sum() / num_masks\n\n\ndice_loss_jit = torch.jit.script(\n    dice_loss\n)  # type: torch.jit.ScriptModule\n\n\ndef sigmoid_ce_loss(\n        inputs: torch.Tensor,\n        targets: torch.Tensor,\n        num_masks: float,\n    ):\n    \"\"\"\n    Args:\n        inputs: A float tensor of arbitrary shape.\n                The predictions for each example.\n        targets: A float tensor with the same shape as inputs. Stores the binary\n                 classification label for each element in inputs\n                (0 for the negative class and 1 for the positive class).\n    Returns:\n        Loss tensor\n    \"\"\"\n    loss = F.binary_cross_entropy_with_logits(inputs, targets, reduction=\"none\")\n    return loss.mean(1).sum() / num_masks\n\n\nsigmoid_ce_loss_jit = torch.jit.script(\n    sigmoid_ce_loss\n)  # type: torch.jit.ScriptModule\n\n\ndef visualize_masks(masks, output_dir='masks'):\n    \"\"\"\n    Visualize binary mask tensor with shape (N, H, W) and save them as PNG images in the output directory.\n    \"\"\"\n    os.makedirs(output_dir, exist_ok=True)\n\n    n, h, w = masks.shape\n    masks = masks.cpu().numpy()\n    for i in range(n):\n        mask = (masks[i] * 255).astype('uint8') \n        print('mask sum', mask.sum(), mask.max(), mask.min())\n        # mask = cv2.cvtColor(mask, cv2.COLOR_GRAY2BGR)\n        # mask = mask * 255\n        # mask = cv2.cvtColor(mask, cv2.COLOR_GRAY2BGR)\n        filename = os.path.join(output_dir, f'mask_{i}.jpg')\n        cv2.imwrite(filename, mask)\n\ndef calculate_uncertainty(logits):\n    \"\"\"\n    We estimate uncerainty as L1 distance between 0.0 and the logit prediction in 'logits' for the\n        foreground class in `classes`.\n    Args:\n        logits (Tensor): A tensor of shape (R, 1, ...) for class-specific or\n            class-agnostic, where R is the total number of predicted masks in all images and C is\n            the number of foreground classes. The values are logits.\n    Returns:\n        scores (Tensor): A tensor of shape (R, 1, ...) that contains uncertainty scores with\n            the most uncertain locations having the highest uncertainty score.\n    \"\"\"\n    assert logits.shape[1] == 1\n    gt_class_logits = logits.clone()\n    return -(torch.abs(gt_class_logits))\n\n\nclass VideoSetCriterion(nn.Module):\n    \"\"\"This class computes the loss for DETR.\n    The process happens in two steps:\n        1) we compute hungarian assignment between ground truth boxes and the outputs of the model\n        2) we supervise each pair of matched ground-truth / prediction (supervise class and box)\n    \"\"\"\n\n    def __init__(self, num_classes, matcher, weight_dict, eos_coef, losses,\n                 num_points, oversample_ratio, importance_sample_ratio):\n        \"\"\"Create the criterion.\n        Parameters:\n            num_classes: number of object categories, omitting the special no-object category\n            matcher: module able to compute a matching between targets and proposals\n            weight_dict: dict containing as key the names of the losses and as values their relative weight.\n            eos_coef: relative classification weight applied to the no-object category\n            losses: list of all the losses to be applied. See get_loss for list of available losses.\n        \"\"\"\n        super().__init__()\n        self.num_classes = num_classes\n        self.matcher = matcher\n        self.weight_dict = weight_dict\n        self.eos_coef = eos_coef\n        self.losses = losses\n        empty_weight = torch.ones(self.num_classes + 1)\n        empty_weight[-1] = self.eos_coef\n        self.register_buffer(\"empty_weight\", empty_weight)\n\n        # pointwise mask loss parameters\n        self.num_points = num_points\n        self.oversample_ratio = oversample_ratio\n        self.importance_sample_ratio = importance_sample_ratio\n\n        self._warmup_iters = 2000\n        self.register_buffer(\"_iter\", torch.zeros([1]))\n\n    def loss_labels(self, outputs, targets, indices, num_masks):\n        \"\"\"Classification loss (NLL)\n        targets dicts must contain the key \"labels\" containing a tensor of dim [nb_target_boxes]\n        \"\"\"\n        assert \"pred_logits\" in outputs\n        src_logits = outputs[\"pred_logits\"].float()\n\n        idx = self._get_src_permutation_idx(indices)\n        target_classes_o = torch.cat([t[\"labels\"][J] for t, (_, J) in zip(targets, indices)])\n        target_classes = torch.full(\n            src_logits.shape[:2], self.num_classes, dtype=torch.int64, device=src_logits.device\n        )\n        target_classes[idx] = target_classes_o\n\n        loss_ce = F.cross_entropy(src_logits.transpose(1, 2), target_classes, self.empty_weight)\n        losses = {\"loss_ce\": loss_ce}\n        return losses\n    \n    def loss_masks(self, outputs, targets, indices, num_masks):\n        \"\"\"Compute the losses related to the masks: the focal loss and the dice loss.\n        targets dicts must contain the key \"masks\" containing a tensor of dim [nb_target_boxes, h, w]\n        \"\"\"\n        assert \"pred_masks\" in outputs\n\n        src_idx = self._get_src_permutation_idx(indices)\n        src_masks = outputs[\"pred_masks\"]\n        src_masks = src_masks[src_idx]\n        # Modified to handle video\n        target_masks = torch.cat([t['masks'][i] for t, (_, i) in zip(targets, indices)]).to(src_masks)\n\n        # No need to upsample predictions as we are using normalized coordinates :)\n        # NT x 1 x H x W\n        src_masks = src_masks.flatten(0, 1)[:, None]\n        target_masks = target_masks.flatten(0, 1)[:, None]\n        # print('src_masks shape:', src_masks.shape)\n        # print('target_masks shape:', target_masks.shape)\n\n        with torch.no_grad():\n            # sample point_coords\n            point_coords = get_uncertain_point_coords_with_randomness(\n                src_masks,\n                lambda logits: calculate_uncertainty(logits),\n                self.num_points,\n                self.oversample_ratio,\n                self.importance_sample_ratio,\n            )\n            # get gt labels\n            point_labels = point_sample(\n                target_masks,\n                point_coords,\n                align_corners=False,\n            ).squeeze(1)\n\n        point_logits = point_sample(\n            src_masks,\n            point_coords,\n            align_corners=False,\n        ).squeeze(1)\n\n        losses = {\n            \"loss_mask\": sigmoid_ce_loss_jit(point_logits, point_labels, num_masks),\n            \"loss_mask_proj\": src_masks.sum() * 0.,\n            \"loss_dice\": dice_loss_jit(point_logits, point_labels, num_masks),\n            \"loss_bound\": src_masks.sum() * 0.,\n            \"loss_bound_neighbor\": src_masks.sum() * 0.,\n        }\n\n        del src_masks\n        del target_masks\n        return losses\n    \n    def topk_mask(self, images_lab_sim):\n        images_lab_sim_mask = torch.zeros_like(images_lab_sim)\n        topk, indices = torch.topk(images_lab_sim, 5, dim =1) \n        images_lab_sim_mask = images_lab_sim_mask.scatter(1, indices, topk)\n        return images_lab_sim_mask\n\n    def loss_masks_proj(self, outputs, targets, indices, num_masks, images_lab_sim, images_lab_sim_nei, images_lab_sim_nei1, images_lab_sim_nei2):\n        \"\"\"Compute the losses related to the masks: the focal loss and the dice loss.\n        targets dicts must contain the key \"masks\" containing a tensor of dim [nb_target_boxes, h, w]\n        \"\"\"\n        assert \"pred_masks\" in outputs\n        self._iter += 1\n        \n        # print('images_lab_sim is None:', (images_lab_sim is None))\n        if images_lab_sim is None:\n            return self.loss_masks(outputs, targets, indices, num_masks)\n        \n        src_idx = self._get_src_permutation_idx(indices)\n        src_masks = outputs[\"pred_masks\"]\n        src_masks = src_masks[src_idx]\n        # Modified to handle video\n        target_masks = torch.cat([t['masks'][i] for t, (_, i) in zip(targets, indices)]).to(src_masks)\n\n        images_lab_sim = torch.cat(images_lab_sim, dim =0)\n        images_lab_sim_nei = torch.cat(images_lab_sim_nei, dim=0)\n        images_lab_sim_nei1 = torch.cat(images_lab_sim_nei1, dim=0)\n        images_lab_sim_nei2 = torch.cat(images_lab_sim_nei2, dim=0)\n        images_lab_sim = images_lab_sim.view(-1, target_masks.shape[1], images_lab_sim.shape[-3], images_lab_sim.shape[-2], images_lab_sim.shape[-1])\n        images_lab_sim_nei = images_lab_sim_nei.unsqueeze(1)\n        images_lab_sim_nei1 = images_lab_sim_nei1.unsqueeze(1)\n        images_lab_sim_nei2 = images_lab_sim_nei2.unsqueeze(1)\n        if len(src_idx[0].tolist()) > 0:\n            images_lab_sim = torch.cat([images_lab_sim[ind][None] for ind in src_idx[0].tolist()]).flatten(0, 1)\n            images_lab_sim_nei = self.topk_mask(torch.cat([images_lab_sim_nei[ind][None] for ind in src_idx[0].tolist()]).flatten(0, 1))\n            images_lab_sim_nei1 = self.topk_mask(torch.cat([images_lab_sim_nei1[ind][None] for ind in src_idx[0].tolist()]).flatten(0, 1))\n            images_lab_sim_nei2 = self.topk_mask(torch.cat([images_lab_sim_nei2[ind][None] for ind in src_idx[0].tolist()]).flatten(0, 1))\n\n        k_size = 3 \n\n        if src_masks.shape[0] > 0:\n            pairwise_losses_neighbor = compute_pairwise_term_neighbor(\n                src_masks[:,:1], src_masks[:,1:2], k_size, 3\n            ) \n            pairwise_losses_neighbor1 = compute_pairwise_term_neighbor(\n                src_masks[:,:1], src_masks[:,2:3], k_size, 3\n            ) \n            pairwise_losses_neighbor2 = compute_pairwise_term_neighbor(\n                src_masks[:,1:2], src_masks[:,2:3], k_size, 3\n            )\n            \n        src_masks = src_masks.flatten(0, 1)[:, None]\n        target_masks = target_masks.flatten(0, 1)[:, None]\n        target_masks = F.interpolate(target_masks, (src_masks.shape[-2], src_masks.shape[-1]), mode='bilinear')        \n        \n        if src_masks.shape[0] > 0:\n            loss_prj_term = compute_project_term(src_masks.sigmoid(), target_masks)  \n            \n            pairwise_losses = compute_pairwise_term(\n                src_masks, 3, 2\n            )\n            weights = (images_lab_sim >= 0.3).float() * target_masks.float()\n            target_masks_sum = target_masks.reshape(pairwise_losses_neighbor.shape[0], 3, target_masks.shape[-2], target_masks.shape[-1]).sum(dim=1, keepdim=True)\n            target_masks_sum = (target_masks_sum >= 1.0).float()\n            \n            weights_neighbor = (images_lab_sim_nei >= 0.05).float() * target_masks_sum # ori is 0.5, 0.01, 0.001, 0.005, 0.0001, 0.02, 0.05, 0.075, 0.1 , dy 0.5\n            weights_neighbor1 = (images_lab_sim_nei1 >= 0.05).float() * target_masks_sum # ori is 0.5, 0.01, 0.001, 0.005, 0.0001, 0.02, 0.05, 0.075, 0.1, dy 0.5\n            weights_neighbor2 = (images_lab_sim_nei2 >= 0.05).float() * target_masks_sum # ori is 0.5, 0.01, 0.001, 0.005, 0.0001, 0.02, 0.05, 0.075, 0.1, dy 0.5\n\n            warmup_factor = min(self._iter.item() / float(self._warmup_iters), 1.0) #1.0\n\n            loss_pairwise = (pairwise_losses * weights).sum() / weights.sum().clamp(min=1.0)\n            loss_pairwise_neighbor = (pairwise_losses_neighbor * weights_neighbor).sum() / weights_neighbor.sum().clamp(min=1.0) * warmup_factor\n            loss_pairwise_neighbor1 = (pairwise_losses_neighbor1 * weights_neighbor1).sum() / weights_neighbor1.sum().clamp(min=1.0) * warmup_factor\n            loss_pairwise_neighbor2 = (pairwise_losses_neighbor2 * weights_neighbor2).sum() / weights_neighbor2.sum().clamp(min=1.0) * warmup_factor\n        else:\n            loss_prj_term = src_masks.sum() * 0.\n            loss_pairwise = src_masks.sum() * 0.\n            loss_pairwise_neighbor = src_masks.sum() * 0.\n            loss_pairwise_neighbor1 = src_masks.sum() * 0.\n            loss_pairwise_neighbor2 = src_masks.sum() * 0.\n\n        losses = {\n            \"loss_mask\": src_masks.sum() * 0.,\n            \"loss_mask_proj\": loss_prj_term,\n            \"loss_dice\": src_masks.sum() * 0.,\n            \"loss_bound\": loss_pairwise,\n            \"loss_bound_neighbor\": (loss_pairwise_neighbor + loss_pairwise_neighbor1 + loss_pairwise_neighbor2) * 0.1,\n        }\n\n        del src_masks\n        del target_masks\n        return losses\n\n    def _get_src_permutation_idx(self, indices):\n        # permute predictions following indices\n        batch_idx = torch.cat([torch.full_like(src, i) for i, (src, _) in enumerate(indices)])\n        src_idx = torch.cat([src for (src, _) in indices])\n        return batch_idx, src_idx\n\n    def _get_tgt_permutation_idx(self, indices):\n        # permute targets following indices\n        batch_idx = torch.cat([torch.full_like(tgt, i) for i, (_, tgt) in enumerate(indices)])\n        tgt_idx = torch.cat([tgt for (_, tgt) in indices])\n        return batch_idx, tgt_idx\n\n    def get_loss(self, loss, outputs, targets, indices, num_masks, images_lab_sim, images_lab_sim_nei, images_lab_sim_nei1, images_lab_sim_nei2):\n        loss_map = {\n            'labels': self.loss_labels,\n            'masks': self.loss_masks_proj,\n        }\n        assert loss in loss_map, f\"do you really want to compute {loss} loss?\"\n        if loss == 'masks':\n            return loss_map[loss](outputs, targets, indices, num_masks, images_lab_sim, images_lab_sim_nei, images_lab_sim_nei1, images_lab_sim_nei2)\n        else:\n            return loss_map[loss](outputs, targets, indices, num_masks)\n\n    def forward(self, outputs, targets, images_lab_sim, images_lab_sim_nei, images_lab_sim_nei1, images_lab_sim_nei2):\n        \"\"\"This performs the loss computation.\n        Parameters:\n             outputs: dict of tensors, see the output specification of the model for the format\n             targets: list of dicts, such that len(targets) == batch_size.\n                      The expected keys in each dict depends on the losses applied, see each loss' doc\n        \"\"\"\n        outputs_without_aux = {k: v for k, v in outputs.items() if k != \"aux_outputs\"}\n\n        # Retrieve the matching between the outputs of the last layer and the targets\n        indices = self.matcher(outputs_without_aux, targets)\n\n        # Compute the average number of target boxes accross all nodes, for normalization purposes\n        num_masks = sum(len(t[\"labels\"]) for t in targets)\n        num_masks = torch.as_tensor(\n            [num_masks], dtype=torch.float, device=next(iter(outputs.values())).device\n        )\n        if is_dist_avail_and_initialized():\n            torch.distributed.all_reduce(num_masks)\n        num_masks = torch.clamp(num_masks / get_world_size(), min=1).item()\n\n        # Compute all the requested losses\n        losses = {}\n        for loss in self.losses:\n            losses.update(self.get_loss(loss, outputs, targets, indices, num_masks, images_lab_sim, images_lab_sim_nei, images_lab_sim_nei1, images_lab_sim_nei2))\n\n        # In case of auxiliary losses, we repeat this process with the output of each intermediate layer.\n        if \"aux_outputs\" in outputs:\n            for i, aux_outputs in enumerate(outputs[\"aux_outputs\"]):\n                indices = self.matcher(aux_outputs, targets)\n                for loss in self.losses:\n                    l_dict = self.get_loss(loss, aux_outputs, targets, indices, num_masks, images_lab_sim, images_lab_sim_nei, images_lab_sim_nei1, images_lab_sim_nei2)\n                    l_dict = {k + f\"_{i}\": v for k, v in l_dict.items()}\n                    losses.update(l_dict)\n\n        return losses\n\n    def __repr__(self):\n        head = \"Criterion \" + self.__class__.__name__\n        body = [\n            \"matcher: {}\".format(self.matcher.__repr__(_repr_indent=8)),\n            \"losses: {}\".format(self.losses),\n            \"weight_dict: {}\".format(self.weight_dict),\n            \"num_classes: {}\".format(self.num_classes),\n            \"eos_coef: {}\".format(self.eos_coef),\n            \"num_points: {}\".format(self.num_points),\n            \"oversample_ratio: {}\".format(self.oversample_ratio),\n            \"importance_sample_ratio: {}\".format(self.importance_sample_ratio),\n        ]\n        _repr_indent = 4\n        lines = [head] + [\" \" * _repr_indent + line for line in body]\n        return \"\\n\".join(lines)\n"
  },
  {
    "path": "mask2former_video/modeling/matcher.py",
    "content": "# Modified by Bowen Cheng from https://github.com/facebookresearch/detr/blob/master/models/matcher.py\n\"\"\"\nModules to compute the matching cost and solve the corresponding LSAP.\n\"\"\"\nimport torch\nimport torch.nn.functional as F\nfrom scipy.optimize import linear_sum_assignment\nfrom torch import nn\nfrom torch.cuda.amp import autocast\n\nfrom detectron2.projects.point_rend.point_features import point_sample\nimport cv2\nimport os\n\n# def visualize_masks(masks, output_dir='masks_new'):\n#     \"\"\"\n#     Visualize binary mask tensor with shape (N, H, W) and save them as PNG images in the output directory.\n#     \"\"\"\n#     os.makedirs(output_dir, exist_ok=True)\n#     masks = masks.flatten(0, 1)\n#     print('masks shape:', masks.shape)\n#     n, h, w = masks.shape\n\n#     for i in range(n):\n#         mask = masks[i].cpu().numpy()\n#         mask = (mask * 255).astype('uint8')\n#         # mask = cv2.cvtColor(mask, cv2.COLOR_GRAY2BGR)\n#         filename = os.path.join(output_dir, f'mask_{i}.png')\n#         cv2.imwrite(filename, mask)\n\ndef masks_to_boxes(masks: torch.Tensor) -> torch.Tensor:\n    \"\"\"\n    Compute the bounding boxes around the provided masks.\n\n    Returns a [N, 4] tensor containing bounding boxes. The boxes are in ``(x1, y1, x2, y2)`` format with\n    ``0 <= x1 < x2`` and ``0 <= y1 < y2``.\n\n    Args:\n        masks (Tensor[N, H, W]): masks to transform where N is the number of masks\n            and (H, W) are the spatial dimensions.\n\n    Returns:\n        Tensor[N, 4]: bounding boxes\n    \"\"\"\n    if masks.numel() == 0:\n        return masks\n    \n    n = masks.shape[0]\n    masks = masks.flatten(0, 1)\n    for index, mask in enumerate(masks):\n        y, x = torch.where(mask != 0)\n        if len(x) * len(y) == 0:\n            continue\n        \n        masks[index, torch.min(y):torch.max(y)+1, torch.min(x):torch.max(x)+1] = 1.0\n    \n    masks = masks.view(n, -1, masks.shape[-2], masks.shape[-1])\n    return masks\n\ndef masks_to_boxes_new(masks: torch.Tensor) -> torch.Tensor:\n    \"\"\"\n    Compute the bounding boxes around the provided masks.\n\n    Returns a [N, 4] tensor containing bounding boxes. The boxes are in ``(x1, y1, x2, y2)`` format with\n    ``0 <= x1 < x2`` and ``0 <= y1 < y2``.\n\n    Args:\n        masks (Tensor[N, H, W]): masks to transform where N is the number of masks\n            and (H, W) are the spatial dimensions.\n\n    Returns:\n        Tensor[N, 4]: bounding boxes\n    \"\"\"\n    if masks.numel() == 0:\n        return masks\n    \n    n, _, h, w = masks.shape\n    masks = masks.flatten(0, 1)\n    y = torch.arange(0, h, dtype=torch.float).to(masks.device)\n    x = torch.arange(0, w, dtype=torch.float).to(masks.device)\n    y, x = torch.meshgrid(y, x)\n\n    x_mask = (masks * x.unsqueeze(0))\n    x_max = x_mask.flatten(1).max(-1)[0] + 1\n    x_min = x_mask.masked_fill(~(masks.bool()), 1e8).flatten(1).min(-1)[0]\n\n    y_mask = (masks * y.unsqueeze(0))\n    y_max = y_mask.flatten(1).max(-1)[0] + 1\n    y_min = y_mask.masked_fill(~(masks.bool()), 1e8).flatten(1).min(-1)[0]\n\n    boxes = torch.stack([x_min, y_min, x_max, y_max], 1)\n    # print('boxes shape:', boxes.shape)\n\n    mem_mask = torch.zeros_like(masks)\n\n    hMask = torch.logical_or(torch.arange(h).unsqueeze(0).to(boxes)<boxes[:, 1, None], torch.arange(h).unsqueeze(0).to(boxes)>=boxes[:, 3, None])\n    wMask = torch.logical_or(torch.arange(w).unsqueeze(0).to(boxes)<boxes[:, 0, None], torch.arange(w).unsqueeze(0).to(boxes)>=boxes[:, 2, None])\n    \n    mem_mask = torch.logical_or(hMask.unsqueeze(2), wMask.unsqueeze(1)).float()\n    # print('mem mask shape:', mem_mask.shape)\n    mem_mask = 1.0 - mem_mask.view(n, -1, masks.shape[-2], masks.shape[-1])\n    return mem_mask\n\ndef batch_dice_loss(inputs: torch.Tensor, targets: torch.Tensor):\n    \"\"\"\n    Compute the DICE loss, similar to generalized IOU for masks\n    Args:\n        inputs: A float tensor of arbitrary shape.\n                The predictions for each example.\n        targets: A float tensor with the same shape as inputs. Stores the binary\n                 classification label for each element in inputs\n                (0 for the negative class and 1 for the positive class).\n    \"\"\"\n    inputs = inputs.sigmoid()\n    inputs = inputs.flatten(1)\n    numerator = 2 * torch.einsum(\"nc,mc->nm\", inputs, targets)\n    denominator = inputs.sum(-1)[:, None] + targets.sum(-1)[None, :]\n    loss = 1 - (numerator + 1) / (denominator + 1)\n    return loss\n\ndef batch_dice_loss_nosig(inputs: torch.Tensor, targets: torch.Tensor):\n    \"\"\"\n    Compute the DICE loss, similar to generalized IOU for masks\n    Args:\n        inputs: A float tensor of arbitrary shape.\n                The predictions for each example.\n        targets: A float tensor with the same shape as inputs. Stores the binary\n                 classification label for each element in inputs\n                (0 for the negative class and 1 for the positive class).\n    \"\"\"\n    # inputs = inputs.sigmoid()\n    inputs = inputs.flatten(1)\n    numerator = 2 * torch.einsum(\"nc,mc->nm\", inputs, targets)\n    denominator = inputs.sum(-1)[:, None] + targets.sum(-1)[None, :]\n    loss = 1 - (numerator + 1) / (denominator + 1)\n    return loss\n\nbatch_dice_loss_jit = torch.jit.script(\n    batch_dice_loss\n)  # type: torch.jit.ScriptModule\n\nbatch_dice_loss_jit_nosig = torch.jit.script(\n    batch_dice_loss_nosig\n)  # type: torch.jit.ScriptModule\n\ndef batch_sigmoid_ce_loss(inputs: torch.Tensor, targets: torch.Tensor):\n    \"\"\"\n    Args:\n        inputs: A float tensor of arbitrary shape.\n                The predictions for each example.\n        targets: A float tensor with the same shape as inputs. Stores the binary\n                 classification label for each element in inputs\n                (0 for the negative class and 1 for the positive class).\n    Returns:\n        Loss tensor\n    \"\"\"\n    hw = inputs.shape[1]\n\n    pos = F.binary_cross_entropy_with_logits(\n        inputs, torch.ones_like(inputs), reduction=\"none\"\n    )\n    neg = F.binary_cross_entropy_with_logits(\n        inputs, torch.zeros_like(inputs), reduction=\"none\"\n    )\n\n    loss = torch.einsum(\"nc,mc->nm\", pos, targets) + torch.einsum(\n        \"nc,mc->nm\", neg, (1 - targets)\n    )\n    return loss / hw\n\ndef batch_sigmoid_ce_loss_nosig(inputs: torch.Tensor, targets: torch.Tensor):\n    \"\"\"\n    Args:\n        inputs: A float tensor of arbitrary shape.\n                The predictions for each example.\n        targets: A float tensor with the same shape as inputs. Stores the binary\n                 classification label for each element in inputs\n                (0 for the negative class and 1 for the positive class).\n    Returns:\n        Loss tensor\n    \"\"\"\n    hw = inputs.shape[1]\n\n    pos = F.binary_cross_entropy(\n        inputs, torch.ones_like(inputs), reduction=\"none\"\n    )\n    neg = F.binary_cross_entropy(\n        inputs, torch.zeros_like(inputs), reduction=\"none\"\n    )\n\n    loss = torch.einsum(\"nc,mc->nm\", pos, targets) + torch.einsum(\n        \"nc,mc->nm\", neg, (1 - targets)\n    )\n    #print('loss max no sig:', loss.max())\n    return loss / hw\n\nbatch_sigmoid_ce_loss_jit = torch.jit.script(\n    batch_sigmoid_ce_loss\n)  # type: torch.jit.ScriptModule\n\nbatch_sigmoid_ce_loss_jit_nosig = torch.jit.script(\n    batch_sigmoid_ce_loss_nosig\n)  # type: torch.jit.ScriptModule\n\nclass VideoHungarianMatcher(nn.Module):\n    \"\"\"This class computes an assignment between the targets and the predictions of the network\n\n    For efficiency reasons, the targets don't include the no_object. Because of this, in general,\n    there are more predictions than targets. In this case, we do a 1-to-1 matching of the best predictions,\n    while the others are un-matched (and thus treated as non-objects).\n    \"\"\"\n\n    def __init__(self, cost_class: float = 1, cost_mask: float = 1, cost_dice: float = 1, num_points: int = 0):\n        \"\"\"Creates the matcher\n\n        Params:\n            cost_class: This is the relative weight of the classification error in the matching cost\n            cost_mask: This is the relative weight of the focal loss of the binary mask in the matching cost\n            cost_dice: This is the relative weight of the dice loss of the binary mask in the matching cost\n        \"\"\"\n        super().__init__()\n        self.cost_class = cost_class\n        self.cost_mask = cost_mask\n        self.cost_dice = cost_dice\n\n        assert cost_class != 0 or cost_mask != 0 or cost_dice != 0, \"all costs cant be 0\"\n\n        self.num_points = num_points\n\n    @torch.no_grad()\n    def memory_efficient_forward(self, outputs, targets):\n        \"\"\"More memory-friendly matching\"\"\"\n        bs, num_queries = outputs[\"pred_logits\"].shape[:2]\n\n        indices = []\n\n        # Iterate through batch size\n        for b in range(bs):\n\n            out_prob = outputs[\"pred_logits\"][b].softmax(-1)  # [num_queries, num_classes]\n            tgt_ids = targets[b][\"labels\"]\n\n            # Compute the classification cost. Contrary to the loss, we don't use the NLL,\n            # but approximate it in 1 - proba[target class].\n            # The 1 is a constant that doesn't change the matching, it can be ommitted.\n            cost_class = -out_prob[:, tgt_ids]\n\n            out_mask = outputs[\"pred_masks\"][b]  # [num_queries, T, H_pred, W_pred]\n            is_ytvis = (out_mask.shape[1] == 3) # change here\n\n            if is_ytvis:\n                # out_mask_c = masks_to_boxes((out_mask.sigmoid() > 0.5).clone().float()).float()\n                out_mask = masks_to_boxes_new((out_mask.sigmoid() > 0.5).float()).float() # ori match\n                # visualize_masks(out_mask, 'box_mask_convert')\n                \n            # gt masks are already padded when preparing target\n            tgt_mask = targets[b][\"masks\"].to(out_mask)  # [num_gts, T, H_pred, W_pred]\n            \n            if is_ytvis:\n                tgt_mask = masks_to_boxes(tgt_mask).float() # ori match, change here will also influnce criterion\n\n            # all masks share the same set of points for efficient matching!\n            point_coords = torch.rand(1, self.num_points, 2, device=out_mask.device)\n            # get gt labels\n            tgt_mask = point_sample(\n                tgt_mask,\n                point_coords.repeat(tgt_mask.shape[0], 1, 1),\n                align_corners=False,\n            ).flatten(1)\n\n            out_mask = point_sample(\n                out_mask,\n                point_coords.repeat(out_mask.shape[0], 1, 1),\n                align_corners=False,\n            ).flatten(1)\n\n            with autocast(enabled=False):\n                out_mask = out_mask.float()\n                tgt_mask = tgt_mask.float()\n                \n                # Compute the dice loss betwen masks\n                if not is_ytvis:\n                    cost_dice = batch_dice_loss_jit(out_mask, tgt_mask)\n                    cost_mask = batch_sigmoid_ce_loss_jit(out_mask, tgt_mask)\n                else:\n                    cost_dice_nosig = batch_dice_loss_jit_nosig(out_mask, tgt_mask)\n\n            # Final cost matrix\n            if not is_ytvis:\n                C = (\n                    self.cost_mask * cost_mask\n                    + self.cost_class * cost_class\n                    + self.cost_dice * cost_dice\n                )\n            else:\n                C = (\n                    self.cost_class * cost_class\n                    + self.cost_dice * cost_dice_nosig\n                )\n\n            C = C.reshape(num_queries, -1).cpu()\n\n            indices.append(linear_sum_assignment(C))\n\n        return [\n            (torch.as_tensor(i, dtype=torch.int64), torch.as_tensor(j, dtype=torch.int64))\n            for i, j in indices\n        ]\n\n    @torch.no_grad()\n    def forward(self, outputs, targets):\n        \"\"\"Performs the matching\n\n        Params:\n            outputs: This is a dict that contains at least these entries:\n                 \"pred_logits\": Tensor of dim [batch_size, num_queries, num_classes] with the classification logits\n                 \"pred_masks\": Tensor of dim [batch_size, num_queries, H_pred, W_pred] with the predicted masks\n\n            targets: This is a list of targets (len(targets) = batch_size), where each target is a dict containing:\n                 \"labels\": Tensor of dim [num_target_boxes] (where num_target_boxes is the number of ground-truth\n                           objects in the target) containing the class labels\n                 \"masks\": Tensor of dim [num_target_boxes, H_gt, W_gt] containing the target masks\n\n        Returns:\n            A list of size batch_size, containing tuples of (index_i, index_j) where:\n                - index_i is the indices of the selected predictions (in order)\n                - index_j is the indices of the corresponding selected targets (in order)\n            For each batch element, it holds:\n                len(index_i) = len(index_j) = min(num_queries, num_target_boxes)\n        \"\"\"\n        return self.memory_efficient_forward(outputs, targets)\n\n    def __repr__(self, _repr_indent=4):\n        head = \"Matcher \" + self.__class__.__name__\n        body = [\n            \"cost_class: {}\".format(self.cost_class),\n            \"cost_mask: {}\".format(self.cost_mask),\n            \"cost_dice: {}\".format(self.cost_dice),\n        ]\n        lines = [head] + [\" \" * _repr_indent + line for line in body]\n        return \"\\n\".join(lines)\n"
  },
  {
    "path": "mask2former_video/modeling/transformer_decoder/__init__.py",
    "content": "from .video_mask2former_transformer_decoder import VideoMultiScaleMaskedTransformerDecoder\n"
  },
  {
    "path": "mask2former_video/modeling/transformer_decoder/position_encoding.py",
    "content": "# # Modified by Bowen Cheng from: https://github.com/facebookresearch/detr/blob/master/models/position_encoding.py\n\"\"\"\nVarious positional encodings for the transformer.\n\"\"\"\nimport math\n\nimport torch\nfrom torch import nn\n\n\nclass PositionEmbeddingSine3D(nn.Module):\n    \"\"\"\n    This is a more standard version of the position embedding, very similar to the one\n    used by the Attention is all you need paper, generalized to work on images.\n    \"\"\"\n\n    def __init__(self, num_pos_feats=64, temperature=10000, normalize=False, scale=None):\n        super().__init__()\n        self.num_pos_feats = num_pos_feats\n        self.temperature = temperature\n        self.normalize = normalize\n        if scale is not None and normalize is False:\n            raise ValueError(\"normalize should be True if scale is passed\")\n        if scale is None:\n            scale = 2 * math.pi\n        self.scale = scale\n\n    def forward(self, x, mask=None):\n        # b, t, c, h, w\n        assert x.dim() == 5, f\"{x.shape} should be a 5-dimensional Tensor, got {x.dim()}-dimensional Tensor instead\"\n        if mask is None:\n            mask = torch.zeros((x.size(0), x.size(1), x.size(3), x.size(4)), device=x.device, dtype=torch.bool)\n        not_mask = ~mask\n        z_embed = not_mask.cumsum(1, dtype=torch.float32)\n        y_embed = not_mask.cumsum(2, dtype=torch.float32)\n        x_embed = not_mask.cumsum(3, dtype=torch.float32)\n        if self.normalize:\n            eps = 1e-6\n            z_embed = z_embed / (z_embed[:, -1:, :, :] + eps) * self.scale\n            y_embed = y_embed / (y_embed[:, :, -1:, :] + eps) * self.scale\n            x_embed = x_embed / (x_embed[:, :, :, -1:] + eps) * self.scale\n\n        dim_t = torch.arange(self.num_pos_feats, dtype=torch.float32, device=x.device)\n        dim_t = self.temperature ** (2 * (dim_t // 2) / self.num_pos_feats)\n\n        dim_t_z = torch.arange((self.num_pos_feats * 2), dtype=torch.float32, device=x.device)\n        dim_t_z = self.temperature ** (2 * (dim_t_z // 2) / (self.num_pos_feats * 2))\n\n        pos_x = x_embed[:, :, :, :, None] / dim_t\n        pos_y = y_embed[:, :, :, :, None] / dim_t\n        pos_z = z_embed[:, :, :, :, None] / dim_t_z\n        pos_x = torch.stack((pos_x[:, :, :, :, 0::2].sin(), pos_x[:, :, :, :, 1::2].cos()), dim=5).flatten(4)\n        pos_y = torch.stack((pos_y[:, :, :, :, 0::2].sin(), pos_y[:, :, :, :, 1::2].cos()), dim=5).flatten(4)\n        pos_z = torch.stack((pos_z[:, :, :, :, 0::2].sin(), pos_z[:, :, :, :, 1::2].cos()), dim=5).flatten(4)\n        pos = (torch.cat((pos_y, pos_x), dim=4) + pos_z).permute(0, 1, 4, 2, 3)  # b, t, c, h, w\n        return pos\n"
  },
  {
    "path": "mask2former_video/modeling/transformer_decoder/video_mask2former_transformer_decoder.py",
    "content": "# Modified by Bowen Cheng from: https://github.com/facebookresearch/detr/blob/master/models/detr.py\nimport logging\nimport fvcore.nn.weight_init as weight_init\nfrom typing import Optional\nimport torch\nfrom torch import nn, Tensor\nfrom torch.nn import functional as F\n\nfrom detectron2.config import configurable\nfrom detectron2.layers import Conv2d\n\nfrom mask2former.modeling.transformer_decoder.maskformer_transformer_decoder import TRANSFORMER_DECODER_REGISTRY\n\nfrom .position_encoding import PositionEmbeddingSine3D\n\n\nclass SelfAttentionLayer(nn.Module):\n\n    def __init__(self, d_model, nhead, dropout=0.0,\n                 activation=\"relu\", normalize_before=False):\n        super().__init__()\n        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)\n\n        self.norm = nn.LayerNorm(d_model)\n        self.dropout = nn.Dropout(dropout)\n\n        self.activation = _get_activation_fn(activation)\n        self.normalize_before = normalize_before\n\n        self._reset_parameters()\n    \n    def _reset_parameters(self):\n        for p in self.parameters():\n            if p.dim() > 1:\n                nn.init.xavier_uniform_(p)\n\n    def with_pos_embed(self, tensor, pos: Optional[Tensor]):\n        return tensor if pos is None else tensor + pos\n\n    def forward_post(self, tgt,\n                     tgt_mask: Optional[Tensor] = None,\n                     tgt_key_padding_mask: Optional[Tensor] = None,\n                     query_pos: Optional[Tensor] = None):\n        q = k = self.with_pos_embed(tgt, query_pos)\n        tgt2 = self.self_attn(q, k, value=tgt, attn_mask=tgt_mask,\n                              key_padding_mask=tgt_key_padding_mask)[0]\n        tgt = tgt + self.dropout(tgt2)\n        tgt = self.norm(tgt)\n\n        return tgt\n\n    def forward_pre(self, tgt,\n                    tgt_mask: Optional[Tensor] = None,\n                    tgt_key_padding_mask: Optional[Tensor] = None,\n                    query_pos: Optional[Tensor] = None):\n        tgt2 = self.norm(tgt)\n        q = k = self.with_pos_embed(tgt2, query_pos)\n        tgt2 = self.self_attn(q, k, value=tgt2, attn_mask=tgt_mask,\n                              key_padding_mask=tgt_key_padding_mask)[0]\n        tgt = tgt + self.dropout(tgt2)\n        \n        return tgt\n\n    def forward(self, tgt,\n                tgt_mask: Optional[Tensor] = None,\n                tgt_key_padding_mask: Optional[Tensor] = None,\n                query_pos: Optional[Tensor] = None):\n        if self.normalize_before:\n            return self.forward_pre(tgt, tgt_mask,\n                                    tgt_key_padding_mask, query_pos)\n        return self.forward_post(tgt, tgt_mask,\n                                 tgt_key_padding_mask, query_pos)\n\n\nclass CrossAttentionLayer(nn.Module):\n\n    def __init__(self, d_model, nhead, dropout=0.0,\n                 activation=\"relu\", normalize_before=False):\n        super().__init__()\n        self.multihead_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)\n\n        self.norm = nn.LayerNorm(d_model)\n        self.dropout = nn.Dropout(dropout)\n\n        self.activation = _get_activation_fn(activation)\n        self.normalize_before = normalize_before\n\n        self._reset_parameters()\n    \n    def _reset_parameters(self):\n        for p in self.parameters():\n            if p.dim() > 1:\n                nn.init.xavier_uniform_(p)\n\n    def with_pos_embed(self, tensor, pos: Optional[Tensor]):\n        return tensor if pos is None else tensor + pos\n\n    def forward_post(self, tgt, memory,\n                     memory_mask: Optional[Tensor] = None,\n                     memory_key_padding_mask: Optional[Tensor] = None,\n                     pos: Optional[Tensor] = None,\n                     query_pos: Optional[Tensor] = None):\n        tgt2 = self.multihead_attn(query=self.with_pos_embed(tgt, query_pos),\n                                   key=self.with_pos_embed(memory, pos),\n                                   value=memory, attn_mask=memory_mask,\n                                   key_padding_mask=memory_key_padding_mask)[0]\n        tgt = tgt + self.dropout(tgt2)\n        tgt = self.norm(tgt)\n        \n        return tgt\n\n    def forward_pre(self, tgt, memory,\n                    memory_mask: Optional[Tensor] = None,\n                    memory_key_padding_mask: Optional[Tensor] = None,\n                    pos: Optional[Tensor] = None,\n                    query_pos: Optional[Tensor] = None):\n        tgt2 = self.norm(tgt)\n        tgt2 = self.multihead_attn(query=self.with_pos_embed(tgt2, query_pos),\n                                   key=self.with_pos_embed(memory, pos),\n                                   value=memory, attn_mask=memory_mask,\n                                   key_padding_mask=memory_key_padding_mask)[0]\n        tgt = tgt + self.dropout(tgt2)\n\n        return tgt\n\n    def forward(self, tgt, memory,\n                memory_mask: Optional[Tensor] = None,\n                memory_key_padding_mask: Optional[Tensor] = None,\n                pos: Optional[Tensor] = None,\n                query_pos: Optional[Tensor] = None):\n        if self.normalize_before:\n            return self.forward_pre(tgt, memory, memory_mask,\n                                    memory_key_padding_mask, pos, query_pos)\n        return self.forward_post(tgt, memory, memory_mask,\n                                 memory_key_padding_mask, pos, query_pos)\n\n\nclass FFNLayer(nn.Module):\n\n    def __init__(self, d_model, dim_feedforward=2048, dropout=0.0,\n                 activation=\"relu\", normalize_before=False):\n        super().__init__()\n        # Implementation of Feedforward model\n        self.linear1 = nn.Linear(d_model, dim_feedforward)\n        self.dropout = nn.Dropout(dropout)\n        self.linear2 = nn.Linear(dim_feedforward, d_model)\n\n        self.norm = nn.LayerNorm(d_model)\n\n        self.activation = _get_activation_fn(activation)\n        self.normalize_before = normalize_before\n\n        self._reset_parameters()\n    \n    def _reset_parameters(self):\n        for p in self.parameters():\n            if p.dim() > 1:\n                nn.init.xavier_uniform_(p)\n\n    def with_pos_embed(self, tensor, pos: Optional[Tensor]):\n        return tensor if pos is None else tensor + pos\n\n    def forward_post(self, tgt):\n        tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt))))\n        tgt = tgt + self.dropout(tgt2)\n        tgt = self.norm(tgt)\n        return tgt\n\n    def forward_pre(self, tgt):\n        tgt2 = self.norm(tgt)\n        tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt2))))\n        tgt = tgt + self.dropout(tgt2)\n        return tgt\n\n    def forward(self, tgt):\n        if self.normalize_before:\n            return self.forward_pre(tgt)\n        return self.forward_post(tgt)\n\n\ndef _get_activation_fn(activation):\n    \"\"\"Return an activation function given a string\"\"\"\n    if activation == \"relu\":\n        return F.relu\n    if activation == \"gelu\":\n        return F.gelu\n    if activation == \"glu\":\n        return F.glu\n    raise RuntimeError(F\"activation should be relu/gelu, not {activation}.\")\n\n\nclass MLP(nn.Module):\n    \"\"\" Very simple multi-layer perceptron (also called FFN)\"\"\"\n\n    def __init__(self, input_dim, hidden_dim, output_dim, num_layers):\n        super().__init__()\n        self.num_layers = num_layers\n        h = [hidden_dim] * (num_layers - 1)\n        self.layers = nn.ModuleList(nn.Linear(n, k) for n, k in zip([input_dim] + h, h + [output_dim]))\n\n    def forward(self, x):\n        for i, layer in enumerate(self.layers):\n            x = F.relu(layer(x)) if i < self.num_layers - 1 else layer(x)\n        return x\n\n\n@TRANSFORMER_DECODER_REGISTRY.register()\nclass VideoMultiScaleMaskedTransformerDecoder(nn.Module):\n\n    _version = 2\n\n    def _load_from_state_dict(\n        self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs\n    ):\n        version = local_metadata.get(\"version\", None)\n        if version is None or version < 2:\n            # Do not warn if train from scratch\n            scratch = True\n            logger = logging.getLogger(__name__)\n            for k in list(state_dict.keys()):\n                newk = k\n                if \"static_query\" in k:\n                    newk = k.replace(\"static_query\", \"query_feat\")\n                if newk != k:\n                    state_dict[newk] = state_dict[k]\n                    del state_dict[k]\n                    scratch = False\n\n            if not scratch:\n                logger.warning(\n                    f\"Weight format of {self.__class__.__name__} have changed! \"\n                    \"Please upgrade your models. Applying automatic conversion now ...\"\n                )\n\n    @configurable\n    def __init__(\n        self,\n        in_channels,\n        mask_classification=True,\n        *,\n        num_classes: int,\n        hidden_dim: int,\n        num_queries: int,\n        nheads: int,\n        dim_feedforward: int,\n        dec_layers: int,\n        pre_norm: bool,\n        mask_dim: int,\n        enforce_input_project: bool,\n        # video related\n        num_frames,\n    ):\n        \"\"\"\n        NOTE: this interface is experimental.\n        Args:\n            in_channels: channels of the input features\n            mask_classification: whether to add mask classifier or not\n            num_classes: number of classes\n            hidden_dim: Transformer feature dimension\n            num_queries: number of queries\n            nheads: number of heads\n            dim_feedforward: feature dimension in feedforward network\n            enc_layers: number of Transformer encoder layers\n            dec_layers: number of Transformer decoder layers\n            pre_norm: whether to use pre-LayerNorm or not\n            mask_dim: mask feature dimension\n            enforce_input_project: add input project 1x1 conv even if input\n                channels and hidden dim is identical\n        \"\"\"\n        super().__init__()\n\n        assert mask_classification, \"Only support mask classification model\"\n        self.mask_classification = mask_classification\n\n        self.num_frames = num_frames\n\n        # positional encoding\n        N_steps = hidden_dim // 2\n        self.pe_layer = PositionEmbeddingSine3D(N_steps, normalize=True)\n        \n        # define Transformer decoder here\n        self.num_heads = nheads\n        self.num_layers = dec_layers\n        self.transformer_self_attention_layers = nn.ModuleList()\n        self.transformer_cross_attention_layers = nn.ModuleList()\n        self.transformer_ffn_layers = nn.ModuleList()\n\n        for _ in range(self.num_layers):\n            self.transformer_self_attention_layers.append(\n                SelfAttentionLayer(\n                    d_model=hidden_dim,\n                    nhead=nheads,\n                    dropout=0.0,\n                    normalize_before=pre_norm,\n                )\n            )\n\n            self.transformer_cross_attention_layers.append(\n                CrossAttentionLayer(\n                    d_model=hidden_dim,\n                    nhead=nheads,\n                    dropout=0.0,\n                    normalize_before=pre_norm,\n                )\n            )\n\n            self.transformer_ffn_layers.append(\n                FFNLayer(\n                    d_model=hidden_dim,\n                    dim_feedforward=dim_feedforward,\n                    dropout=0.0,\n                    normalize_before=pre_norm,\n                )\n            )\n\n        self.decoder_norm = nn.LayerNorm(hidden_dim)\n\n        self.num_queries = num_queries\n        # learnable query features\n        self.query_feat = nn.Embedding(num_queries, hidden_dim)\n        # learnable query p.e.\n        self.query_embed = nn.Embedding(num_queries, hidden_dim)\n\n        # level embedding (we always use 3 scales)\n        self.num_feature_levels = 3\n        self.level_embed = nn.Embedding(self.num_feature_levels, hidden_dim)\n        self.input_proj = nn.ModuleList()\n        for _ in range(self.num_feature_levels):\n            if in_channels != hidden_dim or enforce_input_project:\n                self.input_proj.append(Conv2d(in_channels, hidden_dim, kernel_size=1))\n                weight_init.c2_xavier_fill(self.input_proj[-1])\n            else:\n                self.input_proj.append(nn.Sequential())\n\n        # output FFNs\n        if self.mask_classification:\n            self.class_embed = nn.Linear(hidden_dim, num_classes + 1)\n        self.mask_embed = MLP(hidden_dim, hidden_dim, mask_dim, 3)\n\n    @classmethod\n    def from_config(cls, cfg, in_channels, mask_classification):\n        ret = {}\n        ret[\"in_channels\"] = in_channels\n        ret[\"mask_classification\"] = mask_classification\n        \n        ret[\"num_classes\"] = cfg.MODEL.SEM_SEG_HEAD.NUM_CLASSES\n        ret[\"hidden_dim\"] = cfg.MODEL.MASK_FORMER.HIDDEN_DIM\n        ret[\"num_queries\"] = cfg.MODEL.MASK_FORMER.NUM_OBJECT_QUERIES\n        # Transformer parameters:\n        ret[\"nheads\"] = cfg.MODEL.MASK_FORMER.NHEADS\n        ret[\"dim_feedforward\"] = cfg.MODEL.MASK_FORMER.DIM_FEEDFORWARD\n\n        # NOTE: because we add learnable query features which requires supervision,\n        # we add minus 1 to decoder layers to be consistent with our loss\n        # implementation: that is, number of auxiliary losses is always\n        # equal to number of decoder layers. With learnable query features, the number of\n        # auxiliary losses equals number of decoders plus 1.\n        assert cfg.MODEL.MASK_FORMER.DEC_LAYERS >= 1\n        ret[\"dec_layers\"] = cfg.MODEL.MASK_FORMER.DEC_LAYERS - 1\n        ret[\"pre_norm\"] = cfg.MODEL.MASK_FORMER.PRE_NORM\n        ret[\"enforce_input_project\"] = cfg.MODEL.MASK_FORMER.ENFORCE_INPUT_PROJ\n\n        ret[\"mask_dim\"] = cfg.MODEL.SEM_SEG_HEAD.MASK_DIM\n\n        ret[\"num_frames\"] = cfg.INPUT.SAMPLING_FRAME_NUM\n\n        return ret\n\n    def forward(self, x, mask_features, mask = None):\n        bt, c_m, h_m, w_m = mask_features.shape\n        if bt == 6 or bt == 3: # 3 is for swinl which cannot afford batch size 2\n            bs = bt // self.num_frames if self.training else 1\n        else:\n            bs = bt // 4 if self.training else 1 # change here\n            \n        t = bt // bs\n        mask_features = mask_features.view(bs, t, c_m, h_m, w_m)\n\n        # x is a list of multi-scale feature\n        assert len(x) == self.num_feature_levels\n        src = []\n        pos = []\n        size_list = []\n\n        # disable mask, it does not affect performance\n        del mask\n\n        for i in range(self.num_feature_levels):\n            size_list.append(x[i].shape[-2:])\n            pos.append(self.pe_layer(x[i].view(bs, t, -1, size_list[-1][0], size_list[-1][1]), None).flatten(3))\n            src.append(self.input_proj[i](x[i]).flatten(2) + self.level_embed.weight[i][None, :, None])\n\n            # NTxCxHW => NxTxCxHW => (TxHW)xNxC\n            _, c, hw = src[-1].shape\n            pos[-1] = pos[-1].view(bs, t, c, hw).permute(1, 3, 0, 2).flatten(0, 1)\n            src[-1] = src[-1].view(bs, t, c, hw).permute(1, 3, 0, 2).flatten(0, 1)\n\n        # QxNxC\n        query_embed = self.query_embed.weight.unsqueeze(1).repeat(1, bs, 1)\n        output = self.query_feat.weight.unsqueeze(1).repeat(1, bs, 1)\n\n        predictions_class = []\n        predictions_mask = []\n\n        # prediction heads on learnable query features\n        outputs_class, outputs_mask, attn_mask = self.forward_prediction_heads(output, mask_features, attn_mask_target_size=size_list[0])\n        predictions_class.append(outputs_class)\n        predictions_mask.append(outputs_mask)\n\n        for i in range(self.num_layers):\n            level_index = i % self.num_feature_levels\n            attn_mask[torch.where(attn_mask.sum(-1) == attn_mask.shape[-1])] = False\n            # attention: cross-attention first\n            output = self.transformer_cross_attention_layers[i](\n                output, src[level_index],\n                memory_mask=attn_mask,\n                memory_key_padding_mask=None,  # here we do not apply masking on padded region\n                pos=pos[level_index], query_pos=query_embed\n            )\n\n            output = self.transformer_self_attention_layers[i](\n                output, tgt_mask=None,\n                tgt_key_padding_mask=None,\n                query_pos=query_embed\n            )\n            \n            # FFN\n            output = self.transformer_ffn_layers[i](\n                output\n            )\n\n            outputs_class, outputs_mask, attn_mask = self.forward_prediction_heads(output, mask_features, attn_mask_target_size=size_list[(i + 1) % self.num_feature_levels])\n            predictions_class.append(outputs_class)\n            predictions_mask.append(outputs_mask)\n\n        assert len(predictions_class) == self.num_layers + 1\n\n        out = {\n            'pred_logits': predictions_class[-1],\n            'pred_masks': predictions_mask[-1],\n            'aux_outputs': self._set_aux_loss(\n                predictions_class if self.mask_classification else None, predictions_mask\n            )\n        }\n        return out\n\n    def forward_prediction_heads(self, output, mask_features, attn_mask_target_size):\n        decoder_output = self.decoder_norm(output)\n        decoder_output = decoder_output.transpose(0, 1)\n        outputs_class = self.class_embed(decoder_output)\n        mask_embed = self.mask_embed(decoder_output)\n        outputs_mask = torch.einsum(\"bqc,btchw->bqthw\", mask_embed, mask_features)\n        b, q, t, _, _ = outputs_mask.shape\n\n        # NOTE: prediction is of higher-resolution\n        # [B, Q, T, H, W] -> [B, Q, T*H*W] -> [B, h, Q, T*H*W] -> [B*h, Q, T*HW]\n        attn_mask = F.interpolate(outputs_mask.flatten(0, 1), size=attn_mask_target_size, mode=\"bilinear\", align_corners=False).view(\n            b, q, t, attn_mask_target_size[0], attn_mask_target_size[1])\n        # must use bool type\n        # If a BoolTensor is provided, positions with ``True`` are not allowed to attend while ``False`` values will be unchanged.\n        attn_mask = (attn_mask.sigmoid().flatten(2).unsqueeze(1).repeat(1, self.num_heads, 1, 1).flatten(0, 1) < 0.5).bool()\n        attn_mask = attn_mask.detach()\n\n        return outputs_class, outputs_mask, attn_mask\n\n    @torch.jit.unused\n    def _set_aux_loss(self, outputs_class, outputs_seg_masks):\n        # this is a workaround to make torchscript happy, as torchscript\n        # doesn't support dictionary with non-homogeneous values, such\n        # as a dict having both a Tensor and a list.\n        if self.mask_classification:\n            return [\n                {\"pred_logits\": a, \"pred_masks\": b}\n                for a, b in zip(outputs_class[:-1], outputs_seg_masks[:-1])\n            ]\n        else:\n            return [{\"pred_masks\": b} for b in outputs_seg_masks[:-1]]\n"
  },
  {
    "path": "mask2former_video/utils/__init__.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates.\n"
  },
  {
    "path": "mask2former_video/utils/__init__.py.new",
    "content": ""
  },
  {
    "path": "mask2former_video/utils/memory.py",
    "content": "\nimport logging\nfrom contextlib import contextmanager\nfrom functools import wraps\nimport torch\nfrom torch.cuda.amp import autocast\n\n__all__ = [\"retry_if_cuda_oom\"]\n\n\n@contextmanager\ndef _ignore_torch_cuda_oom():\n    \"\"\"\n    A context which ignores CUDA OOM exception from pytorch.\n    \"\"\"\n    try:\n        yield\n    except RuntimeError as e:\n        # NOTE: the string may change?\n        if \"CUDA out of memory. \" in str(e):\n            pass\n        else:\n            raise\n\n\ndef retry_if_cuda_oom(func):\n    \"\"\"\n    Makes a function retry itself after encountering\n    pytorch's CUDA OOM error.\n    It will first retry after calling `torch.cuda.empty_cache()`.\n    If that still fails, it will then retry by trying to convert inputs to CPUs.\n    In this case, it expects the function to dispatch to CPU implementation.\n    The return values may become CPU tensors as well and it's user's\n    responsibility to convert it back to CUDA tensor if needed.\n    Args:\n        func: a stateless callable that takes tensor-like objects as arguments\n    Returns:\n        a callable which retries `func` if OOM is encountered.\n    Examples:\n    ::\n        output = retry_if_cuda_oom(some_torch_function)(input1, input2)\n        # output may be on CPU even if inputs are on GPU\n    Note:\n        1. When converting inputs to CPU, it will only look at each argument and check\n           if it has `.device` and `.to` for conversion. Nested structures of tensors\n           are not supported.\n        2. Since the function might be called more than once, it has to be\n           stateless.\n    \"\"\"\n\n    def maybe_to_cpu(x):\n        try:\n            like_gpu_tensor = x.device.type == \"cuda\" and hasattr(x, \"to\")\n        except AttributeError:\n            like_gpu_tensor = False\n        if like_gpu_tensor:\n            return x.to(device=\"cpu\").to(torch.float32)\n        else:\n            return x\n\n    @wraps(func)\n    def wrapped(*args, **kwargs):\n        with _ignore_torch_cuda_oom():\n            return func(*args, **kwargs)\n\n        # Clear cache and retry\n        torch.cuda.empty_cache()\n        with _ignore_torch_cuda_oom():\n            return func(*args, **kwargs)\n\n        # Try on CPU. This slows down the code significantly, therefore print a notice.\n        logger = logging.getLogger(__name__)\n        logger.info(\"Attempting to copy inputs to CPU due to CUDA OOM\")\n        new_args = (maybe_to_cpu(x) for x in args)\n        new_kwargs = {k: maybe_to_cpu(v) for k, v in kwargs.items()}\n        with autocast(enabled=False):\n            return func(*new_args, **new_kwargs)\n\n    return wrapped\n"
  },
  {
    "path": "mask2former_video/video_maskformer_model.py",
    "content": "import logging\nimport math\nfrom typing import Tuple\n\nimport torch\nfrom torch import nn\nfrom torch.nn import functional as F\n\nfrom detectron2.config import configurable\nfrom detectron2.data import MetadataCatalog\nfrom detectron2.modeling import META_ARCH_REGISTRY, build_backbone, build_sem_seg_head\nfrom detectron2.modeling.backbone import Backbone\nfrom detectron2.modeling.postprocessing import sem_seg_postprocess\nfrom detectron2.structures import Boxes, ImageList, Instances, BitMasks\n\nfrom .modeling.criterion import VideoSetCriterion\nfrom .modeling.matcher import VideoHungarianMatcher\nfrom .utils.memory import retry_if_cuda_oom\n\nfrom skimage import color\nimport cv2\nimport numpy as np\n\ndef unfold_wo_center(x, kernel_size, dilation):\n    assert x.dim() == 4\n    assert kernel_size % 2 == 1\n\n    # using SAME padding\n    padding = (kernel_size + (dilation - 1) * (kernel_size - 1)) // 2\n    unfolded_x = F.unfold(\n        x, kernel_size=kernel_size,\n        padding=padding,\n        dilation=dilation\n    )\n\n    unfolded_x = unfolded_x.reshape(\n        x.size(0), x.size(1), -1, x.size(2), x.size(3)\n    )\n\n    # remove the center pixels\n    size = kernel_size ** 2\n    unfolded_x = torch.cat((\n        unfolded_x[:, :, :size // 2],\n        unfolded_x[:, :, size // 2 + 1:]\n    ), dim=2)\n\n    return unfolded_x\n\ndef unfold_w_center(x, kernel_size, dilation):\n    assert x.dim() == 4\n    assert kernel_size % 2 == 1\n\n    # using SAME padding\n    padding = (kernel_size + (dilation - 1) * (kernel_size - 1)) // 2\n    unfolded_x = F.unfold(\n        x, kernel_size=kernel_size,\n        padding=padding,\n        dilation=dilation\n    )\n\n    unfolded_x = unfolded_x.reshape(\n        x.size(0), x.size(1), -1, x.size(2), x.size(3)\n    )\n\n\n    return unfolded_x\n\ndef get_images_color_similarity(images, kernel_size, dilation):\n    assert images.dim() == 4\n    assert images.size(0) == 1\n\n    unfolded_images = unfold_wo_center(\n        images, kernel_size=kernel_size, dilation=dilation\n    )\n\n    diff = images[:, :, None] - unfolded_images\n    similarity = torch.exp(-torch.norm(diff, dim=1) * 0.5)\n    return similarity\n\ndef get_neighbor_images_color_similarity(images, images_neighbor, kernel_size, dilation):\n    assert images.dim() == 4\n    assert images.size(0) == 1\n\n    unfolded_images = unfold_w_center(\n        images, kernel_size=kernel_size, dilation=dilation\n    )\n\n    diff = images_neighbor[:, :, None] - unfolded_images\n    similarity = torch.exp(-torch.norm(diff, dim=1) * 0.5)\n\n    return similarity\n\n\ndef get_neighbor_images_patch_color_similarity(images, images_neighbor, kernel_size, dilation):\n    assert images.dim() == 4\n    assert images.size(0) == 1\n\n    unfolded_images = unfold_w_center(\n        images, kernel_size=kernel_size, dilation= 1 #dilation\n    )\n    unfolded_images_neighbor = unfold_w_center(\n        images_neighbor, kernel_size=kernel_size, dilation= 1 #dilation\n    )\n    unfolded_images = unfolded_images.flatten(1,2)\n    unfolded_images_neighbor = unfolded_images_neighbor.flatten(1,2)\n    similarity = get_neighbor_images_color_similarity(unfolded_images, unfolded_images_neighbor, 3, 3) \n    \n    return similarity\n\nlogger = logging.getLogger(__name__)\n\n\n@META_ARCH_REGISTRY.register()\nclass VideoMaskFormer(nn.Module):\n    \"\"\"\n    Main class for mask classification semantic segmentation architectures.\n    \"\"\"\n\n    @configurable\n    def __init__(\n        self,\n        *,\n        backbone: Backbone,\n        sem_seg_head: nn.Module,\n        criterion: nn.Module,\n        num_queries: int,\n        object_mask_threshold: float,\n        overlap_threshold: float,\n        metadata,\n        size_divisibility: int,\n        sem_seg_postprocess_before_inference: bool,\n        pixel_mean: Tuple[float],\n        pixel_std: Tuple[float],\n        # video\n        num_frames,\n    ):\n        \"\"\"\n        Args:\n            backbone: a backbone module, must follow detectron2's backbone interface\n            sem_seg_head: a module that predicts semantic segmentation from backbone features\n            criterion: a module that defines the loss\n            num_queries: int, number of queries\n            object_mask_threshold: float, threshold to filter query based on classification score\n                for panoptic segmentation inference\n            overlap_threshold: overlap threshold used in general inference for panoptic segmentation\n            metadata: dataset meta, get `thing` and `stuff` category names for panoptic\n                segmentation inference\n            size_divisibility: Some backbones require the input height and width to be divisible by a\n                specific integer. We can use this to override such requirement.\n            sem_seg_postprocess_before_inference: whether to resize the prediction back\n                to original input size before semantic segmentation inference or after.\n                For high-resolution dataset like Mapillary, resizing predictions before\n                inference will cause OOM error.\n            pixel_mean, pixel_std: list or tuple with #channels element, representing\n                the per-channel mean and std to be used to normalize the input image\n            semantic_on: bool, whether to output semantic segmentation prediction\n            instance_on: bool, whether to output instance segmentation prediction\n            panoptic_on: bool, whether to output panoptic segmentation prediction\n            test_topk_per_image: int, instance segmentation parameter, keep topk instances per image\n        \"\"\"\n        super().__init__()\n        self.backbone = backbone\n        self.sem_seg_head = sem_seg_head\n        self.criterion = criterion\n        self.num_queries = num_queries\n        self.overlap_threshold = overlap_threshold\n        self.object_mask_threshold = object_mask_threshold\n        self.metadata = metadata\n        if size_divisibility < 0:\n            # use backbone size_divisibility if not set\n            size_divisibility = self.backbone.size_divisibility\n        self.size_divisibility = size_divisibility\n        self.sem_seg_postprocess_before_inference = sem_seg_postprocess_before_inference\n        self.register_buffer(\"pixel_mean\", torch.Tensor(pixel_mean).view(-1, 1, 1), False)\n        self.register_buffer(\"pixel_std\", torch.Tensor(pixel_std).view(-1, 1, 1), False)\n\n        self.num_frames = num_frames\n        #self.structure_fc = nn.Conv2d(27, 256, 1)\n\n    @classmethod\n    def from_config(cls, cfg):\n        backbone = build_backbone(cfg)\n        sem_seg_head = build_sem_seg_head(cfg, backbone.output_shape())\n\n        # Loss parameters:\n        deep_supervision = cfg.MODEL.MASK_FORMER.DEEP_SUPERVISION\n        no_object_weight = cfg.MODEL.MASK_FORMER.NO_OBJECT_WEIGHT\n\n        # loss weights\n        class_weight = cfg.MODEL.MASK_FORMER.CLASS_WEIGHT\n        dice_weight = cfg.MODEL.MASK_FORMER.DICE_WEIGHT\n        mask_weight = cfg.MODEL.MASK_FORMER.MASK_WEIGHT\n\n        # building criterion\n        matcher = VideoHungarianMatcher(\n            cost_class=class_weight,\n            cost_mask=mask_weight,\n            cost_dice=dice_weight,\n            num_points=cfg.MODEL.MASK_FORMER.TRAIN_NUM_POINTS,\n        )\n\n        weight_dict = {\"loss_ce\": class_weight, \"loss_mask\": mask_weight, \"loss_mask_proj\": mask_weight, \"loss_dice\": dice_weight, \"loss_bound\": mask_weight, \"loss_bound_neighbor\": mask_weight, \"loss_out_box\": mask_weight}\n\n        if deep_supervision:\n            dec_layers = cfg.MODEL.MASK_FORMER.DEC_LAYERS\n            aux_weight_dict = {}\n            for i in range(dec_layers - 1):\n                aux_weight_dict.update({k + f\"_{i}\": v for k, v in weight_dict.items()})\n            weight_dict.update(aux_weight_dict)\n\n        losses = [\"labels\", \"masks\"]\n\n        criterion = VideoSetCriterion(\n            sem_seg_head.num_classes,\n            matcher=matcher,\n            weight_dict=weight_dict,\n            eos_coef=no_object_weight,\n            losses=losses,\n            num_points=cfg.MODEL.MASK_FORMER.TRAIN_NUM_POINTS,\n            oversample_ratio=cfg.MODEL.MASK_FORMER.OVERSAMPLE_RATIO,\n            importance_sample_ratio=cfg.MODEL.MASK_FORMER.IMPORTANCE_SAMPLE_RATIO,\n        )\n\n        return {\n            \"backbone\": backbone,\n            \"sem_seg_head\": sem_seg_head,\n            \"criterion\": criterion,\n            \"num_queries\": cfg.MODEL.MASK_FORMER.NUM_OBJECT_QUERIES,\n            \"object_mask_threshold\": cfg.MODEL.MASK_FORMER.TEST.OBJECT_MASK_THRESHOLD,\n            \"overlap_threshold\": cfg.MODEL.MASK_FORMER.TEST.OVERLAP_THRESHOLD,\n            \"metadata\": MetadataCatalog.get(cfg.DATASETS.TRAIN[0]),\n            \"size_divisibility\": cfg.MODEL.MASK_FORMER.SIZE_DIVISIBILITY,\n            \"sem_seg_postprocess_before_inference\": True,\n            \"pixel_mean\": cfg.MODEL.PIXEL_MEAN,\n            \"pixel_std\": cfg.MODEL.PIXEL_STD,\n            # video\n            \"num_frames\": cfg.INPUT.SAMPLING_FRAME_NUM,\n        }\n\n    @property\n    def device(self):\n        return self.pixel_mean.device\n    def forward(self, batched_inputs):\n        \"\"\"\n        Args:\n            batched_inputs: a list, batched outputs of :class:`DatasetMapper`.\n                Each item in the list contains the inputs for one image.\n                For now, each item in the list is a dict that contains:\n                   * \"image\": Tensor, image in (C, H, W) format.\n                   * \"instances\": per-region ground truth\n                   * Other information that's included in the original dicts, such as:\n                     \"height\", \"width\" (int): the output resolution of the model (may be different\n                     from input resolution), used in inference.\n        Returns:\n            list[dict]:\n                each dict has the results for one image. The dict contains the following keys:\n\n                * \"sem_seg\":\n                    A Tensor that represents the\n                    per-pixel segmentation prediced by the head.\n                    The prediction has shape KxHxW that represents the logits of\n                    each class for each pixel.\n                * \"panoptic_seg\":\n                    A tuple that represent panoptic output\n                    panoptic_seg (Tensor): of shape (height, width) where the values are ids for each segment.\n                    segments_info (list[dict]): Describe each segment in `panoptic_seg`.\n                        Each dict contains keys \"id\", \"category_id\", \"isthing\".\n        \"\"\"\n        images = []\n        \n        for video in batched_inputs:\n            for frame in video[\"image\"]:\n                images.append(frame.to(self.device))\n\n        is_coco = (len(images) == 8) or (len(images) == 4)# change here, 4 is for swinl with bs 1 which cannot afford batch size 2\n        if self.training and not is_coco:\n            k_size = 3 \n            rs_images = ImageList.from_tensors(images, self.size_divisibility)\n            downsampled_images = F.avg_pool2d(rs_images.tensor.float(), kernel_size=4, stride=4, padding=0) #for img in images]\n            images_lab = [torch.as_tensor(color.rgb2lab(ds_image[[2, 1, 0]].byte().permute(1, 2, 0).cpu().numpy()), device=ds_image.device, dtype=torch.float32).permute(2, 0, 1) for ds_image in downsampled_images]\n            images_lab_sim = [get_images_color_similarity(img_lab.unsqueeze(0), k_size, 2) for img_lab in images_lab] # ori is 0.3, 0.5, 0.7\n            images_lab_sim_nei = [get_neighbor_images_patch_color_similarity(images_lab[ii].unsqueeze(0), images_lab[ii+1].unsqueeze(0), 3, 3) for ii in range(0, len(images_lab), 3)] # change k form 3 to 5, ori is 3, ori dilation is 3\n            images_lab_sim_nei1 = [get_neighbor_images_patch_color_similarity(images_lab[ii].unsqueeze(0), images_lab[ii+2].unsqueeze(0), 3, 3) for ii in range(0, len(images_lab), 3)]\n            images_lab_sim_nei2 = [get_neighbor_images_patch_color_similarity(images_lab[ii+1].unsqueeze(0), images_lab[ii+2].unsqueeze(0), 3, 3) for ii in range(0, len(images_lab), 3)]\n\n            \n\n        images = [(x - self.pixel_mean) / self.pixel_std for x in images]\n        images = ImageList.from_tensors(images, self.size_divisibility)\n\n        features = self.backbone(images.tensor)\n        outputs = self.sem_seg_head(features)\n\n        if self.training:\n            # mask classification target\n            targets = self.prepare_targets(batched_inputs, images, is_coco)\n            if not is_coco:\n                # bipartite matching-based loss\n                losses = self.criterion(outputs, targets, images_lab_sim, images_lab_sim_nei, images_lab_sim_nei1, images_lab_sim_nei2)\n            else:\n                losses = self.criterion(outputs, targets, None, None, None, None)\n\n            for k in list(losses.keys()):\n                if k in self.criterion.weight_dict:\n                    losses[k] *= self.criterion.weight_dict[k]\n                else:\n                    # remove this loss if not specified in `weight_dict`\n                    losses.pop(k)\n            return losses\n        else:\n            mask_cls_results = outputs[\"pred_logits\"]\n            mask_pred_results = outputs[\"pred_masks\"]\n\n            mask_cls_result = mask_cls_results[0]\n            # upsample masks\n            mask_pred_result = retry_if_cuda_oom(F.interpolate)(\n                mask_pred_results[0],\n                size=(images.tensor.shape[-2], images.tensor.shape[-1]),\n                mode=\"bilinear\",\n                align_corners=False,\n            )\n\n            del outputs\n\n            input_per_image = batched_inputs[0]\n            image_size = images.image_sizes[0]  # image size without padding after data augmentation\n\n            height = input_per_image.get(\"height\", image_size[0])  # raw image size before data augmentation\n            width = input_per_image.get(\"width\", image_size[1])\n\n            return retry_if_cuda_oom(self.inference_video)(mask_cls_result, mask_pred_result, image_size, height, width)\n\n    def prepare_targets(self, targets, images, is_coco):\n        h_pad, w_pad = images.tensor.shape[-2:]\n        gt_instances = []\n        for targets_per_video in targets:\n            _num_instance = len(targets_per_video[\"instances\"][0])\n            if is_coco:\n                mask_shape = [_num_instance, 4, h_pad, w_pad] #change here\n            else:\n                mask_shape = [_num_instance, self.num_frames, h_pad, w_pad]\n\n            gt_masks_per_video = torch.zeros(mask_shape, dtype=torch.bool, device=self.device)\n            gt_classes_per_video = targets_per_video[\"instances\"][0].gt_classes.to(self.device)\n\n            gt_ids_per_video = []\n            for f_i, targets_per_frame in enumerate(targets_per_video[\"instances\"]):\n                targets_per_frame = targets_per_frame.to(self.device)\n                h, w = targets_per_frame.image_size\n                \n                _update_cls = gt_classes_per_video == -1\n                gt_classes_per_video[_update_cls] = targets_per_frame.gt_classes[_update_cls]\n                gt_ids_per_video.append(targets_per_frame.gt_ids[:, None])\n                if isinstance(targets_per_frame.gt_masks, BitMasks):\n                    gt_masks_per_video[:, f_i, :h, :w] = targets_per_frame.gt_masks.tensor\n                else: #polygon\n                    gt_masks_per_video[:, f_i, :h, :w] = targets_per_frame.gt_masks\n\n\n            gt_ids_per_video = torch.cat(gt_ids_per_video, dim=1)\n            gt_ids_per_video[gt_masks_per_video.sum(dim=(2,3)) == 0] = -1\n            valid_bool_frame = (gt_ids_per_video != -1)\n            valid_bool_clip = valid_bool_frame.any(dim=-1)\n            # valid_idx = (gt_ids_per_video != -1).any(dim=-1)\n\n            gt_classes_per_video = gt_classes_per_video[valid_bool_clip].long() #targets_per_frame.gt_classes[valid_idx]          # N,\n            gt_ids_per_video = gt_ids_per_video[valid_bool_clip].long()                          # N, num_frames\n            valid_bool_frame = valid_bool_frame[valid_bool_clip]\n\n            if len(gt_ids_per_video) > 0:\n                min_id = max(gt_ids_per_video[valid_bool_frame].min(), 0)\n                gt_ids_per_video[valid_bool_frame] -= min_id\n\n            gt_instances.append({\"labels\": gt_classes_per_video, \"ids\": gt_ids_per_video})\n            gt_masks_per_video = gt_masks_per_video[valid_bool_clip].float()          # N, num_frames, H, W\n            gt_instances[-1].update({\"masks\": gt_masks_per_video})\n\n        return gt_instances\n\n    def inference_video(self, pred_cls, pred_masks, img_size, output_height, output_width):\n        if len(pred_cls) > 0:\n            scores = F.softmax(pred_cls, dim=-1)[:, :-1]\n            labels = torch.arange(self.sem_seg_head.num_classes, device=self.device).unsqueeze(0).repeat(self.num_queries, 1).flatten(0, 1)\n            # keep top-10 predictions\n            scores_per_image, topk_indices = scores.flatten(0, 1).topk(10, sorted=False)\n            labels_per_image = labels[topk_indices]\n            topk_indices = topk_indices // self.sem_seg_head.num_classes\n            pred_masks = pred_masks[topk_indices]\n\n            pred_masks = pred_masks[:, :, : img_size[0], : img_size[1]]\n            pred_masks = F.interpolate(\n                pred_masks, size=(output_height, output_width), mode=\"bilinear\", align_corners=False\n            )\n\n            masks = pred_masks > 0.\n\n            out_scores = scores_per_image.tolist()\n            out_labels = labels_per_image.tolist()\n            out_masks = [m for m in masks.cpu()]\n        else:\n            out_scores = []\n            out_labels = []\n            out_masks = []\n\n        video_output = {\n            \"image_size\": (output_height, output_width),\n            \"pred_scores\": out_scores,\n            \"pred_labels\": out_labels,\n            \"pred_masks\": out_masks,\n        }\n\n        return video_output\n"
  },
  {
    "path": "mfvis_nococo/__init__.py",
    "content": "from . import modeling\n\n# config\nfrom .config import add_maskformer2_video_config\n\n# models\nfrom .video_maskformer_model import VideoMaskFormer\n\n# video\nfrom .data_video import (\n    YTVISDatasetMapper,\n    YTVISEvaluator,\n    build_detection_train_loader,\n    build_detection_test_loader,\n    get_detection_dataset_dicts,\n)\n"
  },
  {
    "path": "mfvis_nococo/configs/youtubevis_2019/Base-YouTubeVIS-VideoInstanceSegmentation.yaml",
    "content": "MODEL:\n  BACKBONE:\n    FREEZE_AT: 0\n    NAME: \"build_resnet_backbone\"\n  WEIGHTS: \"detectron2://ImageNetPretrained/torchvision/R-50.pkl\"\n  PIXEL_MEAN: [123.675, 116.280, 103.530]\n  PIXEL_STD: [58.395, 57.120, 57.375]\n  MASK_ON: True\n  RESNETS:\n    DEPTH: 50\n    STEM_TYPE: \"basic\"  # not used\n    STEM_OUT_CHANNELS: 64\n    STRIDE_IN_1X1: False\n    OUT_FEATURES: [\"res2\", \"res3\", \"res4\", \"res5\"]\n    # NORM: \"SyncBN\"\n    RES5_MULTI_GRID: [1, 1, 1]  # not used\nDATASETS:\n  TRAIN: (\"ytvis_2019_train\",)\n  TEST: (\"ytvis_2019_val\",)\nSOLVER:\n  IMS_PER_BATCH: 16\n  BASE_LR: 0.0001\n  STEPS: (4000,)\n  MAX_ITER: 6000\n  WARMUP_FACTOR: 1.0\n  WARMUP_ITERS: 10\n  WEIGHT_DECAY: 0.05\n  OPTIMIZER: \"ADAMW\"\n  BACKBONE_MULTIPLIER: 0.1\n  CLIP_GRADIENTS:\n    ENABLED: True\n    CLIP_TYPE: \"full_model\"\n    CLIP_VALUE: 0.01\n    NORM_TYPE: 2.0\n  AMP:\n    ENABLED: True\nINPUT:\n  MIN_SIZE_TRAIN_SAMPLING: \"choice_by_clip\"\n  RANDOM_FLIP: \"flip_by_clip\"\n  AUGMENTATIONS: []\n  MIN_SIZE_TRAIN: (360, 480)\n  MIN_SIZE_TEST: 360\n  CROP:\n    ENABLED: False\n    TYPE: \"absolute_range\"\n    SIZE: (600, 720)\n  FORMAT: \"RGB\"\nTEST:\n  EVAL_PERIOD: 0\nDATALOADER:\n  FILTER_EMPTY_ANNOTATIONS: False\n  NUM_WORKERS: 4\nVERSION: 2\n"
  },
  {
    "path": "mfvis_nococo/configs/youtubevis_2019/video_maskformer2_R101_bs16_8ep_coco.yaml",
    "content": "_BASE_: video_maskformer2_R50_bs16_8ep.yaml\nOUTPUT_DIR: 'box_patch_newknn_t5s5_spretrained1_r101_correct'\nMODEL:\n  WEIGHTS: \"./pretrained_model/model_final_eba159.pkl\"\n  RESNETS:\n    DEPTH: 101\n    STEM_TYPE: \"basic\"  # not used\n    STEM_OUT_CHANNELS: 64\n    STRIDE_IN_1X1: False\n    OUT_FEATURES: [\"res2\", \"res3\", \"res4\", \"res5\"]\n    # NORM: \"SyncBN\"\n    RES5_MULTI_GRID: [1, 1, 1]  # not used\n"
  },
  {
    "path": "mfvis_nococo/configs/youtubevis_2019/video_maskformer2_R50_bs16_8ep.yaml",
    "content": "_BASE_: Base-YouTubeVIS-VideoInstanceSegmentation.yaml\nOUTPUT_DIR: 'box_patch_newknn_t5s5_spretrained3_correct1'\nSEED: 29118357\nMODEL:\n  WEIGHTS: \"./model_final_proj.pth\"\n  META_ARCHITECTURE: \"VideoMaskFormer\"\n  SEM_SEG_HEAD:\n    NAME: \"MaskFormerHead\"\n    IGNORE_VALUE: 255\n    NUM_CLASSES: 40\n    LOSS_WEIGHT: 1.0\n    CONVS_DIM: 256\n    MASK_DIM: 256\n    NORM: \"GN\"\n    # pixel decoder\n    PIXEL_DECODER_NAME: \"MSDeformAttnPixelDecoder\"\n    IN_FEATURES: [\"res2\", \"res3\", \"res4\", \"res5\"]\n    DEFORMABLE_TRANSFORMER_ENCODER_IN_FEATURES: [\"res3\", \"res4\", \"res5\"]\n    COMMON_STRIDE: 4\n    TRANSFORMER_ENC_LAYERS: 6\n  MASK_FORMER:\n    TRANSFORMER_DECODER_NAME: \"VideoMultiScaleMaskedTransformerDecoder\"\n    TRANSFORMER_IN_FEATURE: \"multi_scale_pixel_decoder\"\n    DEEP_SUPERVISION: True\n    NO_OBJECT_WEIGHT: 0.1\n    CLASS_WEIGHT: 2.0\n    MASK_WEIGHT: 5.0\n    DICE_WEIGHT: 5.0\n    HIDDEN_DIM: 256\n    NUM_OBJECT_QUERIES: 100\n    NHEADS: 8\n    DROPOUT: 0.0\n    DIM_FEEDFORWARD: 2048\n    ENC_LAYERS: 0\n    PRE_NORM: False\n    ENFORCE_INPUT_PROJ: False\n    SIZE_DIVISIBILITY: 32\n    DEC_LAYERS: 10  # 9 decoder layers, add one for the loss on learnable query\n    TRAIN_NUM_POINTS: 20000 #20000 #12544\n    OVERSAMPLE_RATIO: 3.0\n    IMPORTANCE_SAMPLE_RATIO: 0.75\n    TEST:\n      SEMANTIC_ON: False\n      INSTANCE_ON: True\n      PANOPTIC_ON: False\n      OVERLAP_THRESHOLD: 0.8\n      OBJECT_MASK_THRESHOLD: 0.8\n"
  },
  {
    "path": "mfvis_nococo/configs/youtubevis_2019/video_maskformer2_R50_bs16_8ep_coco.yaml",
    "content": "_BASE_: Base-YouTubeVIS-VideoInstanceSegmentation.yaml\nOUTPUT_DIR: 'box_patch_newknn_t5s5_spretrained3_coco_correct1'\nSEED: 29118357\nMODEL:\n  WEIGHTS: \"./pretrained_model/model_final_3c8ec9.pkl\"\n  META_ARCHITECTURE: \"VideoMaskFormer\"\n  SEM_SEG_HEAD:\n    NAME: \"MaskFormerHead\"\n    IGNORE_VALUE: 255\n    NUM_CLASSES: 40\n    LOSS_WEIGHT: 1.0\n    CONVS_DIM: 256\n    MASK_DIM: 256\n    NORM: \"GN\"\n    # pixel decoder\n    PIXEL_DECODER_NAME: \"MSDeformAttnPixelDecoder\"\n    IN_FEATURES: [\"res2\", \"res3\", \"res4\", \"res5\"]\n    DEFORMABLE_TRANSFORMER_ENCODER_IN_FEATURES: [\"res3\", \"res4\", \"res5\"]\n    COMMON_STRIDE: 4\n    TRANSFORMER_ENC_LAYERS: 6\n  MASK_FORMER:\n    TRANSFORMER_DECODER_NAME: \"VideoMultiScaleMaskedTransformerDecoder\"\n    TRANSFORMER_IN_FEATURE: \"multi_scale_pixel_decoder\"\n    DEEP_SUPERVISION: True\n    NO_OBJECT_WEIGHT: 0.1\n    CLASS_WEIGHT: 2.0\n    MASK_WEIGHT: 5.0\n    DICE_WEIGHT: 5.0\n    HIDDEN_DIM: 256\n    NUM_OBJECT_QUERIES: 100\n    NHEADS: 8\n    DROPOUT: 0.0\n    DIM_FEEDFORWARD: 2048\n    ENC_LAYERS: 0\n    PRE_NORM: False\n    ENFORCE_INPUT_PROJ: False\n    SIZE_DIVISIBILITY: 32\n    DEC_LAYERS: 10  # 9 decoder layers, add one for the loss on learnable query\n    TRAIN_NUM_POINTS: 20000 #20000 #12544\n    OVERSAMPLE_RATIO: 3.0\n    IMPORTANCE_SAMPLE_RATIO: 0.75\n    TEST:\n      SEMANTIC_ON: False\n      INSTANCE_ON: True\n      PANOPTIC_ON: False\n      OVERLAP_THRESHOLD: 0.8\n      OBJECT_MASK_THRESHOLD: 0.8\n"
  },
  {
    "path": "mfvis_nococo/mask2former/__init__.py",
    "content": "from . import data  # register all new datasets\nfrom . import modeling\n\n# config\nfrom .config import add_maskformer2_config\n\n# dataset loading\nfrom .data.dataset_mappers.coco_instance_new_baseline_dataset_mapper import COCOInstanceNewBaselineDatasetMapper\nfrom .data.dataset_mappers.coco_panoptic_new_baseline_dataset_mapper import COCOPanopticNewBaselineDatasetMapper\nfrom .data.dataset_mappers.mask_former_instance_dataset_mapper import (\n    MaskFormerInstanceDatasetMapper,\n)\nfrom .data.dataset_mappers.mask_former_panoptic_dataset_mapper import (\n    MaskFormerPanopticDatasetMapper,\n)\nfrom .data.dataset_mappers.mask_former_semantic_dataset_mapper import (\n    MaskFormerSemanticDatasetMapper,\n)\n\n# models\nfrom .maskformer_model import MaskFormer\nfrom .test_time_augmentation import SemanticSegmentorWithTTA\n\n# evaluation\nfrom .evaluation.instance_evaluation import InstanceSegEvaluator\n"
  },
  {
    "path": "mfvis_nococo/mask2former/config.py",
    "content": "# -*- coding: utf-8 -*-\nfrom detectron2.config import CfgNode as CN\n\n\ndef add_maskformer2_config(cfg):\n    \"\"\"\n    Add config for MASK_FORMER.\n    \"\"\"\n    # NOTE: configs from original maskformer\n    # data config\n    # select the dataset mapper\n    cfg.INPUT.DATASET_MAPPER_NAME = \"mask_former_semantic\"\n    # Color augmentation\n    cfg.INPUT.COLOR_AUG_SSD = False\n    # We retry random cropping until no single category in semantic segmentation GT occupies more\n    # than `SINGLE_CATEGORY_MAX_AREA` part of the crop.\n    cfg.INPUT.CROP.SINGLE_CATEGORY_MAX_AREA = 1.0\n    # Pad image and segmentation GT in dataset mapper.\n    cfg.INPUT.SIZE_DIVISIBILITY = -1\n\n    # solver config\n    # weight decay on embedding\n    cfg.SOLVER.WEIGHT_DECAY_EMBED = 0.0\n    # optimizer\n    cfg.SOLVER.OPTIMIZER = \"ADAMW\"\n    cfg.SOLVER.BACKBONE_MULTIPLIER = 0.1\n\n    # mask_former model config\n    cfg.MODEL.MASK_FORMER = CN()\n\n    # loss\n    cfg.MODEL.MASK_FORMER.DEEP_SUPERVISION = True\n    cfg.MODEL.MASK_FORMER.NO_OBJECT_WEIGHT = 0.1\n    cfg.MODEL.MASK_FORMER.CLASS_WEIGHT = 1.0\n    cfg.MODEL.MASK_FORMER.DICE_WEIGHT = 1.0\n    cfg.MODEL.MASK_FORMER.MASK_WEIGHT = 20.0\n\n    # transformer config\n    cfg.MODEL.MASK_FORMER.NHEADS = 8\n    cfg.MODEL.MASK_FORMER.DROPOUT = 0.1\n    cfg.MODEL.MASK_FORMER.DIM_FEEDFORWARD = 2048\n    cfg.MODEL.MASK_FORMER.ENC_LAYERS = 0\n    cfg.MODEL.MASK_FORMER.DEC_LAYERS = 6\n    cfg.MODEL.MASK_FORMER.PRE_NORM = False\n\n    cfg.MODEL.MASK_FORMER.HIDDEN_DIM = 256\n    cfg.MODEL.MASK_FORMER.NUM_OBJECT_QUERIES = 100\n\n    cfg.MODEL.MASK_FORMER.TRANSFORMER_IN_FEATURE = \"res5\"\n    cfg.MODEL.MASK_FORMER.ENFORCE_INPUT_PROJ = False\n\n    # mask_former inference config\n    cfg.MODEL.MASK_FORMER.TEST = CN()\n    cfg.MODEL.MASK_FORMER.TEST.SEMANTIC_ON = True\n    cfg.MODEL.MASK_FORMER.TEST.INSTANCE_ON = False\n    cfg.MODEL.MASK_FORMER.TEST.PANOPTIC_ON = False\n    cfg.MODEL.MASK_FORMER.TEST.OBJECT_MASK_THRESHOLD = 0.0\n    cfg.MODEL.MASK_FORMER.TEST.OVERLAP_THRESHOLD = 0.0\n    cfg.MODEL.MASK_FORMER.TEST.SEM_SEG_POSTPROCESSING_BEFORE_INFERENCE = False\n\n    # Sometimes `backbone.size_divisibility` is set to 0 for some backbone (e.g. ResNet)\n    # you can use this config to override\n    cfg.MODEL.MASK_FORMER.SIZE_DIVISIBILITY = 32\n\n    # pixel decoder config\n    cfg.MODEL.SEM_SEG_HEAD.MASK_DIM = 256\n    # adding transformer in pixel decoder\n    cfg.MODEL.SEM_SEG_HEAD.TRANSFORMER_ENC_LAYERS = 0\n    # pixel decoder\n    cfg.MODEL.SEM_SEG_HEAD.PIXEL_DECODER_NAME = \"BasePixelDecoder\"\n\n    # swin transformer backbone\n    cfg.MODEL.SWIN = CN()\n    cfg.MODEL.SWIN.PRETRAIN_IMG_SIZE = 224\n    cfg.MODEL.SWIN.PATCH_SIZE = 4\n    cfg.MODEL.SWIN.EMBED_DIM = 96\n    cfg.MODEL.SWIN.DEPTHS = [2, 2, 6, 2]\n    cfg.MODEL.SWIN.NUM_HEADS = [3, 6, 12, 24]\n    cfg.MODEL.SWIN.WINDOW_SIZE = 7\n    cfg.MODEL.SWIN.MLP_RATIO = 4.0\n    cfg.MODEL.SWIN.QKV_BIAS = True\n    cfg.MODEL.SWIN.QK_SCALE = None\n    cfg.MODEL.SWIN.DROP_RATE = 0.0\n    cfg.MODEL.SWIN.ATTN_DROP_RATE = 0.0\n    cfg.MODEL.SWIN.DROP_PATH_RATE = 0.3\n    cfg.MODEL.SWIN.APE = False\n    cfg.MODEL.SWIN.PATCH_NORM = True\n    cfg.MODEL.SWIN.OUT_FEATURES = [\"res2\", \"res3\", \"res4\", \"res5\"]\n    cfg.MODEL.SWIN.USE_CHECKPOINT = False\n\n    # NOTE: maskformer2 extra configs\n    # transformer module\n    cfg.MODEL.MASK_FORMER.TRANSFORMER_DECODER_NAME = \"MultiScaleMaskedTransformerDecoder\"\n\n    # LSJ aug\n    cfg.INPUT.IMAGE_SIZE = 1024\n    cfg.INPUT.MIN_SCALE = 0.1\n    cfg.INPUT.MAX_SCALE = 2.0\n\n    # MSDeformAttn encoder configs\n    cfg.MODEL.SEM_SEG_HEAD.DEFORMABLE_TRANSFORMER_ENCODER_IN_FEATURES = [\"res3\", \"res4\", \"res5\"]\n    cfg.MODEL.SEM_SEG_HEAD.DEFORMABLE_TRANSFORMER_ENCODER_N_POINTS = 4\n    cfg.MODEL.SEM_SEG_HEAD.DEFORMABLE_TRANSFORMER_ENCODER_N_HEADS = 8\n\n    # point loss configs\n    # Number of points sampled during training for a mask point head.\n    cfg.MODEL.MASK_FORMER.TRAIN_NUM_POINTS = 112 * 112\n    # Oversampling parameter for PointRend point sampling during training. Parameter `k` in the\n    # original paper.\n    cfg.MODEL.MASK_FORMER.OVERSAMPLE_RATIO = 3.0\n    # Importance sampling parameter for PointRend point sampling during training. Parametr `beta` in\n    # the original paper.\n    cfg.MODEL.MASK_FORMER.IMPORTANCE_SAMPLE_RATIO = 0.75\n"
  },
  {
    "path": "mfvis_nococo/mask2former/data/__init__.py",
    "content": "from . import datasets\n"
  },
  {
    "path": "mfvis_nococo/mask2former/data/dataset_mappers/__init__.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates.\n"
  },
  {
    "path": "mfvis_nococo/mask2former/data/dataset_mappers/__init__.py.new",
    "content": ""
  },
  {
    "path": "mfvis_nococo/mask2former/data/dataset_mappers/coco_instance_new_baseline_dataset_mapper.py",
    "content": "# Modified by Bowen Cheng from https://github.com/facebookresearch/detr/blob/master/d2/detr/dataset_mapper.py\nimport copy\nimport logging\n\nimport numpy as np\nimport torch\n\nfrom detectron2.config import configurable\nfrom detectron2.data import detection_utils as utils\nfrom detectron2.data import transforms as T\nfrom detectron2.data.transforms import TransformGen\nfrom detectron2.structures import BitMasks, Instances\n\nfrom pycocotools import mask as coco_mask\n\n__all__ = [\"COCOInstanceNewBaselineDatasetMapper\"]\n\ndef masks_to_boxes(masks: torch.Tensor) -> torch.Tensor:\n    \"\"\"\n    Compute the bounding boxes around the provided masks.\n\n    Returns a [N, 4] tensor containing bounding boxes. The boxes are in ``(x1, y1, x2, y2)`` format with\n    ``0 <= x1 < x2`` and ``0 <= y1 < y2``.\n\n    Args:\n        masks (Tensor[N, H, W]): masks to transform where N is the number of masks\n            and (H, W) are the spatial dimensions.\n\n    Returns:\n        Tensor[N, 4]: bounding boxes\n    \"\"\"\n    if masks.numel() == 0:\n        return masks\n\n    n = masks.shape[0]\n    for index, mask in enumerate(masks):\n        y, x = torch.where(mask != 0)\n        if len(x) * len(y) == 0:\n            continue\n        \n        h = torch.max(y) - torch.min(y)\n        w = torch.max(x) - torch.min(x)\n        masks[index, torch.min(y):torch.max(y), torch.min(x):torch.max(x)] = 1.0\n\n    return masks\n\ndef convert_coco_poly_to_mask(segmentations, height, width):\n    masks = []\n    for polygons in segmentations:\n        rles = coco_mask.frPyObjects(polygons, height, width)\n        mask = coco_mask.decode(rles)\n        if len(mask.shape) < 3:\n            mask = mask[..., None]\n        mask = torch.as_tensor(mask, dtype=torch.uint8)\n        mask = mask.any(dim=2)\n        masks.append(mask)\n    if masks:\n        masks = torch.stack(masks, dim=0)\n        masks = masks_to_boxes(masks)\n    else:\n        masks = torch.zeros((0, height, width), dtype=torch.uint8)\n\n    return masks\n\n\ndef build_transform_gen(cfg, is_train):\n    \"\"\"\n    Create a list of default :class:`Augmentation` from config.\n    Now it includes resizing and flipping.\n    Returns:\n        list[Augmentation]\n    \"\"\"\n    assert is_train, \"Only support training augmentation\"\n    image_size = cfg.INPUT.IMAGE_SIZE\n    min_scale = cfg.INPUT.MIN_SCALE\n    max_scale = cfg.INPUT.MAX_SCALE\n\n    augmentation = []\n\n    if cfg.INPUT.RANDOM_FLIP != \"none\":\n        augmentation.append(\n            T.RandomFlip(\n                horizontal=cfg.INPUT.RANDOM_FLIP == \"horizontal\",\n                vertical=cfg.INPUT.RANDOM_FLIP == \"vertical\",\n            )\n        )\n\n    augmentation.extend([\n        T.ResizeScale(\n            min_scale=min_scale, max_scale=max_scale, target_height=image_size, target_width=image_size\n        ),\n        T.FixedSizeCrop(crop_size=(image_size, image_size)),\n    ])\n\n    return augmentation\n\n\n# This is specifically designed for the COCO dataset.\nclass COCOInstanceNewBaselineDatasetMapper:\n    \"\"\"\n    A callable which takes a dataset dict in Detectron2 Dataset format,\n    and map it into a format used by MaskFormer.\n\n    This dataset mapper applies the same transformation as DETR for COCO panoptic segmentation.\n\n    The callable currently does the following:\n\n    1. Read the image from \"file_name\"\n    2. Applies geometric transforms to the image and annotation\n    3. Find and applies suitable cropping to the image and annotation\n    4. Prepare image and annotation to Tensors\n    \"\"\"\n\n    @configurable\n    def __init__(\n        self,\n        is_train=True,\n        *,\n        tfm_gens,\n        image_format,\n    ):\n        \"\"\"\n        NOTE: this interface is experimental.\n        Args:\n            is_train: for training or inference\n            augmentations: a list of augmentations or deterministic transforms to apply\n            tfm_gens: data augmentation\n            image_format: an image format supported by :func:`detection_utils.read_image`.\n        \"\"\"\n        self.tfm_gens = tfm_gens\n        logging.getLogger(__name__).info(\n            \"[COCOInstanceNewBaselineDatasetMapper] Full TransformGens used in training: {}\".format(str(self.tfm_gens))\n        )\n\n        self.img_format = image_format\n        self.is_train = is_train\n    \n    @classmethod\n    def from_config(cls, cfg, is_train=True):\n        # Build augmentation\n        tfm_gens = build_transform_gen(cfg, is_train)\n\n        ret = {\n            \"is_train\": is_train,\n            \"tfm_gens\": tfm_gens,\n            \"image_format\": cfg.INPUT.FORMAT,\n        }\n        return ret\n\n    def __call__(self, dataset_dict):\n        \"\"\"\n        Args:\n            dataset_dict (dict): Metadata of one image, in Detectron2 Dataset format.\n\n        Returns:\n            dict: a format that builtin models in detectron2 accept\n        \"\"\"\n        dataset_dict = copy.deepcopy(dataset_dict)  # it will be modified by code below\n        image = utils.read_image(dataset_dict[\"file_name\"], format=self.img_format)\n        utils.check_image_size(dataset_dict, image)\n\n        # TODO: get padding mask\n        # by feeding a \"segmentation mask\" to the same transforms\n        padding_mask = np.ones(image.shape[:2])\n\n        image, transforms = T.apply_transform_gens(self.tfm_gens, image)\n        # the crop transformation has default padding value 0 for segmentation\n        padding_mask = transforms.apply_segmentation(padding_mask)\n        padding_mask = ~ padding_mask.astype(bool)\n\n        image_shape = image.shape[:2]  # h, w\n\n        # Pytorch's dataloader is efficient on torch.Tensor due to shared-memory,\n        # but not efficient on large generic data structures due to the use of pickle & mp.Queue.\n        # Therefore it's important to use torch.Tensor.\n        dataset_dict[\"image\"] = torch.as_tensor(np.ascontiguousarray(image.transpose(2, 0, 1)))\n        dataset_dict[\"padding_mask\"] = torch.as_tensor(np.ascontiguousarray(padding_mask))\n\n        if not self.is_train:\n            # USER: Modify this if you want to keep them for some reason.\n            dataset_dict.pop(\"annotations\", None)\n            return dataset_dict\n\n        if \"annotations\" in dataset_dict:\n            # USER: Modify this if you want to keep them for some reason.\n            for anno in dataset_dict[\"annotations\"]:\n                # Let's always keep mask\n                # if not self.mask_on:\n                #     anno.pop(\"segmentation\", None)\n                anno.pop(\"keypoints\", None)\n\n            # USER: Implement additional transformations if you have other types of data\n            annos = [\n                utils.transform_instance_annotations(obj, transforms, image_shape)\n                for obj in dataset_dict.pop(\"annotations\")\n                if obj.get(\"iscrowd\", 0) == 0\n            ]\n            # NOTE: does not support BitMask due to augmentation\n            # Current BitMask cannot handle empty objects\n            instances = utils.annotations_to_instances(annos, image_shape)\n            # After transforms such as cropping are applied, the bounding box may no longer\n            # tightly bound the object. As an example, imagine a triangle object\n            # [(0,0), (2,0), (0,2)] cropped by a box [(1,0),(2,2)] (XYXY format). The tight\n            # bounding box of the cropped triangle should be [(1,0),(2,1)], which is not equal to\n            # the intersection of original bounding box and the cropping box.\n            instances.gt_boxes = instances.gt_masks.get_bounding_boxes()\n            # Need to filter empty instances first (due to augmentation)\n            instances = utils.filter_empty_instances(instances)\n            # Generate masks from polygon\n            h, w = instances.image_size\n            # image_size_xyxy = torch.as_tensor([w, h, w, h], dtype=torch.float)\n            if hasattr(instances, 'gt_masks'):\n                gt_masks = instances.gt_masks\n                gt_masks_box = convert_coco_poly_to_mask(gt_masks.polygons, h, w)\n                instances.gt_masks = gt_masks_box\n            dataset_dict[\"instances\"] = instances\n\n        return dataset_dict\n"
  },
  {
    "path": "mfvis_nococo/mask2former/data/dataset_mappers/coco_panoptic_new_baseline_dataset_mapper.py",
    "content": "# Modified by Bowen Cheng from https://github.com/facebookresearch/detr/blob/master/d2/detr/dataset_mapper.py\nimport copy\nimport logging\n\nimport numpy as np\nimport torch\n\nfrom detectron2.config import configurable\nfrom detectron2.data import detection_utils as utils\nfrom detectron2.data import transforms as T\nfrom detectron2.data.transforms import TransformGen\nfrom detectron2.structures import BitMasks, Boxes, Instances\n\n__all__ = [\"COCOPanopticNewBaselineDatasetMapper\"]\n\n\ndef build_transform_gen(cfg, is_train):\n    \"\"\"\n    Create a list of default :class:`Augmentation` from config.\n    Now it includes resizing and flipping.\n    Returns:\n        list[Augmentation]\n    \"\"\"\n    assert is_train, \"Only support training augmentation\"\n    image_size = cfg.INPUT.IMAGE_SIZE\n    min_scale = cfg.INPUT.MIN_SCALE\n    max_scale = cfg.INPUT.MAX_SCALE\n\n    augmentation = []\n\n    if cfg.INPUT.RANDOM_FLIP != \"none\":\n        augmentation.append(\n            T.RandomFlip(\n                horizontal=cfg.INPUT.RANDOM_FLIP == \"horizontal\",\n                vertical=cfg.INPUT.RANDOM_FLIP == \"vertical\",\n            )\n        )\n\n    augmentation.extend([\n        T.ResizeScale(\n            min_scale=min_scale, max_scale=max_scale, target_height=image_size, target_width=image_size\n        ),\n        T.FixedSizeCrop(crop_size=(image_size, image_size)),\n    ])\n\n    return augmentation\n\n\n# This is specifically designed for the COCO dataset.\nclass COCOPanopticNewBaselineDatasetMapper:\n    \"\"\"\n    A callable which takes a dataset dict in Detectron2 Dataset format,\n    and map it into a format used by MaskFormer.\n\n    This dataset mapper applies the same transformation as DETR for COCO panoptic segmentation.\n\n    The callable currently does the following:\n\n    1. Read the image from \"file_name\"\n    2. Applies geometric transforms to the image and annotation\n    3. Find and applies suitable cropping to the image and annotation\n    4. Prepare image and annotation to Tensors\n    \"\"\"\n\n    @configurable\n    def __init__(\n        self,\n        is_train=True,\n        *,\n        tfm_gens,\n        image_format,\n    ):\n        \"\"\"\n        NOTE: this interface is experimental.\n        Args:\n            is_train: for training or inference\n            augmentations: a list of augmentations or deterministic transforms to apply\n            crop_gen: crop augmentation\n            tfm_gens: data augmentation\n            image_format: an image format supported by :func:`detection_utils.read_image`.\n        \"\"\"\n        self.tfm_gens = tfm_gens\n        logging.getLogger(__name__).info(\n            \"[COCOPanopticNewBaselineDatasetMapper] Full TransformGens used in training: {}\".format(\n                str(self.tfm_gens)\n            )\n        )\n\n        self.img_format = image_format\n        self.is_train = is_train\n\n    @classmethod\n    def from_config(cls, cfg, is_train=True):\n        # Build augmentation\n        tfm_gens = build_transform_gen(cfg, is_train)\n\n        ret = {\n            \"is_train\": is_train,\n            \"tfm_gens\": tfm_gens,\n            \"image_format\": cfg.INPUT.FORMAT,\n        }\n        return ret\n\n    def __call__(self, dataset_dict):\n        \"\"\"\n        Args:\n            dataset_dict (dict): Metadata of one image, in Detectron2 Dataset format.\n\n        Returns:\n            dict: a format that builtin models in detectron2 accept\n        \"\"\"\n        dataset_dict = copy.deepcopy(dataset_dict)  # it will be modified by code below\n        image = utils.read_image(dataset_dict[\"file_name\"], format=self.img_format)\n        utils.check_image_size(dataset_dict, image)\n\n        image, transforms = T.apply_transform_gens(self.tfm_gens, image)\n        image_shape = image.shape[:2]  # h, w\n\n        # Pytorch's dataloader is efficient on torch.Tensor due to shared-memory,\n        # but not efficient on large generic data structures due to the use of pickle & mp.Queue.\n        # Therefore it's important to use torch.Tensor.\n        dataset_dict[\"image\"] = torch.as_tensor(np.ascontiguousarray(image.transpose(2, 0, 1)))\n\n        if not self.is_train:\n            # USER: Modify this if you want to keep them for some reason.\n            dataset_dict.pop(\"annotations\", None)\n            return dataset_dict\n\n        if \"pan_seg_file_name\" in dataset_dict:\n            pan_seg_gt = utils.read_image(dataset_dict.pop(\"pan_seg_file_name\"), \"RGB\")\n            segments_info = dataset_dict[\"segments_info\"]\n\n            # apply the same transformation to panoptic segmentation\n            pan_seg_gt = transforms.apply_segmentation(pan_seg_gt)\n\n            from panopticapi.utils import rgb2id\n\n            pan_seg_gt = rgb2id(pan_seg_gt)\n\n            instances = Instances(image_shape)\n            classes = []\n            masks = []\n            for segment_info in segments_info:\n                class_id = segment_info[\"category_id\"]\n                if not segment_info[\"iscrowd\"]:\n                    classes.append(class_id)\n                    masks.append(pan_seg_gt == segment_info[\"id\"])\n\n            classes = np.array(classes)\n            instances.gt_classes = torch.tensor(classes, dtype=torch.int64)\n            if len(masks) == 0:\n                # Some image does not have annotation (all ignored)\n                instances.gt_masks = torch.zeros((0, pan_seg_gt.shape[-2], pan_seg_gt.shape[-1]))\n                instances.gt_boxes = Boxes(torch.zeros((0, 4)))\n            else:\n                masks = BitMasks(\n                    torch.stack([torch.from_numpy(np.ascontiguousarray(x.copy())) for x in masks])\n                )\n                instances.gt_masks = masks.tensor\n                instances.gt_boxes = masks.get_bounding_boxes()\n\n            dataset_dict[\"instances\"] = instances\n\n        return dataset_dict\n"
  },
  {
    "path": "mfvis_nococo/mask2former/data/dataset_mappers/mask_former_instance_dataset_mapper.py",
    "content": "import copy\nimport logging\n\nimport numpy as np\nimport pycocotools.mask as mask_util\nimport torch\nfrom torch.nn import functional as F\n\nfrom detectron2.config import configurable\nfrom detectron2.data import detection_utils as utils\nfrom detectron2.data import transforms as T\nfrom detectron2.projects.point_rend import ColorAugSSDTransform\nfrom detectron2.structures import BitMasks, Instances, polygons_to_bitmask\n\n__all__ = [\"MaskFormerInstanceDatasetMapper\"]\n\n\nclass MaskFormerInstanceDatasetMapper:\n    \"\"\"\n    A callable which takes a dataset dict in Detectron2 Dataset format,\n    and map it into a format used by MaskFormer for instance segmentation.\n\n    The callable currently does the following:\n\n    1. Read the image from \"file_name\"\n    2. Applies geometric transforms to the image and annotation\n    3. Find and applies suitable cropping to the image and annotation\n    4. Prepare image and annotation to Tensors\n    \"\"\"\n\n    @configurable\n    def __init__(\n        self,\n        is_train=True,\n        *,\n        augmentations,\n        image_format,\n        size_divisibility,\n    ):\n        \"\"\"\n        NOTE: this interface is experimental.\n        Args:\n            is_train: for training or inference\n            augmentations: a list of augmentations or deterministic transforms to apply\n            image_format: an image format supported by :func:`detection_utils.read_image`.\n            size_divisibility: pad image size to be divisible by this value\n        \"\"\"\n        self.is_train = is_train\n        self.tfm_gens = augmentations\n        self.img_format = image_format\n        self.size_divisibility = size_divisibility\n\n        logger = logging.getLogger(__name__)\n        mode = \"training\" if is_train else \"inference\"\n        logger.info(f\"[{self.__class__.__name__}] Augmentations used in {mode}: {augmentations}\")\n\n    @classmethod\n    def from_config(cls, cfg, is_train=True):\n        # Build augmentation\n        augs = [\n            T.ResizeShortestEdge(\n                cfg.INPUT.MIN_SIZE_TRAIN,\n                cfg.INPUT.MAX_SIZE_TRAIN,\n                cfg.INPUT.MIN_SIZE_TRAIN_SAMPLING,\n            )\n        ]\n        if cfg.INPUT.CROP.ENABLED:\n            augs.append(\n                T.RandomCrop(\n                    cfg.INPUT.CROP.TYPE,\n                    cfg.INPUT.CROP.SIZE,\n                )\n            )\n        if cfg.INPUT.COLOR_AUG_SSD:\n            augs.append(ColorAugSSDTransform(img_format=cfg.INPUT.FORMAT))\n        augs.append(T.RandomFlip())\n\n        ret = {\n            \"is_train\": is_train,\n            \"augmentations\": augs,\n            \"image_format\": cfg.INPUT.FORMAT,\n            \"size_divisibility\": cfg.INPUT.SIZE_DIVISIBILITY,\n        }\n        return ret\n\n    def __call__(self, dataset_dict):\n        \"\"\"\n        Args:\n            dataset_dict (dict): Metadata of one image, in Detectron2 Dataset format.\n\n        Returns:\n            dict: a format that builtin models in detectron2 accept\n        \"\"\"\n        assert self.is_train, \"MaskFormerPanopticDatasetMapper should only be used for training!\"\n\n        dataset_dict = copy.deepcopy(dataset_dict)  # it will be modified by code below\n        image = utils.read_image(dataset_dict[\"file_name\"], format=self.img_format)\n        utils.check_image_size(dataset_dict, image)\n\n        aug_input = T.AugInput(image)\n        aug_input, transforms = T.apply_transform_gens(self.tfm_gens, aug_input)\n        image = aug_input.image\n\n        # transform instnace masks\n        assert \"annotations\" in dataset_dict\n        for anno in dataset_dict[\"annotations\"]:\n            anno.pop(\"keypoints\", None)\n\n        annos = [\n            utils.transform_instance_annotations(obj, transforms, image.shape[:2])\n            for obj in dataset_dict.pop(\"annotations\")\n            if obj.get(\"iscrowd\", 0) == 0\n        ]\n\n        if len(annos):\n            assert \"segmentation\" in annos[0]\n        segms = [obj[\"segmentation\"] for obj in annos]\n        masks = []\n        for segm in segms:\n            if isinstance(segm, list):\n                # polygon\n                masks.append(polygons_to_bitmask(segm, *image.shape[:2]))\n            elif isinstance(segm, dict):\n                # COCO RLE\n                masks.append(mask_util.decode(segm))\n            elif isinstance(segm, np.ndarray):\n                assert segm.ndim == 2, \"Expect segmentation of 2 dimensions, got {}.\".format(\n                    segm.ndim\n                )\n                # mask array\n                masks.append(segm)\n            else:\n                raise ValueError(\n                    \"Cannot convert segmentation of type '{}' to BitMasks!\"\n                    \"Supported types are: polygons as list[list[float] or ndarray],\"\n                    \" COCO-style RLE as a dict, or a binary segmentation mask \"\n                    \" in a 2D numpy array of shape HxW.\".format(type(segm))\n                )\n\n        # Pad image and segmentation label here!\n        image = torch.as_tensor(np.ascontiguousarray(image.transpose(2, 0, 1)))\n        masks = [torch.from_numpy(np.ascontiguousarray(x)) for x in masks]\n\n        classes = [int(obj[\"category_id\"]) for obj in annos]\n        classes = torch.tensor(classes, dtype=torch.int64)\n\n        if self.size_divisibility > 0:\n            image_size = (image.shape[-2], image.shape[-1])\n            padding_size = [\n                0,\n                self.size_divisibility - image_size[1],\n                0,\n                self.size_divisibility - image_size[0],\n            ]\n            # pad image\n            image = F.pad(image, padding_size, value=128).contiguous()\n            # pad mask\n            masks = [F.pad(x, padding_size, value=0).contiguous() for x in masks]\n\n        image_shape = (image.shape[-2], image.shape[-1])  # h, w\n\n        # Pytorch's dataloader is efficient on torch.Tensor due to shared-memory,\n        # but not efficient on large generic data structures due to the use of pickle & mp.Queue.\n        # Therefore it's important to use torch.Tensor.\n        dataset_dict[\"image\"] = image\n\n        # Prepare per-category binary masks\n        instances = Instances(image_shape)\n        instances.gt_classes = classes\n        if len(masks) == 0:\n            # Some image does not have annotation (all ignored)\n            instances.gt_masks = torch.zeros((0, image.shape[-2], image.shape[-1]))\n        else:\n            masks = BitMasks(torch.stack(masks))\n            instances.gt_masks = masks.tensor\n\n        dataset_dict[\"instances\"] = instances\n\n        return dataset_dict\n"
  },
  {
    "path": "mfvis_nococo/mask2former/data/dataset_mappers/mask_former_panoptic_dataset_mapper.py",
    "content": "import copy\nimport logging\n\nimport numpy as np\nimport torch\nfrom torch.nn import functional as F\n\nfrom detectron2.config import configurable\nfrom detectron2.data import detection_utils as utils\nfrom detectron2.data import transforms as T\nfrom detectron2.structures import BitMasks, Instances\n\nfrom .mask_former_semantic_dataset_mapper import MaskFormerSemanticDatasetMapper\n\n__all__ = [\"MaskFormerPanopticDatasetMapper\"]\n\n\nclass MaskFormerPanopticDatasetMapper(MaskFormerSemanticDatasetMapper):\n    \"\"\"\n    A callable which takes a dataset dict in Detectron2 Dataset format,\n    and map it into a format used by MaskFormer for panoptic segmentation.\n\n    The callable currently does the following:\n\n    1. Read the image from \"file_name\"\n    2. Applies geometric transforms to the image and annotation\n    3. Find and applies suitable cropping to the image and annotation\n    4. Prepare image and annotation to Tensors\n    \"\"\"\n\n    @configurable\n    def __init__(\n        self,\n        is_train=True,\n        *,\n        augmentations,\n        image_format,\n        ignore_label,\n        size_divisibility,\n    ):\n        \"\"\"\n        NOTE: this interface is experimental.\n        Args:\n            is_train: for training or inference\n            augmentations: a list of augmentations or deterministic transforms to apply\n            image_format: an image format supported by :func:`detection_utils.read_image`.\n            ignore_label: the label that is ignored to evaluation\n            size_divisibility: pad image size to be divisible by this value\n        \"\"\"\n        super().__init__(\n            is_train,\n            augmentations=augmentations,\n            image_format=image_format,\n            ignore_label=ignore_label,\n            size_divisibility=size_divisibility,\n        )\n\n    def __call__(self, dataset_dict):\n        \"\"\"\n        Args:\n            dataset_dict (dict): Metadata of one image, in Detectron2 Dataset format.\n\n        Returns:\n            dict: a format that builtin models in detectron2 accept\n        \"\"\"\n        assert self.is_train, \"MaskFormerPanopticDatasetMapper should only be used for training!\"\n\n        dataset_dict = copy.deepcopy(dataset_dict)  # it will be modified by code below\n        image = utils.read_image(dataset_dict[\"file_name\"], format=self.img_format)\n        utils.check_image_size(dataset_dict, image)\n\n        # semantic segmentation\n        if \"sem_seg_file_name\" in dataset_dict:\n            # PyTorch transformation not implemented for uint16, so converting it to double first\n            sem_seg_gt = utils.read_image(dataset_dict.pop(\"sem_seg_file_name\")).astype(\"double\")\n        else:\n            sem_seg_gt = None\n\n        # panoptic segmentation\n        if \"pan_seg_file_name\" in dataset_dict:\n            pan_seg_gt = utils.read_image(dataset_dict.pop(\"pan_seg_file_name\"), \"RGB\")\n            segments_info = dataset_dict[\"segments_info\"]\n        else:\n            pan_seg_gt = None\n            segments_info = None\n\n        if pan_seg_gt is None:\n            raise ValueError(\n                \"Cannot find 'pan_seg_file_name' for panoptic segmentation dataset {}.\".format(\n                    dataset_dict[\"file_name\"]\n                )\n            )\n\n        aug_input = T.AugInput(image, sem_seg=sem_seg_gt)\n        aug_input, transforms = T.apply_transform_gens(self.tfm_gens, aug_input)\n        image = aug_input.image\n        if sem_seg_gt is not None:\n            sem_seg_gt = aug_input.sem_seg\n\n        # apply the same transformation to panoptic segmentation\n        pan_seg_gt = transforms.apply_segmentation(pan_seg_gt)\n\n        from panopticapi.utils import rgb2id\n\n        pan_seg_gt = rgb2id(pan_seg_gt)\n\n        # Pad image and segmentation label here!\n        image = torch.as_tensor(np.ascontiguousarray(image.transpose(2, 0, 1)))\n        if sem_seg_gt is not None:\n            sem_seg_gt = torch.as_tensor(sem_seg_gt.astype(\"long\"))\n        pan_seg_gt = torch.as_tensor(pan_seg_gt.astype(\"long\"))\n\n        if self.size_divisibility > 0:\n            image_size = (image.shape[-2], image.shape[-1])\n            padding_size = [\n                0,\n                self.size_divisibility - image_size[1],\n                0,\n                self.size_divisibility - image_size[0],\n            ]\n            image = F.pad(image, padding_size, value=128).contiguous()\n            if sem_seg_gt is not None:\n                sem_seg_gt = F.pad(sem_seg_gt, padding_size, value=self.ignore_label).contiguous()\n            pan_seg_gt = F.pad(\n                pan_seg_gt, padding_size, value=0\n            ).contiguous()  # 0 is the VOID panoptic label\n\n        image_shape = (image.shape[-2], image.shape[-1])  # h, w\n\n        # Pytorch's dataloader is efficient on torch.Tensor due to shared-memory,\n        # but not efficient on large generic data structures due to the use of pickle & mp.Queue.\n        # Therefore it's important to use torch.Tensor.\n        dataset_dict[\"image\"] = image\n        if sem_seg_gt is not None:\n            dataset_dict[\"sem_seg\"] = sem_seg_gt.long()\n\n        if \"annotations\" in dataset_dict:\n            raise ValueError(\"Pemantic segmentation dataset should not have 'annotations'.\")\n\n        # Prepare per-category binary masks\n        pan_seg_gt = pan_seg_gt.numpy()\n        instances = Instances(image_shape)\n        classes = []\n        masks = []\n        for segment_info in segments_info:\n            class_id = segment_info[\"category_id\"]\n            if not segment_info[\"iscrowd\"]:\n                classes.append(class_id)\n                masks.append(pan_seg_gt == segment_info[\"id\"])\n\n        classes = np.array(classes)\n        instances.gt_classes = torch.tensor(classes, dtype=torch.int64)\n        if len(masks) == 0:\n            # Some image does not have annotation (all ignored)\n            instances.gt_masks = torch.zeros((0, pan_seg_gt.shape[-2], pan_seg_gt.shape[-1]))\n        else:\n            masks = BitMasks(\n                torch.stack([torch.from_numpy(np.ascontiguousarray(x.copy())) for x in masks])\n            )\n            instances.gt_masks = masks.tensor\n\n        dataset_dict[\"instances\"] = instances\n\n        return dataset_dict\n"
  },
  {
    "path": "mfvis_nococo/mask2former/data/dataset_mappers/mask_former_semantic_dataset_mapper.py",
    "content": "import copy\nimport logging\n\nimport numpy as np\nimport torch\nfrom torch.nn import functional as F\n\nfrom detectron2.config import configurable\nfrom detectron2.data import MetadataCatalog\nfrom detectron2.data import detection_utils as utils\nfrom detectron2.data import transforms as T\nfrom detectron2.projects.point_rend import ColorAugSSDTransform\nfrom detectron2.structures import BitMasks, Instances\n\n__all__ = [\"MaskFormerSemanticDatasetMapper\"]\n\n\nclass MaskFormerSemanticDatasetMapper:\n    \"\"\"\n    A callable which takes a dataset dict in Detectron2 Dataset format,\n    and map it into a format used by MaskFormer for semantic segmentation.\n\n    The callable currently does the following:\n\n    1. Read the image from \"file_name\"\n    2. Applies geometric transforms to the image and annotation\n    3. Find and applies suitable cropping to the image and annotation\n    4. Prepare image and annotation to Tensors\n    \"\"\"\n\n    @configurable\n    def __init__(\n        self,\n        is_train=True,\n        *,\n        augmentations,\n        image_format,\n        ignore_label,\n        size_divisibility,\n    ):\n        \"\"\"\n        NOTE: this interface is experimental.\n        Args:\n            is_train: for training or inference\n            augmentations: a list of augmentations or deterministic transforms to apply\n            image_format: an image format supported by :func:`detection_utils.read_image`.\n            ignore_label: the label that is ignored to evaluation\n            size_divisibility: pad image size to be divisible by this value\n        \"\"\"\n        self.is_train = is_train\n        self.tfm_gens = augmentations\n        self.img_format = image_format\n        self.ignore_label = ignore_label\n        self.size_divisibility = size_divisibility\n\n        logger = logging.getLogger(__name__)\n        mode = \"training\" if is_train else \"inference\"\n        logger.info(f\"[{self.__class__.__name__}] Augmentations used in {mode}: {augmentations}\")\n\n    @classmethod\n    def from_config(cls, cfg, is_train=True):\n        # Build augmentation\n        augs = [\n            T.ResizeShortestEdge(\n                cfg.INPUT.MIN_SIZE_TRAIN,\n                cfg.INPUT.MAX_SIZE_TRAIN,\n                cfg.INPUT.MIN_SIZE_TRAIN_SAMPLING,\n            )\n        ]\n        if cfg.INPUT.CROP.ENABLED:\n            augs.append(\n                T.RandomCrop_CategoryAreaConstraint(\n                    cfg.INPUT.CROP.TYPE,\n                    cfg.INPUT.CROP.SIZE,\n                    cfg.INPUT.CROP.SINGLE_CATEGORY_MAX_AREA,\n                    cfg.MODEL.SEM_SEG_HEAD.IGNORE_VALUE,\n                )\n            )\n        if cfg.INPUT.COLOR_AUG_SSD:\n            augs.append(ColorAugSSDTransform(img_format=cfg.INPUT.FORMAT))\n        augs.append(T.RandomFlip())\n\n        # Assume always applies to the training set.\n        dataset_names = cfg.DATASETS.TRAIN\n        meta = MetadataCatalog.get(dataset_names[0])\n        ignore_label = meta.ignore_label\n\n        ret = {\n            \"is_train\": is_train,\n            \"augmentations\": augs,\n            \"image_format\": cfg.INPUT.FORMAT,\n            \"ignore_label\": ignore_label,\n            \"size_divisibility\": cfg.INPUT.SIZE_DIVISIBILITY,\n        }\n        return ret\n\n    def __call__(self, dataset_dict):\n        \"\"\"\n        Args:\n            dataset_dict (dict): Metadata of one image, in Detectron2 Dataset format.\n\n        Returns:\n            dict: a format that builtin models in detectron2 accept\n        \"\"\"\n        assert self.is_train, \"MaskFormerSemanticDatasetMapper should only be used for training!\"\n\n        dataset_dict = copy.deepcopy(dataset_dict)  # it will be modified by code below\n        image = utils.read_image(dataset_dict[\"file_name\"], format=self.img_format)\n        utils.check_image_size(dataset_dict, image)\n\n        if \"sem_seg_file_name\" in dataset_dict:\n            # PyTorch transformation not implemented for uint16, so converting it to double first\n            sem_seg_gt = utils.read_image(dataset_dict.pop(\"sem_seg_file_name\")).astype(\"double\")\n        else:\n            sem_seg_gt = None\n\n        if sem_seg_gt is None:\n            raise ValueError(\n                \"Cannot find 'sem_seg_file_name' for semantic segmentation dataset {}.\".format(\n                    dataset_dict[\"file_name\"]\n                )\n            )\n\n        aug_input = T.AugInput(image, sem_seg=sem_seg_gt)\n        aug_input, transforms = T.apply_transform_gens(self.tfm_gens, aug_input)\n        image = aug_input.image\n        sem_seg_gt = aug_input.sem_seg\n\n        # Pad image and segmentation label here!\n        image = torch.as_tensor(np.ascontiguousarray(image.transpose(2, 0, 1)))\n        if sem_seg_gt is not None:\n            sem_seg_gt = torch.as_tensor(sem_seg_gt.astype(\"long\"))\n\n        if self.size_divisibility > 0:\n            image_size = (image.shape[-2], image.shape[-1])\n            padding_size = [\n                0,\n                self.size_divisibility - image_size[1],\n                0,\n                self.size_divisibility - image_size[0],\n            ]\n            image = F.pad(image, padding_size, value=128).contiguous()\n            if sem_seg_gt is not None:\n                sem_seg_gt = F.pad(sem_seg_gt, padding_size, value=self.ignore_label).contiguous()\n\n        image_shape = (image.shape[-2], image.shape[-1])  # h, w\n\n        # Pytorch's dataloader is efficient on torch.Tensor due to shared-memory,\n        # but not efficient on large generic data structures due to the use of pickle & mp.Queue.\n        # Therefore it's important to use torch.Tensor.\n        dataset_dict[\"image\"] = image\n\n        if sem_seg_gt is not None:\n            dataset_dict[\"sem_seg\"] = sem_seg_gt.long()\n\n        if \"annotations\" in dataset_dict:\n            raise ValueError(\"Semantic segmentation dataset should not have 'annotations'.\")\n\n        # Prepare per-category binary masks\n        if sem_seg_gt is not None:\n            sem_seg_gt = sem_seg_gt.numpy()\n            instances = Instances(image_shape)\n            classes = np.unique(sem_seg_gt)\n            # remove ignored region\n            classes = classes[classes != self.ignore_label]\n            instances.gt_classes = torch.tensor(classes, dtype=torch.int64)\n\n            masks = []\n            for class_id in classes:\n                masks.append(sem_seg_gt == class_id)\n\n            if len(masks) == 0:\n                # Some image does not have annotation (all ignored)\n                instances.gt_masks = torch.zeros((0, sem_seg_gt.shape[-2], sem_seg_gt.shape[-1]))\n            else:\n                masks = BitMasks(\n                    torch.stack([torch.from_numpy(np.ascontiguousarray(x.copy())) for x in masks])\n                )\n                instances.gt_masks = masks.tensor\n\n            dataset_dict[\"instances\"] = instances\n\n        return dataset_dict\n"
  },
  {
    "path": "mfvis_nococo/mask2former/data/datasets/__init__.py",
    "content": "from . import (\n    register_ade20k_full,\n    register_ade20k_panoptic,\n    register_coco_stuff_10k,\n    register_mapillary_vistas,\n    register_coco_panoptic_annos_semseg,\n    register_ade20k_instance,\n    register_mapillary_vistas_panoptic,\n)\n"
  },
  {
    "path": "mfvis_nococo/mask2former/data/datasets/register_ade20k_full.py",
    "content": "import os\n\nfrom detectron2.data import DatasetCatalog, MetadataCatalog\nfrom detectron2.data.datasets import load_sem_seg\n\nADE20K_SEM_SEG_FULL_CATEGORIES = [\n    {\"name\": \"wall\", \"id\": 2978, \"trainId\": 0},\n    {\"name\": \"building, edifice\", \"id\": 312, \"trainId\": 1},\n    {\"name\": \"sky\", \"id\": 2420, \"trainId\": 2},\n    {\"name\": \"tree\", \"id\": 2855, \"trainId\": 3},\n    {\"name\": \"road, route\", \"id\": 2131, \"trainId\": 4},\n    {\"name\": \"floor, flooring\", \"id\": 976, \"trainId\": 5},\n    {\"name\": \"ceiling\", \"id\": 447, \"trainId\": 6},\n    {\"name\": \"bed\", \"id\": 165, \"trainId\": 7},\n    {\"name\": \"sidewalk, pavement\", \"id\": 2377, \"trainId\": 8},\n    {\"name\": \"earth, ground\", \"id\": 838, \"trainId\": 9},\n    {\"name\": \"cabinet\", \"id\": 350, \"trainId\": 10},\n    {\"name\": \"person, individual, someone, somebody, mortal, soul\", \"id\": 1831, \"trainId\": 11},\n    {\"name\": \"grass\", \"id\": 1125, \"trainId\": 12},\n    {\"name\": \"windowpane, window\", \"id\": 3055, \"trainId\": 13},\n    {\"name\": \"car, auto, automobile, machine, motorcar\", \"id\": 401, \"trainId\": 14},\n    {\"name\": \"mountain, mount\", \"id\": 1610, \"trainId\": 15},\n    {\"name\": \"plant, flora, plant life\", \"id\": 1910, \"trainId\": 16},\n    {\"name\": \"table\", \"id\": 2684, \"trainId\": 17},\n    {\"name\": \"chair\", \"id\": 471, \"trainId\": 18},\n    {\"name\": \"curtain, drape, drapery, mantle, pall\", \"id\": 687, \"trainId\": 19},\n    {\"name\": \"door\", \"id\": 774, \"trainId\": 20},\n    {\"name\": \"sofa, couch, lounge\", \"id\": 2473, \"trainId\": 21},\n    {\"name\": \"sea\", \"id\": 2264, \"trainId\": 22},\n    {\"name\": \"painting, picture\", \"id\": 1735, \"trainId\": 23},\n    {\"name\": \"water\", \"id\": 2994, \"trainId\": 24},\n    {\"name\": \"mirror\", \"id\": 1564, \"trainId\": 25},\n    {\"name\": \"house\", \"id\": 1276, \"trainId\": 26},\n    {\"name\": \"rug, carpet, carpeting\", \"id\": 2178, \"trainId\": 27},\n    {\"name\": \"shelf\", \"id\": 2329, \"trainId\": 28},\n    {\"name\": \"armchair\", \"id\": 57, \"trainId\": 29},\n    {\"name\": \"fence, fencing\", \"id\": 907, \"trainId\": 30},\n    {\"name\": \"field\", \"id\": 913, \"trainId\": 31},\n    {\"name\": \"lamp\", \"id\": 1395, \"trainId\": 32},\n    {\"name\": \"rock, stone\", \"id\": 2138, \"trainId\": 33},\n    {\"name\": \"seat\", \"id\": 2272, \"trainId\": 34},\n    {\"name\": \"river\", \"id\": 2128, \"trainId\": 35},\n    {\"name\": \"desk\", \"id\": 724, \"trainId\": 36},\n    {\"name\": \"bathtub, bathing tub, bath, tub\", \"id\": 155, \"trainId\": 37},\n    {\"name\": \"railing, rail\", \"id\": 2053, \"trainId\": 38},\n    {\"name\": \"signboard, sign\", \"id\": 2380, \"trainId\": 39},\n    {\"name\": \"cushion\", \"id\": 689, \"trainId\": 40},\n    {\"name\": \"path\", \"id\": 1788, \"trainId\": 41},\n    {\"name\": \"work surface\", \"id\": 3087, \"trainId\": 42},\n    {\"name\": \"stairs, steps\", \"id\": 2530, \"trainId\": 43},\n    {\"name\": \"column, pillar\", \"id\": 581, \"trainId\": 44},\n    {\"name\": \"sink\", \"id\": 2388, \"trainId\": 45},\n    {\"name\": \"wardrobe, closet, press\", \"id\": 2985, \"trainId\": 46},\n    {\"name\": \"snow\", \"id\": 2454, \"trainId\": 47},\n    {\"name\": \"refrigerator, icebox\", \"id\": 2096, \"trainId\": 48},\n    {\"name\": \"base, pedestal, stand\", \"id\": 137, \"trainId\": 49},\n    {\"name\": \"bridge, span\", \"id\": 294, \"trainId\": 50},\n    {\"name\": \"blind, screen\", \"id\": 212, \"trainId\": 51},\n    {\"name\": \"runway\", \"id\": 2185, \"trainId\": 52},\n    {\"name\": \"cliff, drop, drop-off\", \"id\": 524, \"trainId\": 53},\n    {\"name\": \"sand\", \"id\": 2212, \"trainId\": 54},\n    {\"name\": \"fireplace, hearth, open fireplace\", \"id\": 943, \"trainId\": 55},\n    {\"name\": \"pillow\", \"id\": 1869, \"trainId\": 56},\n    {\"name\": \"screen door, screen\", \"id\": 2251, \"trainId\": 57},\n    {\"name\": \"toilet, can, commode, crapper, pot, potty, stool, throne\", \"id\": 2793, \"trainId\": 58},\n    {\"name\": \"skyscraper\", \"id\": 2423, \"trainId\": 59},\n    {\"name\": \"grandstand, covered stand\", \"id\": 1121, \"trainId\": 60},\n    {\"name\": \"box\", \"id\": 266, \"trainId\": 61},\n    {\"name\": \"pool table, billiard table, snooker table\", \"id\": 1948, \"trainId\": 62},\n    {\"name\": \"palm, palm tree\", \"id\": 1744, \"trainId\": 63},\n    {\"name\": \"double door\", \"id\": 783, \"trainId\": 64},\n    {\"name\": \"coffee table, cocktail table\", \"id\": 571, \"trainId\": 65},\n    {\"name\": \"counter\", \"id\": 627, \"trainId\": 66},\n    {\"name\": \"countertop\", \"id\": 629, \"trainId\": 67},\n    {\"name\": \"chest of drawers, chest, bureau, dresser\", \"id\": 491, \"trainId\": 68},\n    {\"name\": \"kitchen island\", \"id\": 1374, \"trainId\": 69},\n    {\"name\": \"boat\", \"id\": 223, \"trainId\": 70},\n    {\"name\": \"waterfall, falls\", \"id\": 3016, \"trainId\": 71},\n    {\n        \"name\": \"stove, kitchen stove, range, kitchen range, cooking stove\",\n        \"id\": 2598,\n        \"trainId\": 72,\n    },\n    {\"name\": \"flower\", \"id\": 978, \"trainId\": 73},\n    {\"name\": \"bookcase\", \"id\": 239, \"trainId\": 74},\n    {\"name\": \"controls\", \"id\": 608, \"trainId\": 75},\n    {\"name\": \"book\", \"id\": 236, \"trainId\": 76},\n    {\"name\": \"stairway, staircase\", \"id\": 2531, \"trainId\": 77},\n    {\"name\": \"streetlight, street lamp\", \"id\": 2616, \"trainId\": 78},\n    {\n        \"name\": \"computer, computing machine, computing device, data processor, electronic computer, information processing system\",\n        \"id\": 591,\n        \"trainId\": 79,\n    },\n    {\n        \"name\": \"bus, autobus, coach, charabanc, double-decker, jitney, motorbus, motorcoach, omnibus, passenger vehicle\",\n        \"id\": 327,\n        \"trainId\": 80,\n    },\n    {\"name\": \"swivel chair\", \"id\": 2679, \"trainId\": 81},\n    {\"name\": \"light, light source\", \"id\": 1451, \"trainId\": 82},\n    {\"name\": \"bench\", \"id\": 181, \"trainId\": 83},\n    {\"name\": \"case, display case, showcase, vitrine\", \"id\": 420, \"trainId\": 84},\n    {\"name\": \"towel\", \"id\": 2821, \"trainId\": 85},\n    {\"name\": \"fountain\", \"id\": 1023, \"trainId\": 86},\n    {\"name\": \"embankment\", \"id\": 855, \"trainId\": 87},\n    {\n        \"name\": \"television receiver, television, television set, tv, tv set, idiot box, boob tube, telly, goggle box\",\n        \"id\": 2733,\n        \"trainId\": 88,\n    },\n    {\"name\": \"van\", \"id\": 2928, \"trainId\": 89},\n    {\"name\": \"hill\", \"id\": 1240, \"trainId\": 90},\n    {\"name\": \"awning, sunshade, sunblind\", \"id\": 77, \"trainId\": 91},\n    {\"name\": \"poster, posting, placard, notice, bill, card\", \"id\": 1969, \"trainId\": 92},\n    {\"name\": \"truck, motortruck\", \"id\": 2880, \"trainId\": 93},\n    {\"name\": \"airplane, aeroplane, plane\", \"id\": 14, \"trainId\": 94},\n    {\"name\": \"pole\", \"id\": 1936, \"trainId\": 95},\n    {\"name\": \"tower\", \"id\": 2828, \"trainId\": 96},\n    {\"name\": \"court\", \"id\": 631, \"trainId\": 97},\n    {\"name\": \"ball\", \"id\": 103, \"trainId\": 98},\n    {\n        \"name\": \"aircraft carrier, carrier, flattop, attack aircraft carrier\",\n        \"id\": 3144,\n        \"trainId\": 99,\n    },\n    {\"name\": \"buffet, counter, sideboard\", \"id\": 308, \"trainId\": 100},\n    {\"name\": \"hovel, hut, hutch, shack, shanty\", \"id\": 1282, \"trainId\": 101},\n    {\"name\": \"apparel, wearing apparel, dress, clothes\", \"id\": 38, \"trainId\": 102},\n    {\"name\": \"minibike, motorbike\", \"id\": 1563, \"trainId\": 103},\n    {\"name\": \"animal, animate being, beast, brute, creature, fauna\", \"id\": 29, \"trainId\": 104},\n    {\"name\": \"chandelier, pendant, pendent\", \"id\": 480, \"trainId\": 105},\n    {\"name\": \"step, stair\", \"id\": 2569, \"trainId\": 106},\n    {\"name\": \"booth, cubicle, stall, kiosk\", \"id\": 247, \"trainId\": 107},\n    {\"name\": \"bicycle, bike, wheel, cycle\", \"id\": 187, \"trainId\": 108},\n    {\"name\": \"doorframe, doorcase\", \"id\": 778, \"trainId\": 109},\n    {\"name\": \"sconce\", \"id\": 2243, \"trainId\": 110},\n    {\"name\": \"pond\", \"id\": 1941, \"trainId\": 111},\n    {\"name\": \"trade name, brand name, brand, marque\", \"id\": 2833, \"trainId\": 112},\n    {\"name\": \"bannister, banister, balustrade, balusters, handrail\", \"id\": 120, \"trainId\": 113},\n    {\"name\": \"bag\", \"id\": 95, \"trainId\": 114},\n    {\"name\": \"traffic light, traffic signal, stoplight\", \"id\": 2836, \"trainId\": 115},\n    {\"name\": \"gazebo\", \"id\": 1087, \"trainId\": 116},\n    {\"name\": \"escalator, moving staircase, moving stairway\", \"id\": 868, \"trainId\": 117},\n    {\"name\": \"land, ground, soil\", \"id\": 1401, \"trainId\": 118},\n    {\"name\": \"board, plank\", \"id\": 220, \"trainId\": 119},\n    {\"name\": \"arcade machine\", \"id\": 47, \"trainId\": 120},\n    {\"name\": \"eiderdown, duvet, continental quilt\", \"id\": 843, \"trainId\": 121},\n    {\"name\": \"bar\", \"id\": 123, \"trainId\": 122},\n    {\"name\": \"stall, stand, sales booth\", \"id\": 2537, \"trainId\": 123},\n    {\"name\": \"playground\", \"id\": 1927, \"trainId\": 124},\n    {\"name\": \"ship\", \"id\": 2337, \"trainId\": 125},\n    {\"name\": \"ottoman, pouf, pouffe, puff, hassock\", \"id\": 1702, \"trainId\": 126},\n    {\n        \"name\": \"ashcan, trash can, garbage can, wastebin, ash bin, ash-bin, ashbin, dustbin, trash barrel, trash bin\",\n        \"id\": 64,\n        \"trainId\": 127,\n    },\n    {\"name\": \"bottle\", \"id\": 249, \"trainId\": 128},\n    {\"name\": \"cradle\", \"id\": 642, \"trainId\": 129},\n    {\"name\": \"pot, flowerpot\", \"id\": 1981, \"trainId\": 130},\n    {\n        \"name\": \"conveyer belt, conveyor belt, conveyer, conveyor, transporter\",\n        \"id\": 609,\n        \"trainId\": 131,\n    },\n    {\"name\": \"train, railroad train\", \"id\": 2840, \"trainId\": 132},\n    {\"name\": \"stool\", \"id\": 2586, \"trainId\": 133},\n    {\"name\": \"lake\", \"id\": 1393, \"trainId\": 134},\n    {\"name\": \"tank, storage tank\", \"id\": 2704, \"trainId\": 135},\n    {\"name\": \"ice, water ice\", \"id\": 1304, \"trainId\": 136},\n    {\"name\": \"basket, handbasket\", \"id\": 146, \"trainId\": 137},\n    {\"name\": \"manhole\", \"id\": 1494, \"trainId\": 138},\n    {\"name\": \"tent, collapsible shelter\", \"id\": 2739, \"trainId\": 139},\n    {\"name\": \"canopy\", \"id\": 389, \"trainId\": 140},\n    {\"name\": \"microwave, microwave oven\", \"id\": 1551, \"trainId\": 141},\n    {\"name\": \"barrel, cask\", \"id\": 131, \"trainId\": 142},\n    {\"name\": \"dirt track\", \"id\": 738, \"trainId\": 143},\n    {\"name\": \"beam\", \"id\": 161, \"trainId\": 144},\n    {\"name\": \"dishwasher, dish washer, dishwashing machine\", \"id\": 747, \"trainId\": 145},\n    {\"name\": \"plate\", \"id\": 1919, \"trainId\": 146},\n    {\"name\": \"screen, crt screen\", \"id\": 3109, \"trainId\": 147},\n    {\"name\": \"ruins\", \"id\": 2179, \"trainId\": 148},\n    {\"name\": \"washer, automatic washer, washing machine\", \"id\": 2989, \"trainId\": 149},\n    {\"name\": \"blanket, cover\", \"id\": 206, \"trainId\": 150},\n    {\"name\": \"plaything, toy\", \"id\": 1930, \"trainId\": 151},\n    {\"name\": \"food, solid food\", \"id\": 1002, \"trainId\": 152},\n    {\"name\": \"screen, silver screen, projection screen\", \"id\": 2254, \"trainId\": 153},\n    {\"name\": \"oven\", \"id\": 1708, \"trainId\": 154},\n    {\"name\": \"stage\", \"id\": 2526, \"trainId\": 155},\n    {\"name\": \"beacon, lighthouse, beacon light, pharos\", \"id\": 160, \"trainId\": 156},\n    {\"name\": \"umbrella\", \"id\": 2901, \"trainId\": 157},\n    {\"name\": \"sculpture\", \"id\": 2262, \"trainId\": 158},\n    {\"name\": \"aqueduct\", \"id\": 44, \"trainId\": 159},\n    {\"name\": \"container\", \"id\": 597, \"trainId\": 160},\n    {\"name\": \"scaffolding, staging\", \"id\": 2235, \"trainId\": 161},\n    {\"name\": \"hood, exhaust hood\", \"id\": 1260, \"trainId\": 162},\n    {\"name\": \"curb, curbing, kerb\", \"id\": 682, \"trainId\": 163},\n    {\"name\": \"roller coaster\", \"id\": 2151, \"trainId\": 164},\n    {\"name\": \"horse, equus caballus\", \"id\": 3107, \"trainId\": 165},\n    {\"name\": \"catwalk\", \"id\": 432, \"trainId\": 166},\n    {\"name\": \"glass, drinking glass\", \"id\": 1098, \"trainId\": 167},\n    {\"name\": \"vase\", \"id\": 2932, \"trainId\": 168},\n    {\"name\": \"central reservation\", \"id\": 461, \"trainId\": 169},\n    {\"name\": \"carousel\", \"id\": 410, \"trainId\": 170},\n    {\"name\": \"radiator\", \"id\": 2046, \"trainId\": 171},\n    {\"name\": \"closet\", \"id\": 533, \"trainId\": 172},\n    {\"name\": \"machine\", \"id\": 1481, \"trainId\": 173},\n    {\"name\": \"pier, wharf, wharfage, dock\", \"id\": 1858, \"trainId\": 174},\n    {\"name\": \"fan\", \"id\": 894, \"trainId\": 175},\n    {\"name\": \"inflatable bounce game\", \"id\": 1322, \"trainId\": 176},\n    {\"name\": \"pitch\", \"id\": 1891, \"trainId\": 177},\n    {\"name\": \"paper\", \"id\": 1756, \"trainId\": 178},\n    {\"name\": \"arcade, colonnade\", \"id\": 49, \"trainId\": 179},\n    {\"name\": \"hot tub\", \"id\": 1272, \"trainId\": 180},\n    {\"name\": \"helicopter\", \"id\": 1229, \"trainId\": 181},\n    {\"name\": \"tray\", \"id\": 2850, \"trainId\": 182},\n    {\"name\": \"partition, divider\", \"id\": 1784, \"trainId\": 183},\n    {\"name\": \"vineyard\", \"id\": 2962, \"trainId\": 184},\n    {\"name\": \"bowl\", \"id\": 259, \"trainId\": 185},\n    {\"name\": \"bullring\", \"id\": 319, \"trainId\": 186},\n    {\"name\": \"flag\", \"id\": 954, \"trainId\": 187},\n    {\"name\": \"pot\", \"id\": 1974, \"trainId\": 188},\n    {\"name\": \"footbridge, overcrossing, pedestrian bridge\", \"id\": 1013, \"trainId\": 189},\n    {\"name\": \"shower\", \"id\": 2356, \"trainId\": 190},\n    {\"name\": \"bag, traveling bag, travelling bag, grip, suitcase\", \"id\": 97, \"trainId\": 191},\n    {\"name\": \"bulletin board, notice board\", \"id\": 318, \"trainId\": 192},\n    {\"name\": \"confessional booth\", \"id\": 592, \"trainId\": 193},\n    {\"name\": \"trunk, tree trunk, bole\", \"id\": 2885, \"trainId\": 194},\n    {\"name\": \"forest\", \"id\": 1017, \"trainId\": 195},\n    {\"name\": \"elevator door\", \"id\": 851, \"trainId\": 196},\n    {\"name\": \"laptop, laptop computer\", \"id\": 1407, \"trainId\": 197},\n    {\"name\": \"instrument panel\", \"id\": 1332, \"trainId\": 198},\n    {\"name\": \"bucket, pail\", \"id\": 303, \"trainId\": 199},\n    {\"name\": \"tapestry, tapis\", \"id\": 2714, \"trainId\": 200},\n    {\"name\": \"platform\", \"id\": 1924, \"trainId\": 201},\n    {\"name\": \"jacket\", \"id\": 1346, \"trainId\": 202},\n    {\"name\": \"gate\", \"id\": 1081, \"trainId\": 203},\n    {\"name\": \"monitor, monitoring device\", \"id\": 1583, \"trainId\": 204},\n    {\n        \"name\": \"telephone booth, phone booth, call box, telephone box, telephone kiosk\",\n        \"id\": 2727,\n        \"trainId\": 205,\n    },\n    {\"name\": \"spotlight, spot\", \"id\": 2509, \"trainId\": 206},\n    {\"name\": \"ring\", \"id\": 2123, \"trainId\": 207},\n    {\"name\": \"control panel\", \"id\": 602, \"trainId\": 208},\n    {\"name\": \"blackboard, chalkboard\", \"id\": 202, \"trainId\": 209},\n    {\"name\": \"air conditioner, air conditioning\", \"id\": 10, \"trainId\": 210},\n    {\"name\": \"chest\", \"id\": 490, \"trainId\": 211},\n    {\"name\": \"clock\", \"id\": 530, \"trainId\": 212},\n    {\"name\": \"sand dune\", \"id\": 2213, \"trainId\": 213},\n    {\"name\": \"pipe, pipage, piping\", \"id\": 1884, \"trainId\": 214},\n    {\"name\": \"vault\", \"id\": 2934, \"trainId\": 215},\n    {\"name\": \"table football\", \"id\": 2687, \"trainId\": 216},\n    {\"name\": \"cannon\", \"id\": 387, \"trainId\": 217},\n    {\"name\": \"swimming pool, swimming bath, natatorium\", \"id\": 2668, \"trainId\": 218},\n    {\"name\": \"fluorescent, fluorescent fixture\", \"id\": 982, \"trainId\": 219},\n    {\"name\": \"statue\", \"id\": 2547, \"trainId\": 220},\n    {\n        \"name\": \"loudspeaker, speaker, speaker unit, loudspeaker system, speaker system\",\n        \"id\": 1474,\n        \"trainId\": 221,\n    },\n    {\"name\": \"exhibitor\", \"id\": 877, \"trainId\": 222},\n    {\"name\": \"ladder\", \"id\": 1391, \"trainId\": 223},\n    {\"name\": \"carport\", \"id\": 414, \"trainId\": 224},\n    {\"name\": \"dam\", \"id\": 698, \"trainId\": 225},\n    {\"name\": \"pulpit\", \"id\": 2019, \"trainId\": 226},\n    {\"name\": \"skylight, fanlight\", \"id\": 2422, \"trainId\": 227},\n    {\"name\": \"water tower\", \"id\": 3010, \"trainId\": 228},\n    {\"name\": \"grill, grille, grillwork\", \"id\": 1139, \"trainId\": 229},\n    {\"name\": \"display board\", \"id\": 753, \"trainId\": 230},\n    {\"name\": \"pane, pane of glass, window glass\", \"id\": 1747, \"trainId\": 231},\n    {\"name\": \"rubbish, trash, scrap\", \"id\": 2175, \"trainId\": 232},\n    {\"name\": \"ice rink\", \"id\": 1301, \"trainId\": 233},\n    {\"name\": \"fruit\", \"id\": 1033, \"trainId\": 234},\n    {\"name\": \"patio\", \"id\": 1789, \"trainId\": 235},\n    {\"name\": \"vending machine\", \"id\": 2939, \"trainId\": 236},\n    {\"name\": \"telephone, phone, telephone set\", \"id\": 2730, \"trainId\": 237},\n    {\"name\": \"net\", \"id\": 1652, \"trainId\": 238},\n    {\n        \"name\": \"backpack, back pack, knapsack, packsack, rucksack, haversack\",\n        \"id\": 90,\n        \"trainId\": 239,\n    },\n    {\"name\": \"jar\", \"id\": 1349, \"trainId\": 240},\n    {\"name\": \"track\", \"id\": 2830, \"trainId\": 241},\n    {\"name\": \"magazine\", \"id\": 1485, \"trainId\": 242},\n    {\"name\": \"shutter\", \"id\": 2370, \"trainId\": 243},\n    {\"name\": \"roof\", \"id\": 2155, \"trainId\": 244},\n    {\"name\": \"banner, streamer\", \"id\": 118, \"trainId\": 245},\n    {\"name\": \"landfill\", \"id\": 1402, \"trainId\": 246},\n    {\"name\": \"post\", \"id\": 1957, \"trainId\": 247},\n    {\"name\": \"altarpiece, reredos\", \"id\": 3130, \"trainId\": 248},\n    {\"name\": \"hat, chapeau, lid\", \"id\": 1197, \"trainId\": 249},\n    {\"name\": \"arch, archway\", \"id\": 52, \"trainId\": 250},\n    {\"name\": \"table game\", \"id\": 2688, \"trainId\": 251},\n    {\"name\": \"bag, handbag, pocketbook, purse\", \"id\": 96, \"trainId\": 252},\n    {\"name\": \"document, written document, papers\", \"id\": 762, \"trainId\": 253},\n    {\"name\": \"dome\", \"id\": 772, \"trainId\": 254},\n    {\"name\": \"pier\", \"id\": 1857, \"trainId\": 255},\n    {\"name\": \"shanties\", \"id\": 2315, \"trainId\": 256},\n    {\"name\": \"forecourt\", \"id\": 1016, \"trainId\": 257},\n    {\"name\": \"crane\", \"id\": 643, \"trainId\": 258},\n    {\"name\": \"dog, domestic dog, canis familiaris\", \"id\": 3105, \"trainId\": 259},\n    {\"name\": \"piano, pianoforte, forte-piano\", \"id\": 1849, \"trainId\": 260},\n    {\"name\": \"drawing\", \"id\": 791, \"trainId\": 261},\n    {\"name\": \"cabin\", \"id\": 349, \"trainId\": 262},\n    {\n        \"name\": \"ad, advertisement, advertizement, advertising, advertizing, advert\",\n        \"id\": 6,\n        \"trainId\": 263,\n    },\n    {\"name\": \"amphitheater, amphitheatre, coliseum\", \"id\": 3114, \"trainId\": 264},\n    {\"name\": \"monument\", \"id\": 1587, \"trainId\": 265},\n    {\"name\": \"henhouse\", \"id\": 1233, \"trainId\": 266},\n    {\"name\": \"cockpit\", \"id\": 559, \"trainId\": 267},\n    {\"name\": \"heater, warmer\", \"id\": 1223, \"trainId\": 268},\n    {\"name\": \"windmill, aerogenerator, wind generator\", \"id\": 3049, \"trainId\": 269},\n    {\"name\": \"pool\", \"id\": 1943, \"trainId\": 270},\n    {\"name\": \"elevator, lift\", \"id\": 853, \"trainId\": 271},\n    {\"name\": \"decoration, ornament, ornamentation\", \"id\": 709, \"trainId\": 272},\n    {\"name\": \"labyrinth\", \"id\": 1390, \"trainId\": 273},\n    {\"name\": \"text, textual matter\", \"id\": 2748, \"trainId\": 274},\n    {\"name\": \"printer\", \"id\": 2007, \"trainId\": 275},\n    {\"name\": \"mezzanine, first balcony\", \"id\": 1546, \"trainId\": 276},\n    {\"name\": \"mattress\", \"id\": 1513, \"trainId\": 277},\n    {\"name\": \"straw\", \"id\": 2600, \"trainId\": 278},\n    {\"name\": \"stalls\", \"id\": 2538, \"trainId\": 279},\n    {\"name\": \"patio, terrace\", \"id\": 1790, \"trainId\": 280},\n    {\"name\": \"billboard, hoarding\", \"id\": 194, \"trainId\": 281},\n    {\"name\": \"bus stop\", \"id\": 326, \"trainId\": 282},\n    {\"name\": \"trouser, pant\", \"id\": 2877, \"trainId\": 283},\n    {\"name\": \"console table, console\", \"id\": 594, \"trainId\": 284},\n    {\"name\": \"rack\", \"id\": 2036, \"trainId\": 285},\n    {\"name\": \"notebook\", \"id\": 1662, \"trainId\": 286},\n    {\"name\": \"shrine\", \"id\": 2366, \"trainId\": 287},\n    {\"name\": \"pantry\", \"id\": 1754, \"trainId\": 288},\n    {\"name\": \"cart\", \"id\": 418, \"trainId\": 289},\n    {\"name\": \"steam shovel\", \"id\": 2553, \"trainId\": 290},\n    {\"name\": \"porch\", \"id\": 1951, \"trainId\": 291},\n    {\"name\": \"postbox, mailbox, letter box\", \"id\": 1963, \"trainId\": 292},\n    {\"name\": \"figurine, statuette\", \"id\": 918, \"trainId\": 293},\n    {\"name\": \"recycling bin\", \"id\": 2086, \"trainId\": 294},\n    {\"name\": \"folding screen\", \"id\": 997, \"trainId\": 295},\n    {\"name\": \"telescope\", \"id\": 2731, \"trainId\": 296},\n    {\"name\": \"deck chair, beach chair\", \"id\": 704, \"trainId\": 297},\n    {\"name\": \"kennel\", \"id\": 1365, \"trainId\": 298},\n    {\"name\": \"coffee maker\", \"id\": 569, \"trainId\": 299},\n    {\"name\": \"altar, communion table, lord's table\", \"id\": 3108, \"trainId\": 300},\n    {\"name\": \"fish\", \"id\": 948, \"trainId\": 301},\n    {\"name\": \"easel\", \"id\": 839, \"trainId\": 302},\n    {\"name\": \"artificial golf green\", \"id\": 63, \"trainId\": 303},\n    {\"name\": \"iceberg\", \"id\": 1305, \"trainId\": 304},\n    {\"name\": \"candlestick, candle holder\", \"id\": 378, \"trainId\": 305},\n    {\"name\": \"shower stall, shower bath\", \"id\": 2362, \"trainId\": 306},\n    {\"name\": \"television stand\", \"id\": 2734, \"trainId\": 307},\n    {\n        \"name\": \"wall socket, wall plug, electric outlet, electrical outlet, outlet, electric receptacle\",\n        \"id\": 2982,\n        \"trainId\": 308,\n    },\n    {\"name\": \"skeleton\", \"id\": 2398, \"trainId\": 309},\n    {\"name\": \"grand piano, grand\", \"id\": 1119, \"trainId\": 310},\n    {\"name\": \"candy, confect\", \"id\": 382, \"trainId\": 311},\n    {\"name\": \"grille door\", \"id\": 1141, \"trainId\": 312},\n    {\"name\": \"pedestal, plinth, footstall\", \"id\": 1805, \"trainId\": 313},\n    {\"name\": \"jersey, t-shirt, tee shirt\", \"id\": 3102, \"trainId\": 314},\n    {\"name\": \"shoe\", \"id\": 2341, \"trainId\": 315},\n    {\"name\": \"gravestone, headstone, tombstone\", \"id\": 1131, \"trainId\": 316},\n    {\"name\": \"shanty\", \"id\": 2316, \"trainId\": 317},\n    {\"name\": \"structure\", \"id\": 2626, \"trainId\": 318},\n    {\"name\": \"rocking chair, rocker\", \"id\": 3104, \"trainId\": 319},\n    {\"name\": \"bird\", \"id\": 198, \"trainId\": 320},\n    {\"name\": \"place mat\", \"id\": 1896, \"trainId\": 321},\n    {\"name\": \"tomb\", \"id\": 2800, \"trainId\": 322},\n    {\"name\": \"big top\", \"id\": 190, \"trainId\": 323},\n    {\"name\": \"gas pump, gasoline pump, petrol pump, island dispenser\", \"id\": 3131, \"trainId\": 324},\n    {\"name\": \"lockers\", \"id\": 1463, \"trainId\": 325},\n    {\"name\": \"cage\", \"id\": 357, \"trainId\": 326},\n    {\"name\": \"finger\", \"id\": 929, \"trainId\": 327},\n    {\"name\": \"bleachers\", \"id\": 209, \"trainId\": 328},\n    {\"name\": \"ferris wheel\", \"id\": 912, \"trainId\": 329},\n    {\"name\": \"hairdresser chair\", \"id\": 1164, \"trainId\": 330},\n    {\"name\": \"mat\", \"id\": 1509, \"trainId\": 331},\n    {\"name\": \"stands\", \"id\": 2539, \"trainId\": 332},\n    {\"name\": \"aquarium, fish tank, marine museum\", \"id\": 3116, \"trainId\": 333},\n    {\"name\": \"streetcar, tram, tramcar, trolley, trolley car\", \"id\": 2615, \"trainId\": 334},\n    {\"name\": \"napkin, table napkin, serviette\", \"id\": 1644, \"trainId\": 335},\n    {\"name\": \"dummy\", \"id\": 818, \"trainId\": 336},\n    {\"name\": \"booklet, brochure, folder, leaflet, pamphlet\", \"id\": 242, \"trainId\": 337},\n    {\"name\": \"sand trap\", \"id\": 2217, \"trainId\": 338},\n    {\"name\": \"shop, store\", \"id\": 2347, \"trainId\": 339},\n    {\"name\": \"table cloth\", \"id\": 2686, \"trainId\": 340},\n    {\"name\": \"service station\", \"id\": 2300, \"trainId\": 341},\n    {\"name\": \"coffin\", \"id\": 572, \"trainId\": 342},\n    {\"name\": \"drawer\", \"id\": 789, \"trainId\": 343},\n    {\"name\": \"cages\", \"id\": 358, \"trainId\": 344},\n    {\"name\": \"slot machine, coin machine\", \"id\": 2443, \"trainId\": 345},\n    {\"name\": \"balcony\", \"id\": 101, \"trainId\": 346},\n    {\"name\": \"volleyball court\", \"id\": 2969, \"trainId\": 347},\n    {\"name\": \"table tennis\", \"id\": 2692, \"trainId\": 348},\n    {\"name\": \"control table\", \"id\": 606, \"trainId\": 349},\n    {\"name\": \"shirt\", \"id\": 2339, \"trainId\": 350},\n    {\"name\": \"merchandise, ware, product\", \"id\": 1533, \"trainId\": 351},\n    {\"name\": \"railway\", \"id\": 2060, \"trainId\": 352},\n    {\"name\": \"parterre\", \"id\": 1782, \"trainId\": 353},\n    {\"name\": \"chimney\", \"id\": 495, \"trainId\": 354},\n    {\"name\": \"can, tin, tin can\", \"id\": 371, \"trainId\": 355},\n    {\"name\": \"tanks\", \"id\": 2707, \"trainId\": 356},\n    {\"name\": \"fabric, cloth, material, textile\", \"id\": 889, \"trainId\": 357},\n    {\"name\": \"alga, algae\", \"id\": 3156, \"trainId\": 358},\n    {\"name\": \"system\", \"id\": 2683, \"trainId\": 359},\n    {\"name\": \"map\", \"id\": 1499, \"trainId\": 360},\n    {\"name\": \"greenhouse\", \"id\": 1135, \"trainId\": 361},\n    {\"name\": \"mug\", \"id\": 1619, \"trainId\": 362},\n    {\"name\": \"barbecue\", \"id\": 125, \"trainId\": 363},\n    {\"name\": \"trailer\", \"id\": 2838, \"trainId\": 364},\n    {\"name\": \"toilet tissue, toilet paper, bathroom tissue\", \"id\": 2792, \"trainId\": 365},\n    {\"name\": \"organ\", \"id\": 1695, \"trainId\": 366},\n    {\"name\": \"dishrag, dishcloth\", \"id\": 746, \"trainId\": 367},\n    {\"name\": \"island\", \"id\": 1343, \"trainId\": 368},\n    {\"name\": \"keyboard\", \"id\": 1370, \"trainId\": 369},\n    {\"name\": \"trench\", \"id\": 2858, \"trainId\": 370},\n    {\"name\": \"basket, basketball hoop, hoop\", \"id\": 145, \"trainId\": 371},\n    {\"name\": \"steering wheel, wheel\", \"id\": 2565, \"trainId\": 372},\n    {\"name\": \"pitcher, ewer\", \"id\": 1892, \"trainId\": 373},\n    {\"name\": \"goal\", \"id\": 1103, \"trainId\": 374},\n    {\"name\": \"bread, breadstuff, staff of life\", \"id\": 286, \"trainId\": 375},\n    {\"name\": \"beds\", \"id\": 170, \"trainId\": 376},\n    {\"name\": \"wood\", \"id\": 3073, \"trainId\": 377},\n    {\"name\": \"file cabinet\", \"id\": 922, \"trainId\": 378},\n    {\"name\": \"newspaper, paper\", \"id\": 1655, \"trainId\": 379},\n    {\"name\": \"motorboat\", \"id\": 1602, \"trainId\": 380},\n    {\"name\": \"rope\", \"id\": 2160, \"trainId\": 381},\n    {\"name\": \"guitar\", \"id\": 1151, \"trainId\": 382},\n    {\"name\": \"rubble\", \"id\": 2176, \"trainId\": 383},\n    {\"name\": \"scarf\", \"id\": 2239, \"trainId\": 384},\n    {\"name\": \"barrels\", \"id\": 132, \"trainId\": 385},\n    {\"name\": \"cap\", \"id\": 394, \"trainId\": 386},\n    {\"name\": \"leaves\", \"id\": 1424, \"trainId\": 387},\n    {\"name\": \"control tower\", \"id\": 607, \"trainId\": 388},\n    {\"name\": \"dashboard\", \"id\": 700, \"trainId\": 389},\n    {\"name\": \"bandstand\", \"id\": 116, \"trainId\": 390},\n    {\"name\": \"lectern\", \"id\": 1425, \"trainId\": 391},\n    {\"name\": \"switch, electric switch, electrical switch\", \"id\": 2676, \"trainId\": 392},\n    {\"name\": \"baseboard, mopboard, skirting board\", \"id\": 141, \"trainId\": 393},\n    {\"name\": \"shower room\", \"id\": 2360, \"trainId\": 394},\n    {\"name\": \"smoke\", \"id\": 2449, \"trainId\": 395},\n    {\"name\": \"faucet, spigot\", \"id\": 897, \"trainId\": 396},\n    {\"name\": \"bulldozer\", \"id\": 317, \"trainId\": 397},\n    {\"name\": \"saucepan\", \"id\": 2228, \"trainId\": 398},\n    {\"name\": \"shops\", \"id\": 2351, \"trainId\": 399},\n    {\"name\": \"meter\", \"id\": 1543, \"trainId\": 400},\n    {\"name\": \"crevasse\", \"id\": 656, \"trainId\": 401},\n    {\"name\": \"gear\", \"id\": 1088, \"trainId\": 402},\n    {\"name\": \"candelabrum, candelabra\", \"id\": 373, \"trainId\": 403},\n    {\"name\": \"sofa bed\", \"id\": 2472, \"trainId\": 404},\n    {\"name\": \"tunnel\", \"id\": 2892, \"trainId\": 405},\n    {\"name\": \"pallet\", \"id\": 1740, \"trainId\": 406},\n    {\"name\": \"wire, conducting wire\", \"id\": 3067, \"trainId\": 407},\n    {\"name\": \"kettle, boiler\", \"id\": 1367, \"trainId\": 408},\n    {\"name\": \"bidet\", \"id\": 188, \"trainId\": 409},\n    {\n        \"name\": \"baby buggy, baby carriage, carriage, perambulator, pram, stroller, go-cart, pushchair, pusher\",\n        \"id\": 79,\n        \"trainId\": 410,\n    },\n    {\"name\": \"music stand\", \"id\": 1633, \"trainId\": 411},\n    {\"name\": \"pipe, tube\", \"id\": 1885, \"trainId\": 412},\n    {\"name\": \"cup\", \"id\": 677, \"trainId\": 413},\n    {\"name\": \"parking meter\", \"id\": 1779, \"trainId\": 414},\n    {\"name\": \"ice hockey rink\", \"id\": 1297, \"trainId\": 415},\n    {\"name\": \"shelter\", \"id\": 2334, \"trainId\": 416},\n    {\"name\": \"weeds\", \"id\": 3027, \"trainId\": 417},\n    {\"name\": \"temple\", \"id\": 2735, \"trainId\": 418},\n    {\"name\": \"patty, cake\", \"id\": 1791, \"trainId\": 419},\n    {\"name\": \"ski slope\", \"id\": 2405, \"trainId\": 420},\n    {\"name\": \"panel\", \"id\": 1748, \"trainId\": 421},\n    {\"name\": \"wallet\", \"id\": 2983, \"trainId\": 422},\n    {\"name\": \"wheel\", \"id\": 3035, \"trainId\": 423},\n    {\"name\": \"towel rack, towel horse\", \"id\": 2824, \"trainId\": 424},\n    {\"name\": \"roundabout\", \"id\": 2168, \"trainId\": 425},\n    {\"name\": \"canister, cannister, tin\", \"id\": 385, \"trainId\": 426},\n    {\"name\": \"rod\", \"id\": 2148, \"trainId\": 427},\n    {\"name\": \"soap dispenser\", \"id\": 2465, \"trainId\": 428},\n    {\"name\": \"bell\", \"id\": 175, \"trainId\": 429},\n    {\"name\": \"canvas\", \"id\": 390, \"trainId\": 430},\n    {\"name\": \"box office, ticket office, ticket booth\", \"id\": 268, \"trainId\": 431},\n    {\"name\": \"teacup\", \"id\": 2722, \"trainId\": 432},\n    {\"name\": \"trellis\", \"id\": 2857, \"trainId\": 433},\n    {\"name\": \"workbench\", \"id\": 3088, \"trainId\": 434},\n    {\"name\": \"valley, vale\", \"id\": 2926, \"trainId\": 435},\n    {\"name\": \"toaster\", \"id\": 2782, \"trainId\": 436},\n    {\"name\": \"knife\", \"id\": 1378, \"trainId\": 437},\n    {\"name\": \"podium\", \"id\": 1934, \"trainId\": 438},\n    {\"name\": \"ramp\", \"id\": 2072, \"trainId\": 439},\n    {\"name\": \"tumble dryer\", \"id\": 2889, \"trainId\": 440},\n    {\"name\": \"fireplug, fire hydrant, plug\", \"id\": 944, \"trainId\": 441},\n    {\"name\": \"gym shoe, sneaker, tennis shoe\", \"id\": 1158, \"trainId\": 442},\n    {\"name\": \"lab bench\", \"id\": 1383, \"trainId\": 443},\n    {\"name\": \"equipment\", \"id\": 867, \"trainId\": 444},\n    {\"name\": \"rocky formation\", \"id\": 2145, \"trainId\": 445},\n    {\"name\": \"plastic\", \"id\": 1915, \"trainId\": 446},\n    {\"name\": \"calendar\", \"id\": 361, \"trainId\": 447},\n    {\"name\": \"caravan\", \"id\": 402, \"trainId\": 448},\n    {\"name\": \"check-in-desk\", \"id\": 482, \"trainId\": 449},\n    {\"name\": \"ticket counter\", \"id\": 2761, \"trainId\": 450},\n    {\"name\": \"brush\", \"id\": 300, \"trainId\": 451},\n    {\"name\": \"mill\", \"id\": 1554, \"trainId\": 452},\n    {\"name\": \"covered bridge\", \"id\": 636, \"trainId\": 453},\n    {\"name\": \"bowling alley\", \"id\": 260, \"trainId\": 454},\n    {\"name\": \"hanger\", \"id\": 1186, \"trainId\": 455},\n    {\"name\": \"excavator\", \"id\": 871, \"trainId\": 456},\n    {\"name\": \"trestle\", \"id\": 2859, \"trainId\": 457},\n    {\"name\": \"revolving door\", \"id\": 2103, \"trainId\": 458},\n    {\"name\": \"blast furnace\", \"id\": 208, \"trainId\": 459},\n    {\"name\": \"scale, weighing machine\", \"id\": 2236, \"trainId\": 460},\n    {\"name\": \"projector\", \"id\": 2012, \"trainId\": 461},\n    {\"name\": \"soap\", \"id\": 2462, \"trainId\": 462},\n    {\"name\": \"locker\", \"id\": 1462, \"trainId\": 463},\n    {\"name\": \"tractor\", \"id\": 2832, \"trainId\": 464},\n    {\"name\": \"stretcher\", \"id\": 2617, \"trainId\": 465},\n    {\"name\": \"frame\", \"id\": 1024, \"trainId\": 466},\n    {\"name\": \"grating\", \"id\": 1129, \"trainId\": 467},\n    {\"name\": \"alembic\", \"id\": 18, \"trainId\": 468},\n    {\"name\": \"candle, taper, wax light\", \"id\": 376, \"trainId\": 469},\n    {\"name\": \"barrier\", \"id\": 134, \"trainId\": 470},\n    {\"name\": \"cardboard\", \"id\": 407, \"trainId\": 471},\n    {\"name\": \"cave\", \"id\": 434, \"trainId\": 472},\n    {\"name\": \"puddle\", \"id\": 2017, \"trainId\": 473},\n    {\"name\": \"tarp\", \"id\": 2717, \"trainId\": 474},\n    {\"name\": \"price tag\", \"id\": 2005, \"trainId\": 475},\n    {\"name\": \"watchtower\", \"id\": 2993, \"trainId\": 476},\n    {\"name\": \"meters\", \"id\": 1545, \"trainId\": 477},\n    {\n        \"name\": \"light bulb, lightbulb, bulb, incandescent lamp, electric light, electric-light bulb\",\n        \"id\": 1445,\n        \"trainId\": 478,\n    },\n    {\"name\": \"tracks\", \"id\": 2831, \"trainId\": 479},\n    {\"name\": \"hair dryer\", \"id\": 1161, \"trainId\": 480},\n    {\"name\": \"skirt\", \"id\": 2411, \"trainId\": 481},\n    {\"name\": \"viaduct\", \"id\": 2949, \"trainId\": 482},\n    {\"name\": \"paper towel\", \"id\": 1769, \"trainId\": 483},\n    {\"name\": \"coat\", \"id\": 552, \"trainId\": 484},\n    {\"name\": \"sheet\", \"id\": 2327, \"trainId\": 485},\n    {\"name\": \"fire extinguisher, extinguisher, asphyxiator\", \"id\": 939, \"trainId\": 486},\n    {\"name\": \"water wheel\", \"id\": 3013, \"trainId\": 487},\n    {\"name\": \"pottery, clayware\", \"id\": 1986, \"trainId\": 488},\n    {\"name\": \"magazine rack\", \"id\": 1486, \"trainId\": 489},\n    {\"name\": \"teapot\", \"id\": 2723, \"trainId\": 490},\n    {\"name\": \"microphone, mike\", \"id\": 1549, \"trainId\": 491},\n    {\"name\": \"support\", \"id\": 2649, \"trainId\": 492},\n    {\"name\": \"forklift\", \"id\": 1020, \"trainId\": 493},\n    {\"name\": \"canyon\", \"id\": 392, \"trainId\": 494},\n    {\"name\": \"cash register, register\", \"id\": 422, \"trainId\": 495},\n    {\"name\": \"leaf, leafage, foliage\", \"id\": 1419, \"trainId\": 496},\n    {\"name\": \"remote control, remote\", \"id\": 2099, \"trainId\": 497},\n    {\"name\": \"soap dish\", \"id\": 2464, \"trainId\": 498},\n    {\"name\": \"windshield, windscreen\", \"id\": 3058, \"trainId\": 499},\n    {\"name\": \"cat\", \"id\": 430, \"trainId\": 500},\n    {\"name\": \"cue, cue stick, pool cue, pool stick\", \"id\": 675, \"trainId\": 501},\n    {\"name\": \"vent, venthole, vent-hole, blowhole\", \"id\": 2941, \"trainId\": 502},\n    {\"name\": \"videos\", \"id\": 2955, \"trainId\": 503},\n    {\"name\": \"shovel\", \"id\": 2355, \"trainId\": 504},\n    {\"name\": \"eaves\", \"id\": 840, \"trainId\": 505},\n    {\"name\": \"antenna, aerial, transmitting aerial\", \"id\": 32, \"trainId\": 506},\n    {\"name\": \"shipyard\", \"id\": 2338, \"trainId\": 507},\n    {\"name\": \"hen, biddy\", \"id\": 1232, \"trainId\": 508},\n    {\"name\": \"traffic cone\", \"id\": 2834, \"trainId\": 509},\n    {\"name\": \"washing machines\", \"id\": 2991, \"trainId\": 510},\n    {\"name\": \"truck crane\", \"id\": 2879, \"trainId\": 511},\n    {\"name\": \"cds\", \"id\": 444, \"trainId\": 512},\n    {\"name\": \"niche\", \"id\": 1657, \"trainId\": 513},\n    {\"name\": \"scoreboard\", \"id\": 2246, \"trainId\": 514},\n    {\"name\": \"briefcase\", \"id\": 296, \"trainId\": 515},\n    {\"name\": \"boot\", \"id\": 245, \"trainId\": 516},\n    {\"name\": \"sweater, jumper\", \"id\": 2661, \"trainId\": 517},\n    {\"name\": \"hay\", \"id\": 1202, \"trainId\": 518},\n    {\"name\": \"pack\", \"id\": 1714, \"trainId\": 519},\n    {\"name\": \"bottle rack\", \"id\": 251, \"trainId\": 520},\n    {\"name\": \"glacier\", \"id\": 1095, \"trainId\": 521},\n    {\"name\": \"pergola\", \"id\": 1828, \"trainId\": 522},\n    {\"name\": \"building materials\", \"id\": 311, \"trainId\": 523},\n    {\"name\": \"television camera\", \"id\": 2732, \"trainId\": 524},\n    {\"name\": \"first floor\", \"id\": 947, \"trainId\": 525},\n    {\"name\": \"rifle\", \"id\": 2115, \"trainId\": 526},\n    {\"name\": \"tennis table\", \"id\": 2738, \"trainId\": 527},\n    {\"name\": \"stadium\", \"id\": 2525, \"trainId\": 528},\n    {\"name\": \"safety belt\", \"id\": 2194, \"trainId\": 529},\n    {\"name\": \"cover\", \"id\": 634, \"trainId\": 530},\n    {\"name\": \"dish rack\", \"id\": 740, \"trainId\": 531},\n    {\"name\": \"synthesizer\", \"id\": 2682, \"trainId\": 532},\n    {\"name\": \"pumpkin\", \"id\": 2020, \"trainId\": 533},\n    {\"name\": \"gutter\", \"id\": 1156, \"trainId\": 534},\n    {\"name\": \"fruit stand\", \"id\": 1036, \"trainId\": 535},\n    {\"name\": \"ice floe, floe\", \"id\": 1295, \"trainId\": 536},\n    {\"name\": \"handle, grip, handgrip, hold\", \"id\": 1181, \"trainId\": 537},\n    {\"name\": \"wheelchair\", \"id\": 3037, \"trainId\": 538},\n    {\"name\": \"mousepad, mouse mat\", \"id\": 1614, \"trainId\": 539},\n    {\"name\": \"diploma\", \"id\": 736, \"trainId\": 540},\n    {\"name\": \"fairground ride\", \"id\": 893, \"trainId\": 541},\n    {\"name\": \"radio\", \"id\": 2047, \"trainId\": 542},\n    {\"name\": \"hotplate\", \"id\": 1274, \"trainId\": 543},\n    {\"name\": \"junk\", \"id\": 1361, \"trainId\": 544},\n    {\"name\": \"wheelbarrow\", \"id\": 3036, \"trainId\": 545},\n    {\"name\": \"stream\", \"id\": 2606, \"trainId\": 546},\n    {\"name\": \"toll plaza\", \"id\": 2797, \"trainId\": 547},\n    {\"name\": \"punching bag\", \"id\": 2022, \"trainId\": 548},\n    {\"name\": \"trough\", \"id\": 2876, \"trainId\": 549},\n    {\"name\": \"throne\", \"id\": 2758, \"trainId\": 550},\n    {\"name\": \"chair desk\", \"id\": 472, \"trainId\": 551},\n    {\"name\": \"weighbridge\", \"id\": 3028, \"trainId\": 552},\n    {\"name\": \"extractor fan\", \"id\": 882, \"trainId\": 553},\n    {\"name\": \"hanging clothes\", \"id\": 1189, \"trainId\": 554},\n    {\"name\": \"dish, dish aerial, dish antenna, saucer\", \"id\": 743, \"trainId\": 555},\n    {\"name\": \"alarm clock, alarm\", \"id\": 3122, \"trainId\": 556},\n    {\"name\": \"ski lift\", \"id\": 2401, \"trainId\": 557},\n    {\"name\": \"chain\", \"id\": 468, \"trainId\": 558},\n    {\"name\": \"garage\", \"id\": 1061, \"trainId\": 559},\n    {\"name\": \"mechanical shovel\", \"id\": 1523, \"trainId\": 560},\n    {\"name\": \"wine rack\", \"id\": 3059, \"trainId\": 561},\n    {\"name\": \"tramway\", \"id\": 2843, \"trainId\": 562},\n    {\"name\": \"treadmill\", \"id\": 2853, \"trainId\": 563},\n    {\"name\": \"menu\", \"id\": 1529, \"trainId\": 564},\n    {\"name\": \"block\", \"id\": 214, \"trainId\": 565},\n    {\"name\": \"well\", \"id\": 3032, \"trainId\": 566},\n    {\"name\": \"witness stand\", \"id\": 3071, \"trainId\": 567},\n    {\"name\": \"branch\", \"id\": 277, \"trainId\": 568},\n    {\"name\": \"duck\", \"id\": 813, \"trainId\": 569},\n    {\"name\": \"casserole\", \"id\": 426, \"trainId\": 570},\n    {\"name\": \"frying pan\", \"id\": 1039, \"trainId\": 571},\n    {\"name\": \"desk organizer\", \"id\": 727, \"trainId\": 572},\n    {\"name\": \"mast\", \"id\": 1508, \"trainId\": 573},\n    {\"name\": \"spectacles, specs, eyeglasses, glasses\", \"id\": 2490, \"trainId\": 574},\n    {\"name\": \"service elevator\", \"id\": 2299, \"trainId\": 575},\n    {\"name\": \"dollhouse\", \"id\": 768, \"trainId\": 576},\n    {\"name\": \"hammock\", \"id\": 1172, \"trainId\": 577},\n    {\"name\": \"clothes hanging\", \"id\": 537, \"trainId\": 578},\n    {\"name\": \"photocopier\", \"id\": 1847, \"trainId\": 579},\n    {\"name\": \"notepad\", \"id\": 1664, \"trainId\": 580},\n    {\"name\": \"golf cart\", \"id\": 1110, \"trainId\": 581},\n    {\"name\": \"footpath\", \"id\": 1014, \"trainId\": 582},\n    {\"name\": \"cross\", \"id\": 662, \"trainId\": 583},\n    {\"name\": \"baptismal font\", \"id\": 121, \"trainId\": 584},\n    {\"name\": \"boiler\", \"id\": 227, \"trainId\": 585},\n    {\"name\": \"skip\", \"id\": 2410, \"trainId\": 586},\n    {\"name\": \"rotisserie\", \"id\": 2165, \"trainId\": 587},\n    {\"name\": \"tables\", \"id\": 2696, \"trainId\": 588},\n    {\"name\": \"water mill\", \"id\": 3005, \"trainId\": 589},\n    {\"name\": \"helmet\", \"id\": 1231, \"trainId\": 590},\n    {\"name\": \"cover curtain\", \"id\": 635, \"trainId\": 591},\n    {\"name\": \"brick\", \"id\": 292, \"trainId\": 592},\n    {\"name\": \"table runner\", \"id\": 2690, \"trainId\": 593},\n    {\"name\": \"ashtray\", \"id\": 65, \"trainId\": 594},\n    {\"name\": \"street box\", \"id\": 2607, \"trainId\": 595},\n    {\"name\": \"stick\", \"id\": 2574, \"trainId\": 596},\n    {\"name\": \"hangers\", \"id\": 1188, \"trainId\": 597},\n    {\"name\": \"cells\", \"id\": 456, \"trainId\": 598},\n    {\"name\": \"urinal\", \"id\": 2913, \"trainId\": 599},\n    {\"name\": \"centerpiece\", \"id\": 459, \"trainId\": 600},\n    {\"name\": \"portable fridge\", \"id\": 1955, \"trainId\": 601},\n    {\"name\": \"dvds\", \"id\": 827, \"trainId\": 602},\n    {\"name\": \"golf club\", \"id\": 1111, \"trainId\": 603},\n    {\"name\": \"skirting board\", \"id\": 2412, \"trainId\": 604},\n    {\"name\": \"water cooler\", \"id\": 2997, \"trainId\": 605},\n    {\"name\": \"clipboard\", \"id\": 528, \"trainId\": 606},\n    {\"name\": \"camera, photographic camera\", \"id\": 366, \"trainId\": 607},\n    {\"name\": \"pigeonhole\", \"id\": 1863, \"trainId\": 608},\n    {\"name\": \"chips\", \"id\": 500, \"trainId\": 609},\n    {\"name\": \"food processor\", \"id\": 1001, \"trainId\": 610},\n    {\"name\": \"post box\", \"id\": 1958, \"trainId\": 611},\n    {\"name\": \"lid\", \"id\": 1441, \"trainId\": 612},\n    {\"name\": \"drum\", \"id\": 809, \"trainId\": 613},\n    {\"name\": \"blender\", \"id\": 210, \"trainId\": 614},\n    {\"name\": \"cave entrance\", \"id\": 435, \"trainId\": 615},\n    {\"name\": \"dental chair\", \"id\": 718, \"trainId\": 616},\n    {\"name\": \"obelisk\", \"id\": 1674, \"trainId\": 617},\n    {\"name\": \"canoe\", \"id\": 388, \"trainId\": 618},\n    {\"name\": \"mobile\", \"id\": 1572, \"trainId\": 619},\n    {\"name\": \"monitors\", \"id\": 1584, \"trainId\": 620},\n    {\"name\": \"pool ball\", \"id\": 1944, \"trainId\": 621},\n    {\"name\": \"cue rack\", \"id\": 674, \"trainId\": 622},\n    {\"name\": \"baggage carts\", \"id\": 99, \"trainId\": 623},\n    {\"name\": \"shore\", \"id\": 2352, \"trainId\": 624},\n    {\"name\": \"fork\", \"id\": 1019, \"trainId\": 625},\n    {\"name\": \"paper filer\", \"id\": 1763, \"trainId\": 626},\n    {\"name\": \"bicycle rack\", \"id\": 185, \"trainId\": 627},\n    {\"name\": \"coat rack\", \"id\": 554, \"trainId\": 628},\n    {\"name\": \"garland\", \"id\": 1066, \"trainId\": 629},\n    {\"name\": \"sports bag\", \"id\": 2508, \"trainId\": 630},\n    {\"name\": \"fish tank\", \"id\": 951, \"trainId\": 631},\n    {\"name\": \"towel dispenser\", \"id\": 2822, \"trainId\": 632},\n    {\"name\": \"carriage\", \"id\": 415, \"trainId\": 633},\n    {\"name\": \"brochure\", \"id\": 297, \"trainId\": 634},\n    {\"name\": \"plaque\", \"id\": 1914, \"trainId\": 635},\n    {\"name\": \"stringer\", \"id\": 2619, \"trainId\": 636},\n    {\"name\": \"iron\", \"id\": 1338, \"trainId\": 637},\n    {\"name\": \"spoon\", \"id\": 2505, \"trainId\": 638},\n    {\"name\": \"flag pole\", \"id\": 955, \"trainId\": 639},\n    {\"name\": \"toilet brush\", \"id\": 2786, \"trainId\": 640},\n    {\"name\": \"book stand\", \"id\": 238, \"trainId\": 641},\n    {\"name\": \"water faucet, water tap, tap, hydrant\", \"id\": 3000, \"trainId\": 642},\n    {\"name\": \"ticket office\", \"id\": 2763, \"trainId\": 643},\n    {\"name\": \"broom\", \"id\": 299, \"trainId\": 644},\n    {\"name\": \"dvd\", \"id\": 822, \"trainId\": 645},\n    {\"name\": \"ice bucket\", \"id\": 1288, \"trainId\": 646},\n    {\"name\": \"carapace, shell, cuticle, shield\", \"id\": 3101, \"trainId\": 647},\n    {\"name\": \"tureen\", \"id\": 2894, \"trainId\": 648},\n    {\"name\": \"folders\", \"id\": 992, \"trainId\": 649},\n    {\"name\": \"chess\", \"id\": 489, \"trainId\": 650},\n    {\"name\": \"root\", \"id\": 2157, \"trainId\": 651},\n    {\"name\": \"sewing machine\", \"id\": 2309, \"trainId\": 652},\n    {\"name\": \"model\", \"id\": 1576, \"trainId\": 653},\n    {\"name\": \"pen\", \"id\": 1810, \"trainId\": 654},\n    {\"name\": \"violin\", \"id\": 2964, \"trainId\": 655},\n    {\"name\": \"sweatshirt\", \"id\": 2662, \"trainId\": 656},\n    {\"name\": \"recycling materials\", \"id\": 2087, \"trainId\": 657},\n    {\"name\": \"mitten\", \"id\": 1569, \"trainId\": 658},\n    {\"name\": \"chopping board, cutting board\", \"id\": 503, \"trainId\": 659},\n    {\"name\": \"mask\", \"id\": 1505, \"trainId\": 660},\n    {\"name\": \"log\", \"id\": 1468, \"trainId\": 661},\n    {\"name\": \"mouse, computer mouse\", \"id\": 1613, \"trainId\": 662},\n    {\"name\": \"grill\", \"id\": 1138, \"trainId\": 663},\n    {\"name\": \"hole\", \"id\": 1256, \"trainId\": 664},\n    {\"name\": \"target\", \"id\": 2715, \"trainId\": 665},\n    {\"name\": \"trash bag\", \"id\": 2846, \"trainId\": 666},\n    {\"name\": \"chalk\", \"id\": 477, \"trainId\": 667},\n    {\"name\": \"sticks\", \"id\": 2576, \"trainId\": 668},\n    {\"name\": \"balloon\", \"id\": 108, \"trainId\": 669},\n    {\"name\": \"score\", \"id\": 2245, \"trainId\": 670},\n    {\"name\": \"hair spray\", \"id\": 1162, \"trainId\": 671},\n    {\"name\": \"roll\", \"id\": 2149, \"trainId\": 672},\n    {\"name\": \"runner\", \"id\": 2183, \"trainId\": 673},\n    {\"name\": \"engine\", \"id\": 858, \"trainId\": 674},\n    {\"name\": \"inflatable glove\", \"id\": 1324, \"trainId\": 675},\n    {\"name\": \"games\", \"id\": 1055, \"trainId\": 676},\n    {\"name\": \"pallets\", \"id\": 1741, \"trainId\": 677},\n    {\"name\": \"baskets\", \"id\": 149, \"trainId\": 678},\n    {\"name\": \"coop\", \"id\": 615, \"trainId\": 679},\n    {\"name\": \"dvd player\", \"id\": 825, \"trainId\": 680},\n    {\"name\": \"rocking horse\", \"id\": 2143, \"trainId\": 681},\n    {\"name\": \"buckets\", \"id\": 304, \"trainId\": 682},\n    {\"name\": \"bread rolls\", \"id\": 283, \"trainId\": 683},\n    {\"name\": \"shawl\", \"id\": 2322, \"trainId\": 684},\n    {\"name\": \"watering can\", \"id\": 3017, \"trainId\": 685},\n    {\"name\": \"spotlights\", \"id\": 2510, \"trainId\": 686},\n    {\"name\": \"post-it\", \"id\": 1960, \"trainId\": 687},\n    {\"name\": \"bowls\", \"id\": 265, \"trainId\": 688},\n    {\"name\": \"security camera\", \"id\": 2282, \"trainId\": 689},\n    {\"name\": \"runner cloth\", \"id\": 2184, \"trainId\": 690},\n    {\"name\": \"lock\", \"id\": 1461, \"trainId\": 691},\n    {\"name\": \"alarm, warning device, alarm system\", \"id\": 3113, \"trainId\": 692},\n    {\"name\": \"side\", \"id\": 2372, \"trainId\": 693},\n    {\"name\": \"roulette\", \"id\": 2166, \"trainId\": 694},\n    {\"name\": \"bone\", \"id\": 232, \"trainId\": 695},\n    {\"name\": \"cutlery\", \"id\": 693, \"trainId\": 696},\n    {\"name\": \"pool balls\", \"id\": 1945, \"trainId\": 697},\n    {\"name\": \"wheels\", \"id\": 3039, \"trainId\": 698},\n    {\"name\": \"spice rack\", \"id\": 2494, \"trainId\": 699},\n    {\"name\": \"plant pots\", \"id\": 1908, \"trainId\": 700},\n    {\"name\": \"towel ring\", \"id\": 2827, \"trainId\": 701},\n    {\"name\": \"bread box\", \"id\": 280, \"trainId\": 702},\n    {\"name\": \"video\", \"id\": 2950, \"trainId\": 703},\n    {\"name\": \"funfair\", \"id\": 1044, \"trainId\": 704},\n    {\"name\": \"breads\", \"id\": 288, \"trainId\": 705},\n    {\"name\": \"tripod\", \"id\": 2863, \"trainId\": 706},\n    {\"name\": \"ironing board\", \"id\": 1342, \"trainId\": 707},\n    {\"name\": \"skimmer\", \"id\": 2409, \"trainId\": 708},\n    {\"name\": \"hollow\", \"id\": 1258, \"trainId\": 709},\n    {\"name\": \"scratching post\", \"id\": 2249, \"trainId\": 710},\n    {\"name\": \"tricycle\", \"id\": 2862, \"trainId\": 711},\n    {\"name\": \"file box\", \"id\": 920, \"trainId\": 712},\n    {\"name\": \"mountain pass\", \"id\": 1607, \"trainId\": 713},\n    {\"name\": \"tombstones\", \"id\": 2802, \"trainId\": 714},\n    {\"name\": \"cooker\", \"id\": 610, \"trainId\": 715},\n    {\"name\": \"card game, cards\", \"id\": 3129, \"trainId\": 716},\n    {\"name\": \"golf bag\", \"id\": 1108, \"trainId\": 717},\n    {\"name\": \"towel paper\", \"id\": 2823, \"trainId\": 718},\n    {\"name\": \"chaise lounge\", \"id\": 476, \"trainId\": 719},\n    {\"name\": \"sun\", \"id\": 2641, \"trainId\": 720},\n    {\"name\": \"toilet paper holder\", \"id\": 2788, \"trainId\": 721},\n    {\"name\": \"rake\", \"id\": 2070, \"trainId\": 722},\n    {\"name\": \"key\", \"id\": 1368, \"trainId\": 723},\n    {\"name\": \"umbrella stand\", \"id\": 2903, \"trainId\": 724},\n    {\"name\": \"dartboard\", \"id\": 699, \"trainId\": 725},\n    {\"name\": \"transformer\", \"id\": 2844, \"trainId\": 726},\n    {\"name\": \"fireplace utensils\", \"id\": 942, \"trainId\": 727},\n    {\"name\": \"sweatshirts\", \"id\": 2663, \"trainId\": 728},\n    {\n        \"name\": \"cellular telephone, cellular phone, cellphone, cell, mobile phone\",\n        \"id\": 457,\n        \"trainId\": 729,\n    },\n    {\"name\": \"tallboy\", \"id\": 2701, \"trainId\": 730},\n    {\"name\": \"stapler\", \"id\": 2540, \"trainId\": 731},\n    {\"name\": \"sauna\", \"id\": 2231, \"trainId\": 732},\n    {\"name\": \"test tube\", \"id\": 2746, \"trainId\": 733},\n    {\"name\": \"palette\", \"id\": 1738, \"trainId\": 734},\n    {\"name\": \"shopping carts\", \"id\": 2350, \"trainId\": 735},\n    {\"name\": \"tools\", \"id\": 2808, \"trainId\": 736},\n    {\"name\": \"push button, push, button\", \"id\": 2025, \"trainId\": 737},\n    {\"name\": \"star\", \"id\": 2541, \"trainId\": 738},\n    {\"name\": \"roof rack\", \"id\": 2156, \"trainId\": 739},\n    {\"name\": \"barbed wire\", \"id\": 126, \"trainId\": 740},\n    {\"name\": \"spray\", \"id\": 2512, \"trainId\": 741},\n    {\"name\": \"ear\", \"id\": 831, \"trainId\": 742},\n    {\"name\": \"sponge\", \"id\": 2503, \"trainId\": 743},\n    {\"name\": \"racket\", \"id\": 2039, \"trainId\": 744},\n    {\"name\": \"tins\", \"id\": 2774, \"trainId\": 745},\n    {\"name\": \"eyeglasses\", \"id\": 886, \"trainId\": 746},\n    {\"name\": \"file\", \"id\": 919, \"trainId\": 747},\n    {\"name\": \"scarfs\", \"id\": 2240, \"trainId\": 748},\n    {\"name\": \"sugar bowl\", \"id\": 2636, \"trainId\": 749},\n    {\"name\": \"flip flop\", \"id\": 963, \"trainId\": 750},\n    {\"name\": \"headstones\", \"id\": 1218, \"trainId\": 751},\n    {\"name\": \"laptop bag\", \"id\": 1406, \"trainId\": 752},\n    {\"name\": \"leash\", \"id\": 1420, \"trainId\": 753},\n    {\"name\": \"climbing frame\", \"id\": 526, \"trainId\": 754},\n    {\"name\": \"suit hanger\", \"id\": 2639, \"trainId\": 755},\n    {\"name\": \"floor spotlight\", \"id\": 975, \"trainId\": 756},\n    {\"name\": \"plate rack\", \"id\": 1921, \"trainId\": 757},\n    {\"name\": \"sewer\", \"id\": 2305, \"trainId\": 758},\n    {\"name\": \"hard drive\", \"id\": 1193, \"trainId\": 759},\n    {\"name\": \"sprinkler\", \"id\": 2517, \"trainId\": 760},\n    {\"name\": \"tools box\", \"id\": 2809, \"trainId\": 761},\n    {\"name\": \"necklace\", \"id\": 1647, \"trainId\": 762},\n    {\"name\": \"bulbs\", \"id\": 314, \"trainId\": 763},\n    {\"name\": \"steel industry\", \"id\": 2560, \"trainId\": 764},\n    {\"name\": \"club\", \"id\": 545, \"trainId\": 765},\n    {\"name\": \"jack\", \"id\": 1345, \"trainId\": 766},\n    {\"name\": \"door bars\", \"id\": 775, \"trainId\": 767},\n    {\n        \"name\": \"control panel, instrument panel, control board, board, panel\",\n        \"id\": 603,\n        \"trainId\": 768,\n    },\n    {\"name\": \"hairbrush\", \"id\": 1163, \"trainId\": 769},\n    {\"name\": \"napkin holder\", \"id\": 1641, \"trainId\": 770},\n    {\"name\": \"office\", \"id\": 1678, \"trainId\": 771},\n    {\"name\": \"smoke detector\", \"id\": 2450, \"trainId\": 772},\n    {\"name\": \"utensils\", \"id\": 2915, \"trainId\": 773},\n    {\"name\": \"apron\", \"id\": 42, \"trainId\": 774},\n    {\"name\": \"scissors\", \"id\": 2242, \"trainId\": 775},\n    {\"name\": \"terminal\", \"id\": 2741, \"trainId\": 776},\n    {\"name\": \"grinder\", \"id\": 1143, \"trainId\": 777},\n    {\"name\": \"entry phone\", \"id\": 862, \"trainId\": 778},\n    {\"name\": \"newspaper stand\", \"id\": 1654, \"trainId\": 779},\n    {\"name\": \"pepper shaker\", \"id\": 1826, \"trainId\": 780},\n    {\"name\": \"onions\", \"id\": 1689, \"trainId\": 781},\n    {\n        \"name\": \"central processing unit, cpu, c p u , central processor, processor, mainframe\",\n        \"id\": 3124,\n        \"trainId\": 782,\n    },\n    {\"name\": \"tape\", \"id\": 2710, \"trainId\": 783},\n    {\"name\": \"bat\", \"id\": 152, \"trainId\": 784},\n    {\"name\": \"coaster\", \"id\": 549, \"trainId\": 785},\n    {\"name\": \"calculator\", \"id\": 360, \"trainId\": 786},\n    {\"name\": \"potatoes\", \"id\": 1982, \"trainId\": 787},\n    {\"name\": \"luggage rack\", \"id\": 1478, \"trainId\": 788},\n    {\"name\": \"salt\", \"id\": 2203, \"trainId\": 789},\n    {\"name\": \"street number\", \"id\": 2612, \"trainId\": 790},\n    {\"name\": \"viewpoint\", \"id\": 2956, \"trainId\": 791},\n    {\"name\": \"sword\", \"id\": 2681, \"trainId\": 792},\n    {\"name\": \"cd\", \"id\": 437, \"trainId\": 793},\n    {\"name\": \"rowing machine\", \"id\": 2171, \"trainId\": 794},\n    {\"name\": \"plug\", \"id\": 1933, \"trainId\": 795},\n    {\"name\": \"andiron, firedog, dog, dog-iron\", \"id\": 3110, \"trainId\": 796},\n    {\"name\": \"pepper\", \"id\": 1824, \"trainId\": 797},\n    {\"name\": \"tongs\", \"id\": 2803, \"trainId\": 798},\n    {\"name\": \"bonfire\", \"id\": 234, \"trainId\": 799},\n    {\"name\": \"dog dish\", \"id\": 764, \"trainId\": 800},\n    {\"name\": \"belt\", \"id\": 177, \"trainId\": 801},\n    {\"name\": \"dumbbells\", \"id\": 817, \"trainId\": 802},\n    {\"name\": \"videocassette recorder, vcr\", \"id\": 3145, \"trainId\": 803},\n    {\"name\": \"hook\", \"id\": 1262, \"trainId\": 804},\n    {\"name\": \"envelopes\", \"id\": 864, \"trainId\": 805},\n    {\"name\": \"shower faucet\", \"id\": 2359, \"trainId\": 806},\n    {\"name\": \"watch\", \"id\": 2992, \"trainId\": 807},\n    {\"name\": \"padlock\", \"id\": 1725, \"trainId\": 808},\n    {\"name\": \"swimming pool ladder\", \"id\": 2667, \"trainId\": 809},\n    {\"name\": \"spanners\", \"id\": 2484, \"trainId\": 810},\n    {\"name\": \"gravy boat\", \"id\": 1133, \"trainId\": 811},\n    {\"name\": \"notice board\", \"id\": 1667, \"trainId\": 812},\n    {\"name\": \"trash bags\", \"id\": 2847, \"trainId\": 813},\n    {\"name\": \"fire alarm\", \"id\": 932, \"trainId\": 814},\n    {\"name\": \"ladle\", \"id\": 1392, \"trainId\": 815},\n    {\"name\": \"stethoscope\", \"id\": 2573, \"trainId\": 816},\n    {\"name\": \"rocket\", \"id\": 2140, \"trainId\": 817},\n    {\"name\": \"funnel\", \"id\": 1046, \"trainId\": 818},\n    {\"name\": \"bowling pins\", \"id\": 264, \"trainId\": 819},\n    {\"name\": \"valve\", \"id\": 2927, \"trainId\": 820},\n    {\"name\": \"thermometer\", \"id\": 2752, \"trainId\": 821},\n    {\"name\": \"cups\", \"id\": 679, \"trainId\": 822},\n    {\"name\": \"spice jar\", \"id\": 2493, \"trainId\": 823},\n    {\"name\": \"night light\", \"id\": 1658, \"trainId\": 824},\n    {\"name\": \"soaps\", \"id\": 2466, \"trainId\": 825},\n    {\"name\": \"games table\", \"id\": 1057, \"trainId\": 826},\n    {\"name\": \"slotted spoon\", \"id\": 2444, \"trainId\": 827},\n    {\"name\": \"reel\", \"id\": 2093, \"trainId\": 828},\n    {\"name\": \"scourer\", \"id\": 2248, \"trainId\": 829},\n    {\"name\": \"sleeping robe\", \"id\": 2432, \"trainId\": 830},\n    {\"name\": \"desk mat\", \"id\": 726, \"trainId\": 831},\n    {\"name\": \"dumbbell\", \"id\": 816, \"trainId\": 832},\n    {\"name\": \"hammer\", \"id\": 1171, \"trainId\": 833},\n    {\"name\": \"tie\", \"id\": 2766, \"trainId\": 834},\n    {\"name\": \"typewriter\", \"id\": 2900, \"trainId\": 835},\n    {\"name\": \"shaker\", \"id\": 2313, \"trainId\": 836},\n    {\"name\": \"cheese dish\", \"id\": 488, \"trainId\": 837},\n    {\"name\": \"sea star\", \"id\": 2265, \"trainId\": 838},\n    {\"name\": \"racquet\", \"id\": 2043, \"trainId\": 839},\n    {\"name\": \"butane gas cylinder\", \"id\": 332, \"trainId\": 840},\n    {\"name\": \"paper weight\", \"id\": 1771, \"trainId\": 841},\n    {\"name\": \"shaving brush\", \"id\": 2320, \"trainId\": 842},\n    {\"name\": \"sunglasses\", \"id\": 2646, \"trainId\": 843},\n    {\"name\": \"gear shift\", \"id\": 1089, \"trainId\": 844},\n    {\"name\": \"towel rail\", \"id\": 2826, \"trainId\": 845},\n    {\"name\": \"adding machine, totalizer, totaliser\", \"id\": 3148, \"trainId\": 846},\n]\n\n\ndef _get_ade20k_full_meta():\n    # Id 0 is reserved for ignore_label, we change ignore_label for 0\n    # to 255 in our pre-processing, so all ids are shifted by 1.\n    stuff_ids = [k[\"id\"] for k in ADE20K_SEM_SEG_FULL_CATEGORIES]\n    assert len(stuff_ids) == 847, len(stuff_ids)\n\n    # For semantic segmentation, this mapping maps from contiguous stuff id\n    # (in [0, 91], used in models) to ids in the dataset (used for processing results)\n    stuff_dataset_id_to_contiguous_id = {k: i for i, k in enumerate(stuff_ids)}\n    stuff_classes = [k[\"name\"] for k in ADE20K_SEM_SEG_FULL_CATEGORIES]\n\n    ret = {\n        \"stuff_dataset_id_to_contiguous_id\": stuff_dataset_id_to_contiguous_id,\n        \"stuff_classes\": stuff_classes,\n    }\n    return ret\n\n\ndef register_all_ade20k_full(root):\n    root = os.path.join(root, \"ADE20K_2021_17_01\")\n    meta = _get_ade20k_full_meta()\n    for name, dirname in [(\"train\", \"training\"), (\"val\", \"validation\")]:\n        image_dir = os.path.join(root, \"images_detectron2\", dirname)\n        gt_dir = os.path.join(root, \"annotations_detectron2\", dirname)\n        name = f\"ade20k_full_sem_seg_{name}\"\n        DatasetCatalog.register(\n            name, lambda x=image_dir, y=gt_dir: load_sem_seg(y, x, gt_ext=\"tif\", image_ext=\"jpg\")\n        )\n        MetadataCatalog.get(name).set(\n            stuff_classes=meta[\"stuff_classes\"][:],\n            image_root=image_dir,\n            sem_seg_root=gt_dir,\n            evaluator_type=\"sem_seg\",\n            ignore_label=65535,  # NOTE: gt is saved in 16-bit TIFF images\n        )\n\n\n_root = os.getenv(\"DETECTRON2_DATASETS\", \"datasets\")\nregister_all_ade20k_full(_root)\n"
  },
  {
    "path": "mfvis_nococo/mask2former/data/datasets/register_ade20k_instance.py",
    "content": "import json\nimport logging\nimport numpy as np\nimport os\nfrom PIL import Image\n\nfrom detectron2.data import DatasetCatalog, MetadataCatalog\nfrom detectron2.data.datasets.coco import load_coco_json, register_coco_instances\nfrom detectron2.utils.file_io import PathManager\n\nADE_CATEGORIES = [{'id': 7, 'name': 'bed'}, {'id': 8, 'name': 'windowpane'}, {'id': 10, 'name': 'cabinet'}, {'id': 12, 'name': 'person'}, {'id': 14, 'name': 'door'}, {'id': 15, 'name': 'table'}, {'id': 18, 'name': 'curtain'}, {'id': 19, 'name': 'chair'}, {'id': 20, 'name': 'car'}, {'id': 22, 'name': 'painting'}, {'id': 23, 'name': 'sofa'}, {'id': 24, 'name': 'shelf'}, {'id': 27, 'name': 'mirror'}, {'id': 30, 'name': 'armchair'}, {'id': 31, 'name': 'seat'}, {'id': 32, 'name': 'fence'}, {'id': 33, 'name': 'desk'}, {'id': 35, 'name': 'wardrobe'}, {'id': 36, 'name': 'lamp'}, {'id': 37, 'name': 'bathtub'}, {'id': 38, 'name': 'railing'}, {'id': 39, 'name': 'cushion'}, {'id': 41, 'name': 'box'}, {'id': 42, 'name': 'column'}, {'id': 43, 'name': 'signboard'}, {'id': 44, 'name': 'chest of drawers'}, {'id': 45, 'name': 'counter'}, {'id': 47, 'name': 'sink'}, {'id': 49, 'name': 'fireplace'}, {'id': 50, 'name': 'refrigerator'}, {'id': 53, 'name': 'stairs'}, {'id': 55, 'name': 'case'}, {'id': 56, 'name': 'pool table'}, {'id': 57, 'name': 'pillow'}, {'id': 58, 'name': 'screen door'}, {'id': 62, 'name': 'bookcase'}, {'id': 64, 'name': 'coffee table'}, {'id': 65, 'name': 'toilet'}, {'id': 66, 'name': 'flower'}, {'id': 67, 'name': 'book'}, {'id': 69, 'name': 'bench'}, {'id': 70, 'name': 'countertop'}, {'id': 71, 'name': 'stove'}, {'id': 72, 'name': 'palm'}, {'id': 73, 'name': 'kitchen island'}, {'id': 74, 'name': 'computer'}, {'id': 75, 'name': 'swivel chair'}, {'id': 76, 'name': 'boat'}, {'id': 78, 'name': 'arcade machine'}, {'id': 80, 'name': 'bus'}, {'id': 81, 'name': 'towel'}, {'id': 82, 'name': 'light'}, {'id': 83, 'name': 'truck'}, {'id': 85, 'name': 'chandelier'}, {'id': 86, 'name': 'awning'}, {'id': 87, 'name': 'streetlight'}, {'id': 88, 'name': 'booth'}, {'id': 89, 'name': 'television receiver'}, {'id': 90, 'name': 'airplane'}, {'id': 92, 'name': 'apparel'}, {'id': 93, 'name': 'pole'}, {'id': 95, 'name': 'bannister'}, {'id': 97, 'name': 'ottoman'}, {'id': 98, 'name': 'bottle'}, {'id': 102, 'name': 'van'}, {'id': 103, 'name': 'ship'}, {'id': 104, 'name': 'fountain'}, {'id': 107, 'name': 'washer'}, {'id': 108, 'name': 'plaything'}, {'id': 110, 'name': 'stool'}, {'id': 111, 'name': 'barrel'}, {'id': 112, 'name': 'basket'}, {'id': 115, 'name': 'bag'}, {'id': 116, 'name': 'minibike'}, {'id': 118, 'name': 'oven'}, {'id': 119, 'name': 'ball'}, {'id': 120, 'name': 'food'}, {'id': 121, 'name': 'step'}, {'id': 123, 'name': 'trade name'}, {'id': 124, 'name': 'microwave'}, {'id': 125, 'name': 'pot'}, {'id': 126, 'name': 'animal'}, {'id': 127, 'name': 'bicycle'}, {'id': 129, 'name': 'dishwasher'}, {'id': 130, 'name': 'screen'}, {'id': 132, 'name': 'sculpture'}, {'id': 133, 'name': 'hood'}, {'id': 134, 'name': 'sconce'}, {'id': 135, 'name': 'vase'}, {'id': 136, 'name': 'traffic light'}, {'id': 137, 'name': 'tray'}, {'id': 138, 'name': 'ashcan'}, {'id': 139, 'name': 'fan'}, {'id': 142, 'name': 'plate'}, {'id': 143, 'name': 'monitor'}, {'id': 144, 'name': 'bulletin board'}, {'id': 146, 'name': 'radiator'}, {'id': 147, 'name': 'glass'}, {'id': 148, 'name': 'clock'}, {'id': 149, 'name': 'flag'}]\n\n\n_PREDEFINED_SPLITS = {\n    # point annotations without masks\n    \"ade20k_instance_train\": (\n        \"ADEChallengeData2016/images/training\",\n        \"ADEChallengeData2016/ade20k_instance_train.json\",\n    ),\n    \"ade20k_instance_val\": (\n        \"ADEChallengeData2016/images/validation\",\n        \"ADEChallengeData2016/ade20k_instance_val.json\",\n    ),\n}\n\n\ndef _get_ade_instances_meta():\n    thing_ids = [k[\"id\"] for k in ADE_CATEGORIES]\n    assert len(thing_ids) == 100, len(thing_ids)\n    # Mapping from the incontiguous ADE category id to an id in [0, 99]\n    thing_dataset_id_to_contiguous_id = {k: i for i, k in enumerate(thing_ids)}\n    thing_classes = [k[\"name\"] for k in ADE_CATEGORIES]\n    ret = {\n        \"thing_dataset_id_to_contiguous_id\": thing_dataset_id_to_contiguous_id,\n        \"thing_classes\": thing_classes,\n    }\n    return ret\n\n\ndef register_all_ade20k_instance(root):\n    for key, (image_root, json_file) in _PREDEFINED_SPLITS.items():\n        # Assume pre-defined datasets live in `./datasets`.\n        register_coco_instances(\n            key,\n            _get_ade_instances_meta(),\n            os.path.join(root, json_file) if \"://\" not in json_file else json_file,\n            os.path.join(root, image_root),\n        )\n\n\n_root = os.getenv(\"DETECTRON2_DATASETS\", \"datasets\")\nregister_all_ade20k_instance(_root)\n"
  },
  {
    "path": "mfvis_nococo/mask2former/data/datasets/register_ade20k_panoptic.py",
    "content": "import json\nimport os\n\nfrom detectron2.data import DatasetCatalog, MetadataCatalog\nfrom detectron2.utils.file_io import PathManager\n\nADE20K_150_CATEGORIES = [\n    {\"color\": [120, 120, 120], \"id\": 0, \"isthing\": 0, \"name\": \"wall\"},\n    {\"color\": [180, 120, 120], \"id\": 1, \"isthing\": 0, \"name\": \"building\"},\n    {\"color\": [6, 230, 230], \"id\": 2, \"isthing\": 0, \"name\": \"sky\"},\n    {\"color\": [80, 50, 50], \"id\": 3, \"isthing\": 0, \"name\": \"floor\"},\n    {\"color\": [4, 200, 3], \"id\": 4, \"isthing\": 0, \"name\": \"tree\"},\n    {\"color\": [120, 120, 80], \"id\": 5, \"isthing\": 0, \"name\": \"ceiling\"},\n    {\"color\": [140, 140, 140], \"id\": 6, \"isthing\": 0, \"name\": \"road, route\"},\n    {\"color\": [204, 5, 255], \"id\": 7, \"isthing\": 1, \"name\": \"bed\"},\n    {\"color\": [230, 230, 230], \"id\": 8, \"isthing\": 1, \"name\": \"window \"},\n    {\"color\": [4, 250, 7], \"id\": 9, \"isthing\": 0, \"name\": \"grass\"},\n    {\"color\": [224, 5, 255], \"id\": 10, \"isthing\": 1, \"name\": \"cabinet\"},\n    {\"color\": [235, 255, 7], \"id\": 11, \"isthing\": 0, \"name\": \"sidewalk, pavement\"},\n    {\"color\": [150, 5, 61], \"id\": 12, \"isthing\": 1, \"name\": \"person\"},\n    {\"color\": [120, 120, 70], \"id\": 13, \"isthing\": 0, \"name\": \"earth, ground\"},\n    {\"color\": [8, 255, 51], \"id\": 14, \"isthing\": 1, \"name\": \"door\"},\n    {\"color\": [255, 6, 82], \"id\": 15, \"isthing\": 1, \"name\": \"table\"},\n    {\"color\": [143, 255, 140], \"id\": 16, \"isthing\": 0, \"name\": \"mountain, mount\"},\n    {\"color\": [204, 255, 4], \"id\": 17, \"isthing\": 0, \"name\": \"plant\"},\n    {\"color\": [255, 51, 7], \"id\": 18, \"isthing\": 1, \"name\": \"curtain\"},\n    {\"color\": [204, 70, 3], \"id\": 19, \"isthing\": 1, \"name\": \"chair\"},\n    {\"color\": [0, 102, 200], \"id\": 20, \"isthing\": 1, \"name\": \"car\"},\n    {\"color\": [61, 230, 250], \"id\": 21, \"isthing\": 0, \"name\": \"water\"},\n    {\"color\": [255, 6, 51], \"id\": 22, \"isthing\": 1, \"name\": \"painting, picture\"},\n    {\"color\": [11, 102, 255], \"id\": 23, \"isthing\": 1, \"name\": \"sofa\"},\n    {\"color\": [255, 7, 71], \"id\": 24, \"isthing\": 1, \"name\": \"shelf\"},\n    {\"color\": [255, 9, 224], \"id\": 25, \"isthing\": 0, \"name\": \"house\"},\n    {\"color\": [9, 7, 230], \"id\": 26, \"isthing\": 0, \"name\": \"sea\"},\n    {\"color\": [220, 220, 220], \"id\": 27, \"isthing\": 1, \"name\": \"mirror\"},\n    {\"color\": [255, 9, 92], \"id\": 28, \"isthing\": 0, \"name\": \"rug\"},\n    {\"color\": [112, 9, 255], \"id\": 29, \"isthing\": 0, \"name\": \"field\"},\n    {\"color\": [8, 255, 214], \"id\": 30, \"isthing\": 1, \"name\": \"armchair\"},\n    {\"color\": [7, 255, 224], \"id\": 31, \"isthing\": 1, \"name\": \"seat\"},\n    {\"color\": [255, 184, 6], \"id\": 32, \"isthing\": 1, \"name\": \"fence\"},\n    {\"color\": [10, 255, 71], \"id\": 33, \"isthing\": 1, \"name\": \"desk\"},\n    {\"color\": [255, 41, 10], \"id\": 34, \"isthing\": 0, \"name\": \"rock, stone\"},\n    {\"color\": [7, 255, 255], \"id\": 35, \"isthing\": 1, \"name\": \"wardrobe, closet, press\"},\n    {\"color\": [224, 255, 8], \"id\": 36, \"isthing\": 1, \"name\": \"lamp\"},\n    {\"color\": [102, 8, 255], \"id\": 37, \"isthing\": 1, \"name\": \"tub\"},\n    {\"color\": [255, 61, 6], \"id\": 38, \"isthing\": 1, \"name\": \"rail\"},\n    {\"color\": [255, 194, 7], \"id\": 39, \"isthing\": 1, \"name\": \"cushion\"},\n    {\"color\": [255, 122, 8], \"id\": 40, \"isthing\": 0, \"name\": \"base, pedestal, stand\"},\n    {\"color\": [0, 255, 20], \"id\": 41, \"isthing\": 1, \"name\": \"box\"},\n    {\"color\": [255, 8, 41], \"id\": 42, \"isthing\": 1, \"name\": \"column, pillar\"},\n    {\"color\": [255, 5, 153], \"id\": 43, \"isthing\": 1, \"name\": \"signboard, sign\"},\n    {\n        \"color\": [6, 51, 255],\n        \"id\": 44,\n        \"isthing\": 1,\n        \"name\": \"chest of drawers, chest, bureau, dresser\",\n    },\n    {\"color\": [235, 12, 255], \"id\": 45, \"isthing\": 1, \"name\": \"counter\"},\n    {\"color\": [160, 150, 20], \"id\": 46, \"isthing\": 0, \"name\": \"sand\"},\n    {\"color\": [0, 163, 255], \"id\": 47, \"isthing\": 1, \"name\": \"sink\"},\n    {\"color\": [140, 140, 140], \"id\": 48, \"isthing\": 0, \"name\": \"skyscraper\"},\n    {\"color\": [250, 10, 15], \"id\": 49, \"isthing\": 1, \"name\": \"fireplace\"},\n    {\"color\": [20, 255, 0], \"id\": 50, \"isthing\": 1, \"name\": \"refrigerator, icebox\"},\n    {\"color\": [31, 255, 0], \"id\": 51, \"isthing\": 0, \"name\": \"grandstand, covered stand\"},\n    {\"color\": [255, 31, 0], \"id\": 52, \"isthing\": 0, \"name\": \"path\"},\n    {\"color\": [255, 224, 0], \"id\": 53, \"isthing\": 1, \"name\": \"stairs\"},\n    {\"color\": [153, 255, 0], \"id\": 54, \"isthing\": 0, \"name\": \"runway\"},\n    {\"color\": [0, 0, 255], \"id\": 55, \"isthing\": 1, \"name\": \"case, display case, showcase, vitrine\"},\n    {\n        \"color\": [255, 71, 0],\n        \"id\": 56,\n        \"isthing\": 1,\n        \"name\": \"pool table, billiard table, snooker table\",\n    },\n    {\"color\": [0, 235, 255], \"id\": 57, \"isthing\": 1, \"name\": \"pillow\"},\n    {\"color\": [0, 173, 255], \"id\": 58, \"isthing\": 1, \"name\": \"screen door, screen\"},\n    {\"color\": [31, 0, 255], \"id\": 59, \"isthing\": 0, \"name\": \"stairway, staircase\"},\n    {\"color\": [11, 200, 200], \"id\": 60, \"isthing\": 0, \"name\": \"river\"},\n    {\"color\": [255, 82, 0], \"id\": 61, \"isthing\": 0, \"name\": \"bridge, span\"},\n    {\"color\": [0, 255, 245], \"id\": 62, \"isthing\": 1, \"name\": \"bookcase\"},\n    {\"color\": [0, 61, 255], \"id\": 63, \"isthing\": 0, \"name\": \"blind, screen\"},\n    {\"color\": [0, 255, 112], \"id\": 64, \"isthing\": 1, \"name\": \"coffee table\"},\n    {\n        \"color\": [0, 255, 133],\n        \"id\": 65,\n        \"isthing\": 1,\n        \"name\": \"toilet, can, commode, crapper, pot, potty, stool, throne\",\n    },\n    {\"color\": [255, 0, 0], \"id\": 66, \"isthing\": 1, \"name\": \"flower\"},\n    {\"color\": [255, 163, 0], \"id\": 67, \"isthing\": 1, \"name\": \"book\"},\n    {\"color\": [255, 102, 0], \"id\": 68, \"isthing\": 0, \"name\": \"hill\"},\n    {\"color\": [194, 255, 0], \"id\": 69, \"isthing\": 1, \"name\": \"bench\"},\n    {\"color\": [0, 143, 255], \"id\": 70, \"isthing\": 1, \"name\": \"countertop\"},\n    {\"color\": [51, 255, 0], \"id\": 71, \"isthing\": 1, \"name\": \"stove\"},\n    {\"color\": [0, 82, 255], \"id\": 72, \"isthing\": 1, \"name\": \"palm, palm tree\"},\n    {\"color\": [0, 255, 41], \"id\": 73, \"isthing\": 1, \"name\": \"kitchen island\"},\n    {\"color\": [0, 255, 173], \"id\": 74, \"isthing\": 1, \"name\": \"computer\"},\n    {\"color\": [10, 0, 255], \"id\": 75, \"isthing\": 1, \"name\": \"swivel chair\"},\n    {\"color\": [173, 255, 0], \"id\": 76, \"isthing\": 1, \"name\": \"boat\"},\n    {\"color\": [0, 255, 153], \"id\": 77, \"isthing\": 0, \"name\": \"bar\"},\n    {\"color\": [255, 92, 0], \"id\": 78, \"isthing\": 1, \"name\": \"arcade machine\"},\n    {\"color\": [255, 0, 255], \"id\": 79, \"isthing\": 0, \"name\": \"hovel, hut, hutch, shack, shanty\"},\n    {\"color\": [255, 0, 245], \"id\": 80, \"isthing\": 1, \"name\": \"bus\"},\n    {\"color\": [255, 0, 102], \"id\": 81, \"isthing\": 1, \"name\": \"towel\"},\n    {\"color\": [255, 173, 0], \"id\": 82, \"isthing\": 1, \"name\": \"light\"},\n    {\"color\": [255, 0, 20], \"id\": 83, \"isthing\": 1, \"name\": \"truck\"},\n    {\"color\": [255, 184, 184], \"id\": 84, \"isthing\": 0, \"name\": \"tower\"},\n    {\"color\": [0, 31, 255], \"id\": 85, \"isthing\": 1, \"name\": \"chandelier\"},\n    {\"color\": [0, 255, 61], \"id\": 86, \"isthing\": 1, \"name\": \"awning, sunshade, sunblind\"},\n    {\"color\": [0, 71, 255], \"id\": 87, \"isthing\": 1, \"name\": \"street lamp\"},\n    {\"color\": [255, 0, 204], \"id\": 88, \"isthing\": 1, \"name\": \"booth\"},\n    {\"color\": [0, 255, 194], \"id\": 89, \"isthing\": 1, \"name\": \"tv\"},\n    {\"color\": [0, 255, 82], \"id\": 90, \"isthing\": 1, \"name\": \"plane\"},\n    {\"color\": [0, 10, 255], \"id\": 91, \"isthing\": 0, \"name\": \"dirt track\"},\n    {\"color\": [0, 112, 255], \"id\": 92, \"isthing\": 1, \"name\": \"clothes\"},\n    {\"color\": [51, 0, 255], \"id\": 93, \"isthing\": 1, \"name\": \"pole\"},\n    {\"color\": [0, 194, 255], \"id\": 94, \"isthing\": 0, \"name\": \"land, ground, soil\"},\n    {\n        \"color\": [0, 122, 255],\n        \"id\": 95,\n        \"isthing\": 1,\n        \"name\": \"bannister, banister, balustrade, balusters, handrail\",\n    },\n    {\n        \"color\": [0, 255, 163],\n        \"id\": 96,\n        \"isthing\": 0,\n        \"name\": \"escalator, moving staircase, moving stairway\",\n    },\n    {\n        \"color\": [255, 153, 0],\n        \"id\": 97,\n        \"isthing\": 1,\n        \"name\": \"ottoman, pouf, pouffe, puff, hassock\",\n    },\n    {\"color\": [0, 255, 10], \"id\": 98, \"isthing\": 1, \"name\": \"bottle\"},\n    {\"color\": [255, 112, 0], \"id\": 99, \"isthing\": 0, \"name\": \"buffet, counter, sideboard\"},\n    {\n        \"color\": [143, 255, 0],\n        \"id\": 100,\n        \"isthing\": 0,\n        \"name\": \"poster, posting, placard, notice, bill, card\",\n    },\n    {\"color\": [82, 0, 255], \"id\": 101, \"isthing\": 0, \"name\": \"stage\"},\n    {\"color\": [163, 255, 0], \"id\": 102, \"isthing\": 1, \"name\": \"van\"},\n    {\"color\": [255, 235, 0], \"id\": 103, \"isthing\": 1, \"name\": \"ship\"},\n    {\"color\": [8, 184, 170], \"id\": 104, \"isthing\": 1, \"name\": \"fountain\"},\n    {\n        \"color\": [133, 0, 255],\n        \"id\": 105,\n        \"isthing\": 0,\n        \"name\": \"conveyer belt, conveyor belt, conveyer, conveyor, transporter\",\n    },\n    {\"color\": [0, 255, 92], \"id\": 106, \"isthing\": 0, \"name\": \"canopy\"},\n    {\n        \"color\": [184, 0, 255],\n        \"id\": 107,\n        \"isthing\": 1,\n        \"name\": \"washer, automatic washer, washing machine\",\n    },\n    {\"color\": [255, 0, 31], \"id\": 108, \"isthing\": 1, \"name\": \"plaything, toy\"},\n    {\"color\": [0, 184, 255], \"id\": 109, \"isthing\": 0, \"name\": \"pool\"},\n    {\"color\": [0, 214, 255], \"id\": 110, \"isthing\": 1, \"name\": \"stool\"},\n    {\"color\": [255, 0, 112], \"id\": 111, \"isthing\": 1, \"name\": \"barrel, cask\"},\n    {\"color\": [92, 255, 0], \"id\": 112, \"isthing\": 1, \"name\": \"basket, handbasket\"},\n    {\"color\": [0, 224, 255], \"id\": 113, \"isthing\": 0, \"name\": \"falls\"},\n    {\"color\": [112, 224, 255], \"id\": 114, \"isthing\": 0, \"name\": \"tent\"},\n    {\"color\": [70, 184, 160], \"id\": 115, \"isthing\": 1, \"name\": \"bag\"},\n    {\"color\": [163, 0, 255], \"id\": 116, \"isthing\": 1, \"name\": \"minibike, motorbike\"},\n    {\"color\": [153, 0, 255], \"id\": 117, \"isthing\": 0, \"name\": \"cradle\"},\n    {\"color\": [71, 255, 0], \"id\": 118, \"isthing\": 1, \"name\": \"oven\"},\n    {\"color\": [255, 0, 163], \"id\": 119, \"isthing\": 1, \"name\": \"ball\"},\n    {\"color\": [255, 204, 0], \"id\": 120, \"isthing\": 1, \"name\": \"food, solid food\"},\n    {\"color\": [255, 0, 143], \"id\": 121, \"isthing\": 1, \"name\": \"step, stair\"},\n    {\"color\": [0, 255, 235], \"id\": 122, \"isthing\": 0, \"name\": \"tank, storage tank\"},\n    {\"color\": [133, 255, 0], \"id\": 123, \"isthing\": 1, \"name\": \"trade name\"},\n    {\"color\": [255, 0, 235], \"id\": 124, \"isthing\": 1, \"name\": \"microwave\"},\n    {\"color\": [245, 0, 255], \"id\": 125, \"isthing\": 1, \"name\": \"pot\"},\n    {\"color\": [255, 0, 122], \"id\": 126, \"isthing\": 1, \"name\": \"animal\"},\n    {\"color\": [255, 245, 0], \"id\": 127, \"isthing\": 1, \"name\": \"bicycle\"},\n    {\"color\": [10, 190, 212], \"id\": 128, \"isthing\": 0, \"name\": \"lake\"},\n    {\"color\": [214, 255, 0], \"id\": 129, \"isthing\": 1, \"name\": \"dishwasher\"},\n    {\"color\": [0, 204, 255], \"id\": 130, \"isthing\": 1, \"name\": \"screen\"},\n    {\"color\": [20, 0, 255], \"id\": 131, \"isthing\": 0, \"name\": \"blanket, cover\"},\n    {\"color\": [255, 255, 0], \"id\": 132, \"isthing\": 1, \"name\": \"sculpture\"},\n    {\"color\": [0, 153, 255], \"id\": 133, \"isthing\": 1, \"name\": \"hood, exhaust hood\"},\n    {\"color\": [0, 41, 255], \"id\": 134, \"isthing\": 1, \"name\": \"sconce\"},\n    {\"color\": [0, 255, 204], \"id\": 135, \"isthing\": 1, \"name\": \"vase\"},\n    {\"color\": [41, 0, 255], \"id\": 136, \"isthing\": 1, \"name\": \"traffic light\"},\n    {\"color\": [41, 255, 0], \"id\": 137, \"isthing\": 1, \"name\": \"tray\"},\n    {\"color\": [173, 0, 255], \"id\": 138, \"isthing\": 1, \"name\": \"trash can\"},\n    {\"color\": [0, 245, 255], \"id\": 139, \"isthing\": 1, \"name\": \"fan\"},\n    {\"color\": [71, 0, 255], \"id\": 140, \"isthing\": 0, \"name\": \"pier\"},\n    {\"color\": [122, 0, 255], \"id\": 141, \"isthing\": 0, \"name\": \"crt screen\"},\n    {\"color\": [0, 255, 184], \"id\": 142, \"isthing\": 1, \"name\": \"plate\"},\n    {\"color\": [0, 92, 255], \"id\": 143, \"isthing\": 1, \"name\": \"monitor\"},\n    {\"color\": [184, 255, 0], \"id\": 144, \"isthing\": 1, \"name\": \"bulletin board\"},\n    {\"color\": [0, 133, 255], \"id\": 145, \"isthing\": 0, \"name\": \"shower\"},\n    {\"color\": [255, 214, 0], \"id\": 146, \"isthing\": 1, \"name\": \"radiator\"},\n    {\"color\": [25, 194, 194], \"id\": 147, \"isthing\": 1, \"name\": \"glass, drinking glass\"},\n    {\"color\": [102, 255, 0], \"id\": 148, \"isthing\": 1, \"name\": \"clock\"},\n    {\"color\": [92, 0, 255], \"id\": 149, \"isthing\": 1, \"name\": \"flag\"},\n]\n\nADE20k_COLORS = [k[\"color\"] for k in ADE20K_150_CATEGORIES]\n\nMetadataCatalog.get(\"ade20k_sem_seg_train\").set(\n    stuff_colors=ADE20k_COLORS[:],\n)\n\nMetadataCatalog.get(\"ade20k_sem_seg_val\").set(\n    stuff_colors=ADE20k_COLORS[:],\n)\n\n\ndef load_ade20k_panoptic_json(json_file, image_dir, gt_dir, semseg_dir, meta):\n    \"\"\"\n    Args:\n        image_dir (str): path to the raw dataset. e.g., \"~/coco/train2017\".\n        gt_dir (str): path to the raw annotations. e.g., \"~/coco/panoptic_train2017\".\n        json_file (str): path to the json file. e.g., \"~/coco/annotations/panoptic_train2017.json\".\n    Returns:\n        list[dict]: a list of dicts in Detectron2 standard format. (See\n        `Using Custom Datasets </tutorials/datasets.html>`_ )\n    \"\"\"\n\n    def _convert_category_id(segment_info, meta):\n        if segment_info[\"category_id\"] in meta[\"thing_dataset_id_to_contiguous_id\"]:\n            segment_info[\"category_id\"] = meta[\"thing_dataset_id_to_contiguous_id\"][\n                segment_info[\"category_id\"]\n            ]\n            segment_info[\"isthing\"] = True\n        else:\n            segment_info[\"category_id\"] = meta[\"stuff_dataset_id_to_contiguous_id\"][\n                segment_info[\"category_id\"]\n            ]\n            segment_info[\"isthing\"] = False\n        return segment_info\n\n    with PathManager.open(json_file) as f:\n        json_info = json.load(f)\n\n    ret = []\n    for ann in json_info[\"annotations\"]:\n        image_id = ann[\"image_id\"]\n        # TODO: currently we assume image and label has the same filename but\n        # different extension, and images have extension \".jpg\" for COCO. Need\n        # to make image extension a user-provided argument if we extend this\n        # function to support other COCO-like datasets.\n        image_file = os.path.join(image_dir, os.path.splitext(ann[\"file_name\"])[0] + \".jpg\")\n        label_file = os.path.join(gt_dir, ann[\"file_name\"])\n        sem_label_file = os.path.join(semseg_dir, ann[\"file_name\"])\n        segments_info = [_convert_category_id(x, meta) for x in ann[\"segments_info\"]]\n        ret.append(\n            {\n                \"file_name\": image_file,\n                \"image_id\": image_id,\n                \"pan_seg_file_name\": label_file,\n                \"sem_seg_file_name\": sem_label_file,\n                \"segments_info\": segments_info,\n            }\n        )\n    assert len(ret), f\"No images found in {image_dir}!\"\n    assert PathManager.isfile(ret[0][\"file_name\"]), ret[0][\"file_name\"]\n    assert PathManager.isfile(ret[0][\"pan_seg_file_name\"]), ret[0][\"pan_seg_file_name\"]\n    assert PathManager.isfile(ret[0][\"sem_seg_file_name\"]), ret[0][\"sem_seg_file_name\"]\n    return ret\n\n\ndef register_ade20k_panoptic(\n    name, metadata, image_root, panoptic_root, semantic_root, panoptic_json, instances_json=None\n):\n    \"\"\"\n    Register a \"standard\" version of ADE20k panoptic segmentation dataset named `name`.\n    The dictionaries in this registered dataset follows detectron2's standard format.\n    Hence it's called \"standard\".\n    Args:\n        name (str): the name that identifies a dataset,\n            e.g. \"ade20k_panoptic_train\"\n        metadata (dict): extra metadata associated with this dataset.\n        image_root (str): directory which contains all the images\n        panoptic_root (str): directory which contains panoptic annotation images in COCO format\n        panoptic_json (str): path to the json panoptic annotation file in COCO format\n        sem_seg_root (none): not used, to be consistent with\n            `register_coco_panoptic_separated`.\n        instances_json (str): path to the json instance annotation file\n    \"\"\"\n    panoptic_name = name\n    DatasetCatalog.register(\n        panoptic_name,\n        lambda: load_ade20k_panoptic_json(\n            panoptic_json, image_root, panoptic_root, semantic_root, metadata\n        ),\n    )\n    MetadataCatalog.get(panoptic_name).set(\n        panoptic_root=panoptic_root,\n        image_root=image_root,\n        panoptic_json=panoptic_json,\n        json_file=instances_json,\n        evaluator_type=\"ade20k_panoptic_seg\",\n        ignore_label=255,\n        label_divisor=1000,\n        **metadata,\n    )\n\n\n_PREDEFINED_SPLITS_ADE20K_PANOPTIC = {\n    \"ade20k_panoptic_train\": (\n        \"ADEChallengeData2016/images/training\",\n        \"ADEChallengeData2016/ade20k_panoptic_train\",\n        \"ADEChallengeData2016/ade20k_panoptic_train.json\",\n        \"ADEChallengeData2016/annotations_detectron2/training\",\n        \"ADEChallengeData2016/ade20k_instance_train.json\",\n    ),\n    \"ade20k_panoptic_val\": (\n        \"ADEChallengeData2016/images/validation\",\n        \"ADEChallengeData2016/ade20k_panoptic_val\",\n        \"ADEChallengeData2016/ade20k_panoptic_val.json\",\n        \"ADEChallengeData2016/annotations_detectron2/validation\",\n        \"ADEChallengeData2016/ade20k_instance_val.json\",\n    ),\n}\n\n\ndef get_metadata():\n    meta = {}\n    # The following metadata maps contiguous id from [0, #thing categories +\n    # #stuff categories) to their names and colors. We have to replica of the\n    # same name and color under \"thing_*\" and \"stuff_*\" because the current\n    # visualization function in D2 handles thing and class classes differently\n    # due to some heuristic used in Panoptic FPN. We keep the same naming to\n    # enable reusing existing visualization functions.\n    thing_classes = [k[\"name\"] for k in ADE20K_150_CATEGORIES if k[\"isthing\"] == 1]\n    thing_colors = [k[\"color\"] for k in ADE20K_150_CATEGORIES if k[\"isthing\"] == 1]\n    stuff_classes = [k[\"name\"] for k in ADE20K_150_CATEGORIES]\n    stuff_colors = [k[\"color\"] for k in ADE20K_150_CATEGORIES]\n\n    meta[\"thing_classes\"] = thing_classes\n    meta[\"thing_colors\"] = thing_colors\n    meta[\"stuff_classes\"] = stuff_classes\n    meta[\"stuff_colors\"] = stuff_colors\n\n    # Convert category id for training:\n    #   category id: like semantic segmentation, it is the class id for each\n    #   pixel. Since there are some classes not used in evaluation, the category\n    #   id is not always contiguous and thus we have two set of category ids:\n    #       - original category id: category id in the original dataset, mainly\n    #           used for evaluation.\n    #       - contiguous category id: [0, #classes), in order to train the linear\n    #           softmax classifier.\n    thing_dataset_id_to_contiguous_id = {}\n    stuff_dataset_id_to_contiguous_id = {}\n\n    for i, cat in enumerate(ADE20K_150_CATEGORIES):\n        if cat[\"isthing\"]:\n            thing_dataset_id_to_contiguous_id[cat[\"id\"]] = i\n        # else:\n        #     stuff_dataset_id_to_contiguous_id[cat[\"id\"]] = i\n\n        # in order to use sem_seg evaluator\n        stuff_dataset_id_to_contiguous_id[cat[\"id\"]] = i\n\n    meta[\"thing_dataset_id_to_contiguous_id\"] = thing_dataset_id_to_contiguous_id\n    meta[\"stuff_dataset_id_to_contiguous_id\"] = stuff_dataset_id_to_contiguous_id\n\n    return meta\n\n\ndef register_all_ade20k_panoptic(root):\n    metadata = get_metadata()\n    for (\n        prefix,\n        (image_root, panoptic_root, panoptic_json, semantic_root, instance_json),\n    ) in _PREDEFINED_SPLITS_ADE20K_PANOPTIC.items():\n        # The \"standard\" version of COCO panoptic segmentation dataset,\n        # e.g. used by Panoptic-DeepLab\n        register_ade20k_panoptic(\n            prefix,\n            metadata,\n            os.path.join(root, image_root),\n            os.path.join(root, panoptic_root),\n            os.path.join(root, semantic_root),\n            os.path.join(root, panoptic_json),\n            os.path.join(root, instance_json),\n        )\n\n\n_root = os.getenv(\"DETECTRON2_DATASETS\", \"datasets\")\nregister_all_ade20k_panoptic(_root)\n"
  },
  {
    "path": "mfvis_nococo/mask2former/data/datasets/register_coco_panoptic_annos_semseg.py",
    "content": "import json\nimport os\n\nfrom detectron2.data import DatasetCatalog, MetadataCatalog\nfrom detectron2.data.datasets import load_sem_seg\nfrom detectron2.data.datasets.builtin_meta import COCO_CATEGORIES\nfrom detectron2.utils.file_io import PathManager\n\n\n_PREDEFINED_SPLITS_COCO_PANOPTIC = {\n    \"coco_2017_train_panoptic\": (\n        # This is the original panoptic annotation directory\n        \"coco/panoptic_train2017\",\n        \"coco/annotations/panoptic_train2017.json\",\n        # This directory contains semantic annotations that are\n        # converted from panoptic annotations.\n        # It is used by PanopticFPN.\n        # You can use the script at detectron2/datasets/prepare_panoptic_fpn.py\n        # to create these directories.\n        \"coco/panoptic_semseg_train2017\",\n    ),\n    \"coco_2017_val_panoptic\": (\n        \"coco/panoptic_val2017\",\n        \"coco/annotations/panoptic_val2017.json\",\n        \"coco/panoptic_semseg_val2017\",\n    ),\n}\n\n\ndef get_metadata():\n    meta = {}\n    # The following metadata maps contiguous id from [0, #thing categories +\n    # #stuff categories) to their names and colors. We have to replica of the\n    # same name and color under \"thing_*\" and \"stuff_*\" because the current\n    # visualization function in D2 handles thing and class classes differently\n    # due to some heuristic used in Panoptic FPN. We keep the same naming to\n    # enable reusing existing visualization functions.\n    thing_classes = [k[\"name\"] for k in COCO_CATEGORIES if k[\"isthing\"] == 1]\n    thing_colors = [k[\"color\"] for k in COCO_CATEGORIES if k[\"isthing\"] == 1]\n    stuff_classes = [k[\"name\"] for k in COCO_CATEGORIES]\n    stuff_colors = [k[\"color\"] for k in COCO_CATEGORIES]\n\n    meta[\"thing_classes\"] = thing_classes\n    meta[\"thing_colors\"] = thing_colors\n    meta[\"stuff_classes\"] = stuff_classes\n    meta[\"stuff_colors\"] = stuff_colors\n\n    # Convert category id for training:\n    #   category id: like semantic segmentation, it is the class id for each\n    #   pixel. Since there are some classes not used in evaluation, the category\n    #   id is not always contiguous and thus we have two set of category ids:\n    #       - original category id: category id in the original dataset, mainly\n    #           used for evaluation.\n    #       - contiguous category id: [0, #classes), in order to train the linear\n    #           softmax classifier.\n    thing_dataset_id_to_contiguous_id = {}\n    stuff_dataset_id_to_contiguous_id = {}\n\n    for i, cat in enumerate(COCO_CATEGORIES):\n        if cat[\"isthing\"]:\n            thing_dataset_id_to_contiguous_id[cat[\"id\"]] = i\n        # else:\n        #     stuff_dataset_id_to_contiguous_id[cat[\"id\"]] = i\n\n        # in order to use sem_seg evaluator\n        stuff_dataset_id_to_contiguous_id[cat[\"id\"]] = i\n\n    meta[\"thing_dataset_id_to_contiguous_id\"] = thing_dataset_id_to_contiguous_id\n    meta[\"stuff_dataset_id_to_contiguous_id\"] = stuff_dataset_id_to_contiguous_id\n\n    return meta\n\n\ndef load_coco_panoptic_json(json_file, image_dir, gt_dir, semseg_dir, meta):\n    \"\"\"\n    Args:\n        image_dir (str): path to the raw dataset. e.g., \"~/coco/train2017\".\n        gt_dir (str): path to the raw annotations. e.g., \"~/coco/panoptic_train2017\".\n        json_file (str): path to the json file. e.g., \"~/coco/annotations/panoptic_train2017.json\".\n    Returns:\n        list[dict]: a list of dicts in Detectron2 standard format. (See\n        `Using Custom Datasets </tutorials/datasets.html>`_ )\n    \"\"\"\n\n    def _convert_category_id(segment_info, meta):\n        if segment_info[\"category_id\"] in meta[\"thing_dataset_id_to_contiguous_id\"]:\n            segment_info[\"category_id\"] = meta[\"thing_dataset_id_to_contiguous_id\"][\n                segment_info[\"category_id\"]\n            ]\n            segment_info[\"isthing\"] = True\n        else:\n            segment_info[\"category_id\"] = meta[\"stuff_dataset_id_to_contiguous_id\"][\n                segment_info[\"category_id\"]\n            ]\n            segment_info[\"isthing\"] = False\n        return segment_info\n\n    with PathManager.open(json_file) as f:\n        json_info = json.load(f)\n\n    ret = []\n    for ann in json_info[\"annotations\"]:\n        image_id = int(ann[\"image_id\"])\n        # TODO: currently we assume image and label has the same filename but\n        # different extension, and images have extension \".jpg\" for COCO. Need\n        # to make image extension a user-provided argument if we extend this\n        # function to support other COCO-like datasets.\n        image_file = os.path.join(image_dir, os.path.splitext(ann[\"file_name\"])[0] + \".jpg\")\n        label_file = os.path.join(gt_dir, ann[\"file_name\"])\n        sem_label_file = os.path.join(semseg_dir, ann[\"file_name\"])\n        segments_info = [_convert_category_id(x, meta) for x in ann[\"segments_info\"]]\n        ret.append(\n            {\n                \"file_name\": image_file,\n                \"image_id\": image_id,\n                \"pan_seg_file_name\": label_file,\n                \"sem_seg_file_name\": sem_label_file,\n                \"segments_info\": segments_info,\n            }\n        )\n    assert len(ret), f\"No images found in {image_dir}!\"\n    assert PathManager.isfile(ret[0][\"file_name\"]), ret[0][\"file_name\"]\n    assert PathManager.isfile(ret[0][\"pan_seg_file_name\"]), ret[0][\"pan_seg_file_name\"]\n    assert PathManager.isfile(ret[0][\"sem_seg_file_name\"]), ret[0][\"sem_seg_file_name\"]\n    return ret\n\n\ndef register_coco_panoptic_annos_sem_seg(\n    name, metadata, image_root, panoptic_root, panoptic_json, sem_seg_root, instances_json\n):\n    panoptic_name = name\n    delattr(MetadataCatalog.get(panoptic_name), \"thing_classes\")\n    delattr(MetadataCatalog.get(panoptic_name), \"thing_colors\")\n    MetadataCatalog.get(panoptic_name).set(\n        thing_classes=metadata[\"thing_classes\"],\n        thing_colors=metadata[\"thing_colors\"],\n        # thing_dataset_id_to_contiguous_id=metadata[\"thing_dataset_id_to_contiguous_id\"],\n    )\n\n    # the name is \"coco_2017_train_panoptic_with_sem_seg\" and \"coco_2017_val_panoptic_with_sem_seg\"\n    semantic_name = name + \"_with_sem_seg\"\n    DatasetCatalog.register(\n        semantic_name,\n        lambda: load_coco_panoptic_json(panoptic_json, image_root, panoptic_root, sem_seg_root, metadata),\n    )\n    MetadataCatalog.get(semantic_name).set(\n        sem_seg_root=sem_seg_root,\n        panoptic_root=panoptic_root,\n        image_root=image_root,\n        panoptic_json=panoptic_json,\n        json_file=instances_json,\n        evaluator_type=\"coco_panoptic_seg\",\n        ignore_label=255,\n        label_divisor=1000,\n        **metadata,\n    )\n\n\ndef register_all_coco_panoptic_annos_sem_seg(root):\n    for (\n        prefix,\n        (panoptic_root, panoptic_json, semantic_root),\n    ) in _PREDEFINED_SPLITS_COCO_PANOPTIC.items():\n        prefix_instances = prefix[: -len(\"_panoptic\")]\n        instances_meta = MetadataCatalog.get(prefix_instances)\n        image_root, instances_json = instances_meta.image_root, instances_meta.json_file\n\n        register_coco_panoptic_annos_sem_seg(\n            prefix,\n            get_metadata(),\n            image_root,\n            os.path.join(root, panoptic_root),\n            os.path.join(root, panoptic_json),\n            os.path.join(root, semantic_root),\n            instances_json,\n        )\n\n\n_root = os.getenv(\"DETECTRON2_DATASETS\", \"datasets\")\nregister_all_coco_panoptic_annos_sem_seg(_root)\n"
  },
  {
    "path": "mfvis_nococo/mask2former/data/datasets/register_coco_stuff_10k.py",
    "content": "import os\n\nfrom detectron2.data import DatasetCatalog, MetadataCatalog\nfrom detectron2.data.datasets import load_sem_seg\n\nCOCO_CATEGORIES = [\n    {\"color\": [220, 20, 60], \"isthing\": 1, \"id\": 1, \"name\": \"person\"},\n    {\"color\": [119, 11, 32], \"isthing\": 1, \"id\": 2, \"name\": \"bicycle\"},\n    {\"color\": [0, 0, 142], \"isthing\": 1, \"id\": 3, \"name\": \"car\"},\n    {\"color\": [0, 0, 230], \"isthing\": 1, \"id\": 4, \"name\": \"motorcycle\"},\n    {\"color\": [106, 0, 228], \"isthing\": 1, \"id\": 5, \"name\": \"airplane\"},\n    {\"color\": [0, 60, 100], \"isthing\": 1, \"id\": 6, \"name\": \"bus\"},\n    {\"color\": [0, 80, 100], \"isthing\": 1, \"id\": 7, \"name\": \"train\"},\n    {\"color\": [0, 0, 70], \"isthing\": 1, \"id\": 8, \"name\": \"truck\"},\n    {\"color\": [0, 0, 192], \"isthing\": 1, \"id\": 9, \"name\": \"boat\"},\n    {\"color\": [250, 170, 30], \"isthing\": 1, \"id\": 10, \"name\": \"traffic light\"},\n    {\"color\": [100, 170, 30], \"isthing\": 1, \"id\": 11, \"name\": \"fire hydrant\"},\n    {\"color\": [220, 220, 0], \"isthing\": 1, \"id\": 13, \"name\": \"stop sign\"},\n    {\"color\": [175, 116, 175], \"isthing\": 1, \"id\": 14, \"name\": \"parking meter\"},\n    {\"color\": [250, 0, 30], \"isthing\": 1, \"id\": 15, \"name\": \"bench\"},\n    {\"color\": [165, 42, 42], \"isthing\": 1, \"id\": 16, \"name\": \"bird\"},\n    {\"color\": [255, 77, 255], \"isthing\": 1, \"id\": 17, \"name\": \"cat\"},\n    {\"color\": [0, 226, 252], \"isthing\": 1, \"id\": 18, \"name\": \"dog\"},\n    {\"color\": [182, 182, 255], \"isthing\": 1, \"id\": 19, \"name\": \"horse\"},\n    {\"color\": [0, 82, 0], \"isthing\": 1, \"id\": 20, \"name\": \"sheep\"},\n    {\"color\": [120, 166, 157], \"isthing\": 1, \"id\": 21, \"name\": \"cow\"},\n    {\"color\": [110, 76, 0], \"isthing\": 1, \"id\": 22, \"name\": \"elephant\"},\n    {\"color\": [174, 57, 255], \"isthing\": 1, \"id\": 23, \"name\": \"bear\"},\n    {\"color\": [199, 100, 0], \"isthing\": 1, \"id\": 24, \"name\": \"zebra\"},\n    {\"color\": [72, 0, 118], \"isthing\": 1, \"id\": 25, \"name\": \"giraffe\"},\n    {\"color\": [255, 179, 240], \"isthing\": 1, \"id\": 27, \"name\": \"backpack\"},\n    {\"color\": [0, 125, 92], \"isthing\": 1, \"id\": 28, \"name\": \"umbrella\"},\n    {\"color\": [209, 0, 151], \"isthing\": 1, \"id\": 31, \"name\": \"handbag\"},\n    {\"color\": [188, 208, 182], \"isthing\": 1, \"id\": 32, \"name\": \"tie\"},\n    {\"color\": [0, 220, 176], \"isthing\": 1, \"id\": 33, \"name\": \"suitcase\"},\n    {\"color\": [255, 99, 164], \"isthing\": 1, \"id\": 34, \"name\": \"frisbee\"},\n    {\"color\": [92, 0, 73], \"isthing\": 1, \"id\": 35, \"name\": \"skis\"},\n    {\"color\": [133, 129, 255], \"isthing\": 1, \"id\": 36, \"name\": \"snowboard\"},\n    {\"color\": [78, 180, 255], \"isthing\": 1, \"id\": 37, \"name\": \"sports ball\"},\n    {\"color\": [0, 228, 0], \"isthing\": 1, \"id\": 38, \"name\": \"kite\"},\n    {\"color\": [174, 255, 243], \"isthing\": 1, \"id\": 39, \"name\": \"baseball bat\"},\n    {\"color\": [45, 89, 255], \"isthing\": 1, \"id\": 40, \"name\": \"baseball glove\"},\n    {\"color\": [134, 134, 103], \"isthing\": 1, \"id\": 41, \"name\": \"skateboard\"},\n    {\"color\": [145, 148, 174], \"isthing\": 1, \"id\": 42, \"name\": \"surfboard\"},\n    {\"color\": [255, 208, 186], \"isthing\": 1, \"id\": 43, \"name\": \"tennis racket\"},\n    {\"color\": [197, 226, 255], \"isthing\": 1, \"id\": 44, \"name\": \"bottle\"},\n    {\"color\": [171, 134, 1], \"isthing\": 1, \"id\": 46, \"name\": \"wine glass\"},\n    {\"color\": [109, 63, 54], \"isthing\": 1, \"id\": 47, \"name\": \"cup\"},\n    {\"color\": [207, 138, 255], \"isthing\": 1, \"id\": 48, \"name\": \"fork\"},\n    {\"color\": [151, 0, 95], \"isthing\": 1, \"id\": 49, \"name\": \"knife\"},\n    {\"color\": [9, 80, 61], \"isthing\": 1, \"id\": 50, \"name\": \"spoon\"},\n    {\"color\": [84, 105, 51], \"isthing\": 1, \"id\": 51, \"name\": \"bowl\"},\n    {\"color\": [74, 65, 105], \"isthing\": 1, \"id\": 52, \"name\": \"banana\"},\n    {\"color\": [166, 196, 102], \"isthing\": 1, \"id\": 53, \"name\": \"apple\"},\n    {\"color\": [208, 195, 210], \"isthing\": 1, \"id\": 54, \"name\": \"sandwich\"},\n    {\"color\": [255, 109, 65], \"isthing\": 1, \"id\": 55, \"name\": \"orange\"},\n    {\"color\": [0, 143, 149], \"isthing\": 1, \"id\": 56, \"name\": \"broccoli\"},\n    {\"color\": [179, 0, 194], \"isthing\": 1, \"id\": 57, \"name\": \"carrot\"},\n    {\"color\": [209, 99, 106], \"isthing\": 1, \"id\": 58, \"name\": \"hot dog\"},\n    {\"color\": [5, 121, 0], \"isthing\": 1, \"id\": 59, \"name\": \"pizza\"},\n    {\"color\": [227, 255, 205], \"isthing\": 1, \"id\": 60, \"name\": \"donut\"},\n    {\"color\": [147, 186, 208], \"isthing\": 1, \"id\": 61, \"name\": \"cake\"},\n    {\"color\": [153, 69, 1], \"isthing\": 1, \"id\": 62, \"name\": \"chair\"},\n    {\"color\": [3, 95, 161], \"isthing\": 1, \"id\": 63, \"name\": \"couch\"},\n    {\"color\": [163, 255, 0], \"isthing\": 1, \"id\": 64, \"name\": \"potted plant\"},\n    {\"color\": [119, 0, 170], \"isthing\": 1, \"id\": 65, \"name\": \"bed\"},\n    {\"color\": [0, 182, 199], \"isthing\": 1, \"id\": 67, \"name\": \"dining table\"},\n    {\"color\": [0, 165, 120], \"isthing\": 1, \"id\": 70, \"name\": \"toilet\"},\n    {\"color\": [183, 130, 88], \"isthing\": 1, \"id\": 72, \"name\": \"tv\"},\n    {\"color\": [95, 32, 0], \"isthing\": 1, \"id\": 73, \"name\": \"laptop\"},\n    {\"color\": [130, 114, 135], \"isthing\": 1, \"id\": 74, \"name\": \"mouse\"},\n    {\"color\": [110, 129, 133], \"isthing\": 1, \"id\": 75, \"name\": \"remote\"},\n    {\"color\": [166, 74, 118], \"isthing\": 1, \"id\": 76, \"name\": \"keyboard\"},\n    {\"color\": [219, 142, 185], \"isthing\": 1, \"id\": 77, \"name\": \"cell phone\"},\n    {\"color\": [79, 210, 114], \"isthing\": 1, \"id\": 78, \"name\": \"microwave\"},\n    {\"color\": [178, 90, 62], \"isthing\": 1, \"id\": 79, \"name\": \"oven\"},\n    {\"color\": [65, 70, 15], \"isthing\": 1, \"id\": 80, \"name\": \"toaster\"},\n    {\"color\": [127, 167, 115], \"isthing\": 1, \"id\": 81, \"name\": \"sink\"},\n    {\"color\": [59, 105, 106], \"isthing\": 1, \"id\": 82, \"name\": \"refrigerator\"},\n    {\"color\": [142, 108, 45], \"isthing\": 1, \"id\": 84, \"name\": \"book\"},\n    {\"color\": [196, 172, 0], \"isthing\": 1, \"id\": 85, \"name\": \"clock\"},\n    {\"color\": [95, 54, 80], \"isthing\": 1, \"id\": 86, \"name\": \"vase\"},\n    {\"color\": [128, 76, 255], \"isthing\": 1, \"id\": 87, \"name\": \"scissors\"},\n    {\"color\": [201, 57, 1], \"isthing\": 1, \"id\": 88, \"name\": \"teddy bear\"},\n    {\"color\": [246, 0, 122], \"isthing\": 1, \"id\": 89, \"name\": \"hair drier\"},\n    {\"color\": [191, 162, 208], \"isthing\": 1, \"id\": 90, \"name\": \"toothbrush\"},\n    {\"id\": 92, \"name\": \"banner\", \"supercategory\": \"textile\"},\n    {\"id\": 93, \"name\": \"blanket\", \"supercategory\": \"textile\"},\n    {\"id\": 94, \"name\": \"branch\", \"supercategory\": \"plant\"},\n    {\"id\": 95, \"name\": \"bridge\", \"supercategory\": \"building\"},\n    {\"id\": 96, \"name\": \"building-other\", \"supercategory\": \"building\"},\n    {\"id\": 97, \"name\": \"bush\", \"supercategory\": \"plant\"},\n    {\"id\": 98, \"name\": \"cabinet\", \"supercategory\": \"furniture-stuff\"},\n    {\"id\": 99, \"name\": \"cage\", \"supercategory\": \"structural\"},\n    {\"id\": 100, \"name\": \"cardboard\", \"supercategory\": \"raw-material\"},\n    {\"id\": 101, \"name\": \"carpet\", \"supercategory\": \"floor\"},\n    {\"id\": 102, \"name\": \"ceiling-other\", \"supercategory\": \"ceiling\"},\n    {\"id\": 103, \"name\": \"ceiling-tile\", \"supercategory\": \"ceiling\"},\n    {\"id\": 104, \"name\": \"cloth\", \"supercategory\": \"textile\"},\n    {\"id\": 105, \"name\": \"clothes\", \"supercategory\": \"textile\"},\n    {\"id\": 106, \"name\": \"clouds\", \"supercategory\": \"sky\"},\n    {\"id\": 107, \"name\": \"counter\", \"supercategory\": \"furniture-stuff\"},\n    {\"id\": 108, \"name\": \"cupboard\", \"supercategory\": \"furniture-stuff\"},\n    {\"id\": 109, \"name\": \"curtain\", \"supercategory\": \"textile\"},\n    {\"id\": 110, \"name\": \"desk-stuff\", \"supercategory\": \"furniture-stuff\"},\n    {\"id\": 111, \"name\": \"dirt\", \"supercategory\": \"ground\"},\n    {\"id\": 112, \"name\": \"door-stuff\", \"supercategory\": \"furniture-stuff\"},\n    {\"id\": 113, \"name\": \"fence\", \"supercategory\": \"structural\"},\n    {\"id\": 114, \"name\": \"floor-marble\", \"supercategory\": \"floor\"},\n    {\"id\": 115, \"name\": \"floor-other\", \"supercategory\": \"floor\"},\n    {\"id\": 116, \"name\": \"floor-stone\", \"supercategory\": \"floor\"},\n    {\"id\": 117, \"name\": \"floor-tile\", \"supercategory\": \"floor\"},\n    {\"id\": 118, \"name\": \"floor-wood\", \"supercategory\": \"floor\"},\n    {\"id\": 119, \"name\": \"flower\", \"supercategory\": \"plant\"},\n    {\"id\": 120, \"name\": \"fog\", \"supercategory\": \"water\"},\n    {\"id\": 121, \"name\": \"food-other\", \"supercategory\": \"food-stuff\"},\n    {\"id\": 122, \"name\": \"fruit\", \"supercategory\": \"food-stuff\"},\n    {\"id\": 123, \"name\": \"furniture-other\", \"supercategory\": \"furniture-stuff\"},\n    {\"id\": 124, \"name\": \"grass\", \"supercategory\": \"plant\"},\n    {\"id\": 125, \"name\": \"gravel\", \"supercategory\": \"ground\"},\n    {\"id\": 126, \"name\": \"ground-other\", \"supercategory\": \"ground\"},\n    {\"id\": 127, \"name\": \"hill\", \"supercategory\": \"solid\"},\n    {\"id\": 128, \"name\": \"house\", \"supercategory\": \"building\"},\n    {\"id\": 129, \"name\": \"leaves\", \"supercategory\": \"plant\"},\n    {\"id\": 130, \"name\": \"light\", \"supercategory\": \"furniture-stuff\"},\n    {\"id\": 131, \"name\": \"mat\", \"supercategory\": \"textile\"},\n    {\"id\": 132, \"name\": \"metal\", \"supercategory\": \"raw-material\"},\n    {\"id\": 133, \"name\": \"mirror-stuff\", \"supercategory\": \"furniture-stuff\"},\n    {\"id\": 134, \"name\": \"moss\", \"supercategory\": \"plant\"},\n    {\"id\": 135, \"name\": \"mountain\", \"supercategory\": \"solid\"},\n    {\"id\": 136, \"name\": \"mud\", \"supercategory\": \"ground\"},\n    {\"id\": 137, \"name\": \"napkin\", \"supercategory\": \"textile\"},\n    {\"id\": 138, \"name\": \"net\", \"supercategory\": \"structural\"},\n    {\"id\": 139, \"name\": \"paper\", \"supercategory\": \"raw-material\"},\n    {\"id\": 140, \"name\": \"pavement\", \"supercategory\": \"ground\"},\n    {\"id\": 141, \"name\": \"pillow\", \"supercategory\": \"textile\"},\n    {\"id\": 142, \"name\": \"plant-other\", \"supercategory\": \"plant\"},\n    {\"id\": 143, \"name\": \"plastic\", \"supercategory\": \"raw-material\"},\n    {\"id\": 144, \"name\": \"platform\", \"supercategory\": \"ground\"},\n    {\"id\": 145, \"name\": \"playingfield\", \"supercategory\": \"ground\"},\n    {\"id\": 146, \"name\": \"railing\", \"supercategory\": \"structural\"},\n    {\"id\": 147, \"name\": \"railroad\", \"supercategory\": \"ground\"},\n    {\"id\": 148, \"name\": \"river\", \"supercategory\": \"water\"},\n    {\"id\": 149, \"name\": \"road\", \"supercategory\": \"ground\"},\n    {\"id\": 150, \"name\": \"rock\", \"supercategory\": \"solid\"},\n    {\"id\": 151, \"name\": \"roof\", \"supercategory\": \"building\"},\n    {\"id\": 152, \"name\": \"rug\", \"supercategory\": \"textile\"},\n    {\"id\": 153, \"name\": \"salad\", \"supercategory\": \"food-stuff\"},\n    {\"id\": 154, \"name\": \"sand\", \"supercategory\": \"ground\"},\n    {\"id\": 155, \"name\": \"sea\", \"supercategory\": \"water\"},\n    {\"id\": 156, \"name\": \"shelf\", \"supercategory\": \"furniture-stuff\"},\n    {\"id\": 157, \"name\": \"sky-other\", \"supercategory\": \"sky\"},\n    {\"id\": 158, \"name\": \"skyscraper\", \"supercategory\": \"building\"},\n    {\"id\": 159, \"name\": \"snow\", \"supercategory\": \"ground\"},\n    {\"id\": 160, \"name\": \"solid-other\", \"supercategory\": \"solid\"},\n    {\"id\": 161, \"name\": \"stairs\", \"supercategory\": \"furniture-stuff\"},\n    {\"id\": 162, \"name\": \"stone\", \"supercategory\": \"solid\"},\n    {\"id\": 163, \"name\": \"straw\", \"supercategory\": \"plant\"},\n    {\"id\": 164, \"name\": \"structural-other\", \"supercategory\": \"structural\"},\n    {\"id\": 165, \"name\": \"table\", \"supercategory\": \"furniture-stuff\"},\n    {\"id\": 166, \"name\": \"tent\", \"supercategory\": \"building\"},\n    {\"id\": 167, \"name\": \"textile-other\", \"supercategory\": \"textile\"},\n    {\"id\": 168, \"name\": \"towel\", \"supercategory\": \"textile\"},\n    {\"id\": 169, \"name\": \"tree\", \"supercategory\": \"plant\"},\n    {\"id\": 170, \"name\": \"vegetable\", \"supercategory\": \"food-stuff\"},\n    {\"id\": 171, \"name\": \"wall-brick\", \"supercategory\": \"wall\"},\n    {\"id\": 172, \"name\": \"wall-concrete\", \"supercategory\": \"wall\"},\n    {\"id\": 173, \"name\": \"wall-other\", \"supercategory\": \"wall\"},\n    {\"id\": 174, \"name\": \"wall-panel\", \"supercategory\": \"wall\"},\n    {\"id\": 175, \"name\": \"wall-stone\", \"supercategory\": \"wall\"},\n    {\"id\": 176, \"name\": \"wall-tile\", \"supercategory\": \"wall\"},\n    {\"id\": 177, \"name\": \"wall-wood\", \"supercategory\": \"wall\"},\n    {\"id\": 178, \"name\": \"water-other\", \"supercategory\": \"water\"},\n    {\"id\": 179, \"name\": \"waterdrops\", \"supercategory\": \"water\"},\n    {\"id\": 180, \"name\": \"window-blind\", \"supercategory\": \"window\"},\n    {\"id\": 181, \"name\": \"window-other\", \"supercategory\": \"window\"},\n    {\"id\": 182, \"name\": \"wood\", \"supercategory\": \"solid\"},\n]\n\n\ndef _get_coco_stuff_meta():\n    # Id 0 is reserved for ignore_label, we change ignore_label for 0\n    # to 255 in our pre-processing.\n    stuff_ids = [k[\"id\"] for k in COCO_CATEGORIES]\n    assert len(stuff_ids) == 171, len(stuff_ids)\n\n    # For semantic segmentation, this mapping maps from contiguous stuff id\n    # (in [0, 91], used in models) to ids in the dataset (used for processing results)\n    stuff_dataset_id_to_contiguous_id = {k: i for i, k in enumerate(stuff_ids)}\n    stuff_classes = [k[\"name\"] for k in COCO_CATEGORIES]\n\n    ret = {\n        \"stuff_dataset_id_to_contiguous_id\": stuff_dataset_id_to_contiguous_id,\n        \"stuff_classes\": stuff_classes,\n    }\n    return ret\n\n\ndef register_all_coco_stuff_10k(root):\n    root = os.path.join(root, \"coco\", \"coco_stuff_10k\")\n    meta = _get_coco_stuff_meta()\n    for name, image_dirname, sem_seg_dirname in [\n        (\"train\", \"images_detectron2/train\", \"annotations_detectron2/train\"),\n        (\"test\", \"images_detectron2/test\", \"annotations_detectron2/test\"),\n    ]:\n        image_dir = os.path.join(root, image_dirname)\n        gt_dir = os.path.join(root, sem_seg_dirname)\n        name = f\"coco_2017_{name}_stuff_10k_sem_seg\"\n        DatasetCatalog.register(\n            name, lambda x=image_dir, y=gt_dir: load_sem_seg(y, x, gt_ext=\"png\", image_ext=\"jpg\")\n        )\n        MetadataCatalog.get(name).set(\n            image_root=image_dir,\n            sem_seg_root=gt_dir,\n            evaluator_type=\"sem_seg\",\n            ignore_label=255,\n            **meta,\n        )\n\n\n_root = os.getenv(\"DETECTRON2_DATASETS\", \"datasets\")\nregister_all_coco_stuff_10k(_root)\n"
  },
  {
    "path": "mfvis_nococo/mask2former/data/datasets/register_mapillary_vistas.py",
    "content": "import os\n\nfrom detectron2.data import DatasetCatalog, MetadataCatalog\nfrom detectron2.data.datasets import load_sem_seg\n\nMAPILLARY_VISTAS_SEM_SEG_CATEGORIES = [\n    {\n        \"color\": [165, 42, 42],\n        \"instances\": True,\n        \"readable\": \"Bird\",\n        \"name\": \"animal--bird\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [0, 192, 0],\n        \"instances\": True,\n        \"readable\": \"Ground Animal\",\n        \"name\": \"animal--ground-animal\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [196, 196, 196],\n        \"instances\": False,\n        \"readable\": \"Curb\",\n        \"name\": \"construction--barrier--curb\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [190, 153, 153],\n        \"instances\": False,\n        \"readable\": \"Fence\",\n        \"name\": \"construction--barrier--fence\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [180, 165, 180],\n        \"instances\": False,\n        \"readable\": \"Guard Rail\",\n        \"name\": \"construction--barrier--guard-rail\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [90, 120, 150],\n        \"instances\": False,\n        \"readable\": \"Barrier\",\n        \"name\": \"construction--barrier--other-barrier\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [102, 102, 156],\n        \"instances\": False,\n        \"readable\": \"Wall\",\n        \"name\": \"construction--barrier--wall\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [128, 64, 255],\n        \"instances\": False,\n        \"readable\": \"Bike Lane\",\n        \"name\": \"construction--flat--bike-lane\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [140, 140, 200],\n        \"instances\": True,\n        \"readable\": \"Crosswalk - Plain\",\n        \"name\": \"construction--flat--crosswalk-plain\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [170, 170, 170],\n        \"instances\": False,\n        \"readable\": \"Curb Cut\",\n        \"name\": \"construction--flat--curb-cut\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [250, 170, 160],\n        \"instances\": False,\n        \"readable\": \"Parking\",\n        \"name\": \"construction--flat--parking\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [96, 96, 96],\n        \"instances\": False,\n        \"readable\": \"Pedestrian Area\",\n        \"name\": \"construction--flat--pedestrian-area\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [230, 150, 140],\n        \"instances\": False,\n        \"readable\": \"Rail Track\",\n        \"name\": \"construction--flat--rail-track\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [128, 64, 128],\n        \"instances\": False,\n        \"readable\": \"Road\",\n        \"name\": \"construction--flat--road\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [110, 110, 110],\n        \"instances\": False,\n        \"readable\": \"Service Lane\",\n        \"name\": \"construction--flat--service-lane\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [244, 35, 232],\n        \"instances\": False,\n        \"readable\": \"Sidewalk\",\n        \"name\": \"construction--flat--sidewalk\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [150, 100, 100],\n        \"instances\": False,\n        \"readable\": \"Bridge\",\n        \"name\": \"construction--structure--bridge\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [70, 70, 70],\n        \"instances\": False,\n        \"readable\": \"Building\",\n        \"name\": \"construction--structure--building\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [150, 120, 90],\n        \"instances\": False,\n        \"readable\": \"Tunnel\",\n        \"name\": \"construction--structure--tunnel\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [220, 20, 60],\n        \"instances\": True,\n        \"readable\": \"Person\",\n        \"name\": \"human--person\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [255, 0, 0],\n        \"instances\": True,\n        \"readable\": \"Bicyclist\",\n        \"name\": \"human--rider--bicyclist\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [255, 0, 100],\n        \"instances\": True,\n        \"readable\": \"Motorcyclist\",\n        \"name\": \"human--rider--motorcyclist\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [255, 0, 200],\n        \"instances\": True,\n        \"readable\": \"Other Rider\",\n        \"name\": \"human--rider--other-rider\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [200, 128, 128],\n        \"instances\": True,\n        \"readable\": \"Lane Marking - Crosswalk\",\n        \"name\": \"marking--crosswalk-zebra\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [255, 255, 255],\n        \"instances\": False,\n        \"readable\": \"Lane Marking - General\",\n        \"name\": \"marking--general\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [64, 170, 64],\n        \"instances\": False,\n        \"readable\": \"Mountain\",\n        \"name\": \"nature--mountain\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [230, 160, 50],\n        \"instances\": False,\n        \"readable\": \"Sand\",\n        \"name\": \"nature--sand\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [70, 130, 180],\n        \"instances\": False,\n        \"readable\": \"Sky\",\n        \"name\": \"nature--sky\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [190, 255, 255],\n        \"instances\": False,\n        \"readable\": \"Snow\",\n        \"name\": \"nature--snow\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [152, 251, 152],\n        \"instances\": False,\n        \"readable\": \"Terrain\",\n        \"name\": \"nature--terrain\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [107, 142, 35],\n        \"instances\": False,\n        \"readable\": \"Vegetation\",\n        \"name\": \"nature--vegetation\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [0, 170, 30],\n        \"instances\": False,\n        \"readable\": \"Water\",\n        \"name\": \"nature--water\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [255, 255, 128],\n        \"instances\": True,\n        \"readable\": \"Banner\",\n        \"name\": \"object--banner\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [250, 0, 30],\n        \"instances\": True,\n        \"readable\": \"Bench\",\n        \"name\": \"object--bench\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [100, 140, 180],\n        \"instances\": True,\n        \"readable\": \"Bike Rack\",\n        \"name\": \"object--bike-rack\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [220, 220, 220],\n        \"instances\": True,\n        \"readable\": \"Billboard\",\n        \"name\": \"object--billboard\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [220, 128, 128],\n        \"instances\": True,\n        \"readable\": \"Catch Basin\",\n        \"name\": \"object--catch-basin\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [222, 40, 40],\n        \"instances\": True,\n        \"readable\": \"CCTV Camera\",\n        \"name\": \"object--cctv-camera\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [100, 170, 30],\n        \"instances\": True,\n        \"readable\": \"Fire Hydrant\",\n        \"name\": \"object--fire-hydrant\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [40, 40, 40],\n        \"instances\": True,\n        \"readable\": \"Junction Box\",\n        \"name\": \"object--junction-box\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [33, 33, 33],\n        \"instances\": True,\n        \"readable\": \"Mailbox\",\n        \"name\": \"object--mailbox\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [100, 128, 160],\n        \"instances\": True,\n        \"readable\": \"Manhole\",\n        \"name\": \"object--manhole\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [142, 0, 0],\n        \"instances\": True,\n        \"readable\": \"Phone Booth\",\n        \"name\": \"object--phone-booth\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [70, 100, 150],\n        \"instances\": False,\n        \"readable\": \"Pothole\",\n        \"name\": \"object--pothole\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [210, 170, 100],\n        \"instances\": True,\n        \"readable\": \"Street Light\",\n        \"name\": \"object--street-light\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [153, 153, 153],\n        \"instances\": True,\n        \"readable\": \"Pole\",\n        \"name\": \"object--support--pole\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [128, 128, 128],\n        \"instances\": True,\n        \"readable\": \"Traffic Sign Frame\",\n        \"name\": \"object--support--traffic-sign-frame\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [0, 0, 80],\n        \"instances\": True,\n        \"readable\": \"Utility Pole\",\n        \"name\": \"object--support--utility-pole\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [250, 170, 30],\n        \"instances\": True,\n        \"readable\": \"Traffic Light\",\n        \"name\": \"object--traffic-light\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [192, 192, 192],\n        \"instances\": True,\n        \"readable\": \"Traffic Sign (Back)\",\n        \"name\": \"object--traffic-sign--back\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [220, 220, 0],\n        \"instances\": True,\n        \"readable\": \"Traffic Sign (Front)\",\n        \"name\": \"object--traffic-sign--front\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [140, 140, 20],\n        \"instances\": True,\n        \"readable\": \"Trash Can\",\n        \"name\": \"object--trash-can\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [119, 11, 32],\n        \"instances\": True,\n        \"readable\": \"Bicycle\",\n        \"name\": \"object--vehicle--bicycle\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [150, 0, 255],\n        \"instances\": True,\n        \"readable\": \"Boat\",\n        \"name\": \"object--vehicle--boat\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [0, 60, 100],\n        \"instances\": True,\n        \"readable\": \"Bus\",\n        \"name\": \"object--vehicle--bus\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [0, 0, 142],\n        \"instances\": True,\n        \"readable\": \"Car\",\n        \"name\": \"object--vehicle--car\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [0, 0, 90],\n        \"instances\": True,\n        \"readable\": \"Caravan\",\n        \"name\": \"object--vehicle--caravan\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [0, 0, 230],\n        \"instances\": True,\n        \"readable\": \"Motorcycle\",\n        \"name\": \"object--vehicle--motorcycle\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [0, 80, 100],\n        \"instances\": False,\n        \"readable\": \"On Rails\",\n        \"name\": \"object--vehicle--on-rails\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [128, 64, 64],\n        \"instances\": True,\n        \"readable\": \"Other Vehicle\",\n        \"name\": \"object--vehicle--other-vehicle\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [0, 0, 110],\n        \"instances\": True,\n        \"readable\": \"Trailer\",\n        \"name\": \"object--vehicle--trailer\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [0, 0, 70],\n        \"instances\": True,\n        \"readable\": \"Truck\",\n        \"name\": \"object--vehicle--truck\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [0, 0, 192],\n        \"instances\": True,\n        \"readable\": \"Wheeled Slow\",\n        \"name\": \"object--vehicle--wheeled-slow\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [32, 32, 32],\n        \"instances\": False,\n        \"readable\": \"Car Mount\",\n        \"name\": \"void--car-mount\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [120, 10, 10],\n        \"instances\": False,\n        \"readable\": \"Ego Vehicle\",\n        \"name\": \"void--ego-vehicle\",\n        \"evaluate\": True,\n    },\n    {\n        \"color\": [0, 0, 0],\n        \"instances\": False,\n        \"readable\": \"Unlabeled\",\n        \"name\": \"void--unlabeled\",\n        \"evaluate\": False,\n    },\n]\n\n\ndef _get_mapillary_vistas_meta():\n    stuff_classes = [k[\"readable\"] for k in MAPILLARY_VISTAS_SEM_SEG_CATEGORIES if k[\"evaluate\"]]\n    assert len(stuff_classes) == 65\n\n    stuff_colors = [k[\"color\"] for k in MAPILLARY_VISTAS_SEM_SEG_CATEGORIES if k[\"evaluate\"]]\n    assert len(stuff_colors) == 65\n\n    ret = {\n        \"stuff_classes\": stuff_classes,\n        \"stuff_colors\": stuff_colors,\n    }\n    return ret\n\n\ndef register_all_mapillary_vistas(root):\n    root = os.path.join(root, \"mapillary_vistas\")\n    meta = _get_mapillary_vistas_meta()\n    for name, dirname in [(\"train\", \"training\"), (\"val\", \"validation\")]:\n        image_dir = os.path.join(root, dirname, \"images\")\n        gt_dir = os.path.join(root, dirname, \"labels\")\n        name = f\"mapillary_vistas_sem_seg_{name}\"\n        DatasetCatalog.register(\n            name, lambda x=image_dir, y=gt_dir: load_sem_seg(y, x, gt_ext=\"png\", image_ext=\"jpg\")\n        )\n        MetadataCatalog.get(name).set(\n            image_root=image_dir,\n            sem_seg_root=gt_dir,\n            evaluator_type=\"sem_seg\",\n            ignore_label=65,  # different from other datasets, Mapillary Vistas sets ignore_label to 65\n            **meta,\n        )\n\n\n_root = os.getenv(\"DETECTRON2_DATASETS\", \"datasets\")\nregister_all_mapillary_vistas(_root)\n"
  },
  {
    "path": "mfvis_nococo/mask2former/data/datasets/register_mapillary_vistas_panoptic.py",
    "content": "import json\nimport os\n\nfrom detectron2.data import DatasetCatalog, MetadataCatalog\nfrom detectron2.utils.file_io import PathManager\n\n\nMAPILLARY_VISTAS_SEM_SEG_CATEGORIES = [\n    {'color': [165, 42, 42],\n    'id': 1,\n    'isthing': 1,\n    'name': 'Bird',\n    'supercategory': 'animal--bird'},\n    {'color': [0, 192, 0],\n    'id': 2,\n    'isthing': 1,\n    'name': 'Ground Animal',\n    'supercategory': 'animal--ground-animal'},\n    {'color': [196, 196, 196],\n    'id': 3,\n    'isthing': 0,\n    'name': 'Curb',\n    'supercategory': 'construction--barrier--curb'},\n    {'color': [190, 153, 153],\n    'id': 4,\n    'isthing': 0,\n    'name': 'Fence',\n    'supercategory': 'construction--barrier--fence'},\n    {'color': [180, 165, 180],\n    'id': 5,\n    'isthing': 0,\n    'name': 'Guard Rail',\n    'supercategory': 'construction--barrier--guard-rail'},\n    {'color': [90, 120, 150],\n    'id': 6,\n    'isthing': 0,\n    'name': 'Barrier',\n    'supercategory': 'construction--barrier--other-barrier'},\n    {'color': [102, 102, 156],\n    'id': 7,\n    'isthing': 0,\n    'name': 'Wall',\n    'supercategory': 'construction--barrier--wall'},\n    {'color': [128, 64, 255],\n    'id': 8,\n    'isthing': 0,\n    'name': 'Bike Lane',\n    'supercategory': 'construction--flat--bike-lane'},\n    {'color': [140, 140, 200],\n    'id': 9,\n    'isthing': 1,\n    'name': 'Crosswalk - Plain',\n    'supercategory': 'construction--flat--crosswalk-plain'},\n    {'color': [170, 170, 170],\n    'id': 10,\n    'isthing': 0,\n    'name': 'Curb Cut',\n    'supercategory': 'construction--flat--curb-cut'},\n    {'color': [250, 170, 160],\n    'id': 11,\n    'isthing': 0,\n    'name': 'Parking',\n    'supercategory': 'construction--flat--parking'},\n    {'color': [96, 96, 96],\n    'id': 12,\n    'isthing': 0,\n    'name': 'Pedestrian Area',\n    'supercategory': 'construction--flat--pedestrian-area'},\n    {'color': [230, 150, 140],\n    'id': 13,\n    'isthing': 0,\n    'name': 'Rail Track',\n    'supercategory': 'construction--flat--rail-track'},\n    {'color': [128, 64, 128],\n    'id': 14,\n    'isthing': 0,\n    'name': 'Road',\n    'supercategory': 'construction--flat--road'},\n    {'color': [110, 110, 110],\n    'id': 15,\n    'isthing': 0,\n    'name': 'Service Lane',\n    'supercategory': 'construction--flat--service-lane'},\n    {'color': [244, 35, 232],\n    'id': 16,\n    'isthing': 0,\n    'name': 'Sidewalk',\n    'supercategory': 'construction--flat--sidewalk'},\n    {'color': [150, 100, 100],\n    'id': 17,\n    'isthing': 0,\n    'name': 'Bridge',\n    'supercategory': 'construction--structure--bridge'},\n    {'color': [70, 70, 70],\n    'id': 18,\n    'isthing': 0,\n    'name': 'Building',\n    'supercategory': 'construction--structure--building'},\n    {'color': [150, 120, 90],\n    'id': 19,\n    'isthing': 0,\n    'name': 'Tunnel',\n    'supercategory': 'construction--structure--tunnel'},\n    {'color': [220, 20, 60],\n    'id': 20,\n    'isthing': 1,\n    'name': 'Person',\n    'supercategory': 'human--person'},\n    {'color': [255, 0, 0],\n    'id': 21,\n    'isthing': 1,\n    'name': 'Bicyclist',\n    'supercategory': 'human--rider--bicyclist'},\n    {'color': [255, 0, 100],\n    'id': 22,\n    'isthing': 1,\n    'name': 'Motorcyclist',\n    'supercategory': 'human--rider--motorcyclist'},\n    {'color': [255, 0, 200],\n    'id': 23,\n    'isthing': 1,\n    'name': 'Other Rider',\n    'supercategory': 'human--rider--other-rider'},\n    {'color': [200, 128, 128],\n    'id': 24,\n    'isthing': 1,\n    'name': 'Lane Marking - Crosswalk',\n    'supercategory': 'marking--crosswalk-zebra'},\n    {'color': [255, 255, 255],\n    'id': 25,\n    'isthing': 0,\n    'name': 'Lane Marking - General',\n    'supercategory': 'marking--general'},\n    {'color': [64, 170, 64],\n    'id': 26,\n    'isthing': 0,\n    'name': 'Mountain',\n    'supercategory': 'nature--mountain'},\n    {'color': [230, 160, 50],\n    'id': 27,\n    'isthing': 0,\n    'name': 'Sand',\n    'supercategory': 'nature--sand'},\n    {'color': [70, 130, 180],\n    'id': 28,\n    'isthing': 0,\n    'name': 'Sky',\n    'supercategory': 'nature--sky'},\n    {'color': [190, 255, 255],\n    'id': 29,\n    'isthing': 0,\n    'name': 'Snow',\n    'supercategory': 'nature--snow'},\n    {'color': [152, 251, 152],\n    'id': 30,\n    'isthing': 0,\n    'name': 'Terrain',\n    'supercategory': 'nature--terrain'},\n    {'color': [107, 142, 35],\n    'id': 31,\n    'isthing': 0,\n    'name': 'Vegetation',\n    'supercategory': 'nature--vegetation'},\n    {'color': [0, 170, 30],\n    'id': 32,\n    'isthing': 0,\n    'name': 'Water',\n    'supercategory': 'nature--water'},\n    {'color': [255, 255, 128],\n    'id': 33,\n    'isthing': 1,\n    'name': 'Banner',\n    'supercategory': 'object--banner'},\n    {'color': [250, 0, 30],\n    'id': 34,\n    'isthing': 1,\n    'name': 'Bench',\n    'supercategory': 'object--bench'},\n    {'color': [100, 140, 180],\n    'id': 35,\n    'isthing': 1,\n    'name': 'Bike Rack',\n    'supercategory': 'object--bike-rack'},\n    {'color': [220, 220, 220],\n    'id': 36,\n    'isthing': 1,\n    'name': 'Billboard',\n    'supercategory': 'object--billboard'},\n    {'color': [220, 128, 128],\n    'id': 37,\n    'isthing': 1,\n    'name': 'Catch Basin',\n    'supercategory': 'object--catch-basin'},\n    {'color': [222, 40, 40],\n    'id': 38,\n    'isthing': 1,\n    'name': 'CCTV Camera',\n    'supercategory': 'object--cctv-camera'},\n    {'color': [100, 170, 30],\n    'id': 39,\n    'isthing': 1,\n    'name': 'Fire Hydrant',\n    'supercategory': 'object--fire-hydrant'},\n    {'color': [40, 40, 40],\n    'id': 40,\n    'isthing': 1,\n    'name': 'Junction Box',\n    'supercategory': 'object--junction-box'},\n    {'color': [33, 33, 33],\n    'id': 41,\n    'isthing': 1,\n    'name': 'Mailbox',\n    'supercategory': 'object--mailbox'},\n    {'color': [100, 128, 160],\n    'id': 42,\n    'isthing': 1,\n    'name': 'Manhole',\n    'supercategory': 'object--manhole'},\n    {'color': [142, 0, 0],\n    'id': 43,\n    'isthing': 1,\n    'name': 'Phone Booth',\n    'supercategory': 'object--phone-booth'},\n    {'color': [70, 100, 150],\n    'id': 44,\n    'isthing': 0,\n    'name': 'Pothole',\n    'supercategory': 'object--pothole'},\n    {'color': [210, 170, 100],\n    'id': 45,\n    'isthing': 1,\n    'name': 'Street Light',\n    'supercategory': 'object--street-light'},\n    {'color': [153, 153, 153],\n    'id': 46,\n    'isthing': 1,\n    'name': 'Pole',\n    'supercategory': 'object--support--pole'},\n    {'color': [128, 128, 128],\n    'id': 47,\n    'isthing': 1,\n    'name': 'Traffic Sign Frame',\n    'supercategory': 'object--support--traffic-sign-frame'},\n    {'color': [0, 0, 80],\n    'id': 48,\n    'isthing': 1,\n    'name': 'Utility Pole',\n    'supercategory': 'object--support--utility-pole'},\n    {'color': [250, 170, 30],\n    'id': 49,\n    'isthing': 1,\n    'name': 'Traffic Light',\n    'supercategory': 'object--traffic-light'},\n    {'color': [192, 192, 192],\n    'id': 50,\n    'isthing': 1,\n    'name': 'Traffic Sign (Back)',\n    'supercategory': 'object--traffic-sign--back'},\n    {'color': [220, 220, 0],\n    'id': 51,\n    'isthing': 1,\n    'name': 'Traffic Sign (Front)',\n    'supercategory': 'object--traffic-sign--front'},\n    {'color': [140, 140, 20],\n    'id': 52,\n    'isthing': 1,\n    'name': 'Trash Can',\n    'supercategory': 'object--trash-can'},\n    {'color': [119, 11, 32],\n    'id': 53,\n    'isthing': 1,\n    'name': 'Bicycle',\n    'supercategory': 'object--vehicle--bicycle'},\n    {'color': [150, 0, 255],\n    'id': 54,\n    'isthing': 1,\n    'name': 'Boat',\n    'supercategory': 'object--vehicle--boat'},\n    {'color': [0, 60, 100],\n    'id': 55,\n    'isthing': 1,\n    'name': 'Bus',\n    'supercategory': 'object--vehicle--bus'},\n    {'color': [0, 0, 142],\n    'id': 56,\n    'isthing': 1,\n    'name': 'Car',\n    'supercategory': 'object--vehicle--car'},\n    {'color': [0, 0, 90],\n    'id': 57,\n    'isthing': 1,\n    'name': 'Caravan',\n    'supercategory': 'object--vehicle--caravan'},\n    {'color': [0, 0, 230],\n    'id': 58,\n    'isthing': 1,\n    'name': 'Motorcycle',\n    'supercategory': 'object--vehicle--motorcycle'},\n    {'color': [0, 80, 100],\n    'id': 59,\n    'isthing': 0,\n    'name': 'On Rails',\n    'supercategory': 'object--vehicle--on-rails'},\n    {'color': [128, 64, 64],\n    'id': 60,\n    'isthing': 1,\n    'name': 'Other Vehicle',\n    'supercategory': 'object--vehicle--other-vehicle'},\n    {'color': [0, 0, 110],\n    'id': 61,\n    'isthing': 1,\n    'name': 'Trailer',\n    'supercategory': 'object--vehicle--trailer'},\n    {'color': [0, 0, 70],\n    'id': 62,\n    'isthing': 1,\n    'name': 'Truck',\n    'supercategory': 'object--vehicle--truck'},\n    {'color': [0, 0, 192],\n    'id': 63,\n    'isthing': 1,\n    'name': 'Wheeled Slow',\n    'supercategory': 'object--vehicle--wheeled-slow'},\n    {'color': [32, 32, 32],\n    'id': 64,\n    'isthing': 0,\n    'name': 'Car Mount',\n    'supercategory': 'void--car-mount'},\n    {'color': [120, 10, 10],\n    'id': 65,\n    'isthing': 0,\n    'name': 'Ego Vehicle',\n    'supercategory': 'void--ego-vehicle'}\n]\n\n\ndef load_mapillary_vistas_panoptic_json(json_file, image_dir, gt_dir, semseg_dir, meta):\n    \"\"\"\n    Args:\n        image_dir (str): path to the raw dataset. e.g., \"~/coco/train2017\".\n        gt_dir (str): path to the raw annotations. e.g., \"~/coco/panoptic_train2017\".\n        json_file (str): path to the json file. e.g., \"~/coco/annotations/panoptic_train2017.json\".\n    Returns:\n        list[dict]: a list of dicts in Detectron2 standard format. (See\n        `Using Custom Datasets </tutorials/datasets.html>`_ )\n    \"\"\"\n\n    def _convert_category_id(segment_info, meta):\n        if segment_info[\"category_id\"] in meta[\"thing_dataset_id_to_contiguous_id\"]:\n            segment_info[\"category_id\"] = meta[\"thing_dataset_id_to_contiguous_id\"][\n                segment_info[\"category_id\"]\n            ]\n            segment_info[\"isthing\"] = True\n        else:\n            segment_info[\"category_id\"] = meta[\"stuff_dataset_id_to_contiguous_id\"][\n                segment_info[\"category_id\"]\n            ]\n            segment_info[\"isthing\"] = False\n        return segment_info\n\n    with PathManager.open(json_file) as f:\n        json_info = json.load(f)\n\n    ret = []\n    for ann in json_info[\"annotations\"]:\n        image_id = ann[\"image_id\"]\n        # TODO: currently we assume image and label has the same filename but\n        # different extension, and images have extension \".jpg\" for COCO. Need\n        # to make image extension a user-provided argument if we extend this\n        # function to support other COCO-like datasets.\n        image_file = os.path.join(image_dir, os.path.splitext(ann[\"file_name\"])[0] + \".jpg\")\n        label_file = os.path.join(gt_dir, ann[\"file_name\"])\n        sem_label_file = os.path.join(semseg_dir, ann[\"file_name\"])\n        segments_info = [_convert_category_id(x, meta) for x in ann[\"segments_info\"]]\n        ret.append(\n            {\n                \"file_name\": image_file,\n                \"image_id\": image_id,\n                \"pan_seg_file_name\": label_file,\n                \"sem_seg_file_name\": sem_label_file,\n                \"segments_info\": segments_info,\n            }\n        )\n    assert len(ret), f\"No images found in {image_dir}!\"\n    assert PathManager.isfile(ret[0][\"file_name\"]), ret[0][\"file_name\"]\n    assert PathManager.isfile(ret[0][\"pan_seg_file_name\"]), ret[0][\"pan_seg_file_name\"]\n    assert PathManager.isfile(ret[0][\"sem_seg_file_name\"]), ret[0][\"sem_seg_file_name\"]\n    return ret\n\n\ndef register_mapillary_vistas_panoptic(\n    name, metadata, image_root, panoptic_root, semantic_root, panoptic_json, instances_json=None\n):\n    \"\"\"\n    Register a \"standard\" version of ADE20k panoptic segmentation dataset named `name`.\n    The dictionaries in this registered dataset follows detectron2's standard format.\n    Hence it's called \"standard\".\n    Args:\n        name (str): the name that identifies a dataset,\n            e.g. \"ade20k_panoptic_train\"\n        metadata (dict): extra metadata associated with this dataset.\n        image_root (str): directory which contains all the images\n        panoptic_root (str): directory which contains panoptic annotation images in COCO format\n        panoptic_json (str): path to the json panoptic annotation file in COCO format\n        sem_seg_root (none): not used, to be consistent with\n            `register_coco_panoptic_separated`.\n        instances_json (str): path to the json instance annotation file\n    \"\"\"\n    panoptic_name = name\n    DatasetCatalog.register(\n        panoptic_name,\n        lambda: load_mapillary_vistas_panoptic_json(\n            panoptic_json, image_root, panoptic_root, semantic_root, metadata\n        ),\n    )\n    MetadataCatalog.get(panoptic_name).set(\n        panoptic_root=panoptic_root,\n        image_root=image_root,\n        panoptic_json=panoptic_json,\n        json_file=instances_json,\n        evaluator_type=\"mapillary_vistas_panoptic_seg\",\n        ignore_label=65,  # different from other datasets, Mapillary Vistas sets ignore_label to 65\n        label_divisor=1000,\n        **metadata,\n    )\n\n\n_PREDEFINED_SPLITS_ADE20K_PANOPTIC = {\n    \"mapillary_vistas_panoptic_train\": (\n        \"mapillary_vistas/training/images\",\n        \"mapillary_vistas/training/panoptic\",\n        \"mapillary_vistas/training/panoptic/panoptic_2018.json\",\n        \"mapillary_vistas/training/labels\",\n    ),\n    \"mapillary_vistas_panoptic_val\": (\n        \"mapillary_vistas/validation/images\",\n        \"mapillary_vistas/validation/panoptic\",\n        \"mapillary_vistas/validation/panoptic/panoptic_2018.json\",\n        \"mapillary_vistas/validation/labels\",\n    ),\n}\n\n\ndef get_metadata():\n    meta = {}\n    # The following metadata maps contiguous id from [0, #thing categories +\n    # #stuff categories) to their names and colors. We have to replica of the\n    # same name and color under \"thing_*\" and \"stuff_*\" because the current\n    # visualization function in D2 handles thing and class classes differently\n    # due to some heuristic used in Panoptic FPN. We keep the same naming to\n    # enable reusing existing visualization functions.\n    thing_classes = [k[\"name\"] for k in MAPILLARY_VISTAS_SEM_SEG_CATEGORIES]\n    thing_colors = [k[\"color\"] for k in MAPILLARY_VISTAS_SEM_SEG_CATEGORIES]\n    stuff_classes = [k[\"name\"] for k in MAPILLARY_VISTAS_SEM_SEG_CATEGORIES]\n    stuff_colors = [k[\"color\"] for k in MAPILLARY_VISTAS_SEM_SEG_CATEGORIES]\n\n    meta[\"thing_classes\"] = thing_classes\n    meta[\"thing_colors\"] = thing_colors\n    meta[\"stuff_classes\"] = stuff_classes\n    meta[\"stuff_colors\"] = stuff_colors\n\n    # Convert category id for training:\n    #   category id: like semantic segmentation, it is the class id for each\n    #   pixel. Since there are some classes not used in evaluation, the category\n    #   id is not always contiguous and thus we have two set of category ids:\n    #       - original category id: category id in the original dataset, mainly\n    #           used for evaluation.\n    #       - contiguous category id: [0, #classes), in order to train the linear\n    #           softmax classifier.\n    thing_dataset_id_to_contiguous_id = {}\n    stuff_dataset_id_to_contiguous_id = {}\n\n    for i, cat in enumerate(MAPILLARY_VISTAS_SEM_SEG_CATEGORIES):\n        if cat[\"isthing\"]:\n            thing_dataset_id_to_contiguous_id[cat[\"id\"]] = i\n        # else:\n        #     stuff_dataset_id_to_contiguous_id[cat[\"id\"]] = i\n\n        # in order to use sem_seg evaluator\n        stuff_dataset_id_to_contiguous_id[cat[\"id\"]] = i\n\n    meta[\"thing_dataset_id_to_contiguous_id\"] = thing_dataset_id_to_contiguous_id\n    meta[\"stuff_dataset_id_to_contiguous_id\"] = stuff_dataset_id_to_contiguous_id\n\n    return meta\n\n\ndef register_all_mapillary_vistas_panoptic(root):\n    metadata = get_metadata()\n    for (\n        prefix,\n        (image_root, panoptic_root, panoptic_json, semantic_root),\n    ) in _PREDEFINED_SPLITS_ADE20K_PANOPTIC.items():\n        # The \"standard\" version of COCO panoptic segmentation dataset,\n        # e.g. used by Panoptic-DeepLab\n        register_mapillary_vistas_panoptic(\n            prefix,\n            metadata,\n            os.path.join(root, image_root),\n            os.path.join(root, panoptic_root),\n            os.path.join(root, semantic_root),\n            os.path.join(root, panoptic_json),\n        )\n\n\n_root = os.getenv(\"DETECTRON2_DATASETS\", \"datasets\")\nregister_all_mapillary_vistas_panoptic(_root)\n"
  },
  {
    "path": "mfvis_nococo/mask2former/evaluation/__init__.py",
    "content": ""
  },
  {
    "path": "mfvis_nococo/mask2former/evaluation/__init__.py.new",
    "content": ""
  },
  {
    "path": "mfvis_nococo/mask2former/evaluation/instance_evaluation.py",
    "content": "import contextlib\nimport copy\nimport io\nimport itertools\nimport json\nimport logging\nimport numpy as np\nimport os\nimport pickle\nfrom collections import OrderedDict\nimport pycocotools.mask as mask_util\nimport torch\nfrom pycocotools.coco import COCO\nfrom pycocotools.cocoeval import COCOeval\nfrom tabulate import tabulate\n\nimport detectron2.utils.comm as comm\nfrom detectron2.config import CfgNode\nfrom detectron2.data import MetadataCatalog\nfrom detectron2.data.datasets.coco import convert_to_coco_json\nfrom detectron2.evaluation.coco_evaluation import COCOEvaluator, _evaluate_predictions_on_coco\nfrom detectron2.evaluation.fast_eval_api import COCOeval_opt\nfrom detectron2.structures import Boxes, BoxMode, pairwise_iou\nfrom detectron2.utils.file_io import PathManager\nfrom detectron2.utils.logger import create_small_table\n\n\n# modified from COCOEvaluator for instance segmetnat\nclass InstanceSegEvaluator(COCOEvaluator):\n    \"\"\"\n    Evaluate AR for object proposals, AP for instance detection/segmentation, AP\n    for keypoint detection outputs using COCO's metrics.\n    See http://cocodataset.org/#detection-eval and\n    http://cocodataset.org/#keypoints-eval to understand its metrics.\n    The metrics range from 0 to 100 (instead of 0 to 1), where a -1 or NaN means\n    the metric cannot be computed (e.g. due to no predictions made).\n\n    In addition to COCO, this evaluator is able to support any bounding box detection,\n    instance segmentation, or keypoint detection dataset.\n    \"\"\"\n\n    def _eval_predictions(self, predictions, img_ids=None):\n        \"\"\"\n        Evaluate predictions. Fill self._results with the metrics of the tasks.\n        \"\"\"\n        self._logger.info(\"Preparing results for COCO format ...\")\n        coco_results = list(itertools.chain(*[x[\"instances\"] for x in predictions]))\n        tasks = self._tasks or self._tasks_from_predictions(coco_results)\n\n        # unmap the category ids for COCO\n        if hasattr(self._metadata, \"thing_dataset_id_to_contiguous_id\"):\n            dataset_id_to_contiguous_id = self._metadata.thing_dataset_id_to_contiguous_id\n            # all_contiguous_ids = list(dataset_id_to_contiguous_id.values())\n            # num_classes = len(all_contiguous_ids)\n            # assert min(all_contiguous_ids) == 0 and max(all_contiguous_ids) == num_classes - 1\n\n            reverse_id_mapping = {v: k for k, v in dataset_id_to_contiguous_id.items()}\n            for result in coco_results:\n                category_id = result[\"category_id\"]\n                # assert category_id < num_classes, (\n                #     f\"A prediction has class={category_id}, \"\n                #     f\"but the dataset only has {num_classes} classes and \"\n                #     f\"predicted class id should be in [0, {num_classes - 1}].\"\n                # )\n                assert category_id in reverse_id_mapping, (\n                    f\"A prediction has class={category_id}, \"\n                    f\"but the dataset only has class ids in {dataset_id_to_contiguous_id}.\"\n                )\n                result[\"category_id\"] = reverse_id_mapping[category_id]\n\n        if self._output_dir:\n            file_path = os.path.join(self._output_dir, \"coco_instances_results.json\")\n            self._logger.info(\"Saving results to {}\".format(file_path))\n            with PathManager.open(file_path, \"w\") as f:\n                f.write(json.dumps(coco_results))\n                f.flush()\n\n        if not self._do_evaluation:\n            self._logger.info(\"Annotations are not available for evaluation.\")\n            return\n\n        self._logger.info(\n            \"Evaluating predictions with {} COCO API...\".format(\n                \"unofficial\" if self._use_fast_impl else \"official\"\n            )\n        )\n        for task in sorted(tasks):\n            assert task in {\"bbox\", \"segm\", \"keypoints\"}, f\"Got unknown task: {task}!\"\n            coco_eval = (\n                _evaluate_predictions_on_coco(\n                    self._coco_api,\n                    coco_results,\n                    task,\n                    kpt_oks_sigmas=self._kpt_oks_sigmas,\n                    use_fast_impl=self._use_fast_impl,\n                    img_ids=img_ids,\n                    max_dets_per_image=self._max_dets_per_image,\n                )\n                if len(coco_results) > 0\n                else None  # cocoapi does not handle empty results very well\n            )\n\n            res = self._derive_coco_results(\n                coco_eval, task, class_names=self._metadata.get(\"thing_classes\")\n            )\n            self._results[task] = res\n"
  },
  {
    "path": "mfvis_nococo/mask2former/maskformer_model.py",
    "content": "from typing import Tuple\n\nimport torch\nfrom torch import nn\nfrom torch.nn import functional as F\n\nfrom detectron2.config import configurable\nfrom detectron2.data import MetadataCatalog\nfrom detectron2.modeling import META_ARCH_REGISTRY, build_backbone, build_sem_seg_head\nfrom detectron2.modeling.backbone import Backbone\nfrom detectron2.modeling.postprocessing import sem_seg_postprocess\nfrom detectron2.structures import Boxes, ImageList, Instances, BitMasks\nfrom detectron2.utils.memory import retry_if_cuda_oom\n\nfrom .modeling.criterion import SetCriterion\nfrom .modeling.matcher import HungarianMatcher\n\nfrom skimage import color\nimport cv2\nimport numpy as np\n\ndef unfold_wo_center(x, kernel_size, dilation):\n    assert x.dim() == 4\n    assert kernel_size % 2 == 1\n\n    # using SAME padding\n    padding = (kernel_size + (dilation - 1) * (kernel_size - 1)) // 2\n    unfolded_x = F.unfold(\n        x, kernel_size=kernel_size,\n        padding=padding,\n        dilation=dilation\n    )\n\n    unfolded_x = unfolded_x.reshape(\n        x.size(0), x.size(1), -1, x.size(2), x.size(3)\n    )\n\n    # remove the center pixels\n    size = kernel_size ** 2\n    unfolded_x = torch.cat((\n        unfolded_x[:, :, :size // 2],\n        unfolded_x[:, :, size // 2 + 1:]\n    ), dim=2)\n\n    return unfolded_x\n\ndef get_images_color_similarity(images, kernel_size, dilation):\n    assert images.dim() == 4\n    assert images.size(0) == 1\n\n    unfolded_images = unfold_wo_center(\n        images, kernel_size=kernel_size, dilation=dilation\n    )\n\n    diff = images[:, :, None] - unfolded_images\n    similarity = torch.exp(-torch.norm(diff, dim=1) * 0.5)\n\n    return similarity\n\n\n@META_ARCH_REGISTRY.register()\nclass MaskFormer(nn.Module):\n    \"\"\"\n    Main class for mask classification semantic segmentation architectures.\n    \"\"\"\n\n    @configurable\n    def __init__(\n        self,\n        *,\n        backbone: Backbone,\n        sem_seg_head: nn.Module,\n        criterion: nn.Module,\n        num_queries: int,\n        object_mask_threshold: float,\n        overlap_threshold: float,\n        metadata,\n        size_divisibility: int,\n        sem_seg_postprocess_before_inference: bool,\n        pixel_mean: Tuple[float],\n        pixel_std: Tuple[float],\n        # inference\n        semantic_on: bool,\n        panoptic_on: bool,\n        instance_on: bool,\n        test_topk_per_image: int,\n    ):\n        \"\"\"\n        Args:\n            backbone: a backbone module, must follow detectron2's backbone interface\n            sem_seg_head: a module that predicts semantic segmentation from backbone features\n            criterion: a module that defines the loss\n            num_queries: int, number of queries\n            object_mask_threshold: float, threshold to filter query based on classification score\n                for panoptic segmentation inference\n            overlap_threshold: overlap threshold used in general inference for panoptic segmentation\n            metadata: dataset meta, get `thing` and `stuff` category names for panoptic\n                segmentation inference\n            size_divisibility: Some backbones require the input height and width to be divisible by a\n                specific integer. We can use this to override such requirement.\n            sem_seg_postprocess_before_inference: whether to resize the prediction back\n                to original input size before semantic segmentation inference or after.\n                For high-resolution dataset like Mapillary, resizing predictions before\n                inference will cause OOM error.\n            pixel_mean, pixel_std: list or tuple with #channels element, representing\n                the per-channel mean and std to be used to normalize the input image\n            semantic_on: bool, whether to output semantic segmentation prediction\n            instance_on: bool, whether to output instance segmentation prediction\n            panoptic_on: bool, whether to output panoptic segmentation prediction\n            test_topk_per_image: int, instance segmentation parameter, keep topk instances per image\n        \"\"\"\n        super().__init__()\n        self.backbone = backbone\n        self.sem_seg_head = sem_seg_head\n        self.criterion = criterion\n        self.num_queries = num_queries\n        self.overlap_threshold = overlap_threshold\n        self.object_mask_threshold = object_mask_threshold\n        self.metadata = metadata\n        if size_divisibility < 0:\n            # use backbone size_divisibility if not set\n            size_divisibility = self.backbone.size_divisibility\n        self.size_divisibility = size_divisibility\n        self.sem_seg_postprocess_before_inference = sem_seg_postprocess_before_inference\n        self.register_buffer(\"pixel_mean\", torch.Tensor(pixel_mean).view(-1, 1, 1), False)\n        self.register_buffer(\"pixel_std\", torch.Tensor(pixel_std).view(-1, 1, 1), False)\n\n        # additional args\n        self.semantic_on = semantic_on\n        self.instance_on = instance_on\n        self.panoptic_on = panoptic_on\n        self.test_topk_per_image = test_topk_per_image\n\n        if not self.semantic_on:\n            assert self.sem_seg_postprocess_before_inference\n\n    @classmethod\n    def from_config(cls, cfg):\n        backbone = build_backbone(cfg)\n        sem_seg_head = build_sem_seg_head(cfg, backbone.output_shape())\n\n        # Loss parameters:\n        deep_supervision = cfg.MODEL.MASK_FORMER.DEEP_SUPERVISION\n        no_object_weight = cfg.MODEL.MASK_FORMER.NO_OBJECT_WEIGHT\n\n        # loss weights\n        class_weight = cfg.MODEL.MASK_FORMER.CLASS_WEIGHT\n        dice_weight = cfg.MODEL.MASK_FORMER.DICE_WEIGHT\n        mask_weight = cfg.MODEL.MASK_FORMER.MASK_WEIGHT\n\n        # building criterion\n        matcher = HungarianMatcher(\n            cost_class=class_weight,\n            cost_mask=mask_weight,\n            cost_dice=dice_weight,\n            num_points=cfg.MODEL.MASK_FORMER.TRAIN_NUM_POINTS,\n        )\n\n        weight_dict = {\"loss_ce\": class_weight, \"loss_mask\": mask_weight, \"loss_dice\": dice_weight, \"loss_bound\": mask_weight}\n\n        if deep_supervision:\n            dec_layers = cfg.MODEL.MASK_FORMER.DEC_LAYERS\n            aux_weight_dict = {}\n            for i in range(dec_layers - 1):\n                aux_weight_dict.update({k + f\"_{i}\": v for k, v in weight_dict.items()})\n            weight_dict.update(aux_weight_dict)\n\n        losses = [\"labels\", \"masks\"]\n\n        criterion = SetCriterion(\n            sem_seg_head.num_classes,\n            matcher=matcher,\n            weight_dict=weight_dict,\n            eos_coef=no_object_weight,\n            losses=losses,\n            num_points=cfg.MODEL.MASK_FORMER.TRAIN_NUM_POINTS,\n            oversample_ratio=cfg.MODEL.MASK_FORMER.OVERSAMPLE_RATIO,\n            importance_sample_ratio=cfg.MODEL.MASK_FORMER.IMPORTANCE_SAMPLE_RATIO,\n        )\n\n        return {\n            \"backbone\": backbone,\n            \"sem_seg_head\": sem_seg_head,\n            \"criterion\": criterion,\n            \"num_queries\": cfg.MODEL.MASK_FORMER.NUM_OBJECT_QUERIES,\n            \"object_mask_threshold\": cfg.MODEL.MASK_FORMER.TEST.OBJECT_MASK_THRESHOLD,\n            \"overlap_threshold\": cfg.MODEL.MASK_FORMER.TEST.OVERLAP_THRESHOLD,\n            \"metadata\": MetadataCatalog.get(cfg.DATASETS.TRAIN[0]),\n            \"size_divisibility\": cfg.MODEL.MASK_FORMER.SIZE_DIVISIBILITY,\n            \"sem_seg_postprocess_before_inference\": (\n                cfg.MODEL.MASK_FORMER.TEST.SEM_SEG_POSTPROCESSING_BEFORE_INFERENCE\n                or cfg.MODEL.MASK_FORMER.TEST.PANOPTIC_ON\n                or cfg.MODEL.MASK_FORMER.TEST.INSTANCE_ON\n            ),\n            \"pixel_mean\": cfg.MODEL.PIXEL_MEAN,\n            \"pixel_std\": cfg.MODEL.PIXEL_STD,\n            # inference\n            \"semantic_on\": cfg.MODEL.MASK_FORMER.TEST.SEMANTIC_ON,\n            \"instance_on\": cfg.MODEL.MASK_FORMER.TEST.INSTANCE_ON,\n            \"panoptic_on\": cfg.MODEL.MASK_FORMER.TEST.PANOPTIC_ON,\n            \"test_topk_per_image\": cfg.TEST.DETECTIONS_PER_IMAGE,\n        }\n\n    @property\n    def device(self):\n        return self.pixel_mean.device\n\n    def forward(self, batched_inputs):\n        \"\"\"\n        Args:\n            batched_inputs: a list, batched outputs of :class:`DatasetMapper`.\n                Each item in the list contains the inputs for one image.\n                For now, each item in the list is a dict that contains:\n                   * \"image\": Tensor, image in (C, H, W) format.\n                   * \"instances\": per-region ground truth\n                   * Other information that's included in the original dicts, such as:\n                     \"height\", \"width\" (int): the output resolution of the model (may be different\n                     from input resolution), used in inference.\n        Returns:\n            list[dict]:\n                each dict has the results for one image. The dict contains the following keys:\n\n                * \"sem_seg\":\n                    A Tensor that represents the\n                    per-pixel segmentation prediced by the head.\n                    The prediction has shape KxHxW that represents the logits of\n                    each class for each pixel.\n                * \"panoptic_seg\":\n                    A tuple that represent panoptic output\n                    panoptic_seg (Tensor): of shape (height, width) where the values are ids for each segment.\n                    segments_info (list[dict]): Describe each segment in `panoptic_seg`.\n                        Each dict contains keys \"id\", \"category_id\", \"isthing\".\n        \"\"\"\n        images = [x[\"image\"].to(self.device) for x in batched_inputs]\n        \n        if self.training:\n            rs_images = ImageList.from_tensors(images, self.size_divisibility)\n            image_masks = [~ x[\"padding_mask\"].to(self.device) for x in batched_inputs]\n            image_masks_back = [x[\"padding_mask\"].to(self.device) for x in batched_inputs]\n            image_masks_bool = [((m.sum() / (m.shape[0] * m.shape[1])) > 0.25).float()*((m_b.sum() / (m.shape[0] * m.shape[1])) > 0.25).float() for m, m_b in zip(image_masks, image_masks_back)]\n            downsampled_images = F.avg_pool2d(rs_images.tensor.float(), kernel_size=4, stride=4, padding=0) #for img in images]\n            images_lab = [torch.as_tensor(color.rgb2lab(ds_image[[2, 1, 0]].byte().permute(1, 2, 0).cpu().numpy()), device=ds_image.device, dtype=torch.float32).permute(2, 0, 1) for ds_image in downsampled_images]\n            images_lab_sim = [get_images_color_similarity(img_lab.unsqueeze(0), 3, 2) * float(img_m_bool) for img_lab, img_m_bool in zip(images_lab, image_masks_bool)] \n\n        images = [(x - self.pixel_mean) / self.pixel_std for x in images]\n        images = ImageList.from_tensors(images, self.size_divisibility)\n\n        features = self.backbone(images.tensor)\n        outputs = self.sem_seg_head(features)\n\n        if self.training:\n            # mask classification target\n            if \"instances\" in batched_inputs[0]:\n                gt_instances = [x[\"instances\"].to(self.device) for x in batched_inputs]\n                targets = self.prepare_targets(gt_instances, images)\n            else:\n                targets = None\n\n            # bipartite matching-based loss\n            losses = self.criterion(outputs, targets, images_lab_sim)\n\n            for k in list(losses.keys()):\n                if k in self.criterion.weight_dict:\n                    losses[k] *= self.criterion.weight_dict[k]\n                else:\n                    # remove this loss if not specified in `weight_dict`\n                    losses.pop(k)\n            return losses\n        else:\n            mask_cls_results = outputs[\"pred_logits\"]\n            mask_pred_results = outputs[\"pred_masks\"]\n            # upsample masks\n            mask_pred_results = F.interpolate(\n                mask_pred_results,\n                size=(images.tensor.shape[-2], images.tensor.shape[-1]),\n                mode=\"bilinear\",\n                align_corners=False,\n            )\n\n            del outputs\n\n            processed_results = []\n            for mask_cls_result, mask_pred_result, input_per_image, image_size in zip(\n                mask_cls_results, mask_pred_results, batched_inputs, images.image_sizes\n            ):\n                height = input_per_image.get(\"height\", image_size[0])\n                width = input_per_image.get(\"width\", image_size[1])\n                processed_results.append({})\n\n                if self.sem_seg_postprocess_before_inference:\n                    mask_pred_result = retry_if_cuda_oom(sem_seg_postprocess)(\n                        mask_pred_result, image_size, height, width\n                    )\n                    mask_cls_result = mask_cls_result.to(mask_pred_result)\n\n                # semantic segmentation inference\n                if self.semantic_on:\n                    r = retry_if_cuda_oom(self.semantic_inference)(mask_cls_result, mask_pred_result)\n                    if not self.sem_seg_postprocess_before_inference:\n                        r = retry_if_cuda_oom(sem_seg_postprocess)(r, image_size, height, width)\n                    processed_results[-1][\"sem_seg\"] = r\n\n                # panoptic segmentation inference\n                if self.panoptic_on:\n                    panoptic_r = retry_if_cuda_oom(self.panoptic_inference)(mask_cls_result, mask_pred_result)\n                    processed_results[-1][\"panoptic_seg\"] = panoptic_r\n                \n                # instance segmentation inference\n                if self.instance_on:\n                    instance_r = retry_if_cuda_oom(self.instance_inference)(mask_cls_result, mask_pred_result)\n                    processed_results[-1][\"instances\"] = instance_r\n\n            return processed_results\n\n    def prepare_targets(self, targets, images):\n        h_pad, w_pad = images.tensor.shape[-2:]\n        new_targets = []\n        for targets_per_image in targets:\n            # pad gt\n            gt_masks = targets_per_image.gt_masks\n            padded_masks = torch.zeros((gt_masks.shape[0], h_pad, w_pad), dtype=gt_masks.dtype, device=gt_masks.device)\n            padded_masks[:, : gt_masks.shape[1], : gt_masks.shape[2]] = gt_masks\n            new_targets.append(\n                {\n                    \"labels\": targets_per_image.gt_classes,\n                    \"masks\": padded_masks,\n                }\n            )\n        return new_targets\n\n    def semantic_inference(self, mask_cls, mask_pred):\n        mask_cls = F.softmax(mask_cls, dim=-1)[..., :-1]\n        mask_pred = mask_pred.sigmoid()\n        semseg = torch.einsum(\"qc,qhw->chw\", mask_cls, mask_pred)\n        return semseg\n\n    def panoptic_inference(self, mask_cls, mask_pred):\n        scores, labels = F.softmax(mask_cls, dim=-1).max(-1)\n        mask_pred = mask_pred.sigmoid()\n\n        keep = labels.ne(self.sem_seg_head.num_classes) & (scores > self.object_mask_threshold)\n        cur_scores = scores[keep]\n        cur_classes = labels[keep]\n        cur_masks = mask_pred[keep]\n        cur_mask_cls = mask_cls[keep]\n        cur_mask_cls = cur_mask_cls[:, :-1]\n\n        cur_prob_masks = cur_scores.view(-1, 1, 1) * cur_masks\n\n        h, w = cur_masks.shape[-2:]\n        panoptic_seg = torch.zeros((h, w), dtype=torch.int32, device=cur_masks.device)\n        segments_info = []\n\n        current_segment_id = 0\n\n        if cur_masks.shape[0] == 0:\n            # We didn't detect any mask :(\n            return panoptic_seg, segments_info\n        else:\n            # take argmax\n            cur_mask_ids = cur_prob_masks.argmax(0)\n            stuff_memory_list = {}\n            for k in range(cur_classes.shape[0]):\n                pred_class = cur_classes[k].item()\n                isthing = pred_class in self.metadata.thing_dataset_id_to_contiguous_id.values()\n                mask_area = (cur_mask_ids == k).sum().item()\n                original_area = (cur_masks[k] >= 0.5).sum().item()\n                mask = (cur_mask_ids == k) & (cur_masks[k] >= 0.5)\n\n                if mask_area > 0 and original_area > 0 and mask.sum().item() > 0:\n                    if mask_area / original_area < self.overlap_threshold:\n                        continue\n\n                    # merge stuff regions\n                    if not isthing:\n                        if int(pred_class) in stuff_memory_list.keys():\n                            panoptic_seg[mask] = stuff_memory_list[int(pred_class)]\n                            continue\n                        else:\n                            stuff_memory_list[int(pred_class)] = current_segment_id + 1\n\n                    current_segment_id += 1\n                    panoptic_seg[mask] = current_segment_id\n\n                    segments_info.append(\n                        {\n                            \"id\": current_segment_id,\n                            \"isthing\": bool(isthing),\n                            \"category_id\": int(pred_class),\n                        }\n                    )\n\n            return panoptic_seg, segments_info\n\n    def instance_inference(self, mask_cls, mask_pred):\n        # mask_pred is already processed to have the same shape as original input\n        image_size = mask_pred.shape[-2:]\n\n        # [Q, K]\n        scores = F.softmax(mask_cls, dim=-1)[:, :-1]\n        labels = torch.arange(self.sem_seg_head.num_classes, device=self.device).unsqueeze(0).repeat(self.num_queries, 1).flatten(0, 1)\n        # scores_per_image, topk_indices = scores.flatten(0, 1).topk(self.num_queries, sorted=False)\n        scores_per_image, topk_indices = scores.flatten(0, 1).topk(self.test_topk_per_image, sorted=False)\n        labels_per_image = labels[topk_indices]\n\n        topk_indices = topk_indices // self.sem_seg_head.num_classes\n        # mask_pred = mask_pred.unsqueeze(1).repeat(1, self.sem_seg_head.num_classes, 1).flatten(0, 1)\n        mask_pred = mask_pred[topk_indices]\n\n        # if this is panoptic segmentation, we only keep the \"thing\" classes\n        if self.panoptic_on:\n            keep = torch.zeros_like(scores_per_image).bool()\n            for i, lab in enumerate(labels_per_image):\n                keep[i] = lab in self.metadata.thing_dataset_id_to_contiguous_id.values()\n\n            scores_per_image = scores_per_image[keep]\n            labels_per_image = labels_per_image[keep]\n            mask_pred = mask_pred[keep]\n\n        result = Instances(image_size)\n        result.pred_masks = (mask_pred > 0).float()\n        result.pred_boxes = BitMasks(mask_pred > 0).get_bounding_boxes()\n\n        # calculate average mask prob\n        mask_scores_per_image = (mask_pred.sigmoid().flatten(1) * result.pred_masks.flatten(1)).sum(1) / (result.pred_masks.flatten(1).sum(1) + 1e-6)\n        result.scores = scores_per_image * mask_scores_per_image\n        result.pred_classes = labels_per_image\n        return result\n"
  },
  {
    "path": "mfvis_nococo/mask2former/modeling/__init__.py",
    "content": "from .backbone.swin import D2SwinTransformer\nfrom .pixel_decoder.fpn import BasePixelDecoder\nfrom .pixel_decoder.msdeformattn import MSDeformAttnPixelDecoder\nfrom .meta_arch.mask_former_head import MaskFormerHead\nfrom .meta_arch.per_pixel_baseline import PerPixelBaselineHead, PerPixelBaselinePlusHead\n"
  },
  {
    "path": "mfvis_nococo/mask2former/modeling/backbone/__init__.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates.\n"
  },
  {
    "path": "mfvis_nococo/mask2former/modeling/backbone/__init__.py.new",
    "content": ""
  },
  {
    "path": "mfvis_nococo/mask2former/modeling/backbone/swin.py",
    "content": "# --------------------------------------------------------\n# Swin Transformer\n# Copyright (c) 2021 Microsoft\n# Licensed under The MIT License [see LICENSE for details]\n# Written by Ze Liu, Yutong Lin, Yixuan Wei\n# --------------------------------------------------------\n\n# Modified by Bowen Cheng from https://github.com/SwinTransformer/Swin-Transformer-Semantic-Segmentation/blob/main/mmseg/models/backbones/swin_transformer.py\n\nimport numpy as np\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nimport torch.utils.checkpoint as checkpoint\nfrom timm.models.layers import DropPath, to_2tuple, trunc_normal_\n\nfrom detectron2.modeling import BACKBONE_REGISTRY, Backbone, ShapeSpec\n\n\nclass Mlp(nn.Module):\n    \"\"\"Multilayer perceptron.\"\"\"\n\n    def __init__(\n        self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.0\n    ):\n        super().__init__()\n        out_features = out_features or in_features\n        hidden_features = hidden_features or in_features\n        self.fc1 = nn.Linear(in_features, hidden_features)\n        self.act = act_layer()\n        self.fc2 = nn.Linear(hidden_features, out_features)\n        self.drop = nn.Dropout(drop)\n\n    def forward(self, x):\n        x = self.fc1(x)\n        x = self.act(x)\n        x = self.drop(x)\n        x = self.fc2(x)\n        x = self.drop(x)\n        return x\n\n\ndef window_partition(x, window_size):\n    \"\"\"\n    Args:\n        x: (B, H, W, C)\n        window_size (int): window size\n    Returns:\n        windows: (num_windows*B, window_size, window_size, C)\n    \"\"\"\n    B, H, W, C = x.shape\n    x = x.view(B, H // window_size, window_size, W // window_size, window_size, C)\n    windows = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(-1, window_size, window_size, C)\n    return windows\n\n\ndef window_reverse(windows, window_size, H, W):\n    \"\"\"\n    Args:\n        windows: (num_windows*B, window_size, window_size, C)\n        window_size (int): Window size\n        H (int): Height of image\n        W (int): Width of image\n    Returns:\n        x: (B, H, W, C)\n    \"\"\"\n    B = int(windows.shape[0] / (H * W / window_size / window_size))\n    x = windows.view(B, H // window_size, W // window_size, window_size, window_size, -1)\n    x = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(B, H, W, -1)\n    return x\n\n\nclass WindowAttention(nn.Module):\n    \"\"\"Window based multi-head self attention (W-MSA) module with relative position bias.\n    It supports both of shifted and non-shifted window.\n    Args:\n        dim (int): Number of input channels.\n        window_size (tuple[int]): The height and width of the window.\n        num_heads (int): Number of attention heads.\n        qkv_bias (bool, optional):  If True, add a learnable bias to query, key, value. Default: True\n        qk_scale (float | None, optional): Override default qk scale of head_dim ** -0.5 if set\n        attn_drop (float, optional): Dropout ratio of attention weight. Default: 0.0\n        proj_drop (float, optional): Dropout ratio of output. Default: 0.0\n    \"\"\"\n\n    def __init__(\n        self,\n        dim,\n        window_size,\n        num_heads,\n        qkv_bias=True,\n        qk_scale=None,\n        attn_drop=0.0,\n        proj_drop=0.0,\n    ):\n\n        super().__init__()\n        self.dim = dim\n        self.window_size = window_size  # Wh, Ww\n        self.num_heads = num_heads\n        head_dim = dim // num_heads\n        self.scale = qk_scale or head_dim ** -0.5\n\n        # define a parameter table of relative position bias\n        self.relative_position_bias_table = nn.Parameter(\n            torch.zeros((2 * window_size[0] - 1) * (2 * window_size[1] - 1), num_heads)\n        )  # 2*Wh-1 * 2*Ww-1, nH\n\n        # get pair-wise relative position index for each token inside the window\n        coords_h = torch.arange(self.window_size[0])\n        coords_w = torch.arange(self.window_size[1])\n        coords = torch.stack(torch.meshgrid([coords_h, coords_w]))  # 2, Wh, Ww\n        coords_flatten = torch.flatten(coords, 1)  # 2, Wh*Ww\n        relative_coords = coords_flatten[:, :, None] - coords_flatten[:, None, :]  # 2, Wh*Ww, Wh*Ww\n        relative_coords = relative_coords.permute(1, 2, 0).contiguous()  # Wh*Ww, Wh*Ww, 2\n        relative_coords[:, :, 0] += self.window_size[0] - 1  # shift to start from 0\n        relative_coords[:, :, 1] += self.window_size[1] - 1\n        relative_coords[:, :, 0] *= 2 * self.window_size[1] - 1\n        relative_position_index = relative_coords.sum(-1)  # Wh*Ww, Wh*Ww\n        self.register_buffer(\"relative_position_index\", relative_position_index)\n\n        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)\n        self.attn_drop = nn.Dropout(attn_drop)\n        self.proj = nn.Linear(dim, dim)\n        self.proj_drop = nn.Dropout(proj_drop)\n\n        trunc_normal_(self.relative_position_bias_table, std=0.02)\n        self.softmax = nn.Softmax(dim=-1)\n\n    def forward(self, x, mask=None):\n        \"\"\"Forward function.\n        Args:\n            x: input features with shape of (num_windows*B, N, C)\n            mask: (0/-inf) mask with shape of (num_windows, Wh*Ww, Wh*Ww) or None\n        \"\"\"\n        B_, N, C = x.shape\n        qkv = (\n            self.qkv(x)\n            .reshape(B_, N, 3, self.num_heads, C // self.num_heads)\n            .permute(2, 0, 3, 1, 4)\n        )\n        q, k, v = qkv[0], qkv[1], qkv[2]  # make torchscript happy (cannot use tensor as tuple)\n\n        q = q * self.scale\n        attn = q @ k.transpose(-2, -1)\n\n        relative_position_bias = self.relative_position_bias_table[\n            self.relative_position_index.view(-1)\n        ].view(\n            self.window_size[0] * self.window_size[1], self.window_size[0] * self.window_size[1], -1\n        )  # Wh*Ww,Wh*Ww,nH\n        relative_position_bias = relative_position_bias.permute(\n            2, 0, 1\n        ).contiguous()  # nH, Wh*Ww, Wh*Ww\n        attn = attn + relative_position_bias.unsqueeze(0)\n\n        if mask is not None:\n            nW = mask.shape[0]\n            attn = attn.view(B_ // nW, nW, self.num_heads, N, N) + mask.unsqueeze(1).unsqueeze(0)\n            attn = attn.view(-1, self.num_heads, N, N)\n            attn = self.softmax(attn)\n        else:\n            attn = self.softmax(attn)\n\n        attn = self.attn_drop(attn)\n\n        x = (attn @ v).transpose(1, 2).reshape(B_, N, C)\n        x = self.proj(x)\n        x = self.proj_drop(x)\n        return x\n\n\nclass SwinTransformerBlock(nn.Module):\n    \"\"\"Swin Transformer Block.\n    Args:\n        dim (int): Number of input channels.\n        num_heads (int): Number of attention heads.\n        window_size (int): Window size.\n        shift_size (int): Shift size for SW-MSA.\n        mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.\n        qkv_bias (bool, optional): If True, add a learnable bias to query, key, value. Default: True\n        qk_scale (float | None, optional): Override default qk scale of head_dim ** -0.5 if set.\n        drop (float, optional): Dropout rate. Default: 0.0\n        attn_drop (float, optional): Attention dropout rate. Default: 0.0\n        drop_path (float, optional): Stochastic depth rate. Default: 0.0\n        act_layer (nn.Module, optional): Activation layer. Default: nn.GELU\n        norm_layer (nn.Module, optional): Normalization layer.  Default: nn.LayerNorm\n    \"\"\"\n\n    def __init__(\n        self,\n        dim,\n        num_heads,\n        window_size=7,\n        shift_size=0,\n        mlp_ratio=4.0,\n        qkv_bias=True,\n        qk_scale=None,\n        drop=0.0,\n        attn_drop=0.0,\n        drop_path=0.0,\n        act_layer=nn.GELU,\n        norm_layer=nn.LayerNorm,\n    ):\n        super().__init__()\n        self.dim = dim\n        self.num_heads = num_heads\n        self.window_size = window_size\n        self.shift_size = shift_size\n        self.mlp_ratio = mlp_ratio\n        assert 0 <= self.shift_size < self.window_size, \"shift_size must in 0-window_size\"\n\n        self.norm1 = norm_layer(dim)\n        self.attn = WindowAttention(\n            dim,\n            window_size=to_2tuple(self.window_size),\n            num_heads=num_heads,\n            qkv_bias=qkv_bias,\n            qk_scale=qk_scale,\n            attn_drop=attn_drop,\n            proj_drop=drop,\n        )\n\n        self.drop_path = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()\n        self.norm2 = norm_layer(dim)\n        mlp_hidden_dim = int(dim * mlp_ratio)\n        self.mlp = Mlp(\n            in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop\n        )\n\n        self.H = None\n        self.W = None\n\n    def forward(self, x, mask_matrix):\n        \"\"\"Forward function.\n        Args:\n            x: Input feature, tensor size (B, H*W, C).\n            H, W: Spatial resolution of the input feature.\n            mask_matrix: Attention mask for cyclic shift.\n        \"\"\"\n        B, L, C = x.shape\n        H, W = self.H, self.W\n        assert L == H * W, \"input feature has wrong size\"\n\n        shortcut = x\n        x = self.norm1(x)\n        x = x.view(B, H, W, C)\n\n        # pad feature maps to multiples of window size\n        pad_l = pad_t = 0\n        pad_r = (self.window_size - W % self.window_size) % self.window_size\n        pad_b = (self.window_size - H % self.window_size) % self.window_size\n        x = F.pad(x, (0, 0, pad_l, pad_r, pad_t, pad_b))\n        _, Hp, Wp, _ = x.shape\n\n        # cyclic shift\n        if self.shift_size > 0:\n            shifted_x = torch.roll(x, shifts=(-self.shift_size, -self.shift_size), dims=(1, 2))\n            attn_mask = mask_matrix\n        else:\n            shifted_x = x\n            attn_mask = None\n\n        # partition windows\n        x_windows = window_partition(\n            shifted_x, self.window_size\n        )  # nW*B, window_size, window_size, C\n        x_windows = x_windows.view(\n            -1, self.window_size * self.window_size, C\n        )  # nW*B, window_size*window_size, C\n\n        # W-MSA/SW-MSA\n        attn_windows = self.attn(x_windows, mask=attn_mask)  # nW*B, window_size*window_size, C\n\n        # merge windows\n        attn_windows = attn_windows.view(-1, self.window_size, self.window_size, C)\n        shifted_x = window_reverse(attn_windows, self.window_size, Hp, Wp)  # B H' W' C\n\n        # reverse cyclic shift\n        if self.shift_size > 0:\n            x = torch.roll(shifted_x, shifts=(self.shift_size, self.shift_size), dims=(1, 2))\n        else:\n            x = shifted_x\n\n        if pad_r > 0 or pad_b > 0:\n            x = x[:, :H, :W, :].contiguous()\n\n        x = x.view(B, H * W, C)\n\n        # FFN\n        x = shortcut + self.drop_path(x)\n        x = x + self.drop_path(self.mlp(self.norm2(x)))\n\n        return x\n\n\nclass PatchMerging(nn.Module):\n    \"\"\"Patch Merging Layer\n    Args:\n        dim (int): Number of input channels.\n        norm_layer (nn.Module, optional): Normalization layer.  Default: nn.LayerNorm\n    \"\"\"\n\n    def __init__(self, dim, norm_layer=nn.LayerNorm):\n        super().__init__()\n        self.dim = dim\n        self.reduction = nn.Linear(4 * dim, 2 * dim, bias=False)\n        self.norm = norm_layer(4 * dim)\n\n    def forward(self, x, H, W):\n        \"\"\"Forward function.\n        Args:\n            x: Input feature, tensor size (B, H*W, C).\n            H, W: Spatial resolution of the input feature.\n        \"\"\"\n        B, L, C = x.shape\n        assert L == H * W, \"input feature has wrong size\"\n\n        x = x.view(B, H, W, C)\n\n        # padding\n        pad_input = (H % 2 == 1) or (W % 2 == 1)\n        if pad_input:\n            x = F.pad(x, (0, 0, 0, W % 2, 0, H % 2))\n\n        x0 = x[:, 0::2, 0::2, :]  # B H/2 W/2 C\n        x1 = x[:, 1::2, 0::2, :]  # B H/2 W/2 C\n        x2 = x[:, 0::2, 1::2, :]  # B H/2 W/2 C\n        x3 = x[:, 1::2, 1::2, :]  # B H/2 W/2 C\n        x = torch.cat([x0, x1, x2, x3], -1)  # B H/2 W/2 4*C\n        x = x.view(B, -1, 4 * C)  # B H/2*W/2 4*C\n\n        x = self.norm(x)\n        x = self.reduction(x)\n\n        return x\n\n\nclass BasicLayer(nn.Module):\n    \"\"\"A basic Swin Transformer layer for one stage.\n    Args:\n        dim (int): Number of feature channels\n        depth (int): Depths of this stage.\n        num_heads (int): Number of attention head.\n        window_size (int): Local window size. Default: 7.\n        mlp_ratio (float): Ratio of mlp hidden dim to embedding dim. Default: 4.\n        qkv_bias (bool, optional): If True, add a learnable bias to query, key, value. Default: True\n        qk_scale (float | None, optional): Override default qk scale of head_dim ** -0.5 if set.\n        drop (float, optional): Dropout rate. Default: 0.0\n        attn_drop (float, optional): Attention dropout rate. Default: 0.0\n        drop_path (float | tuple[float], optional): Stochastic depth rate. Default: 0.0\n        norm_layer (nn.Module, optional): Normalization layer. Default: nn.LayerNorm\n        downsample (nn.Module | None, optional): Downsample layer at the end of the layer. Default: None\n        use_checkpoint (bool): Whether to use checkpointing to save memory. Default: False.\n    \"\"\"\n\n    def __init__(\n        self,\n        dim,\n        depth,\n        num_heads,\n        window_size=7,\n        mlp_ratio=4.0,\n        qkv_bias=True,\n        qk_scale=None,\n        drop=0.0,\n        attn_drop=0.0,\n        drop_path=0.0,\n        norm_layer=nn.LayerNorm,\n        downsample=None,\n        use_checkpoint=False,\n    ):\n        super().__init__()\n        self.window_size = window_size\n        self.shift_size = window_size // 2\n        self.depth = depth\n        self.use_checkpoint = use_checkpoint\n\n        # build blocks\n        self.blocks = nn.ModuleList(\n            [\n                SwinTransformerBlock(\n                    dim=dim,\n                    num_heads=num_heads,\n                    window_size=window_size,\n                    shift_size=0 if (i % 2 == 0) else window_size // 2,\n                    mlp_ratio=mlp_ratio,\n                    qkv_bias=qkv_bias,\n                    qk_scale=qk_scale,\n                    drop=drop,\n                    attn_drop=attn_drop,\n                    drop_path=drop_path[i] if isinstance(drop_path, list) else drop_path,\n                    norm_layer=norm_layer,\n                )\n                for i in range(depth)\n            ]\n        )\n\n        # patch merging layer\n        if downsample is not None:\n            self.downsample = downsample(dim=dim, norm_layer=norm_layer)\n        else:\n            self.downsample = None\n\n    def forward(self, x, H, W):\n        \"\"\"Forward function.\n        Args:\n            x: Input feature, tensor size (B, H*W, C).\n            H, W: Spatial resolution of the input feature.\n        \"\"\"\n\n        # calculate attention mask for SW-MSA\n        Hp = int(np.ceil(H / self.window_size)) * self.window_size\n        Wp = int(np.ceil(W / self.window_size)) * self.window_size\n        img_mask = torch.zeros((1, Hp, Wp, 1), device=x.device)  # 1 Hp Wp 1\n        h_slices = (\n            slice(0, -self.window_size),\n            slice(-self.window_size, -self.shift_size),\n            slice(-self.shift_size, None),\n        )\n        w_slices = (\n            slice(0, -self.window_size),\n            slice(-self.window_size, -self.shift_size),\n            slice(-self.shift_size, None),\n        )\n        cnt = 0\n        for h in h_slices:\n            for w in w_slices:\n                img_mask[:, h, w, :] = cnt\n                cnt += 1\n\n        mask_windows = window_partition(\n            img_mask, self.window_size\n        )  # nW, window_size, window_size, 1\n        mask_windows = mask_windows.view(-1, self.window_size * self.window_size)\n        attn_mask = mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2)\n        attn_mask = attn_mask.masked_fill(attn_mask != 0, float(-100.0)).masked_fill(\n            attn_mask == 0, float(0.0)\n        )\n\n        for blk in self.blocks:\n            blk.H, blk.W = H, W\n            if self.use_checkpoint:\n                x = checkpoint.checkpoint(blk, x, attn_mask)\n            else:\n                x = blk(x, attn_mask)\n        if self.downsample is not None:\n            x_down = self.downsample(x, H, W)\n            Wh, Ww = (H + 1) // 2, (W + 1) // 2\n            return x, H, W, x_down, Wh, Ww\n        else:\n            return x, H, W, x, H, W\n\n\nclass PatchEmbed(nn.Module):\n    \"\"\"Image to Patch Embedding\n    Args:\n        patch_size (int): Patch token size. Default: 4.\n        in_chans (int): Number of input image channels. Default: 3.\n        embed_dim (int): Number of linear projection output channels. Default: 96.\n        norm_layer (nn.Module, optional): Normalization layer. Default: None\n    \"\"\"\n\n    def __init__(self, patch_size=4, in_chans=3, embed_dim=96, norm_layer=None):\n        super().__init__()\n        patch_size = to_2tuple(patch_size)\n        self.patch_size = patch_size\n\n        self.in_chans = in_chans\n        self.embed_dim = embed_dim\n\n        self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=patch_size)\n        if norm_layer is not None:\n            self.norm = norm_layer(embed_dim)\n        else:\n            self.norm = None\n\n    def forward(self, x):\n        \"\"\"Forward function.\"\"\"\n        # padding\n        _, _, H, W = x.size()\n        if W % self.patch_size[1] != 0:\n            x = F.pad(x, (0, self.patch_size[1] - W % self.patch_size[1]))\n        if H % self.patch_size[0] != 0:\n            x = F.pad(x, (0, 0, 0, self.patch_size[0] - H % self.patch_size[0]))\n\n        x = self.proj(x)  # B C Wh Ww\n        if self.norm is not None:\n            Wh, Ww = x.size(2), x.size(3)\n            x = x.flatten(2).transpose(1, 2)\n            x = self.norm(x)\n            x = x.transpose(1, 2).view(-1, self.embed_dim, Wh, Ww)\n\n        return x\n\n\nclass SwinTransformer(nn.Module):\n    \"\"\"Swin Transformer backbone.\n        A PyTorch impl of : `Swin Transformer: Hierarchical Vision Transformer using Shifted Windows`  -\n          https://arxiv.org/pdf/2103.14030\n    Args:\n        pretrain_img_size (int): Input image size for training the pretrained model,\n            used in absolute postion embedding. Default 224.\n        patch_size (int | tuple(int)): Patch size. Default: 4.\n        in_chans (int): Number of input image channels. Default: 3.\n        embed_dim (int): Number of linear projection output channels. Default: 96.\n        depths (tuple[int]): Depths of each Swin Transformer stage.\n        num_heads (tuple[int]): Number of attention head of each stage.\n        window_size (int): Window size. Default: 7.\n        mlp_ratio (float): Ratio of mlp hidden dim to embedding dim. Default: 4.\n        qkv_bias (bool): If True, add a learnable bias to query, key, value. Default: True\n        qk_scale (float): Override default qk scale of head_dim ** -0.5 if set.\n        drop_rate (float): Dropout rate.\n        attn_drop_rate (float): Attention dropout rate. Default: 0.\n        drop_path_rate (float): Stochastic depth rate. Default: 0.2.\n        norm_layer (nn.Module): Normalization layer. Default: nn.LayerNorm.\n        ape (bool): If True, add absolute position embedding to the patch embedding. Default: False.\n        patch_norm (bool): If True, add normalization after patch embedding. Default: True.\n        out_indices (Sequence[int]): Output from which stages.\n        frozen_stages (int): Stages to be frozen (stop grad and set eval mode).\n            -1 means not freezing any parameters.\n        use_checkpoint (bool): Whether to use checkpointing to save memory. Default: False.\n    \"\"\"\n\n    def __init__(\n        self,\n        pretrain_img_size=224,\n        patch_size=4,\n        in_chans=3,\n        embed_dim=96,\n        depths=[2, 2, 6, 2],\n        num_heads=[3, 6, 12, 24],\n        window_size=7,\n        mlp_ratio=4.0,\n        qkv_bias=True,\n        qk_scale=None,\n        drop_rate=0.0,\n        attn_drop_rate=0.0,\n        drop_path_rate=0.2,\n        norm_layer=nn.LayerNorm,\n        ape=False,\n        patch_norm=True,\n        out_indices=(0, 1, 2, 3),\n        frozen_stages=-1,\n        use_checkpoint=False,\n    ):\n        super().__init__()\n\n        self.pretrain_img_size = pretrain_img_size\n        self.num_layers = len(depths)\n        self.embed_dim = embed_dim\n        self.ape = ape\n        self.patch_norm = patch_norm\n        self.out_indices = out_indices\n        self.frozen_stages = frozen_stages\n\n        # split image into non-overlapping patches\n        self.patch_embed = PatchEmbed(\n            patch_size=patch_size,\n            in_chans=in_chans,\n            embed_dim=embed_dim,\n            norm_layer=norm_layer if self.patch_norm else None,\n        )\n\n        # absolute position embedding\n        if self.ape:\n            pretrain_img_size = to_2tuple(pretrain_img_size)\n            patch_size = to_2tuple(patch_size)\n            patches_resolution = [\n                pretrain_img_size[0] // patch_size[0],\n                pretrain_img_size[1] // patch_size[1],\n            ]\n\n            self.absolute_pos_embed = nn.Parameter(\n                torch.zeros(1, embed_dim, patches_resolution[0], patches_resolution[1])\n            )\n            trunc_normal_(self.absolute_pos_embed, std=0.02)\n\n        self.pos_drop = nn.Dropout(p=drop_rate)\n\n        # stochastic depth\n        dpr = [\n            x.item() for x in torch.linspace(0, drop_path_rate, sum(depths))\n        ]  # stochastic depth decay rule\n\n        # build layers\n        self.layers = nn.ModuleList()\n        for i_layer in range(self.num_layers):\n            layer = BasicLayer(\n                dim=int(embed_dim * 2 ** i_layer),\n                depth=depths[i_layer],\n                num_heads=num_heads[i_layer],\n                window_size=window_size,\n                mlp_ratio=mlp_ratio,\n                qkv_bias=qkv_bias,\n                qk_scale=qk_scale,\n                drop=drop_rate,\n                attn_drop=attn_drop_rate,\n                drop_path=dpr[sum(depths[:i_layer]) : sum(depths[: i_layer + 1])],\n                norm_layer=norm_layer,\n                downsample=PatchMerging if (i_layer < self.num_layers - 1) else None,\n                use_checkpoint=use_checkpoint,\n            )\n            self.layers.append(layer)\n\n        num_features = [int(embed_dim * 2 ** i) for i in range(self.num_layers)]\n        self.num_features = num_features\n\n        # add a norm layer for each output\n        for i_layer in out_indices:\n            layer = norm_layer(num_features[i_layer])\n            layer_name = f\"norm{i_layer}\"\n            self.add_module(layer_name, layer)\n\n        self._freeze_stages()\n\n    def _freeze_stages(self):\n        if self.frozen_stages >= 0:\n            self.patch_embed.eval()\n            for param in self.patch_embed.parameters():\n                param.requires_grad = False\n\n        if self.frozen_stages >= 1 and self.ape:\n            self.absolute_pos_embed.requires_grad = False\n\n        if self.frozen_stages >= 2:\n            self.pos_drop.eval()\n            for i in range(0, self.frozen_stages - 1):\n                m = self.layers[i]\n                m.eval()\n                for param in m.parameters():\n                    param.requires_grad = False\n\n    def init_weights(self, pretrained=None):\n        \"\"\"Initialize the weights in backbone.\n        Args:\n            pretrained (str, optional): Path to pre-trained weights.\n                Defaults to None.\n        \"\"\"\n\n        def _init_weights(m):\n            if isinstance(m, nn.Linear):\n                trunc_normal_(m.weight, std=0.02)\n                if isinstance(m, nn.Linear) and m.bias is not None:\n                    nn.init.constant_(m.bias, 0)\n            elif isinstance(m, nn.LayerNorm):\n                nn.init.constant_(m.bias, 0)\n                nn.init.constant_(m.weight, 1.0)\n\n    def forward(self, x):\n        \"\"\"Forward function.\"\"\"\n        x = self.patch_embed(x)\n\n        Wh, Ww = x.size(2), x.size(3)\n        if self.ape:\n            # interpolate the position embedding to the corresponding size\n            absolute_pos_embed = F.interpolate(\n                self.absolute_pos_embed, size=(Wh, Ww), mode=\"bicubic\"\n            )\n            x = (x + absolute_pos_embed).flatten(2).transpose(1, 2)  # B Wh*Ww C\n        else:\n            x = x.flatten(2).transpose(1, 2)\n        x = self.pos_drop(x)\n\n        outs = {}\n        for i in range(self.num_layers):\n            layer = self.layers[i]\n            x_out, H, W, x, Wh, Ww = layer(x, Wh, Ww)\n\n            if i in self.out_indices:\n                norm_layer = getattr(self, f\"norm{i}\")\n                x_out = norm_layer(x_out)\n\n                out = x_out.view(-1, H, W, self.num_features[i]).permute(0, 3, 1, 2).contiguous()\n                outs[\"res{}\".format(i + 2)] = out\n\n        return outs\n\n    def train(self, mode=True):\n        \"\"\"Convert the model into training mode while keep layers freezed.\"\"\"\n        super(SwinTransformer, self).train(mode)\n        self._freeze_stages()\n\n\n@BACKBONE_REGISTRY.register()\nclass D2SwinTransformer(SwinTransformer, Backbone):\n    def __init__(self, cfg, input_shape):\n\n        pretrain_img_size = cfg.MODEL.SWIN.PRETRAIN_IMG_SIZE\n        patch_size = cfg.MODEL.SWIN.PATCH_SIZE\n        in_chans = 3\n        embed_dim = cfg.MODEL.SWIN.EMBED_DIM\n        depths = cfg.MODEL.SWIN.DEPTHS\n        num_heads = cfg.MODEL.SWIN.NUM_HEADS\n        window_size = cfg.MODEL.SWIN.WINDOW_SIZE\n        mlp_ratio = cfg.MODEL.SWIN.MLP_RATIO\n        qkv_bias = cfg.MODEL.SWIN.QKV_BIAS\n        qk_scale = cfg.MODEL.SWIN.QK_SCALE\n        drop_rate = cfg.MODEL.SWIN.DROP_RATE\n        attn_drop_rate = cfg.MODEL.SWIN.ATTN_DROP_RATE\n        drop_path_rate = cfg.MODEL.SWIN.DROP_PATH_RATE\n        norm_layer = nn.LayerNorm\n        ape = cfg.MODEL.SWIN.APE\n        patch_norm = cfg.MODEL.SWIN.PATCH_NORM\n        use_checkpoint = cfg.MODEL.SWIN.USE_CHECKPOINT\n\n        super().__init__(\n            pretrain_img_size,\n            patch_size,\n            in_chans,\n            embed_dim,\n            depths,\n            num_heads,\n            window_size,\n            mlp_ratio,\n            qkv_bias,\n            qk_scale,\n            drop_rate,\n            attn_drop_rate,\n            drop_path_rate,\n            norm_layer,\n            ape,\n            patch_norm,\n            use_checkpoint=use_checkpoint,\n        )\n\n        self._out_features = cfg.MODEL.SWIN.OUT_FEATURES\n\n        self._out_feature_strides = {\n            \"res2\": 4,\n            \"res3\": 8,\n            \"res4\": 16,\n            \"res5\": 32,\n        }\n        self._out_feature_channels = {\n            \"res2\": self.num_features[0],\n            \"res3\": self.num_features[1],\n            \"res4\": self.num_features[2],\n            \"res5\": self.num_features[3],\n        }\n\n    def forward(self, x):\n        \"\"\"\n        Args:\n            x: Tensor of shape (N,C,H,W). H, W must be a multiple of ``self.size_divisibility``.\n        Returns:\n            dict[str->Tensor]: names and the corresponding features\n        \"\"\"\n        assert (\n            x.dim() == 4\n        ), f\"SwinTransformer takes an input of shape (N, C, H, W). Got {x.shape} instead!\"\n        outputs = {}\n        y = super().forward(x)\n        for k in y.keys():\n            if k in self._out_features:\n                outputs[k] = y[k]\n        return outputs\n\n    def output_shape(self):\n        return {\n            name: ShapeSpec(\n                channels=self._out_feature_channels[name], stride=self._out_feature_strides[name]\n            )\n            for name in self._out_features\n        }\n\n    @property\n    def size_divisibility(self):\n        return 32\n"
  },
  {
    "path": "mfvis_nococo/mask2former/modeling/criterion.py",
    "content": "# Modified by Bowen Cheng from https://github.com/facebookresearch/detr/blob/master/models/detr.py\n\"\"\"\nMaskFormer criterion.\n\"\"\"\nimport logging\n\nimport torch\nimport torch.nn.functional as F\nfrom torch import nn\n\nfrom detectron2.utils.comm import get_world_size\nfrom detectron2.projects.point_rend.point_features import (\n    get_uncertain_point_coords_with_randomness,\n    point_sample,\n)\n\nfrom ..utils.misc import is_dist_avail_and_initialized, nested_tensor_from_tensor_list\n\ndef unfold_wo_center(x, kernel_size, dilation):\n    assert x.dim() == 4\n    assert kernel_size % 2 == 1\n\n    # using SAME padding\n    padding = (kernel_size + (dilation - 1) * (kernel_size - 1)) // 2\n    unfolded_x = F.unfold(\n        x, kernel_size=kernel_size,\n        padding=padding,\n        dilation=dilation\n    )\n\n    unfolded_x = unfolded_x.reshape(\n        x.size(0), x.size(1), -1, x.size(2), x.size(3)\n    )\n\n    # remove the center pixels\n    size = kernel_size ** 2\n    unfolded_x = torch.cat((\n        unfolded_x[:, :, :size // 2],\n        unfolded_x[:, :, size // 2 + 1:]\n    ), dim=2)\n\n    return unfolded_x\n\ndef compute_pairwise_term(mask_logits, pairwise_size, pairwise_dilation):\n    assert mask_logits.dim() == 4\n\n    log_fg_prob = F.logsigmoid(mask_logits)\n    log_bg_prob = F.logsigmoid(-mask_logits)\n\n    log_fg_prob_unfold = unfold_wo_center(\n        log_fg_prob, kernel_size=pairwise_size,\n        dilation=pairwise_dilation\n    )\n    log_bg_prob_unfold = unfold_wo_center(\n        log_bg_prob, kernel_size=pairwise_size,\n        dilation=pairwise_dilation\n    )\n\n    # the probability of making the same prediction = p_i * p_j + (1 - p_i) * (1 - p_j)\n    # we compute the the probability in log space to avoid numerical instability\n    log_same_fg_prob = log_fg_prob[:, :, None] + log_fg_prob_unfold\n    log_same_bg_prob = log_bg_prob[:, :, None] + log_bg_prob_unfold\n\n    max_ = torch.max(log_same_fg_prob, log_same_bg_prob)\n    log_same_prob = torch.log(\n        torch.exp(log_same_fg_prob - max_) +\n        torch.exp(log_same_bg_prob - max_)\n    ) + max_\n\n    # loss = -log(prob)\n    return -log_same_prob[:, 0]\n\ndef get_incoherent_mask(input_masks, sfact):\n    mask = input_masks.float()\n    w = input_masks.shape[-1]\n    h = input_masks.shape[-2]\n    mask_small = F.interpolate(mask, (h//sfact, w//sfact), mode='bilinear')\n    mask_recover = F.interpolate(mask_small, (h, w), mode='bilinear')\n    mask_uncertain = (mask - mask_recover).abs()\n    \n    mask_uncertain = (mask_uncertain > 0.01).float()\n    return mask_uncertain\n\ndef dice_coefficient(x, target):\n    eps = 1e-5\n    n_inst = x.size(0)\n    x = x.reshape(n_inst, -1)\n    target = target.reshape(n_inst, -1)\n    intersection = (x * target).sum(dim=1)\n    union = (x ** 2.0).sum(dim=1) + (target ** 2.0).sum(dim=1) + eps\n    loss = 1. - (2 * intersection / union)\n    return loss\n\ndef compute_project_term(mask_scores, gt_bitmasks):\n    mask_losses_y = dice_coefficient(\n        mask_scores.max(dim=2, keepdim=True)[0],\n        gt_bitmasks.max(dim=2, keepdim=True)[0]\n    )\n    mask_losses_x = dice_coefficient(\n        mask_scores.max(dim=3, keepdim=True)[0],\n        gt_bitmasks.max(dim=3, keepdim=True)[0]\n    )\n    return (mask_losses_x + mask_losses_y).mean()\n\ndef dice_loss(\n        inputs: torch.Tensor,\n        targets: torch.Tensor,\n        num_masks: float,\n    ):\n    \"\"\"\n    Compute the DICE loss, similar to generalized IOU for masks\n    Args:\n        inputs: A float tensor of arbitrary shape.\n                The predictions for each example.\n        targets: A float tensor with the same shape as inputs. Stores the binary\n                 classification label for each element in inputs\n                (0 for the negative class and 1 for the positive class).\n    \"\"\"\n    inputs = inputs.sigmoid()\n    inputs = inputs.flatten(1)\n    numerator = 2 * (inputs * targets).sum(-1)\n    denominator = inputs.sum(-1) + targets.sum(-1)\n    loss = 1 - (numerator + 1) / (denominator + 1)\n    return loss.sum() / num_masks\n\n\ndice_loss_jit = torch.jit.script(\n    dice_loss\n)  # type: torch.jit.ScriptModule\n\n\ndef sigmoid_ce_loss(\n        inputs: torch.Tensor,\n        targets: torch.Tensor,\n        num_masks: float,\n    ):\n    \"\"\"\n    Args:\n        inputs: A float tensor of arbitrary shape.\n                The predictions for each example.\n        targets: A float tensor with the same shape as inputs. Stores the binary\n                 classification label for each element in inputs\n                (0 for the negative class and 1 for the positive class).\n    Returns:\n        Loss tensor\n    \"\"\"\n    loss = F.binary_cross_entropy_with_logits(inputs, targets, reduction=\"none\")\n\n    return loss.mean(1).sum() / num_masks\n\n\nsigmoid_ce_loss_jit = torch.jit.script(\n    sigmoid_ce_loss\n)  # type: torch.jit.ScriptModule\n\n\ndef calculate_uncertainty(logits):\n    \"\"\"\n    We estimate uncerainty as L1 distance between 0.0 and the logit prediction in 'logits' for the\n        foreground class in `classes`.\n    Args:\n        logits (Tensor): A tensor of shape (R, 1, ...) for class-specific or\n            class-agnostic, where R is the total number of predicted masks in all images and C is\n            the number of foreground classes. The values are logits.\n    Returns:\n        scores (Tensor): A tensor of shape (R, 1, ...) that contains uncertainty scores with\n            the most uncertain locations having the highest uncertainty score.\n    \"\"\"\n    assert logits.shape[1] == 1\n    gt_class_logits = logits.clone()\n    return -(torch.abs(gt_class_logits))\n\n\nclass SetCriterion(nn.Module):\n    \"\"\"This class computes the loss for DETR.\n    The process happens in two steps:\n        1) we compute hungarian assignment between ground truth boxes and the outputs of the model\n        2) we supervise each pair of matched ground-truth / prediction (supervise class and box)\n    \"\"\"\n\n    def __init__(self, num_classes, matcher, weight_dict, eos_coef, losses,\n                 num_points, oversample_ratio, importance_sample_ratio):\n        \"\"\"Create the criterion.\n        Parameters:\n            num_classes: number of object categories, omitting the special no-object category\n            matcher: module able to compute a matching between targets and proposals\n            weight_dict: dict containing as key the names of the losses and as values their relative weight.\n            eos_coef: relative classification weight applied to the no-object category\n            losses: list of all the losses to be applied. See get_loss for list of available losses.\n        \"\"\"\n        super().__init__()\n        self.num_classes = num_classes\n        self.matcher = matcher\n        self.weight_dict = weight_dict\n        self.eos_coef = eos_coef\n        self.losses = losses\n        empty_weight = torch.ones(self.num_classes + 1)\n        empty_weight[-1] = self.eos_coef\n        self.register_buffer(\"empty_weight\", empty_weight)\n\n        # pointwise mask loss parameters\n        self.num_points = num_points\n        self.oversample_ratio = oversample_ratio\n        self.importance_sample_ratio = importance_sample_ratio\n        self.laplacian_kernel = torch.tensor([-1, -1, -1, -1, 8, -1, -1, -1, -1], dtype=torch.float32).reshape(1, 1, 3, 3).requires_grad_(False)\n\n        self.register_buffer(\"_iter\", torch.zeros([1]))\n        self._warmup_iters = 1000 #20000\n\n    def loss_labels(self, outputs, targets, indices, num_masks):\n        \"\"\"Classification loss (NLL)\n        targets dicts must contain the key \"labels\" containing a tensor of dim [nb_target_boxes]\n        \"\"\"\n        assert \"pred_logits\" in outputs\n        src_logits = outputs[\"pred_logits\"].float()\n\n        idx = self._get_src_permutation_idx(indices)\n        target_classes_o = torch.cat([t[\"labels\"][J] for t, (_, J) in zip(targets, indices)])\n        target_classes = torch.full(\n            src_logits.shape[:2], self.num_classes, dtype=torch.int64, device=src_logits.device\n        )\n        target_classes[idx] = target_classes_o\n\n        loss_ce = F.cross_entropy(src_logits.transpose(1, 2), target_classes, self.empty_weight)\n        losses = {\"loss_ce\": loss_ce}\n        return losses\n\n    \n    def loss_masks_proj(self, outputs, targets, indices, num_masks, images_lab_sim):\n        assert \"pred_masks\" in outputs\n\n        self._iter += 1\n\n        src_idx = self._get_src_permutation_idx(indices)\n        tgt_idx = self._get_tgt_permutation_idx(indices)\n        src_masks = outputs[\"pred_masks\"]\n        src_masks = src_masks[src_idx]\n        masks = [t[\"masks\"] for t in targets]\n        # TODO use valid to mask invalid areas due to padding in loss\n        target_masks, valid = nested_tensor_from_tensor_list(masks).decompose()\n        target_masks = target_masks.to(src_masks)\n        target_masks = target_masks[tgt_idx]\n        \n        if len(src_idx[0].tolist()) > 0:\n            images_lab_sim = torch.cat([images_lab_sim[ind] for ind in src_idx[0].tolist()])\n        \n\n        # No need to upsample predictions as we are using normalized coordinates :)\n        # N x 1 x H x W\n        src_masks = src_masks[:, None]\n        target_masks = target_masks[:, None]\n        target_masks = F.interpolate(target_masks, (src_masks.shape[-2], src_masks.shape[-1]), mode='bilinear')\n        # print('src masks shape:', src_masks.shape)\n        # print('target masks shape:', target_masks.shape)\n        if src_masks.shape[0] > 0:\n            loss_prj_term = compute_project_term(src_masks.sigmoid(), target_masks)  \n            # print('src_masks shape before:', src_masks.shape)\n            pairwise_losses = compute_pairwise_term(\n                src_masks, 3, 2\n            )\n            \n            inc_mask = get_incoherent_mask(src_masks.detach().sigmoid() > 0.5, 2) #* images_lab_sim).bool()\n            inc_mask = F.conv2d(inc_mask, self.laplacian_kernel.to(inc_mask.device), padding=1).abs()\n            inc_mask = (inc_mask > 0.5).float()\n            \n            weights = (images_lab_sim >= 0.3).float() * target_masks.float() #* inc_mask\n            loss_pairwise = ((pairwise_losses * weights).sum() / weights.sum().clamp(min=1.0)) * 0.25\n            warmup_factor = min(self._iter.item() / float(self._warmup_iters), 1.0)\n            loss_pairwise = loss_pairwise * warmup_factor #* 0.\n        else:\n            loss_prj_term = src_masks.sum() * 0.\n            loss_pairwise = src_masks.sum() * 0.\n\n       \n\n        losses = {\n            \"loss_mask\": loss_prj_term,\n            \"loss_bound\": loss_pairwise,\n        }\n\n        del src_masks\n        del target_masks\n        return losses\n\n\n\n    def loss_masks(self, outputs, targets, indices, num_masks):\n        \"\"\"Compute the losses related to the masks: the focal loss and the dice loss.\n        targets dicts must contain the key \"masks\" containing a tensor of dim [nb_target_boxes, h, w]\n        \"\"\"\n        assert \"pred_masks\" in outputs\n\n        src_idx = self._get_src_permutation_idx(indices)\n        tgt_idx = self._get_tgt_permutation_idx(indices)\n        src_masks = outputs[\"pred_masks\"]\n        src_masks = src_masks[src_idx]\n        masks = [t[\"masks\"] for t in targets]\n        # TODO use valid to mask invalid areas due to padding in loss\n        target_masks, valid = nested_tensor_from_tensor_list(masks).decompose()\n        target_masks = target_masks.to(src_masks)\n        target_masks = target_masks[tgt_idx]\n\n        # No need to upsample predictions as we are using normalized coordinates :)\n        # N x 1 x H x W\n        src_masks = src_masks[:, None]\n        target_masks = target_masks[:, None]\n\n        with torch.no_grad():\n            # sample point_coords\n            point_coords = get_uncertain_point_coords_with_randomness(\n                src_masks,\n                lambda logits: calculate_uncertainty(logits),\n                self.num_points,\n                self.oversample_ratio,\n                self.importance_sample_ratio,\n            )\n            # get gt labels\n            point_labels = point_sample(\n                target_masks,\n                point_coords,\n                align_corners=False,\n            ).squeeze(1)\n\n        point_logits = point_sample(\n            src_masks,\n            point_coords,\n            align_corners=False,\n        ).squeeze(1)\n\n        losses = {\n            \"loss_mask\": sigmoid_ce_loss_jit(point_logits, point_labels, num_masks),\n            \"loss_dice\": dice_loss_jit(point_logits, point_labels, num_masks),\n        }\n\n        del src_masks\n        del target_masks\n        return losses\n\n    def _get_src_permutation_idx(self, indices):\n        # permute predictions following indices\n        batch_idx = torch.cat([torch.full_like(src, i) for i, (src, _) in enumerate(indices)])\n        src_idx = torch.cat([src for (src, _) in indices])\n        return batch_idx, src_idx\n\n    def _get_tgt_permutation_idx(self, indices):\n        # permute targets following indices\n        batch_idx = torch.cat([torch.full_like(tgt, i) for i, (_, tgt) in enumerate(indices)])\n        tgt_idx = torch.cat([tgt for (_, tgt) in indices])\n        return batch_idx, tgt_idx\n\n    def get_loss(self, loss, outputs, targets, indices, num_masks, images_lab_sim):\n        loss_map = {\n            'labels': self.loss_labels,\n            'masks': self.loss_masks_proj,\n        }\n        assert loss in loss_map, f\"do you really want to compute {loss} loss?\"\n        if loss == 'masks':\n            return loss_map[loss](outputs, targets, indices, num_masks, images_lab_sim)\n        else:\n            return loss_map[loss](outputs, targets, indices, num_masks)\n\n    def forward(self, outputs, targets, images_lab_sim):\n        \"\"\"This performs the loss computation.\n        Parameters:\n             outputs: dict of tensors, see the output specification of the model for the format\n             targets: list of dicts, such that len(targets) == batch_size.\n                      The expected keys in each dict depends on the losses applied, see each loss' doc\n        \"\"\"\n        outputs_without_aux = {k: v for k, v in outputs.items() if k != \"aux_outputs\"}\n\n        # Retrieve the matching between the outputs of the last layer and the targets\n        indices = self.matcher(outputs_without_aux, targets)\n\n        # Compute the average number of target boxes accross all nodes, for normalization purposes\n        num_masks = sum(len(t[\"labels\"]) for t in targets)\n        num_masks = torch.as_tensor(\n            [num_masks], dtype=torch.float, device=next(iter(outputs.values())).device\n        )\n        if is_dist_avail_and_initialized():\n            torch.distributed.all_reduce(num_masks)\n        num_masks = torch.clamp(num_masks / get_world_size(), min=1).item()\n\n        # Compute all the requested losses\n        losses = {}\n        for loss in self.losses:\n            losses.update(self.get_loss(loss, outputs, targets, indices, num_masks, images_lab_sim))\n\n        # In case of auxiliary losses, we repeat this process with the output of each intermediate layer.\n        if \"aux_outputs\" in outputs:\n            for i, aux_outputs in enumerate(outputs[\"aux_outputs\"]):\n                indices = self.matcher(aux_outputs, targets)\n                for loss in self.losses:\n                    l_dict = self.get_loss(loss, aux_outputs, targets, indices, num_masks, images_lab_sim)\n                    l_dict = {k + f\"_{i}\": v for k, v in l_dict.items()}\n                    losses.update(l_dict)\n\n        return losses\n\n    def __repr__(self):\n        head = \"Criterion \" + self.__class__.__name__\n        body = [\n            \"matcher: {}\".format(self.matcher.__repr__(_repr_indent=8)),\n            \"losses: {}\".format(self.losses),\n            \"weight_dict: {}\".format(self.weight_dict),\n            \"num_classes: {}\".format(self.num_classes),\n            \"eos_coef: {}\".format(self.eos_coef),\n            \"num_points: {}\".format(self.num_points),\n            \"oversample_ratio: {}\".format(self.oversample_ratio),\n            \"importance_sample_ratio: {}\".format(self.importance_sample_ratio),\n        ]\n        _repr_indent = 4\n        lines = [head] + [\" \" * _repr_indent + line for line in body]\n        return \"\\n\".join(lines)\n"
  },
  {
    "path": "mfvis_nococo/mask2former/modeling/matcher.py",
    "content": "# Modified by Bowen Cheng from https://github.com/facebookresearch/detr/blob/master/models/matcher.py\n\"\"\"\nModules to compute the matching cost and solve the corresponding LSAP.\n\"\"\"\nimport torch\nimport torch.nn.functional as F\nfrom scipy.optimize import linear_sum_assignment\nfrom torch import nn\nfrom torch.cuda.amp import autocast\n\nfrom detectron2.projects.point_rend.point_features import point_sample\nfrom util.box_ops import box_cxcywh_to_xyxy, generalized_box_iou, generalized_multi_box_iou\n\ndef batch_dice_loss(inputs: torch.Tensor, targets: torch.Tensor):\n    \"\"\"\n    Compute the DICE loss, similar to generalized IOU for masks\n    Args:\n        inputs: A float tensor of arbitrary shape.\n                The predictions for each example.\n        targets: A float tensor with the same shape as inputs. Stores the binary\n                 classification label for each element in inputs\n                (0 for the negative class and 1 for the positive class).\n    \"\"\"\n    inputs = inputs #.sigmoid()\n    inputs = inputs.flatten(1)\n    numerator = 2 * torch.einsum(\"nc,mc->nm\", inputs, targets)\n    denominator = inputs.sum(-1)[:, None] + targets.sum(-1)[None, :]\n    loss = 1 - (numerator + 1) / (denominator + 1)\n    return loss\n\n\nbatch_dice_loss_jit = torch.jit.script(\n    batch_dice_loss\n)  # type: torch.jit.ScriptModule\n\n\ndef batch_sigmoid_ce_loss(inputs: torch.Tensor, targets: torch.Tensor):\n    \"\"\"\n    Args:\n        inputs: A float tensor of arbitrary shape.\n                The predictions for each example.\n        targets: A float tensor with the same shape as inputs. Stores the binary\n                 classification label for each element in inputs\n                (0 for the negative class and 1 for the positive class).\n    Returns:\n        Loss tensor\n    \"\"\"\n    hw = inputs.shape[1]\n\n\n    pos = F.binary_cross_entropy(\n        inputs, torch.ones_like(inputs), reduction=\"none\"\n    )\n    neg = F.binary_cross_entropy(\n        inputs, torch.zeros_like(inputs), reduction=\"none\"\n    )\n\n    loss = torch.einsum(\"nc,mc->nm\", pos, targets) + torch.einsum(\n        \"nc,mc->nm\", neg, (1 - targets)\n    )\n\n    return loss / hw\n\n\nbatch_sigmoid_ce_loss_jit = torch.jit.script(\n    batch_sigmoid_ce_loss\n)  # type: torch.jit.ScriptModule\n\ndef masks_to_boxes(masks: torch.Tensor) -> torch.Tensor:\n    \"\"\"\n    Compute the bounding boxes around the provided masks.\n\n    Returns a [N, 4] tensor containing bounding boxes. The boxes are in ``(x1, y1, x2, y2)`` format with\n    ``0 <= x1 < x2`` and ``0 <= y1 < y2``.\n\n    Args:\n        masks (Tensor[N, H, W]): masks to transform where N is the number of masks\n            and (H, W) are the spatial dimensions.\n\n    Returns:\n        Tensor[N, 4]: bounding boxes\n    \"\"\"\n    if masks.numel() == 0:\n        return masks\n\n    n = masks.shape[0]\n    for index, mask in enumerate(masks):\n        y, x = torch.where(mask != 0)\n        if len(x) * len(y) == 0:\n            continue\n        \n        h = torch.max(y) - torch.min(y)\n        w = torch.max(x) - torch.min(x)\n        masks[index, torch.min(y):torch.max(y), torch.min(x):torch.max(x)] = 1.0\n\n    return masks\n\ndef masks_to_boxes_cc(masks: torch.Tensor) -> torch.Tensor:\n    \"\"\"\n    Compute the bounding boxes around the provided masks.\n\n    Returns a [N, 4] tensor containing bounding boxes. The boxes are in ``(x1, y1, x2, y2)`` format with\n    ``0 <= x1 < x2`` and ``0 <= y1 < y2``.\n\n    Args:\n        masks (Tensor[N, H, W]): masks to transform where N is the number of masks\n            and (H, W) are the spatial dimensions.\n\n    Returns:\n        Tensor[N, 4]: bounding boxes\n    \"\"\"\n    if masks.numel() == 0:\n        return torch.zeros((0, 4), device=masks.device, dtype=torch.float)\n\n    n = masks.shape[0]\n    h = masks.shape[1]\n    w = masks.shape[2]\n\n    bounding_boxes = torch.zeros((n, 4), device=masks.device, dtype=torch.float)\n\n    for index, mask in enumerate(masks):\n        y, x = torch.where(mask != 0)\n        if len(x) * len(y) == 0:\n            continue\n\n        bounding_boxes[index, 0] = torch.min(x) / float(w)\n        bounding_boxes[index, 1] = torch.min(y) / float(h)\n        bounding_boxes[index, 2] = torch.max(x) / float(w)\n        bounding_boxes[index, 3] = torch.max(y) / float(h)\n\n    return bounding_boxes\n\n\nclass HungarianMatcher(nn.Module):\n    \"\"\"This class computes an assignment between the targets and the predictions of the network\n\n    For efficiency reasons, the targets don't include the no_object. Because of this, in general,\n    there are more predictions than targets. In this case, we do a 1-to-1 matching of the best predictions,\n    while the others are un-matched (and thus treated as non-objects).\n    \"\"\"\n\n    def __init__(self, cost_class: float = 1, cost_mask: float = 1, cost_dice: float = 1, num_points: int = 0):\n        \"\"\"Creates the matcher\n\n        Params:\n            cost_class: This is the relative weight of the classification error in the matching cost\n            cost_mask: This is the relative weight of the focal loss of the binary mask in the matching cost\n            cost_dice: This is the relative weight of the dice loss of the binary mask in the matching cost\n        \"\"\"\n        super().__init__()\n        self.cost_class = cost_class\n        self.cost_mask = cost_mask\n        self.cost_dice = cost_dice\n\n        self.cost_giou = 2.0\n        self.cost_bbox = 5.0\n\n        assert cost_class != 0 or cost_mask != 0 or cost_dice != 0, \"all costs cant be 0\"\n\n        self.num_points = num_points\n\n    @torch.no_grad()\n    def memory_efficient_forward(self, outputs, targets):\n        \"\"\"More memory-friendly matching\"\"\"\n        bs, num_queries = outputs[\"pred_logits\"].shape[:2]\n\n        indices = []\n\n        # Iterate through batch size\n        for b in range(bs):\n\n            out_prob = outputs[\"pred_logits\"][b].softmax(-1)  # [num_queries, num_classes]\n            tgt_ids = targets[b][\"labels\"]\n\n            # Compute the classification cost. Contrary to the loss, we don't use the NLL,\n            # but approximate it in 1 - proba[target class].\n            # The 1 is a constant that doesn't change the matching, it can be ommitted.\n            cost_class = -out_prob[:, tgt_ids]\n\n            out_mask = outputs[\"pred_masks\"][b]  # [num_queries, H_pred, W_pred]\n            out_mask_box = masks_to_boxes_cc((out_mask.sigmoid() > 0.5).float())\n            # gt masks are already padded when preparing target\n            tgt_mask = targets[b][\"masks\"].to(out_mask)\n            tgt_mask_box = masks_to_boxes_cc(tgt_mask)\n            # print('tgt_mask_box shape:', tgt_mask_box.shape)\n\n            with autocast(enabled=False):\n                cost_bbox = torch.cdist(out_mask_box, tgt_mask_box)\n                cost_giou = -generalized_box_iou(out_mask_box, tgt_mask_box)\n                if torch.isnan(cost_bbox).any():\n                    print('cost_bbox:', cost_bbox)\n                if torch.isnan(cost_giou).any():\n                    print('cost_giou:', cost_giou)\n\n            \n            C = (\n                self.cost_bbox * cost_bbox\n                + self.cost_class * cost_class\n                + self.cost_giou * cost_giou\n            )\n\n            C = C.reshape(num_queries, -1).cpu()\n            \n            indices.append(linear_sum_assignment(C))\n\n        return [\n            (torch.as_tensor(i, dtype=torch.int64), torch.as_tensor(j, dtype=torch.int64))\n            for i, j in indices\n        ]\n\n    @torch.no_grad()\n    def forward(self, outputs, targets):\n        \"\"\"Performs the matching\n\n        Params:\n            outputs: This is a dict that contains at least these entries:\n                 \"pred_logits\": Tensor of dim [batch_size, num_queries, num_classes] with the classification logits\n                 \"pred_masks\": Tensor of dim [batch_size, num_queries, H_pred, W_pred] with the predicted masks\n\n            targets: This is a list of targets (len(targets) = batch_size), where each target is a dict containing:\n                 \"labels\": Tensor of dim [num_target_boxes] (where num_target_boxes is the number of ground-truth\n                           objects in the target) containing the class labels\n                 \"masks\": Tensor of dim [num_target_boxes, H_gt, W_gt] containing the target masks\n\n        Returns:\n            A list of size batch_size, containing tuples of (index_i, index_j) where:\n                - index_i is the indices of the selected predictions (in order)\n                - index_j is the indices of the corresponding selected targets (in order)\n            For each batch element, it holds:\n                len(index_i) = len(index_j) = min(num_queries, num_target_boxes)\n        \"\"\"\n        return self.memory_efficient_forward(outputs, targets)\n\n    def __repr__(self, _repr_indent=4):\n        head = \"Matcher \" + self.__class__.__name__\n        body = [\n            \"cost_class: {}\".format(self.cost_class),\n            \"cost_mask: {}\".format(self.cost_mask),\n            \"cost_dice: {}\".format(self.cost_dice),\n        ]\n        lines = [head] + [\" \" * _repr_indent + line for line in body]\n        return \"\\n\".join(lines)\n"
  },
  {
    "path": "mfvis_nococo/mask2former/modeling/meta_arch/__init__.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates.\n"
  },
  {
    "path": "mfvis_nococo/mask2former/modeling/meta_arch/__init__.py.new",
    "content": ""
  },
  {
    "path": "mfvis_nococo/mask2former/modeling/meta_arch/mask_former_head.py",
    "content": "import logging\nfrom copy import deepcopy\nfrom typing import Callable, Dict, List, Optional, Tuple, Union\n\nimport fvcore.nn.weight_init as weight_init\nfrom torch import nn\nfrom torch.nn import functional as F\n\nfrom detectron2.config import configurable\nfrom detectron2.layers import Conv2d, ShapeSpec, get_norm\nfrom detectron2.modeling import SEM_SEG_HEADS_REGISTRY\n\nfrom ..transformer_decoder.maskformer_transformer_decoder import build_transformer_decoder\nfrom ..pixel_decoder.fpn import build_pixel_decoder\n\n    \n@SEM_SEG_HEADS_REGISTRY.register()\nclass MaskFormerHead(nn.Module):\n\n    _version = 2\n\n    def _load_from_state_dict(\n        self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs\n    ):\n        version = local_metadata.get(\"version\", None)\n        if version is None or version < 2:\n            # Do not warn if train from scratch\n            scratch = True\n            logger = logging.getLogger(__name__)\n            for k in list(state_dict.keys()):\n                newk = k\n                '''\n                if \"sem_seg_head\" in k and not k.startswith(prefix + \"predictor\"):\n                    newk = k.replace(prefix, prefix + \"pixel_decoder.\")\n                    # logger.debug(f\"{k} ==> {newk}\")\n                '''\n                if newk != k:\n                    state_dict[newk] = state_dict[k]\n                    del state_dict[k]\n                    scratch = False\n\n            if not scratch:\n                logger.warning(\n                    f\"Weight format of {self.__class__.__name__} have changed! \"\n                    \"Please upgrade your models. Applying automatic conversion now ...\"\n                )\n\n    @configurable\n    def __init__(\n        self,\n        input_shape: Dict[str, ShapeSpec],\n        *,\n        num_classes: int,\n        pixel_decoder: nn.Module,\n        loss_weight: float = 1.0,\n        ignore_value: int = -1,\n        # extra parameters\n        transformer_predictor: nn.Module,\n        transformer_in_feature: str,\n    ):\n        \"\"\"\n        NOTE: this interface is experimental.\n        Args:\n            input_shape: shapes (channels and stride) of the input features\n            num_classes: number of classes to predict\n            pixel_decoder: the pixel decoder module\n            loss_weight: loss weight\n            ignore_value: category id to be ignored during training.\n            transformer_predictor: the transformer decoder that makes prediction\n            transformer_in_feature: input feature name to the transformer_predictor\n        \"\"\"\n        super().__init__()\n        input_shape = sorted(input_shape.items(), key=lambda x: x[1].stride)\n        self.in_features = [k for k, v in input_shape]\n        feature_strides = [v.stride for k, v in input_shape]\n        feature_channels = [v.channels for k, v in input_shape]\n\n        self.ignore_value = ignore_value\n        self.common_stride = 4\n        self.loss_weight = loss_weight\n\n        self.pixel_decoder = pixel_decoder\n        self.predictor = transformer_predictor\n        self.transformer_in_feature = transformer_in_feature\n\n        self.num_classes = num_classes\n\n    @classmethod\n    def from_config(cls, cfg, input_shape: Dict[str, ShapeSpec]):\n        # figure out in_channels to transformer predictor\n        if cfg.MODEL.MASK_FORMER.TRANSFORMER_IN_FEATURE == \"transformer_encoder\":\n            transformer_predictor_in_channels = cfg.MODEL.SEM_SEG_HEAD.CONVS_DIM\n        elif cfg.MODEL.MASK_FORMER.TRANSFORMER_IN_FEATURE == \"pixel_embedding\":\n            transformer_predictor_in_channels = cfg.MODEL.SEM_SEG_HEAD.MASK_DIM\n        elif cfg.MODEL.MASK_FORMER.TRANSFORMER_IN_FEATURE == \"multi_scale_pixel_decoder\":  # for maskformer2\n            transformer_predictor_in_channels = cfg.MODEL.SEM_SEG_HEAD.CONVS_DIM\n        else:\n            transformer_predictor_in_channels = input_shape[cfg.MODEL.MASK_FORMER.TRANSFORMER_IN_FEATURE].channels\n\n        return {\n            \"input_shape\": {\n                k: v for k, v in input_shape.items() if k in cfg.MODEL.SEM_SEG_HEAD.IN_FEATURES\n            },\n            \"ignore_value\": cfg.MODEL.SEM_SEG_HEAD.IGNORE_VALUE,\n            \"num_classes\": cfg.MODEL.SEM_SEG_HEAD.NUM_CLASSES,\n            \"pixel_decoder\": build_pixel_decoder(cfg, input_shape),\n            \"loss_weight\": cfg.MODEL.SEM_SEG_HEAD.LOSS_WEIGHT,\n            \"transformer_in_feature\": cfg.MODEL.MASK_FORMER.TRANSFORMER_IN_FEATURE,\n            \"transformer_predictor\": build_transformer_decoder(\n                cfg,\n                transformer_predictor_in_channels,\n                mask_classification=True,\n            ),\n        }\n\n    def forward(self, features, mask=None):\n        return self.layers(features, mask)\n\n    def layers(self, features, mask=None):\n        mask_features, transformer_encoder_features, multi_scale_features = self.pixel_decoder.forward_features(features)\n        if self.transformer_in_feature == \"multi_scale_pixel_decoder\":\n            predictions = self.predictor(multi_scale_features, mask_features, mask)\n        else:\n            if self.transformer_in_feature == \"transformer_encoder\":\n                assert (\n                    transformer_encoder_features is not None\n                ), \"Please use the TransformerEncoderPixelDecoder.\"\n                predictions = self.predictor(transformer_encoder_features, mask_features, mask)\n            elif self.transformer_in_feature == \"pixel_embedding\":\n                predictions = self.predictor(mask_features, mask_features, mask)\n            else:\n                predictions = self.predictor(features[self.transformer_in_feature], mask_features, mask)\n        return predictions\n"
  },
  {
    "path": "mfvis_nococo/mask2former/modeling/meta_arch/per_pixel_baseline.py",
    "content": "import logging\nfrom typing import Callable, Dict, List, Optional, Tuple, Union\n\nimport fvcore.nn.weight_init as weight_init\nfrom torch import nn\nfrom torch.nn import functional as F\n\nfrom detectron2.config import configurable\nfrom detectron2.layers import Conv2d, ShapeSpec, get_norm\nfrom detectron2.modeling import SEM_SEG_HEADS_REGISTRY\n\nfrom ..transformer_decoder.maskformer_transformer_decoder import StandardTransformerDecoder\nfrom ..pixel_decoder.fpn import build_pixel_decoder\n\n\n@SEM_SEG_HEADS_REGISTRY.register()\nclass PerPixelBaselineHead(nn.Module):\n\n    _version = 2\n\n    def _load_from_state_dict(\n        self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs\n    ):\n        version = local_metadata.get(\"version\", None)\n        if version is None or version < 2:\n            logger = logging.getLogger(__name__)\n            # Do not warn if train from scratch\n            scratch = True\n            logger = logging.getLogger(__name__)\n            for k in list(state_dict.keys()):\n                newk = k\n                if \"sem_seg_head\" in k and not k.startswith(prefix + \"predictor\"):\n                    newk = k.replace(prefix, prefix + \"pixel_decoder.\")\n                    # logger.warning(f\"{k} ==> {newk}\")\n                if newk != k:\n                    state_dict[newk] = state_dict[k]\n                    del state_dict[k]\n                    scratch = False\n\n            if not scratch:\n                logger.warning(\n                    f\"Weight format of {self.__class__.__name__} have changed! \"\n                    \"Please upgrade your models. Applying automatic conversion now ...\"\n                )\n\n    @configurable\n    def __init__(\n        self,\n        input_shape: Dict[str, ShapeSpec],\n        *,\n        num_classes: int,\n        pixel_decoder: nn.Module,\n        loss_weight: float = 1.0,\n        ignore_value: int = -1,\n    ):\n        \"\"\"\n        NOTE: this interface is experimental.\n        Args:\n            input_shape: shapes (channels and stride) of the input features\n            num_classes: number of classes to predict\n            pixel_decoder: the pixel decoder module\n            loss_weight: loss weight\n            ignore_value: category id to be ignored during training.\n        \"\"\"\n        super().__init__()\n        input_shape = sorted(input_shape.items(), key=lambda x: x[1].stride)\n        self.in_features = [k for k, v in input_shape]\n        feature_strides = [v.stride for k, v in input_shape]\n        feature_channels = [v.channels for k, v in input_shape]\n\n        self.ignore_value = ignore_value\n        self.common_stride = 4\n        self.loss_weight = loss_weight\n\n        self.pixel_decoder = pixel_decoder\n        self.predictor = Conv2d(\n            self.pixel_decoder.mask_dim, num_classes, kernel_size=1, stride=1, padding=0\n        )\n        weight_init.c2_msra_fill(self.predictor)\n\n    @classmethod\n    def from_config(cls, cfg, input_shape: Dict[str, ShapeSpec]):\n        return {\n            \"input_shape\": {\n                k: v for k, v in input_shape.items() if k in cfg.MODEL.SEM_SEG_HEAD.IN_FEATURES\n            },\n            \"ignore_value\": cfg.MODEL.SEM_SEG_HEAD.IGNORE_VALUE,\n            \"num_classes\": cfg.MODEL.SEM_SEG_HEAD.NUM_CLASSES,\n            \"pixel_decoder\": build_pixel_decoder(cfg, input_shape),\n            \"loss_weight\": cfg.MODEL.SEM_SEG_HEAD.LOSS_WEIGHT,\n        }\n\n    def forward(self, features, targets=None):\n        \"\"\"\n        Returns:\n            In training, returns (None, dict of losses)\n            In inference, returns (CxHxW logits, {})\n        \"\"\"\n        x = self.layers(features)\n        if self.training:\n            return None, self.losses(x, targets)\n        else:\n            x = F.interpolate(\n                x, scale_factor=self.common_stride, mode=\"bilinear\", align_corners=False\n            )\n            return x, {}\n\n    def layers(self, features):\n        x, _, _ = self.pixel_decoder.forward_features(features)\n        x = self.predictor(x)\n        return x\n\n    def losses(self, predictions, targets):\n        predictions = predictions.float()  # https://github.com/pytorch/pytorch/issues/48163\n        predictions = F.interpolate(\n            predictions, scale_factor=self.common_stride, mode=\"bilinear\", align_corners=False\n        )\n        loss = F.cross_entropy(\n            predictions, targets, reduction=\"mean\", ignore_index=self.ignore_value\n        )\n        losses = {\"loss_sem_seg\": loss * self.loss_weight}\n        return losses\n\n\n@SEM_SEG_HEADS_REGISTRY.register()\nclass PerPixelBaselinePlusHead(PerPixelBaselineHead):\n    def _load_from_state_dict(\n        self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs\n    ):\n        version = local_metadata.get(\"version\", None)\n        if version is None or version < 2:\n            # Do not warn if train from scratch\n            scratch = True\n            logger = logging.getLogger(__name__)\n            for k in list(state_dict.keys()):\n                newk = k\n                if \"sem_seg_head\" in k and not k.startswith(prefix + \"predictor\"):\n                    newk = k.replace(prefix, prefix + \"pixel_decoder.\")\n                    logger.debug(f\"{k} ==> {newk}\")\n                if newk != k:\n                    state_dict[newk] = state_dict[k]\n                    del state_dict[k]\n                    scratch = False\n\n            if not scratch:\n                logger.warning(\n                    f\"Weight format of {self.__class__.__name__} have changed! \"\n                    \"Please upgrade your models. Applying automatic conversion now ...\"\n                )\n\n    @configurable\n    def __init__(\n        self,\n        input_shape: Dict[str, ShapeSpec],\n        *,\n        # extra parameters\n        transformer_predictor: nn.Module,\n        transformer_in_feature: str,\n        deep_supervision: bool,\n        # inherit parameters\n        num_classes: int,\n        pixel_decoder: nn.Module,\n        loss_weight: float = 1.0,\n        ignore_value: int = -1,\n    ):\n        \"\"\"\n        NOTE: this interface is experimental.\n        Args:\n            input_shape: shapes (channels and stride) of the input features\n            transformer_predictor: the transformer decoder that makes prediction\n            transformer_in_feature: input feature name to the transformer_predictor\n            deep_supervision: whether or not to add supervision to the output of\n                every transformer decoder layer\n            num_classes: number of classes to predict\n            pixel_decoder: the pixel decoder module\n            loss_weight: loss weight\n            ignore_value: category id to be ignored during training.\n        \"\"\"\n        super().__init__(\n            input_shape,\n            num_classes=num_classes,\n            pixel_decoder=pixel_decoder,\n            loss_weight=loss_weight,\n            ignore_value=ignore_value,\n        )\n\n        del self.predictor\n\n        self.predictor = transformer_predictor\n        self.transformer_in_feature = transformer_in_feature\n        self.deep_supervision = deep_supervision\n\n    @classmethod\n    def from_config(cls, cfg, input_shape: Dict[str, ShapeSpec]):\n        ret = super().from_config(cfg, input_shape)\n        ret[\"transformer_in_feature\"] = cfg.MODEL.MASK_FORMER.TRANSFORMER_IN_FEATURE\n        if cfg.MODEL.MASK_FORMER.TRANSFORMER_IN_FEATURE == \"transformer_encoder\":\n            in_channels = cfg.MODEL.SEM_SEG_HEAD.CONVS_DIM\n        else:\n            in_channels = input_shape[ret[\"transformer_in_feature\"]].channels\n        ret[\"transformer_predictor\"] = StandardTransformerDecoder(\n            cfg, in_channels, mask_classification=False\n        )\n        ret[\"deep_supervision\"] = cfg.MODEL.MASK_FORMER.DEEP_SUPERVISION\n        return ret\n\n    def forward(self, features, targets=None):\n        \"\"\"\n        Returns:\n            In training, returns (None, dict of losses)\n            In inference, returns (CxHxW logits, {})\n        \"\"\"\n        x, aux_outputs = self.layers(features)\n        if self.training:\n            if self.deep_supervision:\n                losses = self.losses(x, targets)\n                for i, aux_output in enumerate(aux_outputs):\n                    losses[\"loss_sem_seg\" + f\"_{i}\"] = self.losses(\n                        aux_output[\"pred_masks\"], targets\n                    )[\"loss_sem_seg\"]\n                return None, losses\n            else:\n                return None, self.losses(x, targets)\n        else:\n            x = F.interpolate(\n                x, scale_factor=self.common_stride, mode=\"bilinear\", align_corners=False\n            )\n            return x, {}\n\n    def layers(self, features):\n        mask_features, transformer_encoder_features, _ = self.pixel_decoder.forward_features(features)\n        if self.transformer_in_feature == \"transformer_encoder\":\n            assert (\n                transformer_encoder_features is not None\n            ), \"Please use the TransformerEncoderPixelDecoder.\"\n            predictions = self.predictor(transformer_encoder_features, mask_features)\n        else:\n            predictions = self.predictor(features[self.transformer_in_feature], mask_features)\n        if self.deep_supervision:\n            return predictions[\"pred_masks\"], predictions[\"aux_outputs\"]\n        else:\n            return predictions[\"pred_masks\"], None\n"
  },
  {
    "path": "mfvis_nococo/mask2former/modeling/pixel_decoder/__init__.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates.\n"
  },
  {
    "path": "mfvis_nococo/mask2former/modeling/pixel_decoder/__init__.py.new",
    "content": ""
  },
  {
    "path": "mfvis_nococo/mask2former/modeling/pixel_decoder/fpn.py",
    "content": "import logging\nimport numpy as np\nfrom typing import Callable, Dict, List, Optional, Tuple, Union\n\nimport fvcore.nn.weight_init as weight_init\nimport torch\nfrom torch import nn\nfrom torch.nn import functional as F\nfrom torch.nn.init import xavier_uniform_, constant_, uniform_, normal_\nfrom torch.cuda.amp import autocast\n\nfrom detectron2.config import configurable\nfrom detectron2.layers import Conv2d, DeformConv, ShapeSpec, get_norm\nfrom detectron2.modeling import SEM_SEG_HEADS_REGISTRY\n\nfrom ..transformer_decoder.position_encoding import PositionEmbeddingSine\nfrom ..transformer_decoder.transformer import TransformerEncoder, TransformerEncoderLayer, _get_clones, _get_activation_fn\n\n\ndef build_pixel_decoder(cfg, input_shape):\n    \"\"\"\n    Build a pixel decoder from `cfg.MODEL.MASK_FORMER.PIXEL_DECODER_NAME`.\n    \"\"\"\n    name = cfg.MODEL.SEM_SEG_HEAD.PIXEL_DECODER_NAME\n    model = SEM_SEG_HEADS_REGISTRY.get(name)(cfg, input_shape)\n    forward_features = getattr(model, \"forward_features\", None)\n    if not callable(forward_features):\n        raise ValueError(\n            \"Only SEM_SEG_HEADS with forward_features method can be used as pixel decoder. \"\n            f\"Please implement forward_features for {name} to only return mask features.\"\n        )\n    return model\n\n\n# This is a modified FPN decoder.\n@SEM_SEG_HEADS_REGISTRY.register()\nclass BasePixelDecoder(nn.Module):\n    @configurable\n    def __init__(\n        self,\n        input_shape: Dict[str, ShapeSpec],\n        *,\n        conv_dim: int,\n        mask_dim: int,\n        norm: Optional[Union[str, Callable]] = None,\n    ):\n        \"\"\"\n        NOTE: this interface is experimental.\n        Args:\n            input_shape: shapes (channels and stride) of the input features\n            conv_dims: number of output channels for the intermediate conv layers.\n            mask_dim: number of output channels for the final conv layer.\n            norm (str or callable): normalization for all conv layers\n        \"\"\"\n        super().__init__()\n\n        input_shape = sorted(input_shape.items(), key=lambda x: x[1].stride)\n        self.in_features = [k for k, v in input_shape]  # starting from \"res2\" to \"res5\"\n        feature_channels = [v.channels for k, v in input_shape]\n\n        lateral_convs = []\n        output_convs = []\n\n        use_bias = norm == \"\"\n        for idx, in_channels in enumerate(feature_channels):\n            if idx == len(self.in_features) - 1:\n                output_norm = get_norm(norm, conv_dim)\n                output_conv = Conv2d(\n                    in_channels,\n                    conv_dim,\n                    kernel_size=3,\n                    stride=1,\n                    padding=1,\n                    bias=use_bias,\n                    norm=output_norm,\n                    activation=F.relu,\n                )\n                weight_init.c2_xavier_fill(output_conv)\n                self.add_module(\"layer_{}\".format(idx + 1), output_conv)\n\n                lateral_convs.append(None)\n                output_convs.append(output_conv)\n            else:\n                lateral_norm = get_norm(norm, conv_dim)\n                output_norm = get_norm(norm, conv_dim)\n\n                lateral_conv = Conv2d(\n                    in_channels, conv_dim, kernel_size=1, bias=use_bias, norm=lateral_norm\n                )\n                output_conv = Conv2d(\n                    conv_dim,\n                    conv_dim,\n                    kernel_size=3,\n                    stride=1,\n                    padding=1,\n                    bias=use_bias,\n                    norm=output_norm,\n                    activation=F.relu,\n                )\n                weight_init.c2_xavier_fill(lateral_conv)\n                weight_init.c2_xavier_fill(output_conv)\n                self.add_module(\"adapter_{}\".format(idx + 1), lateral_conv)\n                self.add_module(\"layer_{}\".format(idx + 1), output_conv)\n\n                lateral_convs.append(lateral_conv)\n                output_convs.append(output_conv)\n        # Place convs into top-down order (from low to high resolution)\n        # to make the top-down computation in forward clearer.\n        self.lateral_convs = lateral_convs[::-1]\n        self.output_convs = output_convs[::-1]\n\n        self.mask_dim = mask_dim\n        self.mask_features = Conv2d(\n            conv_dim,\n            mask_dim,\n            kernel_size=3,\n            stride=1,\n            padding=1,\n        )\n        weight_init.c2_xavier_fill(self.mask_features)\n\n        self.maskformer_num_feature_levels = 3  # always use 3 scales\n\n    @classmethod\n    def from_config(cls, cfg, input_shape: Dict[str, ShapeSpec]):\n        ret = {}\n        ret[\"input_shape\"] = {\n            k: v for k, v in input_shape.items() if k in cfg.MODEL.SEM_SEG_HEAD.IN_FEATURES\n        }\n        ret[\"conv_dim\"] = cfg.MODEL.SEM_SEG_HEAD.CONVS_DIM\n        ret[\"mask_dim\"] = cfg.MODEL.SEM_SEG_HEAD.MASK_DIM\n        ret[\"norm\"] = cfg.MODEL.SEM_SEG_HEAD.NORM\n        return ret\n\n    def forward_features(self, features):\n        multi_scale_features = []\n        num_cur_levels = 0\n        # Reverse feature maps into top-down order (from low to high resolution)\n        for idx, f in enumerate(self.in_features[::-1]):\n            x = features[f]\n            lateral_conv = self.lateral_convs[idx]\n            output_conv = self.output_convs[idx]\n            if lateral_conv is None:\n                y = output_conv(x)\n            else:\n                cur_fpn = lateral_conv(x)\n                # Following FPN implementation, we use nearest upsampling here\n                y = cur_fpn + F.interpolate(y, size=cur_fpn.shape[-2:], mode=\"nearest\")\n                y = output_conv(y)\n            if num_cur_levels < self.maskformer_num_feature_levels:\n                multi_scale_features.append(y)\n                num_cur_levels += 1\n        return self.mask_features(y), None, multi_scale_features\n\n    def forward(self, features, targets=None):\n        logger = logging.getLogger(__name__)\n        logger.warning(\"Calling forward() may cause unpredicted behavior of PixelDecoder module.\")\n        return self.forward_features(features)\n\n\nclass TransformerEncoderOnly(nn.Module):\n    def __init__(\n        self,\n        d_model=512,\n        nhead=8,\n        num_encoder_layers=6,\n        dim_feedforward=2048,\n        dropout=0.1,\n        activation=\"relu\",\n        normalize_before=False,\n    ):\n        super().__init__()\n\n        encoder_layer = TransformerEncoderLayer(\n            d_model, nhead, dim_feedforward, dropout, activation, normalize_before\n        )\n        encoder_norm = nn.LayerNorm(d_model) if normalize_before else None\n        self.encoder = TransformerEncoder(encoder_layer, num_encoder_layers, encoder_norm)\n\n        self._reset_parameters()\n\n        self.d_model = d_model\n        self.nhead = nhead\n\n    def _reset_parameters(self):\n        for p in self.parameters():\n            if p.dim() > 1:\n                nn.init.xavier_uniform_(p)\n\n    def forward(self, src, mask, pos_embed):\n        # flatten NxCxHxW to HWxNxC\n        bs, c, h, w = src.shape\n        src = src.flatten(2).permute(2, 0, 1)\n        pos_embed = pos_embed.flatten(2).permute(2, 0, 1)\n        if mask is not None:\n            mask = mask.flatten(1)\n\n        memory = self.encoder(src, src_key_padding_mask=mask, pos=pos_embed)\n        return memory.permute(1, 2, 0).view(bs, c, h, w)\n\n\n# This is a modified FPN decoder with extra Transformer encoder that processes the lowest-resolution feature map.\n@SEM_SEG_HEADS_REGISTRY.register()\nclass TransformerEncoderPixelDecoder(BasePixelDecoder):\n    @configurable\n    def __init__(\n        self,\n        input_shape: Dict[str, ShapeSpec],\n        *,\n        transformer_dropout: float,\n        transformer_nheads: int,\n        transformer_dim_feedforward: int,\n        transformer_enc_layers: int,\n        transformer_pre_norm: bool,\n        conv_dim: int,\n        mask_dim: int,\n        norm: Optional[Union[str, Callable]] = None,\n    ):\n        \"\"\"\n        NOTE: this interface is experimental.\n        Args:\n            input_shape: shapes (channels and stride) of the input features\n            transformer_dropout: dropout probability in transformer\n            transformer_nheads: number of heads in transformer\n            transformer_dim_feedforward: dimension of feedforward network\n            transformer_enc_layers: number of transformer encoder layers\n            transformer_pre_norm: whether to use pre-layernorm or not\n            conv_dims: number of output channels for the intermediate conv layers.\n            mask_dim: number of output channels for the final conv layer.\n            norm (str or callable): normalization for all conv layers\n        \"\"\"\n        super().__init__(input_shape, conv_dim=conv_dim, mask_dim=mask_dim, norm=norm)\n\n        input_shape = sorted(input_shape.items(), key=lambda x: x[1].stride)\n        self.in_features = [k for k, v in input_shape]  # starting from \"res2\" to \"res5\"\n        feature_strides = [v.stride for k, v in input_shape]\n        feature_channels = [v.channels for k, v in input_shape]\n\n        in_channels = feature_channels[len(self.in_features) - 1]\n        self.input_proj = Conv2d(in_channels, conv_dim, kernel_size=1)\n        weight_init.c2_xavier_fill(self.input_proj)\n        self.transformer = TransformerEncoderOnly(\n            d_model=conv_dim,\n            dropout=transformer_dropout,\n            nhead=transformer_nheads,\n            dim_feedforward=transformer_dim_feedforward,\n            num_encoder_layers=transformer_enc_layers,\n            normalize_before=transformer_pre_norm,\n        )\n        N_steps = conv_dim // 2\n        self.pe_layer = PositionEmbeddingSine(N_steps, normalize=True)\n\n        # update layer\n        use_bias = norm == \"\"\n        output_norm = get_norm(norm, conv_dim)\n        output_conv = Conv2d(\n            conv_dim,\n            conv_dim,\n            kernel_size=3,\n            stride=1,\n            padding=1,\n            bias=use_bias,\n            norm=output_norm,\n            activation=F.relu,\n        )\n        weight_init.c2_xavier_fill(output_conv)\n        delattr(self, \"layer_{}\".format(len(self.in_features)))\n        self.add_module(\"layer_{}\".format(len(self.in_features)), output_conv)\n        self.output_convs[0] = output_conv\n\n    @classmethod\n    def from_config(cls, cfg, input_shape: Dict[str, ShapeSpec]):\n        ret = super().from_config(cfg, input_shape)\n        ret[\"transformer_dropout\"] = cfg.MODEL.MASK_FORMER.DROPOUT\n        ret[\"transformer_nheads\"] = cfg.MODEL.MASK_FORMER.NHEADS\n        ret[\"transformer_dim_feedforward\"] = cfg.MODEL.MASK_FORMER.DIM_FEEDFORWARD\n        ret[\n            \"transformer_enc_layers\"\n        ] = cfg.MODEL.SEM_SEG_HEAD.TRANSFORMER_ENC_LAYERS  # a separate config\n        ret[\"transformer_pre_norm\"] = cfg.MODEL.MASK_FORMER.PRE_NORM\n        return ret\n\n    def forward_features(self, features):\n        multi_scale_features = []\n        num_cur_levels = 0\n        # Reverse feature maps into top-down order (from low to high resolution)\n        for idx, f in enumerate(self.in_features[::-1]):\n            x = features[f]\n            lateral_conv = self.lateral_convs[idx]\n            output_conv = self.output_convs[idx]\n            if lateral_conv is None:\n                transformer = self.input_proj(x)\n                pos = self.pe_layer(x)\n                transformer = self.transformer(transformer, None, pos)\n                y = output_conv(transformer)\n                # save intermediate feature as input to Transformer decoder\n                transformer_encoder_features = transformer\n            else:\n                cur_fpn = lateral_conv(x)\n                # Following FPN implementation, we use nearest upsampling here\n                y = cur_fpn + F.interpolate(y, size=cur_fpn.shape[-2:], mode=\"nearest\")\n                y = output_conv(y)\n            if num_cur_levels < self.maskformer_num_feature_levels:\n                multi_scale_features.append(y)\n                num_cur_levels += 1\n        return self.mask_features(y), transformer_encoder_features, multi_scale_features\n\n    def forward(self, features, targets=None):\n        logger = logging.getLogger(__name__)\n        logger.warning(\"Calling forward() may cause unpredicted behavior of PixelDecoder module.\")\n        return self.forward_features(features)\n"
  },
  {
    "path": "mfvis_nococo/mask2former/modeling/pixel_decoder/msdeformattn.py",
    "content": "import logging\nimport numpy as np\nfrom typing import Callable, Dict, List, Optional, Tuple, Union\n\nimport fvcore.nn.weight_init as weight_init\nimport torch\nfrom torch import nn\nfrom torch.nn import functional as F\nfrom torch.nn.init import xavier_uniform_, constant_, uniform_, normal_\nfrom torch.cuda.amp import autocast\n\nfrom detectron2.config import configurable\nfrom detectron2.layers import Conv2d, ShapeSpec, get_norm\nfrom detectron2.modeling import SEM_SEG_HEADS_REGISTRY\n\nfrom ..transformer_decoder.position_encoding import PositionEmbeddingSine\nfrom ..transformer_decoder.transformer import _get_clones, _get_activation_fn\nfrom .ops.modules import MSDeformAttn\n\n\n# MSDeformAttn Transformer encoder in deformable detr\nclass MSDeformAttnTransformerEncoderOnly(nn.Module):\n    def __init__(self, d_model=256, nhead=8,\n                 num_encoder_layers=6, dim_feedforward=1024, dropout=0.1,\n                 activation=\"relu\",\n                 num_feature_levels=4, enc_n_points=4,\n        ):\n        super().__init__()\n\n        self.d_model = d_model\n        self.nhead = nhead\n\n        encoder_layer = MSDeformAttnTransformerEncoderLayer(d_model, dim_feedforward,\n                                                            dropout, activation,\n                                                            num_feature_levels, nhead, enc_n_points)\n        self.encoder = MSDeformAttnTransformerEncoder(encoder_layer, num_encoder_layers)\n\n        self.level_embed = nn.Parameter(torch.Tensor(num_feature_levels, d_model))\n\n        self._reset_parameters()\n\n    def _reset_parameters(self):\n        for p in self.parameters():\n            if p.dim() > 1:\n                nn.init.xavier_uniform_(p)\n        for m in self.modules():\n            if isinstance(m, MSDeformAttn):\n                m._reset_parameters()\n        normal_(self.level_embed)\n\n    def get_valid_ratio(self, mask):\n        _, H, W = mask.shape\n        valid_H = torch.sum(~mask[:, :, 0], 1)\n        valid_W = torch.sum(~mask[:, 0, :], 1)\n        valid_ratio_h = valid_H.float() / H\n        valid_ratio_w = valid_W.float() / W\n        valid_ratio = torch.stack([valid_ratio_w, valid_ratio_h], -1)\n        return valid_ratio\n\n    def forward(self, srcs, pos_embeds):\n        masks = [torch.zeros((x.size(0), x.size(2), x.size(3)), device=x.device, dtype=torch.bool) for x in srcs]\n        # prepare input for encoder\n        src_flatten = []\n        mask_flatten = []\n        lvl_pos_embed_flatten = []\n        spatial_shapes = []\n        for lvl, (src, mask, pos_embed) in enumerate(zip(srcs, masks, pos_embeds)):\n            bs, c, h, w = src.shape\n            spatial_shape = (h, w)\n            spatial_shapes.append(spatial_shape)\n            src = src.flatten(2).transpose(1, 2)\n            mask = mask.flatten(1)\n            pos_embed = pos_embed.flatten(2).transpose(1, 2)\n            lvl_pos_embed = pos_embed + self.level_embed[lvl].view(1, 1, -1)\n            lvl_pos_embed_flatten.append(lvl_pos_embed)\n            src_flatten.append(src)\n            mask_flatten.append(mask)\n        src_flatten = torch.cat(src_flatten, 1)\n        mask_flatten = torch.cat(mask_flatten, 1)\n        lvl_pos_embed_flatten = torch.cat(lvl_pos_embed_flatten, 1)\n        spatial_shapes = torch.as_tensor(spatial_shapes, dtype=torch.long, device=src_flatten.device)\n        level_start_index = torch.cat((spatial_shapes.new_zeros((1, )), spatial_shapes.prod(1).cumsum(0)[:-1]))\n        valid_ratios = torch.stack([self.get_valid_ratio(m) for m in masks], 1)\n\n        # encoder\n        memory = self.encoder(src_flatten, spatial_shapes, level_start_index, valid_ratios, lvl_pos_embed_flatten, mask_flatten)\n\n        return memory, spatial_shapes, level_start_index\n\n\nclass MSDeformAttnTransformerEncoderLayer(nn.Module):\n    def __init__(self,\n                 d_model=256, d_ffn=1024,\n                 dropout=0.1, activation=\"relu\",\n                 n_levels=4, n_heads=8, n_points=4):\n        super().__init__()\n\n        # self attention\n        self.self_attn = MSDeformAttn(d_model, n_levels, n_heads, n_points)\n        self.dropout1 = nn.Dropout(dropout)\n        self.norm1 = nn.LayerNorm(d_model)\n\n        # ffn\n        self.linear1 = nn.Linear(d_model, d_ffn)\n        self.activation = _get_activation_fn(activation)\n        self.dropout2 = nn.Dropout(dropout)\n        self.linear2 = nn.Linear(d_ffn, d_model)\n        self.dropout3 = nn.Dropout(dropout)\n        self.norm2 = nn.LayerNorm(d_model)\n\n    @staticmethod\n    def with_pos_embed(tensor, pos):\n        return tensor if pos is None else tensor + pos\n\n    def forward_ffn(self, src):\n        src2 = self.linear2(self.dropout2(self.activation(self.linear1(src))))\n        src = src + self.dropout3(src2)\n        src = self.norm2(src)\n        return src\n\n    def forward(self, src, pos, reference_points, spatial_shapes, level_start_index, padding_mask=None):\n        # self attention\n        src2 = self.self_attn(self.with_pos_embed(src, pos), reference_points, src, spatial_shapes, level_start_index, padding_mask)\n        src = src + self.dropout1(src2)\n        src = self.norm1(src)\n\n        # ffn\n        src = self.forward_ffn(src)\n\n        return src\n\n\nclass MSDeformAttnTransformerEncoder(nn.Module):\n    def __init__(self, encoder_layer, num_layers):\n        super().__init__()\n        self.layers = _get_clones(encoder_layer, num_layers)\n        self.num_layers = num_layers\n\n    @staticmethod\n    def get_reference_points(spatial_shapes, valid_ratios, device):\n        reference_points_list = []\n        for lvl, (H_, W_) in enumerate(spatial_shapes):\n\n            ref_y, ref_x = torch.meshgrid(torch.linspace(0.5, H_ - 0.5, H_, dtype=torch.float32, device=device),\n                                          torch.linspace(0.5, W_ - 0.5, W_, dtype=torch.float32, device=device))\n            ref_y = ref_y.reshape(-1)[None] / (valid_ratios[:, None, lvl, 1] * H_)\n            ref_x = ref_x.reshape(-1)[None] / (valid_ratios[:, None, lvl, 0] * W_)\n            ref = torch.stack((ref_x, ref_y), -1)\n            reference_points_list.append(ref)\n        reference_points = torch.cat(reference_points_list, 1)\n        reference_points = reference_points[:, :, None] * valid_ratios[:, None]\n        return reference_points\n\n    def forward(self, src, spatial_shapes, level_start_index, valid_ratios, pos=None, padding_mask=None):\n        output = src\n        reference_points = self.get_reference_points(spatial_shapes, valid_ratios, device=src.device)\n        for _, layer in enumerate(self.layers):\n            output = layer(output, pos, reference_points, spatial_shapes, level_start_index, padding_mask)\n\n        return output\n\n\n@SEM_SEG_HEADS_REGISTRY.register()\nclass MSDeformAttnPixelDecoder(nn.Module):\n    @configurable\n    def __init__(\n        self,\n        input_shape: Dict[str, ShapeSpec],\n        *,\n        transformer_dropout: float,\n        transformer_nheads: int,\n        transformer_dim_feedforward: int,\n        transformer_enc_layers: int,\n        conv_dim: int,\n        mask_dim: int,\n        norm: Optional[Union[str, Callable]] = None,\n        # deformable transformer encoder args\n        transformer_in_features: List[str],\n        common_stride: int,\n    ):\n        \"\"\"\n        NOTE: this interface is experimental.\n        Args:\n            input_shape: shapes (channels and stride) of the input features\n            transformer_dropout: dropout probability in transformer\n            transformer_nheads: number of heads in transformer\n            transformer_dim_feedforward: dimension of feedforward network\n            transformer_enc_layers: number of transformer encoder layers\n            conv_dims: number of output channels for the intermediate conv layers.\n            mask_dim: number of output channels for the final conv layer.\n            norm (str or callable): normalization for all conv layers\n        \"\"\"\n        super().__init__()\n        transformer_input_shape = {\n            k: v for k, v in input_shape.items() if k in transformer_in_features\n        }\n\n        # this is the input shape of pixel decoder\n        input_shape = sorted(input_shape.items(), key=lambda x: x[1].stride)\n        self.in_features = [k for k, v in input_shape]  # starting from \"res2\" to \"res5\"\n        self.feature_strides = [v.stride for k, v in input_shape]\n        self.feature_channels = [v.channels for k, v in input_shape]\n        \n        # this is the input shape of transformer encoder (could use less features than pixel decoder\n        transformer_input_shape = sorted(transformer_input_shape.items(), key=lambda x: x[1].stride)\n        self.transformer_in_features = [k for k, v in transformer_input_shape]  # starting from \"res2\" to \"res5\"\n        transformer_in_channels = [v.channels for k, v in transformer_input_shape]\n        self.transformer_feature_strides = [v.stride for k, v in transformer_input_shape]  # to decide extra FPN layers\n\n        self.transformer_num_feature_levels = len(self.transformer_in_features)\n        if self.transformer_num_feature_levels > 1:\n            input_proj_list = []\n            # from low resolution to high resolution (res5 -> res2)\n            for in_channels in transformer_in_channels[::-1]:\n                input_proj_list.append(nn.Sequential(\n                    nn.Conv2d(in_channels, conv_dim, kernel_size=1),\n                    nn.GroupNorm(32, conv_dim),\n                ))\n            self.input_proj = nn.ModuleList(input_proj_list)\n        else:\n            self.input_proj = nn.ModuleList([\n                nn.Sequential(\n                    nn.Conv2d(transformer_in_channels[-1], conv_dim, kernel_size=1),\n                    nn.GroupNorm(32, conv_dim),\n                )])\n\n        for proj in self.input_proj:\n            nn.init.xavier_uniform_(proj[0].weight, gain=1)\n            nn.init.constant_(proj[0].bias, 0)\n\n        self.transformer = MSDeformAttnTransformerEncoderOnly(\n            d_model=conv_dim,\n            dropout=transformer_dropout,\n            nhead=transformer_nheads,\n            dim_feedforward=transformer_dim_feedforward,\n            num_encoder_layers=transformer_enc_layers,\n            num_feature_levels=self.transformer_num_feature_levels,\n        )\n        N_steps = conv_dim // 2\n        self.pe_layer = PositionEmbeddingSine(N_steps, normalize=True)\n\n        self.mask_dim = mask_dim\n        # use 1x1 conv instead\n        self.mask_features = Conv2d(\n            conv_dim,\n            mask_dim,\n            kernel_size=1,\n            stride=1,\n            padding=0,\n        )\n        weight_init.c2_xavier_fill(self.mask_features)\n        \n        self.maskformer_num_feature_levels = 3  # always use 3 scales\n        self.common_stride = common_stride\n\n        # extra fpn levels\n        stride = min(self.transformer_feature_strides)\n        self.num_fpn_levels = int(np.log2(stride) - np.log2(self.common_stride))\n\n        lateral_convs = []\n        output_convs = []\n\n        use_bias = norm == \"\"\n        for idx, in_channels in enumerate(self.feature_channels[:self.num_fpn_levels]):\n            lateral_norm = get_norm(norm, conv_dim)\n            output_norm = get_norm(norm, conv_dim)\n\n            lateral_conv = Conv2d(\n                in_channels, conv_dim, kernel_size=1, bias=use_bias, norm=lateral_norm\n            )\n            output_conv = Conv2d(\n                conv_dim,\n                conv_dim,\n                kernel_size=3,\n                stride=1,\n                padding=1,\n                bias=use_bias,\n                norm=output_norm,\n                activation=F.relu,\n            )\n            weight_init.c2_xavier_fill(lateral_conv)\n            weight_init.c2_xavier_fill(output_conv)\n            self.add_module(\"adapter_{}\".format(idx + 1), lateral_conv)\n            self.add_module(\"layer_{}\".format(idx + 1), output_conv)\n\n            lateral_convs.append(lateral_conv)\n            output_convs.append(output_conv)\n        # Place convs into top-down order (from low to high resolution)\n        # to make the top-down computation in forward clearer.\n        self.lateral_convs = lateral_convs[::-1]\n        self.output_convs = output_convs[::-1]\n\n    @classmethod\n    def from_config(cls, cfg, input_shape: Dict[str, ShapeSpec]):\n        ret = {}\n        ret[\"input_shape\"] = {\n            k: v for k, v in input_shape.items() if k in cfg.MODEL.SEM_SEG_HEAD.IN_FEATURES\n        }\n        ret[\"conv_dim\"] = cfg.MODEL.SEM_SEG_HEAD.CONVS_DIM\n        ret[\"mask_dim\"] = cfg.MODEL.SEM_SEG_HEAD.MASK_DIM\n        ret[\"norm\"] = cfg.MODEL.SEM_SEG_HEAD.NORM\n        ret[\"transformer_dropout\"] = cfg.MODEL.MASK_FORMER.DROPOUT\n        ret[\"transformer_nheads\"] = cfg.MODEL.MASK_FORMER.NHEADS\n        # ret[\"transformer_dim_feedforward\"] = cfg.MODEL.MASK_FORMER.DIM_FEEDFORWARD\n        ret[\"transformer_dim_feedforward\"] = 1024  # use 1024 for deformable transformer encoder\n        ret[\n            \"transformer_enc_layers\"\n        ] = cfg.MODEL.SEM_SEG_HEAD.TRANSFORMER_ENC_LAYERS  # a separate config\n        ret[\"transformer_in_features\"] = cfg.MODEL.SEM_SEG_HEAD.DEFORMABLE_TRANSFORMER_ENCODER_IN_FEATURES\n        ret[\"common_stride\"] = cfg.MODEL.SEM_SEG_HEAD.COMMON_STRIDE\n        return ret\n\n    @autocast(enabled=False)\n    def forward_features(self, features):\n        srcs = []\n        pos = []\n        # Reverse feature maps into top-down order (from low to high resolution)\n        for idx, f in enumerate(self.transformer_in_features[::-1]):\n            x = features[f].float()  # deformable detr does not support half precision\n            srcs.append(self.input_proj[idx](x))\n            pos.append(self.pe_layer(x))\n\n        y, spatial_shapes, level_start_index = self.transformer(srcs, pos)\n        bs = y.shape[0]\n\n        split_size_or_sections = [None] * self.transformer_num_feature_levels\n        for i in range(self.transformer_num_feature_levels):\n            if i < self.transformer_num_feature_levels - 1:\n                split_size_or_sections[i] = level_start_index[i + 1] - level_start_index[i]\n            else:\n                split_size_or_sections[i] = y.shape[1] - level_start_index[i]\n        y = torch.split(y, split_size_or_sections, dim=1)\n\n        out = []\n        multi_scale_features = []\n        num_cur_levels = 0\n        for i, z in enumerate(y):\n            out.append(z.transpose(1, 2).view(bs, -1, spatial_shapes[i][0], spatial_shapes[i][1]))\n\n        # append `out` with extra FPN levels\n        # Reverse feature maps into top-down order (from low to high resolution)\n        for idx, f in enumerate(self.in_features[:self.num_fpn_levels][::-1]):\n            x = features[f].float()\n            lateral_conv = self.lateral_convs[idx]\n            output_conv = self.output_convs[idx]\n            cur_fpn = lateral_conv(x)\n            # Following FPN implementation, we use nearest upsampling here\n            y = cur_fpn + F.interpolate(out[-1], size=cur_fpn.shape[-2:], mode=\"bilinear\", align_corners=False)\n            y = output_conv(y)\n            out.append(y)\n\n        for o in out:\n            if num_cur_levels < self.maskformer_num_feature_levels:\n                multi_scale_features.append(o)\n                num_cur_levels += 1\n\n        return self.mask_features(out[-1]), out[0], multi_scale_features\n"
  },
  {
    "path": "mfvis_nococo/mask2former/modeling/pixel_decoder/ops/functions/__init__.py",
    "content": "# ------------------------------------------------------------------------------------------------\n# Deformable DETR\n# Copyright (c) 2020 SenseTime. All Rights Reserved.\n# Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n# ------------------------------------------------------------------------------------------------\n# Modified from https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0\n# ------------------------------------------------------------------------------------------------\n\n# Modified by Bowen Cheng from https://github.com/fundamentalvision/Deformable-DETR\n\nfrom .ms_deform_attn_func import MSDeformAttnFunction\n\n"
  },
  {
    "path": "mfvis_nococo/mask2former/modeling/pixel_decoder/ops/functions/ms_deform_attn_func.py",
    "content": "# ------------------------------------------------------------------------------------------------\n# Deformable DETR\n# Copyright (c) 2020 SenseTime. All Rights Reserved.\n# Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n# ------------------------------------------------------------------------------------------------\n# Modified from https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0\n# ------------------------------------------------------------------------------------------------\n\n# Modified by Bowen Cheng from https://github.com/fundamentalvision/Deformable-DETR\n\nfrom __future__ import absolute_import\nfrom __future__ import print_function\nfrom __future__ import division\n\nimport torch\nimport torch.nn.functional as F\nfrom torch.autograd import Function\nfrom torch.autograd.function import once_differentiable\n\ntry:\n    import MultiScaleDeformableAttention as MSDA\nexcept ModuleNotFoundError as e:\n    info_string = (\n        \"\\n\\nPlease compile MultiScaleDeformableAttention CUDA op with the following commands:\\n\"\n        \"\\t`cd mask2former/modeling/pixel_decoder/ops`\\n\"\n        \"\\t`sh make.sh`\\n\"\n    )\n    raise ModuleNotFoundError(info_string)\n\n\nclass MSDeformAttnFunction(Function):\n    @staticmethod\n    def forward(ctx, value, value_spatial_shapes, value_level_start_index, sampling_locations, attention_weights, im2col_step):\n        ctx.im2col_step = im2col_step\n        output = MSDA.ms_deform_attn_forward(\n            value, value_spatial_shapes, value_level_start_index, sampling_locations, attention_weights, ctx.im2col_step)\n        ctx.save_for_backward(value, value_spatial_shapes, value_level_start_index, sampling_locations, attention_weights)\n        return output\n\n    @staticmethod\n    @once_differentiable\n    def backward(ctx, grad_output):\n        value, value_spatial_shapes, value_level_start_index, sampling_locations, attention_weights = ctx.saved_tensors\n        grad_value, grad_sampling_loc, grad_attn_weight = \\\n            MSDA.ms_deform_attn_backward(\n                value, value_spatial_shapes, value_level_start_index, sampling_locations, attention_weights, grad_output, ctx.im2col_step)\n\n        return grad_value, None, None, grad_sampling_loc, grad_attn_weight, None\n\n\ndef ms_deform_attn_core_pytorch(value, value_spatial_shapes, sampling_locations, attention_weights):\n    # for debug and test only,\n    # need to use cuda version instead\n    N_, S_, M_, D_ = value.shape\n    _, Lq_, M_, L_, P_, _ = sampling_locations.shape\n    value_list = value.split([H_ * W_ for H_, W_ in value_spatial_shapes], dim=1)\n    sampling_grids = 2 * sampling_locations - 1\n    sampling_value_list = []\n    for lid_, (H_, W_) in enumerate(value_spatial_shapes):\n        # N_, H_*W_, M_, D_ -> N_, H_*W_, M_*D_ -> N_, M_*D_, H_*W_ -> N_*M_, D_, H_, W_\n        value_l_ = value_list[lid_].flatten(2).transpose(1, 2).reshape(N_*M_, D_, H_, W_)\n        # N_, Lq_, M_, P_, 2 -> N_, M_, Lq_, P_, 2 -> N_*M_, Lq_, P_, 2\n        sampling_grid_l_ = sampling_grids[:, :, :, lid_].transpose(1, 2).flatten(0, 1)\n        # N_*M_, D_, Lq_, P_\n        sampling_value_l_ = F.grid_sample(value_l_, sampling_grid_l_,\n                                          mode='bilinear', padding_mode='zeros', align_corners=False)\n        sampling_value_list.append(sampling_value_l_)\n    # (N_, Lq_, M_, L_, P_) -> (N_, M_, Lq_, L_, P_) -> (N_, M_, 1, Lq_, L_*P_)\n    attention_weights = attention_weights.transpose(1, 2).reshape(N_*M_, 1, Lq_, L_*P_)\n    output = (torch.stack(sampling_value_list, dim=-2).flatten(-2) * attention_weights).sum(-1).view(N_, M_*D_, Lq_)\n    return output.transpose(1, 2).contiguous()\n"
  },
  {
    "path": "mfvis_nococo/mask2former/modeling/pixel_decoder/ops/make.sh",
    "content": "#!/usr/bin/env bash\n# ------------------------------------------------------------------------------------------------\n# Deformable DETR\n# Copyright (c) 2020 SenseTime. All Rights Reserved.\n# Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n# ------------------------------------------------------------------------------------------------\n# Modified from https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0\n# ------------------------------------------------------------------------------------------------\n\n# Copyright (c) Facebook, Inc. and its affiliates.\n# Modified by Bowen Cheng from https://github.com/fundamentalvision/Deformable-DETR\n\npython setup.py build install\n"
  },
  {
    "path": "mfvis_nococo/mask2former/modeling/pixel_decoder/ops/modules/__init__.py",
    "content": "# ------------------------------------------------------------------------------------------------\n# Deformable DETR\n# Copyright (c) 2020 SenseTime. All Rights Reserved.\n# Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n# ------------------------------------------------------------------------------------------------\n# Modified from https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0\n# ------------------------------------------------------------------------------------------------\n\n# Modified by Bowen Cheng from https://github.com/fundamentalvision/Deformable-DETR\n\nfrom .ms_deform_attn import MSDeformAttn\n"
  },
  {
    "path": "mfvis_nococo/mask2former/modeling/pixel_decoder/ops/modules/ms_deform_attn.py",
    "content": "# ------------------------------------------------------------------------------------------------\n# Deformable DETR\n# Copyright (c) 2020 SenseTime. All Rights Reserved.\n# Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n# ------------------------------------------------------------------------------------------------\n# Modified from https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0\n# ------------------------------------------------------------------------------------------------\n\n# Modified by Bowen Cheng from https://github.com/fundamentalvision/Deformable-DETR\n\nfrom __future__ import absolute_import\nfrom __future__ import print_function\nfrom __future__ import division\n\nimport warnings\nimport math\n\nimport torch\nfrom torch import nn\nimport torch.nn.functional as F\nfrom torch.nn.init import xavier_uniform_, constant_\n\nfrom ..functions import MSDeformAttnFunction\nfrom ..functions.ms_deform_attn_func import ms_deform_attn_core_pytorch\n\n\ndef _is_power_of_2(n):\n    if (not isinstance(n, int)) or (n < 0):\n        raise ValueError(\"invalid input for _is_power_of_2: {} (type: {})\".format(n, type(n)))\n    return (n & (n-1) == 0) and n != 0\n\n\nclass MSDeformAttn(nn.Module):\n    def __init__(self, d_model=256, n_levels=4, n_heads=8, n_points=4):\n        \"\"\"\n        Multi-Scale Deformable Attention Module\n        :param d_model      hidden dimension\n        :param n_levels     number of feature levels\n        :param n_heads      number of attention heads\n        :param n_points     number of sampling points per attention head per feature level\n        \"\"\"\n        super().__init__()\n        if d_model % n_heads != 0:\n            raise ValueError('d_model must be divisible by n_heads, but got {} and {}'.format(d_model, n_heads))\n        _d_per_head = d_model // n_heads\n        # you'd better set _d_per_head to a power of 2 which is more efficient in our CUDA implementation\n        if not _is_power_of_2(_d_per_head):\n            warnings.warn(\"You'd better set d_model in MSDeformAttn to make the dimension of each attention head a power of 2 \"\n                          \"which is more efficient in our CUDA implementation.\")\n\n        self.im2col_step = 128\n\n        self.d_model = d_model\n        self.n_levels = n_levels\n        self.n_heads = n_heads\n        self.n_points = n_points\n\n        self.sampling_offsets = nn.Linear(d_model, n_heads * n_levels * n_points * 2)\n        self.attention_weights = nn.Linear(d_model, n_heads * n_levels * n_points)\n        self.value_proj = nn.Linear(d_model, d_model)\n        self.output_proj = nn.Linear(d_model, d_model)\n\n        self._reset_parameters()\n\n    def _reset_parameters(self):\n        constant_(self.sampling_offsets.weight.data, 0.)\n        thetas = torch.arange(self.n_heads, dtype=torch.float32) * (2.0 * math.pi / self.n_heads)\n        grid_init = torch.stack([thetas.cos(), thetas.sin()], -1)\n        grid_init = (grid_init / grid_init.abs().max(-1, keepdim=True)[0]).view(self.n_heads, 1, 1, 2).repeat(1, self.n_levels, self.n_points, 1)\n        for i in range(self.n_points):\n            grid_init[:, :, i, :] *= i + 1\n        with torch.no_grad():\n            self.sampling_offsets.bias = nn.Parameter(grid_init.view(-1))\n        constant_(self.attention_weights.weight.data, 0.)\n        constant_(self.attention_weights.bias.data, 0.)\n        xavier_uniform_(self.value_proj.weight.data)\n        constant_(self.value_proj.bias.data, 0.)\n        xavier_uniform_(self.output_proj.weight.data)\n        constant_(self.output_proj.bias.data, 0.)\n\n    def forward(self, query, reference_points, input_flatten, input_spatial_shapes, input_level_start_index, input_padding_mask=None):\n        \"\"\"\n        :param query                       (N, Length_{query}, C)\n        :param reference_points            (N, Length_{query}, n_levels, 2), range in [0, 1], top-left (0,0), bottom-right (1, 1), including padding area\n                                        or (N, Length_{query}, n_levels, 4), add additional (w, h) to form reference boxes\n        :param input_flatten               (N, \\sum_{l=0}^{L-1} H_l \\cdot W_l, C)\n        :param input_spatial_shapes        (n_levels, 2), [(H_0, W_0), (H_1, W_1), ..., (H_{L-1}, W_{L-1})]\n        :param input_level_start_index     (n_levels, ), [0, H_0*W_0, H_0*W_0+H_1*W_1, H_0*W_0+H_1*W_1+H_2*W_2, ..., H_0*W_0+H_1*W_1+...+H_{L-1}*W_{L-1}]\n        :param input_padding_mask          (N, \\sum_{l=0}^{L-1} H_l \\cdot W_l), True for padding elements, False for non-padding elements\n\n        :return output                     (N, Length_{query}, C)\n        \"\"\"\n        N, Len_q, _ = query.shape\n        N, Len_in, _ = input_flatten.shape\n        assert (input_spatial_shapes[:, 0] * input_spatial_shapes[:, 1]).sum() == Len_in\n\n        value = self.value_proj(input_flatten)\n        if input_padding_mask is not None:\n            value = value.masked_fill(input_padding_mask[..., None], float(0))\n        value = value.view(N, Len_in, self.n_heads, self.d_model // self.n_heads)\n        sampling_offsets = self.sampling_offsets(query).view(N, Len_q, self.n_heads, self.n_levels, self.n_points, 2)\n        attention_weights = self.attention_weights(query).view(N, Len_q, self.n_heads, self.n_levels * self.n_points)\n        attention_weights = F.softmax(attention_weights, -1).view(N, Len_q, self.n_heads, self.n_levels, self.n_points)\n        # N, Len_q, n_heads, n_levels, n_points, 2\n        if reference_points.shape[-1] == 2:\n            offset_normalizer = torch.stack([input_spatial_shapes[..., 1], input_spatial_shapes[..., 0]], -1)\n            sampling_locations = reference_points[:, :, None, :, None, :] \\\n                                 + sampling_offsets / offset_normalizer[None, None, None, :, None, :]\n        elif reference_points.shape[-1] == 4:\n            sampling_locations = reference_points[:, :, None, :, None, :2] \\\n                                 + sampling_offsets / self.n_points * reference_points[:, :, None, :, None, 2:] * 0.5\n        else:\n            raise ValueError(\n                'Last dim of reference_points must be 2 or 4, but get {} instead.'.format(reference_points.shape[-1]))\n        try:\n            output = MSDeformAttnFunction.apply(\n                value, input_spatial_shapes, input_level_start_index, sampling_locations, attention_weights, self.im2col_step)\n        except:\n            # CPU\n            output = ms_deform_attn_core_pytorch(value, input_spatial_shapes, sampling_locations, attention_weights)\n        # # For FLOPs calculation only\n        # output = ms_deform_attn_core_pytorch(value, input_spatial_shapes, sampling_locations, attention_weights)\n        output = self.output_proj(output)\n        return output\n"
  },
  {
    "path": "mfvis_nococo/mask2former/modeling/pixel_decoder/ops/setup.py",
    "content": "# ------------------------------------------------------------------------------------------------\n# Deformable DETR\n# Copyright (c) 2020 SenseTime. All Rights Reserved.\n# Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n# ------------------------------------------------------------------------------------------------\n# Modified from https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0\n# ------------------------------------------------------------------------------------------------\n\n# Modified by Bowen Cheng from https://github.com/fundamentalvision/Deformable-DETR\n\nimport os\nimport glob\n\nimport torch\n\nfrom torch.utils.cpp_extension import CUDA_HOME\nfrom torch.utils.cpp_extension import CppExtension\nfrom torch.utils.cpp_extension import CUDAExtension\n\nfrom setuptools import find_packages\nfrom setuptools import setup\n\nrequirements = [\"torch\", \"torchvision\"]\n\ndef get_extensions():\n    this_dir = os.path.dirname(os.path.abspath(__file__))\n    extensions_dir = os.path.join(this_dir, \"src\")\n\n    main_file = glob.glob(os.path.join(extensions_dir, \"*.cpp\"))\n    source_cpu = glob.glob(os.path.join(extensions_dir, \"cpu\", \"*.cpp\"))\n    source_cuda = glob.glob(os.path.join(extensions_dir, \"cuda\", \"*.cu\"))\n\n    sources = main_file + source_cpu\n    extension = CppExtension\n    extra_compile_args = {\"cxx\": []}\n    define_macros = []\n\n    # Force cuda since torch ask for a device, not if cuda is in fact available.\n    if (os.environ.get('FORCE_CUDA') or torch.cuda.is_available()) and CUDA_HOME is not None:\n        extension = CUDAExtension\n        sources += source_cuda\n        define_macros += [(\"WITH_CUDA\", None)]\n        extra_compile_args[\"nvcc\"] = [\n            \"-DCUDA_HAS_FP16=1\",\n            \"-D__CUDA_NO_HALF_OPERATORS__\",\n            \"-D__CUDA_NO_HALF_CONVERSIONS__\",\n            \"-D__CUDA_NO_HALF2_OPERATORS__\",\n        ]\n    else:\n        if CUDA_HOME is None:\n            raise NotImplementedError('CUDA_HOME is None. Please set environment variable CUDA_HOME.')\n        else:\n            raise NotImplementedError('No CUDA runtime is found. Please set FORCE_CUDA=1 or test it by running torch.cuda.is_available().')\n\n    sources = [os.path.join(extensions_dir, s) for s in sources]\n    include_dirs = [extensions_dir]\n    ext_modules = [\n        extension(\n            \"MultiScaleDeformableAttention\",\n            sources,\n            include_dirs=include_dirs,\n            define_macros=define_macros,\n            extra_compile_args=extra_compile_args,\n        )\n    ]\n    return ext_modules\n\nsetup(\n    name=\"MultiScaleDeformableAttention\",\n    version=\"1.0\",\n    author=\"Weijie Su\",\n    url=\"https://github.com/fundamentalvision/Deformable-DETR\",\n    description=\"PyTorch Wrapper for CUDA Functions of Multi-Scale Deformable Attention\",\n    packages=find_packages(exclude=(\"configs\", \"tests\",)),\n    ext_modules=get_extensions(),\n    cmdclass={\"build_ext\": torch.utils.cpp_extension.BuildExtension},\n)\n"
  },
  {
    "path": "mfvis_nococo/mask2former/modeling/pixel_decoder/ops/src/cpu/ms_deform_attn_cpu.cpp",
    "content": "/*!\n**************************************************************************************************\n* Deformable DETR\n* Copyright (c) 2020 SenseTime. All Rights Reserved.\n* Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n**************************************************************************************************\n* Modified from https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0\n**************************************************************************************************\n*/\n\n/*!\n* Copyright (c) Facebook, Inc. and its affiliates.\n* Modified by Bowen Cheng from https://github.com/fundamentalvision/Deformable-DETR\n*/\n\n#include <vector>\n\n#include <ATen/ATen.h>\n#include <ATen/cuda/CUDAContext.h>\n\n\nat::Tensor\nms_deform_attn_cpu_forward(\n    const at::Tensor &value, \n    const at::Tensor &spatial_shapes,\n    const at::Tensor &level_start_index,\n    const at::Tensor &sampling_loc,\n    const at::Tensor &attn_weight,\n    const int im2col_step)\n{\n    AT_ERROR(\"Not implement on cpu\");\n}\n\nstd::vector<at::Tensor>\nms_deform_attn_cpu_backward(\n    const at::Tensor &value, \n    const at::Tensor &spatial_shapes,\n    const at::Tensor &level_start_index,\n    const at::Tensor &sampling_loc,\n    const at::Tensor &attn_weight,\n    const at::Tensor &grad_output,\n    const int im2col_step)\n{\n    AT_ERROR(\"Not implement on cpu\");\n}\n\n"
  },
  {
    "path": "mfvis_nococo/mask2former/modeling/pixel_decoder/ops/src/cpu/ms_deform_attn_cpu.h",
    "content": "/*!\n**************************************************************************************************\n* Deformable DETR\n* Copyright (c) 2020 SenseTime. All Rights Reserved.\n* Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n**************************************************************************************************\n* Modified from https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0\n**************************************************************************************************\n*/\n\n/*!\n* Copyright (c) Facebook, Inc. and its affiliates.\n* Modified by Bowen Cheng from https://github.com/fundamentalvision/Deformable-DETR\n*/\n\n#pragma once\n#include <torch/extension.h>\n\nat::Tensor\nms_deform_attn_cpu_forward(\n    const at::Tensor &value, \n    const at::Tensor &spatial_shapes,\n    const at::Tensor &level_start_index,\n    const at::Tensor &sampling_loc,\n    const at::Tensor &attn_weight,\n    const int im2col_step);\n\nstd::vector<at::Tensor>\nms_deform_attn_cpu_backward(\n    const at::Tensor &value, \n    const at::Tensor &spatial_shapes,\n    const at::Tensor &level_start_index,\n    const at::Tensor &sampling_loc,\n    const at::Tensor &attn_weight,\n    const at::Tensor &grad_output,\n    const int im2col_step);\n\n\n"
  },
  {
    "path": "mfvis_nococo/mask2former/modeling/pixel_decoder/ops/src/cuda/ms_deform_attn_cuda.cu",
    "content": "/*!\n**************************************************************************************************\n* Deformable DETR\n* Copyright (c) 2020 SenseTime. All Rights Reserved.\n* Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n**************************************************************************************************\n* Modified from https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0\n**************************************************************************************************\n*/\n\n/*!\n* Copyright (c) Facebook, Inc. and its affiliates.\n* Modified by Bowen Cheng from https://github.com/fundamentalvision/Deformable-DETR\n*/\n\n#include <vector>\n#include \"cuda/ms_deform_im2col_cuda.cuh\"\n\n#include <ATen/ATen.h>\n#include <ATen/cuda/CUDAContext.h>\n#include <cuda.h>\n#include <cuda_runtime.h>\n\n\nat::Tensor ms_deform_attn_cuda_forward(\n    const at::Tensor &value, \n    const at::Tensor &spatial_shapes,\n    const at::Tensor &level_start_index,\n    const at::Tensor &sampling_loc,\n    const at::Tensor &attn_weight,\n    const int im2col_step)\n{\n    AT_ASSERTM(value.is_contiguous(), \"value tensor has to be contiguous\");\n    AT_ASSERTM(spatial_shapes.is_contiguous(), \"spatial_shapes tensor has to be contiguous\");\n    AT_ASSERTM(level_start_index.is_contiguous(), \"level_start_index tensor has to be contiguous\");\n    AT_ASSERTM(sampling_loc.is_contiguous(), \"sampling_loc tensor has to be contiguous\");\n    AT_ASSERTM(attn_weight.is_contiguous(), \"attn_weight tensor has to be contiguous\");\n\n    AT_ASSERTM(value.type().is_cuda(), \"value must be a CUDA tensor\");\n    AT_ASSERTM(spatial_shapes.type().is_cuda(), \"spatial_shapes must be a CUDA tensor\");\n    AT_ASSERTM(level_start_index.type().is_cuda(), \"level_start_index must be a CUDA tensor\");\n    AT_ASSERTM(sampling_loc.type().is_cuda(), \"sampling_loc must be a CUDA tensor\");\n    AT_ASSERTM(attn_weight.type().is_cuda(), \"attn_weight must be a CUDA tensor\");\n\n    const int batch = value.size(0);\n    const int spatial_size = value.size(1);\n    const int num_heads = value.size(2);\n    const int channels = value.size(3);\n\n    const int num_levels = spatial_shapes.size(0);\n\n    const int num_query = sampling_loc.size(1);\n    const int num_point = sampling_loc.size(4);\n\n    const int im2col_step_ = std::min(batch, im2col_step);\n\n    AT_ASSERTM(batch % im2col_step_ == 0, \"batch(%d) must divide im2col_step(%d)\", batch, im2col_step_);\n    \n    auto output = at::zeros({batch, num_query, num_heads, channels}, value.options());\n\n    const int batch_n = im2col_step_;\n    auto output_n = output.view({batch/im2col_step_, batch_n, num_query, num_heads, channels});\n    auto per_value_size = spatial_size * num_heads * channels;\n    auto per_sample_loc_size = num_query * num_heads * num_levels * num_point * 2;\n    auto per_attn_weight_size = num_query * num_heads * num_levels * num_point;\n    for (int n = 0; n < batch/im2col_step_; ++n)\n    {\n        auto columns = output_n.select(0, n);\n        AT_DISPATCH_FLOATING_TYPES(value.type(), \"ms_deform_attn_forward_cuda\", ([&] {\n            ms_deformable_im2col_cuda(at::cuda::getCurrentCUDAStream(),\n                value.data<scalar_t>() + n * im2col_step_ * per_value_size,\n                spatial_shapes.data<int64_t>(),\n                level_start_index.data<int64_t>(),\n                sampling_loc.data<scalar_t>() + n * im2col_step_ * per_sample_loc_size,\n                attn_weight.data<scalar_t>() + n * im2col_step_ * per_attn_weight_size,\n                batch_n, spatial_size, num_heads, channels, num_levels, num_query, num_point,\n                columns.data<scalar_t>());\n\n        }));\n    }\n\n    output = output.view({batch, num_query, num_heads*channels});\n\n    return output;\n}\n\n\nstd::vector<at::Tensor> ms_deform_attn_cuda_backward(\n    const at::Tensor &value, \n    const at::Tensor &spatial_shapes,\n    const at::Tensor &level_start_index,\n    const at::Tensor &sampling_loc,\n    const at::Tensor &attn_weight,\n    const at::Tensor &grad_output,\n    const int im2col_step)\n{\n\n    AT_ASSERTM(value.is_contiguous(), \"value tensor has to be contiguous\");\n    AT_ASSERTM(spatial_shapes.is_contiguous(), \"spatial_shapes tensor has to be contiguous\");\n    AT_ASSERTM(level_start_index.is_contiguous(), \"level_start_index tensor has to be contiguous\");\n    AT_ASSERTM(sampling_loc.is_contiguous(), \"sampling_loc tensor has to be contiguous\");\n    AT_ASSERTM(attn_weight.is_contiguous(), \"attn_weight tensor has to be contiguous\");\n    AT_ASSERTM(grad_output.is_contiguous(), \"grad_output tensor has to be contiguous\");\n\n    AT_ASSERTM(value.type().is_cuda(), \"value must be a CUDA tensor\");\n    AT_ASSERTM(spatial_shapes.type().is_cuda(), \"spatial_shapes must be a CUDA tensor\");\n    AT_ASSERTM(level_start_index.type().is_cuda(), \"level_start_index must be a CUDA tensor\");\n    AT_ASSERTM(sampling_loc.type().is_cuda(), \"sampling_loc must be a CUDA tensor\");\n    AT_ASSERTM(attn_weight.type().is_cuda(), \"attn_weight must be a CUDA tensor\");\n    AT_ASSERTM(grad_output.type().is_cuda(), \"grad_output must be a CUDA tensor\");\n\n    const int batch = value.size(0);\n    const int spatial_size = value.size(1);\n    const int num_heads = value.size(2);\n    const int channels = value.size(3);\n\n    const int num_levels = spatial_shapes.size(0);\n\n    const int num_query = sampling_loc.size(1);\n    const int num_point = sampling_loc.size(4);\n\n    const int im2col_step_ = std::min(batch, im2col_step);\n\n    AT_ASSERTM(batch % im2col_step_ == 0, \"batch(%d) must divide im2col_step(%d)\", batch, im2col_step_);\n\n    auto grad_value = at::zeros_like(value);\n    auto grad_sampling_loc = at::zeros_like(sampling_loc);\n    auto grad_attn_weight = at::zeros_like(attn_weight);\n\n    const int batch_n = im2col_step_;\n    auto per_value_size = spatial_size * num_heads * channels;\n    auto per_sample_loc_size = num_query * num_heads * num_levels * num_point * 2;\n    auto per_attn_weight_size = num_query * num_heads * num_levels * num_point;\n    auto grad_output_n = grad_output.view({batch/im2col_step_, batch_n, num_query, num_heads, channels});\n    \n    for (int n = 0; n < batch/im2col_step_; ++n)\n    {\n        auto grad_output_g = grad_output_n.select(0, n);\n        AT_DISPATCH_FLOATING_TYPES(value.type(), \"ms_deform_attn_backward_cuda\", ([&] {\n            ms_deformable_col2im_cuda(at::cuda::getCurrentCUDAStream(),\n                                    grad_output_g.data<scalar_t>(),\n                                    value.data<scalar_t>() + n * im2col_step_ * per_value_size,\n                                    spatial_shapes.data<int64_t>(),\n                                    level_start_index.data<int64_t>(),\n                                    sampling_loc.data<scalar_t>() + n * im2col_step_ * per_sample_loc_size,\n                                    attn_weight.data<scalar_t>() + n * im2col_step_ * per_attn_weight_size,\n                                    batch_n, spatial_size, num_heads, channels, num_levels, num_query, num_point,\n                                    grad_value.data<scalar_t>() +  n * im2col_step_ * per_value_size,\n                                    grad_sampling_loc.data<scalar_t>() + n * im2col_step_ * per_sample_loc_size,\n                                    grad_attn_weight.data<scalar_t>() + n * im2col_step_ * per_attn_weight_size);\n\n        }));\n    }\n\n    return {\n        grad_value, grad_sampling_loc, grad_attn_weight\n    };\n}"
  },
  {
    "path": "mfvis_nococo/mask2former/modeling/pixel_decoder/ops/src/cuda/ms_deform_attn_cuda.h",
    "content": "/*!\n**************************************************************************************************\n* Deformable DETR\n* Copyright (c) 2020 SenseTime. All Rights Reserved.\n* Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n**************************************************************************************************\n* Modified from https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0\n**************************************************************************************************\n*/\n\n/*!\n* Copyright (c) Facebook, Inc. and its affiliates.\n* Modified by Bowen Cheng from https://github.com/fundamentalvision/Deformable-DETR\n*/\n\n#pragma once\n#include <torch/extension.h>\n\nat::Tensor ms_deform_attn_cuda_forward(\n    const at::Tensor &value, \n    const at::Tensor &spatial_shapes,\n    const at::Tensor &level_start_index,\n    const at::Tensor &sampling_loc,\n    const at::Tensor &attn_weight,\n    const int im2col_step);\n\nstd::vector<at::Tensor> ms_deform_attn_cuda_backward(\n    const at::Tensor &value, \n    const at::Tensor &spatial_shapes,\n    const at::Tensor &level_start_index,\n    const at::Tensor &sampling_loc,\n    const at::Tensor &attn_weight,\n    const at::Tensor &grad_output,\n    const int im2col_step);\n\n"
  },
  {
    "path": "mfvis_nococo/mask2former/modeling/pixel_decoder/ops/src/cuda/ms_deform_im2col_cuda.cuh",
    "content": "/*!\n**************************************************************************\n* Deformable DETR\n* Copyright (c) 2020 SenseTime. All Rights Reserved.\n* Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n**************************************************************************\n* Modified from DCN (https://github.com/msracver/Deformable-ConvNets)\n* Copyright (c) 2018 Microsoft\n**************************************************************************\n*/\n\n/*!\n* Copyright (c) Facebook, Inc. and its affiliates.\n* Modified by Bowen Cheng from https://github.com/fundamentalvision/Deformable-DETR\n*/\n\n#include <cstdio>\n#include <algorithm>\n#include <cstring>\n\n#include <ATen/ATen.h>\n#include <ATen/cuda/CUDAContext.h>\n\n#include <THC/THCAtomics.cuh>\n\n#define CUDA_KERNEL_LOOP(i, n)                          \\\n  for (int i = blockIdx.x * blockDim.x + threadIdx.x;   \\\n      i < (n);                                          \\\n      i += blockDim.x * gridDim.x)\n\nconst int CUDA_NUM_THREADS = 1024;\ninline int GET_BLOCKS(const int N, const int num_threads)\n{\n  return (N + num_threads - 1) / num_threads;\n}\n\n\ntemplate <typename scalar_t>\n__device__ scalar_t ms_deform_attn_im2col_bilinear(const scalar_t* &bottom_data, \n                                                   const int &height, const int &width, const int &nheads, const int &channels,\n                                                   const scalar_t &h, const scalar_t &w, const int &m, const int &c)\n{\n  const int h_low = floor(h);\n  const int w_low = floor(w);\n  const int h_high = h_low + 1;\n  const int w_high = w_low + 1;\n\n  const scalar_t lh = h - h_low;\n  const scalar_t lw = w - w_low;\n  const scalar_t hh = 1 - lh, hw = 1 - lw;\n\n  const int w_stride = nheads * channels;\n  const int h_stride = width * w_stride;\n  const int h_low_ptr_offset = h_low * h_stride;\n  const int h_high_ptr_offset = h_low_ptr_offset + h_stride;\n  const int w_low_ptr_offset = w_low * w_stride;\n  const int w_high_ptr_offset = w_low_ptr_offset + w_stride;\n  const int base_ptr = m * channels + c;\n\n  scalar_t v1 = 0;\n  if (h_low >= 0 && w_low >= 0)\n  {\n    const int ptr1 = h_low_ptr_offset + w_low_ptr_offset + base_ptr;\n    v1 = bottom_data[ptr1];\n  }\n  scalar_t v2 = 0;\n  if (h_low >= 0 && w_high <= width - 1)\n  {\n    const int ptr2 = h_low_ptr_offset + w_high_ptr_offset + base_ptr;\n    v2 = bottom_data[ptr2];\n  }\n  scalar_t v3 = 0;\n  if (h_high <= height - 1 && w_low >= 0)\n  {\n    const int ptr3 = h_high_ptr_offset + w_low_ptr_offset + base_ptr;\n    v3 = bottom_data[ptr3];\n  }\n  scalar_t v4 = 0;\n  if (h_high <= height - 1 && w_high <= width - 1)\n  {\n    const int ptr4 = h_high_ptr_offset + w_high_ptr_offset + base_ptr;\n    v4 = bottom_data[ptr4];\n  }\n\n  const scalar_t w1 = hh * hw, w2 = hh * lw, w3 = lh * hw, w4 = lh * lw;\n\n  const scalar_t val = (w1 * v1 + w2 * v2 + w3 * v3 + w4 * v4);\n  return val;\n}\n\n\ntemplate <typename scalar_t>\n__device__ void ms_deform_attn_col2im_bilinear(const scalar_t* &bottom_data, \n                                                   const int &height, const int &width, const int &nheads, const int &channels,\n                                                   const scalar_t &h, const scalar_t &w, const int &m, const int &c,\n                                                   const scalar_t &top_grad,\n                                                   const scalar_t &attn_weight,\n                                                   scalar_t* &grad_value, \n                                                   scalar_t* grad_sampling_loc,\n                                                   scalar_t* grad_attn_weight)\n{\n  const int h_low = floor(h);\n  const int w_low = floor(w);\n  const int h_high = h_low + 1;\n  const int w_high = w_low + 1;\n\n  const scalar_t lh = h - h_low;\n  const scalar_t lw = w - w_low;\n  const scalar_t hh = 1 - lh, hw = 1 - lw;\n\n  const int w_stride = nheads * channels;\n  const int h_stride = width * w_stride;\n  const int h_low_ptr_offset = h_low * h_stride;\n  const int h_high_ptr_offset = h_low_ptr_offset + h_stride;\n  const int w_low_ptr_offset = w_low * w_stride;\n  const int w_high_ptr_offset = w_low_ptr_offset + w_stride;\n  const int base_ptr = m * channels + c;\n\n  const scalar_t w1 = hh * hw, w2 = hh * lw, w3 = lh * hw, w4 = lh * lw;\n  const scalar_t top_grad_value = top_grad * attn_weight;\n  scalar_t grad_h_weight = 0, grad_w_weight = 0;\n\n  scalar_t v1 = 0;\n  if (h_low >= 0 && w_low >= 0)\n  {\n    const int ptr1 = h_low_ptr_offset + w_low_ptr_offset + base_ptr;\n    v1 = bottom_data[ptr1];\n    grad_h_weight -= hw * v1;\n    grad_w_weight -= hh * v1;\n    atomicAdd(grad_value+ptr1, w1*top_grad_value);\n  }\n  scalar_t v2 = 0;\n  if (h_low >= 0 && w_high <= width - 1)\n  {\n    const int ptr2 = h_low_ptr_offset + w_high_ptr_offset + base_ptr;\n    v2 = bottom_data[ptr2];\n    grad_h_weight -= lw * v2;\n    grad_w_weight += hh * v2;\n    atomicAdd(grad_value+ptr2, w2*top_grad_value);\n  }\n  scalar_t v3 = 0;\n  if (h_high <= height - 1 && w_low >= 0)\n  {\n    const int ptr3 = h_high_ptr_offset + w_low_ptr_offset + base_ptr;\n    v3 = bottom_data[ptr3];\n    grad_h_weight += hw * v3;\n    grad_w_weight -= lh * v3;\n    atomicAdd(grad_value+ptr3, w3*top_grad_value); \n  }\n  scalar_t v4 = 0;\n  if (h_high <= height - 1 && w_high <= width - 1)\n  {\n    const int ptr4 = h_high_ptr_offset + w_high_ptr_offset + base_ptr;\n    v4 = bottom_data[ptr4];\n    grad_h_weight += lw * v4;\n    grad_w_weight += lh * v4;\n    atomicAdd(grad_value+ptr4, w4*top_grad_value);\n  }\n\n  const scalar_t val = (w1 * v1 + w2 * v2 + w3 * v3 + w4 * v4);\n  *grad_attn_weight = top_grad * val;\n  *grad_sampling_loc = width * grad_w_weight * top_grad_value;\n  *(grad_sampling_loc + 1) = height * grad_h_weight * top_grad_value;\n}\n\n\ntemplate <typename scalar_t>\n__device__ void ms_deform_attn_col2im_bilinear_gm(const scalar_t* &bottom_data, \n                                                   const int &height, const int &width, const int &nheads, const int &channels,\n                                                   const scalar_t &h, const scalar_t &w, const int &m, const int &c,\n                                                   const scalar_t &top_grad,\n                                                   const scalar_t &attn_weight,\n                                                   scalar_t* &grad_value, \n                                                   scalar_t* grad_sampling_loc,\n                                                   scalar_t* grad_attn_weight)\n{\n  const int h_low = floor(h);\n  const int w_low = floor(w);\n  const int h_high = h_low + 1;\n  const int w_high = w_low + 1;\n\n  const scalar_t lh = h - h_low;\n  const scalar_t lw = w - w_low;\n  const scalar_t hh = 1 - lh, hw = 1 - lw;\n\n  const int w_stride = nheads * channels;\n  const int h_stride = width * w_stride;\n  const int h_low_ptr_offset = h_low * h_stride;\n  const int h_high_ptr_offset = h_low_ptr_offset + h_stride;\n  const int w_low_ptr_offset = w_low * w_stride;\n  const int w_high_ptr_offset = w_low_ptr_offset + w_stride;\n  const int base_ptr = m * channels + c;\n\n  const scalar_t w1 = hh * hw, w2 = hh * lw, w3 = lh * hw, w4 = lh * lw;\n  const scalar_t top_grad_value = top_grad * attn_weight;\n  scalar_t grad_h_weight = 0, grad_w_weight = 0;\n\n  scalar_t v1 = 0;\n  if (h_low >= 0 && w_low >= 0)\n  {\n    const int ptr1 = h_low_ptr_offset + w_low_ptr_offset + base_ptr;\n    v1 = bottom_data[ptr1];\n    grad_h_weight -= hw * v1;\n    grad_w_weight -= hh * v1;\n    atomicAdd(grad_value+ptr1, w1*top_grad_value);\n  }\n  scalar_t v2 = 0;\n  if (h_low >= 0 && w_high <= width - 1)\n  {\n    const int ptr2 = h_low_ptr_offset + w_high_ptr_offset + base_ptr;\n    v2 = bottom_data[ptr2];\n    grad_h_weight -= lw * v2;\n    grad_w_weight += hh * v2;\n    atomicAdd(grad_value+ptr2, w2*top_grad_value);\n  }\n  scalar_t v3 = 0;\n  if (h_high <= height - 1 && w_low >= 0)\n  {\n    const int ptr3 = h_high_ptr_offset + w_low_ptr_offset + base_ptr;\n    v3 = bottom_data[ptr3];\n    grad_h_weight += hw * v3;\n    grad_w_weight -= lh * v3;\n    atomicAdd(grad_value+ptr3, w3*top_grad_value); \n  }\n  scalar_t v4 = 0;\n  if (h_high <= height - 1 && w_high <= width - 1)\n  {\n    const int ptr4 = h_high_ptr_offset + w_high_ptr_offset + base_ptr;\n    v4 = bottom_data[ptr4];\n    grad_h_weight += lw * v4;\n    grad_w_weight += lh * v4;\n    atomicAdd(grad_value+ptr4, w4*top_grad_value);\n  }\n\n  const scalar_t val = (w1 * v1 + w2 * v2 + w3 * v3 + w4 * v4);\n  atomicAdd(grad_attn_weight, top_grad * val); \n  atomicAdd(grad_sampling_loc, width * grad_w_weight * top_grad_value);\n  atomicAdd(grad_sampling_loc + 1, height * grad_h_weight * top_grad_value);\n}\n\n\ntemplate <typename scalar_t>\n__global__ void ms_deformable_im2col_gpu_kernel(const int n,\n                                                const scalar_t *data_value, \n                                                const int64_t *data_spatial_shapes,\n                                                const int64_t *data_level_start_index, \n                                                const scalar_t *data_sampling_loc,\n                                                const scalar_t *data_attn_weight,\n                                                const int batch_size, \n                                                const int spatial_size, \n                                                const int num_heads,\n                                                const int channels, \n                                                const int num_levels,\n                                                const int num_query,\n                                                const int num_point,\n                                                scalar_t *data_col)\n{\n  CUDA_KERNEL_LOOP(index, n)\n  {\n    int _temp = index;\n    const int c_col = _temp % channels;\n    _temp /= channels;\n    const int sampling_index = _temp; \n    const int m_col = _temp % num_heads;\n    _temp /= num_heads;\n    const int q_col = _temp % num_query;\n    _temp /= num_query;\n    const int b_col = _temp;\n\n    scalar_t *data_col_ptr = data_col + index;\n    int data_weight_ptr = sampling_index * num_levels * num_point;\n    int data_loc_w_ptr = data_weight_ptr << 1;\n    const int qid_stride = num_heads * channels;\n    const int data_value_ptr_init_offset = b_col * spatial_size * qid_stride;\n    scalar_t col = 0;\n    \n    for (int l_col=0; l_col < num_levels; ++l_col)\n    {\n      const int level_start_id = data_level_start_index[l_col];\n      const int spatial_h_ptr = l_col << 1;\n      const int spatial_h = data_spatial_shapes[spatial_h_ptr];\n      const int spatial_w = data_spatial_shapes[spatial_h_ptr + 1];\n      const scalar_t *data_value_ptr = data_value + (data_value_ptr_init_offset + level_start_id * qid_stride);\n      for (int p_col=0; p_col < num_point; ++p_col)\n      {\n        const scalar_t loc_w = data_sampling_loc[data_loc_w_ptr];\n        const scalar_t loc_h = data_sampling_loc[data_loc_w_ptr + 1];\n        const scalar_t weight = data_attn_weight[data_weight_ptr];\n\n        const scalar_t h_im = loc_h * spatial_h - 0.5;\n        const scalar_t w_im = loc_w * spatial_w - 0.5;\n\n        if (h_im > -1 && w_im > -1 && h_im < spatial_h && w_im < spatial_w)\n        {\n          col += ms_deform_attn_im2col_bilinear(data_value_ptr, spatial_h, spatial_w, num_heads, channels, h_im, w_im, m_col, c_col) * weight;\n        }\n\n        data_weight_ptr += 1;\n        data_loc_w_ptr += 2;\n      }\n    }\n    *data_col_ptr = col;\n  }\n}\n\ntemplate <typename scalar_t, unsigned int blockSize>\n__global__ void ms_deformable_col2im_gpu_kernel_shm_blocksize_aware_reduce_v1(const int n,\n                                                const scalar_t *grad_col,\n                                                const scalar_t *data_value,\n                                                const int64_t *data_spatial_shapes,\n                                                const int64_t *data_level_start_index, \n                                                const scalar_t *data_sampling_loc,\n                                                const scalar_t *data_attn_weight,\n                                                const int batch_size, \n                                                const int spatial_size, \n                                                const int num_heads,\n                                                const int channels, \n                                                const int num_levels,\n                                                const int num_query,\n                                                const int num_point,\n                                                scalar_t *grad_value,\n                                                scalar_t *grad_sampling_loc,\n                                                scalar_t *grad_attn_weight)\n{\n  CUDA_KERNEL_LOOP(index, n)\n  {\n    __shared__ scalar_t cache_grad_sampling_loc[blockSize * 2];\n    __shared__ scalar_t cache_grad_attn_weight[blockSize];\n    unsigned int tid = threadIdx.x;\n    int _temp = index;\n    const int c_col = _temp % channels;\n    _temp /= channels;\n    const int sampling_index = _temp; \n    const int m_col = _temp % num_heads;\n    _temp /= num_heads;\n    const int q_col = _temp % num_query;\n    _temp /= num_query;\n    const int b_col = _temp;\n\n    const scalar_t top_grad = grad_col[index];\n\n    int data_weight_ptr = sampling_index * num_levels * num_point;\n    int data_loc_w_ptr = data_weight_ptr << 1;\n    const int grad_sampling_ptr = data_weight_ptr;\n    grad_sampling_loc += grad_sampling_ptr << 1;\n    grad_attn_weight += grad_sampling_ptr;\n    const int grad_weight_stride = 1;\n    const int grad_loc_stride = 2;\n    const int qid_stride = num_heads * channels;\n    const int data_value_ptr_init_offset = b_col * spatial_size * qid_stride;\n\n    for (int l_col=0; l_col < num_levels; ++l_col)\n    {\n      const int level_start_id = data_level_start_index[l_col];\n      const int spatial_h_ptr = l_col << 1;\n      const int spatial_h = data_spatial_shapes[spatial_h_ptr];\n      const int spatial_w = data_spatial_shapes[spatial_h_ptr + 1];\n      const int value_ptr_offset = data_value_ptr_init_offset + level_start_id * qid_stride;\n      const scalar_t *data_value_ptr = data_value + value_ptr_offset;\n      scalar_t *grad_value_ptr = grad_value + value_ptr_offset;\n\n      for (int p_col=0; p_col < num_point; ++p_col)\n      {\n        const scalar_t loc_w = data_sampling_loc[data_loc_w_ptr];\n        const scalar_t loc_h = data_sampling_loc[data_loc_w_ptr + 1];\n        const scalar_t weight = data_attn_weight[data_weight_ptr];\n\n        const scalar_t h_im = loc_h * spatial_h - 0.5;\n        const scalar_t w_im = loc_w * spatial_w - 0.5;\n        *(cache_grad_sampling_loc+(threadIdx.x << 1)) = 0;\n        *(cache_grad_sampling_loc+((threadIdx.x << 1) + 1)) = 0;\n        *(cache_grad_attn_weight+threadIdx.x)=0;\n        if (h_im > -1 && w_im > -1 && h_im < spatial_h && w_im < spatial_w)\n        {\n          ms_deform_attn_col2im_bilinear(\n            data_value_ptr, spatial_h, spatial_w, num_heads, channels, h_im, w_im, m_col, c_col,\n            top_grad, weight, grad_value_ptr, \n            cache_grad_sampling_loc+(threadIdx.x << 1), cache_grad_attn_weight+threadIdx.x);\n        }\n        \n        __syncthreads();\n        if (tid == 0)\n        {\n          scalar_t _grad_w=cache_grad_sampling_loc[0], _grad_h=cache_grad_sampling_loc[1], _grad_a=cache_grad_attn_weight[0];\n          int sid=2;\n          for (unsigned int tid = 1; tid < blockSize; ++tid)\n          {\n            _grad_w += cache_grad_sampling_loc[sid];\n            _grad_h += cache_grad_sampling_loc[sid + 1];\n            _grad_a += cache_grad_attn_weight[tid];\n            sid += 2;\n          }\n          \n          \n          *grad_sampling_loc = _grad_w;\n          *(grad_sampling_loc + 1) = _grad_h;\n          *grad_attn_weight = _grad_a;\n        }\n        __syncthreads();\n\n        data_weight_ptr += 1;\n        data_loc_w_ptr += 2;\n        grad_attn_weight += grad_weight_stride;\n        grad_sampling_loc += grad_loc_stride;\n      }\n    }\n  }\n}\n\n\ntemplate <typename scalar_t, unsigned int blockSize>\n__global__ void ms_deformable_col2im_gpu_kernel_shm_blocksize_aware_reduce_v2(const int n,\n                                                const scalar_t *grad_col,\n                                                const scalar_t *data_value,\n                                                const int64_t *data_spatial_shapes,\n                                                const int64_t *data_level_start_index, \n                                                const scalar_t *data_sampling_loc,\n                                                const scalar_t *data_attn_weight,\n                                                const int batch_size, \n                                                const int spatial_size, \n                                                const int num_heads,\n                                                const int channels, \n                                                const int num_levels,\n                                                const int num_query,\n                                                const int num_point,\n                                                scalar_t *grad_value,\n                                                scalar_t *grad_sampling_loc,\n                                                scalar_t *grad_attn_weight)\n{\n  CUDA_KERNEL_LOOP(index, n)\n  {\n    __shared__ scalar_t cache_grad_sampling_loc[blockSize * 2];\n    __shared__ scalar_t cache_grad_attn_weight[blockSize];\n    unsigned int tid = threadIdx.x;\n    int _temp = index;\n    const int c_col = _temp % channels;\n    _temp /= channels;\n    const int sampling_index = _temp; \n    const int m_col = _temp % num_heads;\n    _temp /= num_heads;\n    const int q_col = _temp % num_query;\n    _temp /= num_query;\n    const int b_col = _temp;\n\n    const scalar_t top_grad = grad_col[index];\n\n    int data_weight_ptr = sampling_index * num_levels * num_point;\n    int data_loc_w_ptr = data_weight_ptr << 1;\n    const int grad_sampling_ptr = data_weight_ptr;\n    grad_sampling_loc += grad_sampling_ptr << 1;\n    grad_attn_weight += grad_sampling_ptr;\n    const int grad_weight_stride = 1;\n    const int grad_loc_stride = 2;\n    const int qid_stride = num_heads * channels;\n    const int data_value_ptr_init_offset = b_col * spatial_size * qid_stride;\n\n    for (int l_col=0; l_col < num_levels; ++l_col)\n    {\n      const int level_start_id = data_level_start_index[l_col];\n      const int spatial_h_ptr = l_col << 1;\n      const int spatial_h = data_spatial_shapes[spatial_h_ptr];\n      const int spatial_w = data_spatial_shapes[spatial_h_ptr + 1];\n      const int value_ptr_offset = data_value_ptr_init_offset + level_start_id * qid_stride;\n      const scalar_t *data_value_ptr = data_value + value_ptr_offset;\n      scalar_t *grad_value_ptr = grad_value + value_ptr_offset;\n\n      for (int p_col=0; p_col < num_point; ++p_col)\n      {\n        const scalar_t loc_w = data_sampling_loc[data_loc_w_ptr];\n        const scalar_t loc_h = data_sampling_loc[data_loc_w_ptr + 1];\n        const scalar_t weight = data_attn_weight[data_weight_ptr];\n\n        const scalar_t h_im = loc_h * spatial_h - 0.5;\n        const scalar_t w_im = loc_w * spatial_w - 0.5;\n        *(cache_grad_sampling_loc+(threadIdx.x << 1)) = 0;\n        *(cache_grad_sampling_loc+((threadIdx.x << 1) + 1)) = 0;\n        *(cache_grad_attn_weight+threadIdx.x)=0;\n        if (h_im > -1 && w_im > -1 && h_im < spatial_h && w_im < spatial_w)\n        {\n          ms_deform_attn_col2im_bilinear(\n            data_value_ptr, spatial_h, spatial_w, num_heads, channels, h_im, w_im, m_col, c_col,\n            top_grad, weight, grad_value_ptr, \n            cache_grad_sampling_loc+(threadIdx.x << 1), cache_grad_attn_weight+threadIdx.x);\n        }\n        \n        __syncthreads();\n\n        for (unsigned int s=blockSize/2; s>0; s>>=1)\n        {\n          if (tid < s) {\n            const unsigned int xid1 = tid << 1;\n            const unsigned int xid2 = (tid + s) << 1;\n            cache_grad_attn_weight[tid] += cache_grad_attn_weight[tid + s];\n            cache_grad_sampling_loc[xid1] += cache_grad_sampling_loc[xid2];\n            cache_grad_sampling_loc[xid1 + 1] += cache_grad_sampling_loc[xid2 + 1];\n          }\n          __syncthreads();\n        }\n\n        if (tid == 0)\n        { \n          *grad_sampling_loc = cache_grad_sampling_loc[0];\n          *(grad_sampling_loc + 1) = cache_grad_sampling_loc[1];\n          *grad_attn_weight = cache_grad_attn_weight[0];\n        }\n        __syncthreads();\n\n        data_weight_ptr += 1;\n        data_loc_w_ptr += 2;\n        grad_attn_weight += grad_weight_stride;\n        grad_sampling_loc += grad_loc_stride;\n      }\n    }\n  }\n}\n\n\ntemplate <typename scalar_t>\n__global__ void ms_deformable_col2im_gpu_kernel_shm_reduce_v1(const int n,\n                                                const scalar_t *grad_col,\n                                                const scalar_t *data_value,\n                                                const int64_t *data_spatial_shapes,\n                                                const int64_t *data_level_start_index, \n                                                const scalar_t *data_sampling_loc,\n                                                const scalar_t *data_attn_weight,\n                                                const int batch_size, \n                                                const int spatial_size, \n                                                const int num_heads,\n                                                const int channels, \n                                                const int num_levels,\n                                                const int num_query,\n                                                const int num_point,\n                                                scalar_t *grad_value,\n                                                scalar_t *grad_sampling_loc,\n                                                scalar_t *grad_attn_weight)\n{\n  CUDA_KERNEL_LOOP(index, n)\n  {\n    extern __shared__ int _s[];\n    scalar_t* cache_grad_sampling_loc = (scalar_t*)_s;\n    scalar_t* cache_grad_attn_weight = cache_grad_sampling_loc + 2 * blockDim.x;\n    unsigned int tid = threadIdx.x;\n    int _temp = index;\n    const int c_col = _temp % channels;\n    _temp /= channels;\n    const int sampling_index = _temp; \n    const int m_col = _temp % num_heads;\n    _temp /= num_heads;\n    const int q_col = _temp % num_query;\n    _temp /= num_query;\n    const int b_col = _temp;\n\n    const scalar_t top_grad = grad_col[index];\n\n    int data_weight_ptr = sampling_index * num_levels * num_point;\n    int data_loc_w_ptr = data_weight_ptr << 1;\n    const int grad_sampling_ptr = data_weight_ptr;\n    grad_sampling_loc += grad_sampling_ptr << 1;\n    grad_attn_weight += grad_sampling_ptr;\n    const int grad_weight_stride = 1;\n    const int grad_loc_stride = 2;\n    const int qid_stride = num_heads * channels;\n    const int data_value_ptr_init_offset = b_col * spatial_size * qid_stride;\n\n    for (int l_col=0; l_col < num_levels; ++l_col)\n    {\n      const int level_start_id = data_level_start_index[l_col];\n      const int spatial_h_ptr = l_col << 1;\n      const int spatial_h = data_spatial_shapes[spatial_h_ptr];\n      const int spatial_w = data_spatial_shapes[spatial_h_ptr + 1];\n      const int value_ptr_offset = data_value_ptr_init_offset + level_start_id * qid_stride;\n      const scalar_t *data_value_ptr = data_value + value_ptr_offset;\n      scalar_t *grad_value_ptr = grad_value + value_ptr_offset;\n\n      for (int p_col=0; p_col < num_point; ++p_col)\n      {\n        const scalar_t loc_w = data_sampling_loc[data_loc_w_ptr];\n        const scalar_t loc_h = data_sampling_loc[data_loc_w_ptr + 1];\n        const scalar_t weight = data_attn_weight[data_weight_ptr];\n\n        const scalar_t h_im = loc_h * spatial_h - 0.5;\n        const scalar_t w_im = loc_w * spatial_w - 0.5;\n        *(cache_grad_sampling_loc+(threadIdx.x << 1)) = 0;\n        *(cache_grad_sampling_loc+((threadIdx.x << 1) + 1)) = 0;\n        *(cache_grad_attn_weight+threadIdx.x)=0;\n        if (h_im > -1 && w_im > -1 && h_im < spatial_h && w_im < spatial_w)\n        {\n          ms_deform_attn_col2im_bilinear(\n            data_value_ptr, spatial_h, spatial_w, num_heads, channels, h_im, w_im, m_col, c_col,\n            top_grad, weight, grad_value_ptr, \n            cache_grad_sampling_loc+(threadIdx.x << 1), cache_grad_attn_weight+threadIdx.x);\n        }\n        \n        __syncthreads();\n        if (tid == 0)\n        {\n          scalar_t _grad_w=cache_grad_sampling_loc[0], _grad_h=cache_grad_sampling_loc[1], _grad_a=cache_grad_attn_weight[0];\n          int sid=2;\n          for (unsigned int tid = 1; tid < blockDim.x; ++tid)\n          {\n            _grad_w += cache_grad_sampling_loc[sid];\n            _grad_h += cache_grad_sampling_loc[sid + 1];\n            _grad_a += cache_grad_attn_weight[tid];\n            sid += 2;\n          }\n          \n          \n          *grad_sampling_loc = _grad_w;\n          *(grad_sampling_loc + 1) = _grad_h;\n          *grad_attn_weight = _grad_a;\n        }\n        __syncthreads();\n\n        data_weight_ptr += 1;\n        data_loc_w_ptr += 2;\n        grad_attn_weight += grad_weight_stride;\n        grad_sampling_loc += grad_loc_stride;\n      }\n    }\n  }\n}\n\ntemplate <typename scalar_t>\n__global__ void ms_deformable_col2im_gpu_kernel_shm_reduce_v2(const int n,\n                                                const scalar_t *grad_col,\n                                                const scalar_t *data_value,\n                                                const int64_t *data_spatial_shapes,\n                                                const int64_t *data_level_start_index, \n                                                const scalar_t *data_sampling_loc,\n                                                const scalar_t *data_attn_weight,\n                                                const int batch_size, \n                                                const int spatial_size, \n                                                const int num_heads,\n                                                const int channels, \n                                                const int num_levels,\n                                                const int num_query,\n                                                const int num_point,\n                                                scalar_t *grad_value,\n                                                scalar_t *grad_sampling_loc,\n                                                scalar_t *grad_attn_weight)\n{\n  CUDA_KERNEL_LOOP(index, n)\n  {\n    extern __shared__ int _s[];\n    scalar_t* cache_grad_sampling_loc = (scalar_t*)_s;\n    scalar_t* cache_grad_attn_weight = cache_grad_sampling_loc + 2 * blockDim.x;\n    unsigned int tid = threadIdx.x;\n    int _temp = index;\n    const int c_col = _temp % channels;\n    _temp /= channels;\n    const int sampling_index = _temp; \n    const int m_col = _temp % num_heads;\n    _temp /= num_heads;\n    const int q_col = _temp % num_query;\n    _temp /= num_query;\n    const int b_col = _temp;\n\n    const scalar_t top_grad = grad_col[index];\n\n    int data_weight_ptr = sampling_index * num_levels * num_point;\n    int data_loc_w_ptr = data_weight_ptr << 1;\n    const int grad_sampling_ptr = data_weight_ptr;\n    grad_sampling_loc += grad_sampling_ptr << 1;\n    grad_attn_weight += grad_sampling_ptr;\n    const int grad_weight_stride = 1;\n    const int grad_loc_stride = 2;\n    const int qid_stride = num_heads * channels;\n    const int data_value_ptr_init_offset = b_col * spatial_size * qid_stride;\n\n    for (int l_col=0; l_col < num_levels; ++l_col)\n    {\n      const int level_start_id = data_level_start_index[l_col];\n      const int spatial_h_ptr = l_col << 1;\n      const int spatial_h = data_spatial_shapes[spatial_h_ptr];\n      const int spatial_w = data_spatial_shapes[spatial_h_ptr + 1];\n      const int value_ptr_offset = data_value_ptr_init_offset + level_start_id * qid_stride;\n      const scalar_t *data_value_ptr = data_value + value_ptr_offset;\n      scalar_t *grad_value_ptr = grad_value + value_ptr_offset;\n\n      for (int p_col=0; p_col < num_point; ++p_col)\n      {\n        const scalar_t loc_w = data_sampling_loc[data_loc_w_ptr];\n        const scalar_t loc_h = data_sampling_loc[data_loc_w_ptr + 1];\n        const scalar_t weight = data_attn_weight[data_weight_ptr];\n\n        const scalar_t h_im = loc_h * spatial_h - 0.5;\n        const scalar_t w_im = loc_w * spatial_w - 0.5;\n        *(cache_grad_sampling_loc+(threadIdx.x << 1)) = 0;\n        *(cache_grad_sampling_loc+((threadIdx.x << 1) + 1)) = 0;\n        *(cache_grad_attn_weight+threadIdx.x)=0;\n        if (h_im > -1 && w_im > -1 && h_im < spatial_h && w_im < spatial_w)\n        {\n          ms_deform_attn_col2im_bilinear(\n            data_value_ptr, spatial_h, spatial_w, num_heads, channels, h_im, w_im, m_col, c_col,\n            top_grad, weight, grad_value_ptr, \n            cache_grad_sampling_loc+(threadIdx.x << 1), cache_grad_attn_weight+threadIdx.x);\n        }\n        \n        __syncthreads();\n\n        for (unsigned int s=blockDim.x/2, spre=blockDim.x; s>0; s>>=1, spre>>=1)\n        {\n          if (tid < s) {\n            const unsigned int xid1 = tid << 1;\n            const unsigned int xid2 = (tid + s) << 1;\n            cache_grad_attn_weight[tid] += cache_grad_attn_weight[tid + s];\n            cache_grad_sampling_loc[xid1] += cache_grad_sampling_loc[xid2];\n            cache_grad_sampling_loc[xid1 + 1] += cache_grad_sampling_loc[xid2 + 1];\n            if (tid + (s << 1) < spre)\n            {\n              cache_grad_attn_weight[tid] += cache_grad_attn_weight[tid + (s << 1)];\n              cache_grad_sampling_loc[xid1] += cache_grad_sampling_loc[xid2 + (s << 1)];\n              cache_grad_sampling_loc[xid1 + 1] += cache_grad_sampling_loc[xid2 + 1 + (s << 1)];\n            } \n          }\n          __syncthreads();\n        }\n\n        if (tid == 0)\n        {\n          *grad_sampling_loc = cache_grad_sampling_loc[0];\n          *(grad_sampling_loc + 1) = cache_grad_sampling_loc[1];\n          *grad_attn_weight = cache_grad_attn_weight[0];\n        }\n        __syncthreads();\n\n        data_weight_ptr += 1;\n        data_loc_w_ptr += 2;\n        grad_attn_weight += grad_weight_stride;\n        grad_sampling_loc += grad_loc_stride;\n      }\n    }\n  }\n}\n\ntemplate <typename scalar_t>\n__global__ void ms_deformable_col2im_gpu_kernel_shm_reduce_v2_multi_blocks(const int n,\n                                                const scalar_t *grad_col,\n                                                const scalar_t *data_value,\n                                                const int64_t *data_spatial_shapes,\n                                                const int64_t *data_level_start_index, \n                                                const scalar_t *data_sampling_loc,\n                                                const scalar_t *data_attn_weight,\n                                                const int batch_size, \n                                                const int spatial_size, \n                                                const int num_heads,\n                                                const int channels, \n                                                const int num_levels,\n                                                const int num_query,\n                                                const int num_point,\n                                                scalar_t *grad_value,\n                                                scalar_t *grad_sampling_loc,\n                                                scalar_t *grad_attn_weight)\n{\n  CUDA_KERNEL_LOOP(index, n)\n  {\n    extern __shared__ int _s[];\n    scalar_t* cache_grad_sampling_loc = (scalar_t*)_s;\n    scalar_t* cache_grad_attn_weight = cache_grad_sampling_loc + 2 * blockDim.x;\n    unsigned int tid = threadIdx.x;\n    int _temp = index;\n    const int c_col = _temp % channels;\n    _temp /= channels;\n    const int sampling_index = _temp; \n    const int m_col = _temp % num_heads;\n    _temp /= num_heads;\n    const int q_col = _temp % num_query;\n    _temp /= num_query;\n    const int b_col = _temp;\n\n    const scalar_t top_grad = grad_col[index];\n\n    int data_weight_ptr = sampling_index * num_levels * num_point;\n    int data_loc_w_ptr = data_weight_ptr << 1;\n    const int grad_sampling_ptr = data_weight_ptr;\n    grad_sampling_loc += grad_sampling_ptr << 1;\n    grad_attn_weight += grad_sampling_ptr;\n    const int grad_weight_stride = 1;\n    const int grad_loc_stride = 2;\n    const int qid_stride = num_heads * channels;\n    const int data_value_ptr_init_offset = b_col * spatial_size * qid_stride;\n\n    for (int l_col=0; l_col < num_levels; ++l_col)\n    {\n      const int level_start_id = data_level_start_index[l_col];\n      const int spatial_h_ptr = l_col << 1;\n      const int spatial_h = data_spatial_shapes[spatial_h_ptr];\n      const int spatial_w = data_spatial_shapes[spatial_h_ptr + 1];\n      const int value_ptr_offset = data_value_ptr_init_offset + level_start_id * qid_stride;\n      const scalar_t *data_value_ptr = data_value + value_ptr_offset;\n      scalar_t *grad_value_ptr = grad_value + value_ptr_offset;\n\n      for (int p_col=0; p_col < num_point; ++p_col)\n      {\n        const scalar_t loc_w = data_sampling_loc[data_loc_w_ptr];\n        const scalar_t loc_h = data_sampling_loc[data_loc_w_ptr + 1];\n        const scalar_t weight = data_attn_weight[data_weight_ptr];\n\n        const scalar_t h_im = loc_h * spatial_h - 0.5;\n        const scalar_t w_im = loc_w * spatial_w - 0.5;\n        *(cache_grad_sampling_loc+(threadIdx.x << 1)) = 0;\n        *(cache_grad_sampling_loc+((threadIdx.x << 1) + 1)) = 0;\n        *(cache_grad_attn_weight+threadIdx.x)=0;\n        if (h_im > -1 && w_im > -1 && h_im < spatial_h && w_im < spatial_w)\n        {\n          ms_deform_attn_col2im_bilinear(\n            data_value_ptr, spatial_h, spatial_w, num_heads, channels, h_im, w_im, m_col, c_col,\n            top_grad, weight, grad_value_ptr, \n            cache_grad_sampling_loc+(threadIdx.x << 1), cache_grad_attn_weight+threadIdx.x);\n        }\n        \n        __syncthreads();\n\n        for (unsigned int s=blockDim.x/2, spre=blockDim.x; s>0; s>>=1, spre>>=1)\n        {\n          if (tid < s) {\n            const unsigned int xid1 = tid << 1;\n            const unsigned int xid2 = (tid + s) << 1;\n            cache_grad_attn_weight[tid] += cache_grad_attn_weight[tid + s];\n            cache_grad_sampling_loc[xid1] += cache_grad_sampling_loc[xid2];\n            cache_grad_sampling_loc[xid1 + 1] += cache_grad_sampling_loc[xid2 + 1];\n            if (tid + (s << 1) < spre)\n            {\n              cache_grad_attn_weight[tid] += cache_grad_attn_weight[tid + (s << 1)];\n              cache_grad_sampling_loc[xid1] += cache_grad_sampling_loc[xid2 + (s << 1)];\n              cache_grad_sampling_loc[xid1 + 1] += cache_grad_sampling_loc[xid2 + 1 + (s << 1)];\n            }\n          }\n          __syncthreads();\n        }\n\n        if (tid == 0)\n        {\n          atomicAdd(grad_sampling_loc, cache_grad_sampling_loc[0]);\n          atomicAdd(grad_sampling_loc + 1, cache_grad_sampling_loc[1]);\n          atomicAdd(grad_attn_weight, cache_grad_attn_weight[0]);\n        }\n        __syncthreads();\n\n        data_weight_ptr += 1;\n        data_loc_w_ptr += 2;\n        grad_attn_weight += grad_weight_stride;\n        grad_sampling_loc += grad_loc_stride;\n      }\n    }\n  }\n}\n\n\ntemplate <typename scalar_t>\n__global__ void ms_deformable_col2im_gpu_kernel_gm(const int n,\n                                                const scalar_t *grad_col,\n                                                const scalar_t *data_value,\n                                                const int64_t *data_spatial_shapes,\n                                                const int64_t *data_level_start_index, \n                                                const scalar_t *data_sampling_loc,\n                                                const scalar_t *data_attn_weight,\n                                                const int batch_size, \n                                                const int spatial_size, \n                                                const int num_heads,\n                                                const int channels, \n                                                const int num_levels,\n                                                const int num_query,\n                                                const int num_point,\n                                                scalar_t *grad_value,\n                                                scalar_t *grad_sampling_loc,\n                                                scalar_t *grad_attn_weight)\n{\n  CUDA_KERNEL_LOOP(index, n)\n  {\n    int _temp = index;\n    const int c_col = _temp % channels;\n    _temp /= channels;\n    const int sampling_index = _temp; \n    const int m_col = _temp % num_heads;\n    _temp /= num_heads;\n    const int q_col = _temp % num_query;\n    _temp /= num_query;\n    const int b_col = _temp;\n\n    const scalar_t top_grad = grad_col[index];\n\n    int data_weight_ptr = sampling_index * num_levels * num_point;\n    int data_loc_w_ptr = data_weight_ptr << 1;\n    const int grad_sampling_ptr = data_weight_ptr;\n    grad_sampling_loc += grad_sampling_ptr << 1;\n    grad_attn_weight += grad_sampling_ptr;\n    const int grad_weight_stride = 1;\n    const int grad_loc_stride = 2;\n    const int qid_stride = num_heads * channels;\n    const int data_value_ptr_init_offset = b_col * spatial_size * qid_stride;\n\n    for (int l_col=0; l_col < num_levels; ++l_col)\n    {\n      const int level_start_id = data_level_start_index[l_col];\n      const int spatial_h_ptr = l_col << 1;\n      const int spatial_h = data_spatial_shapes[spatial_h_ptr];\n      const int spatial_w = data_spatial_shapes[spatial_h_ptr + 1];\n      const int value_ptr_offset = data_value_ptr_init_offset + level_start_id * qid_stride;\n      const scalar_t *data_value_ptr = data_value + value_ptr_offset;\n      scalar_t *grad_value_ptr = grad_value + value_ptr_offset;\n\n      for (int p_col=0; p_col < num_point; ++p_col)\n      {\n        const scalar_t loc_w = data_sampling_loc[data_loc_w_ptr];\n        const scalar_t loc_h = data_sampling_loc[data_loc_w_ptr + 1];\n        const scalar_t weight = data_attn_weight[data_weight_ptr];\n\n        const scalar_t h_im = loc_h * spatial_h - 0.5;\n        const scalar_t w_im = loc_w * spatial_w - 0.5;\n        if (h_im > -1 && w_im > -1 && h_im < spatial_h && w_im < spatial_w)\n        {\n          ms_deform_attn_col2im_bilinear_gm(\n            data_value_ptr, spatial_h, spatial_w, num_heads, channels, h_im, w_im, m_col, c_col,\n            top_grad, weight, grad_value_ptr, \n            grad_sampling_loc, grad_attn_weight);\n        }\n        data_weight_ptr += 1;\n        data_loc_w_ptr += 2;\n        grad_attn_weight += grad_weight_stride;\n        grad_sampling_loc += grad_loc_stride;\n      }\n    }\n  }\n}\n\n\ntemplate <typename scalar_t>\nvoid ms_deformable_im2col_cuda(cudaStream_t stream,\n                              const scalar_t* data_value,\n                              const int64_t* data_spatial_shapes, \n                              const int64_t* data_level_start_index, \n                              const scalar_t* data_sampling_loc,\n                              const scalar_t* data_attn_weight,\n                              const int batch_size,\n                              const int spatial_size, \n                              const int num_heads, \n                              const int channels, \n                              const int num_levels, \n                              const int num_query,\n                              const int num_point,\n                              scalar_t* data_col)\n{\n  const int num_kernels = batch_size * num_query * num_heads * channels;\n  const int num_actual_kernels = batch_size * num_query * num_heads * channels;\n  const int num_threads = CUDA_NUM_THREADS;\n  ms_deformable_im2col_gpu_kernel<scalar_t>\n      <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n          0, stream>>>(\n      num_kernels, data_value, data_spatial_shapes, data_level_start_index, data_sampling_loc, data_attn_weight, \n      batch_size, spatial_size, num_heads, channels, num_levels, num_query, num_point, data_col);\n  \n  cudaError_t err = cudaGetLastError();\n  if (err != cudaSuccess)\n  {\n    printf(\"error in ms_deformable_im2col_cuda: %s\\n\", cudaGetErrorString(err));\n  }\n\n}\n\ntemplate <typename scalar_t>\nvoid ms_deformable_col2im_cuda(cudaStream_t stream,\n                              const scalar_t* grad_col,\n                              const scalar_t* data_value,\n                              const int64_t * data_spatial_shapes,\n                              const int64_t * data_level_start_index,\n                              const scalar_t * data_sampling_loc,\n                              const scalar_t * data_attn_weight,\n                              const int batch_size, \n                              const int spatial_size, \n                              const int num_heads,\n                              const int channels, \n                              const int num_levels,\n                              const int num_query,\n                              const int num_point, \n                              scalar_t* grad_value,\n                              scalar_t* grad_sampling_loc,\n                              scalar_t* grad_attn_weight)\n{\n  const int num_threads = (channels > CUDA_NUM_THREADS)?CUDA_NUM_THREADS:channels;\n  const int num_kernels = batch_size * num_query * num_heads * channels;\n  const int num_actual_kernels = batch_size * num_query * num_heads * channels;\n  if (channels > 1024)\n  {\n    if ((channels & 1023) == 0)\n    {\n      ms_deformable_col2im_gpu_kernel_shm_reduce_v2_multi_blocks<scalar_t>\n          <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n              num_threads*3*sizeof(scalar_t), stream>>>(\n                        num_kernels, \n                        grad_col,\n                        data_value,\n                        data_spatial_shapes,\n                        data_level_start_index, \n                        data_sampling_loc,\n                        data_attn_weight,\n                        batch_size, \n                        spatial_size, \n                        num_heads,\n                        channels, \n                        num_levels,\n                        num_query,\n                        num_point,\n                        grad_value,\n                        grad_sampling_loc,\n                        grad_attn_weight);\n    }\n    else\n    {\n      ms_deformable_col2im_gpu_kernel_gm<scalar_t>\n        <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n            0, stream>>>(\n                      num_kernels, \n                      grad_col,\n                      data_value,\n                      data_spatial_shapes,\n                      data_level_start_index, \n                      data_sampling_loc,\n                      data_attn_weight,\n                      batch_size, \n                      spatial_size, \n                      num_heads,\n                      channels, \n                      num_levels,\n                      num_query,\n                      num_point,\n                      grad_value,\n                      grad_sampling_loc,\n                      grad_attn_weight);\n    }\n  }\n  else{\n    switch(channels)\n    {\n      case 1:\n        ms_deformable_col2im_gpu_kernel_shm_blocksize_aware_reduce_v1<scalar_t, 1>\n        <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n            0, stream>>>(\n                      num_kernels, \n                      grad_col,\n                      data_value,\n                      data_spatial_shapes,\n                      data_level_start_index, \n                      data_sampling_loc,\n                      data_attn_weight,\n                      batch_size, \n                      spatial_size, \n                      num_heads,\n                      channels, \n                      num_levels,\n                      num_query,\n                      num_point,\n                      grad_value,\n                      grad_sampling_loc,\n                      grad_attn_weight);\n        break;\n      case 2:\n        ms_deformable_col2im_gpu_kernel_shm_blocksize_aware_reduce_v1<scalar_t, 2>\n        <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n            0, stream>>>(\n                      num_kernels, \n                      grad_col,\n                      data_value,\n                      data_spatial_shapes,\n                      data_level_start_index, \n                      data_sampling_loc,\n                      data_attn_weight,\n                      batch_size, \n                      spatial_size, \n                      num_heads,\n                      channels, \n                      num_levels,\n                      num_query,\n                      num_point,\n                      grad_value,\n                      grad_sampling_loc,\n                      grad_attn_weight);\n        break;\n      case 4:\n        ms_deformable_col2im_gpu_kernel_shm_blocksize_aware_reduce_v1<scalar_t, 4>\n        <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n            0, stream>>>(\n                      num_kernels, \n                      grad_col,\n                      data_value,\n                      data_spatial_shapes,\n                      data_level_start_index, \n                      data_sampling_loc,\n                      data_attn_weight,\n                      batch_size, \n                      spatial_size, \n                      num_heads,\n                      channels, \n                      num_levels,\n                      num_query,\n                      num_point,\n                      grad_value,\n                      grad_sampling_loc,\n                      grad_attn_weight);\n        break;\n      case 8:\n        ms_deformable_col2im_gpu_kernel_shm_blocksize_aware_reduce_v1<scalar_t, 8>\n        <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n            0, stream>>>(\n                      num_kernels, \n                      grad_col,\n                      data_value,\n                      data_spatial_shapes,\n                      data_level_start_index, \n                      data_sampling_loc,\n                      data_attn_weight,\n                      batch_size, \n                      spatial_size, \n                      num_heads,\n                      channels, \n                      num_levels,\n                      num_query,\n                      num_point,\n                      grad_value,\n                      grad_sampling_loc,\n                      grad_attn_weight);\n        break;\n      case 16:\n        ms_deformable_col2im_gpu_kernel_shm_blocksize_aware_reduce_v1<scalar_t, 16>\n        <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n            0, stream>>>(\n                      num_kernels, \n                      grad_col,\n                      data_value,\n                      data_spatial_shapes,\n                      data_level_start_index, \n                      data_sampling_loc,\n                      data_attn_weight,\n                      batch_size, \n                      spatial_size, \n                      num_heads,\n                      channels, \n                      num_levels,\n                      num_query,\n                      num_point,\n                      grad_value,\n                      grad_sampling_loc,\n                      grad_attn_weight);\n        break;\n      case 32:\n        ms_deformable_col2im_gpu_kernel_shm_blocksize_aware_reduce_v1<scalar_t, 32>\n        <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n            0, stream>>>(\n                      num_kernels, \n                      grad_col,\n                      data_value,\n                      data_spatial_shapes,\n                      data_level_start_index, \n                      data_sampling_loc,\n                      data_attn_weight,\n                      batch_size, \n                      spatial_size, \n                      num_heads,\n                      channels, \n                      num_levels,\n                      num_query,\n                      num_point,\n                      grad_value,\n                      grad_sampling_loc,\n                      grad_attn_weight);\n        break;\n      case 64:\n        ms_deformable_col2im_gpu_kernel_shm_blocksize_aware_reduce_v2<scalar_t, 64>\n        <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n            0, stream>>>(\n                      num_kernels, \n                      grad_col,\n                      data_value,\n                      data_spatial_shapes,\n                      data_level_start_index, \n                      data_sampling_loc,\n                      data_attn_weight,\n                      batch_size, \n                      spatial_size, \n                      num_heads,\n                      channels, \n                      num_levels,\n                      num_query,\n                      num_point,\n                      grad_value,\n                      grad_sampling_loc,\n                      grad_attn_weight);\n        break;\n      case 128:\n        ms_deformable_col2im_gpu_kernel_shm_blocksize_aware_reduce_v2<scalar_t, 128>\n        <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n            0, stream>>>(\n                      num_kernels, \n                      grad_col,\n                      data_value,\n                      data_spatial_shapes,\n                      data_level_start_index, \n                      data_sampling_loc,\n                      data_attn_weight,\n                      batch_size, \n                      spatial_size, \n                      num_heads,\n                      channels, \n                      num_levels,\n                      num_query,\n                      num_point,\n                      grad_value,\n                      grad_sampling_loc,\n                      grad_attn_weight);\n        break;\n      case 256:\n        ms_deformable_col2im_gpu_kernel_shm_blocksize_aware_reduce_v2<scalar_t, 256>\n        <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n            0, stream>>>(\n                      num_kernels, \n                      grad_col,\n                      data_value,\n                      data_spatial_shapes,\n                      data_level_start_index, \n                      data_sampling_loc,\n                      data_attn_weight,\n                      batch_size, \n                      spatial_size, \n                      num_heads,\n                      channels, \n                      num_levels,\n                      num_query,\n                      num_point,\n                      grad_value,\n                      grad_sampling_loc,\n                      grad_attn_weight);\n        break;\n      case 512:\n        ms_deformable_col2im_gpu_kernel_shm_blocksize_aware_reduce_v2<scalar_t, 512>\n        <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n            0, stream>>>(\n                      num_kernels, \n                      grad_col,\n                      data_value,\n                      data_spatial_shapes,\n                      data_level_start_index, \n                      data_sampling_loc,\n                      data_attn_weight,\n                      batch_size, \n                      spatial_size, \n                      num_heads,\n                      channels, \n                      num_levels,\n                      num_query,\n                      num_point,\n                      grad_value,\n                      grad_sampling_loc,\n                      grad_attn_weight);\n        break;\n      case 1024:\n        ms_deformable_col2im_gpu_kernel_shm_blocksize_aware_reduce_v2<scalar_t, 1024>\n        <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n            0, stream>>>(\n                      num_kernels, \n                      grad_col,\n                      data_value,\n                      data_spatial_shapes,\n                      data_level_start_index, \n                      data_sampling_loc,\n                      data_attn_weight,\n                      batch_size, \n                      spatial_size, \n                      num_heads,\n                      channels, \n                      num_levels,\n                      num_query,\n                      num_point,\n                      grad_value,\n                      grad_sampling_loc,\n                      grad_attn_weight);\n        break;\n      default:\n        if (channels < 64)\n        {\n          ms_deformable_col2im_gpu_kernel_shm_reduce_v1<scalar_t>\n          <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n              num_threads*3*sizeof(scalar_t), stream>>>(\n                        num_kernels, \n                        grad_col,\n                        data_value,\n                        data_spatial_shapes,\n                        data_level_start_index, \n                        data_sampling_loc,\n                        data_attn_weight,\n                        batch_size, \n                        spatial_size, \n                        num_heads,\n                        channels, \n                        num_levels,\n                        num_query,\n                        num_point,\n                        grad_value,\n                        grad_sampling_loc,\n                        grad_attn_weight);\n        }\n        else\n        {\n          ms_deformable_col2im_gpu_kernel_shm_reduce_v2<scalar_t>\n          <<<GET_BLOCKS(num_actual_kernels, num_threads), num_threads,\n              num_threads*3*sizeof(scalar_t), stream>>>(\n                        num_kernels, \n                        grad_col,\n                        data_value,\n                        data_spatial_shapes,\n                        data_level_start_index, \n                        data_sampling_loc,\n                        data_attn_weight,\n                        batch_size, \n                        spatial_size, \n                        num_heads,\n                        channels, \n                        num_levels,\n                        num_query,\n                        num_point,\n                        grad_value,\n                        grad_sampling_loc,\n                        grad_attn_weight);\n        }\n    }\n  }\n  cudaError_t err = cudaGetLastError();\n  if (err != cudaSuccess)\n  {\n    printf(\"error in ms_deformable_col2im_cuda: %s\\n\", cudaGetErrorString(err));\n  }\n\n}"
  },
  {
    "path": "mfvis_nococo/mask2former/modeling/pixel_decoder/ops/src/ms_deform_attn.h",
    "content": "/*!\n**************************************************************************************************\n* Deformable DETR\n* Copyright (c) 2020 SenseTime. All Rights Reserved.\n* Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n**************************************************************************************************\n* Modified from https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0\n**************************************************************************************************\n*/\n\n/*!\n* Copyright (c) Facebook, Inc. and its affiliates.\n* Modified by Bowen Cheng from https://github.com/fundamentalvision/Deformable-DETR\n*/\n\n#pragma once\n\n#include \"cpu/ms_deform_attn_cpu.h\"\n\n#ifdef WITH_CUDA\n#include \"cuda/ms_deform_attn_cuda.h\"\n#endif\n\n\nat::Tensor\nms_deform_attn_forward(\n    const at::Tensor &value, \n    const at::Tensor &spatial_shapes,\n    const at::Tensor &level_start_index,\n    const at::Tensor &sampling_loc,\n    const at::Tensor &attn_weight,\n    const int im2col_step)\n{\n    if (value.type().is_cuda())\n    {\n#ifdef WITH_CUDA\n        return ms_deform_attn_cuda_forward(\n            value, spatial_shapes, level_start_index, sampling_loc, attn_weight, im2col_step);\n#else\n        AT_ERROR(\"Not compiled with GPU support\");\n#endif\n    }\n    AT_ERROR(\"Not implemented on the CPU\");\n}\n\nstd::vector<at::Tensor>\nms_deform_attn_backward(\n    const at::Tensor &value, \n    const at::Tensor &spatial_shapes,\n    const at::Tensor &level_start_index,\n    const at::Tensor &sampling_loc,\n    const at::Tensor &attn_weight,\n    const at::Tensor &grad_output,\n    const int im2col_step)\n{\n    if (value.type().is_cuda())\n    {\n#ifdef WITH_CUDA\n        return ms_deform_attn_cuda_backward(\n            value, spatial_shapes, level_start_index, sampling_loc, attn_weight, grad_output, im2col_step);\n#else\n        AT_ERROR(\"Not compiled with GPU support\");\n#endif\n    }\n    AT_ERROR(\"Not implemented on the CPU\");\n}\n\n"
  },
  {
    "path": "mfvis_nococo/mask2former/modeling/pixel_decoder/ops/src/vision.cpp",
    "content": "/*!\n**************************************************************************************************\n* Deformable DETR\n* Copyright (c) 2020 SenseTime. All Rights Reserved.\n* Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n**************************************************************************************************\n* Modified from https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0\n**************************************************************************************************\n*/\n\n/*!\n* Copyright (c) Facebook, Inc. and its affiliates.\n* Modified by Bowen Cheng from https://github.com/fundamentalvision/Deformable-DETR\n*/\n\n#include \"ms_deform_attn.h\"\n\nPYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {\n  m.def(\"ms_deform_attn_forward\", &ms_deform_attn_forward, \"ms_deform_attn_forward\");\n  m.def(\"ms_deform_attn_backward\", &ms_deform_attn_backward, \"ms_deform_attn_backward\");\n}\n"
  },
  {
    "path": "mfvis_nococo/mask2former/modeling/pixel_decoder/ops/test.py",
    "content": "# ------------------------------------------------------------------------------------------------\n# Deformable DETR\n# Copyright (c) 2020 SenseTime. All Rights Reserved.\n# Licensed under the Apache License, Version 2.0 [see LICENSE for details]\n# ------------------------------------------------------------------------------------------------\n# Modified from https://github.com/chengdazhi/Deformable-Convolution-V2-PyTorch/tree/pytorch_1.0.0\n# ------------------------------------------------------------------------------------------------\n\n# Modified by Bowen Cheng from https://github.com/fundamentalvision/Deformable-DETR\n\nfrom __future__ import absolute_import\nfrom __future__ import print_function\nfrom __future__ import division\n\nimport time\nimport torch\nimport torch.nn as nn\nfrom torch.autograd import gradcheck\n\nfrom functions.ms_deform_attn_func import MSDeformAttnFunction, ms_deform_attn_core_pytorch\n\n\nN, M, D = 1, 2, 2\nLq, L, P = 2, 2, 2\nshapes = torch.as_tensor([(6, 4), (3, 2)], dtype=torch.long).cuda()\nlevel_start_index = torch.cat((shapes.new_zeros((1, )), shapes.prod(1).cumsum(0)[:-1]))\nS = sum([(H*W).item() for H, W in shapes])\n\n\ntorch.manual_seed(3)\n\n\n@torch.no_grad()\ndef check_forward_equal_with_pytorch_double():\n    value = torch.rand(N, S, M, D).cuda() * 0.01\n    sampling_locations = torch.rand(N, Lq, M, L, P, 2).cuda()\n    attention_weights = torch.rand(N, Lq, M, L, P).cuda() + 1e-5\n    attention_weights /= attention_weights.sum(-1, keepdim=True).sum(-2, keepdim=True)\n    im2col_step = 2\n    output_pytorch = ms_deform_attn_core_pytorch(value.double(), shapes, sampling_locations.double(), attention_weights.double()).detach().cpu()\n    output_cuda = MSDeformAttnFunction.apply(value.double(), shapes, level_start_index, sampling_locations.double(), attention_weights.double(), im2col_step).detach().cpu()\n    fwdok = torch.allclose(output_cuda, output_pytorch)\n    max_abs_err = (output_cuda - output_pytorch).abs().max()\n    max_rel_err = ((output_cuda - output_pytorch).abs() / output_pytorch.abs()).max()\n\n    print(f'* {fwdok} check_forward_equal_with_pytorch_double: max_abs_err {max_abs_err:.2e} max_rel_err {max_rel_err:.2e}')\n\n\n@torch.no_grad()\ndef check_forward_equal_with_pytorch_float():\n    value = torch.rand(N, S, M, D).cuda() * 0.01\n    sampling_locations = torch.rand(N, Lq, M, L, P, 2).cuda()\n    attention_weights = torch.rand(N, Lq, M, L, P).cuda() + 1e-5\n    attention_weights /= attention_weights.sum(-1, keepdim=True).sum(-2, keepdim=True)\n    im2col_step = 2\n    output_pytorch = ms_deform_attn_core_pytorch(value, shapes, sampling_locations, attention_weights).detach().cpu()\n    output_cuda = MSDeformAttnFunction.apply(value, shapes, level_start_index, sampling_locations, attention_weights, im2col_step).detach().cpu()\n    fwdok = torch.allclose(output_cuda, output_pytorch, rtol=1e-2, atol=1e-3)\n    max_abs_err = (output_cuda - output_pytorch).abs().max()\n    max_rel_err = ((output_cuda - output_pytorch).abs() / output_pytorch.abs()).max()\n\n    print(f'* {fwdok} check_forward_equal_with_pytorch_float: max_abs_err {max_abs_err:.2e} max_rel_err {max_rel_err:.2e}')\n\n\ndef check_gradient_numerical(channels=4, grad_value=True, grad_sampling_loc=True, grad_attn_weight=True):\n\n    value = torch.rand(N, S, M, channels).cuda() * 0.01\n    sampling_locations = torch.rand(N, Lq, M, L, P, 2).cuda()\n    attention_weights = torch.rand(N, Lq, M, L, P).cuda() + 1e-5\n    attention_weights /= attention_weights.sum(-1, keepdim=True).sum(-2, keepdim=True)\n    im2col_step = 2\n    func = MSDeformAttnFunction.apply\n\n    value.requires_grad = grad_value\n    sampling_locations.requires_grad = grad_sampling_loc\n    attention_weights.requires_grad = grad_attn_weight\n\n    gradok = gradcheck(func, (value.double(), shapes, level_start_index, sampling_locations.double(), attention_weights.double(), im2col_step))\n\n    print(f'* {gradok} check_gradient_numerical(D={channels})')\n\n\nif __name__ == '__main__':\n    check_forward_equal_with_pytorch_double()\n    check_forward_equal_with_pytorch_float()\n\n    for channels in [30, 32, 64, 71, 1025, 2048, 3096]:\n        check_gradient_numerical(channels, True, True, True)\n\n\n\n"
  },
  {
    "path": "mfvis_nococo/mask2former/modeling/transformer_decoder/__init__.py",
    "content": "from .maskformer_transformer_decoder import StandardTransformerDecoder\nfrom .mask2former_transformer_decoder import MultiScaleMaskedTransformerDecoder\n"
  },
  {
    "path": "mfvis_nococo/mask2former/modeling/transformer_decoder/mask2former_transformer_decoder.py",
    "content": "# Modified by Bowen Cheng from: https://github.com/facebookresearch/detr/blob/master/models/detr.py\nimport logging\nimport fvcore.nn.weight_init as weight_init\nfrom typing import Optional\nimport torch\nfrom torch import nn, Tensor\nfrom torch.nn import functional as F\n\nfrom detectron2.config import configurable\nfrom detectron2.layers import Conv2d\n\nfrom .position_encoding import PositionEmbeddingSine\nfrom .maskformer_transformer_decoder import TRANSFORMER_DECODER_REGISTRY\n\n\nclass SelfAttentionLayer(nn.Module):\n\n    def __init__(self, d_model, nhead, dropout=0.0,\n                 activation=\"relu\", normalize_before=False):\n        super().__init__()\n        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)\n\n        self.norm = nn.LayerNorm(d_model)\n        self.dropout = nn.Dropout(dropout)\n\n        self.activation = _get_activation_fn(activation)\n        self.normalize_before = normalize_before\n\n        self._reset_parameters()\n    \n    def _reset_parameters(self):\n        for p in self.parameters():\n            if p.dim() > 1:\n                nn.init.xavier_uniform_(p)\n\n    def with_pos_embed(self, tensor, pos: Optional[Tensor]):\n        return tensor if pos is None else tensor + pos\n\n    def forward_post(self, tgt,\n                     tgt_mask: Optional[Tensor] = None,\n                     tgt_key_padding_mask: Optional[Tensor] = None,\n                     query_pos: Optional[Tensor] = None):\n        q = k = self.with_pos_embed(tgt, query_pos)\n        tgt2 = self.self_attn(q, k, value=tgt, attn_mask=tgt_mask,\n                              key_padding_mask=tgt_key_padding_mask)[0]\n        tgt = tgt + self.dropout(tgt2)\n        tgt = self.norm(tgt)\n\n        return tgt\n\n    def forward_pre(self, tgt,\n                    tgt_mask: Optional[Tensor] = None,\n                    tgt_key_padding_mask: Optional[Tensor] = None,\n                    query_pos: Optional[Tensor] = None):\n        tgt2 = self.norm(tgt)\n        q = k = self.with_pos_embed(tgt2, query_pos)\n        tgt2 = self.self_attn(q, k, value=tgt2, attn_mask=tgt_mask,\n                              key_padding_mask=tgt_key_padding_mask)[0]\n        tgt = tgt + self.dropout(tgt2)\n        \n        return tgt\n\n    def forward(self, tgt,\n                tgt_mask: Optional[Tensor] = None,\n                tgt_key_padding_mask: Optional[Tensor] = None,\n                query_pos: Optional[Tensor] = None):\n        if self.normalize_before:\n            return self.forward_pre(tgt, tgt_mask,\n                                    tgt_key_padding_mask, query_pos)\n        return self.forward_post(tgt, tgt_mask,\n                                 tgt_key_padding_mask, query_pos)\n\n\nclass CrossAttentionLayer(nn.Module):\n\n    def __init__(self, d_model, nhead, dropout=0.0,\n                 activation=\"relu\", normalize_before=False):\n        super().__init__()\n        self.multihead_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)\n\n        self.norm = nn.LayerNorm(d_model)\n        self.dropout = nn.Dropout(dropout)\n\n        self.activation = _get_activation_fn(activation)\n        self.normalize_before = normalize_before\n\n        self._reset_parameters()\n    \n    def _reset_parameters(self):\n        for p in self.parameters():\n            if p.dim() > 1:\n                nn.init.xavier_uniform_(p)\n\n    def with_pos_embed(self, tensor, pos: Optional[Tensor]):\n        return tensor if pos is None else tensor + pos\n\n    def forward_post(self, tgt, memory,\n                     memory_mask: Optional[Tensor] = None,\n                     memory_key_padding_mask: Optional[Tensor] = None,\n                     pos: Optional[Tensor] = None,\n                     query_pos: Optional[Tensor] = None):\n        tgt2 = self.multihead_attn(query=self.with_pos_embed(tgt, query_pos),\n                                   key=self.with_pos_embed(memory, pos),\n                                   value=memory, attn_mask=memory_mask,\n                                   key_padding_mask=memory_key_padding_mask)[0]\n        tgt = tgt + self.dropout(tgt2)\n        tgt = self.norm(tgt)\n        \n        return tgt\n\n    def forward_pre(self, tgt, memory,\n                    memory_mask: Optional[Tensor] = None,\n                    memory_key_padding_mask: Optional[Tensor] = None,\n                    pos: Optional[Tensor] = None,\n                    query_pos: Optional[Tensor] = None):\n        tgt2 = self.norm(tgt)\n        tgt2 = self.multihead_attn(query=self.with_pos_embed(tgt2, query_pos),\n                                   key=self.with_pos_embed(memory, pos),\n                                   value=memory, attn_mask=memory_mask,\n                                   key_padding_mask=memory_key_padding_mask)[0]\n        tgt = tgt + self.dropout(tgt2)\n\n        return tgt\n\n    def forward(self, tgt, memory,\n                memory_mask: Optional[Tensor] = None,\n                memory_key_padding_mask: Optional[Tensor] = None,\n                pos: Optional[Tensor] = None,\n                query_pos: Optional[Tensor] = None):\n        if self.normalize_before:\n            return self.forward_pre(tgt, memory, memory_mask,\n                                    memory_key_padding_mask, pos, query_pos)\n        return self.forward_post(tgt, memory, memory_mask,\n                                 memory_key_padding_mask, pos, query_pos)\n\n\nclass FFNLayer(nn.Module):\n\n    def __init__(self, d_model, dim_feedforward=2048, dropout=0.0,\n                 activation=\"relu\", normalize_before=False):\n        super().__init__()\n        # Implementation of Feedforward model\n        self.linear1 = nn.Linear(d_model, dim_feedforward)\n        self.dropout = nn.Dropout(dropout)\n        self.linear2 = nn.Linear(dim_feedforward, d_model)\n\n        self.norm = nn.LayerNorm(d_model)\n\n        self.activation = _get_activation_fn(activation)\n        self.normalize_before = normalize_before\n\n        self._reset_parameters()\n    \n    def _reset_parameters(self):\n        for p in self.parameters():\n            if p.dim() > 1:\n                nn.init.xavier_uniform_(p)\n\n    def with_pos_embed(self, tensor, pos: Optional[Tensor]):\n        return tensor if pos is None else tensor + pos\n\n    def forward_post(self, tgt):\n        tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt))))\n        tgt = tgt + self.dropout(tgt2)\n        tgt = self.norm(tgt)\n        return tgt\n\n    def forward_pre(self, tgt):\n        tgt2 = self.norm(tgt)\n        tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt2))))\n        tgt = tgt + self.dropout(tgt2)\n        return tgt\n\n    def forward(self, tgt):\n        if self.normalize_before:\n            return self.forward_pre(tgt)\n        return self.forward_post(tgt)\n\n\ndef _get_activation_fn(activation):\n    \"\"\"Return an activation function given a string\"\"\"\n    if activation == \"relu\":\n        return F.relu\n    if activation == \"gelu\":\n        return F.gelu\n    if activation == \"glu\":\n        return F.glu\n    raise RuntimeError(F\"activation should be relu/gelu, not {activation}.\")\n\n\nclass MLP(nn.Module):\n    \"\"\" Very simple multi-layer perceptron (also called FFN)\"\"\"\n\n    def __init__(self, input_dim, hidden_dim, output_dim, num_layers):\n        super().__init__()\n        self.num_layers = num_layers\n        h = [hidden_dim] * (num_layers - 1)\n        self.layers = nn.ModuleList(nn.Linear(n, k) for n, k in zip([input_dim] + h, h + [output_dim]))\n\n    def forward(self, x):\n        for i, layer in enumerate(self.layers):\n            x = F.relu(layer(x)) if i < self.num_layers - 1 else layer(x)\n        return x\n\n\n@TRANSFORMER_DECODER_REGISTRY.register()\nclass MultiScaleMaskedTransformerDecoder(nn.Module):\n\n    _version = 2\n\n    def _load_from_state_dict(\n        self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs\n    ):\n        version = local_metadata.get(\"version\", None)\n        if version is None or version < 2:\n            # Do not warn if train from scratch\n            scratch = True\n            logger = logging.getLogger(__name__)\n            for k in list(state_dict.keys()):\n                newk = k\n                if \"static_query\" in k:\n                    newk = k.replace(\"static_query\", \"query_feat\")\n                if newk != k:\n                    state_dict[newk] = state_dict[k]\n                    del state_dict[k]\n                    scratch = False\n\n            if not scratch:\n                logger.warning(\n                    f\"Weight format of {self.__class__.__name__} have changed! \"\n                    \"Please upgrade your models. Applying automatic conversion now ...\"\n                )\n\n    @configurable\n    def __init__(\n        self,\n        in_channels,\n        mask_classification=True,\n        *,\n        num_classes: int,\n        hidden_dim: int,\n        num_queries: int,\n        nheads: int,\n        dim_feedforward: int,\n        dec_layers: int,\n        pre_norm: bool,\n        mask_dim: int,\n        enforce_input_project: bool,\n    ):\n        \"\"\"\n        NOTE: this interface is experimental.\n        Args:\n            in_channels: channels of the input features\n            mask_classification: whether to add mask classifier or not\n            num_classes: number of classes\n            hidden_dim: Transformer feature dimension\n            num_queries: number of queries\n            nheads: number of heads\n            dim_feedforward: feature dimension in feedforward network\n            enc_layers: number of Transformer encoder layers\n            dec_layers: number of Transformer decoder layers\n            pre_norm: whether to use pre-LayerNorm or not\n            mask_dim: mask feature dimension\n            enforce_input_project: add input project 1x1 conv even if input\n                channels and hidden dim is identical\n        \"\"\"\n        super().__init__()\n\n        assert mask_classification, \"Only support mask classification model\"\n        self.mask_classification = mask_classification\n\n        # positional encoding\n        N_steps = hidden_dim // 2\n        self.pe_layer = PositionEmbeddingSine(N_steps, normalize=True)\n        \n        # define Transformer decoder here\n        self.num_heads = nheads\n        self.num_layers = dec_layers\n        self.transformer_self_attention_layers = nn.ModuleList()\n        self.transformer_cross_attention_layers = nn.ModuleList()\n        self.transformer_ffn_layers = nn.ModuleList()\n\n        for _ in range(self.num_layers):\n            self.transformer_self_attention_layers.append(\n                SelfAttentionLayer(\n                    d_model=hidden_dim,\n                    nhead=nheads,\n                    dropout=0.0,\n                    normalize_before=pre_norm,\n                )\n            )\n\n            self.transformer_cross_attention_layers.append(\n                CrossAttentionLayer(\n                    d_model=hidden_dim,\n                    nhead=nheads,\n                    dropout=0.0,\n                    normalize_before=pre_norm,\n                )\n            )\n\n            self.transformer_ffn_layers.append(\n                FFNLayer(\n                    d_model=hidden_dim,\n                    dim_feedforward=dim_feedforward,\n                    dropout=0.0,\n                    normalize_before=pre_norm,\n                )\n            )\n\n        self.decoder_norm = nn.LayerNorm(hidden_dim)\n\n        self.num_queries = num_queries\n        # learnable query features\n        self.query_feat = nn.Embedding(num_queries, hidden_dim)\n        # learnable query p.e.\n        self.query_embed = nn.Embedding(num_queries, hidden_dim)\n\n        # level embedding (we always use 3 scales)\n        self.num_feature_levels = 3\n        self.level_embed = nn.Embedding(self.num_feature_levels, hidden_dim)\n        self.input_proj = nn.ModuleList()\n        for _ in range(self.num_feature_levels):\n            if in_channels != hidden_dim or enforce_input_project:\n                self.input_proj.append(Conv2d(in_channels, hidden_dim, kernel_size=1))\n                weight_init.c2_xavier_fill(self.input_proj[-1])\n            else:\n                self.input_proj.append(nn.Sequential())\n\n        # output FFNs\n        if self.mask_classification:\n            self.class_embed = nn.Linear(hidden_dim, num_classes + 1)\n        self.mask_embed = MLP(hidden_dim, hidden_dim, mask_dim, 3)\n\n    @classmethod\n    def from_config(cls, cfg, in_channels, mask_classification):\n        ret = {}\n        ret[\"in_channels\"] = in_channels\n        ret[\"mask_classification\"] = mask_classification\n        \n        ret[\"num_classes\"] = cfg.MODEL.SEM_SEG_HEAD.NUM_CLASSES\n        ret[\"hidden_dim\"] = cfg.MODEL.MASK_FORMER.HIDDEN_DIM\n        ret[\"num_queries\"] = cfg.MODEL.MASK_FORMER.NUM_OBJECT_QUERIES\n        # Transformer parameters:\n        ret[\"nheads\"] = cfg.MODEL.MASK_FORMER.NHEADS\n        ret[\"dim_feedforward\"] = cfg.MODEL.MASK_FORMER.DIM_FEEDFORWARD\n\n        # NOTE: because we add learnable query features which requires supervision,\n        # we add minus 1 to decoder layers to be consistent with our loss\n        # implementation: that is, number of auxiliary losses is always\n        # equal to number of decoder layers. With learnable query features, the number of\n        # auxiliary losses equals number of decoders plus 1.\n        assert cfg.MODEL.MASK_FORMER.DEC_LAYERS >= 1\n        ret[\"dec_layers\"] = cfg.MODEL.MASK_FORMER.DEC_LAYERS - 1\n        ret[\"pre_norm\"] = cfg.MODEL.MASK_FORMER.PRE_NORM\n        ret[\"enforce_input_project\"] = cfg.MODEL.MASK_FORMER.ENFORCE_INPUT_PROJ\n\n        ret[\"mask_dim\"] = cfg.MODEL.SEM_SEG_HEAD.MASK_DIM\n\n        return ret\n\n    def forward(self, x, mask_features, mask = None):\n        # x is a list of multi-scale feature\n        assert len(x) == self.num_feature_levels\n        src = []\n        pos = []\n        size_list = []\n\n        # disable mask, it does not affect performance\n        del mask\n\n        for i in range(self.num_feature_levels):\n            size_list.append(x[i].shape[-2:])\n            pos.append(self.pe_layer(x[i], None).flatten(2))\n            src.append(self.input_proj[i](x[i]).flatten(2) + self.level_embed.weight[i][None, :, None])\n\n            # flatten NxCxHxW to HWxNxC\n            pos[-1] = pos[-1].permute(2, 0, 1)\n            src[-1] = src[-1].permute(2, 0, 1)\n\n        _, bs, _ = src[0].shape\n\n        # QxNxC\n        query_embed = self.query_embed.weight.unsqueeze(1).repeat(1, bs, 1)\n        # query_embed = None\n        # print('come here==========')\n        output = self.query_feat.weight.unsqueeze(1).repeat(1, bs, 1)\n\n        predictions_class = []\n        predictions_mask = []\n        \n        # prediction heads on learnable query features\n        outputs_class, outputs_mask, attn_mask = self.forward_prediction_heads(output, mask_features, attn_mask_target_size=size_list[0])\n        predictions_class.append(outputs_class)\n        predictions_mask.append(outputs_mask)\n\n        for i in range(self.num_layers):\n            level_index = i % self.num_feature_levels\n            attn_mask[torch.where(attn_mask.sum(-1) == attn_mask.shape[-1])] = False\n            # attention: cross-attention first\n            output = self.transformer_cross_attention_layers[i](\n                output, src[level_index],\n                memory_mask=attn_mask,\n                memory_key_padding_mask=None,  # here we do not apply masking on padded region\n                pos=pos[level_index], query_pos=query_embed\n            )\n\n            output = self.transformer_self_attention_layers[i](\n                output, tgt_mask=None,\n                tgt_key_padding_mask=None,\n                query_pos=query_embed\n            )\n            \n            # FFN\n            output = self.transformer_ffn_layers[i](\n                output\n            )\n\n            outputs_class, outputs_mask, attn_mask = self.forward_prediction_heads(output, mask_features, attn_mask_target_size=size_list[(i + 1) % self.num_feature_levels])\n            predictions_class.append(outputs_class)\n            predictions_mask.append(outputs_mask)\n\n        assert len(predictions_class) == self.num_layers + 1\n        # print('len mask predictions:', len(predictions_mask))\n        out = {\n            'pred_logits': predictions_class[-1],\n            'pred_masks': predictions_mask[-1],\n            'aux_outputs': self._set_aux_loss(\n                predictions_class if self.mask_classification else None, predictions_mask\n            )\n        }\n        return out\n\n    def forward_prediction_heads(self, output, mask_features, attn_mask_target_size):\n        decoder_output = self.decoder_norm(output)\n        decoder_output = decoder_output.transpose(0, 1)\n        outputs_class = self.class_embed(decoder_output)\n        mask_embed = self.mask_embed(decoder_output)\n        outputs_mask = torch.einsum(\"bqc,bchw->bqhw\", mask_embed, mask_features)\n\n        # NOTE: prediction is of higher-resolution\n        # [B, Q, H, W] -> [B, Q, H*W] -> [B, h, Q, H*W] -> [B*h, Q, HW]\n        attn_mask = F.interpolate(outputs_mask, size=attn_mask_target_size, mode=\"bilinear\", align_corners=False)\n        # must use bool type\n        # If a BoolTensor is provided, positions with ``True`` are not allowed to attend while ``False`` values will be unchanged.\n        attn_mask = (attn_mask.sigmoid().flatten(2).unsqueeze(1).repeat(1, self.num_heads, 1, 1).flatten(0, 1) < 0.5).bool()\n        attn_mask = attn_mask.detach()\n\n        return outputs_class, outputs_mask, attn_mask\n\n    @torch.jit.unused\n    def _set_aux_loss(self, outputs_class, outputs_seg_masks):\n        # this is a workaround to make torchscript happy, as torchscript\n        # doesn't support dictionary with non-homogeneous values, such\n        # as a dict having both a Tensor and a list.\n        if self.mask_classification:\n            return [\n                {\"pred_logits\": a, \"pred_masks\": b}\n                for a, b in zip(outputs_class[:-1], outputs_seg_masks[:-1])\n            ]\n        else:\n            return [{\"pred_masks\": b} for b in outputs_seg_masks[:-1]]\n"
  },
  {
    "path": "mfvis_nococo/mask2former/modeling/transformer_decoder/maskformer_transformer_decoder.py",
    "content": "# Modified by Bowen Cheng from: https://github.com/facebookresearch/detr/blob/master/models/detr.py\nimport fvcore.nn.weight_init as weight_init\nimport torch\nfrom torch import nn\nfrom torch.nn import functional as F\n\nfrom detectron2.config import configurable\nfrom detectron2.layers import Conv2d\nfrom detectron2.utils.registry import Registry\n\nfrom .position_encoding import PositionEmbeddingSine\nfrom .transformer import Transformer\n\n\nTRANSFORMER_DECODER_REGISTRY = Registry(\"TRANSFORMER_MODULE\")\nTRANSFORMER_DECODER_REGISTRY.__doc__ = \"\"\"\nRegistry for transformer module in MaskFormer.\n\"\"\"\n\n\ndef build_transformer_decoder(cfg, in_channels, mask_classification=True):\n    \"\"\"\n    Build a instance embedding branch from `cfg.MODEL.INS_EMBED_HEAD.NAME`.\n    \"\"\"\n    name = cfg.MODEL.MASK_FORMER.TRANSFORMER_DECODER_NAME\n    return TRANSFORMER_DECODER_REGISTRY.get(name)(cfg, in_channels, mask_classification)\n\n\n@TRANSFORMER_DECODER_REGISTRY.register()\nclass StandardTransformerDecoder(nn.Module):\n    @configurable\n    def __init__(\n        self,\n        in_channels,\n        mask_classification=True,\n        *,\n        num_classes: int,\n        hidden_dim: int,\n        num_queries: int,\n        nheads: int,\n        dropout: float,\n        dim_feedforward: int,\n        enc_layers: int,\n        dec_layers: int,\n        pre_norm: bool,\n        deep_supervision: bool,\n        mask_dim: int,\n        enforce_input_project: bool,\n    ):\n        \"\"\"\n        NOTE: this interface is experimental.\n        Args:\n            in_channels: channels of the input features\n            mask_classification: whether to add mask classifier or not\n            num_classes: number of classes\n            hidden_dim: Transformer feature dimension\n            num_queries: number of queries\n            nheads: number of heads\n            dropout: dropout in Transformer\n            dim_feedforward: feature dimension in feedforward network\n            enc_layers: number of Transformer encoder layers\n            dec_layers: number of Transformer decoder layers\n            pre_norm: whether to use pre-LayerNorm or not\n            deep_supervision: whether to add supervision to every decoder layers\n            mask_dim: mask feature dimension\n            enforce_input_project: add input project 1x1 conv even if input\n                channels and hidden dim is identical\n        \"\"\"\n        super().__init__()\n\n        self.mask_classification = mask_classification\n\n        # positional encoding\n        N_steps = hidden_dim // 2\n        self.pe_layer = PositionEmbeddingSine(N_steps, normalize=True)\n\n        transformer = Transformer(\n            d_model=hidden_dim,\n            dropout=dropout,\n            nhead=nheads,\n            dim_feedforward=dim_feedforward,\n            num_encoder_layers=enc_layers,\n            num_decoder_layers=dec_layers,\n            normalize_before=pre_norm,\n            return_intermediate_dec=deep_supervision,\n        )\n\n        self.num_queries = num_queries\n        self.transformer = transformer\n        hidden_dim = transformer.d_model\n\n        self.query_embed = nn.Embedding(num_queries, hidden_dim)\n\n        if in_channels != hidden_dim or enforce_input_project:\n            self.input_proj = Conv2d(in_channels, hidden_dim, kernel_size=1)\n            weight_init.c2_xavier_fill(self.input_proj)\n        else:\n            self.input_proj = nn.Sequential()\n        self.aux_loss = deep_supervision\n\n        # output FFNs\n        if self.mask_classification:\n            self.class_embed = nn.Linear(hidden_dim, num_classes + 1)\n        self.mask_embed = MLP(hidden_dim, hidden_dim, mask_dim, 3)\n\n    @classmethod\n    def from_config(cls, cfg, in_channels, mask_classification):\n        ret = {}\n        ret[\"in_channels\"] = in_channels\n        ret[\"mask_classification\"] = mask_classification\n\n        ret[\"num_classes\"] = cfg.MODEL.SEM_SEG_HEAD.NUM_CLASSES\n        ret[\"hidden_dim\"] = cfg.MODEL.MASK_FORMER.HIDDEN_DIM\n        ret[\"num_queries\"] = cfg.MODEL.MASK_FORMER.NUM_OBJECT_QUERIES\n        # Transformer parameters:\n        ret[\"nheads\"] = cfg.MODEL.MASK_FORMER.NHEADS\n        ret[\"dropout\"] = cfg.MODEL.MASK_FORMER.DROPOUT\n        ret[\"dim_feedforward\"] = cfg.MODEL.MASK_FORMER.DIM_FEEDFORWARD\n        ret[\"enc_layers\"] = cfg.MODEL.MASK_FORMER.ENC_LAYERS\n        ret[\"dec_layers\"] = cfg.MODEL.MASK_FORMER.DEC_LAYERS\n        ret[\"pre_norm\"] = cfg.MODEL.MASK_FORMER.PRE_NORM\n        ret[\"deep_supervision\"] = cfg.MODEL.MASK_FORMER.DEEP_SUPERVISION\n        ret[\"enforce_input_project\"] = cfg.MODEL.MASK_FORMER.ENFORCE_INPUT_PROJ\n\n        ret[\"mask_dim\"] = cfg.MODEL.SEM_SEG_HEAD.MASK_DIM\n\n        return ret\n\n    def forward(self, x, mask_features, mask=None):\n        if mask is not None:\n            mask = F.interpolate(mask[None].float(), size=x.shape[-2:]).to(torch.bool)[0]\n        pos = self.pe_layer(x, mask)\n\n        src = x\n        hs, memory = self.transformer(self.input_proj(src), mask, self.query_embed.weight, pos)\n\n        if self.mask_classification:\n            outputs_class = self.class_embed(hs)\n            out = {\"pred_logits\": outputs_class[-1]}\n        else:\n            out = {}\n\n        if self.aux_loss:\n            # [l, bs, queries, embed]\n            mask_embed = self.mask_embed(hs)\n            outputs_seg_masks = torch.einsum(\"lbqc,bchw->lbqhw\", mask_embed, mask_features)\n            out[\"pred_masks\"] = outputs_seg_masks[-1]\n            out[\"aux_outputs\"] = self._set_aux_loss(\n                outputs_class if self.mask_classification else None, outputs_seg_masks\n            )\n        else:\n            # FIXME h_boxes takes the last one computed, keep this in mind\n            # [bs, queries, embed]\n            mask_embed = self.mask_embed(hs[-1])\n            outputs_seg_masks = torch.einsum(\"bqc,bchw->bqhw\", mask_embed, mask_features)\n            out[\"pred_masks\"] = outputs_seg_masks\n        return out\n\n    @torch.jit.unused\n    def _set_aux_loss(self, outputs_class, outputs_seg_masks):\n        # this is a workaround to make torchscript happy, as torchscript\n        # doesn't support dictionary with non-homogeneous values, such\n        # as a dict having both a Tensor and a list.\n        if self.mask_classification:\n            return [\n                {\"pred_logits\": a, \"pred_masks\": b}\n                for a, b in zip(outputs_class[:-1], outputs_seg_masks[:-1])\n            ]\n        else:\n            return [{\"pred_masks\": b} for b in outputs_seg_masks[:-1]]\n\n\nclass MLP(nn.Module):\n    \"\"\"Very simple multi-layer perceptron (also called FFN)\"\"\"\n\n    def __init__(self, input_dim, hidden_dim, output_dim, num_layers):\n        super().__init__()\n        self.num_layers = num_layers\n        h = [hidden_dim] * (num_layers - 1)\n        self.layers = nn.ModuleList(\n            nn.Linear(n, k) for n, k in zip([input_dim] + h, h + [output_dim])\n        )\n\n    def forward(self, x):\n        for i, layer in enumerate(self.layers):\n            x = F.relu(layer(x)) if i < self.num_layers - 1 else layer(x)\n        return x\n"
  },
  {
    "path": "mfvis_nococo/mask2former/modeling/transformer_decoder/position_encoding.py",
    "content": "# # Modified by Bowen Cheng from: https://github.com/facebookresearch/detr/blob/master/models/position_encoding.py\n\"\"\"\nVarious positional encodings for the transformer.\n\"\"\"\nimport math\n\nimport torch\nfrom torch import nn\n\n\nclass PositionEmbeddingSine(nn.Module):\n    \"\"\"\n    This is a more standard version of the position embedding, very similar to the one\n    used by the Attention is all you need paper, generalized to work on images.\n    \"\"\"\n\n    def __init__(self, num_pos_feats=64, temperature=10000, normalize=False, scale=None):\n        super().__init__()\n        self.num_pos_feats = num_pos_feats\n        self.temperature = temperature\n        self.normalize = normalize\n        if scale is not None and normalize is False:\n            raise ValueError(\"normalize should be True if scale is passed\")\n        if scale is None:\n            scale = 2 * math.pi\n        self.scale = scale\n\n    def forward(self, x, mask=None):\n        if mask is None:\n            mask = torch.zeros((x.size(0), x.size(2), x.size(3)), device=x.device, dtype=torch.bool)\n        not_mask = ~mask\n        y_embed = not_mask.cumsum(1, dtype=torch.float32)\n        x_embed = not_mask.cumsum(2, dtype=torch.float32)\n        if self.normalize:\n            eps = 1e-6\n            y_embed = y_embed / (y_embed[:, -1:, :] + eps) * self.scale\n            x_embed = x_embed / (x_embed[:, :, -1:] + eps) * self.scale\n\n        dim_t = torch.arange(self.num_pos_feats, dtype=torch.float32, device=x.device)\n        dim_t = self.temperature ** (2 * (dim_t // 2) / self.num_pos_feats)\n\n        pos_x = x_embed[:, :, :, None] / dim_t\n        pos_y = y_embed[:, :, :, None] / dim_t\n        pos_x = torch.stack(\n            (pos_x[:, :, :, 0::2].sin(), pos_x[:, :, :, 1::2].cos()), dim=4\n        ).flatten(3)\n        pos_y = torch.stack(\n            (pos_y[:, :, :, 0::2].sin(), pos_y[:, :, :, 1::2].cos()), dim=4\n        ).flatten(3)\n        pos = torch.cat((pos_y, pos_x), dim=3).permute(0, 3, 1, 2)\n        return pos\n    \n    def __repr__(self, _repr_indent=4):\n        head = \"Positional encoding \" + self.__class__.__name__\n        body = [\n            \"num_pos_feats: {}\".format(self.num_pos_feats),\n            \"temperature: {}\".format(self.temperature),\n            \"normalize: {}\".format(self.normalize),\n            \"scale: {}\".format(self.scale),\n        ]\n        # _repr_indent = 4\n        lines = [head] + [\" \" * _repr_indent + line for line in body]\n        return \"\\n\".join(lines)\n"
  },
  {
    "path": "mfvis_nococo/mask2former/modeling/transformer_decoder/transformer.py",
    "content": "# Modified by Bowen Cheng from: https://github.com/facebookresearch/detr/blob/master/models/transformer.py\n\"\"\"\nTransformer class.\n\nCopy-paste from torch.nn.Transformer with modifications:\n    * positional encodings are passed in MHattention\n    * extra LN at the end of encoder is removed\n    * decoder returns a stack of activations from all decoding layers\n\"\"\"\nimport copy\nfrom typing import List, Optional\n\nimport torch\nimport torch.nn.functional as F\nfrom torch import Tensor, nn\n\n\nclass Transformer(nn.Module):\n    def __init__(\n        self,\n        d_model=512,\n        nhead=8,\n        num_encoder_layers=6,\n        num_decoder_layers=6,\n        dim_feedforward=2048,\n        dropout=0.1,\n        activation=\"relu\",\n        normalize_before=False,\n        return_intermediate_dec=False,\n    ):\n        super().__init__()\n\n        encoder_layer = TransformerEncoderLayer(\n            d_model, nhead, dim_feedforward, dropout, activation, normalize_before\n        )\n        encoder_norm = nn.LayerNorm(d_model) if normalize_before else None\n        self.encoder = TransformerEncoder(encoder_layer, num_encoder_layers, encoder_norm)\n\n        decoder_layer = TransformerDecoderLayer(\n            d_model, nhead, dim_feedforward, dropout, activation, normalize_before\n        )\n        decoder_norm = nn.LayerNorm(d_model)\n        self.decoder = TransformerDecoder(\n            decoder_layer,\n            num_decoder_layers,\n            decoder_norm,\n            return_intermediate=return_intermediate_dec,\n        )\n\n        self._reset_parameters()\n\n        self.d_model = d_model\n        self.nhead = nhead\n\n    def _reset_parameters(self):\n        for p in self.parameters():\n            if p.dim() > 1:\n                nn.init.xavier_uniform_(p)\n\n    def forward(self, src, mask, query_embed, pos_embed):\n        # flatten NxCxHxW to HWxNxC\n        bs, c, h, w = src.shape\n        src = src.flatten(2).permute(2, 0, 1)\n        pos_embed = pos_embed.flatten(2).permute(2, 0, 1)\n        query_embed = query_embed.unsqueeze(1).repeat(1, bs, 1)\n        if mask is not None:\n            mask = mask.flatten(1)\n\n        tgt = torch.zeros_like(query_embed)\n        memory = self.encoder(src, src_key_padding_mask=mask, pos=pos_embed)\n        hs = self.decoder(\n            tgt, memory, memory_key_padding_mask=mask, pos=pos_embed, query_pos=query_embed\n        )\n        return hs.transpose(1, 2), memory.permute(1, 2, 0).view(bs, c, h, w)\n\n\nclass TransformerEncoder(nn.Module):\n    def __init__(self, encoder_layer, num_layers, norm=None):\n        super().__init__()\n        self.layers = _get_clones(encoder_layer, num_layers)\n        self.num_layers = num_layers\n        self.norm = norm\n\n    def forward(\n        self,\n        src,\n        mask: Optional[Tensor] = None,\n        src_key_padding_mask: Optional[Tensor] = None,\n        pos: Optional[Tensor] = None,\n    ):\n        output = src\n\n        for layer in self.layers:\n            output = layer(\n                output, src_mask=mask, src_key_padding_mask=src_key_padding_mask, pos=pos\n            )\n\n        if self.norm is not None:\n            output = self.norm(output)\n\n        return output\n\n\nclass TransformerDecoder(nn.Module):\n    def __init__(self, decoder_layer, num_layers, norm=None, return_intermediate=False):\n        super().__init__()\n        self.layers = _get_clones(decoder_layer, num_layers)\n        self.num_layers = num_layers\n        self.norm = norm\n        self.return_intermediate = return_intermediate\n\n    def forward(\n        self,\n        tgt,\n        memory,\n        tgt_mask: Optional[Tensor] = None,\n        memory_mask: Optional[Tensor] = None,\n        tgt_key_padding_mask: Optional[Tensor] = None,\n        memory_key_padding_mask: Optional[Tensor] = None,\n        pos: Optional[Tensor] = None,\n        query_pos: Optional[Tensor] = None,\n    ):\n        output = tgt\n\n        intermediate = []\n\n        for layer in self.layers:\n            output = layer(\n                output,\n                memory,\n                tgt_mask=tgt_mask,\n                memory_mask=memory_mask,\n                tgt_key_padding_mask=tgt_key_padding_mask,\n                memory_key_padding_mask=memory_key_padding_mask,\n                pos=pos,\n                query_pos=query_pos,\n            )\n            if self.return_intermediate:\n                intermediate.append(self.norm(output))\n\n        if self.norm is not None:\n            output = self.norm(output)\n            if self.return_intermediate:\n                intermediate.pop()\n                intermediate.append(output)\n\n        if self.return_intermediate:\n            return torch.stack(intermediate)\n\n        return output.unsqueeze(0)\n\n\nclass TransformerEncoderLayer(nn.Module):\n    def __init__(\n        self,\n        d_model,\n        nhead,\n        dim_feedforward=2048,\n        dropout=0.1,\n        activation=\"relu\",\n        normalize_before=False,\n    ):\n        super().__init__()\n        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)\n        # Implementation of Feedforward model\n        self.linear1 = nn.Linear(d_model, dim_feedforward)\n        self.dropout = nn.Dropout(dropout)\n        self.linear2 = nn.Linear(dim_feedforward, d_model)\n\n        self.norm1 = nn.LayerNorm(d_model)\n        self.norm2 = nn.LayerNorm(d_model)\n        self.dropout1 = nn.Dropout(dropout)\n        self.dropout2 = nn.Dropout(dropout)\n\n        self.activation = _get_activation_fn(activation)\n        self.normalize_before = normalize_before\n\n    def with_pos_embed(self, tensor, pos: Optional[Tensor]):\n        return tensor if pos is None else tensor + pos\n\n    def forward_post(\n        self,\n        src,\n        src_mask: Optional[Tensor] = None,\n        src_key_padding_mask: Optional[Tensor] = None,\n        pos: Optional[Tensor] = None,\n    ):\n        q = k = self.with_pos_embed(src, pos)\n        src2 = self.self_attn(\n            q, k, value=src, attn_mask=src_mask, key_padding_mask=src_key_padding_mask\n        )[0]\n        src = src + self.dropout1(src2)\n        src = self.norm1(src)\n        src2 = self.linear2(self.dropout(self.activation(self.linear1(src))))\n        src = src + self.dropout2(src2)\n        src = self.norm2(src)\n        return src\n\n    def forward_pre(\n        self,\n        src,\n        src_mask: Optional[Tensor] = None,\n        src_key_padding_mask: Optional[Tensor] = None,\n        pos: Optional[Tensor] = None,\n    ):\n        src2 = self.norm1(src)\n        q = k = self.with_pos_embed(src2, pos)\n        src2 = self.self_attn(\n            q, k, value=src2, attn_mask=src_mask, key_padding_mask=src_key_padding_mask\n        )[0]\n        src = src + self.dropout1(src2)\n        src2 = self.norm2(src)\n        src2 = self.linear2(self.dropout(self.activation(self.linear1(src2))))\n        src = src + self.dropout2(src2)\n        return src\n\n    def forward(\n        self,\n        src,\n        src_mask: Optional[Tensor] = None,\n        src_key_padding_mask: Optional[Tensor] = None,\n        pos: Optional[Tensor] = None,\n    ):\n        if self.normalize_before:\n            return self.forward_pre(src, src_mask, src_key_padding_mask, pos)\n        return self.forward_post(src, src_mask, src_key_padding_mask, pos)\n\n\nclass TransformerDecoderLayer(nn.Module):\n    def __init__(\n        self,\n        d_model,\n        nhead,\n        dim_feedforward=2048,\n        dropout=0.1,\n        activation=\"relu\",\n        normalize_before=False,\n    ):\n        super().__init__()\n        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)\n        self.multihead_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)\n        # Implementation of Feedforward model\n        self.linear1 = nn.Linear(d_model, dim_feedforward)\n        self.dropout = nn.Dropout(dropout)\n        self.linear2 = nn.Linear(dim_feedforward, d_model)\n\n        self.norm1 = nn.LayerNorm(d_model)\n        self.norm2 = nn.LayerNorm(d_model)\n        self.norm3 = nn.LayerNorm(d_model)\n        self.dropout1 = nn.Dropout(dropout)\n        self.dropout2 = nn.Dropout(dropout)\n        self.dropout3 = nn.Dropout(dropout)\n\n        self.activation = _get_activation_fn(activation)\n        self.normalize_before = normalize_before\n\n    def with_pos_embed(self, tensor, pos: Optional[Tensor]):\n        return tensor if pos is None else tensor + pos\n\n    def forward_post(\n        self,\n        tgt,\n        memory,\n        tgt_mask: Optional[Tensor] = None,\n        memory_mask: Optional[Tensor] = None,\n        tgt_key_padding_mask: Optional[Tensor] = None,\n        memory_key_padding_mask: Optional[Tensor] = None,\n        pos: Optional[Tensor] = None,\n        query_pos: Optional[Tensor] = None,\n    ):\n        q = k = self.with_pos_embed(tgt, query_pos)\n        tgt2 = self.self_attn(\n            q, k, value=tgt, attn_mask=tgt_mask, key_padding_mask=tgt_key_padding_mask\n        )[0]\n        tgt = tgt + self.dropout1(tgt2)\n        tgt = self.norm1(tgt)\n        tgt2 = self.multihead_attn(\n            query=self.with_pos_embed(tgt, query_pos),\n            key=self.with_pos_embed(memory, pos),\n            value=memory,\n            attn_mask=memory_mask,\n            key_padding_mask=memory_key_padding_mask,\n        )[0]\n        tgt = tgt + self.dropout2(tgt2)\n        tgt = self.norm2(tgt)\n        tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt))))\n        tgt = tgt + self.dropout3(tgt2)\n        tgt = self.norm3(tgt)\n        return tgt\n\n    def forward_pre(\n        self,\n        tgt,\n        memory,\n        tgt_mask: Optional[Tensor] = None,\n        memory_mask: Optional[Tensor] = None,\n        tgt_key_padding_mask: Optional[Tensor] = None,\n        memory_key_padding_mask: Optional[Tensor] = None,\n        pos: Optional[Tensor] = None,\n        query_pos: Optional[Tensor] = None,\n    ):\n        tgt2 = self.norm1(tgt)\n        q = k = self.with_pos_embed(tgt2, query_pos)\n        tgt2 = self.self_attn(\n            q, k, value=tgt2, attn_mask=tgt_mask, key_padding_mask=tgt_key_padding_mask\n        )[0]\n        tgt = tgt + self.dropout1(tgt2)\n        tgt2 = self.norm2(tgt)\n        tgt2 = self.multihead_attn(\n            query=self.with_pos_embed(tgt2, query_pos),\n            key=self.with_pos_embed(memory, pos),\n            value=memory,\n            attn_mask=memory_mask,\n            key_padding_mask=memory_key_padding_mask,\n        )[0]\n        tgt = tgt + self.dropout2(tgt2)\n        tgt2 = self.norm3(tgt)\n        tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt2))))\n        tgt = tgt + self.dropout3(tgt2)\n        return tgt\n\n    def forward(\n        self,\n        tgt,\n        memory,\n        tgt_mask: Optional[Tensor] = None,\n        memory_mask: Optional[Tensor] = None,\n        tgt_key_padding_mask: Optional[Tensor] = None,\n        memory_key_padding_mask: Optional[Tensor] = None,\n        pos: Optional[Tensor] = None,\n        query_pos: Optional[Tensor] = None,\n    ):\n        if self.normalize_before:\n            return self.forward_pre(\n                tgt,\n                memory,\n                tgt_mask,\n                memory_mask,\n                tgt_key_padding_mask,\n                memory_key_padding_mask,\n                pos,\n                query_pos,\n            )\n        return self.forward_post(\n            tgt,\n            memory,\n            tgt_mask,\n            memory_mask,\n            tgt_key_padding_mask,\n            memory_key_padding_mask,\n            pos,\n            query_pos,\n        )\n\n\ndef _get_clones(module, N):\n    return nn.ModuleList([copy.deepcopy(module) for i in range(N)])\n\n\ndef _get_activation_fn(activation):\n    \"\"\"Return an activation function given a string\"\"\"\n    if activation == \"relu\":\n        return F.relu\n    if activation == \"gelu\":\n        return F.gelu\n    if activation == \"glu\":\n        return F.glu\n    raise RuntimeError(f\"activation should be relu/gelu, not {activation}.\")\n"
  },
  {
    "path": "mfvis_nococo/mask2former/test_time_augmentation.py",
    "content": "import copy\nimport logging\nfrom itertools import count\n\nimport numpy as np\nimport torch\nfrom fvcore.transforms import HFlipTransform\nfrom torch import nn\nfrom torch.nn.parallel import DistributedDataParallel\n\nfrom detectron2.data.detection_utils import read_image\nfrom detectron2.modeling import DatasetMapperTTA\n\n\n__all__ = [\n    \"SemanticSegmentorWithTTA\",\n]\n\n\nclass SemanticSegmentorWithTTA(nn.Module):\n    \"\"\"\n    A SemanticSegmentor with test-time augmentation enabled.\n    Its :meth:`__call__` method has the same interface as :meth:`SemanticSegmentor.forward`.\n    \"\"\"\n\n    def __init__(self, cfg, model, tta_mapper=None, batch_size=1):\n        \"\"\"\n        Args:\n            cfg (CfgNode):\n            model (SemanticSegmentor): a SemanticSegmentor to apply TTA on.\n            tta_mapper (callable): takes a dataset dict and returns a list of\n                augmented versions of the dataset dict. Defaults to\n                `DatasetMapperTTA(cfg)`.\n            batch_size (int): batch the augmented images into this batch size for inference.\n        \"\"\"\n        super().__init__()\n        if isinstance(model, DistributedDataParallel):\n            model = model.module\n        self.cfg = cfg.clone()\n\n        self.model = model\n\n        if tta_mapper is None:\n            tta_mapper = DatasetMapperTTA(cfg)\n        self.tta_mapper = tta_mapper\n        self.batch_size = batch_size\n\n    def __call__(self, batched_inputs):\n        \"\"\"\n        Same input/output format as :meth:`SemanticSegmentor.forward`\n        \"\"\"\n\n        def _maybe_read_image(dataset_dict):\n            ret = copy.copy(dataset_dict)\n            if \"image\" not in ret:\n                image = read_image(ret.pop(\"file_name\"), self.model.input_format)\n                image = torch.from_numpy(np.ascontiguousarray(image.transpose(2, 0, 1)))  # CHW\n                ret[\"image\"] = image\n            if \"height\" not in ret and \"width\" not in ret:\n                ret[\"height\"] = image.shape[1]\n                ret[\"width\"] = image.shape[2]\n            return ret\n\n        processed_results = []\n        for x in batched_inputs:\n            result = self._inference_one_image(_maybe_read_image(x))\n            processed_results.append(result)\n        return processed_results\n\n    def _inference_one_image(self, input):\n        \"\"\"\n        Args:\n            input (dict): one dataset dict with \"image\" field being a CHW tensor\n        Returns:\n            dict: one output dict\n        \"\"\"\n        orig_shape = (input[\"height\"], input[\"width\"])\n        augmented_inputs, tfms = self._get_augmented_inputs(input)\n\n        final_predictions = None\n        count_predictions = 0\n        for input, tfm in zip(augmented_inputs, tfms):\n            count_predictions += 1\n            with torch.no_grad():\n                if final_predictions is None:\n                    if any(isinstance(t, HFlipTransform) for t in tfm.transforms):\n                        final_predictions = self.model([input])[0].pop(\"sem_seg\").flip(dims=[2])\n                    else:\n                        final_predictions = self.model([input])[0].pop(\"sem_seg\")\n                else:\n                    if any(isinstance(t, HFlipTransform) for t in tfm.transforms):\n                        final_predictions += self.model([input])[0].pop(\"sem_seg\").flip(dims=[2])\n                    else:\n                        final_predictions += self.model([input])[0].pop(\"sem_seg\")\n\n        final_predictions = final_predictions / count_predictions\n        return {\"sem_seg\": final_predictions}\n\n    def _get_augmented_inputs(self, input):\n        augmented_inputs = self.tta_mapper(input)\n        tfms = [x.pop(\"transforms\") for x in augmented_inputs]\n        return augmented_inputs, tfms\n"
  },
  {
    "path": "mfvis_nococo/mask2former/utils/__init__.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates.\n"
  },
  {
    "path": "mfvis_nococo/mask2former/utils/__init__.py.new",
    "content": ""
  },
  {
    "path": "mfvis_nococo/mask2former/utils/misc.py",
    "content": "# Modified by Bowen Cheng from https://github.com/facebookresearch/detr/blob/master/util/misc.py\n\"\"\"\nMisc functions, including distributed helpers.\n\nMostly copy-paste from torchvision references.\n\"\"\"\nfrom typing import List, Optional\n\nimport torch\nimport torch.distributed as dist\nimport torchvision\nfrom torch import Tensor\n\n\ndef _max_by_axis(the_list):\n    # type: (List[List[int]]) -> List[int]\n    maxes = the_list[0]\n    for sublist in the_list[1:]:\n        for index, item in enumerate(sublist):\n            maxes[index] = max(maxes[index], item)\n    return maxes\n\n\nclass NestedTensor(object):\n    def __init__(self, tensors, mask: Optional[Tensor]):\n        self.tensors = tensors\n        self.mask = mask\n\n    def to(self, device):\n        # type: (Device) -> NestedTensor # noqa\n        cast_tensor = self.tensors.to(device)\n        mask = self.mask\n        if mask is not None:\n            assert mask is not None\n            cast_mask = mask.to(device)\n        else:\n            cast_mask = None\n        return NestedTensor(cast_tensor, cast_mask)\n\n    def decompose(self):\n        return self.tensors, self.mask\n\n    def __repr__(self):\n        return str(self.tensors)\n\n\ndef nested_tensor_from_tensor_list(tensor_list: List[Tensor]):\n    # TODO make this more general\n    if tensor_list[0].ndim == 3:\n        if torchvision._is_tracing():\n            # nested_tensor_from_tensor_list() does not export well to ONNX\n            # call _onnx_nested_tensor_from_tensor_list() instead\n            return _onnx_nested_tensor_from_tensor_list(tensor_list)\n\n        # TODO make it support different-sized images\n        max_size = _max_by_axis([list(img.shape) for img in tensor_list])\n        # min_size = tuple(min(s) for s in zip(*[img.shape for img in tensor_list]))\n        batch_shape = [len(tensor_list)] + max_size\n        b, c, h, w = batch_shape\n        dtype = tensor_list[0].dtype\n        device = tensor_list[0].device\n        tensor = torch.zeros(batch_shape, dtype=dtype, device=device)\n        mask = torch.ones((b, h, w), dtype=torch.bool, device=device)\n        for img, pad_img, m in zip(tensor_list, tensor, mask):\n            pad_img[: img.shape[0], : img.shape[1], : img.shape[2]].copy_(img)\n            m[: img.shape[1], : img.shape[2]] = False\n    else:\n        raise ValueError(\"not supported\")\n    return NestedTensor(tensor, mask)\n\n\n# _onnx_nested_tensor_from_tensor_list() is an implementation of\n# nested_tensor_from_tensor_list() that is supported by ONNX tracing.\n@torch.jit.unused\ndef _onnx_nested_tensor_from_tensor_list(tensor_list: List[Tensor]) -> NestedTensor:\n    max_size = []\n    for i in range(tensor_list[0].dim()):\n        max_size_i = torch.max(\n            torch.stack([img.shape[i] for img in tensor_list]).to(torch.float32)\n        ).to(torch.int64)\n        max_size.append(max_size_i)\n    max_size = tuple(max_size)\n\n    # work around for\n    # pad_img[: img.shape[0], : img.shape[1], : img.shape[2]].copy_(img)\n    # m[: img.shape[1], :img.shape[2]] = False\n    # which is not yet supported in onnx\n    padded_imgs = []\n    padded_masks = []\n    for img in tensor_list:\n        padding = [(s1 - s2) for s1, s2 in zip(max_size, tuple(img.shape))]\n        padded_img = torch.nn.functional.pad(img, (0, padding[2], 0, padding[1], 0, padding[0]))\n        padded_imgs.append(padded_img)\n\n        m = torch.zeros_like(img[0], dtype=torch.int, device=img.device)\n        padded_mask = torch.nn.functional.pad(m, (0, padding[2], 0, padding[1]), \"constant\", 1)\n        padded_masks.append(padded_mask.to(torch.bool))\n\n    tensor = torch.stack(padded_imgs)\n    mask = torch.stack(padded_masks)\n\n    return NestedTensor(tensor, mask=mask)\n\n\ndef is_dist_avail_and_initialized():\n    if not dist.is_available():\n        return False\n    if not dist.is_initialized():\n        return False\n    return True\n"
  },
  {
    "path": "mfvis_nococo/mask2former_video/__init__.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates.\nfrom . import modeling\n\n# config\nfrom .config import add_maskformer2_video_config\n\n# models\nfrom .video_maskformer_model import VideoMaskFormer\n\n# video\nfrom .data_video import (\n    YTVISDatasetMapper,\n    YTVISEvaluator,\n    build_detection_train_loader,\n    build_detection_test_loader,\n    get_detection_dataset_dicts,\n)\n"
  },
  {
    "path": "mfvis_nococo/mask2former_video/config.py",
    "content": "# -*- coding: utf-8 -*-\n# Copyright (c) Facebook, Inc. and its affiliates.\nfrom detectron2.config import CfgNode as CN\n\n\ndef add_maskformer2_video_config(cfg):\n    # video data\n    # DataLoader\n    cfg.INPUT.SAMPLING_FRAME_NUM = 5 \n    cfg.INPUT.SAMPLING_FRAME_RANGE = 5\n    cfg.INPUT.SAMPLING_FRAME_SHUFFLE = True \n    cfg.INPUT.AUGMENTATIONS = [] # \"brightness\", \"contrast\", \"saturation\", \"rotation\"\n"
  },
  {
    "path": "mfvis_nococo/mask2former_video/data_video/__init__.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates.\n# Modified by Bowen Cheng from https://github.com/sukjunhwang/IFC\n\nfrom .dataset_mapper import YTVISDatasetMapper, CocoClipDatasetMapper\nfrom .build import *\n\nfrom .datasets import *\nfrom .ytvis_eval import YTVISEvaluator\n"
  },
  {
    "path": "mfvis_nococo/mask2former_video/data_video/augmentation.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates.\n# Modified by Bowen Cheng from https://github.com/sukjunhwang/IFC\n\nimport numpy as np\nimport logging\nimport sys\nfrom fvcore.transforms.transform import (\n    HFlipTransform,\n    NoOpTransform,\n    VFlipTransform,\n)\nfrom PIL import Image\n\nfrom detectron2.data import transforms as T\n\n\nclass ResizeShortestEdge(T.Augmentation):\n    \"\"\"\n    Scale the shorter edge to the given size, with a limit of `max_size` on the longer edge.\n    If `max_size` is reached, then downscale so that the longer edge does not exceed max_size.\n    \"\"\"\n\n    def __init__(\n        self, short_edge_length, max_size=sys.maxsize, sample_style=\"range\", interp=Image.BILINEAR, clip_frame_cnt=1\n    ):\n        \"\"\"\n        Args:\n            short_edge_length (list[int]): If ``sample_style==\"range\"``,\n                a [min, max] interval from which to sample the shortest edge length.\n                If ``sample_style==\"choice\"``, a list of shortest edge lengths to sample from.\n            max_size (int): maximum allowed longest edge length.\n            sample_style (str): either \"range\" or \"choice\".\n        \"\"\"\n        super().__init__()\n        assert sample_style in [\"range\", \"choice\", \"range_by_clip\", \"choice_by_clip\"], sample_style\n\n        self.is_range = (\"range\" in sample_style)\n        if isinstance(short_edge_length, int):\n            short_edge_length = (short_edge_length, short_edge_length)\n        if self.is_range:\n            assert len(short_edge_length) == 2, (\n                \"short_edge_length must be two values using 'range' sample style.\"\n                f\" Got {short_edge_length}!\"\n            )\n        self._cnt = 0\n        self._init(locals())\n\n    def get_transform(self, image):\n        if self._cnt % self.clip_frame_cnt == 0:\n            if self.is_range:\n                self.size = np.random.randint(self.short_edge_length[0], self.short_edge_length[1] + 1)\n            else:\n                self.size = np.random.choice(self.short_edge_length)\n            if self.size == 0:\n                return NoOpTransform()\n\n            self._cnt = 0   # avoiding overflow\n        self._cnt += 1\n\n        h, w = image.shape[:2]\n\n        scale = self.size * 1.0 / min(h, w)\n        if h < w:\n            newh, neww = self.size, scale * w\n        else:\n            newh, neww = scale * h, self.size\n        if max(newh, neww) > self.max_size:\n            scale = self.max_size * 1.0 / max(newh, neww)\n            newh = newh * scale\n            neww = neww * scale\n        neww = int(neww + 0.5)\n        newh = int(newh + 0.5)\n        return T.ResizeTransform(h, w, newh, neww, self.interp)\n\n\nclass RandomFlip(T.Augmentation):\n    \"\"\"\n    Flip the image horizontally or vertically with the given probability.\n    \"\"\"\n\n    def __init__(self, prob=0.5, *, horizontal=True, vertical=False, clip_frame_cnt=1):\n        \"\"\"\n        Args:\n            prob (float): probability of flip.\n            horizontal (boolean): whether to apply horizontal flipping\n            vertical (boolean): whether to apply vertical flipping\n        \"\"\"\n        super().__init__()\n\n        if horizontal and vertical:\n            raise ValueError(\"Cannot do both horiz and vert. Please use two Flip instead.\")\n        if not horizontal and not vertical:\n            raise ValueError(\"At least one of horiz or vert has to be True!\")\n        self._cnt = 0\n\n        self._init(locals())\n\n    def get_transform(self, image):\n        if self._cnt % self.clip_frame_cnt == 0:\n            self.do = self._rand_range() < self.prob\n            self._cnt = 0   # avoiding overflow\n        self._cnt += 1\n\n        h, w = image.shape[:2]\n\n        if self.do:\n            if self.horizontal:\n                return HFlipTransform(w)\n            elif self.vertical:\n                return VFlipTransform(h)\n        else:\n            return NoOpTransform()\n\n\ndef build_augmentation(cfg, is_train):\n    logger = logging.getLogger(__name__)\n    aug_list = []\n    if is_train:\n        # Crop\n        if cfg.INPUT.CROP.ENABLED:\n            aug_list.append(T.RandomCrop(cfg.INPUT.CROP.TYPE, cfg.INPUT.CROP.SIZE))\n\n        # Resize\n        min_size = cfg.INPUT.MIN_SIZE_TRAIN\n        max_size = cfg.INPUT.MAX_SIZE_TRAIN\n        sample_style = cfg.INPUT.MIN_SIZE_TRAIN_SAMPLING\n        ms_clip_frame_cnt = cfg.INPUT.SAMPLING_FRAME_NUM if \"by_clip\" in cfg.INPUT.MIN_SIZE_TRAIN_SAMPLING else 1\n        aug_list.append(ResizeShortestEdge(min_size, max_size, sample_style, clip_frame_cnt=ms_clip_frame_cnt))\n\n        # Flip\n        if cfg.INPUT.RANDOM_FLIP != \"none\":\n            if cfg.INPUT.RANDOM_FLIP == \"flip_by_clip\":\n                flip_clip_frame_cnt = cfg.INPUT.SAMPLING_FRAME_NUM\n            else:\n                flip_clip_frame_cnt = 1\n\n            aug_list.append(\n                # NOTE using RandomFlip modified for the support of flip maintenance\n                RandomFlip(\n                    horizontal=(cfg.INPUT.RANDOM_FLIP == \"horizontal\") or (cfg.INPUT.RANDOM_FLIP == \"flip_by_clip\"),\n                    vertical=cfg.INPUT.RANDOM_FLIP == \"vertical\",\n                    clip_frame_cnt=flip_clip_frame_cnt,\n                )\n            )\n\n        # Additional augmentations : brightness, contrast, saturation, rotation\n        augmentations = cfg.INPUT.AUGMENTATIONS\n        if \"brightness\" in augmentations:\n            aug_list.append(T.RandomBrightness(0.9, 1.1))\n        if \"contrast\" in augmentations:\n            aug_list.append(T.RandomContrast(0.9, 1.1))\n        if \"saturation\" in augmentations:\n            aug_list.append(T.RandomSaturation(0.9, 1.1))\n        if \"rotation\" in augmentations:\n            aug_list.append(\n                T.RandomRotation(\n                    [-15, 15], expand=False, center=[(0.4, 0.4), (0.6, 0.6)], sample_style=\"range\"\n                )\n            )\n    else:\n        # Resize\n        min_size = cfg.INPUT.MIN_SIZE_TEST\n        max_size = cfg.INPUT.MAX_SIZE_TEST\n        sample_style = \"choice\"\n        aug_list.append(T.ResizeShortestEdge(min_size, max_size, sample_style))\n\n    return aug_list\n"
  },
  {
    "path": "mfvis_nococo/mask2former_video/data_video/build.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates.\n# Modified by Bowen Cheng from https://github.com/sukjunhwang/IFC\n\nimport itertools\nimport logging\nimport torch.utils.data\n\nfrom detectron2.config import CfgNode, configurable\nfrom detectron2.data.build import (\n    build_batch_data_loader,\n    load_proposals_into_dataset,\n    trivial_batch_collator,\n)\nfrom detectron2.data.catalog import DatasetCatalog\nfrom detectron2.data.common import DatasetFromList, MapDataset\nfrom detectron2.data.dataset_mapper import DatasetMapper\nfrom detectron2.data.samplers import InferenceSampler, TrainingSampler\nfrom detectron2.utils.comm import get_world_size\n\n\ndef _compute_num_images_per_worker(cfg: CfgNode):\n    num_workers = get_world_size()\n    images_per_batch = cfg.SOLVER.IMS_PER_BATCH\n    assert (\n        images_per_batch % num_workers == 0\n    ), \"SOLVER.IMS_PER_BATCH ({}) must be divisible by the number of workers ({}).\".format(\n        images_per_batch, num_workers\n    )\n    assert (\n        images_per_batch >= num_workers\n    ), \"SOLVER.IMS_PER_BATCH ({}) must be larger than the number of workers ({}).\".format(\n        images_per_batch, num_workers\n    )\n    images_per_worker = images_per_batch // num_workers\n    return images_per_worker\n\n\ndef filter_images_with_only_crowd_annotations(dataset_dicts, dataset_names):\n    \"\"\"\n    Filter out images with none annotations or only crowd annotations\n    (i.e., images without non-crowd annotations).\n    A common training-time preprocessing on COCO dataset.\n\n    Args:\n        dataset_dicts (list[dict]): annotations in Detectron2 Dataset format.\n\n    Returns:\n        list[dict]: the same format, but filtered.\n    \"\"\"\n    num_before = len(dataset_dicts)\n\n    def valid(anns):\n        for ann in anns:\n            if isinstance(ann, list):\n                for instance in ann:\n                    if instance.get(\"iscrowd\", 0) == 0:\n                        return True\n            else:\n                if ann.get(\"iscrowd\", 0) == 0:\n                    return True\n        return False\n\n    dataset_dicts = [x for x in dataset_dicts if valid(x[\"annotations\"])]\n    num_after = len(dataset_dicts)\n    logger = logging.getLogger(__name__)\n    logger.info(\n        \"Removed {} images with no usable annotations. {} images left.\".format(\n            num_before - num_after, num_after\n        )\n    )\n    return dataset_dicts\n\n\ndef get_detection_dataset_dicts(\n    dataset_names, filter_empty=True, proposal_files=None\n):\n    \"\"\"\n    Load and prepare dataset dicts for instance detection/segmentation and semantic segmentation.\n\n    Args:\n        dataset_names (str or list[str]): a dataset name or a list of dataset names\n        filter_empty (bool): whether to filter out images without instance annotations\n        proposal_files (list[str]): if given, a list of object proposal files\n            that match each dataset in `dataset_names`.\n\n    Returns:\n        list[dict]: a list of dicts following the standard dataset dict format.\n    \"\"\"\n    if isinstance(dataset_names, str):\n        dataset_names = [dataset_names]\n    assert len(dataset_names)\n    dataset_dicts = [DatasetCatalog.get(dataset_name) for dataset_name in dataset_names]\n    for dataset_name, dicts in zip(dataset_names, dataset_dicts):\n        assert len(dicts), \"Dataset '{}' is empty!\".format(dataset_name)\n\n    if proposal_files is not None:\n        assert len(dataset_names) == len(proposal_files)\n        # load precomputed proposals from proposal files\n        dataset_dicts = [\n            load_proposals_into_dataset(dataset_i_dicts, proposal_file)\n            for dataset_i_dicts, proposal_file in zip(dataset_dicts, proposal_files)\n        ]\n\n    dataset_dicts = list(itertools.chain.from_iterable(dataset_dicts))\n\n    has_instances = \"annotations\" in dataset_dicts[0]\n    if filter_empty and has_instances:\n        dataset_dicts = filter_images_with_only_crowd_annotations(dataset_dicts, dataset_names)\n\n    assert len(dataset_dicts), \"No valid data found in {}.\".format(\",\".join(dataset_names))\n    return dataset_dicts\n\n\ndef _train_loader_from_config(cfg, mapper, *, dataset=None, sampler=None):\n    if dataset is None:\n        dataset = get_detection_dataset_dicts(\n            cfg.DATASETS.TRAIN,\n            filter_empty=cfg.DATALOADER.FILTER_EMPTY_ANNOTATIONS,\n            proposal_files=cfg.DATASETS.PROPOSAL_FILES_TRAIN if cfg.MODEL.LOAD_PROPOSALS else None,\n        )\n\n    if mapper is None:\n        mapper = DatasetMapper(cfg, True)\n\n    if sampler is None:\n        sampler_name = cfg.DATALOADER.SAMPLER_TRAIN\n        logger = logging.getLogger(__name__)\n        logger.info(\"Using training sampler {}\".format(sampler_name))\n        sampler = TrainingSampler(len(dataset))\n\n    return {\n        \"dataset\": dataset,\n        \"sampler\": sampler,\n        \"mapper\": mapper,\n        \"total_batch_size\": cfg.SOLVER.IMS_PER_BATCH,\n        \"aspect_ratio_grouping\": cfg.DATALOADER.ASPECT_RATIO_GROUPING,\n        \"num_workers\": cfg.DATALOADER.NUM_WORKERS,\n    }\n\n\n# TODO can allow dataset as an iterable or IterableDataset to make this function more general\n@configurable(from_config=_train_loader_from_config)\ndef build_detection_train_loader(\n    dataset, *, mapper, sampler=None, total_batch_size, aspect_ratio_grouping=True, num_workers=0\n):\n    \"\"\"\n    Build a dataloader for object detection with some default features.\n    This interface is experimental.\n\n    Args:\n        dataset (list or torch.utils.data.Dataset): a list of dataset dicts,\n            or a map-style pytorch dataset. They can be obtained by using\n            :func:`DatasetCatalog.get` or :func:`get_detection_dataset_dicts`.\n        mapper (callable): a callable which takes a sample (dict) from dataset and\n            returns the format to be consumed by the model.\n            When using cfg, the default choice is ``DatasetMapper(cfg, is_train=True)``.\n        sampler (torch.utils.data.sampler.Sampler or None): a sampler that\n            produces indices to be applied on ``dataset``.\n            Default to :class:`TrainingSampler`, which coordinates a random shuffle\n            sequence across all workers.\n        total_batch_size (int): total batch size across all workers. Batching\n            simply puts data into a list.\n        aspect_ratio_grouping (bool): whether to group images with similar\n            aspect ratio for efficiency. When enabled, it requires each\n            element in dataset be a dict with keys \"width\" and \"height\".\n        num_workers (int): number of parallel data loading workers\n\n    Returns:\n        torch.utils.data.DataLoader: a dataloader. Each output from it is a\n            ``list[mapped_element]`` of length ``total_batch_size / num_workers``,\n            where ``mapped_element`` is produced by the ``mapper``.\n    \"\"\"\n    if isinstance(dataset, list):\n        dataset = DatasetFromList(dataset, copy=False)\n    if mapper is not None:\n        dataset = MapDataset(dataset, mapper)\n    if sampler is None:\n        sampler = TrainingSampler(len(dataset))\n    assert isinstance(sampler, torch.utils.data.sampler.Sampler)\n    return build_batch_data_loader(\n        dataset,\n        sampler,\n        total_batch_size,\n        aspect_ratio_grouping=aspect_ratio_grouping,\n        num_workers=num_workers,\n    )\n\n\ndef _test_loader_from_config(cfg, dataset_name, mapper=None):\n    \"\"\"\n    Uses the given `dataset_name` argument (instead of the names in cfg), because the\n    standard practice is to evaluate each test set individually (not combining them).\n    \"\"\"\n    dataset = get_detection_dataset_dicts(\n        [dataset_name],\n        filter_empty=False,\n        proposal_files=[\n            cfg.DATASETS.PROPOSAL_FILES_TEST[list(cfg.DATASETS.TEST).index(dataset_name)]\n        ]\n        if cfg.MODEL.LOAD_PROPOSALS\n        else None,\n    )\n    if mapper is None:\n        mapper = DatasetMapper(cfg, False)\n    return {\"dataset\": dataset, \"mapper\": mapper, \"num_workers\": cfg.DATALOADER.NUM_WORKERS}\n\n\n@configurable(from_config=_test_loader_from_config)\ndef build_detection_test_loader(dataset, *, mapper, num_workers=0):\n    \"\"\"\n    Similar to `build_detection_train_loader`, but uses a batch size of 1.\n    This interface is experimental.\n\n    Args:\n        dataset (list or torch.utils.data.Dataset): a list of dataset dicts,\n            or a map-style pytorch dataset. They can be obtained by using\n            :func:`DatasetCatalog.get` or :func:`get_detection_dataset_dicts`.\n        mapper (callable): a callable which takes a sample (dict) from dataset\n           and returns the format to be consumed by the model.\n           When using cfg, the default choice is ``DatasetMapper(cfg, is_train=False)``.\n        num_workers (int): number of parallel data loading workers\n\n    Returns:\n        DataLoader: a torch DataLoader, that loads the given detection\n        dataset, with test-time transformation and batching.\n\n    Examples:\n    ::\n        data_loader = build_detection_test_loader(\n            DatasetRegistry.get(\"my_test\"),\n            mapper=DatasetMapper(...))\n\n        # or, instantiate with a CfgNode:\n        data_loader = build_detection_test_loader(cfg, \"my_test\")\n    \"\"\"\n    if isinstance(dataset, list):\n        dataset = DatasetFromList(dataset, copy=False)\n    if mapper is not None:\n        dataset = MapDataset(dataset, mapper)\n    sampler = InferenceSampler(len(dataset))\n    # Always use 1 image per worker during inference since this is the\n    # standard when reporting inference time in papers.\n    batch_sampler = torch.utils.data.sampler.BatchSampler(sampler, 1, drop_last=False)\n    data_loader = torch.utils.data.DataLoader(\n        dataset,\n        num_workers=num_workers,\n        batch_sampler=batch_sampler,\n        collate_fn=trivial_batch_collator,\n    )\n    return data_loader\n"
  },
  {
    "path": "mfvis_nococo/mask2former_video/data_video/dataset_mapper.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates.\n# Modified by Bowen Cheng from https://github.com/sukjunhwang/IFC\n\nimport copy\nimport logging\nimport random\nimport numpy as np\nfrom typing import List, Union\nimport torch\n\nfrom detectron2.config import configurable\nfrom detectron2.structures import (\n    BitMasks,\n    Boxes,\n    BoxMode,\n    Instances,\n)\n\nfrom detectron2.data import detection_utils as utils\nfrom detectron2.data import transforms as T\n\nfrom .augmentation import build_augmentation\nimport os\n\n__all__ = [\"YTVISDatasetMapper\", \"CocoClipDatasetMapper\"]\n\ndef seed_everything(seed):\n    random.seed(seed)\n    os.environ['PYTHONHASHSEED'] = str(seed)\n    np.random.seed(seed)\n    torch.manual_seed(seed)\n    # torch.cuda.manual_seed(seed)\n    # torch.backends.cudnn.deterministic = True\n    # torch.backends.cudnn.benchmark = True\n\ndef filter_empty_instances(instances, by_box=True, by_mask=True, box_threshold=1e-5):\n    \"\"\"\n    Filter out empty instances in an `Instances` object.\n\n    Args:\n        instances (Instances):\n        by_box (bool): whether to filter out instances with empty boxes\n        by_mask (bool): whether to filter out instances with empty masks\n        box_threshold (float): minimum width and height to be considered non-empty\n\n    Returns:\n        Instances: the filtered instances.\n    \"\"\"\n    assert by_box or by_mask\n    r = []\n    if by_box:\n        r.append(instances.gt_boxes.nonempty(threshold=box_threshold))\n    if instances.has(\"gt_masks\") and by_mask:\n        r.append(instances.gt_masks.nonempty())\n\n    if not r:\n        return instances\n    m = r[0]\n    for x in r[1:]:\n        m = m & x\n\n    instances.gt_ids[~m] = -1\n    return instances\n\n\ndef _get_dummy_anno(num_classes):\n    return {\n        \"iscrowd\": 0,\n        \"category_id\": num_classes,\n        \"id\": -1,\n        \"bbox\": np.array([0, 0, 0, 0]),\n        \"bbox_mode\": BoxMode.XYXY_ABS,\n        \"segmentation\": [np.array([0.0] * 6)]\n    }\n\n\ndef ytvis_annotations_to_instances(annos, image_size):\n    \"\"\"\n    Create an :class:`Instances` object used by the models,\n    from instance annotations in the dataset dict.\n\n    Args:\n        annos (list[dict]): a list of instance annotations in one image, each\n            element for one instance.\n        image_size (tuple): height, width\n\n    Returns:\n        Instances:\n            It will contain fields \"gt_boxes\", \"gt_classes\", \"gt_ids\",\n            \"gt_masks\", if they can be obtained from `annos`.\n            This is the format that builtin models expect.\n    \"\"\"\n    boxes = [BoxMode.convert(obj[\"bbox\"], obj[\"bbox_mode\"], BoxMode.XYXY_ABS) for obj in annos]\n    target = Instances(image_size)\n    target.gt_boxes = Boxes(boxes)\n\n    classes = [int(obj[\"category_id\"]) for obj in annos]\n    classes = torch.tensor(classes, dtype=torch.int64)\n    target.gt_classes = classes\n\n    ids = [int(obj[\"id\"]) for obj in annos]\n    ids = torch.tensor(ids, dtype=torch.int64)\n    target.gt_ids = ids\n\n    if len(annos) and \"segmentation\" in annos[0]:\n        segms = [obj[\"segmentation\"] for obj in annos]\n        masks = []\n        for segm in segms:\n            assert segm.ndim == 2, \"Expect segmentation of 2 dimensions, got {}.\".format(\n                segm.ndim\n            )\n            # mask array\n            masks.append(segm)\n        # torch.from_numpy does not support array with negative stride.\n        masks = BitMasks(\n            torch.stack([torch.from_numpy(np.ascontiguousarray(x)) for x in masks])\n        )\n        target.gt_masks = masks\n\n    return target\n\n\nclass YTVISDatasetMapper:\n    \"\"\"\n    A callable which takes a dataset dict in YouTube-VIS Dataset format,\n    and map it into a format used by the model.\n    \"\"\"\n\n    @configurable\n    def __init__(\n        self,\n        is_train: bool,\n        *,\n        augmentations: List[Union[T.Augmentation, T.Transform]],\n        image_format: str,\n        use_instance_mask: bool = False,\n        sampling_frame_num: int = 2,\n        sampling_frame_range: int = 5,\n        sampling_frame_shuffle: bool = False,\n        num_classes: int = 40,\n    ):\n        \"\"\"\n        NOTE: this interface is experimental.\n        Args:\n            is_train: whether it's used in training or inference\n            augmentations: a list of augmentations or deterministic transforms to apply\n            image_format: an image format supported by :func:`detection_utils.read_image`.\n            use_instance_mask: whether to process instance segmentation annotations, if available\n        \"\"\"\n        # fmt: off\n        self.is_train               = is_train\n        self.augmentations          = T.AugmentationList(augmentations)\n        self.image_format           = image_format\n        self.use_instance_mask      = use_instance_mask\n        self.sampling_frame_num     = sampling_frame_num\n        self.sampling_frame_range   = sampling_frame_range\n        self.sampling_frame_shuffle = sampling_frame_shuffle\n        self.num_classes            = num_classes\n        # fmt: on\n        logger = logging.getLogger(__name__)\n        mode = \"training\" if is_train else \"inference\"\n        logger.info(f\"[DatasetMapper] Augmentations used in {mode}: {augmentations}\")\n        seed_everything(29118357)\n\n    @classmethod\n    def from_config(cls, cfg, is_train: bool = True):\n        augs = build_augmentation(cfg, is_train)\n\n        sampling_frame_num = cfg.INPUT.SAMPLING_FRAME_NUM\n        sampling_frame_range = cfg.INPUT.SAMPLING_FRAME_RANGE\n        sampling_frame_shuffle = cfg.INPUT.SAMPLING_FRAME_SHUFFLE\n\n        ret = {\n            \"is_train\": is_train,\n            \"augmentations\": augs,\n            \"image_format\": cfg.INPUT.FORMAT,\n            \"use_instance_mask\": cfg.MODEL.MASK_ON,\n            \"sampling_frame_num\": sampling_frame_num,\n            \"sampling_frame_range\": sampling_frame_range,\n            \"sampling_frame_shuffle\": sampling_frame_shuffle,\n            \"num_classes\": cfg.MODEL.SEM_SEG_HEAD.NUM_CLASSES,\n        }\n\n        return ret\n\n    def __call__(self, dataset_dict):\n        \"\"\"\n        Args:\n            dataset_dict (dict): Metadata of one video, in YTVIS Dataset format.\n\n        Returns:\n            dict: a format that builtin models in detectron2 accept\n        \"\"\"\n        # TODO consider examining below deepcopy as it costs huge amount of computations.\n        dataset_dict = copy.deepcopy(dataset_dict)  # it will be modified by code below\n\n        video_length = dataset_dict[\"length\"]\n        if self.is_train:\n            ref_frame = random.randrange(video_length)\n\n            start_idx = max(0, ref_frame-self.sampling_frame_range)\n            end_idx = min(video_length, ref_frame+self.sampling_frame_range + 1)\n\n            selected_idx = np.random.choice(\n                np.array(list(range(start_idx, ref_frame)) + list(range(ref_frame+1, end_idx))),\n                self.sampling_frame_num - 1,\n            )\n            selected_idx = selected_idx.tolist() + [ref_frame]\n            selected_idx = sorted(selected_idx)\n            # print('selected_idx:', selected_idx)\n            if self.sampling_frame_shuffle:\n                random.shuffle(selected_idx)\n        else:\n            selected_idx = range(video_length)\n\n        video_annos = dataset_dict.pop(\"annotations\", None)\n        file_names = dataset_dict.pop(\"file_names\", None)\n\n        if self.is_train:\n            _ids = set()\n            for frame_idx in selected_idx:\n                _ids.update([anno[\"id\"] for anno in video_annos[frame_idx]])\n            ids = dict()\n            for i, _id in enumerate(_ids):\n                ids[_id] = i\n\n        dataset_dict[\"image\"] = []\n        dataset_dict[\"instances\"] = []\n        dataset_dict[\"file_names\"] = []\n        for frame_idx in selected_idx:\n            dataset_dict[\"file_names\"].append(file_names[frame_idx])\n\n            # Read image\n            image = utils.read_image(file_names[frame_idx], format=self.image_format)\n            utils.check_image_size(dataset_dict, image)\n\n            aug_input = T.AugInput(image)\n            transforms = self.augmentations(aug_input)\n            image = aug_input.image\n\n            image_shape = image.shape[:2]  # h, w\n            # Pytorch's dataloader is efficient on torch.Tensor due to shared-memory,\n            # but not efficient on large generic data structures due to the use of pickle & mp.Queue.\n            # Therefore it's important to use torch.Tensor.\n            dataset_dict[\"image\"].append(torch.as_tensor(np.ascontiguousarray(image.transpose(2, 0, 1))))\n\n            if (video_annos is None) or (not self.is_train):\n                continue\n\n            # NOTE copy() is to prevent annotations getting changed from applying augmentations\n            _frame_annos = []\n            for anno in video_annos[frame_idx]:\n                _anno = {}\n                for k, v in anno.items():\n                    _anno[k] = copy.deepcopy(v)\n                _frame_annos.append(_anno)\n\n            # USER: Implement additional transformations if you have other types of data\n            annos = [\n                utils.transform_instance_annotations(obj, transforms, image_shape)\n                for obj in _frame_annos\n                if obj.get(\"iscrowd\", 0) == 0\n            ]\n            sorted_annos = [_get_dummy_anno(self.num_classes) for _ in range(len(ids))]\n\n            for _anno in annos:\n                idx = ids[_anno[\"id\"]]\n                sorted_annos[idx] = _anno\n            _gt_ids = [_anno[\"id\"] for _anno in sorted_annos]\n\n            instances = utils.annotations_to_instances(sorted_annos, image_shape, mask_format=\"bitmask\")\n            instances.gt_ids = torch.tensor(_gt_ids)\n            if instances.has(\"gt_masks\"):\n                instances.gt_boxes = instances.gt_masks.get_bounding_boxes()\n                instances = filter_empty_instances(instances)\n            else:\n                instances.gt_masks = BitMasks(torch.empty((0, *image_shape)))\n            dataset_dict[\"instances\"].append(instances)\n\n        return dataset_dict\n\n\nclass CocoClipDatasetMapper:\n    \"\"\"\n    A callable which takes a COCO image which converts into multiple frames,\n    and map it into a format used by the model.\n    \"\"\"\n\n    @configurable\n    def __init__(\n        self,\n        is_train: bool,\n        *,\n        augmentations: List[Union[T.Augmentation, T.Transform]],\n        image_format: str,\n        use_instance_mask: bool = False,\n        sampling_frame_num: int = 2,\n    ):\n        \"\"\"\n        NOTE: this interface is experimental.\n        Args:\n            is_train: whether it's used in training or inference\n            augmentations: a list of augmentations or deterministic transforms to apply\n            image_format: an image format supported by :func:`detection_utils.read_image`.\n            use_instance_mask: whether to process instance segmentation annotations, if available\n        \"\"\"\n        # fmt: off\n        self.is_train               = is_train\n        self.augmentations          = T.AugmentationList(augmentations)\n        self.image_format           = image_format\n        self.use_instance_mask      = use_instance_mask\n        self.sampling_frame_num     = sampling_frame_num\n        # fmt: on\n        logger = logging.getLogger(__name__)\n        mode = \"training\" if is_train else \"inference\"\n        logger.info(f\"[DatasetMapper] Augmentations used in {mode}: {augmentations}\")\n\n    @classmethod\n    def from_config(cls, cfg, is_train: bool = True):\n        augs = build_augmentation(cfg, is_train)\n\n        sampling_frame_num = cfg.INPUT.SAMPLING_FRAME_NUM\n\n        ret = {\n            \"is_train\": is_train,\n            \"augmentations\": augs,\n            \"image_format\": cfg.INPUT.FORMAT,\n            \"use_instance_mask\": cfg.MODEL.MASK_ON,\n            \"sampling_frame_num\": sampling_frame_num,\n        }\n\n        return ret\n\n    def __call__(self, dataset_dict):\n        \"\"\"\n        Args:\n            dataset_dict (dict): Metadata of one image, in Detectron2 Dataset format.\n\n        Returns:\n            dict: a format that builtin models in detectron2 accept\n        \"\"\"\n        dataset_dict = copy.deepcopy(dataset_dict)  # it will be modified by code below\n\n        img_annos = dataset_dict.pop(\"annotations\", None)\n        file_name = dataset_dict.pop(\"file_name\", None)\n        original_image = utils.read_image(file_name, format=self.image_format)\n\n        dataset_dict[\"image\"] = []\n        dataset_dict[\"instances\"] = []\n        dataset_dict[\"file_names\"] = [file_name] * self.sampling_frame_num\n        for _ in range(self.sampling_frame_num):\n            utils.check_image_size(dataset_dict, original_image)\n\n            aug_input = T.AugInput(original_image)\n            transforms = self.augmentations(aug_input)\n            image = aug_input.image\n\n            image_shape = image.shape[:2]  # h, w\n            # Pytorch's dataloader is efficient on torch.Tensor due to shared-memory,\n            # but not efficient on large generic data structures due to the use of pickle & mp.Queue.\n            # Therefore it's important to use torch.Tensor.\n            dataset_dict[\"image\"].append(torch.as_tensor(np.ascontiguousarray(image.transpose(2, 0, 1))))\n\n            if (img_annos is None) or (not self.is_train):\n                continue\n\n            _img_annos = []\n            for anno in img_annos:\n                _anno = {}\n                for k, v in anno.items():\n                    _anno[k] = copy.deepcopy(v)\n                _img_annos.append(_anno)\n\n            # USER: Implement additional transformations if you have other types of data\n            annos = [\n                utils.transform_instance_annotations(obj, transforms, image_shape)\n                for obj in _img_annos\n                if obj.get(\"iscrowd\", 0) == 0\n            ]\n            _gt_ids = list(range(len(annos)))\n            for idx in range(len(annos)):\n                if len(annos[idx][\"segmentation\"]) == 0:\n                    annos[idx][\"segmentation\"] = [np.array([0.0] * 6)]\n\n            instances = utils.annotations_to_instances(annos, image_shape, mask_format=\"bitmask\")\n            instances.gt_ids = torch.tensor(_gt_ids)\n            if instances.has(\"gt_masks\"):\n                instances.gt_boxes = instances.gt_masks.get_bounding_boxes()\n                instances = filter_empty_instances(instances)\n            else:\n                instances.gt_masks = BitMasks(torch.empty((0, *image_shape)))\n            dataset_dict[\"instances\"].append(instances)\n\n        return dataset_dict\n"
  },
  {
    "path": "mfvis_nococo/mask2former_video/data_video/datasets/__init__.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates.\n# Modified by Bowen Cheng from https://github.com/sukjunhwang/IFC\n\nfrom . import builtin  # ensure the builtin datasets are registered\n\n__all__ = [k for k in globals().keys() if \"builtin\" not in k and not k.startswith(\"_\")]\n"
  },
  {
    "path": "mfvis_nococo/mask2former_video/data_video/datasets/builtin.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates.\n# Modified by Bowen Cheng from https://github.com/sukjunhwang/IFC\n\nimport os\n\nfrom .ytvis import (\n    register_ytvis_instances,\n    _get_ytvis_2019_instances_meta,\n    _get_ytvis_2021_instances_meta,\n)\n\n# ==== Predefined splits for YTVIS 2019 ===========\n_PREDEFINED_SPLITS_YTVIS_2019 = {\n    \"ytvis_2019_train\": (\"ytvis_2019/train/JPEGImages\",\n                         \"ytvis_2019/train.json\"),\n    \"ytvis_2019_val\": (\"ytvis_2019/valid/JPEGImages\",\n                       \"ytvis_2019/valid.json\"),\n    \"ytvis_2019_test\": (\"ytvis_2019/test/JPEGImages\",\n                        \"ytvis_2019/test.json\"),\n}\n\n\n# ==== Predefined splits for YTVIS 2021 ===========\n_PREDEFINED_SPLITS_YTVIS_2021 = {\n    \"ytvis_2021_train\": (\"ytvis_2021/train/JPEGImages\",\n                         \"ytvis_2021/train.json\"),\n    \"ytvis_2021_val\": (\"ytvis_2021/valid/JPEGImages\",\n                       \"ytvis_2021/valid.json\"),\n    \"ytvis_2021_test\": (\"ytvis_2021/test/JPEGImages\",\n                        \"ytvis_2021/test.json\"),\n}\n\n\ndef register_all_ytvis_2019(root):\n    for key, (image_root, json_file) in _PREDEFINED_SPLITS_YTVIS_2019.items():\n        # Assume pre-defined datasets live in `./datasets`.\n        register_ytvis_instances(\n            key,\n            _get_ytvis_2019_instances_meta(),\n            os.path.join(root, json_file) if \"://\" not in json_file else json_file,\n            os.path.join(root, image_root),\n        )\n\n\ndef register_all_ytvis_2021(root):\n    for key, (image_root, json_file) in _PREDEFINED_SPLITS_YTVIS_2021.items():\n        # Assume pre-defined datasets live in `./datasets`.\n        register_ytvis_instances(\n            key,\n            _get_ytvis_2021_instances_meta(),\n            os.path.join(root, json_file) if \"://\" not in json_file else json_file,\n            os.path.join(root, image_root),\n        )\n\n\nif __name__.endswith(\".builtin\"):\n    # Assume pre-defined datasets live in `./datasets`.\n    _root = os.getenv(\"DETECTRON2_DATASETS\", \"datasets\")\n    register_all_ytvis_2019(_root)\n    register_all_ytvis_2021(_root)\n"
  },
  {
    "path": "mfvis_nococo/mask2former_video/data_video/datasets/ytvis.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates.\n# Modified by Bowen Cheng from https://github.com/sukjunhwang/IFC\n\nimport contextlib\nimport io\nimport json\nimport logging\nimport numpy as np\nimport os\nimport pycocotools.mask as mask_util\nfrom fvcore.common.file_io import PathManager\nfrom fvcore.common.timer import Timer\n\nfrom detectron2.structures import Boxes, BoxMode, PolygonMasks\nfrom detectron2.data import DatasetCatalog, MetadataCatalog\n\n\"\"\"\nThis file contains functions to parse YTVIS dataset of\nCOCO-format annotations into dicts in \"Detectron2 format\".\n\"\"\"\n\nlogger = logging.getLogger(__name__)\n\n__all__ = [\"load_ytvis_json\", \"register_ytvis_instances\"]\n\n\nYTVIS_CATEGORIES_2019 = [\n    {\"color\": [220, 20, 60], \"isthing\": 1, \"id\": 1, \"name\": \"person\"},\n    {\"color\": [0, 82, 0], \"isthing\": 1, \"id\": 2, \"name\": \"giant_panda\"},\n    {\"color\": [119, 11, 32], \"isthing\": 1, \"id\": 3, \"name\": \"lizard\"},\n    {\"color\": [165, 42, 42], \"isthing\": 1, \"id\": 4, \"name\": \"parrot\"},\n    {\"color\": [134, 134, 103], \"isthing\": 1, \"id\": 5, \"name\": \"skateboard\"},\n    {\"color\": [0, 0, 142], \"isthing\": 1, \"id\": 6, \"name\": \"sedan\"},\n    {\"color\": [255, 109, 65], \"isthing\": 1, \"id\": 7, \"name\": \"ape\"},\n    {\"color\": [0, 226, 252], \"isthing\": 1, \"id\": 8, \"name\": \"dog\"},\n    {\"color\": [5, 121, 0], \"isthing\": 1, \"id\": 9, \"name\": \"snake\"},\n    {\"color\": [0, 60, 100], \"isthing\": 1, \"id\": 10, \"name\": \"monkey\"},\n    {\"color\": [250, 170, 30], \"isthing\": 1, \"id\": 11, \"name\": \"hand\"},\n    {\"color\": [100, 170, 30], \"isthing\": 1, \"id\": 12, \"name\": \"rabbit\"},\n    {\"color\": [179, 0, 194], \"isthing\": 1, \"id\": 13, \"name\": \"duck\"},\n    {\"color\": [255, 77, 255], \"isthing\": 1, \"id\": 14, \"name\": \"cat\"},\n    {\"color\": [120, 166, 157], \"isthing\": 1, \"id\": 15, \"name\": \"cow\"},\n    {\"color\": [73, 77, 174], \"isthing\": 1, \"id\": 16, \"name\": \"fish\"},\n    {\"color\": [0, 80, 100], \"isthing\": 1, \"id\": 17, \"name\": \"train\"},\n    {\"color\": [182, 182, 255], \"isthing\": 1, \"id\": 18, \"name\": \"horse\"},\n    {\"color\": [0, 143, 149], \"isthing\": 1, \"id\": 19, \"name\": \"turtle\"},\n    {\"color\": [174, 57, 255], \"isthing\": 1, \"id\": 20, \"name\": \"bear\"},\n    {\"color\": [0, 0, 230], \"isthing\": 1, \"id\": 21, \"name\": \"motorbike\"},\n    {\"color\": [72, 0, 118], \"isthing\": 1, \"id\": 22, \"name\": \"giraffe\"},\n    {\"color\": [255, 179, 240], \"isthing\": 1, \"id\": 23, \"name\": \"leopard\"},\n    {\"color\": [0, 125, 92], \"isthing\": 1, \"id\": 24, \"name\": \"fox\"},\n    {\"color\": [209, 0, 151], \"isthing\": 1, \"id\": 25, \"name\": \"deer\"},\n    {\"color\": [188, 208, 182], \"isthing\": 1, \"id\": 26, \"name\": \"owl\"},\n    {\"color\": [145, 148, 174], \"isthing\": 1, \"id\": 27, \"name\": \"surfboard\"},\n    {\"color\": [106, 0, 228], \"isthing\": 1, \"id\": 28, \"name\": \"airplane\"},\n    {\"color\": [0, 0, 70], \"isthing\": 1, \"id\": 29, \"name\": \"truck\"},\n    {\"color\": [199, 100, 0], \"isthing\": 1, \"id\": 30, \"name\": \"zebra\"},\n    {\"color\": [166, 196, 102], \"isthing\": 1, \"id\": 31, \"name\": \"tiger\"},\n    {\"color\": [110, 76, 0], \"isthing\": 1, \"id\": 32, \"name\": \"elephant\"},\n    {\"color\": [133, 129, 255], \"isthing\": 1, \"id\": 33, \"name\": \"snowboard\"},\n    {\"color\": [0, 0, 192], \"isthing\": 1, \"id\": 34, \"name\": \"boat\"},\n    {\"color\": [183, 130, 88], \"isthing\": 1, \"id\": 35, \"name\": \"shark\"},\n    {\"color\": [130, 114, 135], \"isthing\": 1, \"id\": 36, \"name\": \"mouse\"},\n    {\"color\": [107, 142, 35], \"isthing\": 1, \"id\": 37, \"name\": \"frog\"},\n    {\"color\": [0, 228, 0], \"isthing\": 1, \"id\": 38, \"name\": \"eagle\"},\n    {\"color\": [174, 255, 243], \"isthing\": 1, \"id\": 39, \"name\": \"earless_seal\"},\n    {\"color\": [255, 208, 186], \"isthing\": 1, \"id\": 40, \"name\": \"tennis_racket\"},\n]\n\n\nYTVIS_CATEGORIES_2021 = [\n    {\"color\": [106, 0, 228], \"isthing\": 1, \"id\": 1, \"name\": \"airplane\"},\n    {\"color\": [174, 57, 255], \"isthing\": 1, \"id\": 2, \"name\": \"bear\"},\n    {\"color\": [255, 109, 65], \"isthing\": 1, \"id\": 3, \"name\": \"bird\"},\n    {\"color\": [0, 0, 192], \"isthing\": 1, \"id\": 4, \"name\": \"boat\"},\n    {\"color\": [0, 0, 142], \"isthing\": 1, \"id\": 5, \"name\": \"car\"},\n    {\"color\": [255, 77, 255], \"isthing\": 1, \"id\": 6, \"name\": \"cat\"},\n    {\"color\": [120, 166, 157], \"isthing\": 1, \"id\": 7, \"name\": \"cow\"},\n    {\"color\": [209, 0, 151], \"isthing\": 1, \"id\": 8, \"name\": \"deer\"},\n    {\"color\": [0, 226, 252], \"isthing\": 1, \"id\": 9, \"name\": \"dog\"},\n    {\"color\": [179, 0, 194], \"isthing\": 1, \"id\": 10, \"name\": \"duck\"},\n    {\"color\": [174, 255, 243], \"isthing\": 1, \"id\": 11, \"name\": \"earless_seal\"},\n    {\"color\": [110, 76, 0], \"isthing\": 1, \"id\": 12, \"name\": \"elephant\"},\n    {\"color\": [73, 77, 174], \"isthing\": 1, \"id\": 13, \"name\": \"fish\"},\n    {\"color\": [250, 170, 30], \"isthing\": 1, \"id\": 14, \"name\": \"flying_disc\"},\n    {\"color\": [0, 125, 92], \"isthing\": 1, \"id\": 15, \"name\": \"fox\"},\n    {\"color\": [107, 142, 35], \"isthing\": 1, \"id\": 16, \"name\": \"frog\"},\n    {\"color\": [0, 82, 0], \"isthing\": 1, \"id\": 17, \"name\": \"giant_panda\"},\n    {\"color\": [72, 0, 118], \"isthing\": 1, \"id\": 18, \"name\": \"giraffe\"},\n    {\"color\": [182, 182, 255], \"isthing\": 1, \"id\": 19, \"name\": \"horse\"},\n    {\"color\": [255, 179, 240], \"isthing\": 1, \"id\": 20, \"name\": \"leopard\"},\n    {\"color\": [119, 11, 32], \"isthing\": 1, \"id\": 21, \"name\": \"lizard\"},\n    {\"color\": [0, 60, 100], \"isthing\": 1, \"id\": 22, \"name\": \"monkey\"},\n    {\"color\": [0, 0, 230], \"isthing\": 1, \"id\": 23, \"name\": \"motorbike\"},\n    {\"color\": [130, 114, 135], \"isthing\": 1, \"id\": 24, \"name\": \"mouse\"},\n    {\"color\": [165, 42, 42], \"isthing\": 1, \"id\": 25, \"name\": \"parrot\"},\n    {\"color\": [220, 20, 60], \"isthing\": 1, \"id\": 26, \"name\": \"person\"},\n    {\"color\": [100, 170, 30], \"isthing\": 1, \"id\": 27, \"name\": \"rabbit\"},\n    {\"color\": [183, 130, 88], \"isthing\": 1, \"id\": 28, \"name\": \"shark\"},\n    {\"color\": [134, 134, 103], \"isthing\": 1, \"id\": 29, \"name\": \"skateboard\"},\n    {\"color\": [5, 121, 0], \"isthing\": 1, \"id\": 30, \"name\": \"snake\"},\n    {\"color\": [133, 129, 255], \"isthing\": 1, \"id\": 31, \"name\": \"snowboard\"},\n    {\"color\": [188, 208, 182], \"isthing\": 1, \"id\": 32, \"name\": \"squirrel\"},\n    {\"color\": [145, 148, 174], \"isthing\": 1, \"id\": 33, \"name\": \"surfboard\"},\n    {\"color\": [255, 208, 186], \"isthing\": 1, \"id\": 34, \"name\": \"tennis_racket\"},\n    {\"color\": [166, 196, 102], \"isthing\": 1, \"id\": 35, \"name\": \"tiger\"},\n    {\"color\": [0, 80, 100], \"isthing\": 1, \"id\": 36, \"name\": \"train\"},\n    {\"color\": [0, 0, 70], \"isthing\": 1, \"id\": 37, \"name\": \"truck\"},\n    {\"color\": [0, 143, 149], \"isthing\": 1, \"id\": 38, \"name\": \"turtle\"},\n    {\"color\": [0, 228, 0], \"isthing\": 1, \"id\": 39, \"name\": \"whale\"},\n    {\"color\": [199, 100, 0], \"isthing\": 1, \"id\": 40, \"name\": \"zebra\"},\n]\n\n\ndef _get_ytvis_2019_instances_meta():\n    thing_ids = [k[\"id\"] for k in YTVIS_CATEGORIES_2019 if k[\"isthing\"] == 1]\n    thing_colors = [k[\"color\"] for k in YTVIS_CATEGORIES_2019 if k[\"isthing\"] == 1]\n    assert len(thing_ids) == 40, len(thing_ids)\n    # Mapping from the incontiguous YTVIS category id to an id in [0, 39]\n    thing_dataset_id_to_contiguous_id = {k: i for i, k in enumerate(thing_ids)}\n    thing_classes = [k[\"name\"] for k in YTVIS_CATEGORIES_2019 if k[\"isthing\"] == 1]\n    ret = {\n        \"thing_dataset_id_to_contiguous_id\": thing_dataset_id_to_contiguous_id,\n        \"thing_classes\": thing_classes,\n        \"thing_colors\": thing_colors,\n    }\n    return ret\n\n\ndef _get_ytvis_2021_instances_meta():\n    thing_ids = [k[\"id\"] for k in YTVIS_CATEGORIES_2021 if k[\"isthing\"] == 1]\n    thing_colors = [k[\"color\"] for k in YTVIS_CATEGORIES_2021 if k[\"isthing\"] == 1]\n    assert len(thing_ids) == 40, len(thing_ids)\n    # Mapping from the incontiguous YTVIS category id to an id in [0, 39]\n    thing_dataset_id_to_contiguous_id = {k: i for i, k in enumerate(thing_ids)}\n    thing_classes = [k[\"name\"] for k in YTVIS_CATEGORIES_2021 if k[\"isthing\"] == 1]\n    ret = {\n        \"thing_dataset_id_to_contiguous_id\": thing_dataset_id_to_contiguous_id,\n        \"thing_classes\": thing_classes,\n        \"thing_colors\": thing_colors,\n    }\n    return ret\n\n\ndef load_ytvis_json(json_file, image_root, dataset_name=None, extra_annotation_keys=None):\n    from .ytvis_api.ytvos import YTVOS\n\n    timer = Timer()\n    json_file = PathManager.get_local_path(json_file)\n    with contextlib.redirect_stdout(io.StringIO()):\n        ytvis_api = YTVOS(json_file)\n    if timer.seconds() > 1:\n        logger.info(\"Loading {} takes {:.2f} seconds.\".format(json_file, timer.seconds()))\n\n    id_map = None\n    if dataset_name is not None:\n        meta = MetadataCatalog.get(dataset_name)\n        cat_ids = sorted(ytvis_api.getCatIds())\n        cats = ytvis_api.loadCats(cat_ids)\n        # The categories in a custom json file may not be sorted.\n        thing_classes = [c[\"name\"] for c in sorted(cats, key=lambda x: x[\"id\"])]\n        meta.thing_classes = thing_classes\n\n        # In COCO, certain category ids are artificially removed,\n        # and by convention they are always ignored.\n        # We deal with COCO's id issue and translate\n        # the category ids to contiguous ids in [0, 80).\n\n        # It works by looking at the \"categories\" field in the json, therefore\n        # if users' own json also have incontiguous ids, we'll\n        # apply this mapping as well but print a warning.\n        if not (min(cat_ids) == 1 and max(cat_ids) == len(cat_ids)):\n            if \"coco\" not in dataset_name:\n                logger.warning(\n                    \"\"\"\nCategory ids in annotations are not in [1, #categories]! We'll apply a mapping for you.\n\"\"\"\n                )\n        id_map = {v: i for i, v in enumerate(cat_ids)}\n        meta.thing_dataset_id_to_contiguous_id = id_map\n\n    # sort indices for reproducible results\n    vid_ids = sorted(ytvis_api.vids.keys())\n    # vids is a list of dicts, each looks something like:\n    # {'license': 1,\n    #  'flickr_url': ' ',\n    #  'file_names': ['ff25f55852/00000.jpg', 'ff25f55852/00005.jpg', ..., 'ff25f55852/00175.jpg'],\n    #  'height': 720,\n    #  'width': 1280,\n    #  'length': 36,\n    #  'date_captured': '2019-04-11 00:55:41.903902',\n    #  'id': 2232}\n    vids = ytvis_api.loadVids(vid_ids)\n\n    anns = [ytvis_api.vidToAnns[vid_id] for vid_id in vid_ids]\n    total_num_valid_anns = sum([len(x) for x in anns])\n    total_num_anns = len(ytvis_api.anns)\n    if total_num_valid_anns < total_num_anns:\n        logger.warning(\n            f\"{json_file} contains {total_num_anns} annotations, but only \"\n            f\"{total_num_valid_anns} of them match to images in the file.\"\n        )\n\n    vids_anns = list(zip(vids, anns))\n    logger.info(\"Loaded {} videos in YTVIS format from {}\".format(len(vids_anns), json_file))\n\n    dataset_dicts = []\n\n    ann_keys = [\"iscrowd\", \"category_id\", \"id\"] + (extra_annotation_keys or [])\n\n    num_instances_without_valid_segmentation = 0\n\n    for (vid_dict, anno_dict_list) in vids_anns:\n        record = {}\n        record[\"file_names\"] = [os.path.join(image_root, vid_dict[\"file_names\"][i]) for i in range(vid_dict[\"length\"])]\n        record[\"height\"] = vid_dict[\"height\"]\n        record[\"width\"] = vid_dict[\"width\"]\n        record[\"length\"] = vid_dict[\"length\"]\n        video_id = record[\"video_id\"] = vid_dict[\"id\"]\n\n        video_objs = []\n        for frame_idx in range(record[\"length\"]):\n            frame_objs = []\n            for anno in anno_dict_list:\n                assert anno[\"video_id\"] == video_id\n\n                obj = {key: anno[key] for key in ann_keys if key in anno}\n\n                _bboxes = anno.get(\"bboxes\", None)\n                _segm = anno.get(\"segmentations\", None)\n\n                if not (_bboxes and _segm and _bboxes[frame_idx] and _segm[frame_idx]):\n                    continue\n\n                bbox = _bboxes[frame_idx]\n                segm = _segm[frame_idx]\n\n                obj[\"bbox\"] = bbox\n                obj[\"bbox_mode\"] = BoxMode.XYWH_ABS\n\n                if isinstance(segm, dict):\n                    if isinstance(segm[\"counts\"], list):\n                        # convert to compressed RLE\n                        segm = mask_util.frPyObjects(segm, *segm[\"size\"])\n                elif segm:\n                    # filter out invalid polygons (< 3 points)\n                    segm = [poly for poly in segm if len(poly) % 2 == 0 and len(poly) >= 6]\n                    if len(segm) == 0:\n                        num_instances_without_valid_segmentation += 1\n                        continue  # ignore this instance\n                obj[\"segmentation\"] = segm\n\n                if id_map:\n                    obj[\"category_id\"] = id_map[obj[\"category_id\"]]\n                frame_objs.append(obj)\n            video_objs.append(frame_objs)\n        record[\"annotations\"] = video_objs\n        dataset_dicts.append(record)\n\n    if num_instances_without_valid_segmentation > 0:\n        logger.warning(\n            \"Filtered out {} instances without valid segmentation. \".format(\n                num_instances_without_valid_segmentation\n            )\n            + \"There might be issues in your dataset generation process. \"\n            \"A valid polygon should be a list[float] with even length >= 6.\"\n        )\n    return dataset_dicts\n\n\ndef register_ytvis_instances(name, metadata, json_file, image_root):\n    \"\"\"\n    Register a dataset in YTVIS's json annotation format for\n    instance tracking.\n\n    Args:\n        name (str): the name that identifies a dataset, e.g. \"ytvis_train\".\n        metadata (dict): extra metadata associated with this dataset.  You can\n            leave it as an empty dict.\n        json_file (str): path to the json instance annotation file.\n        image_root (str or path-like): directory which contains all the images.\n    \"\"\"\n    assert isinstance(name, str), name\n    assert isinstance(json_file, (str, os.PathLike)), json_file\n    assert isinstance(image_root, (str, os.PathLike)), image_root\n    # 1. register a function which returns dicts\n    DatasetCatalog.register(name, lambda: load_ytvis_json(json_file, image_root, name))\n\n    # 2. Optionally, add metadata about this dataset,\n    # since they might be useful in evaluation, visualization or logging\n    MetadataCatalog.get(name).set(\n        json_file=json_file, image_root=image_root, evaluator_type=\"ytvis\", **metadata\n    )\n\n\nif __name__ == \"__main__\":\n    \"\"\"\n    Test the YTVIS json dataset loader.\n    \"\"\"\n    from detectron2.utils.logger import setup_logger\n    from detectron2.utils.visualizer import Visualizer\n    import detectron2.data.datasets  # noqa # add pre-defined metadata\n    import sys\n    from PIL import Image\n\n    logger = setup_logger(name=__name__)\n    #assert sys.argv[3] in DatasetCatalog.list()\n    meta = MetadataCatalog.get(\"ytvis_2019_train\")\n\n    json_file = \"./datasets/ytvis/instances_train_sub.json\"\n    image_root = \"./datasets/ytvis/train/JPEGImages\"\n    dicts = load_ytvis_json(json_file, image_root, dataset_name=\"ytvis_2019_train\")\n    logger.info(\"Done loading {} samples.\".format(len(dicts)))\n\n    dirname = \"ytvis-data-vis\"\n    os.makedirs(dirname, exist_ok=True)\n\n    def extract_frame_dic(dic, frame_idx):\n        import copy\n        frame_dic = copy.deepcopy(dic)\n        annos = frame_dic.get(\"annotations\", None)\n        if annos:\n            frame_dic[\"annotations\"] = annos[frame_idx]\n\n        return frame_dic\n\n    for d in dicts:\n        vid_name = d[\"file_names\"][0].split('/')[-2]\n        os.makedirs(os.path.join(dirname, vid_name), exist_ok=True)\n        for idx, file_name in enumerate(d[\"file_names\"]):\n            img = np.array(Image.open(file_name))\n            visualizer = Visualizer(img, metadata=meta)\n            vis = visualizer.draw_dataset_dict(extract_frame_dic(d, idx))\n            fpath = os.path.join(dirname, vid_name, file_name.split('/')[-1])\n            vis.save(fpath)\n"
  },
  {
    "path": "mfvis_nococo/mask2former_video/data_video/datasets/ytvis_api/__init__.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates.\n# Modified by Bowen Cheng from https://github.com/youtubevos/cocoapi\n"
  },
  {
    "path": "mfvis_nococo/mask2former_video/data_video/datasets/ytvis_api/ytvos.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates.\n# Modified by Bowen Cheng from https://github.com/youtubevos/cocoapi\n\n__author__ = 'ychfan'\n# Interface for accessing the YouTubeVIS dataset.\n\n# The following API functions are defined:\n#  YTVOS       - YTVOS api class that loads YouTubeVIS annotation file and prepare data structures.\n#  decodeMask - Decode binary mask M encoded via run-length encoding.\n#  encodeMask - Encode binary mask M using run-length encoding.\n#  getAnnIds  - Get ann ids that satisfy given filter conditions.\n#  getCatIds  - Get cat ids that satisfy given filter conditions.\n#  getImgIds  - Get img ids that satisfy given filter conditions.\n#  loadAnns   - Load anns with the specified ids.\n#  loadCats   - Load cats with the specified ids.\n#  loadImgs   - Load imgs with the specified ids.\n#  annToMask  - Convert segmentation in an annotation to binary mask.\n#  loadRes    - Load algorithm results and create API for accessing them.\n\n# Microsoft COCO Toolbox.      version 2.0\n# Data, paper, and tutorials available at:  http://mscoco.org/\n# Code written by Piotr Dollar and Tsung-Yi Lin, 2014.\n# Licensed under the Simplified BSD License [see bsd.txt]\n\nimport json\nimport time\nimport matplotlib.pyplot as plt\nfrom matplotlib.collections import PatchCollection\nfrom matplotlib.patches import Polygon\nimport numpy as np\nimport copy\nimport itertools\nfrom pycocotools import mask as maskUtils\nimport os\nfrom collections import defaultdict\nimport sys\nPYTHON_VERSION = sys.version_info[0]\nif PYTHON_VERSION == 2:\n    from urllib import urlretrieve\nelif PYTHON_VERSION == 3:\n    from urllib.request import urlretrieve\n\n\ndef _isArrayLike(obj):\n    return hasattr(obj, '__iter__') and hasattr(obj, '__len__')\n\n\nclass YTVOS:\n    def __init__(self, annotation_file=None):\n        \"\"\"\n        Constructor of Microsoft COCO helper class for reading and visualizing annotations.\n        :param annotation_file (str): location of annotation file\n        :param image_folder (str): location to the folder that hosts images.\n        :return:\n        \"\"\"\n        # load dataset\n        self.dataset,self.anns,self.cats,self.vids = dict(),dict(),dict(),dict()\n        self.vidToAnns, self.catToVids = defaultdict(list), defaultdict(list)\n        if not annotation_file == None:\n            print('loading annotations into memory...')\n            tic = time.time()\n            dataset = json.load(open(annotation_file, 'r'))\n            assert type(dataset)==dict, 'annotation file format {} not supported'.format(type(dataset))\n            print('Done (t={:0.2f}s)'.format(time.time()- tic))\n            self.dataset = dataset\n            self.createIndex()\n\n    def createIndex(self):\n        # create index\n        print('creating index...')\n        anns, cats, vids = {}, {}, {}\n        vidToAnns,catToVids = defaultdict(list),defaultdict(list)\n        if 'annotations' in self.dataset:\n            for ann in self.dataset['annotations']:\n                vidToAnns[ann['video_id']].append(ann)\n                anns[ann['id']] = ann\n\n        if 'videos' in self.dataset:\n            for vid in self.dataset['videos']:\n                vids[vid['id']] = vid\n\n        if 'categories' in self.dataset:\n            for cat in self.dataset['categories']:\n                cats[cat['id']] = cat\n\n        if 'annotations' in self.dataset and 'categories' in self.dataset:\n            for ann in self.dataset['annotations']:\n                catToVids[ann['category_id']].append(ann['video_id'])\n\n        print('index created!')\n\n        # create class members\n        self.anns = anns\n        self.vidToAnns = vidToAnns\n        self.catToVids = catToVids\n        self.vids = vids\n        self.cats = cats\n\n    def info(self):\n        \"\"\"\n        Print information about the annotation file.\n        :return:\n        \"\"\"\n        for key, value in self.dataset['info'].items():\n            print('{}: {}'.format(key, value))\n\n    def getAnnIds(self, vidIds=[], catIds=[], areaRng=[], iscrowd=None):\n        \"\"\"\n        Get ann ids that satisfy given filter conditions. default skips that filter\n        :param vidIds  (int array)     : get anns for given vids\n               catIds  (int array)     : get anns for given cats\n               areaRng (float array)   : get anns for given area range (e.g. [0 inf])\n               iscrowd (boolean)       : get anns for given crowd label (False or True)\n        :return: ids (int array)       : integer array of ann ids\n        \"\"\"\n        vidIds = vidIds if _isArrayLike(vidIds) else [vidIds]\n        catIds = catIds if _isArrayLike(catIds) else [catIds]\n\n        if len(vidIds) == len(catIds) == len(areaRng) == 0:\n            anns = self.dataset['annotations']\n        else:\n            if not len(vidIds) == 0:\n                lists = [self.vidToAnns[vidId] for vidId in vidIds if vidId in self.vidToAnns]\n                anns = list(itertools.chain.from_iterable(lists))\n            else:\n                anns = self.dataset['annotations']\n            anns = anns if len(catIds)  == 0 else [ann for ann in anns if ann['category_id'] in catIds]\n            anns = anns if len(areaRng) == 0 else [ann for ann in anns if ann['avg_area'] > areaRng[0] and ann['avg_area'] < areaRng[1]]\n        if not iscrowd == None:\n            ids = [ann['id'] for ann in anns if ann['iscrowd'] == iscrowd]\n        else:\n            ids = [ann['id'] for ann in anns]\n        return ids\n\n    def getCatIds(self, catNms=[], supNms=[], catIds=[]):\n        \"\"\"\n        filtering parameters. default skips that filter.\n        :param catNms (str array)  : get cats for given cat names\n        :param supNms (str array)  : get cats for given supercategory names\n        :param catIds (int array)  : get cats for given cat ids\n        :return: ids (int array)   : integer array of cat ids\n        \"\"\"\n        catNms = catNms if _isArrayLike(catNms) else [catNms]\n        supNms = supNms if _isArrayLike(supNms) else [supNms]\n        catIds = catIds if _isArrayLike(catIds) else [catIds]\n\n        if len(catNms) == len(supNms) == len(catIds) == 0:\n            cats = self.dataset['categories']\n        else:\n            cats = self.dataset['categories']\n            cats = cats if len(catNms) == 0 else [cat for cat in cats if cat['name']          in catNms]\n            cats = cats if len(supNms) == 0 else [cat for cat in cats if cat['supercategory'] in supNms]\n            cats = cats if len(catIds) == 0 else [cat for cat in cats if cat['id']            in catIds]\n        ids = [cat['id'] for cat in cats]\n        return ids\n\n    def getVidIds(self, vidIds=[], catIds=[]):\n        '''\n        Get vid ids that satisfy given filter conditions.\n        :param vidIds (int array) : get vids for given ids\n        :param catIds (int array) : get vids with all given cats\n        :return: ids (int array)  : integer array of vid ids\n        '''\n        vidIds = vidIds if _isArrayLike(vidIds) else [vidIds]\n        catIds = catIds if _isArrayLike(catIds) else [catIds]\n\n        if len(vidIds) == len(catIds) == 0:\n            ids = self.vids.keys()\n        else:\n            ids = set(vidIds)\n            for i, catId in enumerate(catIds):\n                if i == 0 and len(ids) == 0:\n                    ids = set(self.catToVids[catId])\n                else:\n                    ids &= set(self.catToVids[catId])\n        return list(ids)\n\n    def loadAnns(self, ids=[]):\n        \"\"\"\n        Load anns with the specified ids.\n        :param ids (int array)       : integer ids specifying anns\n        :return: anns (object array) : loaded ann objects\n        \"\"\"\n        if _isArrayLike(ids):\n            return [self.anns[id] for id in ids]\n        elif type(ids) == int:\n            return [self.anns[ids]]\n\n    def loadCats(self, ids=[]):\n        \"\"\"\n        Load cats with the specified ids.\n        :param ids (int array)       : integer ids specifying cats\n        :return: cats (object array) : loaded cat objects\n        \"\"\"\n        if _isArrayLike(ids):\n            return [self.cats[id] for id in ids]\n        elif type(ids) == int:\n            return [self.cats[ids]]\n\n    def loadVids(self, ids=[]):\n        \"\"\"\n        Load anns with the specified ids.\n        :param ids (int array)       : integer ids specifying vid\n        :return: vids (object array) : loaded vid objects\n        \"\"\"\n        if _isArrayLike(ids):\n            return [self.vids[id] for id in ids]\n        elif type(ids) == int:\n            return [self.vids[ids]]\n\n\n    def loadRes(self, resFile):\n        \"\"\"\n        Load result file and return a result api object.\n        :param   resFile (str)     : file name of result file\n        :return: res (obj)         : result api object\n        \"\"\"\n        res = YTVOS()\n        res.dataset['videos'] = [img for img in self.dataset['videos']]\n\n        print('Loading and preparing results...')\n        tic = time.time()\n        if type(resFile) == str or (PYTHON_VERSION == 2 and type(resFile) == unicode):\n            anns = json.load(open(resFile))\n        elif type(resFile) == np.ndarray:\n            anns = self.loadNumpyAnnotations(resFile)\n        else:\n            anns = resFile\n        assert type(anns) == list, 'results in not an array of objects'\n        annsVidIds = [ann['video_id'] for ann in anns]\n        assert set(annsVidIds) == (set(annsVidIds) & set(self.getVidIds())), \\\n               'Results do not correspond to current coco set'\n        if 'segmentations' in anns[0]:\n            res.dataset['categories'] = copy.deepcopy(self.dataset['categories'])\n            for id, ann in enumerate(anns):\n                ann['areas'] = []\n                if not 'bboxes' in ann:\n                    ann['bboxes'] = []\n                for seg in ann['segmentations']:\n                    # now only support compressed RLE format as segmentation results\n                    if seg:\n                        ann['areas'].append(maskUtils.area(seg))\n                        if len(ann['bboxes']) < len(ann['areas']):\n                            ann['bboxes'].append(maskUtils.toBbox(seg))\n                    else:\n                        ann['areas'].append(None)\n                        if len(ann['bboxes']) < len(ann['areas']):\n                            ann['bboxes'].append(None)\n                ann['id'] = id+1\n                l = [a for a in ann['areas'] if a]\n                if len(l)==0:\n                  ann['avg_area'] = 0\n                else:\n                  ann['avg_area'] = np.array(l).mean() \n                ann['iscrowd'] = 0\n        print('DONE (t={:0.2f}s)'.format(time.time()- tic))\n\n        res.dataset['annotations'] = anns\n        res.createIndex()\n        return res\n\n    def annToRLE(self, ann, frameId):\n        \"\"\"\n        Convert annotation which can be polygons, uncompressed RLE to RLE.\n        :return: binary mask (numpy 2D array)\n        \"\"\"\n        t = self.vids[ann['video_id']]\n        h, w = t['height'], t['width']\n        segm = ann['segmentations'][frameId]\n        if type(segm) == list:\n            # polygon -- a single object might consist of multiple parts\n            # we merge all parts into one mask rle code\n            rles = maskUtils.frPyObjects(segm, h, w)\n            rle = maskUtils.merge(rles)\n        elif type(segm['counts']) == list:\n            # uncompressed RLE\n            rle = maskUtils.frPyObjects(segm, h, w)\n        else:\n            # rle\n            rle = segm\n        return rle\n\n    def annToMask(self, ann, frameId):\n        \"\"\"\n        Convert annotation which can be polygons, uncompressed RLE, or RLE to binary mask.\n        :return: binary mask (numpy 2D array)\n        \"\"\"\n        rle = self.annToRLE(ann, frameId)\n        m = maskUtils.decode(rle)\n        return m\n"
  },
  {
    "path": "mfvis_nococo/mask2former_video/data_video/datasets/ytvis_api/ytvoseval.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates.\n# Modified by Bowen Cheng from https://github.com/youtubevos/cocoapi\n\n__author__ = 'ychfan'\n\nimport numpy as np\nimport datetime\nimport time\nfrom collections import defaultdict\nfrom pycocotools import mask as maskUtils\nimport copy\n\nclass YTVOSeval:\n    # Interface for evaluating video instance segmentation on the YouTubeVIS dataset.\n    #\n    # The usage for YTVOSeval is as follows:\n    #  cocoGt=..., cocoDt=...       # load dataset and results\n    #  E = YTVOSeval(cocoGt,cocoDt); # initialize YTVOSeval object\n    #  E.params.recThrs = ...;      # set parameters as desired\n    #  E.evaluate();                # run per image evaluation\n    #  E.accumulate();              # accumulate per image results\n    #  E.summarize();               # display summary metrics of results\n    # For example usage see evalDemo.m and http://mscoco.org/.\n    #\n    # The evaluation parameters are as follows (defaults in brackets):\n    #  imgIds     - [all] N img ids to use for evaluation\n    #  catIds     - [all] K cat ids to use for evaluation\n    #  iouThrs    - [.5:.05:.95] T=10 IoU thresholds for evaluation\n    #  recThrs    - [0:.01:1] R=101 recall thresholds for evaluation\n    #  areaRng    - [...] A=4 object area ranges for evaluation\n    #  maxDets    - [1 10 100] M=3 thresholds on max detections per image\n    #  iouType    - ['segm'] set iouType to 'segm', 'bbox' or 'keypoints'\n    #  iouType replaced the now DEPRECATED useSegm parameter.\n    #  useCats    - [1] if true use category labels for evaluation\n    # Note: if useCats=0 category labels are ignored as in proposal scoring.\n    # Note: multiple areaRngs [Ax2] and maxDets [Mx1] can be specified.\n    #\n    # evaluate(): evaluates detections on every image and every category and\n    # concats the results into the \"evalImgs\" with fields:\n    #  dtIds      - [1xD] id for each of the D detections (dt)\n    #  gtIds      - [1xG] id for each of the G ground truths (gt)\n    #  dtMatches  - [TxD] matching gt id at each IoU or 0\n    #  gtMatches  - [TxG] matching dt id at each IoU or 0\n    #  dtScores   - [1xD] confidence of each dt\n    #  gtIgnore   - [1xG] ignore flag for each gt\n    #  dtIgnore   - [TxD] ignore flag for each dt at each IoU\n    #\n    # accumulate(): accumulates the per-image, per-category evaluation\n    # results in \"evalImgs\" into the dictionary \"eval\" with fields:\n    #  params     - parameters used for evaluation\n    #  date       - date evaluation was performed\n    #  counts     - [T,R,K,A,M] parameter dimensions (see above)\n    #  precision  - [TxRxKxAxM] precision for every evaluation setting\n    #  recall     - [TxKxAxM] max recall for every evaluation setting\n    # Note: precision and recall==-1 for settings with no gt objects.\n    #\n    # See also coco, mask, pycocoDemo, pycocoEvalDemo\n    #\n    # Microsoft COCO Toolbox.      version 2.0\n    # Data, paper, and tutorials available at:  http://mscoco.org/\n    # Code written by Piotr Dollar and Tsung-Yi Lin, 2015.\n    # Licensed under the Simplified BSD License [see coco/license.txt]\n    def __init__(self, cocoGt=None, cocoDt=None, iouType='segm'):\n        '''\n        Initialize CocoEval using coco APIs for gt and dt\n        :param cocoGt: coco object with ground truth annotations\n        :param cocoDt: coco object with detection results\n        :return: None\n        '''\n        if not iouType:\n            print('iouType not specified. use default iouType segm')\n        self.cocoGt   = cocoGt              # ground truth COCO API\n        self.cocoDt   = cocoDt              # detections COCO API\n        self.params   = {}                  # evaluation parameters\n        self.evalVids = defaultdict(list)   # per-image per-category evaluation results [KxAxI] elements\n        self.eval     = {}                  # accumulated evaluation results\n        self._gts = defaultdict(list)       # gt for evaluation\n        self._dts = defaultdict(list)       # dt for evaluation\n        self.params = Params(iouType=iouType) # parameters\n        self._paramsEval = {}               # parameters for evaluation\n        self.stats = []                     # result summarization\n        self.ious = {}                      # ious between all gts and dts\n        if not cocoGt is None:\n            self.params.vidIds = sorted(cocoGt.getVidIds())\n            self.params.catIds = sorted(cocoGt.getCatIds())\n\n\n    def _prepare(self):\n        '''\n        Prepare ._gts and ._dts for evaluation based on params\n        :return: None\n        '''\n        def _toMask(anns, coco):\n            # modify ann['segmentation'] by reference\n            for ann in anns:\n                for i, a in enumerate(ann['segmentations']):\n                    if a:\n                        rle = coco.annToRLE(ann, i)\n                        ann['segmentations'][i] = rle\n                l = [a for a in ann['areas'] if a]\n                if len(l)==0:\n                  ann['avg_area'] = 0\n                else:\n                  ann['avg_area'] = np.array(l).mean() \n        p = self.params\n        if p.useCats:\n            gts=self.cocoGt.loadAnns(self.cocoGt.getAnnIds(vidIds=p.vidIds, catIds=p.catIds))\n            dts=self.cocoDt.loadAnns(self.cocoDt.getAnnIds(vidIds=p.vidIds, catIds=p.catIds))\n        else:\n            gts=self.cocoGt.loadAnns(self.cocoGt.getAnnIds(vidIds=p.vidIds))\n            dts=self.cocoDt.loadAnns(self.cocoDt.getAnnIds(vidIds=p.vidIds))\n\n        # convert ground truth to mask if iouType == 'segm'\n        if p.iouType == 'segm':\n            _toMask(gts, self.cocoGt)\n            _toMask(dts, self.cocoDt)\n        # set ignore flag\n        for gt in gts:\n            gt['ignore'] = gt['ignore'] if 'ignore' in gt else 0\n            gt['ignore'] = 'iscrowd' in gt and gt['iscrowd']\n            if p.iouType == 'keypoints':\n                gt['ignore'] = (gt['num_keypoints'] == 0) or gt['ignore']\n        self._gts = defaultdict(list)       # gt for evaluation\n        self._dts = defaultdict(list)       # dt for evaluation\n        for gt in gts:\n            self._gts[gt['video_id'], gt['category_id']].append(gt)\n        for dt in dts:\n            self._dts[dt['video_id'], dt['category_id']].append(dt)\n        self.evalVids = defaultdict(list)   # per-image per-category evaluation results\n        self.eval     = {}                  # accumulated evaluation results\n\n    def evaluate(self):\n        '''\n        Run per image evaluation on given images and store results (a list of dict) in self.evalVids\n        :return: None\n        '''\n        tic = time.time()\n        print('Running per image evaluation...')\n        p = self.params\n        # add backward compatibility if useSegm is specified in params\n        if not p.useSegm is None:\n            p.iouType = 'segm' if p.useSegm == 1 else 'bbox'\n            print('useSegm (deprecated) is not None. Running {} evaluation'.format(p.iouType))\n        print('Evaluate annotation type *{}*'.format(p.iouType))\n        p.vidIds = list(np.unique(p.vidIds))\n        if p.useCats:\n            p.catIds = list(np.unique(p.catIds))\n        p.maxDets = sorted(p.maxDets)\n        self.params=p\n\n        self._prepare()\n        # loop through images, area range, max detection number\n        catIds = p.catIds if p.useCats else [-1]\n\n        if p.iouType == 'segm' or p.iouType == 'bbox':\n            computeIoU = self.computeIoU\n        elif p.iouType == 'keypoints':\n            computeIoU = self.computeOks\n        self.ious = {(vidId, catId): computeIoU(vidId, catId) \\\n                        for vidId in p.vidIds\n                        for catId in catIds}\n\n        evaluateVid = self.evaluateVid\n        maxDet = p.maxDets[-1]\n        \n        \n        self.evalImgs = [evaluateVid(vidId, catId, areaRng, maxDet)\n                 for catId in catIds\n                 for areaRng in p.areaRng\n                 for vidId in p.vidIds\n             ]\n        self._paramsEval = copy.deepcopy(self.params)\n        toc = time.time()\n        print('DONE (t={:0.2f}s).'.format(toc-tic))\n\n    def computeIoU(self, vidId, catId):\n        p = self.params\n        if p.useCats:\n            gt = self._gts[vidId,catId]\n            dt = self._dts[vidId,catId]\n        else:\n            gt = [_ for cId in p.catIds for _ in self._gts[vidId,cId]]\n            dt = [_ for cId in p.catIds for _ in self._dts[vidId,cId]]\n        if len(gt) == 0 and len(dt) ==0:\n            return []\n        inds = np.argsort([-d['score'] for d in dt], kind='mergesort')\n        dt = [dt[i] for i in inds]\n        if len(dt) > p.maxDets[-1]:\n            dt=dt[0:p.maxDets[-1]]\n\n        if p.iouType == 'segm':\n            g = [g['segmentations'] for g in gt]\n            d = [d['segmentations'] for d in dt]\n        elif p.iouType == 'bbox':\n            g = [g['bboxes'] for g in gt]\n            d = [d['bboxes'] for d in dt]\n        else:\n            raise Exception('unknown iouType for iou computation')\n\n        # compute iou between each dt and gt region\n        iscrowd = [int(o['iscrowd']) for o in gt]\n        #ious = maskUtils.iou(d,g,iscrowd)\n        def iou_seq(d_seq, g_seq):\n            i = .0\n            u = .0\n            for d, g in zip(d_seq, g_seq):\n                if d and g:\n                    i += maskUtils.area(maskUtils.merge([d, g], True))\n                    u += maskUtils.area(maskUtils.merge([d, g], False))\n                elif not d and g:\n                    u += maskUtils.area(g)\n                elif d and not g:\n                    u += maskUtils.area(d)\n            if not u > .0:\n                print(\"Mask sizes in video {} and category {} may not match!\".format(vidId, catId))\n            iou = i / u if u > .0 else .0\n            return iou\n        ious = np.zeros([len(d), len(g)])\n        for i, j in np.ndindex(ious.shape):\n            ious[i, j] = iou_seq(d[i], g[j])\n        #print(vidId, catId, ious.shape, ious)\n        return ious\n\n    def computeOks(self, imgId, catId):\n        p = self.params\n        # dimention here should be Nxm\n        gts = self._gts[imgId, catId]\n        dts = self._dts[imgId, catId]\n        inds = np.argsort([-d['score'] for d in dts], kind='mergesort')\n        dts = [dts[i] for i in inds]\n        if len(dts) > p.maxDets[-1]:\n            dts = dts[0:p.maxDets[-1]]\n        # if len(gts) == 0 and len(dts) == 0:\n        if len(gts) == 0 or len(dts) == 0:\n            return []\n        ious = np.zeros((len(dts), len(gts)))\n        sigmas = np.array([.26, .25, .25, .35, .35, .79, .79, .72, .72, .62,.62, 1.07, 1.07, .87, .87, .89, .89])/10.0\n        vars = (sigmas * 2)**2\n        k = len(sigmas)\n        # compute oks between each detection and ground truth object\n        for j, gt in enumerate(gts):\n            # create bounds for ignore regions(double the gt bbox)\n            g = np.array(gt['keypoints'])\n            xg = g[0::3]; yg = g[1::3]; vg = g[2::3]\n            k1 = np.count_nonzero(vg > 0)\n            bb = gt['bbox']\n            x0 = bb[0] - bb[2]; x1 = bb[0] + bb[2] * 2\n            y0 = bb[1] - bb[3]; y1 = bb[1] + bb[3] * 2\n            for i, dt in enumerate(dts):\n                d = np.array(dt['keypoints'])\n                xd = d[0::3]; yd = d[1::3]\n                if k1>0:\n                    # measure the per-keypoint distance if keypoints visible\n                    dx = xd - xg\n                    dy = yd - yg\n                else:\n                    # measure minimum distance to keypoints in (x0,y0) & (x1,y1)\n                    z = np.zeros((k))\n                    dx = np.max((z, x0-xd),axis=0)+np.max((z, xd-x1),axis=0)\n                    dy = np.max((z, y0-yd),axis=0)+np.max((z, yd-y1),axis=0)\n                e = (dx**2 + dy**2) / vars / (gt['avg_area']+np.spacing(1)) / 2\n                if k1 > 0:\n                    e=e[vg > 0]\n                ious[i, j] = np.sum(np.exp(-e)) / e.shape[0]\n        return ious\n\n    def evaluateVid(self, vidId, catId, aRng, maxDet):\n        '''\n        perform evaluation for single category and image\n        :return: dict (single image results)\n        '''\n        p = self.params\n        if p.useCats:\n            gt = self._gts[vidId,catId]\n            dt = self._dts[vidId,catId]\n        else:\n            gt = [_ for cId in p.catIds for _ in self._gts[vidId,cId]]\n            dt = [_ for cId in p.catIds for _ in self._dts[vidId,cId]]\n        if len(gt) == 0 and len(dt) ==0:\n            return None\n\n        for g in gt:\n            if g['ignore'] or (g['avg_area']<aRng[0] or g['avg_area']>aRng[1]):\n                g['_ignore'] = 1\n            else:\n                g['_ignore'] = 0\n\n        # sort dt highest score first, sort gt ignore last\n        gtind = np.argsort([g['_ignore'] for g in gt], kind='mergesort')\n        gt = [gt[i] for i in gtind]\n        dtind = np.argsort([-d['score'] for d in dt], kind='mergesort')\n        dt = [dt[i] for i in dtind[0:maxDet]]\n        iscrowd = [int(o['iscrowd']) for o in gt]\n        # load computed ious\n        ious = self.ious[vidId, catId][:, gtind] if len(self.ious[vidId, catId]) > 0 else self.ious[vidId, catId]\n\n        T = len(p.iouThrs)\n        G = len(gt)\n        D = len(dt)\n        gtm  = np.zeros((T,G))\n        dtm  = np.zeros((T,D))\n        gtIg = np.array([g['_ignore'] for g in gt])\n        dtIg = np.zeros((T,D))\n        if not len(ious)==0:\n            for tind, t in enumerate(p.iouThrs):\n                for dind, d in enumerate(dt):\n                    # information about best match so far (m=-1 -> unmatched)\n                    iou = min([t,1-1e-10])\n                    m   = -1\n                    for gind, g in enumerate(gt):\n                        # if this gt already matched, and not a crowd, continue\n                        if gtm[tind,gind]>0 and not iscrowd[gind]:\n                            continue\n                        # if dt matched to reg gt, and on ignore gt, stop\n                        if m>-1 and gtIg[m]==0 and gtIg[gind]==1:\n                            break\n                        # continue to next gt unless better match made\n                        if ious[dind,gind] < iou:\n                            continue\n                        # if match successful and best so far, store appropriately\n                        iou=ious[dind,gind]\n                        m=gind\n                    # if match made store id of match for both dt and gt\n                    if m ==-1:\n                        continue\n                    dtIg[tind,dind] = gtIg[m]\n                    dtm[tind,dind]  = gt[m]['id']\n                    gtm[tind,m]     = d['id']\n        # set unmatched detections outside of area range to ignore\n        a = np.array([d['avg_area']<aRng[0] or d['avg_area']>aRng[1] for d in dt]).reshape((1, len(dt)))\n        dtIg = np.logical_or(dtIg, np.logical_and(dtm==0, np.repeat(a,T,0)))\n        # store results for given image and category\n        return {\n                'video_id':     vidId,\n                'category_id':  catId,\n                'aRng':         aRng,\n                'maxDet':       maxDet,\n                'dtIds':        [d['id'] for d in dt],\n                'gtIds':        [g['id'] for g in gt],\n                'dtMatches':    dtm,\n                'gtMatches':    gtm,\n                'dtScores':     [d['score'] for d in dt],\n                'gtIgnore':     gtIg,\n                'dtIgnore':     dtIg,\n            }\n\n    def accumulate(self, p = None):\n        '''\n        Accumulate per image evaluation results and store the result in self.eval\n        :param p: input params for evaluation\n        :return: None\n        '''\n        print('Accumulating evaluation results...')\n        tic = time.time()\n        if not self.evalImgs:\n            print('Please run evaluate() first')\n        # allows input customized parameters\n        if p is None:\n            p = self.params\n        p.catIds = p.catIds if p.useCats == 1 else [-1]\n        T           = len(p.iouThrs)\n        R           = len(p.recThrs)\n        K           = len(p.catIds) if p.useCats else 1\n        A           = len(p.areaRng)\n        M           = len(p.maxDets)\n        precision   = -np.ones((T,R,K,A,M)) # -1 for the precision of absent categories\n        recall      = -np.ones((T,K,A,M))\n        scores      = -np.ones((T,R,K,A,M))\n\n        # create dictionary for future indexing\n        _pe = self._paramsEval\n        catIds = _pe.catIds if _pe.useCats else [-1]\n        setK = set(catIds)\n        setA = set(map(tuple, _pe.areaRng))\n        setM = set(_pe.maxDets)\n        setI = set(_pe.vidIds)\n        # get inds to evaluate\n        k_list = [n for n, k in enumerate(p.catIds)  if k in setK]\n        m_list = [m for n, m in enumerate(p.maxDets) if m in setM]\n        a_list = [n for n, a in enumerate(map(lambda x: tuple(x), p.areaRng)) if a in setA]\n        i_list = [n for n, i in enumerate(p.vidIds)  if i in setI]\n        I0 = len(_pe.vidIds)\n        A0 = len(_pe.areaRng)\n        # retrieve E at each category, area range, and max number of detections\n        for k, k0 in enumerate(k_list):\n            Nk = k0*A0*I0\n            for a, a0 in enumerate(a_list):\n                Na = a0*I0\n                for m, maxDet in enumerate(m_list):\n                    E = [self.evalImgs[Nk + Na + i] for i in i_list]\n                    E = [e for e in E if not e is None]\n                    if len(E) == 0:\n                        continue\n                    dtScores = np.concatenate([e['dtScores'][0:maxDet] for e in E])\n\n                    # different sorting method generates slightly different results.\n                    # mergesort is used to be consistent as Matlab implementation.\n                    inds = np.argsort(-dtScores, kind='mergesort')\n                    dtScoresSorted = dtScores[inds]\n\n                    dtm  = np.concatenate([e['dtMatches'][:,0:maxDet] for e in E], axis=1)[:,inds]\n                    dtIg = np.concatenate([e['dtIgnore'][:,0:maxDet]  for e in E], axis=1)[:,inds]\n                    gtIg = np.concatenate([e['gtIgnore'] for e in E])\n                    npig = np.count_nonzero(gtIg==0 )\n                    if npig == 0:\n                        continue\n                    tps = np.logical_and(               dtm,  np.logical_not(dtIg) )\n                    fps = np.logical_and(np.logical_not(dtm), np.logical_not(dtIg) )\n\n                    tp_sum = np.cumsum(tps, axis=1).astype(dtype=np.float)\n                    fp_sum = np.cumsum(fps, axis=1).astype(dtype=np.float)\n                    for t, (tp, fp) in enumerate(zip(tp_sum, fp_sum)):\n                        tp = np.array(tp)\n                        fp = np.array(fp)\n                        nd = len(tp)\n                        rc = tp / npig\n                        pr = tp / (fp+tp+np.spacing(1))\n                        q  = np.zeros((R,))\n                        ss = np.zeros((R,))\n\n                        if nd:\n                            recall[t,k,a,m] = rc[-1]\n                        else:\n                            recall[t,k,a,m] = 0\n\n                        # numpy is slow without cython optimization for accessing elements\n                        # use python array gets significant speed improvement\n                        pr = pr.tolist(); q = q.tolist()\n\n                        for i in range(nd-1, 0, -1):\n                            if pr[i] > pr[i-1]:\n                                pr[i-1] = pr[i]\n\n                        inds = np.searchsorted(rc, p.recThrs, side='left')\n                        try:\n                            for ri, pi in enumerate(inds):\n                                q[ri] = pr[pi]\n                                ss[ri] = dtScoresSorted[pi]\n                        except:\n                            pass\n                        precision[t,:,k,a,m] = np.array(q)\n                        scores[t,:,k,a,m] = np.array(ss)\n        self.eval = {\n            'params': p,\n            'counts': [T, R, K, A, M],\n            'date': datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S'),\n            'precision': precision,\n            'recall':   recall,\n            'scores': scores,\n        }\n        toc = time.time()\n        print('DONE (t={:0.2f}s).'.format( toc-tic))\n\n    def summarize(self):\n        '''\n        Compute and display summary metrics for evaluation results.\n        Note this functin can *only* be applied on the default parameter setting\n        '''\n        def _summarize( ap=1, iouThr=None, areaRng='all', maxDets=100 ):\n            p = self.params\n            iStr = ' {:<18} {} @[ IoU={:<9} | area={:>6s} | maxDets={:>3d} ] = {:0.3f}'\n            titleStr = 'Average Precision' if ap == 1 else 'Average Recall'\n            typeStr = '(AP)' if ap==1 else '(AR)'\n            iouStr = '{:0.2f}:{:0.2f}'.format(p.iouThrs[0], p.iouThrs[-1]) \\\n                if iouThr is None else '{:0.2f}'.format(iouThr)\n\n            aind = [i for i, aRng in enumerate(p.areaRngLbl) if aRng == areaRng]\n            mind = [i for i, mDet in enumerate(p.maxDets) if mDet == maxDets]\n            if ap == 1:\n                # dimension of precision: [TxRxKxAxM]\n                s = self.eval['precision']\n                # IoU\n                if iouThr is not None:\n                    t = np.where(iouThr == p.iouThrs)[0]\n                    s = s[t]\n                s = s[:,:,:,aind,mind]\n            else:\n                # dimension of recall: [TxKxAxM]\n                s = self.eval['recall']\n                if iouThr is not None:\n                    t = np.where(iouThr == p.iouThrs)[0]\n                    s = s[t]\n                s = s[:,:,aind,mind]\n            if len(s[s>-1])==0:\n                mean_s = -1\n            else:\n                mean_s = np.mean(s[s>-1])\n            print(iStr.format(titleStr, typeStr, iouStr, areaRng, maxDets, mean_s))\n            return mean_s\n        def _summarizeDets():\n            stats = np.zeros((12,))\n            stats[0] = _summarize(1)\n            stats[1] = _summarize(1, iouThr=.5, maxDets=self.params.maxDets[2])\n            stats[2] = _summarize(1, iouThr=.75, maxDets=self.params.maxDets[2])\n            stats[3] = _summarize(1, areaRng='small', maxDets=self.params.maxDets[2])\n            stats[4] = _summarize(1, areaRng='medium', maxDets=self.params.maxDets[2])\n            stats[5] = _summarize(1, areaRng='large', maxDets=self.params.maxDets[2])\n            stats[6] = _summarize(0, maxDets=self.params.maxDets[0])\n            stats[7] = _summarize(0, maxDets=self.params.maxDets[1])\n            stats[8] = _summarize(0, maxDets=self.params.maxDets[2])\n            stats[9] = _summarize(0, areaRng='small', maxDets=self.params.maxDets[2])\n            stats[10] = _summarize(0, areaRng='medium', maxDets=self.params.maxDets[2])\n            stats[11] = _summarize(0, areaRng='large', maxDets=self.params.maxDets[2])\n            return stats\n        def _summarizeKps():\n            stats = np.zeros((10,))\n            stats[0] = _summarize(1, maxDets=20)\n            stats[1] = _summarize(1, maxDets=20, iouThr=.5)\n            stats[2] = _summarize(1, maxDets=20, iouThr=.75)\n            stats[3] = _summarize(1, maxDets=20, areaRng='medium')\n            stats[4] = _summarize(1, maxDets=20, areaRng='large')\n            stats[5] = _summarize(0, maxDets=20)\n            stats[6] = _summarize(0, maxDets=20, iouThr=.5)\n            stats[7] = _summarize(0, maxDets=20, iouThr=.75)\n            stats[8] = _summarize(0, maxDets=20, areaRng='medium')\n            stats[9] = _summarize(0, maxDets=20, areaRng='large')\n            return stats\n        if not self.eval:\n            raise Exception('Please run accumulate() first')\n        iouType = self.params.iouType\n        if iouType == 'segm' or iouType == 'bbox':\n            summarize = _summarizeDets\n        elif iouType == 'keypoints':\n            summarize = _summarizeKps\n        self.stats = summarize()\n\n    def __str__(self):\n        self.summarize()\n\nclass Params:\n    '''\n    Params for coco evaluation api\n    '''\n    def setDetParams(self):\n        self.vidIds = []\n        self.catIds = []\n        # np.arange causes trouble.  the data point on arange is slightly larger than the true value\n        #self.iouThrs = np.linspace(.5, 0.95, np.round((0.95 - .5) / .05) + 1, endpoint=True)\n        #self.recThrs = np.linspace(.0, 1.00, np.round((1.00 - .0) / .01) + 1, endpoint=True)\n        self.iouThrs = np.linspace(.5, 0.95, int(np.round((0.95 - .5) / .05)) + 1, endpoint=True)\n        self.recThrs = np.linspace(.0, 1.00, int(np.round((1.00 - .0) / .01)) + 1, endpoint=True)\n        self.maxDets = [1, 10, 100]\n        self.areaRng = [[0 ** 2, 1e5 ** 2], [0 ** 2, 128 ** 2], [ 128 ** 2, 256 ** 2], [256 ** 2, 1e5 ** 2]]\n        self.areaRngLbl = ['all', 'small', 'medium', 'large']\n        self.useCats = 1\n\n    def setKpParams(self):\n        self.vidIds = []\n        self.catIds = []\n        # np.arange causes trouble.  the data point on arange is slightly larger than the true value\n        self.iouThrs = np.linspace(.5, 0.95, np.round((0.95 - .5) / .05) + 1, endpoint=True)\n        self.recThrs = np.linspace(.0, 1.00, np.round((1.00 - .0) / .01) + 1, endpoint=True)\n        self.maxDets = [20]\n        self.areaRng = [[0 ** 2, 1e5 ** 2], [32 ** 2, 96 ** 2], [96 ** 2, 1e5 ** 2]]\n        self.areaRngLbl = ['all', 'medium', 'large']\n        self.useCats = 1\n\n    def __init__(self, iouType='segm'):\n        if iouType == 'segm' or iouType == 'bbox':\n            self.setDetParams()\n        elif iouType == 'keypoints':\n            self.setKpParams()\n        else:\n            raise Exception('iouType not supported')\n        self.iouType = iouType\n        # useSegm is deprecated\n        self.useSegm = None\n"
  },
  {
    "path": "mfvis_nococo/mask2former_video/data_video/ytvis_eval.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates.\n# Modified by Bowen Cheng from https://github.com/sukjunhwang/IFC\n\nimport contextlib\nimport copy\nimport io\nimport itertools\nimport json\nimport logging\nimport numpy as np\nimport os\nfrom collections import OrderedDict\nimport pycocotools.mask as mask_util\nimport torch\nfrom .datasets.ytvis_api.ytvos import YTVOS\nfrom .datasets.ytvis_api.ytvoseval import YTVOSeval\nfrom tabulate import tabulate\n\nimport detectron2.utils.comm as comm\nfrom detectron2.config import CfgNode\nfrom detectron2.data import MetadataCatalog\nfrom detectron2.evaluation import DatasetEvaluator\nfrom detectron2.utils.file_io import PathManager\nfrom detectron2.utils.logger import create_small_table\n\n\nclass YTVISEvaluator(DatasetEvaluator):\n    \"\"\"\n    Evaluate AR for object proposals, AP for instance detection/segmentation, AP\n    for keypoint detection outputs using COCO's metrics.\n    See http://cocodataset.org/#detection-eval and\n    http://cocodataset.org/#keypoints-eval to understand its metrics.\n\n    In addition to COCO, this evaluator is able to support any bounding box detection,\n    instance segmentation, or keypoint detection dataset.\n    \"\"\"\n\n    def __init__(\n        self,\n        dataset_name,\n        tasks=None,\n        distributed=True,\n        output_dir=None,\n        *,\n        use_fast_impl=True,\n    ):\n        \"\"\"\n        Args:\n            dataset_name (str): name of the dataset to be evaluated.\n                It must have either the following corresponding metadata:\n\n                    \"json_file\": the path to the COCO format annotation\n\n                Or it must be in detectron2's standard dataset format\n                so it can be converted to COCO format automatically.\n            tasks (tuple[str]): tasks that can be evaluated under the given\n                configuration. A task is one of \"bbox\", \"segm\", \"keypoints\".\n                By default, will infer this automatically from predictions.\n            distributed (True): if True, will collect results from all ranks and run evaluation\n                in the main process.\n                Otherwise, will only evaluate the results in the current process.\n            output_dir (str): optional, an output directory to dump all\n                results predicted on the dataset. The dump contains two files:\n\n                1. \"instances_predictions.pth\" a file in torch serialization\n                   format that contains all the raw original predictions.\n                2. \"coco_instances_results.json\" a json file in COCO's result\n                   format.\n            use_fast_impl (bool): use a fast but **unofficial** implementation to compute AP.\n                Although the results should be very close to the official implementation in COCO\n                API, it is still recommended to compute results with the official API for use in\n                papers. The faster implementation also uses more RAM.\n        \"\"\"\n        self._logger = logging.getLogger(__name__)\n        self._distributed = distributed\n        self._output_dir = output_dir\n        self._use_fast_impl = use_fast_impl\n\n        if tasks is not None and isinstance(tasks, CfgNode):\n            self._logger.warning(\n                \"COCO Evaluator instantiated using config, this is deprecated behavior.\"\n                \" Please pass in explicit arguments instead.\"\n            )\n            self._tasks = None  # Infering it from predictions should be better\n        else:\n            self._tasks = tasks\n\n        self._cpu_device = torch.device(\"cpu\")\n\n        self._metadata = MetadataCatalog.get(dataset_name)\n\n        json_file = PathManager.get_local_path(self._metadata.json_file)\n        with contextlib.redirect_stdout(io.StringIO()):\n            self._ytvis_api = YTVOS(json_file)\n\n        # Test set json files do not contain annotations (evaluation must be\n        # performed using the COCO evaluation server).\n        self._do_evaluation = \"annotations\" in self._ytvis_api.dataset\n\n    def reset(self):\n        self._predictions = []\n\n    def process(self, inputs, outputs):\n        \"\"\"\n        Args:\n            inputs: the inputs to a COCO model (e.g., GeneralizedRCNN).\n                It is a list of dict. Each dict corresponds to an image and\n                contains keys like \"height\", \"width\", \"file_name\", \"image_id\".\n            outputs: the outputs of a COCO model. It is a list of dicts with key\n                \"instances\" that contains :class:`Instances`.\n        \"\"\"\n        prediction = instances_to_coco_json_video(inputs, outputs)\n        self._predictions.extend(prediction)\n\n    def evaluate(self):\n        \"\"\"\n        Args:\n            img_ids: a list of image IDs to evaluate on. Default to None for the whole dataset\n        \"\"\"\n        if self._distributed:\n            comm.synchronize()\n            predictions = comm.gather(self._predictions, dst=0)\n            predictions = list(itertools.chain(*predictions))\n\n            if not comm.is_main_process():\n                return {}\n        else:\n            predictions = self._predictions\n\n        if len(predictions) == 0:\n            self._logger.warning(\"[COCOEvaluator] Did not receive valid predictions.\")\n            return {}\n\n        if self._output_dir:\n            PathManager.mkdirs(self._output_dir)\n            file_path = os.path.join(self._output_dir, \"instances_predictions.pth\")\n            with PathManager.open(file_path, \"wb\") as f:\n                torch.save(predictions, f)\n\n        self._results = OrderedDict()\n        self._eval_predictions(predictions)\n        # Copy so the caller can do whatever with results\n        return copy.deepcopy(self._results)\n\n    def _eval_predictions(self, predictions):\n        \"\"\"\n        Evaluate predictions. Fill self._results with the metrics of the tasks.\n        \"\"\"\n        self._logger.info(\"Preparing results for YTVIS format ...\")\n\n        # unmap the category ids for COCO\n        if hasattr(self._metadata, \"thing_dataset_id_to_contiguous_id\"):\n            dataset_id_to_contiguous_id = self._metadata.thing_dataset_id_to_contiguous_id\n            all_contiguous_ids = list(dataset_id_to_contiguous_id.values())\n            num_classes = len(all_contiguous_ids)\n            assert min(all_contiguous_ids) == 0 and max(all_contiguous_ids) == num_classes - 1\n\n            reverse_id_mapping = {v: k for k, v in dataset_id_to_contiguous_id.items()}\n            for result in predictions:\n                category_id = result[\"category_id\"]\n                assert category_id < num_classes, (\n                    f\"A prediction has class={category_id}, \"\n                    f\"but the dataset only has {num_classes} classes and \"\n                    f\"predicted class id should be in [0, {num_classes - 1}].\"\n                )\n                result[\"category_id\"] = reverse_id_mapping[category_id]\n\n        if self._output_dir:\n            file_path = os.path.join(self._output_dir, \"results.json\")\n            self._logger.info(\"Saving results to {}\".format(file_path))\n            with PathManager.open(file_path, \"w\") as f:\n                f.write(json.dumps(predictions))\n                f.flush()\n\n        if not self._do_evaluation:\n            self._logger.info(\"Annotations are not available for evaluation.\")\n            return\n\n        coco_eval = (\n            _evaluate_predictions_on_coco(\n                self._ytvis_api,\n                predictions,\n            )\n            if len(predictions) > 0\n            else None  # cocoapi does not handle empty results very well\n        )\n\n        res = self._derive_coco_results(\n            coco_eval, class_names=self._metadata.get(\"thing_classes\")\n        )\n        self._results[\"segm\"] = res\n\n    def _derive_coco_results(self, coco_eval, class_names=None):\n        \"\"\"\n        Derive the desired score numbers from summarized COCOeval.\n        Args:\n            coco_eval (None or COCOEval): None represents no predictions from model.\n            iou_type (str):\n            class_names (None or list[str]): if provided, will use it to predict\n                per-category AP.\n        Returns:\n            a dict of {metric name: score}\n        \"\"\"\n\n        metrics = [\"AP\", \"AP50\", \"AP75\", \"APs\", \"APm\", \"APl\", \"AR1\", \"AR10\"]\n\n        if coco_eval is None:\n            self._logger.warn(\"No predictions from the model!\")\n            return {metric: float(\"nan\") for metric in metrics}\n\n        # the standard metrics\n        results = {\n            metric: float(coco_eval.stats[idx] * 100 if coco_eval.stats[idx] >= 0 else \"nan\")\n            for idx, metric in enumerate(metrics)\n        }\n        self._logger.info(\n            \"Evaluation results for {}: \\n\".format(\"segm\") + create_small_table(results)\n        )\n        if not np.isfinite(sum(results.values())):\n            self._logger.info(\"Some metrics cannot be computed and is shown as NaN.\")\n\n        if class_names is None or len(class_names) <= 1:\n            return results\n        # Compute per-category AP\n        # from https://github.com/facebookresearch/Detectron/blob/a6a835f5b8208c45d0dce217ce9bbda915f44df7/detectron/datasets/json_dataset_evaluator.py#L222-L252 # noqa\n        precisions = coco_eval.eval[\"precision\"]\n        # precision has dims (iou, recall, cls, area range, max dets)\n        assert len(class_names) == precisions.shape[2]\n\n        results_per_category = []\n        for idx, name in enumerate(class_names):\n            # area range index 0: all area ranges\n            # max dets index -1: typically 100 per image\n            precision = precisions[:, :, idx, 0, -1]\n            precision = precision[precision > -1]\n            ap = np.mean(precision) if precision.size else float(\"nan\")\n            results_per_category.append((\"{}\".format(name), float(ap * 100)))\n\n        # tabulate it\n        N_COLS = min(6, len(results_per_category) * 2)\n        results_flatten = list(itertools.chain(*results_per_category))\n        results_2d = itertools.zip_longest(*[results_flatten[i::N_COLS] for i in range(N_COLS)])\n        table = tabulate(\n            results_2d,\n            tablefmt=\"pipe\",\n            floatfmt=\".3f\",\n            headers=[\"category\", \"AP\"] * (N_COLS // 2),\n            numalign=\"left\",\n        )\n        self._logger.info(\"Per-category {} AP: \\n\".format(\"segm\") + table)\n\n        results.update({\"AP-\" + name: ap for name, ap in results_per_category})\n        return results\n\n\ndef instances_to_coco_json_video(inputs, outputs):\n    \"\"\"\n    Dump an \"Instances\" object to a COCO-format json that's used for evaluation.\n\n    Args:\n        instances (Instances):\n        video_id (int): the image id\n\n    Returns:\n        list[dict]: list of json annotations in COCO format.\n    \"\"\"\n    assert len(inputs) == 1, \"More than one inputs are loaded for inference!\"\n\n    video_id = inputs[0][\"video_id\"]\n    video_length = inputs[0][\"length\"]\n\n    scores = outputs[\"pred_scores\"]\n    labels = outputs[\"pred_labels\"]\n    masks = outputs[\"pred_masks\"]\n\n    ytvis_results = []\n    for instance_id, (s, l, m) in enumerate(zip(scores, labels, masks)):\n        segms = [\n            mask_util.encode(np.array(_mask[:, :, None], order=\"F\", dtype=\"uint8\"))[0]\n            for _mask in m\n        ]\n        for rle in segms:\n            rle[\"counts\"] = rle[\"counts\"].decode(\"utf-8\")\n\n        res = {\n            \"video_id\": video_id,\n            \"score\": s,\n            \"category_id\": l,\n            \"segmentations\": segms,\n        }\n        ytvis_results.append(res)\n\n    return ytvis_results\n\n\ndef _evaluate_predictions_on_coco(\n    coco_gt,\n    coco_results,\n    img_ids=None,\n):\n    \"\"\"\n    Evaluate the coco results using COCOEval API.\n    \"\"\"\n    assert len(coco_results) > 0\n\n    coco_results = copy.deepcopy(coco_results)\n    # When evaluating mask AP, if the results contain bbox, cocoapi will\n    # use the box area as the area of the instance, instead of the mask area.\n    # This leads to a different definition of small/medium/large.\n    # We remove the bbox field to let mask AP use mask area.\n    for c in coco_results:\n        c.pop(\"bbox\", None)\n\n    coco_dt = coco_gt.loadRes(coco_results)\n    coco_eval = YTVOSeval(coco_gt, coco_dt)\n    # For COCO, the default max_dets_per_image is [1, 10, 100].\n    max_dets_per_image = [1, 10, 100]  # Default from COCOEval\n    coco_eval.params.maxDets = max_dets_per_image\n\n    if img_ids is not None:\n        coco_eval.params.imgIds = img_ids\n\n    coco_eval.evaluate()\n    coco_eval.accumulate()\n    coco_eval.summarize()\n\n    return coco_eval\n"
  },
  {
    "path": "mfvis_nococo/mask2former_video/modeling/__init__.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates.\nfrom .transformer_decoder.video_mask2former_transformer_decoder import VideoMultiScaleMaskedTransformerDecoder\n"
  },
  {
    "path": "mfvis_nococo/mask2former_video/modeling/criterion.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates.\n# Modified by Bowen Cheng from https://github.com/facebookresearch/detr/blob/master/models/detr.py\n\"\"\"\nMaskFormer criterion.\n\"\"\"\nimport logging\n\nimport torch\nimport torch.nn.functional as F\nfrom torch import nn\n\nfrom detectron2.utils.comm import get_world_size\nfrom detectron2.projects.point_rend.point_features import (\n    get_uncertain_point_coords_with_randomness,\n    point_sample,\n)\n\nfrom mask2former.utils.misc import is_dist_avail_and_initialized\n\ndef unfold_wo_center(x, kernel_size, dilation):\n    assert x.dim() == 4\n    assert kernel_size % 2 == 1\n\n    # using SAME padding\n    padding = (kernel_size + (dilation - 1) * (kernel_size - 1)) // 2\n    unfolded_x = F.unfold(\n        x, kernel_size=kernel_size,\n        padding=padding,\n        dilation=dilation\n    )\n\n    unfolded_x = unfolded_x.reshape(\n        x.size(0), x.size(1), -1, x.size(2), x.size(3)\n    )\n\n    # remove the center pixels\n    size = kernel_size ** 2\n    unfolded_x = torch.cat((\n        unfolded_x[:, :, :size // 2],\n        unfolded_x[:, :, size // 2 + 1:]\n    ), dim=2)\n\n    return unfolded_x\n\ndef unfold_w_center(x, kernel_size, dilation):\n    assert x.dim() == 4\n    assert kernel_size % 2 == 1\n\n    # using SAME padding\n    padding = (kernel_size + (dilation - 1) * (kernel_size - 1)) // 2\n    unfolded_x = F.unfold(\n        x, kernel_size=kernel_size,\n        padding=padding,\n        dilation=dilation\n    )\n\n    unfolded_x = unfolded_x.reshape(\n        x.size(0), x.size(1), -1, x.size(2), x.size(3)\n    )\n\n    return unfolded_x\n\ndef compute_pairwise_term(mask_logits, pairwise_size, pairwise_dilation):\n    assert mask_logits.dim() == 4\n\n    log_fg_prob = F.logsigmoid(mask_logits)\n    log_bg_prob = F.logsigmoid(-mask_logits)\n\n    log_fg_prob_unfold = unfold_wo_center(\n        log_fg_prob, kernel_size=pairwise_size,\n        dilation=pairwise_dilation\n    )\n    log_bg_prob_unfold = unfold_wo_center(\n        log_bg_prob, kernel_size=pairwise_size,\n        dilation=pairwise_dilation\n    )\n\n    # the probability of making the same prediction = p_i * p_j + (1 - p_i) * (1 - p_j)\n    # we compute the the probability in log space to avoid numerical instability\n    log_same_fg_prob = log_fg_prob[:, :, None] + log_fg_prob_unfold\n    log_same_bg_prob = log_bg_prob[:, :, None] + log_bg_prob_unfold\n\n    max_ = torch.max(log_same_fg_prob, log_same_bg_prob)\n    log_same_prob = torch.log(\n        torch.exp(log_same_fg_prob - max_) +\n        torch.exp(log_same_bg_prob - max_)\n    ) + max_\n\n    # loss = -log(prob)\n    return -log_same_prob[:, 0]\n\ndef compute_pairwise_term_neighbor(mask_logits, mask_logits_neighbor, pairwise_size, pairwise_dilation):\n    assert mask_logits.dim() == 4\n\n    log_fg_prob_neigh = F.logsigmoid(mask_logits_neighbor)\n    log_bg_prob_neigh = F.logsigmoid(-mask_logits_neighbor)\n\n    log_fg_prob = F.logsigmoid(mask_logits)\n    log_bg_prob = F.logsigmoid(-mask_logits)\n    \n    log_fg_prob_unfold = unfold_w_center(\n        log_fg_prob, kernel_size=pairwise_size,\n        dilation=pairwise_dilation\n    )\n    log_bg_prob_unfold = unfold_w_center(\n        log_bg_prob, kernel_size=pairwise_size,\n        dilation=pairwise_dilation\n    )\n\n    # the probability of making the same prediction = p_i * p_j + (1 - p_i) * (1 - p_j)\n    # we compute the the probability in log space to avoid numerical instability\n    log_same_fg_prob = log_fg_prob_neigh[:, :, None] + log_fg_prob_unfold\n    log_same_bg_prob = log_bg_prob_neigh[:, :, None] + log_bg_prob_unfold\n\n    max_ = torch.max(log_same_fg_prob, log_same_bg_prob)\n    log_same_prob = torch.log(\n        torch.exp(log_same_fg_prob - max_) +\n        torch.exp(log_same_bg_prob - max_)\n    ) + max_\n\n    return -log_same_prob[:, 0]\n\ndef dice_coefficient(x, target):\n    eps = 1e-5\n    n_inst = x.size(0)\n    x = x.reshape(n_inst, -1)\n    target = target.reshape(n_inst, -1)\n    intersection = (x * target).sum(dim=1)\n    union = (x ** 2.0).sum(dim=1) + (target ** 2.0).sum(dim=1) + eps\n    loss = 1. - (2 * intersection / union)\n    return loss\n\ndef compute_project_term(mask_scores, gt_bitmasks):\n    mask_losses_y = dice_coefficient(\n        mask_scores.max(dim=2, keepdim=True)[0],\n        gt_bitmasks.max(dim=2, keepdim=True)[0]\n    )\n    mask_losses_x = dice_coefficient(\n        mask_scores.max(dim=3, keepdim=True)[0],\n        gt_bitmasks.max(dim=3, keepdim=True)[0]\n    )\n    return (mask_losses_x + mask_losses_y).mean()\n\ndef dice_loss(\n        inputs: torch.Tensor,\n        targets: torch.Tensor,\n        num_masks: float,\n    ):\n    \"\"\"\n    Compute the DICE loss, similar to generalized IOU for masks\n    Args:\n        inputs: A float tensor of arbitrary shape.\n                The predictions for each example.\n        targets: A float tensor with the same shape as inputs. Stores the binary\n                 classification label for each element in inputs\n                (0 for the negative class and 1 for the positive class).\n    \"\"\"\n    inputs = inputs.sigmoid()\n    inputs = inputs.flatten(1)\n    numerator = 2 * (inputs * targets).sum(-1)\n    denominator = inputs.sum(-1) + targets.sum(-1)\n    loss = 1 - (numerator + 1) / (denominator + 1)\n    return loss.sum() / num_masks\n\n\ndice_loss_jit = torch.jit.script(\n    dice_loss\n)  # type: torch.jit.ScriptModule\n\n\ndef sigmoid_ce_loss(\n        inputs: torch.Tensor,\n        targets: torch.Tensor,\n        num_masks: float,\n    ):\n    \"\"\"\n    Args:\n        inputs: A float tensor of arbitrary shape.\n                The predictions for each example.\n        targets: A float tensor with the same shape as inputs. Stores the binary\n                 classification label for each element in inputs\n                (0 for the negative class and 1 for the positive class).\n    Returns:\n        Loss tensor\n    \"\"\"\n    loss = F.binary_cross_entropy_with_logits(inputs, targets, reduction=\"none\")\n\n    return loss.mean(1).sum() / num_masks\n\n\nsigmoid_ce_loss_jit = torch.jit.script(\n    sigmoid_ce_loss\n)  # type: torch.jit.ScriptModule\n\n\ndef calculate_uncertainty(logits):\n    \"\"\"\n    We estimate uncerainty as L1 distance between 0.0 and the logit prediction in 'logits' for the\n        foreground class in `classes`.\n    Args:\n        logits (Tensor): A tensor of shape (R, 1, ...) for class-specific or\n            class-agnostic, where R is the total number of predicted masks in all images and C is\n            the number of foreground classes. The values are logits.\n    Returns:\n        scores (Tensor): A tensor of shape (R, 1, ...) that contains uncertainty scores with\n            the most uncertain locations having the highest uncertainty score.\n    \"\"\"\n    assert logits.shape[1] == 1\n    gt_class_logits = logits.clone()\n    return -(torch.abs(gt_class_logits))\n\n\nclass VideoSetCriterion(nn.Module):\n    \"\"\"This class computes the loss for DETR.\n    The process happens in two steps:\n        1) we compute hungarian assignment between ground truth boxes and the outputs of the model\n        2) we supervise each pair of matched ground-truth / prediction (supervise class and box)\n    \"\"\"\n\n    def __init__(self, num_classes, matcher, weight_dict, eos_coef, losses,\n                 num_points, oversample_ratio, importance_sample_ratio):\n        \"\"\"Create the criterion.\n        Parameters:\n            num_classes: number of object categories, omitting the special no-object category\n            matcher: module able to compute a matching between targets and proposals\n            weight_dict: dict containing as key the names of the losses and as values their relative weight.\n            eos_coef: relative classification weight applied to the no-object category\n            losses: list of all the losses to be applied. See get_loss for list of available losses.\n        \"\"\"\n        super().__init__()\n        self.num_classes = num_classes\n        self.matcher = matcher\n        self.weight_dict = weight_dict\n        self.eos_coef = eos_coef\n        self.losses = losses\n        empty_weight = torch.ones(self.num_classes + 1)\n        empty_weight[-1] = self.eos_coef\n        self.register_buffer(\"empty_weight\", empty_weight)\n\n        # pointwise mask loss parameters\n        self.num_points = num_points\n        self.oversample_ratio = oversample_ratio\n        self.importance_sample_ratio = importance_sample_ratio\n\n        self._warmup_iters = 2000\n        self.register_buffer(\"_iter\", torch.zeros([1]))\n\n    def loss_labels(self, outputs, targets, indices, num_masks):\n        \"\"\"Classification loss (NLL)\n        targets dicts must contain the key \"labels\" containing a tensor of dim [nb_target_boxes]\n        \"\"\"\n        assert \"pred_logits\" in outputs\n        src_logits = outputs[\"pred_logits\"].float()\n\n        idx = self._get_src_permutation_idx(indices)\n        target_classes_o = torch.cat([t[\"labels\"][J] for t, (_, J) in zip(targets, indices)])\n        target_classes = torch.full(\n            src_logits.shape[:2], self.num_classes, dtype=torch.int64, device=src_logits.device\n        )\n        target_classes[idx] = target_classes_o\n\n        loss_ce = F.cross_entropy(src_logits.transpose(1, 2), target_classes, self.empty_weight)\n        losses = {\"loss_ce\": loss_ce}\n        return losses\n    \n    def loss_masks(self, outputs, targets, indices, num_masks):\n        \"\"\"Compute the losses related to the masks: the focal loss and the dice loss.\n        targets dicts must contain the key \"masks\" containing a tensor of dim [nb_target_boxes, h, w]\n        \"\"\"\n        assert \"pred_masks\" in outputs\n\n        src_idx = self._get_src_permutation_idx(indices)\n        src_masks = outputs[\"pred_masks\"]\n        src_masks = src_masks[src_idx]\n        # Modified to handle video\n        target_masks = torch.cat([t['masks'][i] for t, (_, i) in zip(targets, indices)]).to(src_masks)\n\n        # No need to upsample predictions as we are using normalized coordinates :)\n        # NT x 1 x H x W\n        src_masks = src_masks.flatten(0, 1)[:, None]\n        target_masks = target_masks.flatten(0, 1)[:, None]\n        \n        with torch.no_grad():\n            # sample point_coords\n            point_coords = get_uncertain_point_coords_with_randomness(\n                src_masks,\n                lambda logits: calculate_uncertainty(logits),\n                self.num_points,\n                self.oversample_ratio,\n                self.importance_sample_ratio,\n            )\n            # get gt labels\n            point_labels = point_sample(\n                target_masks,\n                point_coords,\n                align_corners=False,\n            ).squeeze(1)\n\n        point_logits = point_sample(\n            src_masks,\n            point_coords,\n            align_corners=False,\n        ).squeeze(1)\n\n        losses = {\n            \"loss_mask\": sigmoid_ce_loss_jit(point_logits, point_labels, num_masks),\n            \"loss_dice\": dice_loss_jit(point_logits, point_labels, num_masks),\n        }\n\n        del src_masks\n        del target_masks\n        return losses\n    \n    def topk_mask(self, images_lab_sim, k):\n        images_lab_sim_mask = torch.zeros_like(images_lab_sim)\n        topk, indices = torch.topk(images_lab_sim, k, dim =1) # 1, 3, 5, 7\n        images_lab_sim_mask = images_lab_sim_mask.scatter(1, indices, topk)\n        return images_lab_sim_mask\n\n    def loss_masks_proj(self, outputs, targets, indices, num_masks, images_lab_sim, images_lab_sim_nei, images_lab_sim_nei1, images_lab_sim_nei2, images_lab_sim_nei3, images_lab_sim_nei4):\n        \"\"\"Compute the losses related to the masks: the focal loss and the dice loss.\n        targets dicts must contain the key \"masks\" containing a tensor of dim [nb_target_boxes, h, w]\n        \"\"\"\n        assert \"pred_masks\" in outputs\n        \n        self._iter += 1\n\n        src_idx = self._get_src_permutation_idx(indices)\n        src_masks = outputs[\"pred_masks\"]\n        src_masks = src_masks[src_idx]\n        # Modified to handle video\n        target_masks = torch.cat([t['masks'][i] for t, (_, i) in zip(targets, indices)]).to(src_masks)\n\n        images_lab_sim = torch.cat(images_lab_sim, dim =0)\n        images_lab_sim_nei = torch.cat(images_lab_sim_nei, dim=0)\n        images_lab_sim_nei1 = torch.cat(images_lab_sim_nei1, dim=0)\n        images_lab_sim_nei2 = torch.cat(images_lab_sim_nei2, dim=0)\n        images_lab_sim_nei3 = torch.cat(images_lab_sim_nei3, dim=0)\n        images_lab_sim_nei4 = torch.cat(images_lab_sim_nei4, dim=0)\n\n        images_lab_sim = images_lab_sim.view(-1, target_masks.shape[1], images_lab_sim.shape[-3], images_lab_sim.shape[-2], images_lab_sim.shape[-1])\n        images_lab_sim_nei = images_lab_sim_nei.unsqueeze(1)\n        images_lab_sim_nei1 = images_lab_sim_nei1.unsqueeze(1)\n        images_lab_sim_nei2 = images_lab_sim_nei2.unsqueeze(1)\n        images_lab_sim_nei3 = images_lab_sim_nei3.unsqueeze(1)\n        images_lab_sim_nei4 = images_lab_sim_nei4.unsqueeze(1)\n\n        if len(src_idx[0].tolist()) > 0:\n            images_lab_sim = torch.cat([images_lab_sim[ind][None] for ind in src_idx[0].tolist()]).flatten(0, 1)\n            images_lab_sim_nei = self.topk_mask(torch.cat([images_lab_sim_nei[ind][None] for ind in src_idx[0].tolist()]).flatten(0, 1), 5)\n            images_lab_sim_nei1 = self.topk_mask(torch.cat([images_lab_sim_nei1[ind][None] for ind in src_idx[0].tolist()]).flatten(0, 1), 5)\n            images_lab_sim_nei2 = self.topk_mask(torch.cat([images_lab_sim_nei2[ind][None] for ind in src_idx[0].tolist()]).flatten(0, 1), 5)\n            images_lab_sim_nei3 = self.topk_mask(torch.cat([images_lab_sim_nei3[ind][None] for ind in src_idx[0].tolist()]).flatten(0, 1), 5)\n            images_lab_sim_nei4 = self.topk_mask(torch.cat([images_lab_sim_nei4[ind][None] for ind in src_idx[0].tolist()]).flatten(0, 1), 5)\n\n        k_size = 3 \n\n        if src_masks.shape[0] > 0:\n            pairwise_losses_neighbor = compute_pairwise_term_neighbor(\n                src_masks[:,:1], src_masks[:,1:2], k_size, 3\n            ) \n            pairwise_losses_neighbor1 = compute_pairwise_term_neighbor(\n                src_masks[:,1:2], src_masks[:,2:3], k_size, 3\n            ) \n            pairwise_losses_neighbor2 = compute_pairwise_term_neighbor(\n                src_masks[:,2:3], src_masks[:,3:4], k_size, 3\n            )\n            pairwise_losses_neighbor3 = compute_pairwise_term_neighbor(\n                src_masks[:,3:4], src_masks[:,4:5], k_size, 3\n            )\n            pairwise_losses_neighbor4 = compute_pairwise_term_neighbor(\n                src_masks[:,4:5], src_masks[:,0:1], k_size, 3\n            )\n            \n        # print('pairwise_losses_neighbor:', pairwise_losses_neighbor.shape)\n        src_masks = src_masks.flatten(0, 1)[:, None]\n        target_masks = target_masks.flatten(0, 1)[:, None]\n        target_masks = F.interpolate(target_masks, (src_masks.shape[-2], src_masks.shape[-1]), mode='bilinear')\n        # images_lab_sim = F.interpolate(images_lab_sim, (src_masks.shape[-2], src_masks.shape[-1]), mode='bilinear')\n        \n        \n        if src_masks.shape[0] > 0:\n            loss_prj_term = compute_project_term(src_masks.sigmoid(), target_masks)  \n\n            pairwise_losses = compute_pairwise_term(\n                src_masks, k_size, 2\n            )\n\n            weights = (images_lab_sim >= 0.3).float() * target_masks.float()\n            target_masks_sum = target_masks.reshape(pairwise_losses_neighbor.shape[0], 5, target_masks.shape[-2], target_masks.shape[-1]).sum(dim=1, keepdim=True)\n            \n            target_masks_sum = (target_masks_sum >= 1.0).float() # ori is 1.0\n            weights_neighbor = (images_lab_sim_nei >= 0.05).float() * target_masks_sum # ori is 0.5, 0.01, 0.001, 0.005, 0.0001, 0.02, 0.05, 0.075, 0.1 , dy 0.5\n            weights_neighbor1 = (images_lab_sim_nei1 >= 0.05).float() * target_masks_sum # ori is 0.5, 0.01, 0.001, 0.005, 0.0001, 0.02, 0.05, 0.075, 0.1, dy 0.5\n            weights_neighbor2 = (images_lab_sim_nei2 >= 0.05).float() * target_masks_sum # ori is 0.5, 0.01, 0.001, 0.005, 0.0001, 0.02, 0.05, 0.075, 0.1, dy 0.5\n            weights_neighbor3 = (images_lab_sim_nei3 >= 0.05).float() * target_masks_sum\n            weights_neighbor4 = (images_lab_sim_nei4 >= 0.05).float() * target_masks_sum\n\n            warmup_factor = min(self._iter.item() / float(self._warmup_iters), 1.0) #1.0\n\n            loss_pairwise = (pairwise_losses * weights).sum() / weights.sum().clamp(min=1.0)\n            loss_pairwise_neighbor = (pairwise_losses_neighbor * weights_neighbor).sum() / weights_neighbor.sum().clamp(min=1.0) * warmup_factor\n            loss_pairwise_neighbor1 = (pairwise_losses_neighbor1 * weights_neighbor1).sum() / weights_neighbor1.sum().clamp(min=1.0) * warmup_factor\n            loss_pairwise_neighbor2 = (pairwise_losses_neighbor2 * weights_neighbor2).sum() / weights_neighbor2.sum().clamp(min=1.0) * warmup_factor\n            loss_pairwise_neighbor3 = (pairwise_losses_neighbor3 * weights_neighbor3).sum() / weights_neighbor3.sum().clamp(min=1.0) * warmup_factor\n            loss_pairwise_neighbor4 = (pairwise_losses_neighbor4 * weights_neighbor4).sum() / weights_neighbor4.sum().clamp(min=1.0) * warmup_factor\n\n        else:\n            loss_prj_term = src_masks.sum() * 0.\n            loss_pairwise = src_masks.sum() * 0.\n            loss_pairwise_neighbor = src_masks.sum() * 0.\n            loss_pairwise_neighbor1 = src_masks.sum() * 0.\n            loss_pairwise_neighbor2 = src_masks.sum() * 0.\n            loss_pairwise_neighbor3 = src_masks.sum() * 0.\n            loss_pairwise_neighbor4 = src_masks.sum() * 0.\n\n        # print('loss_proj term:', loss_prj_term)\n        losses = {\n            \"loss_mask\": loss_prj_term,\n            \"loss_bound\": loss_pairwise,\n            \"loss_bound_neighbor\": (loss_pairwise_neighbor + loss_pairwise_neighbor1 + loss_pairwise_neighbor2 + loss_pairwise_neighbor3 + loss_pairwise_neighbor4) * 0.1, # * 0.33\n        }\n\n        del src_masks\n        del target_masks\n        return losses\n\n    def _get_src_permutation_idx(self, indices):\n        # permute predictions following indices\n        batch_idx = torch.cat([torch.full_like(src, i) for i, (src, _) in enumerate(indices)])\n        src_idx = torch.cat([src for (src, _) in indices])\n        return batch_idx, src_idx\n\n    def _get_tgt_permutation_idx(self, indices):\n        # permute targets following indices\n        batch_idx = torch.cat([torch.full_like(tgt, i) for i, (_, tgt) in enumerate(indices)])\n        tgt_idx = torch.cat([tgt for (_, tgt) in indices])\n        return batch_idx, tgt_idx\n\n    def get_loss(self, loss, outputs, targets, indices, num_masks, images_lab_sim, images_lab_sim_nei, images_lab_sim_nei1, images_lab_sim_nei2, images_lab_sim_nei3, images_lab_sim_nei4):\n        loss_map = {\n            'labels': self.loss_labels,\n            'masks': self.loss_masks_proj,\n        }\n        assert loss in loss_map, f\"do you really want to compute {loss} loss?\"\n        if loss == 'masks':\n            return loss_map[loss](outputs, targets, indices, num_masks, images_lab_sim, images_lab_sim_nei, images_lab_sim_nei1, images_lab_sim_nei2, images_lab_sim_nei3, images_lab_sim_nei4)\n        else:\n            return loss_map[loss](outputs, targets, indices, num_masks)\n\n    def forward(self, outputs, targets, images_lab_sim, images_lab_sim_nei, images_lab_sim_nei1, images_lab_sim_nei2, images_lab_sim_nei3, images_lab_sim_nei4):\n        \"\"\"This performs the loss computation.\n        Parameters:\n             outputs: dict of tensors, see the output specification of the model for the format\n             targets: list of dicts, such that len(targets) == batch_size.\n                      The expected keys in each dict depends on the losses applied, see each loss' doc\n        \"\"\"\n        outputs_without_aux = {k: v for k, v in outputs.items() if k != \"aux_outputs\"}\n\n        # Retrieve the matching between the outputs of the last layer and the targets\n        indices = self.matcher(outputs_without_aux, targets)\n\n        # Compute the average number of target boxes accross all nodes, for normalization purposes\n        num_masks = sum(len(t[\"labels\"]) for t in targets)\n        num_masks = torch.as_tensor(\n            [num_masks], dtype=torch.float, device=next(iter(outputs.values())).device\n        )\n        if is_dist_avail_and_initialized():\n            torch.distributed.all_reduce(num_masks)\n        num_masks = torch.clamp(num_masks / get_world_size(), min=1).item()\n\n        # Compute all the requested losses\n        losses = {}\n        for loss in self.losses:\n            losses.update(self.get_loss(loss, outputs, targets, indices, num_masks, images_lab_sim, images_lab_sim_nei, images_lab_sim_nei1, images_lab_sim_nei2, images_lab_sim_nei3, images_lab_sim_nei4))\n\n        # In case of auxiliary losses, we repeat this process with the output of each intermediate layer.\n        if \"aux_outputs\" in outputs:\n            for i, aux_outputs in enumerate(outputs[\"aux_outputs\"]):\n                indices = self.matcher(aux_outputs, targets)\n                for loss in self.losses:\n                    l_dict = self.get_loss(loss, aux_outputs, targets, indices, num_masks, images_lab_sim, images_lab_sim_nei, images_lab_sim_nei1, images_lab_sim_nei2, images_lab_sim_nei3, images_lab_sim_nei4)\n                    l_dict = {k + f\"_{i}\": v for k, v in l_dict.items()}\n                    losses.update(l_dict)\n\n        return losses\n\n    def __repr__(self):\n        head = \"Criterion \" + self.__class__.__name__\n        body = [\n            \"matcher: {}\".format(self.matcher.__repr__(_repr_indent=8)),\n            \"losses: {}\".format(self.losses),\n            \"weight_dict: {}\".format(self.weight_dict),\n            \"num_classes: {}\".format(self.num_classes),\n            \"eos_coef: {}\".format(self.eos_coef),\n            \"num_points: {}\".format(self.num_points),\n            \"oversample_ratio: {}\".format(self.oversample_ratio),\n            \"importance_sample_ratio: {}\".format(self.importance_sample_ratio),\n        ]\n        _repr_indent = 4\n        lines = [head] + [\" \" * _repr_indent + line for line in body]\n        return \"\\n\".join(lines)\n"
  },
  {
    "path": "mfvis_nococo/mask2former_video/modeling/matcher.py",
    "content": "# Modified by Bowen Cheng from https://github.com/facebookresearch/detr/blob/master/models/matcher.py\n\"\"\"\nModules to compute the matching cost and solve the corresponding LSAP.\n\"\"\"\nimport torch\nimport torch.nn.functional as F\nfrom scipy.optimize import linear_sum_assignment\nfrom torch import nn\nfrom torch.cuda.amp import autocast\n\nfrom detectron2.projects.point_rend.point_features import point_sample\n\ndef masks_to_boxes(masks: torch.Tensor) -> torch.Tensor:\n    \"\"\"\n    Compute the bounding boxes around the provided masks.\n\n    Returns a [N, 4] tensor containing bounding boxes. The boxes are in ``(x1, y1, x2, y2)`` format with\n    ``0 <= x1 < x2`` and ``0 <= y1 < y2``.\n\n    Args:\n        masks (Tensor[N, H, W]): masks to transform where N is the number of masks\n            and (H, W) are the spatial dimensions.\n\n    Returns:\n        Tensor[N, 4]: bounding boxes\n    \"\"\"\n    if masks.numel() == 0:\n        return masks\n\n    n = masks.shape[0]\n    masks = masks.flatten(0, 1)\n\n    for index, mask in enumerate(masks):\n        y, x = torch.where(mask != 0)\n        if len(x) * len(y) == 0:\n            continue\n        \n        masks[index, torch.min(y):torch.max(y)+1, torch.min(x):torch.max(x)+1] = 1.0\n    \n    masks = masks.view(n, -1, masks.shape[-2], masks.shape[-1])\n    return masks\n\ndef masks_to_boxes_new(masks: torch.Tensor) -> torch.Tensor:\n    \"\"\"\n    Compute the bounding boxes around the provided masks.\n\n    Returns a [N, 4] tensor containing bounding boxes. The boxes are in ``(x1, y1, x2, y2)`` format with\n    ``0 <= x1 < x2`` and ``0 <= y1 < y2``.\n\n    Args:\n        masks (Tensor[N, H, W]): masks to transform where N is the number of masks\n            and (H, W) are the spatial dimensions.\n\n    Returns:\n        Tensor[N, 4]: bounding boxes\n    \"\"\"\n    if masks.numel() == 0:\n        return masks\n    \n    n, _, h, w = masks.shape\n    masks = masks.flatten(0, 1)\n    y = torch.arange(0, h, dtype=torch.float).to(masks.device)\n    x = torch.arange(0, w, dtype=torch.float).to(masks.device)\n    y, x = torch.meshgrid(y, x)\n\n    x_mask = (masks * x.unsqueeze(0))\n    x_max = x_mask.flatten(1).max(-1)[0] + 1\n    x_min = x_mask.masked_fill(~(masks.bool()), 1e8).flatten(1).min(-1)[0]\n\n    y_mask = (masks * y.unsqueeze(0))\n    y_max = y_mask.flatten(1).max(-1)[0] + 1\n    y_min = y_mask.masked_fill(~(masks.bool()), 1e8).flatten(1).min(-1)[0]\n\n    boxes = torch.stack([x_min, y_min, x_max, y_max], 1)\n\n    mem_mask = torch.zeros_like(masks)\n\n    hMask = torch.logical_or(torch.arange(h).unsqueeze(0).to(boxes)<boxes[:, 1, None], torch.arange(h).unsqueeze(0).to(boxes)>=boxes[:, 3, None])\n    wMask = torch.logical_or(torch.arange(w).unsqueeze(0).to(boxes)<boxes[:, 0, None], torch.arange(w).unsqueeze(0).to(boxes)>=boxes[:, 2, None])\n    \n    mem_mask = torch.logical_or(hMask.unsqueeze(2), wMask.unsqueeze(1)).float()\n    mem_mask = 1.0 - mem_mask.view(n, -1, masks.shape[-2], masks.shape[-1])\n    return mem_mask\n    \ndef batch_dice_loss(inputs: torch.Tensor, targets: torch.Tensor):\n    \"\"\"\n    Compute the DICE loss, similar to generalized IOU for masks\n    Args:\n        inputs: A float tensor of arbitrary shape.\n                The predictions for each example.\n        targets: A float tensor with the same shape as inputs. Stores the binary\n                 classification label for each element in inputs\n                (0 for the negative class and 1 for the positive class).\n    \"\"\"\n    inputs = inputs.sigmoid()\n    inputs = inputs.flatten(1)\n    numerator = 2 * torch.einsum(\"nc,mc->nm\", inputs, targets)\n    denominator = inputs.sum(-1)[:, None] + targets.sum(-1)[None, :]\n    loss = 1 - (numerator + 1) / (denominator + 1)\n    return loss\n\ndef batch_dice_loss_nosig(inputs: torch.Tensor, targets: torch.Tensor):\n    \"\"\"\n    Compute the DICE loss, similar to generalized IOU for masks\n    Args:\n        inputs: A float tensor of arbitrary shape.\n                The predictions for each example.\n        targets: A float tensor with the same shape as inputs. Stores the binary\n                 classification label for each element in inputs\n                (0 for the negative class and 1 for the positive class).\n    \"\"\"\n    # inputs = inputs.sigmoid()\n    inputs = inputs.flatten(1)\n    numerator = 2 * torch.einsum(\"nc,mc->nm\", inputs, targets)\n    denominator = inputs.sum(-1)[:, None] + targets.sum(-1)[None, :]\n    loss = 1 - (numerator + 1) / (denominator + 1)\n    return loss\n\nbatch_dice_loss_jit = torch.jit.script(\n    batch_dice_loss\n)  # type: torch.jit.ScriptModule\n\nbatch_dice_loss_jit_nosig = torch.jit.script(\n    batch_dice_loss_nosig\n)  # type: torch.jit.ScriptModule\n\ndef batch_sigmoid_ce_loss(inputs: torch.Tensor, targets: torch.Tensor):\n    \"\"\"\n    Args:\n        inputs: A float tensor of arbitrary shape.\n                The predictions for each example.\n        targets: A float tensor with the same shape as inputs. Stores the binary\n                 classification label for each element in inputs\n                (0 for the negative class and 1 for the positive class).\n    Returns:\n        Loss tensor\n    \"\"\"\n    hw = inputs.shape[1]\n\n    pos = F.binary_cross_entropy_with_logits(\n        inputs, torch.ones_like(inputs), reduction=\"none\"\n    )\n    neg = F.binary_cross_entropy_with_logits(\n        inputs, torch.zeros_like(inputs), reduction=\"none\"\n    )\n\n    loss = torch.einsum(\"nc,mc->nm\", pos, targets) + torch.einsum(\n        \"nc,mc->nm\", neg, (1 - targets)\n    )\n    return loss / hw\n\n\nbatch_sigmoid_ce_loss_jit = torch.jit.script(\n    batch_sigmoid_ce_loss\n)  # type: torch.jit.ScriptModule\n\n\nclass VideoHungarianMatcher(nn.Module):\n    \"\"\"This class computes an assignment between the targets and the predictions of the network\n\n    For efficiency reasons, the targets don't include the no_object. Because of this, in general,\n    there are more predictions than targets. In this case, we do a 1-to-1 matching of the best predictions,\n    while the others are un-matched (and thus treated as non-objects).\n    \"\"\"\n\n    def __init__(self, cost_class: float = 1, cost_mask: float = 1, cost_dice: float = 1, num_points: int = 0):\n        \"\"\"Creates the matcher\n\n        Params:\n            cost_class: This is the relative weight of the classification error in the matching cost\n            cost_mask: This is the relative weight of the focal loss of the binary mask in the matching cost\n            cost_dice: This is the relative weight of the dice loss of the binary mask in the matching cost\n        \"\"\"\n        super().__init__()\n        self.cost_class = cost_class\n        self.cost_mask = cost_mask\n        self.cost_dice = cost_dice\n\n        assert cost_class != 0 or cost_mask != 0 or cost_dice != 0, \"all costs cant be 0\"\n\n        self.num_points = num_points\n\n    @torch.no_grad()\n    def memory_efficient_forward(self, outputs, targets):\n        \"\"\"More memory-friendly matching\"\"\"\n        bs, num_queries = outputs[\"pred_logits\"].shape[:2]\n\n        indices = []\n\n        # Iterate through batch size\n        for b in range(bs):\n\n            out_prob = outputs[\"pred_logits\"][b].softmax(-1)  # [num_queries, num_classes]\n            tgt_ids = targets[b][\"labels\"]\n\n            # Compute the classification cost. Contrary to the loss, we don't use the NLL,\n            # but approximate it in 1 - proba[target class].\n            # The 1 is a constant that doesn't change the matching, it can be ommitted.\n            cost_class = -out_prob[:, tgt_ids]\n\n            out_mask = outputs[\"pred_masks\"][b]  # [num_queries, T, H_pred, W_pred]\n            out_mask = masks_to_boxes_new((out_mask.sigmoid() > 0.5).float())\n            # gt masks are already padded when preparing target\n            tgt_mask = targets[b][\"masks\"].to(out_mask)  # [num_gts, T, H_pred, W_pred]\n            tgt_mask = masks_to_boxes(tgt_mask)\n            # all masks share the same set of points for efficient matching!\n            point_coords = torch.rand(1, self.num_points, 2, device=out_mask.device)\n            # get gt labels\n            tgt_mask = point_sample(\n                tgt_mask,\n                point_coords.repeat(tgt_mask.shape[0], 1, 1),\n                align_corners=False,\n            ).flatten(1)\n\n            out_mask = point_sample(\n                out_mask,\n                point_coords.repeat(out_mask.shape[0], 1, 1),\n                align_corners=False,\n            ).flatten(1)\n\n            with autocast(enabled=False):\n                out_mask = out_mask.float()\n                tgt_mask = tgt_mask.float()\n\n                cost_dice_nosig = batch_dice_loss_jit_nosig(out_mask, tgt_mask)\n\n            C = (\n                self.cost_class * cost_class\n                + self.cost_dice * cost_dice_nosig\n            )\n\n            C = C.reshape(num_queries, -1).cpu()\n\n            indices.append(linear_sum_assignment(C))\n\n        return [\n            (torch.as_tensor(i, dtype=torch.int64), torch.as_tensor(j, dtype=torch.int64))\n            for i, j in indices\n        ]\n\n    @torch.no_grad()\n    def forward(self, outputs, targets):\n        \"\"\"Performs the matching\n\n        Params:\n            outputs: This is a dict that contains at least these entries:\n                 \"pred_logits\": Tensor of dim [batch_size, num_queries, num_classes] with the classification logits\n                 \"pred_masks\": Tensor of dim [batch_size, num_queries, H_pred, W_pred] with the predicted masks\n\n            targets: This is a list of targets (len(targets) = batch_size), where each target is a dict containing:\n                 \"labels\": Tensor of dim [num_target_boxes] (where num_target_boxes is the number of ground-truth\n                           objects in the target) containing the class labels\n                 \"masks\": Tensor of dim [num_target_boxes, H_gt, W_gt] containing the target masks\n\n        Returns:\n            A list of size batch_size, containing tuples of (index_i, index_j) where:\n                - index_i is the indices of the selected predictions (in order)\n                - index_j is the indices of the corresponding selected targets (in order)\n            For each batch element, it holds:\n                len(index_i) = len(index_j) = min(num_queries, num_target_boxes)\n        \"\"\"\n        return self.memory_efficient_forward(outputs, targets)\n\n    def __repr__(self, _repr_indent=4):\n        head = \"Matcher \" + self.__class__.__name__\n        body = [\n            \"cost_class: {}\".format(self.cost_class),\n            \"cost_mask: {}\".format(self.cost_mask),\n            \"cost_dice: {}\".format(self.cost_dice),\n        ]\n        lines = [head] + [\" \" * _repr_indent + line for line in body]\n        return \"\\n\".join(lines)\n"
  },
  {
    "path": "mfvis_nococo/mask2former_video/modeling/transformer_decoder/__init__.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates.\nfrom .video_mask2former_transformer_decoder import VideoMultiScaleMaskedTransformerDecoder\n"
  },
  {
    "path": "mfvis_nococo/mask2former_video/modeling/transformer_decoder/position_encoding.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates.\n# # Modified by Bowen Cheng from: https://github.com/facebookresearch/detr/blob/master/models/position_encoding.py\n\"\"\"\nVarious positional encodings for the transformer.\n\"\"\"\nimport math\n\nimport torch\nfrom torch import nn\n\n\nclass PositionEmbeddingSine3D(nn.Module):\n    \"\"\"\n    This is a more standard version of the position embedding, very similar to the one\n    used by the Attention is all you need paper, generalized to work on images.\n    \"\"\"\n\n    def __init__(self, num_pos_feats=64, temperature=10000, normalize=False, scale=None):\n        super().__init__()\n        self.num_pos_feats = num_pos_feats\n        self.temperature = temperature\n        self.normalize = normalize\n        if scale is not None and normalize is False:\n            raise ValueError(\"normalize should be True if scale is passed\")\n        if scale is None:\n            scale = 2 * math.pi\n        self.scale = scale\n\n    def forward(self, x, mask=None):\n        # b, t, c, h, w\n        assert x.dim() == 5, f\"{x.shape} should be a 5-dimensional Tensor, got {x.dim()}-dimensional Tensor instead\"\n        if mask is None:\n            mask = torch.zeros((x.size(0), x.size(1), x.size(3), x.size(4)), device=x.device, dtype=torch.bool)\n        not_mask = ~mask\n        z_embed = not_mask.cumsum(1, dtype=torch.float32)\n        y_embed = not_mask.cumsum(2, dtype=torch.float32)\n        x_embed = not_mask.cumsum(3, dtype=torch.float32)\n        if self.normalize:\n            eps = 1e-6\n            z_embed = z_embed / (z_embed[:, -1:, :, :] + eps) * self.scale\n            y_embed = y_embed / (y_embed[:, :, -1:, :] + eps) * self.scale\n            x_embed = x_embed / (x_embed[:, :, :, -1:] + eps) * self.scale\n\n        dim_t = torch.arange(self.num_pos_feats, dtype=torch.float32, device=x.device)\n        dim_t = self.temperature ** (2 * (dim_t // 2) / self.num_pos_feats)\n\n        dim_t_z = torch.arange((self.num_pos_feats * 2), dtype=torch.float32, device=x.device)\n        dim_t_z = self.temperature ** (2 * (dim_t_z // 2) / (self.num_pos_feats * 2))\n\n        pos_x = x_embed[:, :, :, :, None] / dim_t\n        pos_y = y_embed[:, :, :, :, None] / dim_t\n        pos_z = z_embed[:, :, :, :, None] / dim_t_z\n        pos_x = torch.stack((pos_x[:, :, :, :, 0::2].sin(), pos_x[:, :, :, :, 1::2].cos()), dim=5).flatten(4)\n        pos_y = torch.stack((pos_y[:, :, :, :, 0::2].sin(), pos_y[:, :, :, :, 1::2].cos()), dim=5).flatten(4)\n        pos_z = torch.stack((pos_z[:, :, :, :, 0::2].sin(), pos_z[:, :, :, :, 1::2].cos()), dim=5).flatten(4)\n        pos = (torch.cat((pos_y, pos_x), dim=4) + pos_z).permute(0, 1, 4, 2, 3)  # b, t, c, h, w\n        return pos\n"
  },
  {
    "path": "mfvis_nococo/mask2former_video/modeling/transformer_decoder/video_mask2former_transformer_decoder.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates.\n# Modified by Bowen Cheng from: https://github.com/facebookresearch/detr/blob/master/models/detr.py\nimport logging\nimport fvcore.nn.weight_init as weight_init\nfrom typing import Optional\nimport torch\nfrom torch import nn, Tensor\nfrom torch.nn import functional as F\n\nfrom detectron2.config import configurable\nfrom detectron2.layers import Conv2d\n\nfrom mask2former.modeling.transformer_decoder.maskformer_transformer_decoder import TRANSFORMER_DECODER_REGISTRY\n\nfrom .position_encoding import PositionEmbeddingSine3D\n\n\nclass SelfAttentionLayer(nn.Module):\n\n    def __init__(self, d_model, nhead, dropout=0.0,\n                 activation=\"relu\", normalize_before=False):\n        super().__init__()\n        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)\n\n        self.norm = nn.LayerNorm(d_model)\n        self.dropout = nn.Dropout(dropout)\n\n        self.activation = _get_activation_fn(activation)\n        self.normalize_before = normalize_before\n\n        self._reset_parameters()\n    \n    def _reset_parameters(self):\n        for p in self.parameters():\n            if p.dim() > 1:\n                nn.init.xavier_uniform_(p)\n\n    def with_pos_embed(self, tensor, pos: Optional[Tensor]):\n        return tensor if pos is None else tensor + pos\n\n    def forward_post(self, tgt,\n                     tgt_mask: Optional[Tensor] = None,\n                     tgt_key_padding_mask: Optional[Tensor] = None,\n                     query_pos: Optional[Tensor] = None):\n        q = k = self.with_pos_embed(tgt, query_pos)\n        tgt2 = self.self_attn(q, k, value=tgt, attn_mask=tgt_mask,\n                              key_padding_mask=tgt_key_padding_mask)[0]\n        tgt = tgt + self.dropout(tgt2)\n        tgt = self.norm(tgt)\n\n        return tgt\n\n    def forward_pre(self, tgt,\n                    tgt_mask: Optional[Tensor] = None,\n                    tgt_key_padding_mask: Optional[Tensor] = None,\n                    query_pos: Optional[Tensor] = None):\n        tgt2 = self.norm(tgt)\n        q = k = self.with_pos_embed(tgt2, query_pos)\n        tgt2 = self.self_attn(q, k, value=tgt2, attn_mask=tgt_mask,\n                              key_padding_mask=tgt_key_padding_mask)[0]\n        tgt = tgt + self.dropout(tgt2)\n        \n        return tgt\n\n    def forward(self, tgt,\n                tgt_mask: Optional[Tensor] = None,\n                tgt_key_padding_mask: Optional[Tensor] = None,\n                query_pos: Optional[Tensor] = None):\n        if self.normalize_before:\n            return self.forward_pre(tgt, tgt_mask,\n                                    tgt_key_padding_mask, query_pos)\n        return self.forward_post(tgt, tgt_mask,\n                                 tgt_key_padding_mask, query_pos)\n\n\nclass CrossAttentionLayer(nn.Module):\n\n    def __init__(self, d_model, nhead, dropout=0.0,\n                 activation=\"relu\", normalize_before=False):\n        super().__init__()\n        self.multihead_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)\n\n        self.norm = nn.LayerNorm(d_model)\n        self.dropout = nn.Dropout(dropout)\n\n        self.activation = _get_activation_fn(activation)\n        self.normalize_before = normalize_before\n\n        self._reset_parameters()\n    \n    def _reset_parameters(self):\n        for p in self.parameters():\n            if p.dim() > 1:\n                nn.init.xavier_uniform_(p)\n\n    def with_pos_embed(self, tensor, pos: Optional[Tensor]):\n        return tensor if pos is None else tensor + pos\n\n    def forward_post(self, tgt, memory,\n                     memory_mask: Optional[Tensor] = None,\n                     memory_key_padding_mask: Optional[Tensor] = None,\n                     pos: Optional[Tensor] = None,\n                     query_pos: Optional[Tensor] = None):\n        tgt2 = self.multihead_attn(query=self.with_pos_embed(tgt, query_pos),\n                                   key=self.with_pos_embed(memory, pos),\n                                   value=memory, attn_mask=memory_mask,\n                                   key_padding_mask=memory_key_padding_mask)[0]\n        tgt = tgt + self.dropout(tgt2)\n        tgt = self.norm(tgt)\n        \n        return tgt\n\n    def forward_pre(self, tgt, memory,\n                    memory_mask: Optional[Tensor] = None,\n                    memory_key_padding_mask: Optional[Tensor] = None,\n                    pos: Optional[Tensor] = None,\n                    query_pos: Optional[Tensor] = None):\n        tgt2 = self.norm(tgt)\n        tgt2 = self.multihead_attn(query=self.with_pos_embed(tgt2, query_pos),\n                                   key=self.with_pos_embed(memory, pos),\n                                   value=memory, attn_mask=memory_mask,\n                                   key_padding_mask=memory_key_padding_mask)[0]\n        tgt = tgt + self.dropout(tgt2)\n\n        return tgt\n\n    def forward(self, tgt, memory,\n                memory_mask: Optional[Tensor] = None,\n                memory_key_padding_mask: Optional[Tensor] = None,\n                pos: Optional[Tensor] = None,\n                query_pos: Optional[Tensor] = None):\n        if self.normalize_before:\n            return self.forward_pre(tgt, memory, memory_mask,\n                                    memory_key_padding_mask, pos, query_pos)\n        return self.forward_post(tgt, memory, memory_mask,\n                                 memory_key_padding_mask, pos, query_pos)\n\n\nclass FFNLayer(nn.Module):\n\n    def __init__(self, d_model, dim_feedforward=2048, dropout=0.0,\n                 activation=\"relu\", normalize_before=False):\n        super().__init__()\n        # Implementation of Feedforward model\n        self.linear1 = nn.Linear(d_model, dim_feedforward)\n        self.dropout = nn.Dropout(dropout)\n        self.linear2 = nn.Linear(dim_feedforward, d_model)\n\n        self.norm = nn.LayerNorm(d_model)\n\n        self.activation = _get_activation_fn(activation)\n        self.normalize_before = normalize_before\n\n        self._reset_parameters()\n    \n    def _reset_parameters(self):\n        for p in self.parameters():\n            if p.dim() > 1:\n                nn.init.xavier_uniform_(p)\n\n    def with_pos_embed(self, tensor, pos: Optional[Tensor]):\n        return tensor if pos is None else tensor + pos\n\n    def forward_post(self, tgt):\n        tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt))))\n        tgt = tgt + self.dropout(tgt2)\n        tgt = self.norm(tgt)\n        return tgt\n\n    def forward_pre(self, tgt):\n        tgt2 = self.norm(tgt)\n        tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt2))))\n        tgt = tgt + self.dropout(tgt2)\n        return tgt\n\n    def forward(self, tgt):\n        if self.normalize_before:\n            return self.forward_pre(tgt)\n        return self.forward_post(tgt)\n\n\ndef _get_activation_fn(activation):\n    \"\"\"Return an activation function given a string\"\"\"\n    if activation == \"relu\":\n        return F.relu\n    if activation == \"gelu\":\n        return F.gelu\n    if activation == \"glu\":\n        return F.glu\n    raise RuntimeError(F\"activation should be relu/gelu, not {activation}.\")\n\n\nclass MLP(nn.Module):\n    \"\"\" Very simple multi-layer perceptron (also called FFN)\"\"\"\n\n    def __init__(self, input_dim, hidden_dim, output_dim, num_layers):\n        super().__init__()\n        self.num_layers = num_layers\n        h = [hidden_dim] * (num_layers - 1)\n        self.layers = nn.ModuleList(nn.Linear(n, k) for n, k in zip([input_dim] + h, h + [output_dim]))\n\n    def forward(self, x):\n        for i, layer in enumerate(self.layers):\n            x = F.relu(layer(x)) if i < self.num_layers - 1 else layer(x)\n        return x\n\n\n@TRANSFORMER_DECODER_REGISTRY.register()\nclass VideoMultiScaleMaskedTransformerDecoder(nn.Module):\n\n    _version = 2\n\n    def _load_from_state_dict(\n        self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs\n    ):\n        version = local_metadata.get(\"version\", None)\n        if version is None or version < 2:\n            # Do not warn if train from scratch\n            scratch = True\n            logger = logging.getLogger(__name__)\n            for k in list(state_dict.keys()):\n                newk = k\n                if \"static_query\" in k:\n                    newk = k.replace(\"static_query\", \"query_feat\")\n                if newk != k:\n                    state_dict[newk] = state_dict[k]\n                    del state_dict[k]\n                    scratch = False\n\n            if not scratch:\n                logger.warning(\n                    f\"Weight format of {self.__class__.__name__} have changed! \"\n                    \"Please upgrade your models. Applying automatic conversion now ...\"\n                )\n\n    @configurable\n    def __init__(\n        self,\n        in_channels,\n        mask_classification=True,\n        *,\n        num_classes: int,\n        hidden_dim: int,\n        num_queries: int,\n        nheads: int,\n        dim_feedforward: int,\n        dec_layers: int,\n        pre_norm: bool,\n        mask_dim: int,\n        enforce_input_project: bool,\n        # video related\n        num_frames,\n    ):\n        \"\"\"\n        NOTE: this interface is experimental.\n        Args:\n            in_channels: channels of the input features\n            mask_classification: whether to add mask classifier or not\n            num_classes: number of classes\n            hidden_dim: Transformer feature dimension\n            num_queries: number of queries\n            nheads: number of heads\n            dim_feedforward: feature dimension in feedforward network\n            enc_layers: number of Transformer encoder layers\n            dec_layers: number of Transformer decoder layers\n            pre_norm: whether to use pre-LayerNorm or not\n            mask_dim: mask feature dimension\n            enforce_input_project: add input project 1x1 conv even if input\n                channels and hidden dim is identical\n        \"\"\"\n        super().__init__()\n\n        assert mask_classification, \"Only support mask classification model\"\n        self.mask_classification = mask_classification\n\n        self.num_frames = num_frames\n\n        # positional encoding\n        N_steps = hidden_dim // 2\n        self.pe_layer = PositionEmbeddingSine3D(N_steps, normalize=True)\n        \n        # define Transformer decoder here\n        self.num_heads = nheads\n        self.num_layers = dec_layers\n        self.transformer_self_attention_layers = nn.ModuleList()\n        self.transformer_cross_attention_layers = nn.ModuleList()\n        self.transformer_ffn_layers = nn.ModuleList()\n\n        for _ in range(self.num_layers):\n            self.transformer_self_attention_layers.append(\n                SelfAttentionLayer(\n                    d_model=hidden_dim,\n                    nhead=nheads,\n                    dropout=0.0,\n                    normalize_before=pre_norm,\n                )\n            )\n\n            self.transformer_cross_attention_layers.append(\n                CrossAttentionLayer(\n                    d_model=hidden_dim,\n                    nhead=nheads,\n                    dropout=0.0,\n                    normalize_before=pre_norm,\n                )\n            )\n\n            self.transformer_ffn_layers.append(\n                FFNLayer(\n                    d_model=hidden_dim,\n                    dim_feedforward=dim_feedforward,\n                    dropout=0.0,\n                    normalize_before=pre_norm,\n                )\n            )\n\n        self.decoder_norm = nn.LayerNorm(hidden_dim)\n\n        self.num_queries = num_queries\n        # learnable query features\n        self.query_feat = nn.Embedding(num_queries, hidden_dim)\n        # learnable query p.e.\n        self.query_embed = nn.Embedding(num_queries, hidden_dim)\n\n        # level embedding (we always use 3 scales)\n        self.num_feature_levels = 3\n        self.level_embed = nn.Embedding(self.num_feature_levels, hidden_dim)\n        self.input_proj = nn.ModuleList()\n        for _ in range(self.num_feature_levels):\n            if in_channels != hidden_dim or enforce_input_project:\n                self.input_proj.append(Conv2d(in_channels, hidden_dim, kernel_size=1))\n                weight_init.c2_xavier_fill(self.input_proj[-1])\n            else:\n                self.input_proj.append(nn.Sequential())\n\n        # output FFNs\n        if self.mask_classification:\n            self.class_embed = nn.Linear(hidden_dim, num_classes + 1)\n        self.mask_embed = MLP(hidden_dim, hidden_dim, mask_dim, 3)\n\n    @classmethod\n    def from_config(cls, cfg, in_channels, mask_classification):\n        ret = {}\n        ret[\"in_channels\"] = in_channels\n        ret[\"mask_classification\"] = mask_classification\n        \n        ret[\"num_classes\"] = cfg.MODEL.SEM_SEG_HEAD.NUM_CLASSES\n        ret[\"hidden_dim\"] = cfg.MODEL.MASK_FORMER.HIDDEN_DIM\n        ret[\"num_queries\"] = cfg.MODEL.MASK_FORMER.NUM_OBJECT_QUERIES\n        # Transformer parameters:\n        ret[\"nheads\"] = cfg.MODEL.MASK_FORMER.NHEADS\n        ret[\"dim_feedforward\"] = cfg.MODEL.MASK_FORMER.DIM_FEEDFORWARD\n\n        # NOTE: because we add learnable query features which requires supervision,\n        # we add minus 1 to decoder layers to be consistent with our loss\n        # implementation: that is, number of auxiliary losses is always\n        # equal to number of decoder layers. With learnable query features, the number of\n        # auxiliary losses equals number of decoders plus 1.\n        assert cfg.MODEL.MASK_FORMER.DEC_LAYERS >= 1\n        ret[\"dec_layers\"] = cfg.MODEL.MASK_FORMER.DEC_LAYERS - 1\n        ret[\"pre_norm\"] = cfg.MODEL.MASK_FORMER.PRE_NORM\n        ret[\"enforce_input_project\"] = cfg.MODEL.MASK_FORMER.ENFORCE_INPUT_PROJ\n\n        ret[\"mask_dim\"] = cfg.MODEL.SEM_SEG_HEAD.MASK_DIM\n\n        ret[\"num_frames\"] = cfg.INPUT.SAMPLING_FRAME_NUM\n\n        return ret\n\n    def forward(self, x, mask_features, mask = None):\n        bt, c_m, h_m, w_m = mask_features.shape\n        bs = bt // self.num_frames if self.training else 1\n        t = bt // bs\n        mask_features = mask_features.view(bs, t, c_m, h_m, w_m)\n\n        # x is a list of multi-scale feature\n        assert len(x) == self.num_feature_levels\n        src = []\n        pos = []\n        size_list = []\n\n        # disable mask, it does not affect performance\n        del mask\n\n        for i in range(self.num_feature_levels):\n            size_list.append(x[i].shape[-2:])\n            pos.append(self.pe_layer(x[i].view(bs, t, -1, size_list[-1][0], size_list[-1][1]), None).flatten(3))\n            src.append(self.input_proj[i](x[i]).flatten(2) + self.level_embed.weight[i][None, :, None])\n\n            # NTxCxHW => NxTxCxHW => (TxHW)xNxC\n            _, c, hw = src[-1].shape\n            pos[-1] = pos[-1].view(bs, t, c, hw).permute(1, 3, 0, 2).flatten(0, 1)\n            src[-1] = src[-1].view(bs, t, c, hw).permute(1, 3, 0, 2).flatten(0, 1)\n\n        # QxNxC\n        query_embed = self.query_embed.weight.unsqueeze(1).repeat(1, bs, 1)\n        output = self.query_feat.weight.unsqueeze(1).repeat(1, bs, 1)\n\n        predictions_class = []\n        predictions_mask = []\n\n        # prediction heads on learnable query features\n        outputs_class, outputs_mask, attn_mask = self.forward_prediction_heads(output, mask_features, attn_mask_target_size=size_list[0])\n        predictions_class.append(outputs_class)\n        predictions_mask.append(outputs_mask)\n\n        for i in range(self.num_layers):\n            level_index = i % self.num_feature_levels\n            attn_mask[torch.where(attn_mask.sum(-1) == attn_mask.shape[-1])] = False\n            # attention: cross-attention first\n            output = self.transformer_cross_attention_layers[i](\n                output, src[level_index],\n                memory_mask=attn_mask,\n                memory_key_padding_mask=None,  # here we do not apply masking on padded region\n                pos=pos[level_index], query_pos=query_embed\n            )\n\n            output = self.transformer_self_attention_layers[i](\n                output, tgt_mask=None,\n                tgt_key_padding_mask=None,\n                query_pos=query_embed\n            )\n            \n            # FFN\n            output = self.transformer_ffn_layers[i](\n                output\n            )\n\n            outputs_class, outputs_mask, attn_mask = self.forward_prediction_heads(output, mask_features, attn_mask_target_size=size_list[(i + 1) % self.num_feature_levels])\n            predictions_class.append(outputs_class)\n            predictions_mask.append(outputs_mask)\n\n        assert len(predictions_class) == self.num_layers + 1\n\n        out = {\n            'pred_logits': predictions_class[-1],\n            'pred_masks': predictions_mask[-1],\n            'aux_outputs': self._set_aux_loss(\n                predictions_class if self.mask_classification else None, predictions_mask\n            )\n        }\n        return out\n\n    def forward_prediction_heads(self, output, mask_features, attn_mask_target_size):\n        decoder_output = self.decoder_norm(output)\n        decoder_output = decoder_output.transpose(0, 1)\n        outputs_class = self.class_embed(decoder_output)\n        mask_embed = self.mask_embed(decoder_output)\n        outputs_mask = torch.einsum(\"bqc,btchw->bqthw\", mask_embed, mask_features)\n        b, q, t, _, _ = outputs_mask.shape\n\n        # NOTE: prediction is of higher-resolution\n        # [B, Q, T, H, W] -> [B, Q, T*H*W] -> [B, h, Q, T*H*W] -> [B*h, Q, T*HW]\n        attn_mask = F.interpolate(outputs_mask.flatten(0, 1), size=attn_mask_target_size, mode=\"bilinear\", align_corners=False).view(\n            b, q, t, attn_mask_target_size[0], attn_mask_target_size[1])\n        # must use bool type\n        # If a BoolTensor is provided, positions with ``True`` are not allowed to attend while ``False`` values will be unchanged.\n        attn_mask = (attn_mask.sigmoid().flatten(2).unsqueeze(1).repeat(1, self.num_heads, 1, 1).flatten(0, 1) < 0.5).bool()\n        attn_mask = attn_mask.detach()\n\n        return outputs_class, outputs_mask, attn_mask\n\n    @torch.jit.unused\n    def _set_aux_loss(self, outputs_class, outputs_seg_masks):\n        # this is a workaround to make torchscript happy, as torchscript\n        # doesn't support dictionary with non-homogeneous values, such\n        # as a dict having both a Tensor and a list.\n        if self.mask_classification:\n            return [\n                {\"pred_logits\": a, \"pred_masks\": b}\n                for a, b in zip(outputs_class[:-1], outputs_seg_masks[:-1])\n            ]\n        else:\n            return [{\"pred_masks\": b} for b in outputs_seg_masks[:-1]]\n"
  },
  {
    "path": "mfvis_nococo/mask2former_video/utils/__init__.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates.\n"
  },
  {
    "path": "mfvis_nococo/mask2former_video/utils/memory.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates.\n\nimport logging\nfrom contextlib import contextmanager\nfrom functools import wraps\nimport torch\nfrom torch.cuda.amp import autocast\n\n__all__ = [\"retry_if_cuda_oom\"]\n\n\n@contextmanager\ndef _ignore_torch_cuda_oom():\n    \"\"\"\n    A context which ignores CUDA OOM exception from pytorch.\n    \"\"\"\n    try:\n        yield\n    except RuntimeError as e:\n        # NOTE: the string may change?\n        if \"CUDA out of memory. \" in str(e):\n            pass\n        else:\n            raise\n\n\ndef retry_if_cuda_oom(func):\n    \"\"\"\n    Makes a function retry itself after encountering\n    pytorch's CUDA OOM error.\n    It will first retry after calling `torch.cuda.empty_cache()`.\n    If that still fails, it will then retry by trying to convert inputs to CPUs.\n    In this case, it expects the function to dispatch to CPU implementation.\n    The return values may become CPU tensors as well and it's user's\n    responsibility to convert it back to CUDA tensor if needed.\n    Args:\n        func: a stateless callable that takes tensor-like objects as arguments\n    Returns:\n        a callable which retries `func` if OOM is encountered.\n    Examples:\n    ::\n        output = retry_if_cuda_oom(some_torch_function)(input1, input2)\n        # output may be on CPU even if inputs are on GPU\n    Note:\n        1. When converting inputs to CPU, it will only look at each argument and check\n           if it has `.device` and `.to` for conversion. Nested structures of tensors\n           are not supported.\n        2. Since the function might be called more than once, it has to be\n           stateless.\n    \"\"\"\n\n    def maybe_to_cpu(x):\n        try:\n            like_gpu_tensor = x.device.type == \"cuda\" and hasattr(x, \"to\")\n        except AttributeError:\n            like_gpu_tensor = False\n        if like_gpu_tensor:\n            return x.to(device=\"cpu\").to(torch.float32)\n        else:\n            return x\n\n    @wraps(func)\n    def wrapped(*args, **kwargs):\n        with _ignore_torch_cuda_oom():\n            return func(*args, **kwargs)\n\n        # Clear cache and retry\n        torch.cuda.empty_cache()\n        with _ignore_torch_cuda_oom():\n            return func(*args, **kwargs)\n\n        # Try on CPU. This slows down the code significantly, therefore print a notice.\n        logger = logging.getLogger(__name__)\n        logger.info(\"Attempting to copy inputs to CPU due to CUDA OOM\")\n        new_args = (maybe_to_cpu(x) for x in args)\n        new_kwargs = {k: maybe_to_cpu(v) for k, v in kwargs.items()}\n        with autocast(enabled=False):\n            return func(*new_args, **new_kwargs)\n\n    return wrapped\n"
  },
  {
    "path": "mfvis_nococo/mask2former_video/video_maskformer_model.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates.\nimport logging\nimport math\nfrom typing import Tuple\n\nimport torch\nfrom torch import nn\nfrom torch.nn import functional as F\n\nfrom detectron2.config import configurable\nfrom detectron2.data import MetadataCatalog\nfrom detectron2.modeling import META_ARCH_REGISTRY, build_backbone, build_sem_seg_head\nfrom detectron2.modeling.backbone import Backbone\nfrom detectron2.modeling.postprocessing import sem_seg_postprocess\nfrom detectron2.structures import Boxes, ImageList, Instances, BitMasks\n\nfrom .modeling.criterion import VideoSetCriterion\nfrom .modeling.matcher import VideoHungarianMatcher\nfrom .utils.memory import retry_if_cuda_oom\n\nfrom skimage import color\nimport cv2\nimport numpy as np\n\ndef unfold_wo_center(x, kernel_size, dilation):\n    assert x.dim() == 4\n    assert kernel_size % 2 == 1\n\n    # using SAME padding\n    padding = (kernel_size + (dilation - 1) * (kernel_size - 1)) // 2\n    unfolded_x = F.unfold(\n        x, kernel_size=kernel_size,\n        padding=padding,\n        dilation=dilation\n    )\n\n    unfolded_x = unfolded_x.reshape(\n        x.size(0), x.size(1), -1, x.size(2), x.size(3)\n    )\n\n    # remove the center pixels\n    size = kernel_size ** 2\n    unfolded_x = torch.cat((\n        unfolded_x[:, :, :size // 2],\n        unfolded_x[:, :, size // 2 + 1:]\n    ), dim=2)\n\n    return unfolded_x\n\ndef unfold_w_center(x, kernel_size, dilation):\n    assert x.dim() == 4\n    assert kernel_size % 2 == 1\n\n    # using SAME padding\n    padding = (kernel_size + (dilation - 1) * (kernel_size - 1)) // 2\n    unfolded_x = F.unfold(\n        x, kernel_size=kernel_size,\n        padding=padding,\n        dilation=dilation\n    )\n\n    unfolded_x = unfolded_x.reshape(\n        x.size(0), x.size(1), -1, x.size(2), x.size(3)\n    )\n\n    return unfolded_x\n\ndef get_images_color_similarity(images, kernel_size, dilation):\n    assert images.dim() == 4\n    assert images.size(0) == 1\n\n    unfolded_images = unfold_wo_center(\n        images, kernel_size=kernel_size, dilation=dilation\n    )\n\n    diff = images[:, :, None] - unfolded_images\n    similarity = torch.exp(-torch.norm(diff, dim=1) * 0.5)\n    \n    return similarity\n\ndef get_neighbor_images_color_similarity(images, images_neighbor, kernel_size, dilation):\n    assert images.dim() == 4\n    assert images.size(0) == 1\n\n    unfolded_images = unfold_w_center(\n        images, kernel_size=kernel_size, dilation=dilation\n    )\n\n    diff = images_neighbor[:, :, None] - unfolded_images\n    similarity = torch.exp(-torch.norm(diff, dim=1) * 0.5)\n\n    return similarity\n\ndef get_neighbor_images_patch_color_similarity(images, images_neighbor, kernel_size, dilation):\n    assert images.dim() == 4\n    assert images.size(0) == 1\n\n    unfolded_images = unfold_w_center(\n        images, kernel_size=kernel_size, dilation= 1 #dilation\n    )\n    unfolded_images_neighbor = unfold_w_center(\n        images_neighbor, kernel_size=kernel_size, dilation= 1 #dilation\n    )\n    \n    unfolded_images = unfolded_images.flatten(1,2)\n    unfolded_images_neighbor = unfolded_images_neighbor.flatten(1,2)\n    \n    similarity = get_neighbor_images_color_similarity(unfolded_images, unfolded_images_neighbor, 3, 3)\n\n    return similarity\n\nlogger = logging.getLogger(__name__)\n\n\n@META_ARCH_REGISTRY.register()\nclass VideoMaskFormer(nn.Module):\n    \"\"\"\n    Main class for mask classification semantic segmentation architectures.\n    \"\"\"\n\n    @configurable\n    def __init__(\n        self,\n        *,\n        backbone: Backbone,\n        sem_seg_head: nn.Module,\n        criterion: nn.Module,\n        num_queries: int,\n        object_mask_threshold: float,\n        overlap_threshold: float,\n        metadata,\n        size_divisibility: int,\n        sem_seg_postprocess_before_inference: bool,\n        pixel_mean: Tuple[float],\n        pixel_std: Tuple[float],\n        # video\n        num_frames,\n    ):\n        \"\"\"\n        Args:\n            backbone: a backbone module, must follow detectron2's backbone interface\n            sem_seg_head: a module that predicts semantic segmentation from backbone features\n            criterion: a module that defines the loss\n            num_queries: int, number of queries\n            object_mask_threshold: float, threshold to filter query based on classification score\n                for panoptic segmentation inference\n            overlap_threshold: overlap threshold used in general inference for panoptic segmentation\n            metadata: dataset meta, get `thing` and `stuff` category names for panoptic\n                segmentation inference\n            size_divisibility: Some backbones require the input height and width to be divisible by a\n                specific integer. We can use this to override such requirement.\n            sem_seg_postprocess_before_inference: whether to resize the prediction back\n                to original input size before semantic segmentation inference or after.\n                For high-resolution dataset like Mapillary, resizing predictions before\n                inference will cause OOM error.\n            pixel_mean, pixel_std: list or tuple with #channels element, representing\n                the per-channel mean and std to be used to normalize the input image\n            semantic_on: bool, whether to output semantic segmentation prediction\n            instance_on: bool, whether to output instance segmentation prediction\n            panoptic_on: bool, whether to output panoptic segmentation prediction\n            test_topk_per_image: int, instance segmentation parameter, keep topk instances per image\n        \"\"\"\n        super().__init__()\n        self.backbone = backbone\n        self.sem_seg_head = sem_seg_head\n        self.criterion = criterion\n        self.num_queries = num_queries\n        self.overlap_threshold = overlap_threshold\n        self.object_mask_threshold = object_mask_threshold\n        self.metadata = metadata\n        if size_divisibility < 0:\n            # use backbone size_divisibility if not set\n            size_divisibility = self.backbone.size_divisibility\n        self.size_divisibility = size_divisibility\n        self.sem_seg_postprocess_before_inference = sem_seg_postprocess_before_inference\n        self.register_buffer(\"pixel_mean\", torch.Tensor(pixel_mean).view(-1, 1, 1), False)\n        self.register_buffer(\"pixel_std\", torch.Tensor(pixel_std).view(-1, 1, 1), False)\n\n        self.num_frames = num_frames\n        #self.structure_fc = nn.Conv2d(27, 256, 1)\n\n    @classmethod\n    def from_config(cls, cfg):\n        backbone = build_backbone(cfg)\n        sem_seg_head = build_sem_seg_head(cfg, backbone.output_shape())\n\n        # Loss parameters:\n        deep_supervision = cfg.MODEL.MASK_FORMER.DEEP_SUPERVISION\n        no_object_weight = cfg.MODEL.MASK_FORMER.NO_OBJECT_WEIGHT\n\n        # loss weights\n        class_weight = cfg.MODEL.MASK_FORMER.CLASS_WEIGHT\n        dice_weight = cfg.MODEL.MASK_FORMER.DICE_WEIGHT\n        mask_weight = cfg.MODEL.MASK_FORMER.MASK_WEIGHT\n\n        # building criterion\n        matcher = VideoHungarianMatcher(\n            cost_class=class_weight,\n            cost_mask=mask_weight,\n            cost_dice=dice_weight,\n            num_points=cfg.MODEL.MASK_FORMER.TRAIN_NUM_POINTS,\n        )\n\n        weight_dict = {\"loss_ce\": class_weight, \"loss_mask\": mask_weight, \"loss_dice\": dice_weight, \"loss_bound\": mask_weight, \"loss_bound_neighbor\": mask_weight}\n\n        if deep_supervision:\n            dec_layers = cfg.MODEL.MASK_FORMER.DEC_LAYERS\n            aux_weight_dict = {}\n            for i in range(dec_layers - 1):\n                aux_weight_dict.update({k + f\"_{i}\": v for k, v in weight_dict.items()})\n            weight_dict.update(aux_weight_dict)\n\n        losses = [\"labels\", \"masks\"]\n\n        criterion = VideoSetCriterion(\n            sem_seg_head.num_classes,\n            matcher=matcher,\n            weight_dict=weight_dict,\n            eos_coef=no_object_weight,\n            losses=losses,\n            num_points=cfg.MODEL.MASK_FORMER.TRAIN_NUM_POINTS,\n            oversample_ratio=cfg.MODEL.MASK_FORMER.OVERSAMPLE_RATIO,\n            importance_sample_ratio=cfg.MODEL.MASK_FORMER.IMPORTANCE_SAMPLE_RATIO,\n        )\n\n        return {\n            \"backbone\": backbone,\n            \"sem_seg_head\": sem_seg_head,\n            \"criterion\": criterion,\n            \"num_queries\": cfg.MODEL.MASK_FORMER.NUM_OBJECT_QUERIES,\n            \"object_mask_threshold\": cfg.MODEL.MASK_FORMER.TEST.OBJECT_MASK_THRESHOLD,\n            \"overlap_threshold\": cfg.MODEL.MASK_FORMER.TEST.OVERLAP_THRESHOLD,\n            \"metadata\": MetadataCatalog.get(cfg.DATASETS.TRAIN[0]),\n            \"size_divisibility\": cfg.MODEL.MASK_FORMER.SIZE_DIVISIBILITY,\n            \"sem_seg_postprocess_before_inference\": True,\n            \"pixel_mean\": cfg.MODEL.PIXEL_MEAN,\n            \"pixel_std\": cfg.MODEL.PIXEL_STD,\n            # video\n            \"num_frames\": cfg.INPUT.SAMPLING_FRAME_NUM,\n        }\n\n    @property\n    def device(self):\n        return self.pixel_mean.device\n    \n    def forward(self, batched_inputs):\n        \"\"\"\n        Args:\n            batched_inputs: a list, batched outputs of :class:`DatasetMapper`.\n                Each item in the list contains the inputs for one image.\n                For now, each item in the list is a dict that contains:\n                   * \"image\": Tensor, image in (C, H, W) format.\n                   * \"instances\": per-region ground truth\n                   * Other information that's included in the original dicts, such as:\n                     \"height\", \"width\" (int): the output resolution of the model (may be different\n                     from input resolution), used in inference.\n        Returns:\n            list[dict]:\n                each dict has the results for one image. The dict contains the following keys:\n\n                * \"sem_seg\":\n                    A Tensor that represents the\n                    per-pixel segmentation prediced by the head.\n                    The prediction has shape KxHxW that represents the logits of\n                    each class for each pixel.\n                * \"panoptic_seg\":\n                    A tuple that represent panoptic output\n                    panoptic_seg (Tensor): of shape (height, width) where the values are ids for each segment.\n                    segments_info (list[dict]): Describe each segment in `panoptic_seg`.\n                        Each dict contains keys \"id\", \"category_id\", \"isthing\".\n        \"\"\"\n        images = []\n        for video in batched_inputs:\n            for frame in video[\"image\"]:\n                images.append(frame.to(self.device))\n        \n        if self.training:\n            k_size = 3 #3\n            rs_images = ImageList.from_tensors(images, self.size_divisibility)\n            downsampled_images = F.avg_pool2d(rs_images.tensor.float(), kernel_size=4, stride=4, padding=0) #for img in images]\n            images_lab = [torch.as_tensor(color.rgb2lab(ds_image[[2, 1, 0]].byte().permute(1, 2, 0).cpu().numpy()), device=ds_image.device, dtype=torch.float32).permute(2, 0, 1) for ds_image in downsampled_images]\n            images_lab_sim = [get_images_color_similarity(img_lab.unsqueeze(0), k_size, 2) for img_lab in images_lab] # ori is 0.3, 0.5, 0.7\n\n            images_lab_sim_nei = [get_neighbor_images_patch_color_similarity(images_lab[ii].unsqueeze(0), images_lab[ii+1].unsqueeze(0), 3, 3) for ii in range(0, len(images_lab), 5)] # ori dilation is 3\n            images_lab_sim_nei1 = [get_neighbor_images_patch_color_similarity(images_lab[ii+1].unsqueeze(0), images_lab[ii+2].unsqueeze(0), 3, 3) for ii in range(0, len(images_lab), 5)]\n            images_lab_sim_nei2 = [get_neighbor_images_patch_color_similarity(images_lab[ii+2].unsqueeze(0), images_lab[ii+3].unsqueeze(0), 3, 3) for ii in range(0, len(images_lab), 5)]\n            images_lab_sim_nei3 = [get_neighbor_images_patch_color_similarity(images_lab[ii+3].unsqueeze(0), images_lab[ii+4].unsqueeze(0), 3, 3) for ii in range(0, len(images_lab), 5)]\n            images_lab_sim_nei4 = [get_neighbor_images_patch_color_similarity(images_lab[ii+4].unsqueeze(0), images_lab[ii].unsqueeze(0), 3, 3) for ii in range(0, len(images_lab), 5)]\n\n\n\n        images = [(x - self.pixel_mean) / self.pixel_std for x in images]\n        images = ImageList.from_tensors(images, self.size_divisibility)\n\n        features = self.backbone(images.tensor)\n        outputs = self.sem_seg_head(features)\n\n        if self.training:\n            # mask classification target\n            targets = self.prepare_targets(batched_inputs, images)\n\n            # bipartite matching-based loss\n            losses = self.criterion(outputs, targets, images_lab_sim, images_lab_sim_nei, images_lab_sim_nei1, images_lab_sim_nei2, images_lab_sim_nei3, images_lab_sim_nei4)\n\n            for k in list(losses.keys()):\n                if k in self.criterion.weight_dict:\n                    losses[k] *= self.criterion.weight_dict[k]\n                else:\n                    # remove this loss if not specified in `weight_dict`\n                    losses.pop(k)\n            return losses\n        else:\n            mask_cls_results = outputs[\"pred_logits\"]\n            mask_pred_results = outputs[\"pred_masks\"]\n\n            mask_cls_result = mask_cls_results[0]\n            # upsample masks\n            mask_pred_result = retry_if_cuda_oom(F.interpolate)(\n                mask_pred_results[0],\n                size=(images.tensor.shape[-2], images.tensor.shape[-1]),\n                mode=\"bilinear\",\n                align_corners=False,\n            )\n\n            del outputs\n\n            input_per_image = batched_inputs[0]\n            image_size = images.image_sizes[0]  # image size without padding after data augmentation\n\n            height = input_per_image.get(\"height\", image_size[0])  # raw image size before data augmentation\n            width = input_per_image.get(\"width\", image_size[1])\n\n            return retry_if_cuda_oom(self.inference_video)(mask_cls_result, mask_pred_result, image_size, height, width)\n\n    def prepare_targets(self, targets, images):\n        h_pad, w_pad = images.tensor.shape[-2:]\n        gt_instances = []\n        for targets_per_video in targets:\n            _num_instance = len(targets_per_video[\"instances\"][0])\n            mask_shape = [_num_instance, self.num_frames, h_pad, w_pad]\n            gt_masks_per_video = torch.zeros(mask_shape, dtype=torch.bool, device=self.device)\n\n            gt_ids_per_video = []\n            for f_i, targets_per_frame in enumerate(targets_per_video[\"instances\"]):\n                targets_per_frame = targets_per_frame.to(self.device)\n                h, w = targets_per_frame.image_size\n\n                gt_ids_per_video.append(targets_per_frame.gt_ids[:, None])\n                gt_masks_per_video[:, f_i, :h, :w] = targets_per_frame.gt_masks.tensor\n\n            gt_ids_per_video = torch.cat(gt_ids_per_video, dim=1)\n            valid_idx = (gt_ids_per_video != -1).any(dim=-1)\n\n            gt_classes_per_video = targets_per_frame.gt_classes[valid_idx]          # N,\n            gt_ids_per_video = gt_ids_per_video[valid_idx]                          # N, num_frames\n\n            gt_instances.append({\"labels\": gt_classes_per_video, \"ids\": gt_ids_per_video})\n            gt_masks_per_video = gt_masks_per_video[valid_idx].float()          # N, num_frames, H, W\n            gt_instances[-1].update({\"masks\": gt_masks_per_video})\n\n        return gt_instances\n\n    def inference_video(self, pred_cls, pred_masks, img_size, output_height, output_width):\n        if len(pred_cls) > 0:\n            scores = F.softmax(pred_cls, dim=-1)[:, :-1]\n            labels = torch.arange(self.sem_seg_head.num_classes, device=self.device).unsqueeze(0).repeat(self.num_queries, 1).flatten(0, 1)\n            # keep top-10 predictions\n            scores_per_image, topk_indices = scores.flatten(0, 1).topk(10, sorted=False)\n            labels_per_image = labels[topk_indices]\n            topk_indices = topk_indices // self.sem_seg_head.num_classes\n            pred_masks = pred_masks[topk_indices]\n\n            pred_masks = pred_masks[:, :, : img_size[0], : img_size[1]]\n            pred_masks = F.interpolate(\n                pred_masks, size=(output_height, output_width), mode=\"bilinear\", align_corners=False\n            )\n\n            masks = pred_masks > 0.\n\n            out_scores = scores_per_image.tolist()\n            out_labels = labels_per_image.tolist()\n            out_masks = [m for m in masks.cpu()]\n        else:\n            out_scores = []\n            out_labels = []\n            out_masks = []\n\n        video_output = {\n            \"image_size\": (output_height, output_width),\n            \"pred_scores\": out_scores,\n            \"pred_labels\": out_labels,\n            \"pred_masks\": out_masks,\n        }\n\n        return video_output\n"
  },
  {
    "path": "mfvis_nococo/scripts/eval_8gpu_mask2former_r101_video.sh",
    "content": "export PYTHONPATH=$PYTHONPATH:`pwd`\n\nID=159\n\nCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 train_net_video.py --num-gpus 8 --resume --dist-url tcp://0.0.0.0:12349\\\n\t--config-file configs/youtubevis_2019/video_maskformer2_R101_bs16_8ep.yaml\\\n        --eval-only MODEL.WEIGHTS ../mfvis_models/model_final_r101_0473.pth\n\n\n\n"
  },
  {
    "path": "mfvis_nococo/scripts/train_8gpu_mask2former_r101_video_coco.sh",
    "content": "export PYTHONPATH=$PYTHONPATH:`pwd`\n\nID=159\n\n\nCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 train_net_video.py --num-gpus 8 --resume --dist-url tcp://0.0.0.0:12349\\\n\t--config-file configs/youtubevis_2019/video_maskformer2_R101_bs16_8ep_coco.yaml \n\n\n\n"
  },
  {
    "path": "mfvis_nococo/scripts/train_8gpu_mask2former_r50_video.sh",
    "content": "export PYTHONPATH=$PYTHONPATH:`pwd`\n\nID=159\n\n\nCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 train_net_video.py --num-gpus 8 --resume --dist-url tcp://0.0.0.0:12349\\\n\t--config-file configs/youtubevis_2019/video_maskformer2_R50_bs16_8ep.yaml \n\n\n\n"
  },
  {
    "path": "mfvis_nococo/scripts/train_8gpu_mask2former_r50_video_coco.sh",
    "content": "export PYTHONPATH=$PYTHONPATH:`pwd`\n\nID=159\n\nCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 train_net_video.py --num-gpus 8 --resume --dist-url tcp://0.0.0.0:12349\\\n\t--config-file configs/youtubevis_2019/video_maskformer2_R50_bs16_8ep_coco.yaml \n\n\n\n"
  },
  {
    "path": "mfvis_nococo/scripts/visual_video_r101.sh",
    "content": "export PYTHONPATH=$PYTHONPATH:`pwd`\n\nCUDA_VISIBLE_DEVICES=0 python3 demo_video/demo.py --config-file configs/youtubevis_2019/video_maskformer2_R101_bs16_8ep.yaml --save-frames True \\\n  --input './datasets/ytvis_2019/valid/JPEGImages/' \\\n  --output 'box_patch_newknn_r101_vis/' \\\n  --opts MODEL.WEIGHTS ../mfvis_models/model_final_r101_0473.pth\n  "
  },
  {
    "path": "mfvis_nococo/scripts/visual_video_r50.sh",
    "content": "export PYTHONPATH=$PYTHONPATH:`pwd`\n\nCUDA_VISIBLE_DEVICES=0 python3 demo_video/demo.py --config-file configs/youtubevis_2019/video_maskformer2_R50_bs16_8ep.yaml --save-frames True \\\n  --input './datasets/ytvis_2019/valid/JPEGImages/' \\\n  --output 'box_patch_newknn_r50_vis/' \\\n  --opts MODEL.WEIGHTS ./mfvis_models/model_final_r50_0438.pth\n  \n"
  },
  {
    "path": "mfvis_nococo/train_net_video.py",
    "content": "\"\"\"\nThis script is a simplified version of the training script in detectron2/tools.\n\"\"\"\ntry:\n    # ignore ShapelyDeprecationWarning from fvcore\n    from shapely.errors import ShapelyDeprecationWarning\n    import warnings\n    warnings.filterwarnings('ignore', category=ShapelyDeprecationWarning)\nexcept:\n    pass\n\nimport copy\nimport itertools\nimport logging\nimport os\n\nfrom collections import OrderedDict\nfrom typing import Any, Dict, List, Set\n\nimport torch\n\nimport detectron2.utils.comm as comm\nfrom detectron2.checkpoint import DetectionCheckpointer\nfrom detectron2.config import get_cfg\nfrom detectron2.data import MetadataCatalog\nfrom detectron2.engine import (\n    DefaultTrainer,\n    default_argument_parser,\n    default_setup,\n    launch,\n)\nfrom detectron2.evaluation import (\n    DatasetEvaluator,\n    inference_on_dataset,\n    print_csv_format,\n    verify_results,\n)\nfrom detectron2.projects.deeplab import add_deeplab_config, build_lr_scheduler\nfrom detectron2.solver.build import maybe_add_gradient_clipping\nfrom detectron2.utils.logger import setup_logger\n\n# MaskFormer\nfrom mask2former import add_maskformer2_config\nfrom mask2former_video import (\n    YTVISDatasetMapper,\n    YTVISEvaluator,\n    add_maskformer2_video_config,\n    build_detection_train_loader,\n    build_detection_test_loader,\n    get_detection_dataset_dicts,\n)\n\n\nclass Trainer(DefaultTrainer):\n    \"\"\"\n    Extension of the Trainer class adapted to MaskFormer.\n    \"\"\"\n\n    @classmethod\n    def build_evaluator(cls, cfg, dataset_name, output_folder=None):\n        \"\"\"\n        Create evaluator(s) for a given dataset.\n        This uses the special metadata \"evaluator_type\" associated with each builtin dataset.\n        For your own dataset, you can simply create an evaluator manually in your\n        script and do not have to worry about the hacky if-else logic here.\n        \"\"\"\n        if output_folder is None:\n            output_folder = os.path.join(cfg.OUTPUT_DIR, \"inference\")\n            os.makedirs(output_folder, exist_ok=True)\n\n        return YTVISEvaluator(dataset_name, cfg, True, output_folder)\n\n    @classmethod\n    def build_train_loader(cls, cfg):\n        dataset_name = cfg.DATASETS.TRAIN[0]\n        mapper = YTVISDatasetMapper(cfg, is_train=True)\n\n        dataset_dict = get_detection_dataset_dicts(\n            dataset_name,\n            filter_empty=cfg.DATALOADER.FILTER_EMPTY_ANNOTATIONS,\n            proposal_files=cfg.DATASETS.PROPOSAL_FILES_TRAIN if cfg.MODEL.LOAD_PROPOSALS else None,\n        )\n\n        return build_detection_train_loader(cfg, mapper=mapper, dataset=dataset_dict)\n\n    @classmethod\n    def build_test_loader(cls, cfg, dataset_name):\n        dataset_name = cfg.DATASETS.TEST[0]\n        mapper = YTVISDatasetMapper(cfg, is_train=False)\n        return build_detection_test_loader(cfg, dataset_name, mapper=mapper)\n\n    @classmethod\n    def build_lr_scheduler(cls, cfg, optimizer):\n        \"\"\"\n        It now calls :func:`detectron2.solver.build_lr_scheduler`.\n        Overwrite it if you'd like a different scheduler.\n        \"\"\"\n        return build_lr_scheduler(cfg, optimizer)\n\n    @classmethod\n    def build_optimizer(cls, cfg, model):\n        weight_decay_norm = cfg.SOLVER.WEIGHT_DECAY_NORM\n        weight_decay_embed = cfg.SOLVER.WEIGHT_DECAY_EMBED\n\n        defaults = {}\n        defaults[\"lr\"] = cfg.SOLVER.BASE_LR\n        defaults[\"weight_decay\"] = cfg.SOLVER.WEIGHT_DECAY\n\n        norm_module_types = (\n            torch.nn.BatchNorm1d,\n            torch.nn.BatchNorm2d,\n            torch.nn.BatchNorm3d,\n            torch.nn.SyncBatchNorm,\n            # NaiveSyncBatchNorm inherits from BatchNorm2d\n            torch.nn.GroupNorm,\n            torch.nn.InstanceNorm1d,\n            torch.nn.InstanceNorm2d,\n            torch.nn.InstanceNorm3d,\n            torch.nn.LayerNorm,\n            torch.nn.LocalResponseNorm,\n        )\n\n        params: List[Dict[str, Any]] = []\n        memo: Set[torch.nn.parameter.Parameter] = set()\n        for module_name, module in model.named_modules():\n            for module_param_name, value in module.named_parameters(recurse=False):\n                if not value.requires_grad:\n                    continue\n                # Avoid duplicating parameters\n                if value in memo:\n                    continue\n                memo.add(value)\n\n                hyperparams = copy.copy(defaults)\n                if \"backbone\" in module_name:\n                    hyperparams[\"lr\"] = hyperparams[\"lr\"] * cfg.SOLVER.BACKBONE_MULTIPLIER\n                if (\n                    \"relative_position_bias_table\" in module_param_name\n                    or \"absolute_pos_embed\" in module_param_name\n                ):\n                    print(module_param_name)\n                    hyperparams[\"weight_decay\"] = 0.0\n                if isinstance(module, norm_module_types):\n                    hyperparams[\"weight_decay\"] = weight_decay_norm\n                if isinstance(module, torch.nn.Embedding):\n                    hyperparams[\"weight_decay\"] = weight_decay_embed\n                params.append({\"params\": [value], **hyperparams})\n\n        def maybe_add_full_model_gradient_clipping(optim):\n            # detectron2 doesn't have full model gradient clipping now\n            clip_norm_val = cfg.SOLVER.CLIP_GRADIENTS.CLIP_VALUE\n            enable = (\n                cfg.SOLVER.CLIP_GRADIENTS.ENABLED\n                and cfg.SOLVER.CLIP_GRADIENTS.CLIP_TYPE == \"full_model\"\n                and clip_norm_val > 0.0\n            )\n\n            class FullModelGradientClippingOptimizer(optim):\n                def step(self, closure=None):\n                    all_params = itertools.chain(*[x[\"params\"] for x in self.param_groups])\n                    torch.nn.utils.clip_grad_norm_(all_params, clip_norm_val)\n                    super().step(closure=closure)\n\n            return FullModelGradientClippingOptimizer if enable else optim\n\n        optimizer_type = cfg.SOLVER.OPTIMIZER\n        if optimizer_type == \"SGD\":\n            optimizer = maybe_add_full_model_gradient_clipping(torch.optim.SGD)(\n                params, cfg.SOLVER.BASE_LR, momentum=cfg.SOLVER.MOMENTUM\n            )\n        elif optimizer_type == \"ADAMW\":\n            optimizer = maybe_add_full_model_gradient_clipping(torch.optim.AdamW)(\n                params, cfg.SOLVER.BASE_LR\n            )\n        else:\n            raise NotImplementedError(f\"no optimizer type {optimizer_type}\")\n        if not cfg.SOLVER.CLIP_GRADIENTS.CLIP_TYPE == \"full_model\":\n            optimizer = maybe_add_gradient_clipping(cfg, optimizer)\n        return optimizer\n\n    @classmethod\n    def test(cls, cfg, model, evaluators=None):\n        \"\"\"\n        Evaluate the given model. The given model is expected to already contain\n        weights to evaluate.\n        Args:\n            cfg (CfgNode):\n            model (nn.Module):\n            evaluators (list[DatasetEvaluator] or None): if None, will call\n                :meth:`build_evaluator`. Otherwise, must have the same length as\n                ``cfg.DATASETS.TEST``.\n        Returns:\n            dict: a dict of result metrics\n        \"\"\"\n        from torch.cuda.amp import autocast\n        logger = logging.getLogger(__name__)\n        if isinstance(evaluators, DatasetEvaluator):\n            evaluators = [evaluators]\n        if evaluators is not None:\n            assert len(cfg.DATASETS.TEST) == len(evaluators), \"{} != {}\".format(\n                len(cfg.DATASETS.TEST), len(evaluators)\n            )\n\n        results = OrderedDict()\n        for idx, dataset_name in enumerate(cfg.DATASETS.TEST):\n            data_loader = cls.build_test_loader(cfg, dataset_name)\n            # When evaluators are passed in as arguments,\n            # implicitly assume that evaluators can be created before data_loader.\n            if evaluators is not None:\n                evaluator = evaluators[idx]\n            else:\n                try:\n                    evaluator = cls.build_evaluator(cfg, dataset_name)\n                except NotImplementedError:\n                    logger.warn(\n                        \"No evaluator found. Use `DefaultTrainer.test(evaluators=)`, \"\n                        \"or implement its `build_evaluator` method.\"\n                    )\n                    results[dataset_name] = {}\n                    continue\n            with autocast():\n                results_i = inference_on_dataset(model, data_loader, evaluator)\n            results[dataset_name] = results_i\n            if comm.is_main_process():\n                assert isinstance(\n                    results_i, dict\n                ), \"Evaluator must return a dict on the main process. Got {} instead.\".format(\n                    results_i\n                )\n                logger.info(\"Evaluation results for {} in csv format:\".format(dataset_name))\n                print_csv_format(results_i)\n\n        if len(results) == 1:\n            results = list(results.values())[0]\n        return results\n\n\ndef setup(args):\n    \"\"\"\n    Create configs and perform basic setups.\n    \"\"\"\n    cfg = get_cfg()\n    # for poly lr schedule\n    add_deeplab_config(cfg)\n    add_maskformer2_config(cfg)\n    add_maskformer2_video_config(cfg)\n    cfg.merge_from_file(args.config_file)\n    cfg.merge_from_list(args.opts)\n    cfg.freeze()\n    default_setup(cfg, args)\n    # Setup logger for \"mask_former\" module\n    setup_logger(name=\"mask2former\")\n    setup_logger(output=cfg.OUTPUT_DIR, distributed_rank=comm.get_rank(), name=\"mask2former_video\")\n    return cfg\n\n\ndef main(args):\n    cfg = setup(args)\n\n    if args.eval_only:\n        model = Trainer.build_model(cfg)\n        DetectionCheckpointer(model, save_dir=cfg.OUTPUT_DIR).resume_or_load(\n            cfg.MODEL.WEIGHTS, resume=args.resume\n        )\n        res = Trainer.test(cfg, model)\n        if cfg.TEST.AUG.ENABLED:\n            raise NotImplementedError\n        if comm.is_main_process():\n            verify_results(cfg, res)\n        return res\n\n    trainer = Trainer(cfg)\n    trainer.resume_or_load(resume=args.resume)\n    return trainer.train()\n\n\nif __name__ == \"__main__\":\n    args = default_argument_parser().parse_args()\n    print(\"Command Line Args:\", args)\n    launch(\n        main,\n        args.num_gpus,\n        num_machines=args.num_machines,\n        machine_rank=args.machine_rank,\n        dist_url=args.dist_url,\n        args=(args,),\n    )\n"
  },
  {
    "path": "requirements.txt",
    "content": "cython\nscipy\nshapely\ntimm\nh5py\nsubmitit\nscikit-image\n"
  },
  {
    "path": "scripts/eval_8gpu_mask2former_r101_video.sh",
    "content": "export PYTHONPATH=$PYTHONPATH:`pwd`\n\nID=159\n\n\nCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 train_net_video.py --num-gpus 8 --resume --dist-url tcp://0.0.0.0:12349\\\n\t--config-file configs/youtubevis_2019/video_maskformer2_R101_bs16_8ep.yaml\\\n        --eval-only MODEL.WEIGHTS ./mfvis_models/model_final_r101_0491.pth\n"
  },
  {
    "path": "scripts/eval_8gpu_mask2former_r50_video.sh",
    "content": "export PYTHONPATH=$PYTHONPATH:`pwd`\n\nID=159\n\nCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 train_net_video.py --num-gpus 8 --resume --dist-url tcp://0.0.0.0:12349\\\n\t--config-file configs/youtubevis_2019/video_maskformer2_R50_bs16_8ep.yaml\\\n        --eval-only MODEL.WEIGHTS ./mfvis_models/model_final_r50_0466.pth\n\n"
  },
  {
    "path": "scripts/eval_8gpu_mask2former_swinl_video.sh",
    "content": "export PYTHONPATH=$PYTHONPATH:`pwd`\n\nID=159\n\n\nCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 train_net_video.py --num-gpus 8 --resume --dist-url tcp://0.0.0.0:12349\\\n\t--config-file configs/youtubevis_2019/swin/video_maskformer2_swin_large_IN21k_384_bs16_8ep.yaml\\\n        --eval-only MODEL.WEIGHTS ./mfvis_models/model_final_swinl_0560.pth\n"
  },
  {
    "path": "scripts/train_8gpu_mask2former_r101_video.sh",
    "content": "export PYTHONPATH=$PYTHONPATH:`pwd`\n\nID=159\n\nCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 train_net_video.py --num-gpus 8 --resume --dist-url tcp://0.0.0.0:12349\\\n\t--config-file configs/youtubevis_2019/video_maskformer2_R101_bs16_8ep.yaml \n\n"
  },
  {
    "path": "scripts/train_8gpu_mask2former_r50_video.sh",
    "content": "export PYTHONPATH=$PYTHONPATH:`pwd`\n#export CUDA_LAUNCH_BLOCKING=1 # for debug\n\nID=159\n\nCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 train_net_video.py --num-gpus 8 --resume --dist-url tcp://0.0.0.0:12349\\\n\t--config-file configs/youtubevis_2019/video_maskformer2_R50_bs16_8ep.yaml \n\n\n\n"
  },
  {
    "path": "scripts/train_8gpu_mask2former_swinl_video.sh",
    "content": "export PYTHONPATH=$PYTHONPATH:`pwd`\n\nID=159\n\n\nCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 train_net_video.py --num-gpus 8 --resume --dist-url tcp://0.0.0.0:12349\\\n\t--config-file configs/youtubevis_2019/swin/video_maskformer2_swin_large_IN21k_384_bs16_8ep.yaml \n\n\n\n"
  },
  {
    "path": "scripts/visual_video.sh",
    "content": "export PYTHONPATH=$PYTHONPATH:`pwd`\n\nCUDA_VISIBLE_DEVICES=0 python3 demo_video/demo.py --config-file configs/youtubevis_2019/video_maskformer2_R101_bs16_8ep.yaml --save-frames True \\\n  --input './datasets/ytvis_2019/valid/JPEGImages/' \\\n  --output 'r101_vis/' \\\n  --opts MODEL.WEIGHTS ./mfvis_models/model_final_r101_0491.pth\n"
  },
  {
    "path": "tools/README.md",
    "content": "This directory contains few tools for MaskFormer.\n\n* `convert-torchvision-to-d2.py`\n\nTool to convert torchvision pre-trained weights for D2.\n\n```\nwget https://download.pytorch.org/models/resnet101-63fe2227.pth\npython tools/convert-torchvision-to-d2.py resnet101-63fe2227.pth R-101.pkl\n```\n\n* `convert-pretrained-swin-model-to-d2.py`\n\nTool to convert Swin Transformer pre-trained weights for D2.\n\n```\npip install timm\n\nwget https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_tiny_patch4_window7_224.pth\npython tools/convert-pretrained-swin-model-to-d2.py swin_tiny_patch4_window7_224.pth swin_tiny_patch4_window7_224.pkl\n\nwget https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_small_patch4_window7_224.pth\npython tools/convert-pretrained-swin-model-to-d2.py swin_small_patch4_window7_224.pth swin_small_patch4_window7_224.pkl\n\nwget https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_base_patch4_window12_384_22k.pth\npython tools/convert-pretrained-swin-model-to-d2.py swin_base_patch4_window12_384_22k.pth swin_base_patch4_window12_384_22k.pkl\n\nwget https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_large_patch4_window12_384_22k.pth\npython tools/convert-pretrained-swin-model-to-d2.py swin_large_patch4_window12_384_22k.pth swin_large_patch4_window12_384_22k.pkl\n```\n\n* `evaluate_pq_for_semantic_segmentation.py`\n\nTool to evaluate PQ (PQ-stuff) for semantic segmentation predictions.\n\nUsage:\n\n```\npython tools/evaluate_pq_for_semantic_segmentation.py --dataset-name ade20k_sem_seg_val --json-file OUTPUT_DIR/inference/sem_seg_predictions.json\n```\n\nwhere `OUTPUT_DIR` is set in the config file.\n\n* `evaluate_coco_boundary_ap.py`\n\nTool to evaluate Boundary AP for instance segmentation predictions.\n\nUsage:\n\n```\npython tools/coco_instance_evaluation.py --gt-json-file COCO_GT_JSON --dt-json-file COCO_DT_JSON\n```\n\nTo install Boundary IoU API, run:\n\n```\npip install git+https://github.com/bowenc0221/boundary-iou-api.git\n```\n\n* `analyze_model.py`\n\nTool to analyze model parameters and flops.\n\nUsage for semantic segmentation (ADE20K only, use with caution!):\n\n```\npython tools/analyze_model.py --num-inputs 1 --tasks flop --use-fixed-input-size --config-file CONFIG_FILE\n```\n\nNote that, for semantic segmentation (ADE20K only), we use a dummy image with fixed size that equals to `cfg.INPUT.CROP.SIZE[0] x cfg.INPUT.CROP.SIZE[0]`.\nPlease do not use `--use-fixed-input-size` for calculating FLOPs on other datasets like Cityscapes!\n\nUsage for panoptic and instance segmentation:\n\n```\npython tools/analyze_model.py --num-inputs 100 --tasks flop --config-file CONFIG_FILE\n```\n\nNote that, for panoptic and instance segmentation, we compute the average flops over 100 real validation images.\n"
  },
  {
    "path": "tools/analyze_model.py",
    "content": "# -*- coding: utf-8 -*-\n# Modified by Bowen Cheng from https://github.com/facebookresearch/detectron2/blob/main/tools/analyze_model.py\n\nimport logging\nimport numpy as np\nfrom collections import Counter\nimport tqdm\nfrom fvcore.nn import flop_count_table  # can also try flop_count_str\n\nfrom detectron2.checkpoint import DetectionCheckpointer\nfrom detectron2.config import CfgNode, LazyConfig, get_cfg, instantiate\nfrom detectron2.data import build_detection_test_loader\nfrom detectron2.engine import default_argument_parser\nfrom detectron2.modeling import build_model\nfrom detectron2.projects.deeplab import add_deeplab_config\nfrom detectron2.utils.analysis import (\n    FlopCountAnalysis,\n    activation_count_operators,\n    parameter_count_table,\n)\nfrom detectron2.utils.logger import setup_logger\n\n# fmt: off\nimport os\nimport sys\nsys.path.insert(1, os.path.join(sys.path[0], '..'))\n# fmt: on\n\nfrom mask2former import add_maskformer2_config\n\nlogger = logging.getLogger(\"detectron2\")\n\n\ndef setup(args):\n    if args.config_file.endswith(\".yaml\"):\n        cfg = get_cfg()\n        add_deeplab_config(cfg)\n        add_maskformer2_config(cfg)\n        cfg.merge_from_file(args.config_file)\n        cfg.DATALOADER.NUM_WORKERS = 0\n        cfg.merge_from_list(args.opts)\n        cfg.freeze()\n    else:\n        cfg = LazyConfig.load(args.config_file)\n        cfg = LazyConfig.apply_overrides(cfg, args.opts)\n    setup_logger(name=\"fvcore\")\n    setup_logger()\n    return cfg\n\n\ndef do_flop(cfg):\n    if isinstance(cfg, CfgNode):\n        data_loader = build_detection_test_loader(cfg, cfg.DATASETS.TEST[0])\n        model = build_model(cfg)\n        DetectionCheckpointer(model).load(cfg.MODEL.WEIGHTS)\n    else:\n        data_loader = instantiate(cfg.dataloader.test)\n        model = instantiate(cfg.model)\n        model.to(cfg.train.device)\n        DetectionCheckpointer(model).load(cfg.train.init_checkpoint)\n    model.eval()\n\n    counts = Counter()\n    total_flops = []\n    for idx, data in zip(tqdm.trange(args.num_inputs), data_loader):  # noqa\n        if args.use_fixed_input_size and isinstance(cfg, CfgNode):\n            import torch\n            crop_size = cfg.INPUT.CROP.SIZE[0]\n            data[0][\"image\"] = torch.zeros((3, crop_size, crop_size))\n        flops = FlopCountAnalysis(model, data)\n        if idx > 0:\n            flops.unsupported_ops_warnings(False).uncalled_modules_warnings(False)\n        counts += flops.by_operator()\n        total_flops.append(flops.total())\n\n    logger.info(\"Flops table computed from only one input sample:\\n\" + flop_count_table(flops))\n    logger.info(\n        \"Average GFlops for each type of operators:\\n\"\n        + str([(k, v / (idx + 1) / 1e9) for k, v in counts.items()])\n    )\n    logger.info(\n        \"Total GFlops: {:.1f}±{:.1f}\".format(np.mean(total_flops) / 1e9, np.std(total_flops) / 1e9)\n    )\n\n\ndef do_activation(cfg):\n    if isinstance(cfg, CfgNode):\n        data_loader = build_detection_test_loader(cfg, cfg.DATASETS.TEST[0])\n        model = build_model(cfg)\n        DetectionCheckpointer(model).load(cfg.MODEL.WEIGHTS)\n    else:\n        data_loader = instantiate(cfg.dataloader.test)\n        model = instantiate(cfg.model)\n        model.to(cfg.train.device)\n        DetectionCheckpointer(model).load(cfg.train.init_checkpoint)\n    model.eval()\n\n    counts = Counter()\n    total_activations = []\n    for idx, data in zip(tqdm.trange(args.num_inputs), data_loader):  # noqa\n        count = activation_count_operators(model, data)\n        counts += count\n        total_activations.append(sum(count.values()))\n    logger.info(\n        \"(Million) Activations for Each Type of Operators:\\n\"\n        + str([(k, v / idx) for k, v in counts.items()])\n    )\n    logger.info(\n        \"Total (Million) Activations: {}±{}\".format(\n            np.mean(total_activations), np.std(total_activations)\n        )\n    )\n\n\ndef do_parameter(cfg):\n    if isinstance(cfg, CfgNode):\n        model = build_model(cfg)\n    else:\n        model = instantiate(cfg.model)\n    logger.info(\"Parameter Count:\\n\" + parameter_count_table(model, max_depth=5))\n\n\ndef do_structure(cfg):\n    if isinstance(cfg, CfgNode):\n        model = build_model(cfg)\n    else:\n        model = instantiate(cfg.model)\n    logger.info(\"Model Structure:\\n\" + str(model))\n\n\nif __name__ == \"__main__\":\n    parser = default_argument_parser(\n        epilog=\"\"\"\nExamples:\nTo show parameters of a model:\n$ ./analyze_model.py --tasks parameter \\\\\n    --config-file ../configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml\nFlops and activations are data-dependent, therefore inputs and model weights\nare needed to count them:\n$ ./analyze_model.py --num-inputs 100 --tasks flop \\\\\n    --config-file ../configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml \\\\\n    MODEL.WEIGHTS /path/to/model.pkl\n\"\"\"\n    )\n    parser.add_argument(\n        \"--tasks\",\n        choices=[\"flop\", \"activation\", \"parameter\", \"structure\"],\n        required=True,\n        nargs=\"+\",\n    )\n    parser.add_argument(\n        \"-n\",\n        \"--num-inputs\",\n        default=100,\n        type=int,\n        help=\"number of inputs used to compute statistics for flops/activations, \"\n        \"both are data dependent.\",\n    )\n    parser.add_argument(\n        \"--use-fixed-input-size\",\n        action=\"store_true\",\n        help=\"use fixed input size when calculating flops\",\n    )\n    args = parser.parse_args()\n    assert not args.eval_only\n    assert args.num_gpus == 1\n\n    cfg = setup(args)\n\n    for task in args.tasks:\n        {\n            \"flop\": do_flop,\n            \"activation\": do_activation,\n            \"parameter\": do_parameter,\n            \"structure\": do_structure,\n        }[task](cfg)\n"
  },
  {
    "path": "tools/convert-pretrained-swin-model-to-d2.py",
    "content": "#!/usr/bin/env python\n\nimport pickle as pkl\nimport sys\n\nimport torch\n\n\"\"\"\nUsage:\n  # download pretrained swin model:\n  wget https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_tiny_patch4_window7_224.pth\n  # run the conversion\n  ./convert-pretrained-model-to-d2.py swin_tiny_patch4_window7_224.pth swin_tiny_patch4_window7_224.pkl\n  # Then, use swin_tiny_patch4_window7_224.pkl with the following changes in config:\nMODEL:\n  WEIGHTS: \"/path/to/swin_tiny_patch4_window7_224.pkl\"\nINPUT:\n  FORMAT: \"RGB\"\n\"\"\"\n\nif __name__ == \"__main__\":\n    input = sys.argv[1]\n\n    obj = torch.load(input, map_location=\"cpu\")[\"model\"]\n\n    res = {\"model\": obj, \"__author__\": \"third_party\", \"matching_heuristics\": True}\n\n    with open(sys.argv[2], \"wb\") as f:\n        pkl.dump(res, f)\n"
  },
  {
    "path": "tools/convert-torchvision-to-d2.py",
    "content": "#!/usr/bin/env python\n\nimport pickle as pkl\nimport sys\n\nimport torch\n\n\"\"\"\nUsage:\n  # download one of the ResNet{18,34,50,101,152} models from torchvision:\n  wget https://download.pytorch.org/models/resnet50-19c8e357.pth -O r50.pth\n  # run the conversion\n  ./convert-torchvision-to-d2.py r50.pth r50.pkl\n  # Then, use r50.pkl with the following changes in config:\nMODEL:\n  WEIGHTS: \"/path/to/r50.pkl\"\n  PIXEL_MEAN: [123.675, 116.280, 103.530]\n  PIXEL_STD: [58.395, 57.120, 57.375]\n  RESNETS:\n    DEPTH: 50\n    STRIDE_IN_1X1: False\nINPUT:\n  FORMAT: \"RGB\"\n\"\"\"\n\nif __name__ == \"__main__\":\n    input = sys.argv[1]\n\n    obj = torch.load(input, map_location=\"cpu\")\n\n    newmodel = {}\n    for k in list(obj.keys()):\n        old_k = k\n        if \"layer\" not in k:\n            k = \"stem.\" + k\n        for t in [1, 2, 3, 4]:\n            k = k.replace(\"layer{}\".format(t), \"res{}\".format(t + 1))\n        for t in [1, 2, 3]:\n            k = k.replace(\"bn{}\".format(t), \"conv{}.norm\".format(t))\n        k = k.replace(\"downsample.0\", \"shortcut\")\n        k = k.replace(\"downsample.1\", \"shortcut.norm\")\n        print(old_k, \"->\", k)\n        newmodel[k] = obj.pop(old_k).detach().numpy()\n\n    res = {\"model\": newmodel, \"__author__\": \"torchvision\", \"matching_heuristics\": True}\n\n    with open(sys.argv[2], \"wb\") as f:\n        pkl.dump(res, f)\n    if obj:\n        print(\"Unconverted keys:\", obj.keys())\n"
  },
  {
    "path": "tools/evaluate_coco_boundary_ap.py",
    "content": "#!/usr/bin/env python\n# Modified by Bowen Cheng from: https://github.com/bowenc0221/boundary-iou-api/blob/master/tools/coco_instance_evaluation.py\n\n\"\"\"\nEvaluation for COCO val2017:\npython ./tools/coco_instance_evaluation.py \\\n    --gt-json-file COCO_GT_JSON \\\n    --dt-json-file COCO_DT_JSON\n\"\"\"\nimport argparse\nimport json\n\nfrom boundary_iou.coco_instance_api.coco import COCO\nfrom boundary_iou.coco_instance_api.cocoeval import COCOeval\n\n\ndef main():\n    parser = argparse.ArgumentParser()\n    parser.add_argument(\"--gt-json-file\", default=\"\")\n    parser.add_argument(\"--dt-json-file\", default=\"\")\n    parser.add_argument(\"--iou-type\", default=\"boundary\")\n    parser.add_argument(\"--dilation-ratio\", default=\"0.020\", type=float)\n    args = parser.parse_args()\n    print(args)\n\n    annFile = args.gt_json_file\n    resFile = args.dt_json_file\n    dilation_ratio = args.dilation_ratio\n    if args.iou_type == \"boundary\":\n        get_boundary = True\n    else:\n        get_boundary = False\n    cocoGt = COCO(annFile, get_boundary=get_boundary, dilation_ratio=dilation_ratio)\n    \n    # remove box predictions\n    resFile = json.load(open(resFile))\n    for c in resFile:\n        c.pop(\"bbox\", None)\n\n    cocoDt = cocoGt.loadRes(resFile)\n    cocoEval = COCOeval(cocoGt, cocoDt, iouType=args.iou_type, dilation_ratio=dilation_ratio)\n    cocoEval.evaluate()\n    cocoEval.accumulate()\n    cocoEval.summarize()\n\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "tools/evaluate_pq_for_semantic_segmentation.py",
    "content": "#!/usr/bin/env python\n\nimport argparse\nimport json\nimport os\nfrom collections import defaultdict\nfrom tqdm import tqdm\n\nimport numpy as np\nimport torch\n\nfrom detectron2.data import MetadataCatalog\nfrom detectron2.data.detection_utils import read_image\nfrom detectron2.utils.file_io import PathManager\nfrom pycocotools import mask as maskUtils\n\nfrom panopticapi.evaluation import PQStat\n\n\ndef default_argument_parser():\n    \"\"\"\n    Creates a parser with some common arguments used by analysis tools.\n    Returns:\n        argparse.ArgumentParser:\n    \"\"\"\n    parser = argparse.ArgumentParser(description=\"Evaluate PQ metric for semantic segmentation.\")\n    # NOTE: currently does not support Cityscapes, you need to convert\n    # Cityscapes prediction format to Detectron2 prediction format.\n    parser.add_argument(\n        \"--dataset-name\",\n        default=\"ade20k_sem_seg_val\",\n        choices=[\"ade20k_sem_seg_val\", \"coco_2017_test_stuff_10k_sem_seg\", \"ade20k_full_sem_seg_val\"],\n        help=\"dataset name you want to evaluate\")\n    parser.add_argument(\"--json-file\", default=\"\", help=\"path to detection json file\")\n\n    return parser\n\n\n# Modified from the official panoptic api: https://github.com/cocodataset/panopticapi/blob/master/panopticapi/evaluation.py\ndef pq_compute_single_image(segm_gt, segm_dt, categories, ignore_label):\n    pq_stat = PQStat()\n    VOID = ignore_label\n    OFFSET = 256 * 256 * 256\n\n    pan_gt = segm_gt\n    pan_pred = segm_dt\n\n    gt_ann = {'segments_info': []}\n    labels, labels_cnt = np.unique(segm_gt, return_counts=True)\n    for cat_id, cnt in zip(labels, labels_cnt):\n        if cat_id == VOID:\n            continue\n        gt_ann['segments_info'].append(\n            {\"id\": cat_id, \"category_id\": cat_id, \"area\": cnt, \"iscrowd\": 0}\n        )\n    \n    pred_ann = {'segments_info': []}\n    for cat_id in np.unique(segm_dt):\n        pred_ann['segments_info'].append({\"id\": cat_id, \"category_id\": cat_id})\n\n    gt_segms = {el['id']: el for el in gt_ann['segments_info']}\n    pred_segms = {el['id']: el for el in pred_ann['segments_info']}\n\n    # predicted segments area calculation + prediction sanity checks\n    pred_labels_set = set(el['id'] for el in pred_ann['segments_info'])\n    labels, labels_cnt = np.unique(pan_pred, return_counts=True)\n    for label, label_cnt in zip(labels, labels_cnt):\n        if label not in pred_segms:\n            if label == VOID:\n                continue\n            raise KeyError('In the image with ID {} segment with ID {} is presented in PNG and not presented in JSON.'.format(image_id, label))\n        pred_segms[label]['area'] = label_cnt\n        pred_labels_set.remove(label)\n        if pred_segms[label]['category_id'] not in categories:\n            raise KeyError('In the image with ID {} segment with ID {} has unknown category_id {}.'.format(image_id, label, pred_segms[label]['category_id']))\n    if len(pred_labels_set) != 0:\n        raise KeyError('In the image with ID {} the following segment IDs {} are presented in JSON and not presented in PNG.'.format(image_id, list(pred_labels_set)))\n\n    # confusion matrix calculation\n    pan_gt_pred = pan_gt.astype(np.uint64) * OFFSET + pan_pred.astype(np.uint64)\n    gt_pred_map = {}\n    labels, labels_cnt = np.unique(pan_gt_pred, return_counts=True)\n    for label, intersection in zip(labels, labels_cnt):\n        gt_id = label // OFFSET\n        pred_id = label % OFFSET\n        gt_pred_map[(gt_id, pred_id)] = intersection\n\n    # count all matched pairs\n    gt_matched = set()\n    pred_matched = set()\n    for label_tuple, intersection in gt_pred_map.items():\n        gt_label, pred_label = label_tuple\n        if gt_label not in gt_segms:\n            continue\n        if pred_label not in pred_segms:\n            continue\n        if gt_segms[gt_label]['iscrowd'] == 1:\n            continue\n        if gt_segms[gt_label]['category_id'] != pred_segms[pred_label]['category_id']:\n            continue\n\n        union = pred_segms[pred_label]['area'] + gt_segms[gt_label]['area'] - intersection - gt_pred_map.get((VOID, pred_label), 0)\n        iou = intersection / union\n        if iou > 0.5:\n            pq_stat[gt_segms[gt_label]['category_id']].tp += 1\n            pq_stat[gt_segms[gt_label]['category_id']].iou += iou\n            gt_matched.add(gt_label)\n            pred_matched.add(pred_label)\n\n    # count false positives\n    crowd_labels_dict = {}\n    for gt_label, gt_info in gt_segms.items():\n        if gt_label in gt_matched:\n            continue\n        # crowd segments are ignored\n        if gt_info['iscrowd'] == 1:\n            crowd_labels_dict[gt_info['category_id']] = gt_label\n            continue\n        pq_stat[gt_info['category_id']].fn += 1\n\n    # count false positives\n    for pred_label, pred_info in pred_segms.items():\n        if pred_label in pred_matched:\n            continue\n        # intersection of the segment with VOID\n        intersection = gt_pred_map.get((VOID, pred_label), 0)\n        # plus intersection with corresponding CROWD region if it exists\n        if pred_info['category_id'] in crowd_labels_dict:\n            intersection += gt_pred_map.get((crowd_labels_dict[pred_info['category_id']], pred_label), 0)\n        # predicted segment is ignored if more than half of the segment correspond to VOID and CROWD regions\n        if intersection / pred_info['area'] > 0.5:\n            continue\n        pq_stat[pred_info['category_id']].fp += 1\n\n    return pq_stat\n\n\ndef main():\n    parser = default_argument_parser()\n    args = parser.parse_args()\n\n    _root = os.getenv(\"DETECTRON2_DATASETS\", \"datasets\")\n    json_file = args.json_file\n\n    with open(json_file) as f:\n        predictions = json.load(f)\n\n    imgToAnns = defaultdict(list)\n    for pred in predictions:\n        image_id = os.path.basename(pred[\"file_name\"]).split(\".\")[0]\n        imgToAnns[image_id].append(\n            {\"category_id\" : pred[\"category_id\"], \"segmentation\" : pred[\"segmentation\"]}\n        )\n\n    image_ids = list(imgToAnns.keys())\n\n    meta = MetadataCatalog.get(args.dataset_name)\n    class_names = meta.stuff_classes\n    num_classes = len(meta.stuff_classes)\n    ignore_label = meta.ignore_label\n    conf_matrix = np.zeros((num_classes + 1, num_classes + 1), dtype=np.int64)\n\n    categories = {}\n    for i in range(num_classes):\n        categories[i] = {\"id\": i, \"name\": class_names[i], \"isthing\": 0}\n\n    pq_stat = PQStat()\n    \n    for image_id in tqdm(image_ids):\n        if args.dataset_name == \"ade20k_sem_seg_val\":\n            gt_dir = os.path.join(_root, \"ADEChallengeData2016\", \"annotations_detectron2\", \"validation\")\n            segm_gt = read_image(os.path.join(gt_dir, image_id + \".png\")).copy().astype(np.int64)\n        elif args.dataset_name == \"coco_2017_test_stuff_10k_sem_seg\":\n            gt_dir = os.path.join(_root, \"coco\", \"coco_stuff_10k\", \"annotations_detectron2\", \"test\")\n            segm_gt = read_image(os.path.join(gt_dir, image_id + \".png\")).copy().astype(np.int64)\n        elif args.dataset_name == \"ade20k_full_sem_seg_val\":\n            gt_dir = os.path.join(_root, \"ADE20K_2021_17_01\", \"annotations_detectron2\", \"validation\")\n            segm_gt = read_image(os.path.join(gt_dir, image_id + \".tif\")).copy().astype(np.int64)\n        else:\n            raise ValueError(f\"Unsupported dataset {args.dataset_name}\")\n\n        # get predictions\n        segm_dt = np.zeros_like(segm_gt)\n        anns = imgToAnns[image_id]\n        for ann in anns:\n            # map back category_id\n            if hasattr(meta, \"stuff_dataset_id_to_contiguous_id\"):\n                if ann[\"category_id\"] in meta.stuff_dataset_id_to_contiguous_id:\n                    category_id = meta.stuff_dataset_id_to_contiguous_id[ann[\"category_id\"]]\n            else:\n                category_id = ann[\"category_id\"]\n            mask = maskUtils.decode(ann[\"segmentation\"])\n            segm_dt[mask > 0] = category_id\n\n        # miou\n        gt = segm_gt.copy()\n        pred = segm_dt.copy()\n        gt[gt == ignore_label] = num_classes\n        conf_matrix += np.bincount(\n            (num_classes + 1) * pred.reshape(-1) + gt.reshape(-1),\n            minlength=conf_matrix.size,\n        ).reshape(conf_matrix.shape)\n\n        # pq\n        pq_stat_single = pq_compute_single_image(segm_gt, segm_dt, categories, meta.ignore_label)\n        pq_stat += pq_stat_single\n\n    metrics = [(\"All\", None), (\"Stuff\", False)]\n    results = {}\n    for name, isthing in metrics:\n        results[name], per_class_results = pq_stat.pq_average(categories, isthing=isthing)\n        if name == 'All':\n            results['per_class'] = per_class_results\n    print(\"{:10s}| {:>5s}  {:>5s}  {:>5s} {:>5s}\".format(\"\", \"PQ\", \"SQ\", \"RQ\", \"N\"))\n    print(\"-\" * (10 + 7 * 4))\n\n    for name, _isthing in metrics:\n        print(\"{:10s}| {:5.1f}  {:5.1f}  {:5.1f} {:5d}\".format(\n            name,\n            100 * results[name]['pq'],\n            100 * results[name]['sq'],\n            100 * results[name]['rq'],\n            results[name]['n'])\n        )\n\n    # calculate miou\n    acc = np.full(num_classes, np.nan, dtype=np.float64)\n    iou = np.full(num_classes, np.nan, dtype=np.float64)\n    tp = conf_matrix.diagonal()[:-1].astype(np.float64)\n    pos_gt = np.sum(conf_matrix[:-1, :-1], axis=0).astype(np.float64)\n    pos_pred = np.sum(conf_matrix[:-1, :-1], axis=1).astype(np.float64)\n    acc_valid = pos_gt > 0\n    acc[acc_valid] = tp[acc_valid] / pos_gt[acc_valid]\n    iou_valid = (pos_gt + pos_pred) > 0\n    union = pos_gt + pos_pred - tp\n    iou[acc_valid] = tp[acc_valid] / union[acc_valid]\n    miou = np.sum(iou[acc_valid]) / np.sum(iou_valid)\n\n    print(\"\")\n    print(f\"mIoU: {miou}\")\n\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "train_net.py",
    "content": "\"\"\"\nMaskFormer Training Script.\n\nThis script is a simplified version of the training script in detectron2/tools.\n\"\"\"\ntry:\n    # ignore ShapelyDeprecationWarning from fvcore\n    from shapely.errors import ShapelyDeprecationWarning\n    import warnings\n    warnings.filterwarnings('ignore', category=ShapelyDeprecationWarning)\nexcept:\n    pass\n\nimport copy\nimport itertools\nimport logging\nimport os\n\nfrom collections import OrderedDict\nfrom typing import Any, Dict, List, Set\n\nimport torch\n\nimport detectron2.utils.comm as comm\nfrom detectron2.checkpoint import DetectionCheckpointer\nfrom detectron2.config import get_cfg\nfrom detectron2.data import MetadataCatalog, build_detection_train_loader\nfrom detectron2.engine import (\n    DefaultTrainer,\n    default_argument_parser,\n    default_setup,\n    launch,\n)\nfrom detectron2.evaluation import (\n    CityscapesInstanceEvaluator,\n    CityscapesSemSegEvaluator,\n    COCOEvaluator,\n    COCOPanopticEvaluator,\n    DatasetEvaluators,\n    LVISEvaluator,\n    SemSegEvaluator,\n    verify_results,\n)\nfrom detectron2.projects.deeplab import add_deeplab_config, build_lr_scheduler\nfrom detectron2.solver.build import maybe_add_gradient_clipping\nfrom detectron2.utils.logger import setup_logger\n\n# MaskFormer\nfrom mask2former import (\n    COCOInstanceNewBaselineDatasetMapper,\n    COCOPanopticNewBaselineDatasetMapper,\n    InstanceSegEvaluator,\n    MaskFormerInstanceDatasetMapper,\n    MaskFormerPanopticDatasetMapper,\n    MaskFormerSemanticDatasetMapper,\n    SemanticSegmentorWithTTA,\n    add_maskformer2_config,\n)\n\n\nclass Trainer(DefaultTrainer):\n    \"\"\"\n    Extension of the Trainer class adapted to MaskFormer.\n    \"\"\"\n\n    @classmethod\n    def build_evaluator(cls, cfg, dataset_name, output_folder=None):\n        \"\"\"\n        Create evaluator(s) for a given dataset.\n        This uses the special metadata \"evaluator_type\" associated with each\n        builtin dataset. For your own dataset, you can simply create an\n        evaluator manually in your script and do not have to worry about the\n        hacky if-else logic here.\n        \"\"\"\n        if output_folder is None:\n            output_folder = os.path.join(cfg.OUTPUT_DIR, \"inference\")\n        evaluator_list = []\n        evaluator_type = MetadataCatalog.get(dataset_name).evaluator_type\n        # semantic segmentation\n        if evaluator_type in [\"sem_seg\", \"ade20k_panoptic_seg\"]:\n            evaluator_list.append(\n                SemSegEvaluator(\n                    dataset_name,\n                    distributed=True,\n                    output_dir=output_folder,\n                )\n            )\n        # instance segmentation\n        if evaluator_type == \"coco\":\n            evaluator_list.append(COCOEvaluator(dataset_name, output_dir=output_folder))\n        # panoptic segmentation\n        if evaluator_type in [\n            \"coco_panoptic_seg\",\n            \"ade20k_panoptic_seg\",\n            \"cityscapes_panoptic_seg\",\n            \"mapillary_vistas_panoptic_seg\",\n        ]:\n            if cfg.MODEL.MASK_FORMER.TEST.PANOPTIC_ON:\n                evaluator_list.append(COCOPanopticEvaluator(dataset_name, output_folder))\n        # COCO\n        if evaluator_type == \"coco_panoptic_seg\" and cfg.MODEL.MASK_FORMER.TEST.INSTANCE_ON:\n            evaluator_list.append(COCOEvaluator(dataset_name, output_dir=output_folder))\n        if evaluator_type == \"coco_panoptic_seg\" and cfg.MODEL.MASK_FORMER.TEST.SEMANTIC_ON:\n            evaluator_list.append(SemSegEvaluator(dataset_name, distributed=True, output_dir=output_folder))\n        # Mapillary Vistas\n        if evaluator_type == \"mapillary_vistas_panoptic_seg\" and cfg.MODEL.MASK_FORMER.TEST.INSTANCE_ON:\n            evaluator_list.append(InstanceSegEvaluator(dataset_name, output_dir=output_folder))\n        if evaluator_type == \"mapillary_vistas_panoptic_seg\" and cfg.MODEL.MASK_FORMER.TEST.SEMANTIC_ON:\n            evaluator_list.append(SemSegEvaluator(dataset_name, distributed=True, output_dir=output_folder))\n        # Cityscapes\n        if evaluator_type == \"cityscapes_instance\":\n            assert (\n                torch.cuda.device_count() > comm.get_rank()\n            ), \"CityscapesEvaluator currently do not work with multiple machines.\"\n            return CityscapesInstanceEvaluator(dataset_name)\n        if evaluator_type == \"cityscapes_sem_seg\":\n            assert (\n                torch.cuda.device_count() > comm.get_rank()\n            ), \"CityscapesEvaluator currently do not work with multiple machines.\"\n            return CityscapesSemSegEvaluator(dataset_name)\n        if evaluator_type == \"cityscapes_panoptic_seg\":\n            if cfg.MODEL.MASK_FORMER.TEST.SEMANTIC_ON:\n                assert (\n                    torch.cuda.device_count() > comm.get_rank()\n                ), \"CityscapesEvaluator currently do not work with multiple machines.\"\n                evaluator_list.append(CityscapesSemSegEvaluator(dataset_name))\n            if cfg.MODEL.MASK_FORMER.TEST.INSTANCE_ON:\n                assert (\n                    torch.cuda.device_count() > comm.get_rank()\n                ), \"CityscapesEvaluator currently do not work with multiple machines.\"\n                evaluator_list.append(CityscapesInstanceEvaluator(dataset_name))\n        # ADE20K\n        if evaluator_type == \"ade20k_panoptic_seg\" and cfg.MODEL.MASK_FORMER.TEST.INSTANCE_ON:\n            evaluator_list.append(InstanceSegEvaluator(dataset_name, output_dir=output_folder))\n        # LVIS\n        if evaluator_type == \"lvis\":\n            return LVISEvaluator(dataset_name, output_dir=output_folder)\n        if len(evaluator_list) == 0:\n            raise NotImplementedError(\n                \"no Evaluator for the dataset {} with the type {}\".format(\n                    dataset_name, evaluator_type\n                )\n            )\n        elif len(evaluator_list) == 1:\n            return evaluator_list[0]\n        return DatasetEvaluators(evaluator_list)\n\n    @classmethod\n    def build_train_loader(cls, cfg):\n        # Semantic segmentation dataset mapper\n        if cfg.INPUT.DATASET_MAPPER_NAME == \"mask_former_semantic\":\n            mapper = MaskFormerSemanticDatasetMapper(cfg, True)\n            return build_detection_train_loader(cfg, mapper=mapper)\n        # Panoptic segmentation dataset mapper\n        elif cfg.INPUT.DATASET_MAPPER_NAME == \"mask_former_panoptic\":\n            mapper = MaskFormerPanopticDatasetMapper(cfg, True)\n            return build_detection_train_loader(cfg, mapper=mapper)\n        # Instance segmentation dataset mapper\n        elif cfg.INPUT.DATASET_MAPPER_NAME == \"mask_former_instance\":\n            mapper = MaskFormerInstanceDatasetMapper(cfg, True)\n            return build_detection_train_loader(cfg, mapper=mapper)\n        # coco instance segmentation lsj new baseline\n        elif cfg.INPUT.DATASET_MAPPER_NAME == \"coco_instance_lsj\":\n            mapper = COCOInstanceNewBaselineDatasetMapper(cfg, True)\n            return build_detection_train_loader(cfg, mapper=mapper)\n        # coco panoptic segmentation lsj new baseline\n        elif cfg.INPUT.DATASET_MAPPER_NAME == \"coco_panoptic_lsj\":\n            mapper = COCOPanopticNewBaselineDatasetMapper(cfg, True)\n            return build_detection_train_loader(cfg, mapper=mapper)\n        else:\n            mapper = None\n            return build_detection_train_loader(cfg, mapper=mapper)\n\n    @classmethod\n    def build_lr_scheduler(cls, cfg, optimizer):\n        \"\"\"\n        It now calls :func:`detectron2.solver.build_lr_scheduler`.\n        Overwrite it if you'd like a different scheduler.\n        \"\"\"\n        return build_lr_scheduler(cfg, optimizer)\n\n    @classmethod\n    def build_optimizer(cls, cfg, model):\n        weight_decay_norm = cfg.SOLVER.WEIGHT_DECAY_NORM\n        weight_decay_embed = cfg.SOLVER.WEIGHT_DECAY_EMBED\n\n        defaults = {}\n        defaults[\"lr\"] = cfg.SOLVER.BASE_LR\n        defaults[\"weight_decay\"] = cfg.SOLVER.WEIGHT_DECAY\n\n        norm_module_types = (\n            torch.nn.BatchNorm1d,\n            torch.nn.BatchNorm2d,\n            torch.nn.BatchNorm3d,\n            torch.nn.SyncBatchNorm,\n            # NaiveSyncBatchNorm inherits from BatchNorm2d\n            torch.nn.GroupNorm,\n            torch.nn.InstanceNorm1d,\n            torch.nn.InstanceNorm2d,\n            torch.nn.InstanceNorm3d,\n            torch.nn.LayerNorm,\n            torch.nn.LocalResponseNorm,\n        )\n\n        params: List[Dict[str, Any]] = []\n        memo: Set[torch.nn.parameter.Parameter] = set()\n        for module_name, module in model.named_modules():\n            for module_param_name, value in module.named_parameters(recurse=False):\n                if not value.requires_grad:\n                    continue\n                # Avoid duplicating parameters\n                if value in memo:\n                    continue\n                memo.add(value)\n\n                hyperparams = copy.copy(defaults)\n                if \"backbone\" in module_name:\n                    hyperparams[\"lr\"] = hyperparams[\"lr\"] * cfg.SOLVER.BACKBONE_MULTIPLIER\n                if (\n                    \"relative_position_bias_table\" in module_param_name\n                    or \"absolute_pos_embed\" in module_param_name\n                ):\n                    print(module_param_name)\n                    hyperparams[\"weight_decay\"] = 0.0\n                if isinstance(module, norm_module_types):\n                    hyperparams[\"weight_decay\"] = weight_decay_norm\n                if isinstance(module, torch.nn.Embedding):\n                    hyperparams[\"weight_decay\"] = weight_decay_embed\n                params.append({\"params\": [value], **hyperparams})\n\n        def maybe_add_full_model_gradient_clipping(optim):\n            # detectron2 doesn't have full model gradient clipping now\n            clip_norm_val = cfg.SOLVER.CLIP_GRADIENTS.CLIP_VALUE\n            enable = (\n                cfg.SOLVER.CLIP_GRADIENTS.ENABLED\n                and cfg.SOLVER.CLIP_GRADIENTS.CLIP_TYPE == \"full_model\"\n                and clip_norm_val > 0.0\n            )\n\n            class FullModelGradientClippingOptimizer(optim):\n                def step(self, closure=None):\n                    all_params = itertools.chain(*[x[\"params\"] for x in self.param_groups])\n                    torch.nn.utils.clip_grad_norm_(all_params, clip_norm_val)\n                    super().step(closure=closure)\n\n            return FullModelGradientClippingOptimizer if enable else optim\n\n        optimizer_type = cfg.SOLVER.OPTIMIZER\n        if optimizer_type == \"SGD\":\n            optimizer = maybe_add_full_model_gradient_clipping(torch.optim.SGD)(\n                params, cfg.SOLVER.BASE_LR, momentum=cfg.SOLVER.MOMENTUM\n            )\n        elif optimizer_type == \"ADAMW\":\n            optimizer = maybe_add_full_model_gradient_clipping(torch.optim.AdamW)(\n                params, cfg.SOLVER.BASE_LR\n            )\n        else:\n            raise NotImplementedError(f\"no optimizer type {optimizer_type}\")\n        if not cfg.SOLVER.CLIP_GRADIENTS.CLIP_TYPE == \"full_model\":\n            optimizer = maybe_add_gradient_clipping(cfg, optimizer)\n        return optimizer\n\n    @classmethod\n    def test_with_TTA(cls, cfg, model):\n        logger = logging.getLogger(\"detectron2.trainer\")\n        # In the end of training, run an evaluation with TTA.\n        logger.info(\"Running inference with test-time augmentation ...\")\n        model = SemanticSegmentorWithTTA(cfg, model)\n        evaluators = [\n            cls.build_evaluator(\n                cfg, name, output_folder=os.path.join(cfg.OUTPUT_DIR, \"inference_TTA\")\n            )\n            for name in cfg.DATASETS.TEST\n        ]\n        res = cls.test(cfg, model, evaluators)\n        res = OrderedDict({k + \"_TTA\": v for k, v in res.items()})\n        return res\n\n\ndef setup(args):\n    \"\"\"\n    Create configs and perform basic setups.\n    \"\"\"\n    cfg = get_cfg()\n    # for poly lr schedule\n    add_deeplab_config(cfg)\n    add_maskformer2_config(cfg)\n    cfg.merge_from_file(args.config_file)\n    cfg.merge_from_list(args.opts)\n    cfg.freeze()\n    default_setup(cfg, args)\n    # Setup logger for \"mask_former\" module\n    setup_logger(output=cfg.OUTPUT_DIR, distributed_rank=comm.get_rank(), name=\"mask2former\")\n    return cfg\n\n\ndef main(args):\n    cfg = setup(args)\n\n    if args.eval_only:\n        model = Trainer.build_model(cfg)\n        DetectionCheckpointer(model, save_dir=cfg.OUTPUT_DIR).resume_or_load(\n            cfg.MODEL.WEIGHTS, resume=args.resume\n        )\n        res = Trainer.test(cfg, model)\n        if cfg.TEST.AUG.ENABLED:\n            res.update(Trainer.test_with_TTA(cfg, model))\n        if comm.is_main_process():\n            verify_results(cfg, res)\n        return res\n\n    trainer = Trainer(cfg)\n    trainer.resume_or_load(resume=args.resume)\n    return trainer.train()\n\n\nif __name__ == \"__main__\":\n    args = default_argument_parser().parse_args()\n    print(\"Command Line Args:\", args)\n    launch(\n        main,\n        args.num_gpus,\n        num_machines=args.num_machines,\n        machine_rank=args.machine_rank,\n        dist_url=args.dist_url,\n        args=(args,),\n    )\n"
  },
  {
    "path": "train_net_video.py",
    "content": "\"\"\"\nThis script is a simplified version of the training script in detectron2/tools.\n\"\"\"\ntry:\n    # ignore ShapelyDeprecationWarning from fvcore\n    from shapely.errors import ShapelyDeprecationWarning\n    import warnings\n    warnings.filterwarnings('ignore', category=ShapelyDeprecationWarning)\nexcept:\n    pass\n\nimport copy\nimport itertools\nimport logging\nimport os\n\nfrom collections import OrderedDict\nfrom typing import Any, Dict, List, Set\n\nimport torch\n\nimport detectron2.utils.comm as comm\nfrom detectron2.checkpoint import DetectionCheckpointer\nfrom detectron2.config import get_cfg\nfrom detectron2.data import MetadataCatalog, build_detection_train_loader\nfrom detectron2.engine import (\n    DefaultTrainer,\n    default_argument_parser,\n    default_setup,\n    launch,\n)\nfrom detectron2.evaluation import (\n    DatasetEvaluator,\n    inference_on_dataset,\n    print_csv_format,\n    verify_results,\n)\nfrom detectron2.projects.deeplab import add_deeplab_config, build_lr_scheduler\nfrom detectron2.solver.build import maybe_add_gradient_clipping\nfrom detectron2.utils.logger import setup_logger\n\n# MaskFormer\nfrom mask2former import add_maskformer2_config\nfrom mask2former_video import (\n    YTVISDatasetMapper,\n    CocoClipDatasetMapper,\n    build_combined_loader,\n    YTVISEvaluator,\n    add_maskformer2_video_config,\n    build_detection_train_loader,\n    build_detection_test_loader,\n    get_detection_dataset_dicts,\n)\n\nfrom torch.utils.data import Dataset, ConcatDataset\n\nclass Trainer(DefaultTrainer):\n    \"\"\"\n    Extension of the Trainer class adapted to MaskFormer.\n    \"\"\"\n\n    @classmethod\n    def build_evaluator(cls, cfg, dataset_name, output_folder=None):\n        \"\"\"\n        Create evaluator(s) for a given dataset.\n        This uses the special metadata \"evaluator_type\" associated with each builtin dataset.\n        For your own dataset, you can simply create an evaluator manually in your\n        script and do not have to worry about the hacky if-else logic here.\n        \"\"\"\n        if output_folder is None:\n            output_folder = os.path.join(cfg.OUTPUT_DIR, \"inference\")\n            os.makedirs(output_folder, exist_ok=True)\n\n        return YTVISEvaluator(dataset_name, cfg, True, output_folder)\n\n    @classmethod\n    def build_train_loader(cls, cfg):\n        mappers = []\n        for d_i, dataset_name in enumerate(cfg.DATASETS.TRAIN):\n            if dataset_name.startswith('coco'):\n                mappers.append(\n                    CocoClipDatasetMapper(\n                        cfg, is_train=True, is_tgt=(d_i==len(cfg.DATASETS.TRAIN)-1), src_dataset_name=dataset_name\n                    )\n                )\n            elif dataset_name.startswith('ytvis') or dataset_name.startswith('ovis'):\n                mappers.append(\n                    YTVISDatasetMapper(cfg, is_train=True, is_tgt=(d_i==len(cfg.DATASETS.TRAIN)-1), src_dataset_name=dataset_name)\n                )\n\n        loaders = [\n                build_detection_train_loader(cfg, mapper=mapper, dataset_name=dataset_name)\n                for mapper, dataset_name in zip(mappers, cfg.DATASETS.TRAIN)\n        ]\n        DATASET_RATIO = [1.0, 0.75]\n        combined_data_loader = build_combined_loader(cfg, loaders, DATASET_RATIO)\n        return combined_data_loader\n\n        \n    @classmethod\n    def build_test_loader(cls, cfg, dataset_name):\n        dataset_name = cfg.DATASETS.TEST[0]\n        mapper = YTVISDatasetMapper(cfg, is_train=False)\n        return build_detection_test_loader(cfg, dataset_name, mapper=mapper)\n\n    @classmethod\n    def build_lr_scheduler(cls, cfg, optimizer):\n        \"\"\"\n        It now calls :func:`detectron2.solver.build_lr_scheduler`.\n        Overwrite it if you'd like a different scheduler.\n        \"\"\"\n        return build_lr_scheduler(cfg, optimizer)\n\n    @classmethod\n    def build_optimizer(cls, cfg, model):\n        weight_decay_norm = cfg.SOLVER.WEIGHT_DECAY_NORM\n        weight_decay_embed = cfg.SOLVER.WEIGHT_DECAY_EMBED\n\n        defaults = {}\n        defaults[\"lr\"] = cfg.SOLVER.BASE_LR\n        defaults[\"weight_decay\"] = cfg.SOLVER.WEIGHT_DECAY\n\n        norm_module_types = (\n            torch.nn.BatchNorm1d,\n            torch.nn.BatchNorm2d,\n            torch.nn.BatchNorm3d,\n            torch.nn.SyncBatchNorm,\n            # NaiveSyncBatchNorm inherits from BatchNorm2d\n            torch.nn.GroupNorm,\n            torch.nn.InstanceNorm1d,\n            torch.nn.InstanceNorm2d,\n            torch.nn.InstanceNorm3d,\n            torch.nn.LayerNorm,\n            torch.nn.LocalResponseNorm,\n        )\n\n        params: List[Dict[str, Any]] = []\n        memo: Set[torch.nn.parameter.Parameter] = set()\n        for module_name, module in model.named_modules():\n            for module_param_name, value in module.named_parameters(recurse=False):\n                if not value.requires_grad:\n                    continue\n                # Avoid duplicating parameters\n                if value in memo:\n                    continue\n                memo.add(value)\n\n                hyperparams = copy.copy(defaults)\n                if \"backbone\" in module_name:\n                    hyperparams[\"lr\"] = hyperparams[\"lr\"] * cfg.SOLVER.BACKBONE_MULTIPLIER\n                if (\n                    \"relative_position_bias_table\" in module_param_name\n                    or \"absolute_pos_embed\" in module_param_name\n                ):\n                    print(module_param_name)\n                    hyperparams[\"weight_decay\"] = 0.0\n                if isinstance(module, norm_module_types):\n                    hyperparams[\"weight_decay\"] = weight_decay_norm\n                if isinstance(module, torch.nn.Embedding):\n                    hyperparams[\"weight_decay\"] = weight_decay_embed\n                params.append({\"params\": [value], **hyperparams})\n\n        def maybe_add_full_model_gradient_clipping(optim):\n            # detectron2 doesn't have full model gradient clipping now\n            clip_norm_val = cfg.SOLVER.CLIP_GRADIENTS.CLIP_VALUE\n            enable = (\n                cfg.SOLVER.CLIP_GRADIENTS.ENABLED\n                and cfg.SOLVER.CLIP_GRADIENTS.CLIP_TYPE == \"full_model\"\n                and clip_norm_val > 0.0\n            )\n\n            class FullModelGradientClippingOptimizer(optim):\n                def step(self, closure=None):\n                    all_params = itertools.chain(*[x[\"params\"] for x in self.param_groups])\n                    torch.nn.utils.clip_grad_norm_(all_params, clip_norm_val)\n                    super().step(closure=closure)\n\n            return FullModelGradientClippingOptimizer if enable else optim\n\n        optimizer_type = cfg.SOLVER.OPTIMIZER\n        if optimizer_type == \"SGD\":\n            optimizer = maybe_add_full_model_gradient_clipping(torch.optim.SGD)(\n                params, cfg.SOLVER.BASE_LR, momentum=cfg.SOLVER.MOMENTUM\n            )\n        elif optimizer_type == \"ADAMW\":\n            optimizer = maybe_add_full_model_gradient_clipping(torch.optim.AdamW)(\n                params, cfg.SOLVER.BASE_LR\n            )\n        else:\n            raise NotImplementedError(f\"no optimizer type {optimizer_type}\")\n        if not cfg.SOLVER.CLIP_GRADIENTS.CLIP_TYPE == \"full_model\":\n            optimizer = maybe_add_gradient_clipping(cfg, optimizer)\n        return optimizer\n\n    @classmethod\n    def test(cls, cfg, model, evaluators=None):\n        \"\"\"\n        Evaluate the given model. The given model is expected to already contain\n        weights to evaluate.\n        Args:\n            cfg (CfgNode):\n            model (nn.Module):\n            evaluators (list[DatasetEvaluator] or None): if None, will call\n                :meth:`build_evaluator`. Otherwise, must have the same length as\n                ``cfg.DATASETS.TEST``.\n        Returns:\n            dict: a dict of result metrics\n        \"\"\"\n        from torch.cuda.amp import autocast\n        logger = logging.getLogger(__name__)\n        if isinstance(evaluators, DatasetEvaluator):\n            evaluators = [evaluators]\n        if evaluators is not None:\n            assert len(cfg.DATASETS.TEST) == len(evaluators), \"{} != {}\".format(\n                len(cfg.DATASETS.TEST), len(evaluators)\n            )\n\n        results = OrderedDict()\n        for idx, dataset_name in enumerate(cfg.DATASETS.TEST):\n            data_loader = cls.build_test_loader(cfg, dataset_name)\n            # When evaluators are passed in as arguments,\n            # implicitly assume that evaluators can be created before data_loader.\n            if evaluators is not None:\n                evaluator = evaluators[idx]\n            else:\n                try:\n                    evaluator = cls.build_evaluator(cfg, dataset_name)\n                except NotImplementedError:\n                    logger.warn(\n                        \"No evaluator found. Use `DefaultTrainer.test(evaluators=)`, \"\n                        \"or implement its `build_evaluator` method.\"\n                    )\n                    results[dataset_name] = {}\n                    continue\n            with autocast():\n                results_i = inference_on_dataset(model, data_loader, evaluator)\n            results[dataset_name] = results_i\n            if comm.is_main_process():\n                assert isinstance(\n                    results_i, dict\n                ), \"Evaluator must return a dict on the main process. Got {} instead.\".format(\n                    results_i\n                )\n                logger.info(\"Evaluation results for {} in csv format:\".format(dataset_name))\n                print_csv_format(results_i)\n\n        if len(results) == 1:\n            results = list(results.values())[0]\n        return results\n\n\ndef setup(args):\n    \"\"\"\n    Create configs and perform basic setups.\n    \"\"\"\n    cfg = get_cfg()\n    # for poly lr schedule\n    add_deeplab_config(cfg)\n    add_maskformer2_config(cfg)\n    add_maskformer2_video_config(cfg)\n    cfg.merge_from_file(args.config_file)\n    cfg.merge_from_list(args.opts)\n    cfg.freeze()\n    default_setup(cfg, args)\n    # Setup logger for \"mask_former\" module\n    setup_logger(name=\"mask2former\")\n    setup_logger(output=cfg.OUTPUT_DIR, distributed_rank=comm.get_rank(), name=\"mask2former_video\")\n    return cfg\n\n\ndef main(args):\n    cfg = setup(args)\n\n    if args.eval_only:\n        model = Trainer.build_model(cfg)\n        DetectionCheckpointer(model, save_dir=cfg.OUTPUT_DIR).resume_or_load(\n            cfg.MODEL.WEIGHTS, resume=args.resume\n        )\n        res = Trainer.test(cfg, model)\n        if cfg.TEST.AUG.ENABLED:\n            raise NotImplementedError\n        if comm.is_main_process():\n            verify_results(cfg, res)\n        return res\n\n    trainer = Trainer(cfg)\n    trainer.resume_or_load(resume=args.resume)\n    return trainer.train()\n\n\nif __name__ == \"__main__\":\n    args = default_argument_parser().parse_args()\n    print(\"Command Line Args:\", args)\n    launch(\n        main,\n        args.num_gpus,\n        num_machines=args.num_machines,\n        machine_rank=args.machine_rank,\n        dist_url=args.dist_url,\n        args=(args,),\n    )\n"
  },
  {
    "path": "util/__init__.py",
    "content": "# ------------------------------------------------------------------------\n# SeqFormer\n# ------------------------------------------------------------------------\n# Modified from Deformable DETR (https://github.com/fundamentalvision/Deformable-DETR)\n# Copyright (c) 2020 SenseTime. All Rights Reserved.\n# ------------------------------------------------------------------------\n"
  },
  {
    "path": "util/box_ops.py",
    "content": "# ------------------------------------------------------------------------\n# SeqFormer\n# ------------------------------------------------------------------------\n# Modified from Deformable DETR (https://github.com/fundamentalvision/Deformable-DETR)\n# Copyright (c) 2020 SenseTime. All Rights Reserved.\n# ------------------------------------------------------------------------\n\n\"\"\"\nUtilities for bounding box manipulation and GIoU.\n\"\"\"\nimport torch\nfrom torchvision.ops.boxes import box_area\n\n\ndef box_cxcywh_to_xyxy(x):\n    # print('box:\\n', x)\n\n    x_c, y_c, w, h = x.unbind(-1)\n    b = [(x_c - 0.5 * w), (y_c - 0.5 * h),\n         (x_c + 0.5 * w), (y_c + 0.5 * h)]\n    return torch.stack(b, dim=-1)\n\n\ndef box_xyxy_to_cxcywh(x):\n    x0, y0, x1, y1 = x.unbind(-1)\n    b = [(x0 + x1) / 2, (y0 + y1) / 2,\n         (x1 - x0), (y1 - y0)]\n    return torch.stack(b, dim=-1)\n\n\n# modified from torchvision to also return the union\ndef box_iou(boxes1, boxes2):\n    area1 = box_area(boxes1)\n    area2 = box_area(boxes2)\n\n    lt = torch.max(boxes1[:, None, :2], boxes2[:, :2])  # [N,M,2]\n    rb = torch.min(boxes1[:, None, 2:], boxes2[:, 2:])  # [N,M,2]\n\n    wh = (rb - lt).clamp(min=0)  # [N,M,2]\n    inter = wh[:, :, 0] * wh[:, :, 1]  # [N,M]\n\n    union = area1[:, None] + area2 - inter \n\n    iou = inter / (union + 1e-7)\n    return iou, union\n\ndef multi_box_iou(boxes1, boxes2):\n    area1 = box_area(boxes1.flatten(0,1)).reshape(boxes1.shape[0], boxes1.shape[1])\n    area2 = box_area(boxes2.flatten(0,1)).reshape(boxes2.shape[0], boxes2.shape[1])\n\n    lt = torch.max(boxes1[:, :, None, :2], boxes2[:, None, :, :2])  # [nf,N,M,2]\n    rb = torch.min(boxes1[:, :, None, 2:], boxes2[:, None, :, 2:])  # [nf,N,M,2]\n\n    wh = (rb - lt).clamp(min=0)  # [nf,N,M,2]\n    inter = wh[:, :, :, 0] * wh[:, :, :, 1]  # [nf,N,M]\n\n    union = area1[:, :, None] + area2[:, None, :] - inter\n\n    iou = inter / (union + 1e-7)\n    return iou, union\n\ndef generalized_box_iou(boxes1, boxes2):\n    \"\"\"\n    Generalized IoU from https://giou.stanford.edu/\n\n    The boxes should be in [x0, y0, x1, y1] format\n\n    Returns a [N, M] pairwise matrix, where N = len(boxes1)\n    and M = len(boxes2)\n    \"\"\"\n    # degenerate boxes gives inf / nan results\n    # so do an early check\n\n    assert (boxes1[:, 2:] >= boxes1[:, :2]).all()\n    assert (boxes2[:, 2:] >= boxes2[:, :2]).all()\n    iou, union = box_iou(boxes1, boxes2)\n\n    lt = torch.min(boxes1[:, None, :2], boxes2[:, :2])\n    rb = torch.max(boxes1[:, None, 2:], boxes2[:, 2:])\n\n    wh = (rb - lt).clamp(min=0)  # [N,M,2]\n    area = wh[:, :, 0] * wh[:, :, 1]\n\n    # return iou - (area - union) / area\n    return iou - (area - union) / (area + 1e-7)\n\ndef generalized_multi_box_iou(boxes1, boxes2):\n    \"\"\"\n    Generalized IoU from https://giou.stanford.edu/\n\n    The boxes should be in [x0, y0, x1, y1] format\n    boxes1.shape = [nf, N, 4]\n    boxes2.shape = [nf, M, 4]\n    Returns a [nf, N, M] pairwise matrix, where N = boxes1.shape[1]\n    and M = boxes2.shape[1]\n    \"\"\"\n    # degenerate boxes gives inf / nan results\n    # so do an early check\n\n    assert (boxes1[:, :, 2:] >= boxes1[:, :, :2]).all()\n    assert (boxes2[:, :, 2:] >= boxes2[:, :, :2]).all()\n    iou, union = multi_box_iou(boxes1, boxes2)\n\n    lt = torch.min(boxes1[:, :, None, :2], boxes2[:, None, :, :2])\n    rb = torch.max(boxes1[:, :, None, 2:], boxes2[:, None, :, 2:])\n\n    wh = (rb - lt).clamp(min=0)  # [nf,N,M,2]\n    area = wh[:, :, :, 0] * wh[:, :, :, 1]\n    \n\n    return iou - (area - union) / (area + 1e-7)\n\n\ndef masks_to_boxes(masks):\n    \"\"\"Compute the bounding boxes around the provided masks\n\n    The masks should be in format [N, H, W] where N is the number of masks, (H, W) are the spatial dimensions.\n\n    Returns a [N, 4] tensors, with the boxes in xyxy format\n    \"\"\"\n    if masks.numel() == 0:\n        return torch.zeros((0, 4), device=masks.device)\n\n    h, w = masks.shape[-2:]\n\n    y = torch.arange(0, h, dtype=torch.float, device=masks.device)\n    x = torch.arange(0, w, dtype=torch.float, device=masks.device)\n    y, x = torch.meshgrid(y, x)\n\n    x_mask = (masks * x.unsqueeze(0))\n    x_max = x_mask.flatten(1).max(-1)[0]\n    x_min = x_mask.masked_fill(~(masks.bool()), 1e8).flatten(1).min(-1)[0]\n\n    y_mask = (masks * y.unsqueeze(0))\n    y_max = y_mask.flatten(1).max(-1)[0]\n    y_min = y_mask.masked_fill(~(masks.bool()), 1e8).flatten(1).min(-1)[0]\n\n    return torch.stack([x_min, y_min, x_max, y_max], 1)\n\n\n\n"
  },
  {
    "path": "util/misc.py",
    "content": "# ------------------------------------------------------------------------\n# SeqFormer\n# ------------------------------------------------------------------------\n# Modified from Deformable DETR (https://github.com/fundamentalvision/Deformable-DETR)\n# Copyright (c) 2020 SenseTime. All Rights Reserved.\n# ------------------------------------------------------------------------\n\n\"\"\"\nMisc functions, including distributed helpers.\n\nMostly copy-paste from torchvision references.\n\"\"\"\nimport os\nimport subprocess\nimport time\nfrom collections import defaultdict, deque\nimport datetime\nimport pickle\nfrom typing import Optional, List\n\nimport torch\nimport torch.nn as nn\nimport torch.distributed as dist\nfrom torch import Tensor\n\n# needed due to empty tensor bug in pytorch and torchvision 0.5\nimport torchvision\nif float(torchvision.__version__[:3]) < 0.5:\n    import math\n    from torchvision.ops.misc import _NewEmptyTensorOp\n    def _check_size_scale_factor(dim, size, scale_factor):\n        # type: (int, Optional[List[int]], Optional[float]) -> None\n        if size is None and scale_factor is None:\n            raise ValueError(\"either size or scale_factor should be defined\")\n        if size is not None and scale_factor is not None:\n            raise ValueError(\"only one of size or scale_factor should be defined\")\n        if not (scale_factor is not None and len(scale_factor) != dim):\n            raise ValueError(\n                \"scale_factor shape must match input shape. \"\n                \"Input is {}D, scale_factor size is {}\".format(dim, len(scale_factor))\n            )\n    def _output_size(dim, input, size, scale_factor):\n        # type: (int, Tensor, Optional[List[int]], Optional[float]) -> List[int]\n        assert dim == 2\n        _check_size_scale_factor(dim, size, scale_factor)\n        if size is not None:\n            return size\n        # if dim is not 2 or scale_factor is iterable use _ntuple instead of concat\n        assert scale_factor is not None and isinstance(scale_factor, (int, float))\n        scale_factors = [scale_factor, scale_factor]\n        # math.floor might return float in py2.7\n        return [\n            int(math.floor(input.size(i + 2) * scale_factors[i])) for i in range(dim)\n        ]\nelif float(torchvision.__version__[:3]) < 0.7:\n    from torchvision.ops import _new_empty_tensor\n    from torchvision.ops.misc import _output_size\n\n\nclass SmoothedValue(object):\n    \"\"\"Track a series of values and provide access to smoothed values over a\n    window or the global series average.\n    \"\"\"\n\n    def __init__(self, window_size=20, fmt=None):\n        if fmt is None:\n            fmt = \"{median:.4f} ({global_avg:.4f})\"\n        self.deque = deque(maxlen=window_size)\n        self.total = 0.0\n        self.count = 0\n        self.fmt = fmt\n\n    def update(self, value, n=1):\n        self.deque.append(value)\n        self.count += n\n        self.total += value * n\n\n    def synchronize_between_processes(self):\n        \"\"\"\n        Warning: does not synchronize the deque!\n        \"\"\"\n        if not is_dist_avail_and_initialized():\n            return\n        t = torch.tensor([self.count, self.total], dtype=torch.float64, device='cuda')\n        dist.barrier()\n        dist.all_reduce(t)\n        t = t.tolist()\n        self.count = int(t[0])\n        self.total = t[1]\n\n    @property\n    def median(self):\n        d = torch.tensor(list(self.deque))\n        return d.median().item()\n\n    @property\n    def avg(self):\n        d = torch.tensor(list(self.deque), dtype=torch.float32)\n        return d.mean().item()\n\n    @property\n    def global_avg(self):\n        return self.total / self.count\n\n    @property\n    def max(self):\n        return max(self.deque)\n\n    @property\n    def value(self):\n        return self.deque[-1]\n\n    def __str__(self):\n        return self.fmt.format(\n            median=self.median,\n            avg=self.avg,\n            global_avg=self.global_avg,\n            max=self.max,\n            value=self.value)\n\n\ndef all_gather(data):\n    \"\"\"\n    Run all_gather on arbitrary picklable data (not necessarily tensors)\n    Args:\n        data: any picklable object\n    Returns:\n        list[data]: list of data gathered from each rank\n    \"\"\"\n    world_size = get_world_size()\n    if world_size == 1:\n        return [data]\n\n    # serialized to a Tensor\n    buffer = pickle.dumps(data)\n    storage = torch.ByteStorage.from_buffer(buffer)\n    tensor = torch.ByteTensor(storage).to(\"cuda\")\n\n    # obtain Tensor size of each rank\n    local_size = torch.tensor([tensor.numel()], device=\"cuda\")\n    size_list = [torch.tensor([0], device=\"cuda\") for _ in range(world_size)]\n    dist.all_gather(size_list, local_size)\n    size_list = [int(size.item()) for size in size_list]\n    max_size = max(size_list)\n\n    # receiving Tensor from all ranks\n    # we pad the tensor because torch all_gather does not support\n    # gathering tensors of different shapes\n    tensor_list = []\n    for _ in size_list:\n        tensor_list.append(torch.empty((max_size,), dtype=torch.uint8, device=\"cuda\"))\n    if local_size != max_size:\n        padding = torch.empty(size=(max_size - local_size,), dtype=torch.uint8, device=\"cuda\")\n        tensor = torch.cat((tensor, padding), dim=0)\n    dist.all_gather(tensor_list, tensor)\n\n    data_list = []\n    for size, tensor in zip(size_list, tensor_list):\n        buffer = tensor.cpu().numpy().tobytes()[:size]\n        data_list.append(pickle.loads(buffer))\n\n    return data_list\n\n\ndef reduce_dict(input_dict, average=True):\n    \"\"\"\n    Args:\n        input_dict (dict): all the values will be reduced\n        average (bool): whether to do average or sum\n    Reduce the values in the dictionary from all processes so that all processes\n    have the averaged results. Returns a dict with the same fields as\n    input_dict, after reduction.\n    \"\"\"\n    world_size = get_world_size()\n    if world_size < 2:\n        return input_dict\n    with torch.no_grad():\n        names = []\n        values = []\n        # sort the keys so that they are consistent across processes\n        for k in sorted(input_dict.keys()):\n            names.append(k)\n            values.append(input_dict[k])\n        values = torch.stack(values, dim=0)\n        dist.all_reduce(values)\n        if average:\n            values /= world_size\n        reduced_dict = {k: v for k, v in zip(names, values)}\n    return reduced_dict\n\n\nclass MetricLogger(object):\n    def __init__(self, delimiter=\"\\t\"):\n        self.meters = defaultdict(SmoothedValue)\n        self.delimiter = delimiter\n\n    def update(self, **kwargs):\n        for k, v in kwargs.items():\n            if isinstance(v, torch.Tensor):\n                v = v.item()\n            assert isinstance(v, (float, int))\n            self.meters[k].update(v)\n\n    def __getattr__(self, attr):\n        if attr in self.meters:\n            return self.meters[attr]\n        if attr in self.__dict__:\n            return self.__dict__[attr]\n        raise AttributeError(\"'{}' object has no attribute '{}'\".format(\n            type(self).__name__, attr))\n\n    def __str__(self):\n        loss_str = []\n        for name, meter in self.meters.items():\n            loss_str.append(\n                \"{}: {}\".format(name, str(meter))\n            )\n        return self.delimiter.join(loss_str)\n\n    def synchronize_between_processes(self):\n        for meter in self.meters.values():\n            meter.synchronize_between_processes()\n\n    def add_meter(self, name, meter):\n        self.meters[name] = meter\n\n    def log_every(self, iterable, print_freq, header=None):\n        i = 0\n        if not header:\n            header = ''\n        start_time = time.time()\n        end = time.time()\n        iter_time = SmoothedValue(fmt='{avg:.4f}')\n        data_time = SmoothedValue(fmt='{avg:.4f}')\n        space_fmt = ':' + str(len(str(len(iterable)))) + 'd'\n        if torch.cuda.is_available():\n            log_msg = self.delimiter.join([\n                header,\n                '[{0' + space_fmt + '}/{1}]',\n                'eta: {eta}',\n                '{meters}',\n                'time: {time}',\n                'data: {data}',\n                'max mem: {memory:.0f}'\n            ])\n        else:\n            log_msg = self.delimiter.join([\n                header,\n                '[{0' + space_fmt + '}/{1}]',\n                'eta: {eta}',\n                '{meters}',\n                'time: {time}',\n                'data: {data}'\n            ])\n        MB = 1024.0 * 1024.0\n        for obj in iterable:\n            data_time.update(time.time() - end)\n            yield obj\n            iter_time.update(time.time() - end)\n            if i % print_freq == 0 or i == len(iterable) - 1:\n                eta_seconds = iter_time.global_avg * (len(iterable) - i)\n                eta_string = str(datetime.timedelta(seconds=int(eta_seconds)))\n                if torch.cuda.is_available():\n                    print(log_msg.format(\n                        i, len(iterable), eta=eta_string,\n                        meters=str(self),\n                        time=str(iter_time), data=str(data_time),\n                        memory=torch.cuda.max_memory_allocated() / MB))\n                else:\n                    print(log_msg.format(\n                        i, len(iterable), eta=eta_string,\n                        meters=str(self),\n                        time=str(iter_time), data=str(data_time)))\n            i += 1\n            end = time.time()\n        total_time = time.time() - start_time\n        total_time_str = str(datetime.timedelta(seconds=int(total_time)))\n        print('{} Total time: {} ({:.4f} s / it)'.format(\n            header, total_time_str, total_time / len(iterable)))\n\n\ndef get_sha():\n    cwd = os.path.dirname(os.path.abspath(__file__))\n\n    def _run(command):\n        return subprocess.check_output(command, cwd=cwd).decode('ascii').strip()\n    sha = 'N/A'\n    diff = \"clean\"\n    branch = 'N/A'\n    try:\n        sha = _run(['git', 'rev-parse', 'HEAD'])\n        subprocess.check_output(['git', 'diff'], cwd=cwd)\n        diff = _run(['git', 'diff-index', 'HEAD'])\n        diff = \"has uncommited changes\" if diff else \"clean\"\n        branch = _run(['git', 'rev-parse', '--abbrev-ref', 'HEAD'])\n    except Exception:\n        pass\n    message = f\"sha: {sha}, status: {diff}, branch: {branch}\"\n    return message\n\n\ndef collate_fn(batch):\n    batch = list(zip(*batch))\n    batch[0] = nested_tensor_from_tensor_list(batch[0], size_divisibility=32)\n    return tuple(batch)\n\n\ndef _max_by_axis(the_list):\n    # type: (List[List[int]]) -> List[int]\n    maxes = the_list[0]\n    for sublist in the_list[1:]:\n        for index, item in enumerate(sublist):\n            maxes[index] = max(maxes[index], item)\n    return maxes\n\n\ndef nested_tensor_from_tensor_list(tensor_list: List[Tensor], size_divisibility=1, split=True):\n    if split:\n        tensor_list = [tensor.split(3,dim=0) for tensor in tensor_list]\n        tensor_list = [item for sublist in tensor_list for item in sublist]\n\n    # TODO make this more general\n    if tensor_list[0].ndim == 3:\n        # TODO make it support different-sized images\n        max_size = _max_by_axis([list(img.shape) for img in tensor_list])\n\n        if size_divisibility > 1:\n            stride = size_divisibility\n            # the last two dims are H,W, both subject to divisibility requirement\n            max_size[-2] = (max_size[-2] + (stride - 1)) // stride * stride\n            max_size[-1] = (max_size[-1] + (stride - 1)) // stride * stride\n\n        # min_size = tuple(min(s) for s in zip(*[img.shape for img in tensor_list]))\n        batch_shape = [len(tensor_list)] + max_size\n        b, c, h, w = batch_shape\n        dtype = tensor_list[0].dtype\n        device = tensor_list[0].device\n        tensor = torch.zeros(batch_shape, dtype=dtype, device=device)\n        mask = torch.ones((b, h, w), dtype=torch.bool, device=device)\n        for img, pad_img, m in zip(tensor_list, tensor, mask):\n            pad_img[: img.shape[0], : img.shape[1], : img.shape[2]].copy_(img)\n            m[: img.shape[1], :img.shape[2]] = False\n    else:\n        raise ValueError('not supported')\n    return NestedTensor(tensor, mask)\n\n\nclass NestedTensor(object):\n    def __init__(self, tensors, mask: Optional[Tensor]):\n        self.tensors = tensors\n        self.mask = mask\n\n    def to(self, device, non_blocking=False):\n        # type: (Device) -> NestedTensor # noqa\n        cast_tensor = self.tensors.to(device, non_blocking=non_blocking)\n        mask = self.mask\n        if mask is not None:\n            assert mask is not None\n            cast_mask = mask.to(device, non_blocking=non_blocking)\n        else:\n            cast_mask = None\n        return NestedTensor(cast_tensor, cast_mask)\n\n    def record_stream(self, *args, **kwargs):\n        self.tensors.record_stream(*args, **kwargs)\n        if self.mask is not None:\n            self.mask.record_stream(*args, **kwargs)\n\n    def decompose(self):\n        return self.tensors, self.mask\n\n    def __repr__(self):\n        return str(self.tensors)\n\n\ndef setup_for_distributed(is_master):\n    \"\"\"\n    This function disables printing when not in master process\n    \"\"\"\n    import builtins as __builtin__\n    builtin_print = __builtin__.print\n\n    def print(*args, **kwargs):\n        force = kwargs.pop('force', False)\n        if is_master or force:\n            builtin_print(*args, **kwargs)\n\n    __builtin__.print = print\n\n\ndef is_dist_avail_and_initialized():\n    if not dist.is_available():\n        return False\n    if not dist.is_initialized():\n        return False\n    return True\n\n\ndef get_world_size():\n    if not is_dist_avail_and_initialized():\n        return 1\n    return dist.get_world_size()\n\n\ndef get_rank():\n    if not is_dist_avail_and_initialized():\n        return 0\n    return dist.get_rank()\n\n\ndef get_local_size():\n    if not is_dist_avail_and_initialized():\n        return 1\n    return int(os.environ['LOCAL_SIZE'])\n\n\ndef get_local_rank():\n    if not is_dist_avail_and_initialized():\n        return 0\n    return int(os.environ['LOCAL_RANK'])\n\n\ndef is_main_process():\n    return get_rank() == 0\n\n\ndef save_on_master(*args, **kwargs):\n    if is_main_process():\n        torch.save(*args, **kwargs)\n\n\ndef init_distributed_mode(args):\n    if 'RANK' in os.environ and 'WORLD_SIZE' in os.environ:\n        args.rank = int(os.environ[\"RANK\"])\n        args.world_size = int(os.environ['WORLD_SIZE'])\n        args.gpu = int(os.environ['LOCAL_RANK'])\n        args.dist_url = 'env://'\n        os.environ['LOCAL_SIZE'] = str(torch.cuda.device_count())\n    elif 'SLURM_PROCID' in os.environ:\n        proc_id = int(os.environ['SLURM_PROCID'])\n        ntasks = int(os.environ['SLURM_NTASKS'])\n        node_list = os.environ['SLURM_NODELIST']\n        num_gpus = torch.cuda.device_count()\n        addr = subprocess.getoutput(\n            'scontrol show hostname {} | head -n1'.format(node_list))\n        os.environ['MASTER_PORT'] = os.environ.get('MASTER_PORT', '29500')\n        os.environ['MASTER_ADDR'] = addr\n        os.environ['WORLD_SIZE'] = str(ntasks)\n        os.environ['RANK'] = str(proc_id)\n        os.environ['LOCAL_RANK'] = str(proc_id % num_gpus)\n        os.environ['LOCAL_SIZE'] = str(num_gpus)\n        args.dist_url = 'env://'\n        args.world_size = ntasks\n        args.rank = proc_id\n        args.gpu = proc_id % num_gpus\n    else:\n        print('Not using distributed mode')\n        args.distributed = False\n        return\n\n    args.distributed = True\n\n    torch.cuda.set_device(args.gpu)\n    args.dist_backend = 'nccl'\n    print('| distributed init (rank {}): {}'.format(\n        args.rank, args.dist_url), flush=True)\n    torch.distributed.init_process_group(backend=args.dist_backend, init_method=args.dist_url,\n                                         world_size=args.world_size, rank=args.rank)\n    torch.distributed.barrier()\n    setup_for_distributed(args.rank == 0)\n\n\n@torch.no_grad()\ndef accuracy(output, target, topk=(1,)):\n    \"\"\"Computes the precision@k for the specified values of k\"\"\"\n    if target.numel() == 0:\n        return [torch.zeros([], device=output.device)]\n    maxk = max(topk)\n    batch_size = target.size(0)\n\n    _, pred = output.topk(maxk, 1, True, True)\n    pred = pred.t()\n    correct = pred.eq(target.view(1, -1).expand_as(pred))\n\n    res = []\n    for k in topk:\n        correct_k = correct[:k].view(-1).float().sum(0)\n        res.append(correct_k.mul_(100.0 / batch_size))\n    return res\n\n\ndef interpolate(input, size=None, scale_factor=None, mode=\"nearest\", align_corners=None):\n    # type: (Tensor, Optional[List[int]], Optional[float], str, Optional[bool]) -> Tensor\n    \"\"\"\n    Equivalent to nn.functional.interpolate, but with support for empty batch sizes.\n    This will eventually be supported natively by PyTorch, and this\n    class can go away.\n    \"\"\"\n    if float(torchvision.__version__[:3]) < 0.7:\n        if input.numel() > 0:\n            return torch.nn.functional.interpolate(\n                input, size, scale_factor, mode, align_corners\n            )\n\n        output_shape = _output_size(2, input, size, scale_factor)\n        output_shape = list(input.shape[:-2]) + list(output_shape)\n        if float(torchvision.__version__[:3]) < 0.5:\n            return _NewEmptyTensorOp.apply(input, output_shape)\n        return _new_empty_tensor(input, output_shape)\n    else:\n        return torchvision.ops.misc.interpolate(input, size, scale_factor, mode, align_corners)\n\n\ndef get_total_grad_norm(parameters, norm_type=2):\n    parameters = list(filter(lambda p: p.grad is not None, parameters))\n    norm_type = float(norm_type)\n    device = parameters[0].grad.device\n    total_norm = torch.norm(torch.stack([torch.norm(p.grad.detach(), norm_type).to(device) for p in parameters]),\n                            norm_type)\n    return total_norm\n\ndef inverse_sigmoid(x, eps=1e-5):\n    x = x.clamp(min=0, max=1)\n    x1 = x.clamp(min=eps)\n    x2 = (1 - x).clamp(min=eps)\n    return torch.log(x1/x2)\n\n\n"
  },
  {
    "path": "util/plot_utils.py",
    "content": "# ------------------------------------------------------------------------\n# SeqFormer\n# ------------------------------------------------------------------------\n# Modified from Deformable DETR (https://github.com/fundamentalvision/Deformable-DETR)\n# Copyright (c) 2020 SenseTime. All Rights Reserved.\n# ------------------------------------------------------------------------\n\"\"\"\nPlotting utilities to visualize training logs.\n\"\"\"\nimport torch\nimport pandas as pd\nimport seaborn as sns\nimport matplotlib.pyplot as plt\n\nfrom pathlib import Path, PurePath\n\n\ndef plot_logs(logs, fields=('class_error', 'loss_bbox_unscaled', 'mAP'), ewm_col=0, log_name='log.txt'):\n    '''\n    Function to plot specific fields from training log(s). Plots both training and test results.\n\n    :: Inputs - logs = list containing Path objects, each pointing to individual dir with a log file\n              - fields = which results to plot from each log file - plots both training and test for each field.\n              - ewm_col = optional, which column to use as the exponential weighted smoothing of the plots\n              - log_name = optional, name of log file if different than default 'log.txt'.\n\n    :: Outputs - matplotlib plots of results in fields, color coded for each log file.\n               - solid lines are training results, dashed lines are test results.\n\n    '''\n    func_name = \"plot_utils.py::plot_logs\"\n\n    # verify logs is a list of Paths (list[Paths]) or single Pathlib object Path,\n    # convert single Path to list to avoid 'not iterable' error\n\n    if not isinstance(logs, list):\n        if isinstance(logs, PurePath):\n            logs = [logs]\n            print(f\"{func_name} info: logs param expects a list argument, converted to list[Path].\")\n        else:\n            raise ValueError(f\"{func_name} - invalid argument for logs parameter.\\n \\\n            Expect list[Path] or single Path obj, received {type(logs)}\")\n\n    # verify valid dir(s) and that every item in list is Path object\n    for i, dir in enumerate(logs):\n        if not isinstance(dir, PurePath):\n            raise ValueError(f\"{func_name} - non-Path object in logs argument of {type(dir)}: \\n{dir}\")\n        if dir.exists():\n            continue\n        raise ValueError(f\"{func_name} - invalid directory in logs argument:\\n{dir}\")\n\n    # load log file(s) and plot\n    dfs = [pd.read_json(Path(p) / log_name, lines=True) for p in logs]\n\n    fig, axs = plt.subplots(ncols=len(fields), figsize=(16, 5))\n\n    for df, color in zip(dfs, sns.color_palette(n_colors=len(logs))):\n        for j, field in enumerate(fields):\n            if field == 'mAP':\n                coco_eval = pd.DataFrame(pd.np.stack(df.test_coco_eval.dropna().values)[:, 1]).ewm(com=ewm_col).mean()\n                axs[j].plot(coco_eval, c=color)\n            else:\n                df.interpolate().ewm(com=ewm_col).mean().plot(\n                    y=[f'train_{field}', f'test_{field}'],\n                    ax=axs[j],\n                    color=[color] * 2,\n                    style=['-', '--']\n                )\n    for ax, field in zip(axs, fields):\n        ax.legend([Path(p).name for p in logs])\n        ax.set_title(field)\n\n\ndef plot_precision_recall(files, naming_scheme='iter'):\n    if naming_scheme == 'exp_id':\n        # name becomes exp_id\n        names = [f.parts[-3] for f in files]\n    elif naming_scheme == 'iter':\n        names = [f.stem for f in files]\n    else:\n        raise ValueError(f'not supported {naming_scheme}')\n    fig, axs = plt.subplots(ncols=2, figsize=(16, 5))\n    for f, color, name in zip(files, sns.color_palette(\"Blues\", n_colors=len(files)), names):\n        data = torch.load(f)\n        # precision is n_iou, n_points, n_cat, n_area, max_det\n        precision = data['precision']\n        recall = data['params'].recThrs\n        scores = data['scores']\n        # take precision for all classes, all areas and 100 detections\n        precision = precision[0, :, :, 0, -1].mean(1)\n        scores = scores[0, :, :, 0, -1].mean(1)\n        prec = precision.mean()\n        rec = data['recall'][0, :, 0, -1].mean()\n        print(f'{naming_scheme} {name}: mAP@50={prec * 100: 05.1f}, ' +\n              f'score={scores.mean():0.3f}, ' +\n              f'f1={2 * prec * rec / (prec + rec + 1e-8):0.3f}'\n              )\n        axs[0].plot(recall, precision, c=color)\n        axs[1].plot(recall, scores, c=color)\n\n    axs[0].set_title('Precision / Recall')\n    axs[0].legend(names)\n    axs[1].set_title('Scores / Recall')\n    axs[1].legend(names)\n    return fig, axs\n\n\n\n"
  }
]