[
  {
    "path": "CODE_OF_CONDUCT.md",
    "content": "# Code of Conduct\n\nFacebook has adopted a Code of Conduct that we expect project participants to adhere to.\nPlease read the [full text](https://code.fb.com/codeofconduct/)\nso that you can understand what actions will and will not be tolerated.\n"
  },
  {
    "path": "CONTRIBUTING.md",
    "content": "# Contributing to PySlowFast\nWe want to make contributing to this project as easy and transparent as\npossible.\n\n## Pull Requests\nWe actively welcome your pull requests.\n\n1. Fork the repo and create your branch from `master`.\n2. If you've changed APIs, update the documentation.\n3. Ensure the test suite passes.\n4. Make sure your code lints.\n5. Ensure no regressions in baseline model speed and accuracy.\n6. If you haven't already, complete the Contributor License Agreement (\"CLA\").\n\n## Contributor License Agreement (\"CLA\")\nIn order to accept your pull request, we need you to submit a CLA. You only need\nto do this once to work on any of Facebook's open source projects.\n\nComplete your CLA here: <https://code.facebook.com/cla>\n\n## Issues\n\nPlease ensure your description is clear and has sufficient instructions to be able to reproduce the issue. The recommended issue format is:\n------\n\n#### To Reproduce\n```How to reproduce the issue.```\n#### Expected behavior\n```Expected output.```\n#### Environment\n```Your environment.```\n\n------\n\n## Coding Style  \n* 4 spaces for indentation rather than tabs\n* 80 character line length\n* PEP8 formatting\n\n## License\nBy contributing to PySlowFast, you agree that your contributions will be licensed under the LICENSE file in the root directory of this source tree.\n"
  },
  {
    "path": "GETTING_STARTED.md",
    "content": "# Getting Started with PySlowFast\n\nThis document provides a brief intro of launching jobs in PySlowFast for training and testing. Before launching any job, make sure you have properly installed the PySlowFast following the instruction in [README.md](README.md) and you have prepared the dataset following [DATASET.md](slowfast/datasets/DATASET.md) with the correct format.\n\n## Train a Standard Model from Scratch\n\nHere we can start with training a simple C2D models by running:\n\n```\npython tools/run_net.py \\\n  --cfg configs/Kinetics/C2D_8x8_R50.yaml \\\n  DATA.PATH_TO_DATA_DIR path_to_your_dataset \\\n  NUM_GPUS 2 \\\n  TRAIN.BATCH_SIZE 16 \\\n```\nYou may need to pass location of your dataset in the command line by adding `DATA.PATH_TO_DATA_DIR path_to_your_dataset`, or you can simply add\n\n```\nDATA:\n  PATH_TO_DATA_DIR: path_to_your_dataset\n```\nTo the yaml configs file, then you do not need to pass it to the command line every time.\n\n\nYou may also want to add:\n```\n  DATA_LOADER.NUM_WORKERS 0 \\\n  NUM_GPUS 2 \\\n  TRAIN.BATCH_SIZE 16 \\\n```\n\nIf you want to launch a quick job for debugging on your local machine.\n\n## Resume from an Existing Checkpoint\nIf your checkpoint is trained by PyTorch, then you can add the following line in the command line, or you can also add it in the YAML config:\n\n```\nTRAIN.CHECKPOINT_FILE_PATH path_to_your_PyTorch_checkpoint\n```\n\nIf the checkpoint in trained by Caffe2, then you can do the following:\n\n```\nTRAIN.CHECKPOINT_FILE_PATH path_to_your_Caffe2_checkpoint \\\nTRAIN.CHECKPOINT_TYPE caffe2\n```\n\nIf you need to performance inflation on the checkpoint, remember to set `TRAIN.CHECKPOINT_INFLATE` to True.\n\n\n## Perform Test\nWe have `TRAIN.ENABLE` and `TEST.ENABLE` to control whether training or testing is required for the current job. If only testing is preferred, you can set the `TRAIN.ENABLE` to False, and do not forget to pass the path to the model you want to test to TEST.CHECKPOINT_FILE_PATH.\n```\npython tools/run_net.py \\\n  --cfg configs/Kinetics/C2D_8x8_R50.yaml \\\n  DATA.PATH_TO_DATA_DIR path_to_your_dataset \\\n  TEST.CHECKPOINT_FILE_PATH path_to_your_checkpoint \\\n  TRAIN.ENABLE False \\\n```\n\n### Run command\n```\npython \\tools\\run_net.py --cfg path/to/<pretrained_model_config_file>.yaml\n```\n"
  },
  {
    "path": "INSTALL.md",
    "content": "# Installation\n\n## Requirements\n- Python >= 3.8\n- Numpy\n- PyTorch >= 1.3\n- [fvcore](https://github.com/facebookresearch/fvcore/): `pip install 'git+https://github.com/facebookresearch/fvcore'`\n- [torchvision](https://github.com/pytorch/vision/) that matches the PyTorch installation.\n  You can install them together at [pytorch.org](https://pytorch.org) to make sure of this.\n- simplejson: `pip install simplejson`\n- GCC >= 4.9\n- PyAV: `conda install av -c conda-forge`\n- ffmpeg (4.0 is prefereed, will be installed along with PyAV)\n- PyYaml: (will be installed along with fvcore)\n- tqdm: (will be installed along with fvcore)\n- iopath: `pip install -U iopath` or `conda install -c iopath iopath`\n- psutil: `pip install psutil`\n- OpenCV: `pip install opencv-python`\n- torchvision: `pip install torchvision` or `conda install torchvision -c pytorch`\n- tensorboard: `pip install tensorboard`\n- moviepy: (optional, for visualizing video on tensorboard) `conda install -c conda-forge moviepy` or `pip install moviepy`\n- PyTorchVideo: `pip install pytorchvideo`\n- [Detectron2](https://github.com/facebookresearch/detectron2):\n- FairScale: `pip install 'git+https://github.com/facebookresearch/fairscale'`\n```\n    pip install -U torch torchvision cython\n    pip install -U 'git+https://github.com/facebookresearch/fvcore.git' 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'\n    git clone https://github.com/facebookresearch/detectron2 detectron2_repo\n    pip install -e detectron2_repo\n    # You can find more details at https://github.com/facebookresearch/detectron2/blob/master/INSTALL.md\n```\n\n## Pytorch\nPlease follow PyTorch official instructions to install from source:\n```\ngit clone --recursive https://github.com/pytorch/pytorch\n```\n\n## PySlowFast\n\nClone the PySlowFast Video Understanding repository.\n```\ngit clone https://github.com/facebookresearch/slowfast\n```\n\nAdd this repository to $PYTHONPATH.\n```\nexport PYTHONPATH=/path/to/SlowFast/slowfast:$PYTHONPATH\n```\n\n### Build PySlowFast\n\nAfter having the above dependencies, run:\n```\ngit clone https://github.com/facebookresearch/slowfast\ncd SlowFast\npython setup.py build develop\n```\n\nNow the installation is finished, run the pipeline with:\n```\npython tools/run_net.py --cfg configs/Kinetics/C2D_8x8_R50.yaml NUM_GPUS 1 TRAIN.BATCH_SIZE 8 SOLVER.BASE_LR 0.0125 DATA.PATH_TO_DATA_DIR path_to_your_data_folder\n```\n"
  },
  {
    "path": "LICENSE",
    "content": "Apache License\nVersion 2.0, January 2004\nhttp://www.apache.org/licenses/\n\nTERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION\n\n1. Definitions.\n\n\"License\" shall mean the terms and conditions for use, reproduction,\nand distribution as defined by Sections 1 through 9 of this document.\n\n\"Licensor\" shall mean the copyright owner or entity authorized by\nthe copyright owner that is granting the License.\n\n\"Legal Entity\" shall mean the union of the acting entity and all\nother entities that control, are controlled by, or are under common\ncontrol with that entity. For the purposes of this definition,\n\"control\" means (i) the power, direct or indirect, to cause the\ndirection or management of such entity, whether by contract or\notherwise, or (ii) ownership of fifty percent (50%) or more of the\noutstanding shares, or (iii) beneficial ownership of such entity.\n\n\"You\" (or \"Your\") shall mean an individual or Legal Entity\nexercising permissions granted by this License.\n\n\"Source\" form shall mean the preferred form for making modifications,\nincluding but not limited to software source code, documentation\nsource, and configuration files.\n\n\"Object\" form shall mean any form resulting from mechanical\ntransformation or translation of a Source form, including but\nnot limited to compiled object code, generated documentation,\nand conversions to other media types.\n\n\"Work\" shall mean the work of authorship, whether in Source or\nObject form, made available under the License, as indicated by a\ncopyright notice that is included in or attached to the work\n(an example is provided in the Appendix below).\n\n\"Derivative Works\" shall mean any work, whether in Source or Object\nform, that is based on (or derived from) the Work and for which the\neditorial revisions, annotations, elaborations, or other modifications\nrepresent, as a whole, an original work of authorship. For the purposes\nof this License, Derivative Works shall not include works that remain\nseparable from, or merely link (or bind by name) to the interfaces of,\nthe Work and Derivative Works thereof.\n\n\"Contribution\" shall mean any work of authorship, including\nthe original version of the Work and any modifications or additions\nto that Work or Derivative Works thereof, that is intentionally\nsubmitted to Licensor for inclusion in the Work by the copyright owner\nor by an individual or Legal Entity authorized to submit on behalf of\nthe copyright owner. For the purposes of this definition, \"submitted\"\nmeans any form of electronic, verbal, or written communication sent\nto the Licensor or its representatives, including but not limited to\ncommunication on electronic mailing lists, source code control systems,\nand issue tracking systems that are managed by, or on behalf of, the\nLicensor for the purpose of discussing and improving the Work, but\nexcluding communication that is conspicuously marked or otherwise\ndesignated in writing by the copyright owner as \"Not a Contribution.\"\n\n\"Contributor\" shall mean Licensor and any individual or Legal Entity\non behalf of whom a Contribution has been received by Licensor and\nsubsequently incorporated within the Work.\n\n2. Grant of Copyright License. Subject to the terms and conditions of\nthis License, each Contributor hereby grants to You a perpetual,\nworldwide, non-exclusive, no-charge, royalty-free, irrevocable\ncopyright license to reproduce, prepare Derivative Works of,\npublicly display, publicly perform, sublicense, and distribute the\nWork and such Derivative Works in Source or Object form.\n\n3. Grant of Patent License. Subject to the terms and conditions of\nthis License, each Contributor hereby grants to You a perpetual,\nworldwide, non-exclusive, no-charge, royalty-free, irrevocable\n(except as stated in this section) patent license to make, have made,\nuse, offer to sell, sell, import, and otherwise transfer the Work,\nwhere such license applies only to those patent claims licensable\nby such Contributor that are necessarily infringed by their\nContribution(s) alone or by combination of their Contribution(s)\nwith the Work to which such Contribution(s) was submitted. If You\ninstitute patent litigation against any entity (including a\ncross-claim or counterclaim in a lawsuit) alleging that the Work\nor a Contribution incorporated within the Work constitutes direct\nor contributory patent infringement, then any patent licenses\ngranted to You under this License for that Work shall terminate\nas of the date such litigation is filed.\n\n4. Redistribution. You may reproduce and distribute copies of the\nWork or Derivative Works thereof in any medium, with or without\nmodifications, and in Source or Object form, provided that You\nmeet the following conditions:\n\n(a) You must give any other recipients of the Work or\nDerivative Works a copy of this License; and\n\n(b) You must cause any modified files to carry prominent notices\nstating that You changed the files; and\n\n(c) You must retain, in the Source form of any Derivative Works\nthat You distribute, all copyright, patent, trademark, and\nattribution notices from the Source form of the Work,\nexcluding those notices that do not pertain to any part of\nthe Derivative Works; and\n\n(d) If the Work includes a \"NOTICE\" text file as part of its\ndistribution, then any Derivative Works that You distribute must\ninclude a readable copy of the attribution notices contained\nwithin such NOTICE file, excluding those notices that do not\npertain to any part of the Derivative Works, in at least one\nof the following places: within a NOTICE text file distributed\nas part of the Derivative Works; within the Source form or\ndocumentation, if provided along with the Derivative Works; or,\nwithin a display generated by the Derivative Works, if and\nwherever such third-party notices normally appear. The contents\nof the NOTICE file are for informational purposes only and\ndo not modify the License. You may add Your own attribution\nnotices within Derivative Works that You distribute, alongside\nor as an addendum to the NOTICE text from the Work, provided\nthat such additional attribution notices cannot be construed\nas modifying the License.\n\nYou may add Your own copyright statement to Your modifications and\nmay provide additional or different license terms and conditions\nfor use, reproduction, or distribution of Your modifications, or\nfor any such Derivative Works as a whole, provided Your use,\nreproduction, and distribution of the Work otherwise complies with\nthe conditions stated in this License.\n\n5. Submission of Contributions. Unless You explicitly state otherwise,\nany Contribution intentionally submitted for inclusion in the Work\nby You to the Licensor shall be under the terms and conditions of\nthis License, without any additional terms or conditions.\nNotwithstanding the above, nothing herein shall supersede or modify\nthe terms of any separate license agreement you may have executed\nwith Licensor regarding such Contributions.\n\n6. Trademarks. This License does not grant permission to use the trade\nnames, trademarks, service marks, or product names of the Licensor,\nexcept as required for reasonable and customary use in describing the\norigin of the Work and reproducing the content of the NOTICE file.\n\n7. Disclaimer of Warranty. Unless required by applicable law or\nagreed to in writing, Licensor provides the Work (and each\nContributor provides its Contributions) on an \"AS IS\" BASIS,\nWITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or\nimplied, including, without limitation, any warranties or conditions\nof TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A\nPARTICULAR PURPOSE. You are solely responsible for determining the\nappropriateness of using or redistributing the Work and assume any\nrisks associated with Your exercise of permissions under this License.\n\n8. Limitation of Liability. In no event and under no legal theory,\nwhether in tort (including negligence), contract, or otherwise,\nunless required by applicable law (such as deliberate and grossly\nnegligent acts) or agreed to in writing, shall any Contributor be\nliable to You for damages, including any direct, indirect, special,\nincidental, or consequential damages of any character arising as a\nresult of this License or out of the use or inability to use the\nWork (including but not limited to damages for loss of goodwill,\nwork stoppage, computer failure or malfunction, or any and all\nother commercial damages or losses), even if such Contributor\nhas been advised of the possibility of such damages.\n\n9. Accepting Warranty or Additional Liability. While redistributing\nthe Work or Derivative Works thereof, You may choose to offer,\nand charge a fee for, acceptance of support, warranty, indemnity,\nor other liability obligations and/or rights consistent with this\nLicense. However, in accepting such obligations, You may act only\non Your own behalf and on Your sole responsibility, not on behalf\nof any other Contributor, and only if You agree to indemnify,\ndefend, and hold each Contributor harmless for any liability\nincurred by, or claims asserted against, such Contributor by reason\nof your accepting any such warranty or additional liability.\n\nEND OF TERMS AND CONDITIONS\n\nAPPENDIX: How to apply the Apache License to your work.\n\nTo apply the Apache License to your work, attach the following\nboilerplate notice, with the fields enclosed by brackets \"[]\"\nreplaced with your own identifying information. (Don't include\nthe brackets!)  The text should be enclosed in the appropriate\ncomment syntax for the file format. We also recommend that a\nfile or class name and description of purpose be included on the\nsame \"printed page\" as the copyright notice for easier\nidentification within third-party archives.\n\nCopyright 2019, Facebook, Inc\n\nLicensed under the Apache License, Version 2.0 (the \"License\");\nyou may not use this file except in compliance with the License.\nYou may obtain a copy of the License at\n\nhttp://www.apache.org/licenses/LICENSE-2.0\n\nUnless required by applicable law or agreed to in writing, software\ndistributed under the License is distributed on an \"AS IS\" BASIS,\nWITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\nSee the License for the specific language governing permissions and\nlimitations under the License.\n"
  },
  {
    "path": "MODEL_ZOO.md",
    "content": "# PySlowFast Model Zoo and Baselines\n\n## Kinetics 400 and 600\n\n| architecture | size |  crops x clips |  frame length x sample rate | top1 |  top5  |  model | config | dataset |\n| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |\n| C2D | R50 | 3 x 10 | 8 x 8 | 67.2 | 87.8 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/kinetics400/C2D_NOPOOL_8x8_R50.pkl) | Kinetics/c2/C2D_NOPOOL_8x8_R50 | K400 |\n| I3D | R50 | 3 x 10 | 8 x 8 | 73.5 | 90.8 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/kinetics400/I3D_8x8_R50.pkl) | Kinetics/c2/I3D_8x8_R50 | K400 |\n| I3D NLN | R50 | 3 x 10 | 8 x 8 | 74.0 | 91.1 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/kinetics400/I3D_NLN_8x8_R50.pkl) | Kinetics/c2/I3D_NLN_8x8_R50 | K400 |\n| Slow | R50 | 3 x 10 | 4 x 16 | 72.7 | 90.3 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/kinetics400/SLOWONLY_4x16_R50.pkl) | Kinetics/c2/SLOW_4x16_R50 | K400 |\n| Slow | R50 | 3 x 10 | 8 x 8 | 74.8 | 91.6 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/kinetics400/SLOWONLY_8x8_R50.pkl) | Kinetics/c2/SLOW_8x8_R50 | K400 |\n| SlowFast | R50 | 3 x 10 | 4 x 16 | 75.6 | 92.0 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/kinetics400/SLOWFAST_4x16_R50.pkl) | Kinetics/c2/SLOWFAST_4x16_R50 | K400 |\n| SlowFast | R50 | 3 x 10 | 8 x 8 | 77.0 | 92.6 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/kinetics400/SLOWFAST_8x8_R50.pkl) | Kinetics/c2/SLOWFAST_8x8_R50 | K400 |\n| MViTv1 | B-Conv | 1 x 5 | 16 x 4 | 78.4 | 93.5 | [`link`](https://drive.google.com/file/d/194gJinVejq6A1FmySNKQ8vAN5-FOY-QL/view?usp=sharing) | Kinetics/MVIT_B_16x4_CONV | K400 |\n| rev-MViT | B-Conv | 1 x 5 | 16 x 4 | 78.4 | 93.4 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/rev/REV_MVIT_B_16x4.pyth) | Kinetics/REV_MVIT_B_16x4_CONV | K400 |\n| MViTv1 | B-Conv | 1 x 5 | 32 x 3 | 80.4 | 94.8 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/mvit/k400.pyth) | Kinetics/MVIT_B_32x3_CONV | K400 |\n| MViTv1 | B-Conv | 1 x 5 | 32 x 3 | 83.9 | 96.5 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/mvit/k600.pyth) | Kinetics/MVIT_B_32x3_CONV_K600 | K600 |\n| MViTv2 | S | 1 x 5 | 16 x 4 | 81.0 | 94.6 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/mvitv2/pysf_video_models/MViTv2_S_16x4_k400_f302660347.pyth) | Kinetics/MVITv2_S_16x4 | K400 |\n| MViTv2 | B | 1 x 5 | 32 x 3 | 82.9 | 95.7 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/mvitv2/pysf_video_models/MViTv2_B_32x3_k400_f304025456.pyth) | Kinetics/MVITv2_B_32x3 | K400 |\n\n## X3D models (details in projects/x3d)\n\n|    architecture     |  size  | pretrain |    frame length x sample rate     | top1 10-view | top1 30-view | parameters (M) | FLOPs (G) | model | config |\n| :-------------: | :-----: | :-----: | :-------------: | :------: | :------: | :------------: | :----: | :------: | :------: |\n| X3D | XS | - | 4 x 12 | 68.7 | 69.5 | 3.8 | 0.60 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/x3d_models/x3d_xs.pyth) | Kinetics/X3D_XS |\n| X3D | S | - | 13 x 6 | 73.1 | 73.5 | 3.8 | 1.96 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/x3d_models/x3d_s.pyth) | Kinetics/X3D_S |\n| X3D | M | - | 16 x 5 | 75.1 | 76.2 | 3.8 | 4.73 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/x3d_models/x3d_m.pyth) | Kinetics/X3D_M |\n| X3D | L | - | 16 x 5 | 76.9 | 77.5 | 6.2 | 18.37 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/x3d_models/x3d_l.pyth) | Kinetics/X3D_L |\n\n## AVA\n\n| architecture | size | Pretrain Model |  frame length x sample rate  | MAP | AVA version | model |\n| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |------------- |\n| Slow | R50 | [Kinetics 400](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/ava/pretrain/C2D_8x8_R50.pkl) | 4 x 16 | 19.5 | 2.2 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/ava/C2D_8x8_R50.pkl) |\n| SlowFast | R101 | [Kinetics 600](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/ava/pretrain/SLOWFAST_32x2_R101_50_50_v2.1.pkl) | 8 x 8 | 28.2 | 2.1 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/ava/SLOWFAST_32x2_R101_50_50_v2.1.pkl) |\n| SlowFast | R101 | [Kinetics 600](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/ava/pretrain/SLOWFAST_32x2_R101_50_50.pkl) | 8 x 8 | 29.1 | 2.2 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/ava/SLOWFAST_32x2_R101_50_50.pkl) |\n| SlowFast | R101 | [Kinetics 600](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/ava/pretrain/SLOWFAST_64x2_R101_50_50.pkl) | 16 x 8 | 29.4 | 2.2 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/ava/SLOWFAST_64x2_R101_50_50.pkl) |\n\n## Multigrid Training\n\n***Update June, 2020:*** In the following we provide (reimplemented) models from  \"[A Multigrid Method for Efficiently Training Video Models\n](https://arxiv.org/abs/1912.00998)\" paper. The multigrid method trains about 3-6x faster than the original training on multiple datasets. See [projects/multigrid](projects/multigrid/README.md) for more information. The following provides models, results, and example config files.\n\n#### Kinetics:\n| architecture | size |  pretrain |  frame length x sample rate | training | top1 |  top5  |  model | config |\n| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |\n| SlowFast | R50 | - | 8 x 8 | Standard | 76.8 | 92.7 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/pyslowfast/model_zoo/multigrid/model_zoo/Kinetics/SLOWFAST_8x8_R50_stepwise.pkl) | Kinetics/SLOWFAST_8x8_R50_stepwise |\n| SlowFast | R50 | - | 8 x 8 | Multigrid | 76.6 | 92.7 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/pyslowfast/model_zoo/multigrid/model_zoo/Kinetics/SLOWFAST_8x8_R50_stepwise_multigrid.pkl) | Kinetics/SLOWFAST_8x8_R50_stepwise_multigrid |\n\n(Here we use stepwise learning rate schedule.)\n\n#### Something-Something V2:\n| architecture | size |  pretrain |  frame length x sample rate | training | top1 |  top5  |  model | config |\n| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |\n| SlowFast | R50 | [Kinetics 400](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/kinetics400/SLOWFAST_8x8_R50.pkl) | 16 x 8 | Standard | 63.0 | 88.5 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/pyslowfast/model_zoo/multigrid/model_zoo/SSv2/SLOWFAST_16x8_R50.pkl) | SSv2/SLOWFAST_16x8_R50 |\n| SlowFast | R50 | [Kinetics 400](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/kinetics400/SLOWFAST_8x8_R50.pkl) | 16 x 8 | Multigrid | 63.5 | 88.7 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/pyslowfast/model_zoo/multigrid/model_zoo/SSv2/SLOWFAST_16x8_R50_multigrid.pkl) | SSv2/SLOWFAST_16x8_R50_multigrid |\n\n\n#### Charades\n| architecture | size |  pretrain |  frame length x sample rate | training | mAP |  model | config |\n| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |\n| SlowFast | R50 | [Kinetics 400](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/kinetics400/SLOWFAST_8x8_R50.pkl) | 16 x 8 | Standard | 38.9 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/pyslowfast/model_zoo/multigrid/model_zoo/Charades/SLOWFAST_16x8_R50.pkl) | SSv2/SLOWFAST_16x8_R50 |\n| SlowFast | R50 | [Kinetics 400](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/kinetics400/SLOWFAST_8x8_R50.pkl) | 16 x 8 | Multigrid | 38.6 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/pyslowfast/model_zoo/multigrid/model_zoo/Charades/SLOWFAST_16x8_R50_multigrid.pkl) | SSv2/SLOWFAST_16x8_R50_multigrid |\n\n\n## ImageNet\n\nWe also release the imagenet pretrained model if finetuning from ImageNet is preferred. The reported accuracy is obtained by center crop testing on the validation set.\n\n| architecture | size |  Top1 |  Top5  |  model  | Config |\n| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |\n| ResNet | R50 | 76.4 | 93.2 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/kinetics400/R50_IN1K.pyth) | ImageNet/RES_R50 |\n| MVIT | B-16-Conv | 82.9 | 96.3 | [`link`](https://drive.google.com/file/d/1dYYqUB-3DSgBVc9d6o-rW8ojtVsrFLgp/view?usp=sharing) | ImageNet/MVIT_B_16_CONV |\n| rev-VIT | Small | 79.9 | 94.9 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/rev/REV_VIT_S.pyth) | ImageNet/REV_VIT_S.yaml |\n| rev-VIT | Base |  81.8 | 95.6 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/rev/REV_VIT_B.pyth) | ImageNet/REV_VIT_B.yaml |\n| rev-MVIT | Base |  82.9* | 96.3 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/rev/REV_MVIT_B.pyth) | ImageNet/REV_MVIT_B_16_CONV.yaml |\n\n*please refer to [Reversible Model Zoo](projects/rev/README.md).\n\n## PyTorchVideo\n\nWe support and benchmark PyTorchVideo models and datasets in PySlowFast. See [projects/pytorchvideo](projects/pytorchvideo/README.md) for more information about PyTorchVideo Model Zoo.\n"
  },
  {
    "path": "README.md",
    "content": "# PySlowFast\n\nPySlowFast is an open source video understanding codebase from FAIR that provides state-of-the-art video classification models with efficient training. This repository includes implementations of the following methods:\n\n- [SlowFast Networks for Video Recognition](https://arxiv.org/abs/1812.03982)\n- [Non-local Neural Networks](https://arxiv.org/abs/1711.07971)\n- [A Multigrid Method for Efficiently Training Video Models](https://arxiv.org/abs/1912.00998)\n- [X3D: Progressive Network Expansion for Efficient Video Recognition](https://arxiv.org/abs/2004.04730)\n- [Multiscale Vision Transformers](https://arxiv.org/abs/2104.11227)\n- [A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning](https://arxiv.org/abs/2104.14558)\n- [MViTv2: Improved Multiscale Vision Transformers for Classification and Detection](https://arxiv.org/abs/2112.01526)\n- [Masked Feature Prediction for Self-Supervised Visual Pre-Training](https://arxiv.org/abs/2112.09133)\n- [Masked Autoencoders As Spatiotemporal Learners](https://arxiv.org/abs/2205.09113)\n- [Reversible Vision Transformers](https://openaccess.thecvf.com/content/CVPR2022/papers/Mangalam_Reversible_Vision_Transformers_CVPR_2022_paper.pdf)\n\n<div align=\"center\">\n  <img src=\"demo/ava_demo.gif\" width=\"600px\"/>\n</div>\n\n## Introduction\n\nThe goal of PySlowFast is to provide a high-performance, light-weight pytorch codebase provides state-of-the-art video backbones for video understanding research on different tasks (classification, detection, and etc). It is designed in order to support rapid implementation and evaluation of novel video research ideas. PySlowFast includes implementations of the following backbone network architectures:\n\n- SlowFast\n- Slow\n- C2D\n- I3D\n- Non-local Network\n- X3D\n- MViTv1 and MViTv2\n- Rev-ViT and Rev-MViT\n\n## Updates\n - We now [Reversible Vision Transformers](https://openaccess.thecvf.com/content/CVPR2022/papers/Mangalam_Reversible_Vision_Transformers_CVPR_2022_paper.pdf). Both Reversible ViT and MViT models released. See [`projects/rev`](./projects/rev/README.md).\n - We now support [MAE for Video](https://arxiv.org/abs/2104.11227.pdf). See [`projects/mae`](./projects/mae/README.md) for more information.\n - We now support [MaskFeat](https://arxiv.org/abs/2112.09133). See [`projects/maskfeat`](./projects/maskfeat/README.md) for more information.\n - We now support [MViTv2](https://arxiv.org/abs/2104.11227.pdf) in PySlowFast. See [`projects/mvitv2`](./projects/mvitv2/README.md) for more information.\n - We now support [A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning](https://arxiv.org/abs/2104.14558). See [`projects/contrastive_ssl`](./projects/contrastive_ssl/README.md) for more information.\n - We now support [Multiscale Vision Transformers](https://arxiv.org/abs/2104.11227.pdf) on Kinetics and ImageNet. See [`projects/mvit`](./projects/mvit/README.md) for more information.\n - We now support [PyTorchVideo](https://github.com/facebookresearch/pytorchvideo) models and datasets. See [`projects/pytorchvideo`](./projects/pytorchvideo/README.md) for more information.\n - We now support [X3D Models](https://arxiv.org/abs/2004.04730). See [`projects/x3d`](./projects/x3d/README.md) for more information.\n - We now support [Multigrid Training](https://arxiv.org/abs/1912.00998) for efficiently training video models. See [`projects/multigrid`](./projects/multigrid/README.md) for more information.\n - PySlowFast is released in conjunction with our [ICCV 2019 Tutorial](https://alexander-kirillov.github.io/tutorials/visual-recognition-iccv19/).\n\n## License\n\nPySlowFast is released under the [Apache 2.0 license](LICENSE).\n\n## Model Zoo and Baselines\n\nWe provide a large set of baseline results and trained models available for download in the PySlowFast [Model Zoo](MODEL_ZOO.md).\n\n## Installation\n\nPlease find installation instructions for PyTorch and PySlowFast in [INSTALL.md](INSTALL.md). You may follow the instructions in [DATASET.md](slowfast/datasets/DATASET.md) to prepare the datasets.\n\n## Quick Start\n\nFollow the example in [GETTING_STARTED.md](GETTING_STARTED.md) to start playing video models with PySlowFast.\n\n## Visualization Tools\n\nWe offer a range of visualization tools for the train/eval/test processes, model analysis, and for running inference with trained model.\nMore information at [Visualization Tools](VISUALIZATION_TOOLS.md).\n\n## Contributors\nPySlowFast is written and maintained by [Haoqi Fan](https://haoqifan.github.io/), [Yanghao Li](https://lyttonhao.github.io/), [Bo Xiong](https://www.cs.utexas.edu/~bxiong/), [Wan-Yen Lo](https://www.linkedin.com/in/wanyenlo/), [Christoph Feichtenhofer](https://feichtenhofer.github.io/).\n\n## Citing PySlowFast\nIf you find PySlowFast useful in your research, please use the following BibTeX entry for citation.\n```BibTeX\n@misc{fan2020pyslowfast,\n  author =       {Haoqi Fan and Yanghao Li and Bo Xiong and Wan-Yen Lo and\n                  Christoph Feichtenhofer},\n  title =        {PySlowFast},\n  howpublished = {\\url{https://github.com/facebookresearch/slowfast}},\n  year =         {2020}\n}\n```\n"
  },
  {
    "path": "VISUALIZATION_TOOLS.md",
    "content": "# Visualization Tools for PySlowFast\n\nThis document provides a brief intro for running various visualization tools provided with PySlowFast. Before launching any job, make sure you have properly installed the PySlowFast following the instruction in [README.md](README.md) and you have prepared the dataset following [DATASET.md](slowfast/datasets/DATASET.md) with the correct format.\n\n## Tensorboard Support for Train/Eval/Test\n\nWe provide Tensorboard support during the train/eval/test pipeline to assist live monitoring various metrics, and class-level performance\nwith loss/error graphs, confusion matrices and histograms. Enable Tensorboard support by adding the following to your yaml config file:\n\n```\nTENSORBOARD:\n  ENABLE: True\n  LOG_DIR: # Leave empty to use cfg.OUTPUT_DIR/runs-{cfg.TRAIN.DATASET} as path.\n  CLASS_NAMES_PATH: # Path to json file providing class_name - id mapping.\n  CONFUSION_MATRIX:\n    ENABLE: True\n    SUBSET_PATH: # Path to txt file contains class names separated by newline characters.\n                 # Only classes in this file will be visualized in the confusion matrix.\n  HISTOGRAM:\n    ENABLE: True\n    TOP_K: 10   # Top-k most frequently predicted classes for each class in the dataset.\n    SUBSET_PATH: # Path to txt file contains class names separated by newline characters.\n                 # Only classes in this file will be visualized with histograms.\n```\n\nMore details can be found at [defaults.py](slowfast/config/defaults.py)\n\n### Loss & Error Graphs on Tensorboard:\n\n<div align=\"center\">\n  <img src=\"demo/visualization/metrics/loss.png\" width=\"800px\"/>\n</div>\n\n### Confusion matrices:\n\n<div align=\"center\">\n  <img src=\"demo/visualization/metrics/cf_subset.png\" width=\"367px\" style=\"margin:10px;\"/>\n  <img src=\"demo/visualization/metrics/cf_parent.png\" width=\"350px\" style=\"margin:10px;\"/>\n</div>\n\n<div align=\"center\">\n\n</div>\n\nTo enable this mode, set:\n```\nTENSORBOARD:\n  ENABLE: True\n  CATEGORIES_PATH: # Path to a json file for categories -> classes mapping\n                   # in the format {\"parent_class\": [\"child_class1\", \"child_class2\",...], ...}.\n  CONFUSION_MATRIX:\n    ENABLE: True\n```\n\n### Histograms of top-k most frequent predictions:\n\n<div align=\"center\">\n  <img src=\"demo/visualization/metrics/hist1.png\" width=\"400px\" style=\"margin:10px;\"/>\n  <img src=\"demo/visualization/metrics/hist2.png\" width=\"406px\" style=\"margin:10px;\"/>\n</div>\n\n## Model Analysis\n\nIn addition, we provide tools to help with understanding your trained model(s), more options at [defaults.py](slowfast/config/defaults.py)\n\nAdding the following to your yaml config file:\n```\nTENSORBOARD:\n  ENABLE: True\n  MODEL_VIS:\n    ENABLE: True\n    MODEL_WEIGHTS: # Set to True to visualize model weights.\n    ACTIVATIONS: # Set to True to visualize feature maps.\n    INPUT_VIDEO: # Set to True to visualize the input video(s) for the corresponding feature maps.\n    LAYER_LIST: # List of layer names to visualize weights and activations for.\n    GRAD_CAM:\n      ENABLE: True\n      LAYER_LIST: # List of CNN layers to use for Grad-CAM visualization method.\n                  # The number of layer must be equal to the number of pathway(s).\n```\n\n### Weights Visualization on Tensorboard:\n\n<div align=\"center\">\n  <img src=\"demo/visualization/analysis/weights1.png\" width=\"300px\" style=\"margin:10px;\"/>\n  <img src=\"demo/visualization/analysis/weights2.png\" width=\"328px\" style=\"margin:18px;\"/>\n</div>\n\n### Feature Maps & Inputs Visualization:\n\n<div align=\"center\">\n  <img src=\"demo/visualization/analysis/activations.gif\" width=\"800px\"/>\n</div>\n\n### Input Videos Visualization with Grad-CAM:\n\n<div align=\"center\">\n  <img src=\"demo/visualization/analysis/gradcam.gif\" width=\"400px\" style=\"margin:10px;\"/>\n  <img src=\"demo/visualization/analysis/gradcam2.gif\" width=\"400px\" style=\"margin:10px;\"/>\n</div>\n\n## Run the Demo on Videos/Camera\n\nTo run inference with PySlowFast model(s) on wild video(s), add the following to your yaml config file:\n\n```\nDEMO:\n  ENABLE: True\n  LABEL_FILE_PATH: # Path to json file providing class_name - id mapping.\n  INPUT_VIDEO: # Path to input video file.\n  OUTPUT_FILE: # Path to output video file to write results to.\n               # Leave an empty string if you would like to display results to a window.\n  THREAD_ENABLE: # Run video reader/writer in the background with multi-threading.\n  NUM_VIS_INSTANCES: # Number of CPU(s)/processes use to run video visualizer.\n  NUM_CLIPS_SKIP: # Number of clips to skip prediction/visualization\n                  # (mostly to smoothen/improve display quality with wecam input).\n```\n\nIf you would like to use webcam as an input, in place of `DEMO.INPUT_VIDEO`, set `DEMO.WEBCAM` to the index of the webcam for input. Please check for more options at [defaults.py](slowfast/config/defaults.py)\n\n### Action Recognition Demo:\n<div align=\"center\">\n  <img src=\"demo/visualization/demo_gifs/recognition.gif\" width=\"600px\"/>\n</div>\n\n### Action Detection Demo:\n\n<div align=\"center\">\n  <img src=\"demo/visualization/demo_gifs/detection.gif\" width=\"600px\" style=\"margin:10px;\"/>\n</div>\n\n### Demo with AVA video(s):\nWe also offer an option to use trained models to create and visualize prediction results and ground-truth labels on AVA-format videos and metadata. An example config is:\n\n```\nDEMO:\n  ENABLE: True\n  OUTPUT_FILE: yourPath/output.mp4\n  LABEL_FILE_PATH:  yourPath/ava_classnames.json\n  INPUT_VIDEO: yourPath/frames/HVAmkvLrthQ  # Path to a video file or image folder\n  PREDS_BOXES: yourPath/ava_detection_train_boxes_and_labels_include_negative.csv # Path to pre-computed bouding boxes in AVA format.\n  GT_BOXES: yourPath/ava_train_v2.2.csv # Path to ground-truth boxes and labels in AVA format (optional).\n```\n\n<div align=\"center\">\n  <img src=\"demo/visualization/demo_gifs/ava_demo2.gif\" width=\"600px\" style=\"margin:10px;\"/>\n</div>\n\n\n### Run command\n```\npython \\tools\\run_net.py --cfg path/to/<pretrained_model_config_file>.yaml\n```\n### Download class name files\n- [AVA class names json file](https://dl.fbaipublicfiles.com/pyslowfast/dataset/class_names/ava_classids.json)\n- [Kinetics class names json file](https://dl.fbaipublicfiles.com/pyslowfast/dataset/class_names/kinetics_classnames.json)\n- [Kinetics parent-child class mapping](https://dl.fbaipublicfiles.com/pyslowfast/dataset/class_names/parents.json)\n"
  },
  {
    "path": "ava_evaluation/README.md",
    "content": "The code under this folder is from the official [ActivityNet repo](https://github.com/activitynet/ActivityNet).\n"
  },
  {
    "path": "ava_evaluation/ava_action_list_v2.1_for_activitynet_2018.pbtxt.txt",
    "content": "item {\n  name: \"bend/bow (at the waist)\"\n  id: 1\n}\nitem {\n  name: \"crouch/kneel\"\n  id: 3\n}\nitem {\n  name: \"dance\"\n  id: 4\n}\nitem {\n  name: \"fall down\"\n  id: 5\n}\nitem {\n  name: \"get up\"\n  id: 6\n}\nitem {\n  name: \"jump/leap\"\n  id: 7\n}\nitem {\n  name: \"lie/sleep\"\n  id: 8\n}\nitem {\n  name: \"martial art\"\n  id: 9\n}\nitem {\n  name: \"run/jog\"\n  id: 10\n}\nitem {\n  name: \"sit\"\n  id: 11\n}\nitem {\n  name: \"stand\"\n  id: 12\n}\nitem {\n  name: \"swim\"\n  id: 13\n}\nitem {\n  name: \"walk\"\n  id: 14\n}\nitem {\n  name: \"answer phone\"\n  id: 15\n}\nitem {\n  name: \"carry/hold (an object)\"\n  id: 17\n}\nitem {\n  name: \"climb (e.g., a mountain)\"\n  id: 20\n}\nitem {\n  name: \"close (e.g., a door, a box)\"\n  id: 22\n}\nitem {\n  name: \"cut\"\n  id: 24\n}\nitem {\n  name: \"dress/put on clothing\"\n  id: 26\n}\nitem {\n  name: \"drink\"\n  id: 27\n}\nitem {\n  name: \"drive (e.g., a car, a truck)\"\n  id: 28\n}\nitem {\n  name: \"eat\"\n  id: 29\n}\nitem {\n  name: \"enter\"\n  id: 30\n}\nitem {\n  name: \"hit (an object)\"\n  id: 34\n}\nitem {\n  name: \"lift/pick up\"\n  id: 36\n}\nitem {\n  name: \"listen (e.g., to music)\"\n  id: 37\n}\nitem {\n  name: \"open (e.g., a window, a car door)\"\n  id: 38\n}\nitem {\n  name: \"play musical instrument\"\n  id: 41\n}\nitem {\n  name: \"point to (an object)\"\n  id: 43\n}\nitem {\n  name: \"pull (an object)\"\n  id: 45\n}\nitem {\n  name: \"push (an object)\"\n  id: 46\n}\nitem {\n  name: \"put down\"\n  id: 47\n}\nitem {\n  name: \"read\"\n  id: 48\n}\nitem {\n  name: \"ride (e.g., a bike, a car, a horse)\"\n  id: 49\n}\nitem {\n  name: \"sail boat\"\n  id: 51\n}\nitem {\n  name: \"shoot\"\n  id: 52\n}\nitem {\n  name: \"smoke\"\n  id: 54\n}\nitem {\n  name: \"take a photo\"\n  id: 56\n}\nitem {\n  name: \"text on/look at a cellphone\"\n  id: 57\n}\nitem {\n  name: \"throw\"\n  id: 58\n}\nitem {\n  name: \"touch (an object)\"\n  id: 59\n}\nitem {\n  name: \"turn (e.g., a screwdriver)\"\n  id: 60\n}\nitem {\n  name: \"watch (e.g., TV)\"\n  id: 61\n}\nitem {\n  name: \"work on a computer\"\n  id: 62\n}\nitem {\n  name: \"write\"\n  id: 63\n}\nitem {\n  name: \"fight/hit (a person)\"\n  id: 64\n}\nitem {\n  name: \"give/serve (an object) to (a person)\"\n  id: 65\n}\nitem {\n  name: \"grab (a person)\"\n  id: 66\n}\nitem {\n  name: \"hand clap\"\n  id: 67\n}\nitem {\n  name: \"hand shake\"\n  id: 68\n}\nitem {\n  name: \"hand wave\"\n  id: 69\n}\nitem {\n  name: \"hug (a person)\"\n  id: 70\n}\nitem {\n  name: \"kiss (a person)\"\n  id: 72\n}\nitem {\n  name: \"lift (a person)\"\n  id: 73\n}\nitem {\n  name: \"listen to (a person)\"\n  id: 74\n}\nitem {\n  name: \"push (another person)\"\n  id: 76\n}\nitem {\n  name: \"sing to (e.g., self, a person, a group)\"\n  id: 77\n}\nitem {\n  name: \"take (an object) from (a person)\"\n  id: 78\n}\nitem {\n  name: \"talk to (e.g., self, a person, a group)\"\n  id: 79\n}\nitem {\n  name: \"watch (a person)\"\n  id: 80\n}\n"
  },
  {
    "path": "ava_evaluation/label_map_util.py",
    "content": "# Copyright 2017 The TensorFlow Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\"\"\"Label map utility functions.\"\"\"\n\nfrom __future__ import absolute_import, division, print_function, unicode_literals\n\nimport logging\n\n# from google.protobuf import text_format\n# from google3.third_party.tensorflow_models.object_detection.protos import string_int_label_map_pb2\n\n\ndef _validate_label_map(label_map):\n    \"\"\"Checks if a label map is valid.\n\n    Args:\n      label_map: StringIntLabelMap to validate.\n\n    Raises:\n      ValueError: if label map is invalid.\n    \"\"\"\n    for item in label_map.item:\n        if item.id < 1:\n            raise ValueError(\"Label map ids should be >= 1.\")\n\n\ndef create_category_index(categories):\n    \"\"\"Creates dictionary of COCO compatible categories keyed by category id.\n\n    Args:\n      categories: a list of dicts, each of which has the following keys:\n        'id': (required) an integer id uniquely identifying this category.\n        'name': (required) string representing category name\n          e.g., 'cat', 'dog', 'pizza'.\n\n    Returns:\n      category_index: a dict containing the same entries as categories, but keyed\n        by the 'id' field of each category.\n    \"\"\"\n    category_index = {}\n    for cat in categories:\n        category_index[cat[\"id\"]] = cat\n    return category_index\n\n\ndef get_max_label_map_index(label_map):\n    \"\"\"Get maximum index in label map.\n\n    Args:\n      label_map: a StringIntLabelMapProto\n\n    Returns:\n      an integer\n    \"\"\"\n    return max([item.id for item in label_map.item])\n\n\ndef convert_label_map_to_categories(label_map, max_num_classes, use_display_name=True):\n    \"\"\"Loads label map proto and returns categories list compatible with eval.\n\n    This function loads a label map and returns a list of dicts, each of which\n    has the following keys:\n      'id': (required) an integer id uniquely identifying this category.\n      'name': (required) string representing category name\n        e.g., 'cat', 'dog', 'pizza'.\n    We only allow class into the list if its id-label_id_offset is\n    between 0 (inclusive) and max_num_classes (exclusive).\n    If there are several items mapping to the same id in the label map,\n    we will only keep the first one in the categories list.\n\n    Args:\n      label_map: a StringIntLabelMapProto or None.  If None, a default categories\n        list is created with max_num_classes categories.\n      max_num_classes: maximum number of (consecutive) label indices to include.\n      use_display_name: (boolean) choose whether to load 'display_name' field\n        as category name.  If False or if the display_name field does not exist,\n        uses 'name' field as category names instead.\n    Returns:\n      categories: a list of dictionaries representing all possible categories.\n    \"\"\"\n    categories = []\n    list_of_ids_already_added = []\n    if not label_map:\n        label_id_offset = 1\n        for class_id in range(max_num_classes):\n            categories.append(\n                {\n                    \"id\": class_id + label_id_offset,\n                    \"name\": \"category_{}\".format(class_id + label_id_offset),\n                }\n            )\n        return categories\n    for item in label_map.item:\n        if not 0 < item.id <= max_num_classes:\n            logging.info(\n                \"Ignore item %d since it falls outside of requested label range.\",\n                item.id,\n            )\n            continue\n        if use_display_name and item.HasField(\"display_name\"):\n            name = item.display_name\n        else:\n            name = item.name\n        if item.id not in list_of_ids_already_added:\n            list_of_ids_already_added.append(item.id)\n            categories.append({\"id\": item.id, \"name\": name})\n    return categories\n\n\ndef load_labelmap(path):\n    \"\"\"Loads label map proto.\n\n    Args:\n      path: path to StringIntLabelMap proto text file.\n    Returns:\n      a StringIntLabelMapProto\n    \"\"\"\n    with open(path, \"r\") as fid:\n        label_map_string = fid.read()\n        label_map = string_int_label_map_pb2.StringIntLabelMap()  # noqa\n        try:\n            text_format.Merge(label_map_string, label_map)  # noqa\n        except text_format.ParseError:  # noqa\n            label_map.ParseFromString(label_map_string)\n    _validate_label_map(label_map)\n    return label_map\n\n\ndef get_label_map_dict(label_map_path, use_display_name=False):\n    \"\"\"Reads a label map and returns a dictionary of label names to id.\n\n    Args:\n      label_map_path: path to label_map.\n      use_display_name: whether to use the label map items' display names as keys.\n\n    Returns:\n      A dictionary mapping label names to id.\n    \"\"\"\n    label_map = load_labelmap(label_map_path)\n    label_map_dict = {}\n    for item in label_map.item:\n        if use_display_name:\n            label_map_dict[item.display_name] = item.id\n        else:\n            label_map_dict[item.name] = item.id\n    return label_map_dict\n\n\ndef create_category_index_from_labelmap(label_map_path):\n    \"\"\"Reads a label map and returns a category index.\n\n    Args:\n      label_map_path: Path to `StringIntLabelMap` proto text file.\n\n    Returns:\n      A category index, which is a dictionary that maps integer ids to dicts\n      containing categories, e.g.\n      {1: {'id': 1, 'name': 'dog'}, 2: {'id': 2, 'name': 'cat'}, ...}\n    \"\"\"\n    label_map = load_labelmap(label_map_path)\n    max_num_classes = max(item.id for item in label_map.item)\n    categories = convert_label_map_to_categories(label_map, max_num_classes)\n    return create_category_index(categories)\n\n\ndef create_class_agnostic_category_index():\n    \"\"\"Creates a category index with a single `object` class.\"\"\"\n    return {1: {\"id\": 1, \"name\": \"object\"}}\n"
  },
  {
    "path": "ava_evaluation/metrics.py",
    "content": "# Copyright 2017 The TensorFlow Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\n\"\"\"Functions for computing metrics like precision, recall, CorLoc and etc.\"\"\"\n\nfrom __future__ import division\n\nimport numpy as np\n\n\ndef compute_precision_recall(scores, labels, num_gt):\n    \"\"\"Compute precision and recall.\n\n    Args:\n      scores: A float numpy array representing detection score\n      labels: A boolean numpy array representing true/false positive labels\n      num_gt: Number of ground truth instances\n\n    Raises:\n      ValueError: if the input is not of the correct format\n\n    Returns:\n      precision: Fraction of positive instances over detected ones. This value is\n        None if no ground truth labels are present.\n      recall: Fraction of detected positive instance over all positive instances.\n        This value is None if no ground truth labels are present.\n\n    \"\"\"\n    if (\n        not isinstance(labels, np.ndarray)\n        or labels.dtype != bool\n        or len(labels.shape) != 1\n    ):\n        raise ValueError(\"labels must be single dimension bool numpy array\")\n\n    if not isinstance(scores, np.ndarray) or len(scores.shape) != 1:\n        raise ValueError(\"scores must be single dimension numpy array\")\n\n    if num_gt < np.sum(labels):\n        raise ValueError(\"Number of true positives must be smaller than num_gt.\")\n\n    if len(scores) != len(labels):\n        raise ValueError(\"scores and labels must be of the same size.\")\n\n    if num_gt == 0:\n        return None, None\n\n    sorted_indices = np.argsort(scores)\n    sorted_indices = sorted_indices[::-1]\n    labels = labels.astype(int)\n    true_positive_labels = labels[sorted_indices]\n    false_positive_labels = 1 - true_positive_labels\n    cum_true_positives = np.cumsum(true_positive_labels)\n    cum_false_positives = np.cumsum(false_positive_labels)\n    precision = cum_true_positives.astype(float) / (\n        cum_true_positives + cum_false_positives\n    )\n    recall = cum_true_positives.astype(float) / num_gt\n    return precision, recall\n\n\ndef compute_average_precision(precision, recall):\n    \"\"\"Compute Average Precision according to the definition in VOCdevkit.\n\n    Precision is modified to ensure that it does not decrease as recall\n    decrease.\n\n    Args:\n      precision: A float [N, 1] numpy array of precisions\n      recall: A float [N, 1] numpy array of recalls\n\n    Raises:\n      ValueError: if the input is not of the correct format\n\n    Returns:\n      average_precison: The area under the precision recall curve. NaN if\n        precision and recall are None.\n\n    \"\"\"\n    if precision is None:\n        if recall is not None:\n            raise ValueError(\"If precision is None, recall must also be None\")\n        return np.nan\n\n    if not isinstance(precision, np.ndarray) or not isinstance(recall, np.ndarray):\n        raise ValueError(\"precision and recall must be numpy array\")\n    if precision.dtype != float or recall.dtype != float:\n        raise ValueError(\"input must be float numpy array.\")\n    if len(precision) != len(recall):\n        raise ValueError(\"precision and recall must be of the same size.\")\n    if not precision.size:\n        return 0.0\n    if np.amin(precision) < 0 or np.amax(precision) > 1:\n        raise ValueError(\"Precision must be in the range of [0, 1].\")\n    if np.amin(recall) < 0 or np.amax(recall) > 1:\n        raise ValueError(\"recall must be in the range of [0, 1].\")\n    if not all(recall[i] <= recall[i + 1] for i in range(len(recall) - 1)):\n        raise ValueError(\"recall must be a non-decreasing array\")\n\n    recall = np.concatenate([[0], recall, [1]])\n    precision = np.concatenate([[0], precision, [0]])\n\n    # Preprocess precision to be a non-decreasing array\n    for i in range(len(precision) - 2, -1, -1):\n        precision[i] = np.maximum(precision[i], precision[i + 1])\n\n    indices = np.where(recall[1:] != recall[:-1])[0] + 1\n    average_precision = np.sum(\n        (recall[indices] - recall[indices - 1]) * precision[indices]\n    )\n    return average_precision\n\n\ndef compute_cor_loc(num_gt_imgs_per_class, num_images_correctly_detected_per_class):\n    \"\"\"Compute CorLoc according to the definition in the following paper.\n\n    https://www.robots.ox.ac.uk/~vgg/rg/papers/deselaers-eccv10.pdf\n\n    Returns nans if there are no ground truth images for a class.\n\n    Args:\n      num_gt_imgs_per_class: 1D array, representing number of images containing\n          at least one object instance of a particular class\n      num_images_correctly_detected_per_class: 1D array, representing number of\n          images that are correctly detected at least one object instance of a\n          particular class\n\n    Returns:\n      corloc_per_class: A float numpy array represents the corloc score of each\n        class\n    \"\"\"\n    # Divide by zero expected for classes with no gt examples.\n    with np.errstate(divide=\"ignore\", invalid=\"ignore\"):\n        return np.where(\n            num_gt_imgs_per_class == 0,\n            np.nan,\n            num_images_correctly_detected_per_class / num_gt_imgs_per_class,\n        )\n"
  },
  {
    "path": "ava_evaluation/np_box_list.py",
    "content": "# Copyright 2017 The TensorFlow Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\n\"\"\"Numpy BoxList classes and functions.\"\"\"\n\nfrom __future__ import absolute_import, division, print_function, unicode_literals\n\nimport numpy as np\n\n\nclass BoxList:\n    \"\"\"Box collection.\n\n    BoxList represents a list of bounding boxes as numpy array, where each\n    bounding box is represented as a row of 4 numbers,\n    [y_min, x_min, y_max, x_max].  It is assumed that all bounding boxes within a\n    given list correspond to a single image.\n\n    Optionally, users can add additional related fields (such as\n    objectness/classification scores).\n    \"\"\"\n\n    def __init__(self, data):\n        \"\"\"Constructs box collection.\n\n        Args:\n          data: a numpy array of shape [N, 4] representing box coordinates\n\n        Raises:\n          ValueError: if bbox data is not a numpy array\n          ValueError: if invalid dimensions for bbox data\n        \"\"\"\n        if not isinstance(data, np.ndarray):\n            raise ValueError(\"data must be a numpy array.\")\n        if len(data.shape) != 2 or data.shape[1] != 4:\n            raise ValueError(\"Invalid dimensions for box data.\")\n        if data.dtype != np.float32 and data.dtype != np.float64:\n            raise ValueError(\"Invalid data type for box data: float is required.\")\n        if not self._is_valid_boxes(data):\n            raise ValueError(\n                \"Invalid box data. data must be a numpy array of \"\n                \"N*[y_min, x_min, y_max, x_max]\"\n            )\n        self.data = {\"boxes\": data}\n\n    def num_boxes(self):\n        \"\"\"Return number of boxes held in collections.\"\"\"\n        return self.data[\"boxes\"].shape[0]\n\n    def get_extra_fields(self):\n        \"\"\"Return all non-box fields.\"\"\"\n        return [k for k in self.data.keys() if k != \"boxes\"]\n\n    def has_field(self, field):\n        return field in self.data\n\n    def add_field(self, field, field_data):\n        \"\"\"Add data to a specified field.\n\n        Args:\n          field: a string parameter used to speficy a related field to be accessed.\n          field_data: a numpy array of [N, ...] representing the data associated\n              with the field.\n        Raises:\n          ValueError: if the field is already exist or the dimension of the field\n              data does not matches the number of boxes.\n        \"\"\"\n        if self.has_field(field):\n            raise ValueError(\"Field \" + field + \"already exists\")\n        if len(field_data.shape) < 1 or field_data.shape[0] != self.num_boxes():\n            raise ValueError(\"Invalid dimensions for field data\")\n        self.data[field] = field_data\n\n    def get(self):\n        \"\"\"Convenience function for accesssing box coordinates.\n\n        Returns:\n          a numpy array of shape [N, 4] representing box corners\n        \"\"\"\n        return self.get_field(\"boxes\")\n\n    def get_field(self, field):\n        \"\"\"Accesses data associated with the specified field in the box collection.\n\n        Args:\n          field: a string parameter used to speficy a related field to be accessed.\n\n        Returns:\n          a numpy 1-d array representing data of an associated field\n\n        Raises:\n          ValueError: if invalid field\n        \"\"\"\n        if not self.has_field(field):\n            raise ValueError(\"field {} does not exist\".format(field))\n        return self.data[field]\n\n    def get_coordinates(self):\n        \"\"\"Get corner coordinates of boxes.\n\n        Returns:\n         a list of 4 1-d numpy arrays [y_min, x_min, y_max, x_max]\n        \"\"\"\n        box_coordinates = self.get()\n        y_min = box_coordinates[:, 0]\n        x_min = box_coordinates[:, 1]\n        y_max = box_coordinates[:, 2]\n        x_max = box_coordinates[:, 3]\n        return [y_min, x_min, y_max, x_max]\n\n    def _is_valid_boxes(self, data):\n        \"\"\"Check whether data fullfills the format of N*[ymin, xmin, ymax, xmin].\n\n        Args:\n          data: a numpy array of shape [N, 4] representing box coordinates\n\n        Returns:\n          a boolean indicating whether all ymax of boxes are equal or greater than\n              ymin, and all xmax of boxes are equal or greater than xmin.\n        \"\"\"\n        if data.shape[0] > 0:\n            for i in range(data.shape[0]):\n                if data[i, 0] > data[i, 2] or data[i, 1] > data[i, 3]:\n                    return False\n        return True\n"
  },
  {
    "path": "ava_evaluation/np_box_list_ops.py",
    "content": "# Copyright 2017 The TensorFlow Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\n\"\"\"Bounding Box List operations for Numpy BoxLists.\n\nExample box operations that are supported:\n  * Areas: compute bounding box areas\n  * IOU: pairwise intersection-over-union scores\n\"\"\"\n\nfrom __future__ import absolute_import, division, print_function, unicode_literals\n\nimport numpy as np\n\nfrom . import np_box_list, np_box_ops\n\n\nclass SortOrder:\n    \"\"\"Enum class for sort order.\n\n    Attributes:\n      ascend: ascend order.\n      descend: descend order.\n    \"\"\"\n\n    ASCEND = 1\n    DESCEND = 2\n\n\ndef area(boxlist):\n    \"\"\"Computes area of boxes.\n\n    Args:\n      boxlist: BoxList holding N boxes\n\n    Returns:\n      a numpy array with shape [N*1] representing box areas\n    \"\"\"\n    y_min, x_min, y_max, x_max = boxlist.get_coordinates()\n    return (y_max - y_min) * (x_max - x_min)\n\n\ndef intersection(boxlist1, boxlist2):\n    \"\"\"Compute pairwise intersection areas between boxes.\n\n    Args:\n      boxlist1: BoxList holding N boxes\n      boxlist2: BoxList holding M boxes\n\n    Returns:\n      a numpy array with shape [N*M] representing pairwise intersection area\n    \"\"\"\n    return np_box_ops.intersection(boxlist1.get(), boxlist2.get())\n\n\ndef iou(boxlist1, boxlist2):\n    \"\"\"Computes pairwise intersection-over-union between box collections.\n\n    Args:\n      boxlist1: BoxList holding N boxes\n      boxlist2: BoxList holding M boxes\n\n    Returns:\n      a numpy array with shape [N, M] representing pairwise iou scores.\n    \"\"\"\n    return np_box_ops.iou(boxlist1.get(), boxlist2.get())\n\n\ndef ioa(boxlist1, boxlist2):\n    \"\"\"Computes pairwise intersection-over-area between box collections.\n\n    Intersection-over-area (ioa) between two boxes box1 and box2 is defined as\n    their intersection area over box2's area. Note that ioa is not symmetric,\n    that is, IOA(box1, box2) != IOA(box2, box1).\n\n    Args:\n      boxlist1: BoxList holding N boxes\n      boxlist2: BoxList holding M boxes\n\n    Returns:\n      a numpy array with shape [N, M] representing pairwise ioa scores.\n    \"\"\"\n    return np_box_ops.ioa(boxlist1.get(), boxlist2.get())\n\n\ndef gather(boxlist, indices, fields=None):\n    \"\"\"Gather boxes from BoxList according to indices and return new BoxList.\n\n    By default, gather returns boxes corresponding to the input index list, as\n    well as all additional fields stored in the boxlist (indexing into the\n    first dimension).  However one can optionally only gather from a\n    subset of fields.\n\n    Args:\n      boxlist: BoxList holding N boxes\n      indices: a 1-d numpy array of type int_\n      fields: (optional) list of fields to also gather from.  If None (default),\n          all fields are gathered from.  Pass an empty fields list to only gather\n          the box coordinates.\n\n    Returns:\n      subboxlist: a BoxList corresponding to the subset of the input BoxList\n          specified by indices\n\n    Raises:\n      ValueError: if specified field is not contained in boxlist or if the\n          indices are not of type int_\n    \"\"\"\n    if indices.size:\n        if np.amax(indices) >= boxlist.num_boxes() or np.amin(indices) < 0:\n            raise ValueError(\"indices are out of valid range.\")\n    subboxlist = np_box_list.BoxList(boxlist.get()[indices, :])\n    if fields is None:\n        fields = boxlist.get_extra_fields()\n    for field in fields:\n        extra_field_data = boxlist.get_field(field)\n        subboxlist.add_field(field, extra_field_data[indices, ...])\n    return subboxlist\n\n\ndef sort_by_field(boxlist, field, order=SortOrder.DESCEND):\n    \"\"\"Sort boxes and associated fields according to a scalar field.\n\n    A common use case is reordering the boxes according to descending scores.\n\n    Args:\n      boxlist: BoxList holding N boxes.\n      field: A BoxList field for sorting and reordering the BoxList.\n      order: (Optional) 'descend' or 'ascend'. Default is descend.\n\n    Returns:\n      sorted_boxlist: A sorted BoxList with the field in the specified order.\n\n    Raises:\n      ValueError: if specified field does not exist or is not of single dimension.\n      ValueError: if the order is not either descend or ascend.\n    \"\"\"\n    if not boxlist.has_field(field):\n        raise ValueError(\"Field \" + field + \" does not exist\")\n    if len(boxlist.get_field(field).shape) != 1:\n        raise ValueError(\"Field \" + field + \"should be single dimension.\")\n    if order != SortOrder.DESCEND and order != SortOrder.ASCEND:\n        raise ValueError(\"Invalid sort order\")\n\n    field_to_sort = boxlist.get_field(field)\n    sorted_indices = np.argsort(field_to_sort)\n    if order == SortOrder.DESCEND:\n        sorted_indices = sorted_indices[::-1]\n    return gather(boxlist, sorted_indices)\n\n\ndef non_max_suppression(\n    boxlist, max_output_size=10000, iou_threshold=1.0, score_threshold=-10.0\n):\n    \"\"\"Non maximum suppression.\n\n    This op greedily selects a subset of detection bounding boxes, pruning\n    away boxes that have high IOU (intersection over union) overlap (> thresh)\n    with already selected boxes. In each iteration, the detected bounding box with\n    highest score in the available pool is selected.\n\n    Args:\n      boxlist: BoxList holding N boxes.  Must contain a 'scores' field\n        representing detection scores. All scores belong to the same class.\n      max_output_size: maximum number of retained boxes\n      iou_threshold: intersection over union threshold.\n      score_threshold: minimum score threshold. Remove the boxes with scores\n                       less than this value. Default value is set to -10. A very\n                       low threshold to pass pretty much all the boxes, unless\n                       the user sets a different score threshold.\n\n    Returns:\n      a BoxList holding M boxes where M <= max_output_size\n    Raises:\n      ValueError: if 'scores' field does not exist\n      ValueError: if threshold is not in [0, 1]\n      ValueError: if max_output_size < 0\n    \"\"\"\n    if not boxlist.has_field(\"scores\"):\n        raise ValueError(\"Field scores does not exist\")\n    if iou_threshold < 0.0 or iou_threshold > 1.0:\n        raise ValueError(\"IOU threshold must be in [0, 1]\")\n    if max_output_size < 0:\n        raise ValueError(\"max_output_size must be bigger than 0.\")\n\n    boxlist = filter_scores_greater_than(boxlist, score_threshold)\n    if boxlist.num_boxes() == 0:\n        return boxlist\n\n    boxlist = sort_by_field(boxlist, \"scores\")\n\n    # Prevent further computation if NMS is disabled.\n    if iou_threshold == 1.0:\n        if boxlist.num_boxes() > max_output_size:\n            selected_indices = np.arange(max_output_size)\n            return gather(boxlist, selected_indices)\n        else:\n            return boxlist\n\n    boxes = boxlist.get()\n    num_boxes = boxlist.num_boxes()\n    # is_index_valid is True only for all remaining valid boxes,\n    is_index_valid = np.full(num_boxes, 1, dtype=bool)\n    selected_indices = []\n    num_output = 0\n    for i in range(num_boxes):\n        if num_output < max_output_size:\n            if is_index_valid[i]:\n                num_output += 1\n                selected_indices.append(i)\n                is_index_valid[i] = False\n                valid_indices = np.where(is_index_valid)[0]\n                if valid_indices.size == 0:\n                    break\n\n                intersect_over_union = np_box_ops.iou(\n                    np.expand_dims(boxes[i, :], axis=0), boxes[valid_indices, :]\n                )\n                intersect_over_union = np.squeeze(intersect_over_union, axis=0)\n                is_index_valid[valid_indices] = np.logical_and(\n                    is_index_valid[valid_indices],\n                    intersect_over_union <= iou_threshold,\n                )\n    return gather(boxlist, np.array(selected_indices))\n\n\ndef multi_class_non_max_suppression(boxlist, score_thresh, iou_thresh, max_output_size):\n    \"\"\"Multi-class version of non maximum suppression.\n\n    This op greedily selects a subset of detection bounding boxes, pruning\n    away boxes that have high IOU (intersection over union) overlap (> thresh)\n    with already selected boxes.  It operates independently for each class for\n    which scores are provided (via the scores field of the input box_list),\n    pruning boxes with score less than a provided threshold prior to\n    applying NMS.\n\n    Args:\n      boxlist: BoxList holding N boxes.  Must contain a 'scores' field\n        representing detection scores.  This scores field is a tensor that can\n        be 1 dimensional (in the case of a single class) or 2-dimensional, which\n        which case we assume that it takes the shape [num_boxes, num_classes].\n        We further assume that this rank is known statically and that\n        scores.shape[1] is also known (i.e., the number of classes is fixed\n        and known at graph construction time).\n      score_thresh: scalar threshold for score (low scoring boxes are removed).\n      iou_thresh: scalar threshold for IOU (boxes that that high IOU overlap\n        with previously selected boxes are removed).\n      max_output_size: maximum number of retained boxes per class.\n\n    Returns:\n      a BoxList holding M boxes with a rank-1 scores field representing\n        corresponding scores for each box with scores sorted in decreasing order\n        and a rank-1 classes field representing a class label for each box.\n    Raises:\n      ValueError: if iou_thresh is not in [0, 1] or if input boxlist does not have\n        a valid scores field.\n    \"\"\"\n    if not 0 <= iou_thresh <= 1.0:\n        raise ValueError(\"thresh must be between 0 and 1\")\n    if not isinstance(boxlist, np_box_list.BoxList):\n        raise ValueError(\"boxlist must be a BoxList\")\n    if not boxlist.has_field(\"scores\"):\n        raise ValueError(\"input boxlist must have 'scores' field\")\n    scores = boxlist.get_field(\"scores\")\n    if len(scores.shape) == 1:\n        scores = np.reshape(scores, [-1, 1])\n    elif len(scores.shape) == 2:\n        if scores.shape[1] is None:\n            raise ValueError(\n                \"scores field must have statically defined second dimension\"\n            )\n    else:\n        raise ValueError(\"scores field must be of rank 1 or 2\")\n    num_boxes = boxlist.num_boxes()\n    num_scores = scores.shape[0]\n    num_classes = scores.shape[1]\n\n    if num_boxes != num_scores:\n        raise ValueError(\"Incorrect scores field length: actual vs expected.\")\n\n    selected_boxes_list = []\n    for class_idx in range(num_classes):\n        boxlist_and_class_scores = np_box_list.BoxList(boxlist.get())\n        class_scores = np.reshape(scores[0:num_scores, class_idx], [-1])\n        boxlist_and_class_scores.add_field(\"scores\", class_scores)\n        boxlist_filt = filter_scores_greater_than(\n            boxlist_and_class_scores, score_thresh\n        )\n        nms_result = non_max_suppression(\n            boxlist_filt,\n            max_output_size=max_output_size,\n            iou_threshold=iou_thresh,\n            score_threshold=score_thresh,\n        )\n        nms_result.add_field(\n            \"classes\", np.zeros_like(nms_result.get_field(\"scores\")) + class_idx\n        )\n        selected_boxes_list.append(nms_result)\n    selected_boxes = concatenate(selected_boxes_list)\n    sorted_boxes = sort_by_field(selected_boxes, \"scores\")\n    return sorted_boxes\n\n\ndef scale(boxlist, y_scale, x_scale):\n    \"\"\"Scale box coordinates in x and y dimensions.\n\n    Args:\n      boxlist: BoxList holding N boxes\n      y_scale: float\n      x_scale: float\n\n    Returns:\n      boxlist: BoxList holding N boxes\n    \"\"\"\n    y_min, x_min, y_max, x_max = np.array_split(boxlist.get(), 4, axis=1)\n    y_min = y_scale * y_min\n    y_max = y_scale * y_max\n    x_min = x_scale * x_min\n    x_max = x_scale * x_max\n    scaled_boxlist = np_box_list.BoxList(np.hstack([y_min, x_min, y_max, x_max]))\n\n    fields = boxlist.get_extra_fields()\n    for field in fields:\n        extra_field_data = boxlist.get_field(field)\n        scaled_boxlist.add_field(field, extra_field_data)\n\n    return scaled_boxlist\n\n\ndef clip_to_window(boxlist, window):\n    \"\"\"Clip bounding boxes to a window.\n\n    This op clips input bounding boxes (represented by bounding box\n    corners) to a window, optionally filtering out boxes that do not\n    overlap at all with the window.\n\n    Args:\n      boxlist: BoxList holding M_in boxes\n      window: a numpy array of shape [4] representing the\n              [y_min, x_min, y_max, x_max] window to which the op\n              should clip boxes.\n\n    Returns:\n      a BoxList holding M_out boxes where M_out <= M_in\n    \"\"\"\n    y_min, x_min, y_max, x_max = np.array_split(boxlist.get(), 4, axis=1)\n    win_y_min = window[0]\n    win_x_min = window[1]\n    win_y_max = window[2]\n    win_x_max = window[3]\n    y_min_clipped = np.fmax(np.fmin(y_min, win_y_max), win_y_min)\n    y_max_clipped = np.fmax(np.fmin(y_max, win_y_max), win_y_min)\n    x_min_clipped = np.fmax(np.fmin(x_min, win_x_max), win_x_min)\n    x_max_clipped = np.fmax(np.fmin(x_max, win_x_max), win_x_min)\n    clipped = np_box_list.BoxList(\n        np.hstack([y_min_clipped, x_min_clipped, y_max_clipped, x_max_clipped])\n    )\n    clipped = _copy_extra_fields(clipped, boxlist)\n    areas = area(clipped)\n    nonzero_area_indices = np.reshape(np.nonzero(np.greater(areas, 0.0)), [-1]).astype(\n        np.int32\n    )\n    return gather(clipped, nonzero_area_indices)\n\n\ndef prune_non_overlapping_boxes(boxlist1, boxlist2, minoverlap=0.0):\n    \"\"\"Prunes the boxes in boxlist1 that overlap less than thresh with boxlist2.\n\n    For each box in boxlist1, we want its IOA to be more than minoverlap with\n    at least one of the boxes in boxlist2. If it does not, we remove it.\n\n    Args:\n      boxlist1: BoxList holding N boxes.\n      boxlist2: BoxList holding M boxes.\n      minoverlap: Minimum required overlap between boxes, to count them as\n                  overlapping.\n\n    Returns:\n      A pruned boxlist with size [N', 4].\n    \"\"\"\n    intersection_over_area = ioa(boxlist2, boxlist1)  # [M, N] tensor\n    intersection_over_area = np.amax(intersection_over_area, axis=0)  # [N] tensor\n    keep_bool = np.greater_equal(intersection_over_area, np.array(minoverlap))\n    keep_inds = np.nonzero(keep_bool)[0]\n    new_boxlist1 = gather(boxlist1, keep_inds)\n    return new_boxlist1\n\n\ndef prune_outside_window(boxlist, window):\n    \"\"\"Prunes bounding boxes that fall outside a given window.\n\n    This function prunes bounding boxes that even partially fall outside the given\n    window. See also ClipToWindow which only prunes bounding boxes that fall\n    completely outside the window, and clips any bounding boxes that partially\n    overflow.\n\n    Args:\n      boxlist: a BoxList holding M_in boxes.\n      window: a numpy array of size 4, representing [ymin, xmin, ymax, xmax]\n              of the window.\n\n    Returns:\n      pruned_corners: a tensor with shape [M_out, 4] where M_out <= M_in.\n      valid_indices: a tensor with shape [M_out] indexing the valid bounding boxes\n       in the input tensor.\n    \"\"\"\n\n    y_min, x_min, y_max, x_max = np.array_split(boxlist.get(), 4, axis=1)\n    win_y_min = window[0]\n    win_x_min = window[1]\n    win_y_max = window[2]\n    win_x_max = window[3]\n    coordinate_violations = np.hstack(\n        [\n            np.less(y_min, win_y_min),\n            np.less(x_min, win_x_min),\n            np.greater(y_max, win_y_max),\n            np.greater(x_max, win_x_max),\n        ]\n    )\n    valid_indices = np.reshape(\n        np.where(np.logical_not(np.max(coordinate_violations, axis=1))), [-1]\n    )\n    return gather(boxlist, valid_indices), valid_indices\n\n\ndef concatenate(boxlists, fields=None):\n    \"\"\"Concatenate list of BoxLists.\n\n    This op concatenates a list of input BoxLists into a larger BoxList.  It also\n    handles concatenation of BoxList fields as long as the field tensor shapes\n    are equal except for the first dimension.\n\n    Args:\n      boxlists: list of BoxList objects\n      fields: optional list of fields to also concatenate.  By default, all\n        fields from the first BoxList in the list are included in the\n        concatenation.\n\n    Returns:\n      a BoxList with number of boxes equal to\n        sum([boxlist.num_boxes() for boxlist in BoxList])\n    Raises:\n      ValueError: if boxlists is invalid (i.e., is not a list, is empty, or\n        contains non BoxList objects), or if requested fields are not contained in\n        all boxlists\n    \"\"\"\n    if not isinstance(boxlists, list):\n        raise ValueError(\"boxlists should be a list\")\n    if not boxlists:\n        raise ValueError(\"boxlists should have nonzero length\")\n    for boxlist in boxlists:\n        if not isinstance(boxlist, np_box_list.BoxList):\n            raise ValueError(\"all elements of boxlists should be BoxList objects\")\n    concatenated = np_box_list.BoxList(\n        np.vstack([boxlist.get() for boxlist in boxlists])\n    )\n    if fields is None:\n        fields = boxlists[0].get_extra_fields()\n    for field in fields:\n        first_field_shape = boxlists[0].get_field(field).shape\n        first_field_shape = first_field_shape[1:]\n        for boxlist in boxlists:\n            if not boxlist.has_field(field):\n                raise ValueError(\"boxlist must contain all requested fields\")\n            field_shape = boxlist.get_field(field).shape\n            field_shape = field_shape[1:]\n            if field_shape != first_field_shape:\n                raise ValueError(\n                    \"field %s must have same shape for all boxlists \"\n                    \"except for the 0th dimension.\" % field\n                )\n        concatenated_field = np.concatenate(\n            [boxlist.get_field(field) for boxlist in boxlists], axis=0\n        )\n        concatenated.add_field(field, concatenated_field)\n    return concatenated\n\n\ndef filter_scores_greater_than(boxlist, thresh):\n    \"\"\"Filter to keep only boxes with score exceeding a given threshold.\n\n    This op keeps the collection of boxes whose corresponding scores are\n    greater than the input threshold.\n\n    Args:\n      boxlist: BoxList holding N boxes.  Must contain a 'scores' field\n        representing detection scores.\n      thresh: scalar threshold\n\n    Returns:\n      a BoxList holding M boxes where M <= N\n\n    Raises:\n      ValueError: if boxlist not a BoxList object or if it does not\n        have a scores field\n    \"\"\"\n    if not isinstance(boxlist, np_box_list.BoxList):\n        raise ValueError(\"boxlist must be a BoxList\")\n    if not boxlist.has_field(\"scores\"):\n        raise ValueError(\"input boxlist must have 'scores' field\")\n    scores = boxlist.get_field(\"scores\")\n    if len(scores.shape) > 2:\n        raise ValueError(\"Scores should have rank 1 or 2\")\n    if len(scores.shape) == 2 and scores.shape[1] != 1:\n        raise ValueError(\n            \"Scores should have rank 1 or have shape consistent with [None, 1]\"\n        )\n    high_score_indices = np.reshape(np.where(np.greater(scores, thresh)), [-1]).astype(\n        np.int32\n    )\n    return gather(boxlist, high_score_indices)\n\n\ndef change_coordinate_frame(boxlist, window):\n    \"\"\"Change coordinate frame of the boxlist to be relative to window's frame.\n\n    Given a window of the form [ymin, xmin, ymax, xmax],\n    changes bounding box coordinates from boxlist to be relative to this window\n    (e.g., the min corner maps to (0,0) and the max corner maps to (1,1)).\n\n    An example use case is data augmentation: where we are given groundtruth\n    boxes (boxlist) and would like to randomly crop the image to some\n    window (window). In this case we need to change the coordinate frame of\n    each groundtruth box to be relative to this new window.\n\n    Args:\n      boxlist: A BoxList object holding N boxes.\n      window: a size 4 1-D numpy array.\n\n    Returns:\n      Returns a BoxList object with N boxes.\n    \"\"\"\n    win_height = window[2] - window[0]\n    win_width = window[3] - window[1]\n    boxlist_new = scale(\n        np_box_list.BoxList(\n            boxlist.get() - [window[0], window[1], window[0], window[1]]\n        ),\n        1.0 / win_height,\n        1.0 / win_width,\n    )\n    _copy_extra_fields(boxlist_new, boxlist)\n\n    return boxlist_new\n\n\ndef _copy_extra_fields(boxlist_to_copy_to, boxlist_to_copy_from):\n    \"\"\"Copies the extra fields of boxlist_to_copy_from to boxlist_to_copy_to.\n\n    Args:\n      boxlist_to_copy_to: BoxList to which extra fields are copied.\n      boxlist_to_copy_from: BoxList from which fields are copied.\n\n    Returns:\n      boxlist_to_copy_to with extra fields.\n    \"\"\"\n    for field in boxlist_to_copy_from.get_extra_fields():\n        boxlist_to_copy_to.add_field(field, boxlist_to_copy_from.get_field(field))\n    return boxlist_to_copy_to\n\n\ndef _update_valid_indices_by_removing_high_iou_boxes(\n    selected_indices, is_index_valid, intersect_over_union, threshold\n):\n    max_iou = np.max(intersect_over_union[:, selected_indices], axis=1)\n    return np.logical_and(is_index_valid, max_iou <= threshold)\n"
  },
  {
    "path": "ava_evaluation/np_box_mask_list.py",
    "content": "# Copyright 2017 The TensorFlow Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\n\"\"\"Numpy BoxMaskList classes and functions.\"\"\"\n\nfrom __future__ import absolute_import, division, print_function, unicode_literals\n\nimport numpy as np\n\nfrom . import np_box_list\n\n\nclass BoxMaskList(np_box_list.BoxList):\n    \"\"\"Convenience wrapper for BoxList with masks.\n\n    BoxMaskList extends the np_box_list.BoxList to contain masks as well.\n    In particular, its constructor receives both boxes and masks. Note that the\n    masks correspond to the full image.\n    \"\"\"\n\n    def __init__(self, box_data, mask_data):\n        \"\"\"Constructs box collection.\n\n        Args:\n          box_data: a numpy array of shape [N, 4] representing box coordinates\n          mask_data: a numpy array of shape [N, height, width] representing masks\n            with values are in {0,1}. The masks correspond to the full\n            image. The height and the width will be equal to image height and width.\n\n        Raises:\n          ValueError: if bbox data is not a numpy array\n          ValueError: if invalid dimensions for bbox data\n          ValueError: if mask data is not a numpy array\n          ValueError: if invalid dimension for mask data\n        \"\"\"\n        super(BoxMaskList, self).__init__(box_data)\n        if not isinstance(mask_data, np.ndarray):\n            raise ValueError(\"Mask data must be a numpy array.\")\n        if len(mask_data.shape) != 3:\n            raise ValueError(\"Invalid dimensions for mask data.\")\n        if mask_data.dtype != np.uint8:\n            raise ValueError(\"Invalid data type for mask data: uint8 is required.\")\n        if mask_data.shape[0] != box_data.shape[0]:\n            raise ValueError(\"There should be the same number of boxes and masks.\")\n        self.data[\"masks\"] = mask_data\n\n    def get_masks(self):\n        \"\"\"Convenience function for accessing masks.\n\n        Returns:\n          a numpy array of shape [N, height, width] representing masks\n        \"\"\"\n        return self.get_field(\"masks\")\n"
  },
  {
    "path": "ava_evaluation/np_box_mask_list_ops.py",
    "content": "# Copyright 2017 The TensorFlow Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\n\"\"\"Operations for np_box_mask_list.BoxMaskList.\n\nExample box operations that are supported:\n  * Areas: compute bounding box areas\n  * IOU: pairwise intersection-over-union scores\n\"\"\"\n\nfrom __future__ import absolute_import, division, print_function, unicode_literals\n\nimport numpy as np\n\nfrom . import np_box_list_ops, np_box_mask_list, np_mask_ops\n\n\ndef box_list_to_box_mask_list(boxlist):\n    \"\"\"Converts a BoxList containing 'masks' into a BoxMaskList.\n\n    Args:\n      boxlist: An np_box_list.BoxList object.\n\n    Returns:\n      An np_box_mask_list.BoxMaskList object.\n\n    Raises:\n      ValueError: If boxlist does not contain `masks` as a field.\n    \"\"\"\n    if not boxlist.has_field(\"masks\"):\n        raise ValueError(\"boxlist does not contain mask field.\")\n    box_mask_list = np_box_mask_list.BoxMaskList(\n        box_data=boxlist.get(), mask_data=boxlist.get_field(\"masks\")\n    )\n    extra_fields = boxlist.get_extra_fields()\n    for key in extra_fields:\n        if key != \"masks\":\n            box_mask_list.data[key] = boxlist.get_field(key)\n    return box_mask_list\n\n\ndef area(box_mask_list):\n    \"\"\"Computes area of masks.\n\n    Args:\n      box_mask_list: np_box_mask_list.BoxMaskList holding N boxes and masks\n\n    Returns:\n      a numpy array with shape [N*1] representing mask areas\n    \"\"\"\n    return np_mask_ops.area(box_mask_list.get_masks())\n\n\ndef intersection(box_mask_list1, box_mask_list2):\n    \"\"\"Compute pairwise intersection areas between masks.\n\n    Args:\n      box_mask_list1: BoxMaskList holding N boxes and masks\n      box_mask_list2: BoxMaskList holding M boxes and masks\n\n    Returns:\n      a numpy array with shape [N*M] representing pairwise intersection area\n    \"\"\"\n    return np_mask_ops.intersection(\n        box_mask_list1.get_masks(), box_mask_list2.get_masks()\n    )\n\n\ndef iou(box_mask_list1, box_mask_list2):\n    \"\"\"Computes pairwise intersection-over-union between box and mask collections.\n\n    Args:\n      box_mask_list1: BoxMaskList holding N boxes and masks\n      box_mask_list2: BoxMaskList holding M boxes and masks\n\n    Returns:\n      a numpy array with shape [N, M] representing pairwise iou scores.\n    \"\"\"\n    return np_mask_ops.iou(box_mask_list1.get_masks(), box_mask_list2.get_masks())\n\n\ndef ioa(box_mask_list1, box_mask_list2):\n    \"\"\"Computes pairwise intersection-over-area between box and mask collections.\n\n    Intersection-over-area (ioa) between two masks mask1 and mask2 is defined as\n    their intersection area over mask2's area. Note that ioa is not symmetric,\n    that is, IOA(mask1, mask2) != IOA(mask2, mask1).\n\n    Args:\n      box_mask_list1: np_box_mask_list.BoxMaskList holding N boxes and masks\n      box_mask_list2: np_box_mask_list.BoxMaskList holding M boxes and masks\n\n    Returns:\n      a numpy array with shape [N, M] representing pairwise ioa scores.\n    \"\"\"\n    return np_mask_ops.ioa(box_mask_list1.get_masks(), box_mask_list2.get_masks())\n\n\ndef gather(box_mask_list, indices, fields=None):\n    \"\"\"Gather boxes from np_box_mask_list.BoxMaskList according to indices.\n\n    By default, gather returns boxes corresponding to the input index list, as\n    well as all additional fields stored in the box_mask_list (indexing into the\n    first dimension).  However one can optionally only gather from a\n    subset of fields.\n\n    Args:\n      box_mask_list: np_box_mask_list.BoxMaskList holding N boxes\n      indices: a 1-d numpy array of type int_\n      fields: (optional) list of fields to also gather from.  If None (default),\n          all fields are gathered from.  Pass an empty fields list to only gather\n          the box coordinates.\n\n    Returns:\n      subbox_mask_list: a np_box_mask_list.BoxMaskList corresponding to the subset\n          of the input box_mask_list specified by indices\n\n    Raises:\n      ValueError: if specified field is not contained in box_mask_list or if the\n          indices are not of type int_\n    \"\"\"\n    if fields is not None:\n        if \"masks\" not in fields:\n            fields.append(\"masks\")\n    return box_list_to_box_mask_list(\n        np_box_list_ops.gather(boxlist=box_mask_list, indices=indices, fields=fields)\n    )\n\n\ndef sort_by_field(box_mask_list, field, order=np_box_list_ops.SortOrder.DESCEND):\n    \"\"\"Sort boxes and associated fields according to a scalar field.\n\n    A common use case is reordering the boxes according to descending scores.\n\n    Args:\n      box_mask_list: BoxMaskList holding N boxes.\n      field: A BoxMaskList field for sorting and reordering the BoxMaskList.\n      order: (Optional) 'descend' or 'ascend'. Default is descend.\n\n    Returns:\n      sorted_box_mask_list: A sorted BoxMaskList with the field in the specified\n        order.\n    \"\"\"\n    return box_list_to_box_mask_list(\n        np_box_list_ops.sort_by_field(boxlist=box_mask_list, field=field, order=order)\n    )\n\n\ndef non_max_suppression(\n    box_mask_list,\n    max_output_size=10000,\n    iou_threshold=1.0,\n    score_threshold=-10.0,\n):\n    \"\"\"Non maximum suppression.\n\n    This op greedily selects a subset of detection bounding boxes, pruning\n    away boxes that have high IOU (intersection over union) overlap (> thresh)\n    with already selected boxes. In each iteration, the detected bounding box with\n    highest score in the available pool is selected.\n\n    Args:\n      box_mask_list: np_box_mask_list.BoxMaskList holding N boxes.  Must contain\n        a 'scores' field representing detection scores. All scores belong to the\n        same class.\n      max_output_size: maximum number of retained boxes\n      iou_threshold: intersection over union threshold.\n      score_threshold: minimum score threshold. Remove the boxes with scores\n                       less than this value. Default value is set to -10. A very\n                       low threshold to pass pretty much all the boxes, unless\n                       the user sets a different score threshold.\n\n    Returns:\n      an np_box_mask_list.BoxMaskList holding M boxes where M <= max_output_size\n\n    Raises:\n      ValueError: if 'scores' field does not exist\n      ValueError: if threshold is not in [0, 1]\n      ValueError: if max_output_size < 0\n    \"\"\"\n    if not box_mask_list.has_field(\"scores\"):\n        raise ValueError(\"Field scores does not exist\")\n    if iou_threshold < 0.0 or iou_threshold > 1.0:\n        raise ValueError(\"IOU threshold must be in [0, 1]\")\n    if max_output_size < 0:\n        raise ValueError(\"max_output_size must be bigger than 0.\")\n\n    box_mask_list = filter_scores_greater_than(box_mask_list, score_threshold)\n    if box_mask_list.num_boxes() == 0:\n        return box_mask_list\n\n    box_mask_list = sort_by_field(box_mask_list, \"scores\")\n\n    # Prevent further computation if NMS is disabled.\n    if iou_threshold == 1.0:\n        if box_mask_list.num_boxes() > max_output_size:\n            selected_indices = np.arange(max_output_size)\n            return gather(box_mask_list, selected_indices)\n        else:\n            return box_mask_list\n\n    masks = box_mask_list.get_masks()\n    num_masks = box_mask_list.num_boxes()\n\n    # is_index_valid is True only for all remaining valid boxes,\n    is_index_valid = np.full(num_masks, 1, dtype=bool)\n    selected_indices = []\n    num_output = 0\n    for i in range(num_masks):\n        if num_output < max_output_size:\n            if is_index_valid[i]:\n                num_output += 1\n                selected_indices.append(i)\n                is_index_valid[i] = False\n                valid_indices = np.where(is_index_valid)[0]\n                if valid_indices.size == 0:\n                    break\n\n                intersect_over_union = np_mask_ops.iou(\n                    np.expand_dims(masks[i], axis=0), masks[valid_indices]\n                )\n                intersect_over_union = np.squeeze(intersect_over_union, axis=0)\n                is_index_valid[valid_indices] = np.logical_and(\n                    is_index_valid[valid_indices],\n                    intersect_over_union <= iou_threshold,\n                )\n    return gather(box_mask_list, np.array(selected_indices))\n\n\ndef multi_class_non_max_suppression(\n    box_mask_list, score_thresh, iou_thresh, max_output_size\n):\n    \"\"\"Multi-class version of non maximum suppression.\n\n    This op greedily selects a subset of detection bounding boxes, pruning\n    away boxes that have high IOU (intersection over union) overlap (> thresh)\n    with already selected boxes.  It operates independently for each class for\n    which scores are provided (via the scores field of the input box_list),\n    pruning boxes with score less than a provided threshold prior to\n    applying NMS.\n\n    Args:\n      box_mask_list: np_box_mask_list.BoxMaskList holding N boxes.  Must contain a\n        'scores' field representing detection scores.  This scores field is a\n        tensor that can be 1 dimensional (in the case of a single class) or\n        2-dimensional, in which case we assume that it takes the\n        shape [num_boxes, num_classes]. We further assume that this rank is known\n        statically and that scores.shape[1] is also known (i.e., the number of\n        classes is fixed and known at graph construction time).\n      score_thresh: scalar threshold for score (low scoring boxes are removed).\n      iou_thresh: scalar threshold for IOU (boxes that that high IOU overlap\n        with previously selected boxes are removed).\n      max_output_size: maximum number of retained boxes per class.\n\n    Returns:\n      a box_mask_list holding M boxes with a rank-1 scores field representing\n        corresponding scores for each box with scores sorted in decreasing order\n        and a rank-1 classes field representing a class label for each box.\n    Raises:\n      ValueError: if iou_thresh is not in [0, 1] or if input box_mask_list does\n        not have a valid scores field.\n    \"\"\"\n    if not 0 <= iou_thresh <= 1.0:\n        raise ValueError(\"thresh must be between 0 and 1\")\n    if not isinstance(box_mask_list, np_box_mask_list.BoxMaskList):\n        raise ValueError(\"box_mask_list must be a box_mask_list\")\n    if not box_mask_list.has_field(\"scores\"):\n        raise ValueError(\"input box_mask_list must have 'scores' field\")\n    scores = box_mask_list.get_field(\"scores\")\n    if len(scores.shape) == 1:\n        scores = np.reshape(scores, [-1, 1])\n    elif len(scores.shape) == 2:\n        if scores.shape[1] is None:\n            raise ValueError(\n                \"scores field must have statically defined second dimension\"\n            )\n    else:\n        raise ValueError(\"scores field must be of rank 1 or 2\")\n\n    num_boxes = box_mask_list.num_boxes()\n    num_scores = scores.shape[0]\n    num_classes = scores.shape[1]\n\n    if num_boxes != num_scores:\n        raise ValueError(\"Incorrect scores field length: actual vs expected.\")\n\n    selected_boxes_list = []\n    for class_idx in range(num_classes):\n        box_mask_list_and_class_scores = np_box_mask_list.BoxMaskList(\n            box_data=box_mask_list.get(), mask_data=box_mask_list.get_masks()\n        )\n        class_scores = np.reshape(scores[0:num_scores, class_idx], [-1])\n        box_mask_list_and_class_scores.add_field(\"scores\", class_scores)\n        box_mask_list_filt = filter_scores_greater_than(\n            box_mask_list_and_class_scores, score_thresh\n        )\n        nms_result = non_max_suppression(\n            box_mask_list_filt,\n            max_output_size=max_output_size,\n            iou_threshold=iou_thresh,\n            score_threshold=score_thresh,\n        )\n        nms_result.add_field(\n            \"classes\", np.zeros_like(nms_result.get_field(\"scores\")) + class_idx\n        )\n        selected_boxes_list.append(nms_result)\n    selected_boxes = np_box_list_ops.concatenate(selected_boxes_list)\n    sorted_boxes = np_box_list_ops.sort_by_field(selected_boxes, \"scores\")\n    return box_list_to_box_mask_list(boxlist=sorted_boxes)\n\n\ndef prune_non_overlapping_masks(box_mask_list1, box_mask_list2, minoverlap=0.0):\n    \"\"\"Prunes the boxes in list1 that overlap less than thresh with list2.\n\n    For each mask in box_mask_list1, we want its IOA to be more than minoverlap\n    with at least one of the masks in box_mask_list2. If it does not, we remove\n    it. If the masks are not full size image, we do the pruning based on boxes.\n\n    Args:\n      box_mask_list1: np_box_mask_list.BoxMaskList holding N boxes and masks.\n      box_mask_list2: np_box_mask_list.BoxMaskList holding M boxes and masks.\n      minoverlap: Minimum required overlap between boxes, to count them as\n                  overlapping.\n\n    Returns:\n      A pruned box_mask_list with size [N', 4].\n    \"\"\"\n    intersection_over_area = ioa(box_mask_list2, box_mask_list1)  # [M, N] tensor\n    intersection_over_area = np.amax(intersection_over_area, axis=0)  # [N] tensor\n    keep_bool = np.greater_equal(intersection_over_area, np.array(minoverlap))\n    keep_inds = np.nonzero(keep_bool)[0]\n    new_box_mask_list1 = gather(box_mask_list1, keep_inds)\n    return new_box_mask_list1\n\n\ndef concatenate(box_mask_lists, fields=None):\n    \"\"\"Concatenate list of box_mask_lists.\n\n    This op concatenates a list of input box_mask_lists into a larger\n    box_mask_list.  It also\n    handles concatenation of box_mask_list fields as long as the field tensor\n    shapes are equal except for the first dimension.\n\n    Args:\n      box_mask_lists: list of np_box_mask_list.BoxMaskList objects\n      fields: optional list of fields to also concatenate.  By default, all\n        fields from the first BoxMaskList in the list are included in the\n        concatenation.\n\n    Returns:\n      a box_mask_list with number of boxes equal to\n        sum([box_mask_list.num_boxes() for box_mask_list in box_mask_list])\n    Raises:\n      ValueError: if box_mask_lists is invalid (i.e., is not a list, is empty, or\n        contains non box_mask_list objects), or if requested fields are not\n        contained in all box_mask_lists\n    \"\"\"\n    if fields is not None:\n        if \"masks\" not in fields:\n            fields.append(\"masks\")\n    return box_list_to_box_mask_list(\n        np_box_list_ops.concatenate(boxlists=box_mask_lists, fields=fields)\n    )\n\n\ndef filter_scores_greater_than(box_mask_list, thresh):\n    \"\"\"Filter to keep only boxes and masks with score exceeding a given threshold.\n\n    This op keeps the collection of boxes and masks whose corresponding scores are\n    greater than the input threshold.\n\n    Args:\n      box_mask_list: BoxMaskList holding N boxes and masks.  Must contain a\n        'scores' field representing detection scores.\n      thresh: scalar threshold\n\n    Returns:\n      a BoxMaskList holding M boxes and masks where M <= N\n\n    Raises:\n      ValueError: if box_mask_list not a np_box_mask_list.BoxMaskList object or\n        if it does not have a scores field\n    \"\"\"\n    if not isinstance(box_mask_list, np_box_mask_list.BoxMaskList):\n        raise ValueError(\"box_mask_list must be a BoxMaskList\")\n    if not box_mask_list.has_field(\"scores\"):\n        raise ValueError(\"input box_mask_list must have 'scores' field\")\n    scores = box_mask_list.get_field(\"scores\")\n    if len(scores.shape) > 2:\n        raise ValueError(\"Scores should have rank 1 or 2\")\n    if len(scores.shape) == 2 and scores.shape[1] != 1:\n        raise ValueError(\n            \"Scores should have rank 1 or have shape consistent with [None, 1]\"\n        )\n    high_score_indices = np.reshape(np.where(np.greater(scores, thresh)), [-1]).astype(\n        np.int32\n    )\n    return gather(box_mask_list, high_score_indices)\n"
  },
  {
    "path": "ava_evaluation/np_box_ops.py",
    "content": "# Copyright 2017 The TensorFlow Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\n\"\"\"Operations for [N, 4] numpy arrays representing bounding boxes.\n\nExample box operations that are supported:\n  * Areas: compute bounding box areas\n  * IOU: pairwise intersection-over-union scores\n\"\"\"\n\nfrom __future__ import absolute_import, division, print_function, unicode_literals\n\nimport numpy as np\n\n\ndef area(boxes):\n    \"\"\"Computes area of boxes.\n\n    Args:\n      boxes: Numpy array with shape [N, 4] holding N boxes\n\n    Returns:\n      a numpy array with shape [N*1] representing box areas\n    \"\"\"\n    return (boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1])\n\n\ndef intersection(boxes1, boxes2):\n    \"\"\"Compute pairwise intersection areas between boxes.\n\n    Args:\n      boxes1: a numpy array with shape [N, 4] holding N boxes\n      boxes2: a numpy array with shape [M, 4] holding M boxes\n\n    Returns:\n      a numpy array with shape [N*M] representing pairwise intersection area\n    \"\"\"\n    [y_min1, x_min1, y_max1, x_max1] = np.split(boxes1, 4, axis=1)\n    [y_min2, x_min2, y_max2, x_max2] = np.split(boxes2, 4, axis=1)\n\n    all_pairs_min_ymax = np.minimum(y_max1, np.transpose(y_max2))\n    all_pairs_max_ymin = np.maximum(y_min1, np.transpose(y_min2))\n    intersect_heights = np.maximum(\n        np.zeros(all_pairs_max_ymin.shape),\n        all_pairs_min_ymax - all_pairs_max_ymin,\n    )\n    all_pairs_min_xmax = np.minimum(x_max1, np.transpose(x_max2))\n    all_pairs_max_xmin = np.maximum(x_min1, np.transpose(x_min2))\n    intersect_widths = np.maximum(\n        np.zeros(all_pairs_max_xmin.shape),\n        all_pairs_min_xmax - all_pairs_max_xmin,\n    )\n    return intersect_heights * intersect_widths\n\n\ndef iou(boxes1, boxes2):\n    \"\"\"Computes pairwise intersection-over-union between box collections.\n\n    Args:\n      boxes1: a numpy array with shape [N, 4] holding N boxes.\n      boxes2: a numpy array with shape [M, 4] holding N boxes.\n\n    Returns:\n      a numpy array with shape [N, M] representing pairwise iou scores.\n    \"\"\"\n    intersect = intersection(boxes1, boxes2)\n    area1 = area(boxes1)\n    area2 = area(boxes2)\n    union = np.expand_dims(area1, axis=1) + np.expand_dims(area2, axis=0) - intersect\n    return intersect / union\n\n\ndef ioa(boxes1, boxes2):\n    \"\"\"Computes pairwise intersection-over-area between box collections.\n\n    Intersection-over-area (ioa) between two boxes box1 and box2 is defined as\n    their intersection area over box2's area. Note that ioa is not symmetric,\n    that is, IOA(box1, box2) != IOA(box2, box1).\n\n    Args:\n      boxes1: a numpy array with shape [N, 4] holding N boxes.\n      boxes2: a numpy array with shape [M, 4] holding N boxes.\n\n    Returns:\n      a numpy array with shape [N, M] representing pairwise ioa scores.\n    \"\"\"\n    intersect = intersection(boxes1, boxes2)\n    areas = np.expand_dims(area(boxes2), axis=0)\n    return intersect / areas\n"
  },
  {
    "path": "ava_evaluation/np_mask_ops.py",
    "content": "# Copyright 2017 The TensorFlow Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\n\"\"\"Operations for [N, height, width] numpy arrays representing masks.\n\nExample mask operations that are supported:\n  * Areas: compute mask areas\n  * IOU: pairwise intersection-over-union scores\n\"\"\"\n\nfrom __future__ import absolute_import, division, print_function, unicode_literals\n\nimport numpy as np\n\nEPSILON = 1e-7\n\n\ndef area(masks):\n    \"\"\"Computes area of masks.\n\n    Args:\n      masks: Numpy array with shape [N, height, width] holding N masks. Masks\n        values are of type np.uint8 and values are in {0,1}.\n\n    Returns:\n      a numpy array with shape [N*1] representing mask areas.\n\n    Raises:\n      ValueError: If masks.dtype is not np.uint8\n    \"\"\"\n    if masks.dtype != np.uint8:\n        raise ValueError(\"Masks type should be np.uint8\")\n    return np.sum(masks, axis=(1, 2), dtype=np.float32)\n\n\ndef intersection(masks1, masks2):\n    \"\"\"Compute pairwise intersection areas between masks.\n\n    Args:\n      masks1: a numpy array with shape [N, height, width] holding N masks. Masks\n        values are of type np.uint8 and values are in {0,1}.\n      masks2: a numpy array with shape [M, height, width] holding M masks. Masks\n        values are of type np.uint8 and values are in {0,1}.\n\n    Returns:\n      a numpy array with shape [N*M] representing pairwise intersection area.\n\n    Raises:\n      ValueError: If masks1 and masks2 are not of type np.uint8.\n    \"\"\"\n    if masks1.dtype != np.uint8 or masks2.dtype != np.uint8:\n        raise ValueError(\"masks1 and masks2 should be of type np.uint8\")\n    n = masks1.shape[0]\n    m = masks2.shape[0]\n    answer = np.zeros([n, m], dtype=np.float32)\n    for i in np.arange(n):\n        for j in np.arange(m):\n            answer[i, j] = np.sum(np.minimum(masks1[i], masks2[j]), dtype=np.float32)\n    return answer\n\n\ndef iou(masks1, masks2):\n    \"\"\"Computes pairwise intersection-over-union between mask collections.\n\n    Args:\n      masks1: a numpy array with shape [N, height, width] holding N masks. Masks\n        values are of type np.uint8 and values are in {0,1}.\n      masks2: a numpy array with shape [M, height, width] holding N masks. Masks\n        values are of type np.uint8 and values are in {0,1}.\n\n    Returns:\n      a numpy array with shape [N, M] representing pairwise iou scores.\n\n    Raises:\n      ValueError: If masks1 and masks2 are not of type np.uint8.\n    \"\"\"\n    if masks1.dtype != np.uint8 or masks2.dtype != np.uint8:\n        raise ValueError(\"masks1 and masks2 should be of type np.uint8\")\n    intersect = intersection(masks1, masks2)\n    area1 = area(masks1)\n    area2 = area(masks2)\n    union = np.expand_dims(area1, axis=1) + np.expand_dims(area2, axis=0) - intersect\n    return intersect / np.maximum(union, EPSILON)\n\n\ndef ioa(masks1, masks2):\n    \"\"\"Computes pairwise intersection-over-area between box collections.\n\n    Intersection-over-area (ioa) between two masks, mask1 and mask2 is defined as\n    their intersection area over mask2's area. Note that ioa is not symmetric,\n    that is, IOA(mask1, mask2) != IOA(mask2, mask1).\n\n    Args:\n      masks1: a numpy array with shape [N, height, width] holding N masks. Masks\n        values are of type np.uint8 and values are in {0,1}.\n      masks2: a numpy array with shape [M, height, width] holding N masks. Masks\n        values are of type np.uint8 and values are in {0,1}.\n\n    Returns:\n      a numpy array with shape [N, M] representing pairwise ioa scores.\n\n    Raises:\n      ValueError: If masks1 and masks2 are not of type np.uint8.\n    \"\"\"\n    if masks1.dtype != np.uint8 or masks2.dtype != np.uint8:\n        raise ValueError(\"masks1 and masks2 should be of type np.uint8\")\n    intersect = intersection(masks1, masks2)\n    areas = np.expand_dims(area(masks2), axis=0)\n    return intersect / (areas + EPSILON)\n"
  },
  {
    "path": "ava_evaluation/object_detection_evaluation.py",
    "content": "# Copyright 2017 The TensorFlow Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\"\"\"object_detection_evaluation module.\n\nObjectDetectionEvaluation is a class which manages ground truth information of a\nobject detection dataset, and computes frequently used detection metrics such as\nPrecision, Recall, CorLoc of the provided detection results.\nIt supports the following operations:\n1) Add ground truth information of images sequentially.\n2) Add detection result of images sequentially.\n3) Evaluate detection metrics on already inserted detection results.\n4) Write evaluation result into a pickle file for future processing or\n   visualization.\n\nNote: This module operates on numpy boxes and box lists.\n\"\"\"\n\nfrom __future__ import absolute_import, division, print_function, unicode_literals\n\nimport collections\nimport logging\nfrom abc import ABCMeta, abstractmethod\n\nimport numpy as np\n\nfrom . import label_map_util, metrics, per_image_evaluation, standard_fields\n\n\nclass DetectionEvaluator:\n    \"\"\"Interface for object detection evalution classes.\n\n    Example usage of the Evaluator:\n    ------------------------------\n    evaluator = DetectionEvaluator(categories)\n\n    # Detections and groundtruth for image 1.\n    evaluator.add_single_groundtruth_image_info(...)\n    evaluator.add_single_detected_image_info(...)\n\n    # Detections and groundtruth for image 2.\n    evaluator.add_single_groundtruth_image_info(...)\n    evaluator.add_single_detected_image_info(...)\n\n    metrics_dict = evaluator.evaluate()\n    \"\"\"\n\n    __metaclass__ = ABCMeta\n\n    def __init__(self, categories):\n        \"\"\"Constructor.\n\n        Args:\n          categories: A list of dicts, each of which has the following keys -\n            'id': (required) an integer id uniquely identifying this category.\n            'name': (required) string representing category name e.g., 'cat', 'dog'.\n        \"\"\"\n        self._categories = categories\n\n    @abstractmethod\n    def add_single_ground_truth_image_info(self, image_id, groundtruth_dict):\n        \"\"\"Adds groundtruth for a single image to be used for evaluation.\n\n        Args:\n          image_id: A unique string/integer identifier for the image.\n          groundtruth_dict: A dictionary of groundtruth numpy arrays required\n            for evaluations.\n        \"\"\"\n        pass\n\n    @abstractmethod\n    def add_single_detected_image_info(self, image_id, detections_dict):\n        \"\"\"Adds detections for a single image to be used for evaluation.\n\n        Args:\n          image_id: A unique string/integer identifier for the image.\n          detections_dict: A dictionary of detection numpy arrays required\n            for evaluation.\n        \"\"\"\n        pass\n\n    @abstractmethod\n    def evaluate(self):\n        \"\"\"Evaluates detections and returns a dictionary of metrics.\"\"\"\n        pass\n\n    @abstractmethod\n    def clear(self):\n        \"\"\"Clears the state to prepare for a fresh evaluation.\"\"\"\n        pass\n\n\nclass ObjectDetectionEvaluator(DetectionEvaluator):\n    \"\"\"A class to evaluate detections.\"\"\"\n\n    def __init__(\n        self,\n        categories,\n        matching_iou_threshold=0.5,\n        evaluate_corlocs=False,\n        metric_prefix=None,\n        use_weighted_mean_ap=False,\n        evaluate_masks=False,\n    ):\n        \"\"\"Constructor.\n\n        Args:\n          categories: A list of dicts, each of which has the following keys -\n            'id': (required) an integer id uniquely identifying this category.\n            'name': (required) string representing category name e.g., 'cat', 'dog'.\n          matching_iou_threshold: IOU threshold to use for matching groundtruth\n            boxes to detection boxes.\n          evaluate_corlocs: (optional) boolean which determines if corloc scores\n            are to be returned or not.\n          metric_prefix: (optional) string prefix for metric name; if None, no\n            prefix is used.\n          use_weighted_mean_ap: (optional) boolean which determines if the mean\n            average precision is computed directly from the scores and tp_fp_labels\n            of all classes.\n          evaluate_masks: If False, evaluation will be performed based on boxes.\n            If True, mask evaluation will be performed instead.\n\n        Raises:\n          ValueError: If the category ids are not 1-indexed.\n        \"\"\"\n        super(ObjectDetectionEvaluator, self).__init__(categories)\n        self._num_classes = max([cat[\"id\"] for cat in categories])\n        if min(cat[\"id\"] for cat in categories) < 1:\n            raise ValueError(\"Classes should be 1-indexed.\")\n        self._matching_iou_threshold = matching_iou_threshold\n        self._use_weighted_mean_ap = use_weighted_mean_ap\n        self._label_id_offset = 1\n        self._evaluate_masks = evaluate_masks\n        self._evaluation = ObjectDetectionEvaluation(\n            num_groundtruth_classes=self._num_classes,\n            matching_iou_threshold=self._matching_iou_threshold,\n            use_weighted_mean_ap=self._use_weighted_mean_ap,\n            label_id_offset=self._label_id_offset,\n        )\n        self._image_ids = set([])\n        self._evaluate_corlocs = evaluate_corlocs\n        self._metric_prefix = (metric_prefix + \"_\") if metric_prefix else \"\"\n\n    def add_single_ground_truth_image_info(self, image_id, groundtruth_dict):\n        \"\"\"Adds groundtruth for a single image to be used for evaluation.\n\n        Args:\n          image_id: A unique string/integer identifier for the image.\n          groundtruth_dict: A dictionary containing -\n            standard_fields.InputDataFields.groundtruth_boxes: float32 numpy array\n              of shape [num_boxes, 4] containing `num_boxes` groundtruth boxes of\n              the format [ymin, xmin, ymax, xmax] in absolute image coordinates.\n            standard_fields.InputDataFields.groundtruth_classes: integer numpy array\n              of shape [num_boxes] containing 1-indexed groundtruth classes for the\n              boxes.\n            standard_fields.InputDataFields.groundtruth_difficult: Optional length\n              M numpy boolean array denoting whether a ground truth box is a\n              difficult instance or not. This field is optional to support the case\n              that no boxes are difficult.\n            standard_fields.InputDataFields.groundtruth_instance_masks: Optional\n              numpy array of shape [num_boxes, height, width] with values in {0, 1}.\n\n        Raises:\n          ValueError: On adding groundtruth for an image more than once. Will also\n            raise error if instance masks are not in groundtruth dictionary.\n        \"\"\"\n        if image_id in self._image_ids:\n            raise ValueError(\"Image with id {} already added.\".format(image_id))\n\n        groundtruth_classes = (\n            groundtruth_dict[standard_fields.InputDataFields.groundtruth_classes]\n            - self._label_id_offset\n        )\n        # If the key is not present in the groundtruth_dict or the array is empty\n        # (unless there are no annotations for the groundtruth on this image)\n        # use values from the dictionary or insert None otherwise.\n        if (\n            standard_fields.InputDataFields.groundtruth_difficult\n            in groundtruth_dict.keys()\n            and (\n                groundtruth_dict[\n                    standard_fields.InputDataFields.groundtruth_difficult\n                ].size\n                or not groundtruth_classes.size\n            )\n        ):\n            groundtruth_difficult = groundtruth_dict[\n                standard_fields.InputDataFields.groundtruth_difficult\n            ]\n        else:\n            groundtruth_difficult = None\n            if not len(self._image_ids) % 1000:\n                logging.warning(\n                    \"image %s does not have groundtruth difficult flag specified\",\n                    image_id,\n                )\n        groundtruth_masks = None\n        if self._evaluate_masks:\n            if (\n                standard_fields.InputDataFields.groundtruth_instance_masks\n                not in groundtruth_dict\n            ):\n                raise ValueError(\"Instance masks not in groundtruth dictionary.\")\n            groundtruth_masks = groundtruth_dict[\n                standard_fields.InputDataFields.groundtruth_instance_masks\n            ]\n        self._evaluation.add_single_ground_truth_image_info(\n            image_key=image_id,\n            groundtruth_boxes=groundtruth_dict[\n                standard_fields.InputDataFields.groundtruth_boxes\n            ],\n            groundtruth_class_labels=groundtruth_classes,\n            groundtruth_is_difficult_list=groundtruth_difficult,\n            groundtruth_masks=groundtruth_masks,\n        )\n        self._image_ids.update([image_id])\n\n    def add_single_detected_image_info(self, image_id, detections_dict):\n        \"\"\"Adds detections for a single image to be used for evaluation.\n\n        Args:\n          image_id: A unique string/integer identifier for the image.\n          detections_dict: A dictionary containing -\n            standard_fields.DetectionResultFields.detection_boxes: float32 numpy\n              array of shape [num_boxes, 4] containing `num_boxes` detection boxes\n              of the format [ymin, xmin, ymax, xmax] in absolute image coordinates.\n            standard_fields.DetectionResultFields.detection_scores: float32 numpy\n              array of shape [num_boxes] containing detection scores for the boxes.\n            standard_fields.DetectionResultFields.detection_classes: integer numpy\n              array of shape [num_boxes] containing 1-indexed detection classes for\n              the boxes.\n            standard_fields.DetectionResultFields.detection_masks: uint8 numpy\n              array of shape [num_boxes, height, width] containing `num_boxes` masks\n              of values ranging between 0 and 1.\n\n        Raises:\n          ValueError: If detection masks are not in detections dictionary.\n        \"\"\"\n        detection_classes = (\n            detections_dict[standard_fields.DetectionResultFields.detection_classes]\n            - self._label_id_offset\n        )\n        detection_masks = None\n        if self._evaluate_masks:\n            if (\n                standard_fields.DetectionResultFields.detection_masks\n                not in detections_dict\n            ):\n                raise ValueError(\"Detection masks not in detections dictionary.\")\n            detection_masks = detections_dict[\n                standard_fields.DetectionResultFields.detection_masks\n            ]\n        self._evaluation.add_single_detected_image_info(\n            image_key=image_id,\n            detected_boxes=detections_dict[\n                standard_fields.DetectionResultFields.detection_boxes\n            ],\n            detected_scores=detections_dict[\n                standard_fields.DetectionResultFields.detection_scores\n            ],\n            detected_class_labels=detection_classes,\n            detected_masks=detection_masks,\n        )\n\n    def evaluate(self):\n        \"\"\"Compute evaluation result.\n\n        Returns:\n          A dictionary of metrics with the following fields -\n\n          1. summary_metrics:\n            'Precision/mAP@<matching_iou_threshold>IOU': mean average precision at\n            the specified IOU threshold.\n\n          2. per_category_ap: category specific results with keys of the form\n            'PerformanceByCategory/mAP@<matching_iou_threshold>IOU/category'.\n        \"\"\"\n        (\n            per_class_ap,\n            mean_ap,\n            _,\n            _,\n            per_class_corloc,\n            mean_corloc,\n        ) = self._evaluation.evaluate()\n        pascal_metrics = {\n            self._metric_prefix\n            + \"Precision/mAP@{}IOU\".format(self._matching_iou_threshold): mean_ap\n        }\n        if self._evaluate_corlocs:\n            pascal_metrics[\n                self._metric_prefix\n                + \"Precision/meanCorLoc@{}IOU\".format(self._matching_iou_threshold)\n            ] = mean_corloc\n        category_index = label_map_util.create_category_index(self._categories)\n        for idx in range(per_class_ap.size):\n            if idx + self._label_id_offset in category_index:\n                display_name = (\n                    self._metric_prefix\n                    + \"PerformanceByCategory/AP@{}IOU/{}\".format(\n                        self._matching_iou_threshold,\n                        category_index[idx + self._label_id_offset][\"name\"],\n                    )\n                )\n                pascal_metrics[display_name] = per_class_ap[idx]\n\n                # Optionally add CorLoc metrics.classes\n                if self._evaluate_corlocs:\n                    display_name = (\n                        self._metric_prefix\n                        + \"PerformanceByCategory/CorLoc@{}IOU/{}\".format(\n                            self._matching_iou_threshold,\n                            category_index[idx + self._label_id_offset][\"name\"],\n                        )\n                    )\n                    pascal_metrics[display_name] = per_class_corloc[idx]\n\n        return pascal_metrics\n\n    def clear(self):\n        \"\"\"Clears the state to prepare for a fresh evaluation.\"\"\"\n        self._evaluation = ObjectDetectionEvaluation(\n            num_groundtruth_classes=self._num_classes,\n            matching_iou_threshold=self._matching_iou_threshold,\n            use_weighted_mean_ap=self._use_weighted_mean_ap,\n            label_id_offset=self._label_id_offset,\n        )\n        self._image_ids.clear()\n\n\nclass PascalDetectionEvaluator(ObjectDetectionEvaluator):\n    \"\"\"A class to evaluate detections using PASCAL metrics.\"\"\"\n\n    def __init__(self, categories, matching_iou_threshold=0.5):\n        super(PascalDetectionEvaluator, self).__init__(\n            categories,\n            matching_iou_threshold=matching_iou_threshold,\n            evaluate_corlocs=False,\n            metric_prefix=\"PascalBoxes\",\n            use_weighted_mean_ap=False,\n        )\n\n\nclass WeightedPascalDetectionEvaluator(ObjectDetectionEvaluator):\n    \"\"\"A class to evaluate detections using weighted PASCAL metrics.\n\n    Weighted PASCAL metrics computes the mean average precision as the average\n    precision given the scores and tp_fp_labels of all classes. In comparison,\n    PASCAL metrics computes the mean average precision as the mean of the\n    per-class average precisions.\n\n    This definition is very similar to the mean of the per-class average\n    precisions weighted by class frequency. However, they are typically not the\n    same as the average precision is not a linear function of the scores and\n    tp_fp_labels.\n    \"\"\"\n\n    def __init__(self, categories, matching_iou_threshold=0.5):\n        super(WeightedPascalDetectionEvaluator, self).__init__(\n            categories,\n            matching_iou_threshold=matching_iou_threshold,\n            evaluate_corlocs=False,\n            metric_prefix=\"WeightedPascalBoxes\",\n            use_weighted_mean_ap=True,\n        )\n\n\nclass PascalInstanceSegmentationEvaluator(ObjectDetectionEvaluator):\n    \"\"\"A class to evaluate instance masks using PASCAL metrics.\"\"\"\n\n    def __init__(self, categories, matching_iou_threshold=0.5):\n        super(PascalInstanceSegmentationEvaluator, self).__init__(\n            categories,\n            matching_iou_threshold=matching_iou_threshold,\n            evaluate_corlocs=False,\n            metric_prefix=\"PascalMasks\",\n            use_weighted_mean_ap=False,\n            evaluate_masks=True,\n        )\n\n\nclass WeightedPascalInstanceSegmentationEvaluator(ObjectDetectionEvaluator):\n    \"\"\"A class to evaluate instance masks using weighted PASCAL metrics.\n\n    Weighted PASCAL metrics computes the mean average precision as the average\n    precision given the scores and tp_fp_labels of all classes. In comparison,\n    PASCAL metrics computes the mean average precision as the mean of the\n    per-class average precisions.\n\n    This definition is very similar to the mean of the per-class average\n    precisions weighted by class frequency. However, they are typically not the\n    same as the average precision is not a linear function of the scores and\n    tp_fp_labels.\n    \"\"\"\n\n    def __init__(self, categories, matching_iou_threshold=0.5):\n        super(WeightedPascalInstanceSegmentationEvaluator, self).__init__(\n            categories,\n            matching_iou_threshold=matching_iou_threshold,\n            evaluate_corlocs=False,\n            metric_prefix=\"WeightedPascalMasks\",\n            use_weighted_mean_ap=True,\n            evaluate_masks=True,\n        )\n\n\nclass OpenImagesDetectionEvaluator(ObjectDetectionEvaluator):\n    \"\"\"A class to evaluate detections using Open Images V2 metrics.\n\n    Open Images V2 introduce group_of type of bounding boxes and this metric\n    handles those boxes appropriately.\n    \"\"\"\n\n    def __init__(self, categories, matching_iou_threshold=0.5, evaluate_corlocs=False):\n        \"\"\"Constructor.\n\n        Args:\n          categories: A list of dicts, each of which has the following keys -\n            'id': (required) an integer id uniquely identifying this category.\n            'name': (required) string representing category name e.g., 'cat', 'dog'.\n          matching_iou_threshold: IOU threshold to use for matching groundtruth\n            boxes to detection boxes.\n          evaluate_corlocs: if True, additionally evaluates and returns CorLoc.\n        \"\"\"\n        super(OpenImagesDetectionEvaluator, self).__init__(\n            categories,\n            matching_iou_threshold,\n            evaluate_corlocs,\n            metric_prefix=\"OpenImagesV2\",\n        )\n\n    def add_single_ground_truth_image_info(self, image_id, groundtruth_dict):\n        \"\"\"Adds groundtruth for a single image to be used for evaluation.\n\n        Args:\n          image_id: A unique string/integer identifier for the image.\n          groundtruth_dict: A dictionary containing -\n            standard_fields.InputDataFields.groundtruth_boxes: float32 numpy array\n              of shape [num_boxes, 4] containing `num_boxes` groundtruth boxes of\n              the format [ymin, xmin, ymax, xmax] in absolute image coordinates.\n            standard_fields.InputDataFields.groundtruth_classes: integer numpy array\n              of shape [num_boxes] containing 1-indexed groundtruth classes for the\n              boxes.\n            standard_fields.InputDataFields.groundtruth_group_of: Optional length\n              M numpy boolean array denoting whether a groundtruth box contains a\n              group of instances.\n\n        Raises:\n          ValueError: On adding groundtruth for an image more than once.\n        \"\"\"\n        if image_id in self._image_ids:\n            raise ValueError(\"Image with id {} already added.\".format(image_id))\n\n        groundtruth_classes = (\n            groundtruth_dict[standard_fields.InputDataFields.groundtruth_classes]\n            - self._label_id_offset\n        )\n        # If the key is not present in the groundtruth_dict or the array is empty\n        # (unless there are no annotations for the groundtruth on this image)\n        # use values from the dictionary or insert None otherwise.\n        if (\n            standard_fields.InputDataFields.groundtruth_group_of\n            in groundtruth_dict.keys()\n            and (\n                groundtruth_dict[\n                    standard_fields.InputDataFields.groundtruth_group_of\n                ].size\n                or not groundtruth_classes.size\n            )\n        ):\n            groundtruth_group_of = groundtruth_dict[\n                standard_fields.InputDataFields.groundtruth_group_of\n            ]\n        else:\n            groundtruth_group_of = None\n            if not len(self._image_ids) % 1000:\n                logging.warning(\n                    \"image %s does not have groundtruth group_of flag specified\",\n                    image_id,\n                )\n        self._evaluation.add_single_ground_truth_image_info(\n            image_id,\n            groundtruth_dict[standard_fields.InputDataFields.groundtruth_boxes],\n            groundtruth_classes,\n            groundtruth_is_difficult_list=None,\n            groundtruth_is_group_of_list=groundtruth_group_of,\n        )\n        self._image_ids.update([image_id])\n\n\nObjectDetectionEvalMetrics = collections.namedtuple(\n    \"ObjectDetectionEvalMetrics\",\n    [\n        \"average_precisions\",\n        \"mean_ap\",\n        \"precisions\",\n        \"recalls\",\n        \"corlocs\",\n        \"mean_corloc\",\n    ],\n)\n\n\nclass ObjectDetectionEvaluation:\n    \"\"\"Internal implementation of Pascal object detection metrics.\"\"\"\n\n    def __init__(\n        self,\n        num_groundtruth_classes,\n        matching_iou_threshold=0.5,\n        nms_iou_threshold=1.0,\n        nms_max_output_boxes=10000,\n        use_weighted_mean_ap=False,\n        label_id_offset=0,\n    ):\n        if num_groundtruth_classes < 1:\n            raise ValueError(\"Need at least 1 groundtruth class for evaluation.\")\n\n        self.per_image_eval = per_image_evaluation.PerImageEvaluation(\n            num_groundtruth_classes=num_groundtruth_classes,\n            matching_iou_threshold=matching_iou_threshold,\n        )\n        self.num_class = num_groundtruth_classes\n        self.use_weighted_mean_ap = use_weighted_mean_ap\n        self.label_id_offset = label_id_offset\n\n        self.groundtruth_boxes = {}\n        self.groundtruth_class_labels = {}\n        self.groundtruth_masks = {}\n        self.groundtruth_is_difficult_list = {}\n        self.groundtruth_is_group_of_list = {}\n        self.num_gt_instances_per_class = np.zeros(self.num_class, dtype=int)\n        self.num_gt_imgs_per_class = np.zeros(self.num_class, dtype=int)\n\n        self._initialize_detections()\n\n    def _initialize_detections(self):\n        self.detection_keys = set()\n        self.scores_per_class = [[] for _ in range(self.num_class)]\n        self.tp_fp_labels_per_class = [[] for _ in range(self.num_class)]\n        self.num_images_correctly_detected_per_class = np.zeros(self.num_class)\n        self.average_precision_per_class = np.empty(self.num_class, dtype=float)\n        self.average_precision_per_class.fill(np.nan)\n        self.precisions_per_class = []\n        self.recalls_per_class = []\n        self.corloc_per_class = np.ones(self.num_class, dtype=float)\n\n    def clear_detections(self):\n        self._initialize_detections()\n\n    def add_single_ground_truth_image_info(\n        self,\n        image_key,\n        groundtruth_boxes,\n        groundtruth_class_labels,\n        groundtruth_is_difficult_list=None,\n        groundtruth_is_group_of_list=None,\n        groundtruth_masks=None,\n    ):\n        \"\"\"Adds groundtruth for a single image to be used for evaluation.\n\n        Args:\n          image_key: A unique string/integer identifier for the image.\n          groundtruth_boxes: float32 numpy array of shape [num_boxes, 4]\n            containing `num_boxes` groundtruth boxes of the format\n            [ymin, xmin, ymax, xmax] in absolute image coordinates.\n          groundtruth_class_labels: integer numpy array of shape [num_boxes]\n            containing 0-indexed groundtruth classes for the boxes.\n          groundtruth_is_difficult_list: A length M numpy boolean array denoting\n            whether a ground truth box is a difficult instance or not. To support\n            the case that no boxes are difficult, it is by default set as None.\n          groundtruth_is_group_of_list: A length M numpy boolean array denoting\n              whether a ground truth box is a group-of box or not. To support\n              the case that no boxes are groups-of, it is by default set as None.\n          groundtruth_masks: uint8 numpy array of shape\n            [num_boxes, height, width] containing `num_boxes` groundtruth masks.\n            The mask values range from 0 to 1.\n        \"\"\"\n        if image_key in self.groundtruth_boxes:\n            logging.warning(\n                \"image %s has already been added to the ground truth database.\",\n                image_key,\n            )\n            return\n\n        self.groundtruth_boxes[image_key] = groundtruth_boxes\n        self.groundtruth_class_labels[image_key] = groundtruth_class_labels\n        self.groundtruth_masks[image_key] = groundtruth_masks\n        if groundtruth_is_difficult_list is None:\n            num_boxes = groundtruth_boxes.shape[0]\n            groundtruth_is_difficult_list = np.zeros(num_boxes, dtype=bool)\n        self.groundtruth_is_difficult_list[image_key] = (\n            groundtruth_is_difficult_list.astype(dtype=bool)\n        )\n        if groundtruth_is_group_of_list is None:\n            num_boxes = groundtruth_boxes.shape[0]\n            groundtruth_is_group_of_list = np.zeros(num_boxes, dtype=bool)\n        self.groundtruth_is_group_of_list[image_key] = (\n            groundtruth_is_group_of_list.astype(dtype=bool)\n        )\n\n        self._update_ground_truth_statistics(\n            groundtruth_class_labels,\n            groundtruth_is_difficult_list.astype(dtype=bool),\n            groundtruth_is_group_of_list.astype(dtype=bool),\n        )\n\n    def add_single_detected_image_info(\n        self,\n        image_key,\n        detected_boxes,\n        detected_scores,\n        detected_class_labels,\n        detected_masks=None,\n    ):\n        \"\"\"Adds detections for a single image to be used for evaluation.\n\n        Args:\n          image_key: A unique string/integer identifier for the image.\n          detected_boxes: float32 numpy array of shape [num_boxes, 4]\n            containing `num_boxes` detection boxes of the format\n            [ymin, xmin, ymax, xmax] in absolute image coordinates.\n          detected_scores: float32 numpy array of shape [num_boxes] containing\n            detection scores for the boxes.\n          detected_class_labels: integer numpy array of shape [num_boxes] containing\n            0-indexed detection classes for the boxes.\n          detected_masks: np.uint8 numpy array of shape [num_boxes, height, width]\n            containing `num_boxes` detection masks with values ranging\n            between 0 and 1.\n\n        Raises:\n          ValueError: if the number of boxes, scores and class labels differ in\n            length.\n        \"\"\"\n        if len(detected_boxes) != len(detected_scores) or len(detected_boxes) != len(\n            detected_class_labels\n        ):\n            raise ValueError(\n                \"detected_boxes, detected_scores and \"\n                \"detected_class_labels should all have same lengths. Got\"\n                \"[%d, %d, %d]\" % len(detected_boxes),\n                len(detected_scores),\n                len(detected_class_labels),\n            )\n\n        if image_key in self.detection_keys:\n            logging.warning(\n                \"image %s has already been added to the detection result database\",\n                image_key,\n            )\n            return\n\n        self.detection_keys.add(image_key)\n        if image_key in self.groundtruth_boxes:\n            groundtruth_boxes = self.groundtruth_boxes[image_key]\n            groundtruth_class_labels = self.groundtruth_class_labels[image_key]\n            # Masks are popped instead of look up. The reason is that we do not want\n            # to keep all masks in memory which can cause memory overflow.\n            groundtruth_masks = self.groundtruth_masks.pop(image_key)\n            groundtruth_is_difficult_list = self.groundtruth_is_difficult_list[\n                image_key\n            ]\n            groundtruth_is_group_of_list = self.groundtruth_is_group_of_list[image_key]\n        else:\n            groundtruth_boxes = np.empty(shape=[0, 4], dtype=float)\n            groundtruth_class_labels = np.array([], dtype=int)\n            if detected_masks is None:\n                groundtruth_masks = None\n            else:\n                groundtruth_masks = np.empty(shape=[0, 1, 1], dtype=float)\n            groundtruth_is_difficult_list = np.array([], dtype=bool)\n            groundtruth_is_group_of_list = np.array([], dtype=bool)\n        (\n            scores,\n            tp_fp_labels,\n        ) = self.per_image_eval.compute_object_detection_metrics(\n            detected_boxes=detected_boxes,\n            detected_scores=detected_scores,\n            detected_class_labels=detected_class_labels,\n            groundtruth_boxes=groundtruth_boxes,\n            groundtruth_class_labels=groundtruth_class_labels,\n            groundtruth_is_difficult_list=groundtruth_is_difficult_list,\n            groundtruth_is_group_of_list=groundtruth_is_group_of_list,\n            detected_masks=detected_masks,\n            groundtruth_masks=groundtruth_masks,\n        )\n\n        for i in range(self.num_class):\n            if scores[i].shape[0] > 0:\n                self.scores_per_class[i].append(scores[i])\n                self.tp_fp_labels_per_class[i].append(tp_fp_labels[i])\n\n    def _update_ground_truth_statistics(\n        self,\n        groundtruth_class_labels,\n        groundtruth_is_difficult_list,\n        groundtruth_is_group_of_list,\n    ):\n        \"\"\"Update grouth truth statitistics.\n\n        1. Difficult boxes are ignored when counting the number of ground truth\n        instances as done in Pascal VOC devkit.\n        2. Difficult boxes are treated as normal boxes when computing CorLoc related\n        statitistics.\n\n        Args:\n          groundtruth_class_labels: An integer numpy array of length M,\n              representing M class labels of object instances in ground truth\n          groundtruth_is_difficult_list: A boolean numpy array of length M denoting\n              whether a ground truth box is a difficult instance or not\n          groundtruth_is_group_of_list: A boolean numpy array of length M denoting\n              whether a ground truth box is a group-of box or not\n        \"\"\"\n        for class_index in range(self.num_class):\n            num_gt_instances = np.sum(\n                groundtruth_class_labels[\n                    ~groundtruth_is_difficult_list & ~groundtruth_is_group_of_list\n                ]\n                == class_index\n            )\n            self.num_gt_instances_per_class[class_index] += num_gt_instances\n            if np.any(groundtruth_class_labels == class_index):\n                self.num_gt_imgs_per_class[class_index] += 1\n\n    def evaluate(self):\n        \"\"\"Compute evaluation result.\n\n        Returns:\n          A named tuple with the following fields -\n            average_precision: float numpy array of average precision for\n                each class.\n            mean_ap: mean average precision of all classes, float scalar\n            precisions: List of precisions, each precision is a float numpy\n                array\n            recalls: List of recalls, each recall is a float numpy array\n            corloc: numpy float array\n            mean_corloc: Mean CorLoc score for each class, float scalar\n        \"\"\"\n        if (self.num_gt_instances_per_class == 0).any():\n            logging.info(\n                \"The following classes have no ground truth examples: %s\",\n                np.squeeze(np.argwhere(self.num_gt_instances_per_class == 0))\n                + self.label_id_offset,\n            )\n\n        if self.use_weighted_mean_ap:\n            all_scores = np.array([], dtype=float)\n            all_tp_fp_labels = np.array([], dtype=bool)\n\n        for class_index in range(self.num_class):\n            if self.num_gt_instances_per_class[class_index] == 0:\n                continue\n            if not self.scores_per_class[class_index]:\n                scores = np.array([], dtype=float)\n                tp_fp_labels = np.array([], dtype=bool)\n            else:\n                scores = np.concatenate(self.scores_per_class[class_index])\n                tp_fp_labels = np.concatenate(self.tp_fp_labels_per_class[class_index])\n            if self.use_weighted_mean_ap:\n                all_scores = np.append(all_scores, scores)\n                all_tp_fp_labels = np.append(all_tp_fp_labels, tp_fp_labels)\n            precision, recall = metrics.compute_precision_recall(\n                scores,\n                tp_fp_labels,\n                self.num_gt_instances_per_class[class_index],\n            )\n            self.precisions_per_class.append(precision)\n            self.recalls_per_class.append(recall)\n            average_precision = metrics.compute_average_precision(precision, recall)\n            self.average_precision_per_class[class_index] = average_precision\n\n        self.corloc_per_class = metrics.compute_cor_loc(\n            self.num_gt_imgs_per_class,\n            self.num_images_correctly_detected_per_class,\n        )\n\n        if self.use_weighted_mean_ap:\n            num_gt_instances = np.sum(self.num_gt_instances_per_class)\n            precision, recall = metrics.compute_precision_recall(\n                all_scores, all_tp_fp_labels, num_gt_instances\n            )\n            mean_ap = metrics.compute_average_precision(precision, recall)\n        else:\n            mean_ap = np.nanmean(self.average_precision_per_class)\n        mean_corloc = np.nanmean(self.corloc_per_class)\n        return ObjectDetectionEvalMetrics(\n            self.average_precision_per_class,\n            mean_ap,\n            self.precisions_per_class,\n            self.recalls_per_class,\n            self.corloc_per_class,\n            mean_corloc,\n        )\n"
  },
  {
    "path": "ava_evaluation/per_image_evaluation.py",
    "content": "# Copyright 2017 The TensorFlow Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\"\"\"Evaluate Object Detection result on a single image.\n\nAnnotate each detected result as true positives or false positive according to\na predefined IOU ratio. Non Maximum Supression is used by default. Multi class\ndetection is supported by default.\nBased on the settings, per image evaluation is either performed on boxes or\non object masks.\n\"\"\"\n\nfrom __future__ import absolute_import, division, print_function, unicode_literals\n\nimport numpy as np\n\nfrom . import np_box_list, np_box_list_ops\n\n\nclass PerImageEvaluation:\n    \"\"\"Evaluate detection result of a single image.\"\"\"\n\n    def __init__(self, num_groundtruth_classes, matching_iou_threshold=0.5):\n        \"\"\"Initialized PerImageEvaluation by evaluation parameters.\n\n        Args:\n          num_groundtruth_classes: Number of ground truth object classes\n          matching_iou_threshold: A ratio of area intersection to union, which is\n              the threshold to consider whether a detection is true positive or not\n        \"\"\"\n        self.matching_iou_threshold = matching_iou_threshold\n        self.num_groundtruth_classes = num_groundtruth_classes\n\n    def compute_object_detection_metrics(\n        self,\n        detected_boxes,\n        detected_scores,\n        detected_class_labels,\n        groundtruth_boxes,\n        groundtruth_class_labels,\n        groundtruth_is_difficult_list,\n        groundtruth_is_group_of_list,\n        detected_masks=None,\n        groundtruth_masks=None,\n    ):\n        \"\"\"Evaluates detections as being tp, fp or ignored from a single image.\n\n        The evaluation is done in two stages:\n         1. All detections are matched to non group-of boxes; true positives are\n            determined and detections matched to difficult boxes are ignored.\n         2. Detections that are determined as false positives are matched against\n            group-of boxes and ignored if matched.\n\n        Args:\n          detected_boxes: A float numpy array of shape [N, 4], representing N\n              regions of detected object regions.\n              Each row is of the format [y_min, x_min, y_max, x_max]\n          detected_scores: A float numpy array of shape [N, 1], representing\n              the confidence scores of the detected N object instances.\n          detected_class_labels: A integer numpy array of shape [N, 1], repreneting\n              the class labels of the detected N object instances.\n          groundtruth_boxes: A float numpy array of shape [M, 4], representing M\n              regions of object instances in ground truth\n          groundtruth_class_labels: An integer numpy array of shape [M, 1],\n              representing M class labels of object instances in ground truth\n          groundtruth_is_difficult_list: A boolean numpy array of length M denoting\n              whether a ground truth box is a difficult instance or not\n          groundtruth_is_group_of_list: A boolean numpy array of length M denoting\n              whether a ground truth box has group-of tag\n          detected_masks: (optional) A uint8 numpy array of shape\n            [N, height, width]. If not None, the metrics will be computed based\n            on masks.\n          groundtruth_masks: (optional) A uint8 numpy array of shape\n            [M, height, width].\n\n        Returns:\n          scores: A list of C float numpy arrays. Each numpy array is of\n              shape [K, 1], representing K scores detected with object class\n              label c\n          tp_fp_labels: A list of C boolean numpy arrays. Each numpy array\n              is of shape [K, 1], representing K True/False positive label of\n              object instances detected with class label c\n        \"\"\"\n        (\n            detected_boxes,\n            detected_scores,\n            detected_class_labels,\n            detected_masks,\n        ) = self._remove_invalid_boxes(\n            detected_boxes,\n            detected_scores,\n            detected_class_labels,\n            detected_masks,\n        )\n        scores, tp_fp_labels = self._compute_tp_fp(\n            detected_boxes=detected_boxes,\n            detected_scores=detected_scores,\n            detected_class_labels=detected_class_labels,\n            groundtruth_boxes=groundtruth_boxes,\n            groundtruth_class_labels=groundtruth_class_labels,\n            groundtruth_is_difficult_list=groundtruth_is_difficult_list,\n            groundtruth_is_group_of_list=groundtruth_is_group_of_list,\n            detected_masks=detected_masks,\n            groundtruth_masks=groundtruth_masks,\n        )\n\n        return scores, tp_fp_labels\n\n    def _compute_tp_fp(\n        self,\n        detected_boxes,\n        detected_scores,\n        detected_class_labels,\n        groundtruth_boxes,\n        groundtruth_class_labels,\n        groundtruth_is_difficult_list,\n        groundtruth_is_group_of_list,\n        detected_masks=None,\n        groundtruth_masks=None,\n    ):\n        \"\"\"Labels true/false positives of detections of an image across all classes.\n\n        Args:\n          detected_boxes: A float numpy array of shape [N, 4], representing N\n              regions of detected object regions.\n              Each row is of the format [y_min, x_min, y_max, x_max]\n          detected_scores: A float numpy array of shape [N, 1], representing\n              the confidence scores of the detected N object instances.\n          detected_class_labels: A integer numpy array of shape [N, 1], repreneting\n              the class labels of the detected N object instances.\n          groundtruth_boxes: A float numpy array of shape [M, 4], representing M\n              regions of object instances in ground truth\n          groundtruth_class_labels: An integer numpy array of shape [M, 1],\n              representing M class labels of object instances in ground truth\n          groundtruth_is_difficult_list: A boolean numpy array of length M denoting\n              whether a ground truth box is a difficult instance or not\n          groundtruth_is_group_of_list: A boolean numpy array of length M denoting\n              whether a ground truth box has group-of tag\n          detected_masks: (optional) A np.uint8 numpy array of shape\n            [N, height, width]. If not None, the scores will be computed based\n            on masks.\n          groundtruth_masks: (optional) A np.uint8 numpy array of shape\n            [M, height, width].\n\n        Returns:\n          result_scores: A list of float numpy arrays. Each numpy array is of\n              shape [K, 1], representing K scores detected with object class\n              label c\n          result_tp_fp_labels: A list of boolean numpy array. Each numpy array is of\n              shape [K, 1], representing K True/False positive label of object\n              instances detected with class label c\n\n        Raises:\n          ValueError: If detected masks is not None but groundtruth masks are None,\n            or the other way around.\n        \"\"\"\n        if detected_masks is not None and groundtruth_masks is None:\n            raise ValueError(\n                \"Detected masks is available but groundtruth masks is not.\"\n            )\n        if detected_masks is None and groundtruth_masks is not None:\n            raise ValueError(\n                \"Groundtruth masks is available but detected masks is not.\"\n            )\n\n        result_scores = []\n        result_tp_fp_labels = []\n        for i in range(self.num_groundtruth_classes):\n            groundtruth_is_difficult_list_at_ith_class = groundtruth_is_difficult_list[\n                groundtruth_class_labels == i\n            ]\n            groundtruth_is_group_of_list_at_ith_class = groundtruth_is_group_of_list[\n                groundtruth_class_labels == i\n            ]\n            (\n                gt_boxes_at_ith_class,\n                gt_masks_at_ith_class,\n                detected_boxes_at_ith_class,\n                detected_scores_at_ith_class,\n                detected_masks_at_ith_class,\n            ) = self._get_ith_class_arrays(\n                detected_boxes,\n                detected_scores,\n                detected_masks,\n                detected_class_labels,\n                groundtruth_boxes,\n                groundtruth_masks,\n                groundtruth_class_labels,\n                i,\n            )\n            scores, tp_fp_labels = self._compute_tp_fp_for_single_class(\n                detected_boxes=detected_boxes_at_ith_class,\n                detected_scores=detected_scores_at_ith_class,\n                groundtruth_boxes=gt_boxes_at_ith_class,\n                groundtruth_is_difficult_list=groundtruth_is_difficult_list_at_ith_class,\n                groundtruth_is_group_of_list=groundtruth_is_group_of_list_at_ith_class,\n                detected_masks=detected_masks_at_ith_class,\n                groundtruth_masks=gt_masks_at_ith_class,\n            )\n            result_scores.append(scores)\n            result_tp_fp_labels.append(tp_fp_labels)\n        return result_scores, result_tp_fp_labels\n\n    def _get_overlaps_and_scores_box_mode(\n        self,\n        detected_boxes,\n        detected_scores,\n        groundtruth_boxes,\n        groundtruth_is_group_of_list,\n    ):\n        \"\"\"Computes overlaps and scores between detected and groudntruth boxes.\n\n        Args:\n          detected_boxes: A numpy array of shape [N, 4] representing detected box\n              coordinates\n          detected_scores: A 1-d numpy array of length N representing classification\n              score\n          groundtruth_boxes: A numpy array of shape [M, 4] representing ground truth\n              box coordinates\n          groundtruth_is_group_of_list: A boolean numpy array of length M denoting\n              whether a ground truth box has group-of tag. If a groundtruth box\n              is group-of box, every detection matching this box is ignored.\n\n        Returns:\n          iou: A float numpy array of size [num_detected_boxes, num_gt_boxes]. If\n              gt_non_group_of_boxlist.num_boxes() == 0 it will be None.\n          ioa: A float numpy array of size [num_detected_boxes, num_gt_boxes]. If\n              gt_group_of_boxlist.num_boxes() == 0 it will be None.\n          scores: The score of the detected boxlist.\n          num_boxes: Number of non-maximum suppressed detected boxes.\n        \"\"\"\n        detected_boxlist = np_box_list.BoxList(detected_boxes)\n        detected_boxlist.add_field(\"scores\", detected_scores)\n        gt_non_group_of_boxlist = np_box_list.BoxList(\n            groundtruth_boxes[~groundtruth_is_group_of_list]\n        )\n        iou = np_box_list_ops.iou(detected_boxlist, gt_non_group_of_boxlist)\n        scores = detected_boxlist.get_field(\"scores\")\n        num_boxes = detected_boxlist.num_boxes()\n        return iou, None, scores, num_boxes\n\n    def _compute_tp_fp_for_single_class(\n        self,\n        detected_boxes,\n        detected_scores,\n        groundtruth_boxes,\n        groundtruth_is_difficult_list,\n        groundtruth_is_group_of_list,\n        detected_masks=None,\n        groundtruth_masks=None,\n    ):\n        \"\"\"Labels boxes detected with the same class from the same image as tp/fp.\n\n        Args:\n          detected_boxes: A numpy array of shape [N, 4] representing detected box\n              coordinates\n          detected_scores: A 1-d numpy array of length N representing classification\n              score\n          groundtruth_boxes: A numpy array of shape [M, 4] representing ground truth\n              box coordinates\n          groundtruth_is_difficult_list: A boolean numpy array of length M denoting\n              whether a ground truth box is a difficult instance or not. If a\n              groundtruth box is difficult, every detection matching this box\n              is ignored.\n          groundtruth_is_group_of_list: A boolean numpy array of length M denoting\n              whether a ground truth box has group-of tag. If a groundtruth box\n              is group-of box, every detection matching this box is ignored.\n          detected_masks: (optional) A uint8 numpy array of shape\n            [N, height, width]. If not None, the scores will be computed based\n            on masks.\n          groundtruth_masks: (optional) A uint8 numpy array of shape\n            [M, height, width].\n\n        Returns:\n          Two arrays of the same size, containing all boxes that were evaluated as\n          being true positives or false positives; if a box matched to a difficult\n          box or to a group-of box, it is ignored.\n\n          scores: A numpy array representing the detection scores.\n          tp_fp_labels: a boolean numpy array indicating whether a detection is a\n              true positive.\n        \"\"\"\n        if detected_boxes.size == 0:\n            return np.array([], dtype=float), np.array([], dtype=bool)\n\n        (\n            iou,\n            _,\n            scores,\n            num_detected_boxes,\n        ) = self._get_overlaps_and_scores_box_mode(\n            detected_boxes=detected_boxes,\n            detected_scores=detected_scores,\n            groundtruth_boxes=groundtruth_boxes,\n            groundtruth_is_group_of_list=groundtruth_is_group_of_list,\n        )\n\n        if groundtruth_boxes.size == 0:\n            return scores, np.zeros(num_detected_boxes, dtype=bool)\n\n        tp_fp_labels = np.zeros(num_detected_boxes, dtype=bool)\n        is_matched_to_difficult_box = np.zeros(num_detected_boxes, dtype=bool)\n        is_matched_to_group_of_box = np.zeros(num_detected_boxes, dtype=bool)\n\n        # The evaluation is done in two stages:\n        # 1. All detections are matched to non group-of boxes; true positives are\n        #    determined and detections matched to difficult boxes are ignored.\n        # 2. Detections that are determined as false positives are matched against\n        #    group-of boxes and ignored if matched.\n\n        # Tp-fp evaluation for non-group of boxes (if any).\n        if iou.shape[1] > 0:\n            groundtruth_nongroup_of_is_difficult_list = groundtruth_is_difficult_list[\n                ~groundtruth_is_group_of_list\n            ]\n            max_overlap_gt_ids = np.argmax(iou, axis=1)\n            is_gt_box_detected = np.zeros(iou.shape[1], dtype=bool)\n            for i in range(num_detected_boxes):\n                gt_id = max_overlap_gt_ids[i]\n                if iou[i, gt_id] >= self.matching_iou_threshold:\n                    if not groundtruth_nongroup_of_is_difficult_list[gt_id]:\n                        if not is_gt_box_detected[gt_id]:\n                            tp_fp_labels[i] = True\n                            is_gt_box_detected[gt_id] = True\n                    else:\n                        is_matched_to_difficult_box[i] = True\n\n        return (\n            scores[~is_matched_to_difficult_box & ~is_matched_to_group_of_box],\n            tp_fp_labels[~is_matched_to_difficult_box & ~is_matched_to_group_of_box],\n        )\n\n    def _get_ith_class_arrays(\n        self,\n        detected_boxes,\n        detected_scores,\n        detected_masks,\n        detected_class_labels,\n        groundtruth_boxes,\n        groundtruth_masks,\n        groundtruth_class_labels,\n        class_index,\n    ):\n        \"\"\"Returns numpy arrays belonging to class with index `class_index`.\n\n        Args:\n          detected_boxes: A numpy array containing detected boxes.\n          detected_scores: A numpy array containing detected scores.\n          detected_masks: A numpy array containing detected masks.\n          detected_class_labels: A numpy array containing detected class labels.\n          groundtruth_boxes: A numpy array containing groundtruth boxes.\n          groundtruth_masks: A numpy array containing groundtruth masks.\n          groundtruth_class_labels: A numpy array containing groundtruth class\n            labels.\n          class_index: An integer index.\n\n        Returns:\n          gt_boxes_at_ith_class: A numpy array containing groundtruth boxes labeled\n            as ith class.\n          gt_masks_at_ith_class: A numpy array containing groundtruth masks labeled\n            as ith class.\n          detected_boxes_at_ith_class: A numpy array containing detected boxes\n            corresponding to the ith class.\n          detected_scores_at_ith_class: A numpy array containing detected scores\n            corresponding to the ith class.\n          detected_masks_at_ith_class: A numpy array containing detected masks\n            corresponding to the ith class.\n        \"\"\"\n        selected_groundtruth = groundtruth_class_labels == class_index\n        gt_boxes_at_ith_class = groundtruth_boxes[selected_groundtruth]\n        if groundtruth_masks is not None:\n            gt_masks_at_ith_class = groundtruth_masks[selected_groundtruth]\n        else:\n            gt_masks_at_ith_class = None\n        selected_detections = detected_class_labels == class_index\n        detected_boxes_at_ith_class = detected_boxes[selected_detections]\n        detected_scores_at_ith_class = detected_scores[selected_detections]\n        if detected_masks is not None:\n            detected_masks_at_ith_class = detected_masks[selected_detections]\n        else:\n            detected_masks_at_ith_class = None\n        return (\n            gt_boxes_at_ith_class,\n            gt_masks_at_ith_class,\n            detected_boxes_at_ith_class,\n            detected_scores_at_ith_class,\n            detected_masks_at_ith_class,\n        )\n\n    def _remove_invalid_boxes(\n        self,\n        detected_boxes,\n        detected_scores,\n        detected_class_labels,\n        detected_masks=None,\n    ):\n        \"\"\"Removes entries with invalid boxes.\n\n        A box is invalid if either its xmax is smaller than its xmin, or its ymax\n        is smaller than its ymin.\n\n        Args:\n          detected_boxes: A float numpy array of size [num_boxes, 4] containing box\n            coordinates in [ymin, xmin, ymax, xmax] format.\n          detected_scores: A float numpy array of size [num_boxes].\n          detected_class_labels: A int32 numpy array of size [num_boxes].\n          detected_masks: A uint8 numpy array of size [num_boxes, height, width].\n\n        Returns:\n          valid_detected_boxes: A float numpy array of size [num_valid_boxes, 4]\n            containing box coordinates in [ymin, xmin, ymax, xmax] format.\n          valid_detected_scores: A float numpy array of size [num_valid_boxes].\n          valid_detected_class_labels: A int32 numpy array of size\n            [num_valid_boxes].\n          valid_detected_masks: A uint8 numpy array of size\n            [num_valid_boxes, height, width].\n        \"\"\"\n        valid_indices = np.logical_and(\n            detected_boxes[:, 0] < detected_boxes[:, 2],\n            detected_boxes[:, 1] < detected_boxes[:, 3],\n        )\n        detected_boxes = detected_boxes[valid_indices]\n        detected_scores = detected_scores[valid_indices]\n        detected_class_labels = detected_class_labels[valid_indices]\n        if detected_masks is not None:\n            detected_masks = detected_masks[valid_indices]\n        return [\n            detected_boxes,\n            detected_scores,\n            detected_class_labels,\n            detected_masks,\n        ]\n"
  },
  {
    "path": "ava_evaluation/standard_fields.py",
    "content": "# Copyright 2017 The TensorFlow Authors. All Rights Reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n# ==============================================================================\n\n\"\"\"Contains classes specifying naming conventions used for object detection.\n\n\nSpecifies:\n  InputDataFields: standard fields used by reader/preprocessor/batcher.\n  DetectionResultFields: standard fields returned by object detector.\n  BoxListFields: standard field used by BoxList\n  TfExampleFields: standard fields for tf-example data format (go/tf-example).\n\"\"\"\n\nfrom __future__ import absolute_import, division, print_function, unicode_literals\n\n\nclass InputDataFields:\n    \"\"\"Names for the input tensors.\n\n    Holds the standard data field names to use for identifying input tensors. This\n    should be used by the decoder to identify keys for the returned tensor_dict\n    containing input tensors. And it should be used by the model to identify the\n    tensors it needs.\n\n    Attributes:\n      image: image.\n      original_image: image in the original input size.\n      key: unique key corresponding to image.\n      source_id: source of the original image.\n      filename: original filename of the dataset (without common path).\n      groundtruth_image_classes: image-level class labels.\n      groundtruth_boxes: coordinates of the ground truth boxes in the image.\n      groundtruth_classes: box-level class labels.\n      groundtruth_label_types: box-level label types (e.g. explicit negative).\n      groundtruth_is_crowd: [DEPRECATED, use groundtruth_group_of instead]\n        is the groundtruth a single object or a crowd.\n      groundtruth_area: area of a groundtruth segment.\n      groundtruth_difficult: is a `difficult` object\n      groundtruth_group_of: is a `group_of` objects, e.g. multiple objects of the\n        same class, forming a connected group, where instances are heavily\n        occluding each other.\n      proposal_boxes: coordinates of object proposal boxes.\n      proposal_objectness: objectness score of each proposal.\n      groundtruth_instance_masks: ground truth instance masks.\n      groundtruth_instance_boundaries: ground truth instance boundaries.\n      groundtruth_instance_classes: instance mask-level class labels.\n      groundtruth_keypoints: ground truth keypoints.\n      groundtruth_keypoint_visibilities: ground truth keypoint visibilities.\n      groundtruth_label_scores: groundtruth label scores.\n      groundtruth_weights: groundtruth weight factor for bounding boxes.\n      num_groundtruth_boxes: number of groundtruth boxes.\n      true_image_shapes: true shapes of images in the resized images, as resized\n        images can be padded with zeros.\n    \"\"\"\n\n    image = \"image\"\n    original_image = \"original_image\"\n    key = \"key\"\n    source_id = \"source_id\"\n    filename = \"filename\"\n    groundtruth_image_classes = \"groundtruth_image_classes\"\n    groundtruth_boxes = \"groundtruth_boxes\"\n    groundtruth_classes = \"groundtruth_classes\"\n    groundtruth_label_types = \"groundtruth_label_types\"\n    groundtruth_is_crowd = \"groundtruth_is_crowd\"\n    groundtruth_area = \"groundtruth_area\"\n    groundtruth_difficult = \"groundtruth_difficult\"\n    groundtruth_group_of = \"groundtruth_group_of\"\n    proposal_boxes = \"proposal_boxes\"\n    proposal_objectness = \"proposal_objectness\"\n    groundtruth_instance_masks = \"groundtruth_instance_masks\"\n    groundtruth_instance_boundaries = \"groundtruth_instance_boundaries\"\n    groundtruth_instance_classes = \"groundtruth_instance_classes\"\n    groundtruth_keypoints = \"groundtruth_keypoints\"\n    groundtruth_keypoint_visibilities = \"groundtruth_keypoint_visibilities\"\n    groundtruth_label_scores = \"groundtruth_label_scores\"\n    groundtruth_weights = \"groundtruth_weights\"\n    num_groundtruth_boxes = \"num_groundtruth_boxes\"\n    true_image_shape = \"true_image_shape\"\n\n\nclass DetectionResultFields:\n    \"\"\"Naming conventions for storing the output of the detector.\n\n    Attributes:\n      source_id: source of the original image.\n      key: unique key corresponding to image.\n      detection_boxes: coordinates of the detection boxes in the image.\n      detection_scores: detection scores for the detection boxes in the image.\n      detection_classes: detection-level class labels.\n      detection_masks: contains a segmentation mask for each detection box.\n      detection_boundaries: contains an object boundary for each detection box.\n      detection_keypoints: contains detection keypoints for each detection box.\n      num_detections: number of detections in the batch.\n    \"\"\"\n\n    source_id = \"source_id\"\n    key = \"key\"\n    detection_boxes = \"detection_boxes\"\n    detection_scores = \"detection_scores\"\n    detection_classes = \"detection_classes\"\n    detection_masks = \"detection_masks\"\n    detection_boundaries = \"detection_boundaries\"\n    detection_keypoints = \"detection_keypoints\"\n    num_detections = \"num_detections\"\n\n\nclass BoxListFields:\n    \"\"\"Naming conventions for BoxLists.\n\n    Attributes:\n      boxes: bounding box coordinates.\n      classes: classes per bounding box.\n      scores: scores per bounding box.\n      weights: sample weights per bounding box.\n      objectness: objectness score per bounding box.\n      masks: masks per bounding box.\n      boundaries: boundaries per bounding box.\n      keypoints: keypoints per bounding box.\n      keypoint_heatmaps: keypoint heatmaps per bounding box.\n    \"\"\"\n\n    boxes = \"boxes\"\n    classes = \"classes\"\n    scores = \"scores\"\n    weights = \"weights\"\n    objectness = \"objectness\"\n    masks = \"masks\"\n    boundaries = \"boundaries\"\n    keypoints = \"keypoints\"\n    keypoint_heatmaps = \"keypoint_heatmaps\"\n\n\nclass TfExampleFields:\n    \"\"\"TF-example proto feature names for object detection.\n\n    Holds the standard feature names to load from an Example proto for object\n    detection.\n\n    Attributes:\n      image_encoded: JPEG encoded string\n      image_format: image format, e.g. \"JPEG\"\n      filename: filename\n      channels: number of channels of image\n      colorspace: colorspace, e.g. \"RGB\"\n      height: height of image in pixels, e.g. 462\n      width: width of image in pixels, e.g. 581\n      source_id: original source of the image\n      object_class_text: labels in text format, e.g. [\"person\", \"cat\"]\n      object_class_label: labels in numbers, e.g. [16, 8]\n      object_bbox_xmin: xmin coordinates of groundtruth box, e.g. 10, 30\n      object_bbox_xmax: xmax coordinates of groundtruth box, e.g. 50, 40\n      object_bbox_ymin: ymin coordinates of groundtruth box, e.g. 40, 50\n      object_bbox_ymax: ymax coordinates of groundtruth box, e.g. 80, 70\n      object_view: viewpoint of object, e.g. [\"frontal\", \"left\"]\n      object_truncated: is object truncated, e.g. [true, false]\n      object_occluded: is object occluded, e.g. [true, false]\n      object_difficult: is object difficult, e.g. [true, false]\n      object_group_of: is object a single object or a group of objects\n      object_depiction: is object a depiction\n      object_is_crowd: [DEPRECATED, use object_group_of instead]\n        is the object a single object or a crowd\n      object_segment_area: the area of the segment.\n      object_weight: a weight factor for the object's bounding box.\n      instance_masks: instance segmentation masks.\n      instance_boundaries: instance boundaries.\n      instance_classes: Classes for each instance segmentation mask.\n      detection_class_label: class label in numbers.\n      detection_bbox_ymin: ymin coordinates of a detection box.\n      detection_bbox_xmin: xmin coordinates of a detection box.\n      detection_bbox_ymax: ymax coordinates of a detection box.\n      detection_bbox_xmax: xmax coordinates of a detection box.\n      detection_score: detection score for the class label and box.\n    \"\"\"\n\n    image_encoded = \"image/encoded\"\n    image_format = \"image/format\"  # format is reserved keyword\n    filename = \"image/filename\"\n    channels = \"image/channels\"\n    colorspace = \"image/colorspace\"\n    height = \"image/height\"\n    width = \"image/width\"\n    source_id = \"image/source_id\"\n    object_class_text = \"image/object/class/text\"\n    object_class_label = \"image/object/class/label\"\n    object_bbox_ymin = \"image/object/bbox/ymin\"\n    object_bbox_xmin = \"image/object/bbox/xmin\"\n    object_bbox_ymax = \"image/object/bbox/ymax\"\n    object_bbox_xmax = \"image/object/bbox/xmax\"\n    object_view = \"image/object/view\"\n    object_truncated = \"image/object/truncated\"\n    object_occluded = \"image/object/occluded\"\n    object_difficult = \"image/object/difficult\"\n    object_group_of = \"image/object/group_of\"\n    object_depiction = \"image/object/depiction\"\n    object_is_crowd = \"image/object/is_crowd\"\n    object_segment_area = \"image/object/segment/area\"\n    object_weight = \"image/object/weight\"\n    instance_masks = \"image/segmentation/object\"\n    instance_boundaries = \"image/boundaries/object\"\n    instance_classes = \"image/segmentation/object/class\"\n    detection_class_label = \"image/detection/label\"\n    detection_bbox_ymin = \"image/detection/bbox/ymin\"\n    detection_bbox_xmin = \"image/detection/bbox/xmin\"\n    detection_bbox_ymax = \"image/detection/bbox/ymax\"\n    detection_bbox_xmax = \"image/detection/bbox/xmax\"\n    detection_score = \"image/detection/score\"\n"
  },
  {
    "path": "configs/AVA/SLOWFAST_32x2_R50_SHORT.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: ava\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 5\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\n  # CHECKPOINT_FILE_PATH: path to the pretrain checkpoint file.\n  CHECKPOINT_TYPE: caffe2\nDATA:\n  NUM_FRAMES: 32\n  SAMPLING_RATE: 2\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 224\n  INPUT_CHANNEL_NUM: [3, 3]\nDETECTION:\n  ENABLE: True\n  ALIGNED: True\nAVA:\n  DETECTION_SCORE_THRESH: 0.8\n  TRAIN_PREDICT_BOX_LISTS: [\n    \"ava_train_v2.2.csv\",\n    \"person_box_67091280_iou90/ava_detection_train_boxes_and_labels_include_negative_v2.2.csv\",\n  ]\n  TEST_PREDICT_BOX_LISTS: [\"person_box_67091280_iou90/ava_detection_val_boxes_and_labels.csv\"]\nSLOWFAST:\n  ALPHA: 4\n  BETA_INV: 8\n  FUSION_CONV_CHANNEL_RATIO: 2\n  FUSION_KERNEL_SZ: 7\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3, 3], [4, 4], [6, 6], [3, 3]]\n  SPATIAL_DILATIONS: [[1, 1], [1, 1], [1, 1], [2, 2]]\n  SPATIAL_STRIDES: [[1, 1], [2, 2], [2, 2], [1, 1]]\nNONLOCAL:\n  LOCATION: [[[], []], [[], []], [[], []], [[], []]]\n  GROUP: [[1, 1], [1, 1], [1, 1], [1, 1]]\n  INSTANTIATION: dot_product\n  POOL: [[[1, 2, 2], [1, 2, 2]], [[1, 2, 2], [1, 2, 2]], [[1, 2, 2], [1, 2, 2]], [[1, 2, 2], [1, 2, 2]]]\nBN:\n  USE_PRECISE_STATS: False\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.1\n  LR_POLICY: steps_with_relative_lrs\n  STEPS: [0, 10, 15, 20]\n  LRS: [1, 0.1, 0.01, 0.001]\n  MAX_EPOCH: 20\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-7\n  WARMUP_EPOCHS: 5.0\n  WARMUP_START_LR: 0.000125\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 80\n  ARCH: slowfast\n  MODEL_NAME: SlowFast\n  LOSS_FUNC: bce\n  DROPOUT_RATE: 0.5\n  HEAD_ACT: sigmoid\nTEST:\n  ENABLE: True\n  DATASET: ava\n  BATCH_SIZE: 8\nDATA_LOADER:\n  NUM_WORKERS: 2\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/AVA/SLOW_8x8_R50_SHORT.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: ava\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 5\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\n  # CHECKPOINT_FILE_PATH: path to the pretrain checkpoint file.\n  CHECKPOINT_TYPE: caffe2\nDATA:\n  NUM_FRAMES: 4\n  SAMPLING_RATE: 16\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3]\nDETECTION:\n  ENABLE: True\n  ALIGNED: True\nAVA:\n  DETECTION_SCORE_THRESH: 0.9\n  TRAIN_PREDICT_BOX_LISTS: [\n    \"ava_train_v2.2.csv\",\n    \"person_box_67091280_iou90/ava_detection_train_boxes_and_labels_include_negative_v2.2.csv\",\n  ]\n  TEST_PREDICT_BOX_LISTS: [\"person_box_67091280_iou90/ava_detection_val_boxes_and_labels.csv\"]\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3], [4], [6], [3]]\n  SPATIAL_DILATIONS: [[1], [1], [1], [2]]\n  SPATIAL_STRIDES: [[1], [2], [2], [1]]\nNONLOCAL:\n  LOCATION: [[[]], [[]], [[]], [[]]]\n  GROUP: [[1], [1], [1], [1]]\n  INSTANTIATION: softmax\nBN:\n  USE_PRECISE_STATS: False\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.1\n  LR_POLICY: steps_with_relative_lrs\n  STEPS: [0, 10, 15, 20]\n  LRS: [1, 0.1, 0.01, 0.001]\n  MAX_EPOCH: 20\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-7\n  WARMUP_EPOCHS: 5.0\n  WARMUP_START_LR: 0.000125\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 80\n  ARCH: slow\n  MODEL_NAME: ResNet\n  LOSS_FUNC: bce\n  DROPOUT_RATE: 0.5\n  HEAD_ACT: sigmoid\nTEST:\n  ENABLE: True\n  DATASET: ava\n  BATCH_SIZE: 8\nDATA_LOADER:\n  NUM_WORKERS: 2\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/AVA/c2/SLOWFAST_32x2_R101_50_50.yaml",
    "content": "TRAIN:\n  ENABLE: False\n  DATASET: ava\n  BATCH_SIZE: 16\n  EVAL_PERIOD: 1\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\n  # CHECKPOINT_FILE_PATH: path to pretrain model\n  CHECKPOINT_TYPE: pytorch\nDATA:\n  NUM_FRAMES: 32\n  SAMPLING_RATE: 2\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3, 3]\nDETECTION:\n  ENABLE: True\n  ALIGNED: False\nAVA:\n  BGR: False\n  DETECTION_SCORE_THRESH: 0.8\n  TEST_PREDICT_BOX_LISTS: [\"person_box_67091280_iou90/ava_detection_val_boxes_and_labels.csv\"]\nSLOWFAST:\n  ALPHA: 4\n  BETA_INV: 8\n  FUSION_CONV_CHANNEL_RATIO: 2\n  FUSION_KERNEL_SZ: 5\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 101\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3, 3], [4, 4], [6, 6], [3, 3]]\n  SPATIAL_DILATIONS: [[1, 1], [1, 1], [1, 1], [2, 2]]\n  SPATIAL_STRIDES: [[1, 1], [2, 2], [2, 2], [1, 1]]\nNONLOCAL:\n  LOCATION: [[[], []], [[], []], [[6, 13, 20], []], [[], []]]\n  GROUP: [[1, 1], [1, 1], [1, 1], [1, 1]]\n  INSTANTIATION: dot_product\n  POOL: [[[2, 2, 2], [2, 2, 2]], [[2, 2, 2], [2, 2, 2]], [[2, 2, 2], [2, 2, 2]], [[2, 2, 2], [2, 2, 2]]]\nBN:\n  USE_PRECISE_STATS: False\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-7\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 80\n  ARCH: slowfast\n  MODEL_NAME: SlowFast\n  LOSS_FUNC: bce\n  DROPOUT_RATE: 0.5\n  HEAD_ACT: sigmoid\nTEST:\n  ENABLE: True\n  DATASET: ava\n  BATCH_SIZE: 8\nDATA_LOADER:\n  NUM_WORKERS: 2\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/AVA/c2/SLOWFAST_32x2_R101_50_50_v2.1.yaml",
    "content": "TRAIN:\n  ENABLE: False\n  DATASET: ava\n  BATCH_SIZE: 16\n  EVAL_PERIOD: 1\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\n  # CHECKPOINT_FILE_PATH: path to pretrain model\n  CHECKPOINT_TYPE: caffe2\nDATA:\n  NUM_FRAMES: 32\n  SAMPLING_RATE: 2\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3, 3]\nDETECTION:\n  ENABLE: True\n  ALIGNED: False\nAVA:\n  BGR: False\n  DETECTION_SCORE_THRESH: 0.8\n  TEST_PREDICT_BOX_LISTS: [\"person_box_67091280_iou90/ava_detection_val_boxes_and_labels.csv\"]\n  TRAIN_GT_BOX_LISTS: [\"ava_train_v2.1.csv\"]\n  LABEL_MAP_FILE: ava_action_list_v2.1_for_activitynet_2018.pbtxt\n  EXCLUSION_FILE: ava_val_excluded_timestamps_v2.1.csv\n  GROUNDTRUTH_FILE: ava_val_v2.1.csv\nSLOWFAST:\n  ALPHA: 4\n  BETA_INV: 8\n  FUSION_CONV_CHANNEL_RATIO: 2\n  FUSION_KERNEL_SZ: 5\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 101\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3, 3], [4, 4], [6, 6], [3, 3]]\n  SPATIAL_DILATIONS: [[1, 1], [1, 1], [1, 1], [2, 2]]\n  SPATIAL_STRIDES: [[1, 1], [2, 2], [2, 2], [1, 1]]\nNONLOCAL:\n  LOCATION: [[[], []], [[], []], [[6, 13, 20], []], [[], []]]\n  GROUP: [[1, 1], [1, 1], [1, 1], [1, 1]]\n  INSTANTIATION: dot_product\n  POOL: [[[2, 2, 2], [2, 2, 2]], [[2, 2, 2], [2, 2, 2]], [[2, 2, 2], [2, 2, 2]], [[2, 2, 2], [2, 2, 2]]]\nBN:\n  USE_PRECISE_STATS: False\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-7\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 80\n  ARCH: slowfast\n  MODEL_NAME: SlowFast\n  LOSS_FUNC: bce\n  DROPOUT_RATE: 0.5\n  HEAD_ACT: sigmoid\nTEST:\n  ENABLE: True\n  DATASET: ava\n  BATCH_SIZE: 8\nDATA_LOADER:\n  NUM_WORKERS: 2\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/AVA/c2/SLOWFAST_32x2_R50.yaml",
    "content": "TRAIN:\n  ENABLE: False\n  DATASET: ava\n  BATCH_SIZE: 16\n  EVAL_PERIOD: 1\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\n  # CHECKPOINT_FILE_PATH: path to pretrain model\n  CHECKPOINT_TYPE: caffe2\nDATA:\n  NUM_FRAMES: 32\n  SAMPLING_RATE: 2\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3, 3]\nDETECTION:\n  ENABLE: True\n  ALIGNED: False\nAVA:\n  BGR: False\n  DETECTION_SCORE_THRESH: 0.8\n  TEST_PREDICT_BOX_LISTS: [\"person_box_67091280_iou90/ava_detection_val_boxes_and_labels.csv\"]\nSLOWFAST:\n  ALPHA: 4\n  BETA_INV: 8\n  FUSION_CONV_CHANNEL_RATIO: 2\n  FUSION_KERNEL_SZ: 7\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3, 3], [4, 4], [6, 6], [3, 3]]\n  SPATIAL_DILATIONS: [[1, 1], [1, 1], [1, 1], [2, 2]]\n  SPATIAL_STRIDES: [[1, 1], [2, 2], [2, 2], [1, 1]]\nNONLOCAL:\n  LOCATION: [[[], []], [[], []], [[], []], [[], []]]\n  GROUP: [[1, 1], [1, 1], [1, 1], [1, 1]]\n  INSTANTIATION: dot_product\n  POOL: [[[1, 2, 2], [1, 2, 2]], [[1, 2, 2], [1, 2, 2]], [[1, 2, 2], [1, 2, 2]], [[1, 2, 2], [1, 2, 2]]]\nBN:\n  USE_PRECISE_STATS: False\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-7\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 80\n  ARCH: slowfast\n  MODEL_NAME: SlowFast\n  LOSS_FUNC: bce\n  DROPOUT_RATE: 0.5\n  HEAD_ACT: sigmoid\nTEST:\n  ENABLE: True\n  DATASET: ava\n  BATCH_SIZE: 8\nDATA_LOADER:\n  NUM_WORKERS: 2\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/AVA/c2/SLOWFAST_64x2_R101_50_50.yaml",
    "content": "TRAIN:\n  ENABLE: False\n  DATASET: ava\n  BATCH_SIZE: 16\n  EVAL_PERIOD: 1\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\n  # CHECKPOINT_FILE_PATH: path to pretrain model\n  CHECKPOINT_TYPE: caffe2\nDATA:\n  NUM_FRAMES: 64\n  SAMPLING_RATE: 2\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3, 3]\nDETECTION:\n  ENABLE: True\n  ALIGNED: False\nAVA:\n  BGR: False\n  DETECTION_SCORE_THRESH: 0.8\n  TEST_PREDICT_BOX_LISTS: [\"person_box_67091280_iou90/ava_detection_val_boxes_and_labels.csv\"]\nSLOWFAST:\n  ALPHA: 4\n  BETA_INV: 8\n  FUSION_CONV_CHANNEL_RATIO: 2\n  FUSION_KERNEL_SZ: 5\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 101\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3, 3], [4, 4], [6, 6], [3, 3]]\n  SPATIAL_DILATIONS: [[1, 1], [1, 1], [1, 1], [2, 2]]\n  SPATIAL_STRIDES: [[1, 1], [2, 2], [2, 2], [1, 1]]\nNONLOCAL:\n  LOCATION: [[[], []], [[], []], [[6, 13, 20], []], [[], []]]\n  GROUP: [[1, 1], [1, 1], [1, 1], [1, 1]]\n  INSTANTIATION: dot_product\n  POOL: [[[2, 2, 2], [2, 2, 2]], [[2, 2, 2], [2, 2, 2]], [[2, 2, 2], [2, 2, 2]], [[2, 2, 2], [2, 2, 2]]]\nBN:\n  USE_PRECISE_STATS: False\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-7\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 80\n  ARCH: slowfast\n  MODEL_NAME: SlowFast\n  LOSS_FUNC: bce\n  DROPOUT_RATE: 0.5\n  HEAD_ACT: sigmoid\nTEST:\n  ENABLE: True\n  DATASET: ava\n  BATCH_SIZE: 8\nDATA_LOADER:\n  NUM_WORKERS: 0\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/AVA/c2/SLOW_8x8_R50.yaml",
    "content": "TRAIN:\n  ENABLE: False\n  DATASET: ava\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 1\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\n  # CHECKPOINT_FILE_PATH: path to pretrain model\n  CHECKPOINT_TYPE: caffe2\nDATA:\n  NUM_FRAMES: 4\n  SAMPLING_RATE: 16\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3]\nDETECTION:\n  ENABLE: True\n  ALIGNED: False\nAVA:\n  BGR: False\n  DETECTION_SCORE_THRESH: 0.9\n  TEST_PREDICT_BOX_LISTS: [\"person_box_67091280_iou75/ava_detection_val_boxes_and_labels.csv\"]\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3], [4], [6], [3]]\n  SPATIAL_DILATIONS: [[1], [1], [1], [2]]\n  SPATIAL_STRIDES: [[1], [2], [2], [1]]\nNONLOCAL:\n  LOCATION: [[[]], [[]], [[]], [[]]]\n  GROUP: [[1], [1], [1], [1]]\n  INSTANTIATION: softmax\nBN:\n  USE_PRECISE_STATS: False\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-7\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 80\n  ARCH: slow\n  MODEL_NAME: ResNet\n  LOSS_FUNC: bce\n  DROPOUT_RATE: 0.5\n  HEAD_ACT: sigmoid\nTEST:\n  ENABLE: True\n  DATASET: ava\n  BATCH_SIZE: 8\nDATA_LOADER:\n  NUM_WORKERS: 2\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Charades/SLOWFAST_16x8_R50.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: charades\n  BATCH_SIZE: 16\n  EVAL_PERIOD: 6\n  CHECKPOINT_PERIOD: 6\n  AUTO_RESUME: True\n  CHECKPOINT_FILE_PATH: SLOWFAST_8x8_R50.pkl # please download from the model zoo.\n  CHECKPOINT_TYPE: caffe2\nDATA:\n  NUM_FRAMES: 64\n  SAMPLING_RATE: 2\n  TRAIN_JITTER_SCALES: [256, 340]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3, 3]\n  MULTI_LABEL: True\n  INV_UNIFORM_SAMPLE: True\n  ENSEMBLE_METHOD: max\n  REVERSE_INPUT_CHANNEL: True\nSLOWFAST:\n  ALPHA: 4\n  BETA_INV: 8\n  FUSION_CONV_CHANNEL_RATIO: 2\n  FUSION_KERNEL_SZ: 7\nRESNET:\n  SPATIAL_STRIDES: [[1, 1], [2, 2], [2, 2], [2, 2]]\n  SPATIAL_DILATIONS: [[1, 1], [1, 1], [1, 1], [1, 1]]\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3, 3], [4, 4], [6, 6], [3, 3]]\nNONLOCAL:\n  LOCATION: [[[], []], [[], []], [[], []], [[], []]]\n  GROUP: [[1, 1], [1, 1], [1, 1], [1, 1]]\n  INSTANTIATION: dot_product\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\n  NORM_TYPE: sync_batchnorm\n  NUM_SYNC_DEVICES: 4\nSOLVER:\n  BASE_LR: 0.0375\n  LR_POLICY: steps_with_relative_lrs\n  LRS: [1, 0.1, 0.01, 0.001, 0.0001, 0.00001]\n  STEPS: [0, 41, 49]\n  MAX_EPOCH: 57\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 4.0\n  WARMUP_START_LR: 0.0001\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 157\n  ARCH: slowfast\n  LOSS_FUNC: bce_logit\n  HEAD_ACT: sigmoid\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: charades\n  BATCH_SIZE: 16\n  NUM_ENSEMBLE_VIEWS: 10\n  NUM_SPATIAL_CROPS: 3\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\nLOG_MODEL_INFO: False\n"
  },
  {
    "path": "configs/Charades/SLOWFAST_16x8_R50_multigrid.yaml",
    "content": "MULTIGRID:\n  SHORT_CYCLE: True\n  LONG_CYCLE: True\nTRAIN:\n  ENABLE: True\n  DATASET: charades\n  BATCH_SIZE: 16\n  EVAL_PERIOD: 6\n  CHECKPOINT_PERIOD: 6\n  AUTO_RESUME: True\n  CHECKPOINT_FILE_PATH: SLOWFAST_8x8_R50.pkl # please download from the model zoo.\n  CHECKPOINT_TYPE: caffe2\nDATA:\n  NUM_FRAMES: 64\n  SAMPLING_RATE: 2\n  TRAIN_JITTER_SCALES: [256, 340]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3, 3]\n  MULTI_LABEL: True\n  INV_UNIFORM_SAMPLE: True\n  ENSEMBLE_METHOD: max\n  REVERSE_INPUT_CHANNEL: True\nSLOWFAST:\n  ALPHA: 4\n  BETA_INV: 8\n  FUSION_CONV_CHANNEL_RATIO: 2\n  FUSION_KERNEL_SZ: 7\nRESNET:\n  SPATIAL_STRIDES: [[1, 1], [2, 2], [2, 2], [2, 2]]\n  SPATIAL_DILATIONS: [[1, 1], [1, 1], [1, 1], [1, 1]]\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3, 3], [4, 4], [6, 6], [3, 3]]\nNONLOCAL:\n  LOCATION: [[[], []], [[], []], [[], []], [[], []]]\n  GROUP: [[1, 1], [1, 1], [1, 1], [1, 1]]\n  INSTANTIATION: dot_product\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\n  NORM_TYPE: sync_batchnorm\n  NUM_SYNC_DEVICES: 4\nSOLVER:\n  BASE_LR: 0.0375\n  LR_POLICY: steps_with_relative_lrs\n  LRS: [1, 0.1, 0.01, 0.001, 0.0001, 0.00001]\n  STEPS: [0, 41, 49]\n  MAX_EPOCH: 57\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 4.0\n  WARMUP_START_LR: 0.0001\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 157\n  ARCH: slowfast\n  LOSS_FUNC: bce_logit\n  HEAD_ACT: sigmoid\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: charades\n  BATCH_SIZE: 16\n  NUM_ENSEMBLE_VIEWS: 10\n  NUM_SPATIAL_CROPS: 3\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\nLOG_MODEL_INFO: False\n"
  },
  {
    "path": "configs/Charades/pytorchvideo/SLOWFAST_8x8_R50.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: ptvcharades\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 6\n  CHECKPOINT_PERIOD: 6\n  AUTO_RESUME: True\n  # CHECKPOINT_FILE_PATH: # please download from the model zoo.\n  CHECKPOINT_EPOCH_RESET: True\nDATA:\n  NUM_FRAMES: 32\n  SAMPLING_RATE: 2\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3, 3]\n  MULTI_LABEL: True\n  INV_UNIFORM_SAMPLE: True\n  ENSEMBLE_METHOD: max\n  REVERSE_INPUT_CHANNEL: True\nSLOWFAST:\n  ALPHA: 4\n  BETA_INV: 8\n  FUSION_CONV_CHANNEL_RATIO: 2\n  FUSION_KERNEL_SZ: 7\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3, 3], [4, 4], [6, 6], [3, 3]]\n  SPATIAL_STRIDES: [[1, 1], [2, 2], [2, 2], [2, 2]]\n  SPATIAL_DILATIONS: [[1, 1], [1, 1], [1, 1], [1, 1]]\nNONLOCAL:\n  LOCATION: [[[], []], [[], []], [[], []], [[], []]]\n  GROUP: [[1, 1], [1, 1], [1, 1], [1, 1]]\n  INSTANTIATION: dot_product\n  POOL: [[[1, 2, 2], [1, 2, 2]], [[1, 2, 2], [1, 2, 2]], [[1, 2, 2], [1, 2, 2]], [[1, 2, 2], [1, 2, 2]]]\nBN:\n  USE_PRECISE_STATS: False\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.25\n  BASE_LR_SCALE_NUM_SHARDS: True\n  LR_POLICY: cosine\n  COSINE_END_LR: 1e-6\n  MAX_EPOCH: 60\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 5.0\n  WARMUP_START_LR: 0.0001\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 157\n  ARCH: slowfast\n  MODEL_NAME: PTVSlowFast\n  LOSS_FUNC: bce_logit\n  HEAD_ACT: sigmoid\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: ptvcharades\n  BATCH_SIZE: 64\n  NUM_ENSEMBLE_VIEWS: 10\n  NUM_SPATIAL_CROPS: 3\nDATA_LOADER:\n  NUM_WORKERS: 1\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Charades/pytorchvideo/SLOW_8x8_R50.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: ptvcharades\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 6\n  CHECKPOINT_PERIOD: 6\n  AUTO_RESUME: True\n  # CHECKPOINT_FILE_PATH: # please download from the model zoo.\n  CHECKPOINT_EPOCH_RESET: True\nDATA:\n  NUM_FRAMES: 8\n  SAMPLING_RATE: 8\n  TRAIN_JITTER_SCALES: [256, 340]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3]\n  MULTI_LABEL: True\n  INV_UNIFORM_SAMPLE: True\n  ENSEMBLE_METHOD: max\n  REVERSE_INPUT_CHANNEL: True\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3], [4], [6], [3]]\nNONLOCAL:\n  LOCATION: [[[]], [[]], [[]], [[]]]\n  GROUP: [[1], [1], [1], [1]]\n  INSTANTIATION: softmax\nBN:\n  USE_PRECISE_STATS: False\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.25\n  BASE_LR_SCALE_NUM_SHARDS: True\n  LR_POLICY: cosine\n  COSINE_END_LR: 1e-6\n  MAX_EPOCH: 60\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 5.0\n  WARMUP_START_LR: 0.0001\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 157\n  ARCH: slow\n  MODEL_NAME: PTVResNet\n  LOSS_FUNC: bce_logit\n  HEAD_ACT: sigmoid\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: ptvcharades\n  BATCH_SIZE: 64\n  NUM_ENSEMBLE_VIEWS: 10\n  NUM_SPATIAL_CROPS: 3\nDATA_LOADER:\n  NUM_WORKERS: 1\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/ImageNet/MVIT_B_16_CONV.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: imagenet\n  BATCH_SIZE: 256\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\nDATA:\n  # PATH_TO_DATA_DIR: path-to-imagenet-dir\n  MEAN: [0.485, 0.456, 0.406]\n  STD: [0.229, 0.224, 0.225]\n  NUM_FRAMES: 1\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 224\n  INPUT_CHANNEL_NUM: [3]\nMVIT:\n  PATCH_2D: True\n  ZERO_DECAY_POS_CLS: False\n  MODE: \"conv\"\n  CLS_EMBED_ON: True\n  PATCH_KERNEL: [7, 7]\n  PATCH_STRIDE: [4, 4]\n  PATCH_PADDING: [3, 3]\n  EMBED_DIM: 96\n  NUM_HEADS: 1\n  MLP_RATIO: 4.0\n  QKV_BIAS: True\n  DROPPATH_RATE: 0.1\n  DEPTH: 16\n  NORM: \"layernorm\"\n  DIM_MUL: [[1, 2.0], [3, 2.0], [14, 2.0]]\n  HEAD_MUL: [[1, 2.0], [3, 2.0], [14, 2.0]]\n  POOL_KVQ_KERNEL: [1, 3, 3]\n  POOL_KV_STRIDE_ADAPTIVE: [1, 4, 4]\n  POOL_Q_STRIDE: [[1, 1, 2, 2], [3, 1, 2, 2], [14, 1, 2, 2]]\nAUG:\n  ENABLE: True\n  COLOR_JITTER: 0.4\n  AA_TYPE: rand-m9-n6-mstd0.5-inc1\n  INTERPOLATION: bicubic\n  RE_PROB: 0.25\n  RE_MODE: pixel\n  RE_COUNT: 1\n  RE_SPLIT: False\nMIXUP:\n  ENABLE: True\n  ALPHA: 0.8\n  CUTMIX_ALPHA: 1.0\n  PROB: 1.0\n  SWITCH_PROB: 0.5\n  LABEL_SMOOTH_VALUE: 0.1\nSOLVER:\n  BASE_LR_SCALE_NUM_SHARDS: True\n  BASE_LR: 0.00025\n  LR_POLICY: cosine\n  MAX_EPOCH: 300\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 0.05\n  WARMUP_EPOCHS: 70.0\n  WARMUP_START_LR: 1e-8\n  OPTIMIZING_METHOD: adamw\n  COSINE_AFTER_WARMUP: True\n  COSINE_END_LR: 1e-6\n  ZERO_WD_1D_PARAM: True\n  CLIP_GRAD_L2NORM: 1.0\nMODEL:\n  NUM_CLASSES: 1000\n  ARCH: mvit\n  MODEL_NAME: MViT\n  LOSS_FUNC: soft_cross_entropy\n  DROPOUT_RATE: 0.0\nTEST:\n  ENABLE: False\n  DATASET: imagenet\n  BATCH_SIZE: 256\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/ImageNet/MVITv2_B.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: imagenet\n  BATCH_SIZE: 256\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\nDATA:\n  # PATH_TO_DATA_DIR: path-to-imagenet-dir\n  MEAN: [0.485, 0.456, 0.406]\n  STD: [0.229, 0.224, 0.225]\n  NUM_FRAMES: 1\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 224\n  INPUT_CHANNEL_NUM: [3]\nMVIT:\n  PATCH_2D: True\n  ZERO_DECAY_POS_CLS: False\n  MODE: \"conv\"\n  CLS_EMBED_ON: False\n  PATCH_KERNEL: [7, 7]\n  PATCH_STRIDE: [4, 4]\n  PATCH_PADDING: [3, 3]\n  EMBED_DIM: 96\n  NUM_HEADS: 1\n  MLP_RATIO: 4.0\n  QKV_BIAS: True\n DROPPATH_RATE: 0.3\n  DEPTH: 24\n  NORM: \"layernorm\"\n  DIM_MUL: [[2, 2.0], [5, 2.0], [21, 2.0]]\n  HEAD_MUL: [[2, 2.0], [5, 2.0], [21, 2.0]]\n  POOL_KVQ_KERNEL: [1, 3, 3]\n  POOL_KV_STRIDE_ADAPTIVE: [1, 4, 4]\n  POOL_Q_STRIDE: [[0, 1, 1, 1], [1, 1, 1, 1], [2, 1, 2, 2], [3, 1, 1, 1], [4, 1, 1, 1], [5, 1, 2, 2], [6, 1, 1, 1], [7, 1, 1, 1], [8, 1, 1, 1], [9, 1, 1, 1], [10, 1, 1, 1], [11, 1, 1, 1], [12, 1, 1, 1], [13, 1, 1, 1], [14, 1, 1, 1], [15, 1, 1, 1], [16, 1, 1, 1], [17, 1, 1, 1], [18, 1, 1, 1], [19, 1, 1, 1], [20, 1, 1, 1], [21, 1, 2, 2], [22, 1, 1, 1], [23, 1, 1, 1]]\n  RESIDUAL_POOLING: True\n  USE_ABS_POS: False\n  REL_POS_SPATIAL: True\n  DIM_MUL_IN_ATT: True\nAUG:\n  ENABLE: True\n  COLOR_JITTER: 0.4\n  AA_TYPE: rand-m9-n6-mstd0.5-inc1\n  INTERPOLATION: bicubic\n  RE_PROB: 0.25\n  RE_MODE: pixel\n  RE_COUNT: 1\n  RE_SPLIT: False\nMIXUP:\n  ENABLE: True\n  ALPHA: 0.8\n  CUTMIX_ALPHA: 1.0\n  PROB: 1.0\n  SWITCH_PROB: 0.5\n  LABEL_SMOOTH_VALUE: 0.1\nSOLVER:\n  BASE_LR_SCALE_NUM_SHARDS: True\n  BASE_LR: 0.00025\n  LR_POLICY: cosine\n  MAX_EPOCH: 300\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 0.05\n  WARMUP_EPOCHS: 70.0\n  WARMUP_START_LR: 1e-8\n  OPTIMIZING_METHOD: adamw\n  COSINE_AFTER_WARMUP: True\n  COSINE_END_LR: 1e-6\n  ZERO_WD_1D_PARAM: True\n  CLIP_GRAD_L2NORM: 1.0\nMODEL:\n  NUM_CLASSES: 1000\n  ARCH: mvit\n  MODEL_NAME: MViT\n  LOSS_FUNC: soft_cross_entropy\n  DROPOUT_RATE: 0.0\nTEST:\n  ENABLE: True\n  DATASET: imagenet\n  BATCH_SIZE: 256\n  NUM_ENSEMBLE_VIEWS: 1\n  NUM_SPATIAL_CROPS: 1\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/ImageNet/MVITv2_S.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: imagenet\n  BATCH_SIZE: 256\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\nDATA:\n  # PATH_TO_DATA_DIR: path-to-imagenet-dir\n  MEAN: [0.485, 0.456, 0.406]\n  STD: [0.229, 0.224, 0.225]\n  NUM_FRAMES: 1\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 224\n  INPUT_CHANNEL_NUM: [3]\nMVIT:\n  PATCH_2D: True\n  ZERO_DECAY_POS_CLS: False\n  MODE: \"conv\"\n  CLS_EMBED_ON: False\n  PATCH_KERNEL: [7, 7]\n  PATCH_STRIDE: [4, 4]\n  PATCH_PADDING: [3, 3]\n  EMBED_DIM: 96\n  NUM_HEADS: 1\n  MLP_RATIO: 4.0\n  QKV_BIAS: True\n  DROPPATH_RATE: 0.1\n  DEPTH: 16\n  NORM: \"layernorm\"\n  DIM_MUL: [[1, 2.0], [3, 2.0], [14, 2.0]]\n  HEAD_MUL: [[1, 2.0], [3, 2.0], [14, 2.0]]\n  POOL_KVQ_KERNEL: [1, 3, 3]\n  POOL_KV_STRIDE: [[0, 1, 4, 4], [1, 1, 2, 2], [2, 1, 2, 2], [3, 1, 1, 1], [4, 1, 1, 1], [5, 1, 1, 1], [6, 1, 1, 1], [7, 1, 1, 1], [8, 1, 1, 1], [9, 1, 1, 1], [10, 1, 1, 1], [11, 1, 1, 1], [12, 1, 1, 1], [13, 1, 1, 1], [14, 1, 1, 1], [15, 1, 1, 1]]\n  POOL_Q_STRIDE: [[0, 1, 1, 1], [1, 1, 2, 2], [2, 1, 1, 1], [3, 1, 2, 2], [4, 1, 1, 1], [5, 1, 1, 1], [6, 1, 1, 1], [7, 1, 1, 1], [8, 1, 1, 1], [9, 1, 1, 1], [10, 1, 1, 1], [11, 1, 1, 1], [12, 1, 1, 1], [13, 1, 1, 1], [14, 1, 2, 2], [15, 1, 1, 1]]\n  RESIDUAL_POOLING: True\n  USE_ABS_POS: False\n  REL_POS_SPATIAL: True\n  DIM_MUL_IN_ATT: True\nAUG:\n  ENABLE: True\n  COLOR_JITTER: 0.4\n  AA_TYPE: rand-m9-n6-mstd0.5-inc1\n  INTERPOLATION: bicubic\n  RE_PROB: 0.25\n  RE_MODE: pixel\n  RE_COUNT: 1\n  RE_SPLIT: False\nMIXUP:\n  ENABLE: True\n  ALPHA: 0.8\n  CUTMIX_ALPHA: 1.0\n  PROB: 1.0\n  SWITCH_PROB: 0.5\n  LABEL_SMOOTH_VALUE: 0.1\nSOLVER:\n  BASE_LR_SCALE_NUM_SHARDS: True\n  BASE_LR: 0.00025\n  LR_POLICY: cosine\n  MAX_EPOCH: 300\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 0.05\n  WARMUP_EPOCHS: 70.0\n  WARMUP_START_LR: 1e-8\n  OPTIMIZING_METHOD: adamw\n  COSINE_AFTER_WARMUP: True\n  COSINE_END_LR: 1e-6\n  ZERO_WD_1D_PARAM: True\n  CLIP_GRAD_L2NORM: 1.0\nMODEL:\n  NUM_CLASSES: 1000\n  ARCH: mvit\n  MODEL_NAME: MViT\n  LOSS_FUNC: soft_cross_entropy\n  DROPOUT_RATE: 0.0\nTEST:\n  ENABLE: True\n  DATASET: imagenet\n  BATCH_SIZE: 256\n  NUM_ENSEMBLE_VIEWS: 1\n  NUM_SPATIAL_CROPS: 1\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/ImageNet/MVITv2_T.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: imagenet\n  BATCH_SIZE: 256\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\nDATA:\n  # PATH_TO_DATA_DIR: path-to-imagenet-dir\n  MEAN: [0.485, 0.456, 0.406]\n  STD: [0.229, 0.224, 0.225]\n  NUM_FRAMES: 1\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 224\n  INPUT_CHANNEL_NUM: [3]\nMVIT:\n  PATCH_2D: True\n  ZERO_DECAY_POS_CLS: False\n  MODE: \"conv\"\n  CLS_EMBED_ON: False\n  PATCH_KERNEL: [7, 7]\n  PATCH_STRIDE: [4, 4]\n  PATCH_PADDING: [3, 3]\n  EMBED_DIM: 96\n  NUM_HEADS: 1\n  MLP_RATIO: 4.0\n  QKV_BIAS: True\n  DROPPATH_RATE: 0.1\n  DEPTH: 10\n  NORM: \"layernorm\"\n  DIM_MUL: [[1, 2.0], [3, 2.0], [8, 2.0]]\n  HEAD_MUL: [[1, 2.0], [3, 2.0], [8, 2.0]]\n  POOL_KVQ_KERNEL: [1, 3, 3]\n  POOL_KV_STRIDE: [[0, 1, 4, 4], [1, 1, 2, 2], [2, 1, 2, 2], [3, 1, 1, 1], [4, 1, 1, 1], [5, 1, 1, 1], [6, 1, 1, 1], [7, 1, 1, 1], [8, 1, 1, 1], [9, 1, 1, 1]]\n  POOL_Q_STRIDE: [[0, 1, 1, 1], [1, 1, 2, 2], [2, 1, 1, 1], [3, 1, 2, 2], [4, 1, 1, 1], [5, 1, 1, 1], [6, 1, 1, 1], [7, 1, 1, 1], [8, 1, 2, 2], [9, 1, 1, 1]]\n  RESIDUAL_POOLING: True\n  USE_ABS_POS: False\n  REL_POS_SPATIAL: True\n  DIM_MUL_IN_ATT: True\nAUG:\n  ENABLE: True\n  COLOR_JITTER: 0.4\n  AA_TYPE: rand-m9-n6-mstd0.5-inc1\n  INTERPOLATION: bicubic\n  RE_PROB: 0.25\n  RE_MODE: pixel\n  RE_COUNT: 1\n  RE_SPLIT: False\nMIXUP:\n  ENABLE: True\n  ALPHA: 0.8\n  CUTMIX_ALPHA: 1.0\n  PROB: 1.0\n  SWITCH_PROB: 0.5\n  LABEL_SMOOTH_VALUE: 0.1\nSOLVER:\n  BASE_LR_SCALE_NUM_SHARDS: True\n  BASE_LR: 0.00025\n  LR_POLICY: cosine\n  MAX_EPOCH: 300\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 0.05\n  WARMUP_EPOCHS: 70.0\n  WARMUP_START_LR: 1e-8\n  OPTIMIZING_METHOD: adamw\n  COSINE_AFTER_WARMUP: True\n  COSINE_END_LR: 1e-6\n  ZERO_WD_1D_PARAM: True\n  CLIP_GRAD_L2NORM: 1.0\nMODEL:\n  NUM_CLASSES: 1000\n  ARCH: mvit\n  MODEL_NAME: MViT\n  LOSS_FUNC: soft_cross_entropy\n  DROPOUT_RATE: 0.0\nTEST:\n  ENABLE: True\n  DATASET: imagenet\n  BATCH_SIZE: 256\n  NUM_ENSEMBLE_VIEWS: 1\n  NUM_SPATIAL_CROPS: 1\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/ImageNet/RES_R50.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: imagenet\n  BATCH_SIZE: 256\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\nDATA:\n  # PATH_TO_DATA_DIR: path-to-imagenet-dir\n  MEAN: [0.485, 0.456, 0.406]\n  STD: [0.229, 0.224, 0.225]\n  NUM_FRAMES: 1\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3]\n  TRAIN_JITTER_SCALES_RELATIVE: [0.08, 1.0]\n  TRAIN_JITTER_ASPECT_RELATIVE: [0.75, 1.3333333333333333]\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3], [4], [6], [3]]\nNONLOCAL:\n  LOCATION: [[[]], [[]], [[]], [[]]]\n  GROUP: [[1], [1], [1], [1]]\n  INSTANTIATION: softmax\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.1\n  LR_POLICY: steps_with_relative_lrs\n  MAX_EPOCH: 100\n  STEPS: [0, 30, 60, 90]\n  LRS: [1, 0.1, 0.01, 0.001]\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 0.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 1000\n  ARCH: 2d\n  MODEL_NAME: ResNet\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: imagenet\n  BATCH_SIZE: 256\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/ImageNet/REV_MVIT_B_16_CONV.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: imagenet\n  BATCH_SIZE: 256\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\n\nDATA:\n  # PATH_TO_DATA_DIR: path-to-imagenet-dir\n  MEAN: [0.485, 0.456, 0.406]\n  STD: [0.229, 0.224, 0.225]\n  NUM_FRAMES: 1\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 224\n  INPUT_CHANNEL_NUM: [3]\nMVIT:\n  PATCH_2D: True\n  ZERO_DECAY_POS_CLS: False\n  MODE: \"conv\"\n  CLS_EMBED_ON: False\n  PATCH_KERNEL: [7, 7]\n  PATCH_STRIDE: [4, 4]\n  PATCH_PADDING: [3, 3]\n  EMBED_DIM: 96\n  NUM_HEADS: 1\n  MLP_RATIO: 4.0\n  QKV_BIAS: True\n  DROPPATH_RATE: 0.1\n  DEPTH: 16\n  NORM: \"layernorm\"\n  DIM_MUL: [[1, 2.0], [3, 2.0], [14, 2.0]]\n  HEAD_MUL: [[1, 2.0], [3, 2.0], [14, 2.0]]\n  POOL_KVQ_KERNEL: [1, 3, 3]\n  POOL_KV_STRIDE_ADAPTIVE: [1, 4, 4]\n  POOL_Q_STRIDE: [[1, 1, 2, 2], [3, 1, 2, 2], [14, 1, 2, 2]]\n  SEPARATE_QKV : True\n  REV:\n    ENABLE: True\n    RESPATH_FUSE: \"concat\"\n    BUFFER_LAYERS : [1,3, 14]\n    RES_PATH : \"conv\"\n    PRE_Q_FUSION: \"concat_linear_2\"\nAUG:\n  ENABLE: True\n  COLOR_JITTER: 0.4\n  AA_TYPE: rand-m9-n6-mstd0.5-inc1\n  INTERPOLATION: bicubic\n  RE_PROB: 0.25\n  RE_MODE: pixel\n  RE_COUNT: 1\n  RE_SPLIT: False\nMIXUP:\n  ENABLE: True\n  ALPHA: 0.8\n  CUTMIX_ALPHA: 1.0\n  PROB: 1.0\n  SWITCH_PROB: 0.5\n  LABEL_SMOOTH_VALUE: 0.1\nSOLVER:\n  BASE_LR_SCALE_NUM_SHARDS: True\n  BASE_LR: 0.00025\n  LR_POLICY: cosine\n  MAX_EPOCH: 300\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 0.05\n  WARMUP_EPOCHS: 70.0\n  WARMUP_START_LR: 1e-8\n  OPTIMIZING_METHOD: adamw\n  COSINE_AFTER_WARMUP: True\n  COSINE_END_LR: 1e-6\n  ZERO_WD_1D_PARAM: True\n  CLIP_GRAD_L2NORM: 1.0\nMODEL:\n  NUM_CLASSES: 1000\n  ARCH: mvit\n  MODEL_NAME: MViT\n  LOSS_FUNC: soft_cross_entropy\n  DROPOUT_RATE: 0.0\nTEST:\n  ENABLE: False\n  DATASET: imagenet\n  BATCH_SIZE: 256\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/ImageNet/REV_VIT_B.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: imagenet\n  BATCH_SIZE: 512\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\n\nDATA:\n  # PATH_TO_DATA_DIR: path-to-imagenet-dir\n  MEAN: [0.485, 0.456, 0.406]\n  STD: [0.229, 0.224, 0.225]\n  NUM_FRAMES: 1\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 224\n  INPUT_CHANNEL_NUM: [3]\nMVIT:\n  # USE_SOFTMAX: False\n  PATCH_2D: True\n  ZERO_DECAY_POS_CLS: False\n  MODE: \"conv\"\n  CLS_EMBED_ON: False\n  PATCH_KERNEL: [16, 16]\n  PATCH_STRIDE: [16, 16]\n  PATCH_PADDING: [0, 0]\n  EMBED_DIM: 768\n  NUM_HEADS: 12\n  MLP_RATIO: 4.0\n  QKV_BIAS: True\n  DROPPATH_RATE: 0.2  # 0.1\n  DEPTH: 12\n  NORM: \"layernorm\"\n  POOL_KV_STRIDE: []\n  SEPARATE_QKV : True\n  REV:\n    ENABLE: True\n    RESPATH_FUSE: \"concat\"\n    RES_PATH: \"conv\"\n    PRE_Q_FUSION: \"concat\"\n\nAUG:\n  ENABLE: True\n  COLOR_JITTER: 0.4\n  AA_TYPE: rand-m6-n5-mstd0.5-inc1\n  INTERPOLATION: bicubic\n  RE_PROB: 0.25\n  RE_MODE: pixel\n  RE_COUNT: 1\n  RE_SPLIT: False\nMIXUP:\n  ENABLE: True\n  ALPHA: 0.8\n  CUTMIX_ALPHA: 1.0\n  PROB: 1.0\n  SWITCH_PROB: 0.5\n  LABEL_SMOOTH_VALUE: 0.1\nSOLVER:\n  BASE_LR_SCALE_NUM_SHARDS: True\n  BASE_LR: 0.0005\n  # BASE_LR: 0.0003\n  LR_POLICY: cosine\n  MAX_EPOCH: 300\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 0.1\n  WARMUP_EPOCHS: 70.0\n  WARMUP_START_LR: 1e-8\n  OPTIMIZING_METHOD: adamw\n  COSINE_AFTER_WARMUP: True\n  COSINE_END_LR: 1e-6\n  ZERO_WD_1D_PARAM: True\n  CLIP_GRAD_L2NORM: 2.0\nMODEL:\n  NUM_CLASSES: 1000\n  ARCH: mvit\n  MODEL_NAME: MViT\n  LOSS_FUNC: soft_cross_entropy\n  DROPOUT_RATE: 0.0\nTEST:\n  ENABLE: False\n  DATASET: imagenet\n  BATCH_SIZE: 256\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/ImageNet/REV_VIT_S.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: imagenet\n  BATCH_SIZE: 512\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\n\nDATA:\n  # PATH_TO_DATA_DIR: path-to-imagenet-dir\n  MEAN: [0.485, 0.456, 0.406]\n  STD: [0.229, 0.224, 0.225]\n  NUM_FRAMES: 1\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 224\n  INPUT_CHANNEL_NUM: [3]\nMVIT:\n  PATCH_2D: True\n  ZERO_DECAY_POS_CLS: False\n  MODE: \"conv\"\n  CLS_EMBED_ON: False\n  PATCH_KERNEL: [16, 16]\n  PATCH_STRIDE: [16, 16]\n  PATCH_PADDING: [0, 0]\n  EMBED_DIM: 384\n  NUM_HEADS: 6\n  MLP_RATIO: 4.0\n  QKV_BIAS: True\n  DROPPATH_RATE: 0.1\n  DEPTH: 12\n  NORM: \"layernorm\"\n  POOL_KV_STRIDE: []\n  SEPARATE_QKV : True\n  REV:\n    ENABLE: True\n    RESPATH_FUSE: \"concat\"\n    RES_PATH: \"conv\"\nAUG:\n  ENABLE: True\n  COLOR_JITTER: 0.4\n  AA_TYPE: rand-m7-mstd0.5-inc1\n  INTERPOLATION: bicubic\n  RE_PROB: 0.25\n  RE_MODE: pixel\n  RE_COUNT: 1\n  RE_SPLIT: False\nMIXUP:\n  ENABLE: True\n  ALPHA: 0.8\n  CUTMIX_ALPHA: 1.0\n  PROB: 1.0\n  SWITCH_PROB: 0.5\n  LABEL_SMOOTH_VALUE: 0.1\nSOLVER:\n  BASE_LR_SCALE_NUM_SHARDS: True\n  BASE_LR: 0.0005\n  LR_POLICY: cosine\n  MAX_EPOCH: 300\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 0.05\n  WARMUP_EPOCHS: 70.0\n  WARMUP_START_LR: 1e-8\n  OPTIMIZING_METHOD: adamw\n  COSINE_AFTER_WARMUP: True\n  COSINE_END_LR: 1e-6\n  ZERO_WD_1D_PARAM: True\n  CLIP_GRAD_L2NORM: 1.0 # this is commented now\nMODEL:\n  NUM_CLASSES: 1000\n  ARCH: mvit\n  MODEL_NAME: MViT\n  LOSS_FUNC: soft_cross_entropy\n  DROPOUT_RATE: 0.0\nTEST:\n  ENABLE: False\n  DATASET: imagenet\n  BATCH_SIZE: 256\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/C2D_8x8_R50.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\nDATA:\n  NUM_FRAMES: 8\n  SAMPLING_RATE: 8\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3]\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3], [4], [6], [3]]\nNONLOCAL:\n  LOCATION: [[[]], [[]], [[]], [[]]]\n  GROUP: [[1], [1], [1], [1]]\n  INSTANTIATION: softmax\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.1\n  LR_POLICY: cosine\n  MAX_EPOCH: 196\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 34.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: c2d\n  MODEL_NAME: ResNet\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/C2D_8x8_R50_IN1K.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\n  # CHECKPOINT_FILE_PATH: /path/to/your/directory/imagenet50_pretrain.pyth\n  CHECKPOINT_INFLATE: True\nDATA:\n  NUM_FRAMES: 8\n  SAMPLING_RATE: 8\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3]\n  # PATH_TO_DATA_DIR: /path/to/your/dataset\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3], [4], [6], [3]]\nNONLOCAL:\n  LOCATION: [[[]], [[]], [[]], [[]]]\n  GROUP: [[1], [1], [1], [1]]\n  INSTANTIATION: softmax\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.01\n  LR_POLICY: steps_with_relative_lrs\n  STEPS: [0, 44, 88, 118]\n  LRS: [1, 0.1, 0.01, 0.001]\n  MAX_EPOCH: 118\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: c2d\n  MODEL_NAME: ResNet\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/C2D_NLN_8x8_R50.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\nDATA:\n  NUM_FRAMES: 8\n  SAMPLING_RATE: 8\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3]\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3], [4], [6], [3]]\nNONLOCAL:\n  LOCATION: [[[]], [[1, 3]], [[1, 3, 5]], [[]]]\n  GROUP: [[1], [1], [1], [1]]\n  INSTANTIATION: softmax\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.1\n  LR_POLICY: cosine\n  MAX_EPOCH: 196\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 34.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: c2d\n  MODEL_NAME: ResNet\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/C2D_NLN_8x8_R50_IN1K.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\n  # CHECKPOINT_FILE_PATH: /path/to/your/directory/imagenet50_pretrain.pyth\n  CHECKPOINT_INFLATE: True\nDATA:\n  NUM_FRAMES: 8\n  SAMPLING_RATE: 8\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3]\n  # PATH_TO_DATA_DIR: /path/to/your/dataset\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3], [4], [6], [3]]\nNONLOCAL:\n  LOCATION: [[[]], [[1, 3]], [[1, 3, 5]], [[]]]\n  GROUP: [[1], [1], [1], [1]]\n  INSTANTIATION: softmax\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.01\n  LR_POLICY: steps_with_relative_lrs\n  STEPS: [0, 44, 88, 118]\n  LRS: [1, 0.1, 0.01, 0.001]\n  MAX_EPOCH: 118\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: c2d\n  MODEL_NAME: ResNet\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/I3D_8x8_R101.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\nDATA:\n  NUM_FRAMES: 8\n  SAMPLING_RATE: 8\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3]\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 101\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3], [4], [23], [3]]\nNONLOCAL:\n  LOCATION: [[[]], [[]], [[]], [[]]]\n  GROUP: [[1], [1], [1], [1]]\n  INSTANTIATION: softmax\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.1\n  LR_POLICY: cosine\n  MAX_EPOCH: 196\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 34.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: i3d\n  MODEL_NAME: ResNet\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/I3D_8x8_R50.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\nDATA:\n  NUM_FRAMES: 8\n  SAMPLING_RATE: 8\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3]\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3], [4], [6], [3]]\nNONLOCAL:\n  LOCATION: [[[]], [[]], [[]], [[]]]\n  GROUP: [[1], [1], [1], [1]]\n  INSTANTIATION: softmax\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.1\n  LR_POLICY: cosine\n  MAX_EPOCH: 196\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 34.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: i3d\n  MODEL_NAME: ResNet\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/I3D_8x8_R50_IN1K.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\n  # CHECKPOINT_FILE_PATH: /path/to/your/directory/imagenet50_pretrain.pyth\n  CHECKPOINT_INFLATE: True\nDATA:\n  NUM_FRAMES: 8\n  SAMPLING_RATE: 8\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3]\n  # PATH_TO_DATA_DIR: /path/to/your/dataset\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3], [4], [6], [3]]\nNONLOCAL:\n  LOCATION: [[[]], [[]], [[]], [[]]]\n  GROUP: [[1], [1], [1], [1]]\n  INSTANTIATION: softmax\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.01\n  LR_POLICY: steps_with_relative_lrs\n  STEPS: [0, 44, 88, 118]\n  LRS: [1, 0.1, 0.01, 0.001]\n  MAX_EPOCH: 118\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: i3d\n  MODEL_NAME: ResNet\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/I3D_NLN_8x8_R101.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\nDATA:\n  NUM_FRAMES: 8\n  SAMPLING_RATE: 8\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3]\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 101\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3], [4], [23], [3]]\nNONLOCAL:\n  LOCATION: [[[]], [[1, 3]], [[1, 3, 5]], [[]]]\n  GROUP: [[1], [1], [1], [1]]\n  INSTANTIATION: softmax\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.1\n  LR_POLICY: cosine\n  MAX_EPOCH: 196\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 34.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: i3d\n  MODEL_NAME: ResNet\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/I3D_NLN_8x8_R50.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\nDATA:\n  NUM_FRAMES: 8\n  SAMPLING_RATE: 8\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3]\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3], [4], [6], [3]]\nNONLOCAL:\n  LOCATION: [[[]], [[1, 3]], [[1, 3, 5]], [[]]]\n  GROUP: [[1], [1], [1], [1]]\n  INSTANTIATION: softmax\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.1\n  LR_POLICY: cosine\n  MAX_EPOCH: 196\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 34.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: i3d\n  MODEL_NAME: ResNet\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/I3D_NLN_8x8_R50_IN1K.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\n  # CHECKPOINT_FILE_PATH: /path/to/your/directory/imagenet50_pretrain.pyth\n  CHECKPOINT_INFLATE: True\nDATA:\n  NUM_FRAMES: 8\n  SAMPLING_RATE: 8\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3]\n  # PATH_TO_DATA_DIR: /path/to/your/dataset\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3], [4], [6], [3]]\nNONLOCAL:\n  LOCATION: [[[]], [[1, 3]], [[1, 3, 5]], [[]]]\n  GROUP: [[1], [1], [1], [1]]\n  INSTANTIATION: softmax\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.01\n  LR_POLICY: steps_with_relative_lrs\n  STEPS: [0, 44, 88, 118]\n  LRS: [1, 0.1, 0.01, 0.001]\n  MAX_EPOCH: 118\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: i3d\n  MODEL_NAME: ResNet\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/MVIT_B_16x4_CONV.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 16\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 10\n  AUTO_RESUME: True\nDATA:\n  USE_OFFSET_SAMPLING: True\n  DECODING_BACKEND: torchvision\n  NUM_FRAMES: 16\n  SAMPLING_RATE: 4\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 224\n  INPUT_CHANNEL_NUM: [3]\n  # PATH_TO_DATA_DIR: path-to-k400-dir\n  TRAIN_JITTER_SCALES_RELATIVE: [0.08, 1.0]\n  TRAIN_JITTER_ASPECT_RELATIVE: [0.75, 1.3333]\nMVIT:\n  ZERO_DECAY_POS_CLS: False\n  SEP_POS_EMBED: True\n  DEPTH: 16\n  NUM_HEADS: 1\n  EMBED_DIM: 96\n  PATCH_KERNEL: (3, 7, 7)\n  PATCH_STRIDE: (2, 4, 4)\n  PATCH_PADDING: (1, 3, 3)\n  MLP_RATIO: 4.0\n  QKV_BIAS: True\n  DROPPATH_RATE: 0.2\n  NORM: \"layernorm\"\n  MODE: \"conv\"\n  CLS_EMBED_ON: True\n  DIM_MUL: [[1, 2.0], [3, 2.0], [14, 2.0]]\n  HEAD_MUL: [[1, 2.0], [3, 2.0], [14, 2.0]]\n  POOL_KVQ_KERNEL: [3, 3, 3]\n  POOL_KV_STRIDE_ADAPTIVE: [1, 8, 8]\n  POOL_Q_STRIDE: [[1, 1, 2, 2], [3, 1, 2, 2], [14, 1, 2, 2]]\n  DROPOUT_RATE: 0.0\nAUG:\n  NUM_SAMPLE: 2\n  ENABLE: True\n  COLOR_JITTER: 0.4\n  AA_TYPE: rand-m7-n4-mstd0.5-inc1\n  INTERPOLATION: bicubic\n  RE_PROB: 0.25\n  RE_MODE: pixel\n  RE_COUNT: 1\n  RE_SPLIT: False\nMIXUP:\n  ENABLE: True\n  ALPHA: 0.8\n  CUTMIX_ALPHA: 1.0\n  PROB: 1.0\n  SWITCH_PROB: 0.5\n  LABEL_SMOOTH_VALUE: 0.1\nBN:\n  USE_PRECISE_STATS: False\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  ZERO_WD_1D_PARAM: True\n  CLIP_GRAD_L2NORM: 1.0\n  BASE_LR_SCALE_NUM_SHARDS: True\n  BASE_LR: 0.0001\n  COSINE_AFTER_WARMUP: True\n  COSINE_END_LR: 1e-6\n  WARMUP_START_LR: 1e-6\n  WARMUP_EPOCHS: 30.0\n  LR_POLICY: cosine\n  MAX_EPOCH: 200\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 0.05\n  OPTIMIZING_METHOD: adamw\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: mvit\n  MODEL_NAME: MViT\n  LOSS_FUNC: soft_cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  NUM_SPATIAL_CROPS: 1\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/MVIT_B_32x3_CONV.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 16\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 10\n  AUTO_RESUME: True\nDATA:\n  USE_OFFSET_SAMPLING: True\n  DECODING_BACKEND: torchvision\n  NUM_FRAMES: 32\n  SAMPLING_RATE: 3\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 224\n  INPUT_CHANNEL_NUM: [3]\n  # PATH_TO_DATA_DIR: path-to-k400-or-k600-dir\n  TRAIN_JITTER_SCALES_RELATIVE: [0.08, 1.0]\n  TRAIN_JITTER_ASPECT_RELATIVE: [0.75, 1.3333]\nMVIT:\n  ZERO_DECAY_POS_CLS: False\n  SEP_POS_EMBED: True\n  DEPTH: 16\n  NUM_HEADS: 1\n  EMBED_DIM: 96\n  PATCH_KERNEL: (3, 7, 7)\n  PATCH_STRIDE: (2, 4, 4)\n  PATCH_PADDING: (1, 3, 3)\n  MLP_RATIO: 4.0\n  QKV_BIAS: True\n  DROPPATH_RATE: 0.3\n  NORM: \"layernorm\"\n  MODE: \"conv\"\n  DEPTH: 24\n  POOL_Q_STRIDE: [[2,1, 2, 2], [5, 1, 2, 2],  [21, 1, 2, 2]]\n  DIM_MUL: [[2, 2.0], [5, 2.0], [21, 2.0]]\n  HEAD_MUL: [[2, 2.0], [5, 2.0], [21, 2.0]]\n  POOL_KV_STRIDE_ADAPTIVE: [1, 8, 8]\n  SEP_POS_EMBED: True\nAUG:\n  ENABLE: True\n  COLOR_JITTER: 0.4\n  AA_TYPE: rand-m7-n4-mstd0.5-inc1\n  INTERPOLATION: bicubic\n  NUM_SAMPLE: 2\n  RE_MODE: pixel\n  RE_COUNT: 1\n  RE_SPLIT: False\nMIXUP:\n  ENABLE: True\n  LABEL_SMOOTH_VALUE: 0.1\n  ALPHA: 0.8\n  CUTMIX_ALPHA: 1.0\n  PROB: 1.0\n  SWITCH_PROB: 0.5\nSOLVER:\n  BASE_LR_SCALE_NUM_SHARDS: True\n  BASE_LR: 1e-4\n  CLIP_GRAD_L2NORM: 1.0\n  LR_POLICY: cosine\n  COSINE_AFTER_WARMUP: True\n  COSINE_END_LR: 1e-6\n  MAX_EPOCH: 200\n  WARMUP_EPOCHS: 30.0\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 5e-2\n  ZERO_WD_1D_PARAM: True\n  WARMUP_START_LR: 1e-6\n  OPTIMIZING_METHOD: adamw\nMODEL:\n  NUM_CLASSES: 400 # or 600 for K600\n  ARCH: mvit\n  MODEL_NAME: MViT\n  LOSS_FUNC: soft_cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  NUM_SPATIAL_CROPS: 1\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/MVITv2_B_32x3.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 16\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 10\n  AUTO_RESUME: True\nDATA:\n  USE_OFFSET_SAMPLING: True\n  DECODING_BACKEND: torchvision\n  NUM_FRAMES: 32\n  SAMPLING_RATE: 3\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 224\n  INPUT_CHANNEL_NUM: [3]\n  # PATH_TO_DATA_DIR: path-to-k400-dir\n  TRAIN_JITTER_SCALES_RELATIVE: [0.08, 1.0]\n  TRAIN_JITTER_ASPECT_RELATIVE: [0.75, 1.3333]\nMVIT:\n  ZERO_DECAY_POS_CLS: False\n  USE_ABS_POS: False\n  REL_POS_SPATIAL: True\n  REL_POS_TEMPORAL: True\n  DEPTH: 24\n  NUM_HEADS: 1\n  EMBED_DIM: 96\n  PATCH_KERNEL: (3, 7, 7)\n  PATCH_STRIDE: (2, 4, 4)\n  PATCH_PADDING: (1, 3, 3)\n  MLP_RATIO: 4.0\n  QKV_BIAS: True\n  DROPPATH_RATE: 0.3\n  NORM: \"layernorm\"\n  MODE: \"conv\"\n  CLS_EMBED_ON: True\n  DIM_MUL: [[2, 2.0], [5, 2.0], [21, 2.0]]\n  HEAD_MUL: [[2, 2.0], [5, 2.0], [21, 2.0]]\n  POOL_KVQ_KERNEL: [3, 3, 3]\n  POOL_KV_STRIDE_ADAPTIVE: [1, 8, 8]\n  POOL_Q_STRIDE: [[0, 1, 1, 1], [1, 1, 1, 1], [2, 1, 2, 2], [3, 1, 1, 1], [4, 1, 1, 1], [5, 1, 2, 2], [6, 1, 1, 1], [7, 1, 1, 1], [8, 1, 1, 1], [9, 1, 1, 1], [10, 1, 1, 1], [11, 1, 1, 1], [12, 1, 1, 1], [13, 1, 1, 1], [14, 1, 1, 1], [15, 1, 1, 1], [16, 1, 1, 1], [17, 1, 1, 1], [18, 1, 1, 1], [19, 1, 1, 1], [20, 1, 1, 1], [21, 1, 2, 2], [22, 1, 1, 1], [23, 1, 1, 1]]\n  DROPOUT_RATE: 0.0\n  DIM_MUL_IN_ATT: True\n  RESIDUAL_POOLING: True\nAUG:\n  NUM_SAMPLE: 2\n  ENABLE: True\n  COLOR_JITTER: 0.4\n  AA_TYPE: rand-m7-n4-mstd0.5-inc1\n  INTERPOLATION: bicubic\n  RE_PROB: 0.25\n  RE_MODE: pixel\n  RE_COUNT: 1\n  RE_SPLIT: False\nMIXUP:\n  ENABLE: True\n  ALPHA: 0.8\n  CUTMIX_ALPHA: 1.0\n  PROB: 1.0\n  SWITCH_PROB: 0.5\n  LABEL_SMOOTH_VALUE: 0.1\nSOLVER:\n  ZERO_WD_1D_PARAM: True\n  BASE_LR_SCALE_NUM_SHARDS: True\n  CLIP_GRAD_L2NORM: 1.0\n  BASE_LR: 0.0001\n  COSINE_AFTER_WARMUP: True\n  COSINE_END_LR: 1e-6\n  WARMUP_START_LR: 1e-6\n  WARMUP_EPOCHS: 30.0\n  LR_POLICY: cosine\n  MAX_EPOCH: 200\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 0.05\n  OPTIMIZING_METHOD: adamw\n  COSINE_AFTER_WARMUP: True\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: mvit\n  MODEL_NAME: MViT\n  LOSS_FUNC: soft_cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  NUM_SPATIAL_CROPS: 1\n  NUM_ENSEMBLE_VIEWS: 5\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/MVITv2_L_40x3_test.yaml",
    "content": "TRAIN:\n  ENABLE: False\nDATA:\n  USE_OFFSET_SAMPLING: True\n  DECODING_BACKEND: torchvision\n  NUM_FRAMES: 40\n  SAMPLING_RATE: 3\n  TRAIN_JITTER_SCALES: [356, 446]\n  TRAIN_CROP_SIZE: 312\n  TEST_CROP_SIZE: 312\n  INPUT_CHANNEL_NUM: [3]\n  # PATH_TO_DATA_DIR: path-to-k400-dir\n  TRAIN_JITTER_SCALES_RELATIVE: [0.08, 1.0]\n  TRAIN_JITTER_ASPECT_RELATIVE: [0.75, 1.3333]\n  MEAN: [0.485, 0.456, 0.406]\n  STD: [0.229, 0.224, 0.225]\nMVIT:\n  ZERO_DECAY_POS_CLS: False\n  USE_ABS_POS: False\n  REL_POS_SPATIAL: True\n  REL_POS_TEMPORAL: True\n  DEPTH: 48\n  NUM_HEADS: 2\n  EMBED_DIM: 144\n  PATCH_KERNEL: (3, 7, 7)\n  PATCH_STRIDE: (2, 4, 4)\n  PATCH_PADDING: (1, 3, 3)\n  MLP_RATIO: 4.0\n  QKV_BIAS: True\n  DROPPATH_RATE: 0.75\n  NORM: \"layernorm\"\n  MODE: \"conv\"\n  CLS_EMBED_ON: True\n  DIM_MUL: [[2, 2.0], [8, 2.0], [44, 2.0]]\n  HEAD_MUL: [[2, 2.0], [8, 2.0], [44, 2.0]]\n  POOL_KVQ_KERNEL: [3, 3, 3]\n  POOL_KV_STRIDE_ADAPTIVE: [1, 8, 8]\n  POOL_Q_STRIDE: [[0, 1, 1, 1], [1, 1, 1, 1], [2, 1, 2, 2], [3, 1, 1, 1], [4, 1, 1, 1], [5, 1, 1, 1], [6, 1, 1, 1], [7, 1, 1, 1], [8, 1, 2, 2], [9, 1, 1, 1], [10, 1, 1, 1],\n  [11, 1, 1, 1], [12, 1, 1, 1], [13, 1, 1, 1], [14, 1, 1, 1], [15, 1, 1, 1], [16, 1, 1, 1], [17, 1, 1, 1], [18, 1, 1, 1], [19, 1, 1, 1], [20, 1, 1, 1],\n  [21, 1, 1, 1], [22, 1, 1, 1], [23, 1, 1, 1], [24, 1, 1, 1], [25, 1, 1, 1], [26, 1, 1, 1], [27, 1, 1, 1], [28, 1, 1, 1], [29, 1, 1, 1], [30, 1, 1, 1],\n  [31, 1, 1, 1], [32, 1, 1, 1], [33, 1, 1, 1], [34, 1, 1, 1], [35, 1, 1, 1], [36, 1, 1, 1], [37, 1, 1, 1], [38, 1, 1, 1], [39, 1, 1, 1], [40, 1, 1, 1],\n  [41, 1, 1, 1], [42, 1, 1, 1], [43, 1, 1, 1], [44, 1, 2, 2], [45, 1, 1, 1], [46, 1, 1, 1], [47, 1, 1, 1] ]\n  DROPOUT_RATE: 0.0\n  DIM_MUL_IN_ATT: True\n  RESIDUAL_POOLING: True\nAUG:\n  # NUM_SAMPLE: 2\n  ENABLE: True\n  COLOR_JITTER: 0.4\n  AA_TYPE: rand-m7-n4-mstd0.5-inc1\n  INTERPOLATION: bicubic\n  RE_PROB: 0.25\n  RE_MODE: pixel\n  RE_COUNT: 1\n  RE_SPLIT: False\nMIXUP:\n  ENABLE: True\n  ALPHA: 0.8\n  CUTMIX_ALPHA: 1.0\n  PROB: 0.0\n  SWITCH_PROB: 0.5\n  LABEL_SMOOTH_VALUE: 0.1\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: mvit\n  MODEL_NAME: MViT\n  LOSS_FUNC: soft_cross_entropy\n  DROPOUT_RATE: 0.5\n  ACT_CHECKPOINT: True\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 8\n  NUM_SPATIAL_CROPS: 3\n  NUM_ENSEMBLE_VIEWS: 5\n  # CHECKPOINT_FILE_PATH: # download pre-trained model\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/MVITv2_S_16x4.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 16\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 10\n  AUTO_RESUME: True\nDATA:\n  USE_OFFSET_SAMPLING: True\n  DECODING_BACKEND: torchvision\n  NUM_FRAMES: 16\n  SAMPLING_RATE: 4\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 224\n  INPUT_CHANNEL_NUM: [3]\n  # PATH_TO_DATA_DIR: path-to-k400-dir\n  TRAIN_JITTER_SCALES_RELATIVE: [0.08, 1.0]\n  TRAIN_JITTER_ASPECT_RELATIVE: [0.75, 1.3333]\nMVIT:\n  ZERO_DECAY_POS_CLS: False\n  USE_ABS_POS: False\n  REL_POS_SPATIAL: True\n  REL_POS_TEMPORAL: True\n  DEPTH: 16\n  NUM_HEADS: 1\n  EMBED_DIM: 96\n  PATCH_KERNEL: (3, 7, 7)\n  PATCH_STRIDE: (2, 4, 4)\n  PATCH_PADDING: (1, 3, 3)\n  MLP_RATIO: 4.0\n  QKV_BIAS: True\n  DROPPATH_RATE: 0.2\n  NORM: \"layernorm\"\n  MODE: \"conv\"\n  CLS_EMBED_ON: True\n  DIM_MUL: [[1, 2.0], [3, 2.0], [14, 2.0]]\n  HEAD_MUL: [[1, 2.0], [3, 2.0], [14, 2.0]]\n  POOL_KVQ_KERNEL: [3, 3, 3]\n  POOL_KV_STRIDE_ADAPTIVE: [1, 8, 8]\n  POOL_Q_STRIDE: [[0, 1, 1, 1], [1, 1, 2, 2], [2, 1, 1, 1], [3, 1, 2, 2], [4, 1, 1, 1], [5, 1, 1, 1], [6, 1, 1, 1], [7, 1, 1, 1], [8, 1, 1, 1], [9, 1, 1, 1], [10, 1, 1, 1], [11, 1, 1, 1], [12, 1, 1, 1], [13, 1, 1, 1], [14, 1, 2, 2], [15, 1, 1, 1]]\n  DROPOUT_RATE: 0.0\n  DIM_MUL_IN_ATT: True\n  RESIDUAL_POOLING: True\nAUG:\n  NUM_SAMPLE: 2\n  ENABLE: True\n  COLOR_JITTER: 0.4\n  AA_TYPE: rand-m7-n4-mstd0.5-inc1\n  INTERPOLATION: bicubic\n  RE_PROB: 0.25\n  RE_MODE: pixel\n  RE_COUNT: 1\n  RE_SPLIT: False\nMIXUP:\n  ENABLE: True\n  ALPHA: 0.8\n  CUTMIX_ALPHA: 1.0\n  PROB: 1.0\n  SWITCH_PROB: 0.5\n  LABEL_SMOOTH_VALUE: 0.1\nSOLVER:\n  ZERO_WD_1D_PARAM: True\n  BASE_LR_SCALE_NUM_SHARDS: True\n  CLIP_GRAD_L2NORM: 1.0\n  BASE_LR: 0.0001\n  COSINE_AFTER_WARMUP: True\n  COSINE_END_LR: 1e-6\n  WARMUP_START_LR: 1e-6\n  WARMUP_EPOCHS: 30.0\n  LR_POLICY: cosine\n  MAX_EPOCH: 200\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 0.05\n  OPTIMIZING_METHOD: adamw\n  COSINE_AFTER_WARMUP: True\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: mvit\n  MODEL_NAME: MViT\n  LOSS_FUNC: soft_cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  NUM_SPATIAL_CROPS: 1\n  NUM_ENSEMBLE_VIEWS: 5\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/REV_MVIT_B_16x4_CONV.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 16\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 10\n  AUTO_RESUME: True\n\nDATA:\n  DECODING_BACKEND: torchvision\n  NUM_FRAMES: 16\n  SAMPLING_RATE: 4\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 224\n  INPUT_CHANNEL_NUM: [3]\n  # PATH_TO_DATA_DIR: path-to-k400-dir\n  TRAIN_JITTER_SCALES_RELATIVE: [0.08, 1.0]\n  TRAIN_JITTER_ASPECT_RELATIVE: [0.75, 1.3333]\n\nMVIT:\n  ZERO_DECAY_POS_CLS: False\n  CLS_EMBED_ON: True\n  SEP_POS_EMBED: True\n  USE_ABS_POS: True\n  DEPTH: 16\n  NUM_HEADS: 1\n  EMBED_DIM: 96\n  PATCH_KERNEL: (3, 7, 7)\n  PATCH_STRIDE: (2, 4, 4)\n  PATCH_PADDING: (1, 3, 3)\n  MLP_RATIO: 4.0\n  QKV_BIAS: False\n  DROPPATH_RATE: 0.05\n  NORM: \"layernorm\"\n  MODE: \"conv\"\n  DIM_MUL: [[1, 2.0], [3, 2.0], [14, 2.0]]\n  HEAD_MUL: [[1, 2.0], [3, 2.0], [14, 2.0]]\n  POOL_KVQ_KERNEL: [3, 3, 3]\n  POOL_KV_STRIDE_ADAPTIVE: [1, 8, 8]\n  POOL_Q_STRIDE: [[1, 1, 2, 2], [3, 1, 2, 2], [14, 1, 2, 2]]\n\n  REV:\n    ENABLE: True\n    RESPATH_FUSE: \"concat\"\n    BUFFER_LAYERS : [1, 3, 14]\n    RES_PATH : \"conv\"\n\nAUG:\n  ENABLE: True\n  COLOR_JITTER: 0.4\n  AA_TYPE: rand-m7-n4-mstd0.5-inc1\n  INTERPOLATION: bicubic\n  NUM_SAMPLE: 2\n  RE_MODE: pixel\n  RE_COUNT: 1\n  RE_SPLIT: False\n\nMIXUP:\n  ENABLE: True\n  ALPHA: 0.8\n  LABEL_SMOOTH_VALUE: 0.1\n  CUTMIX_ALPHA: 1.0\n  PROB: 1.0\n  SWITCH_PROB: 0.5\n\nSOLVER:\n  BASE_LR_SCALE_NUM_SHARDS: True\n  BASE_LR: 0.0001\n  LR_POLICY: cosine\n  COSINE_AFTER_WARMUP: True\n  COSINE_END_LR:  1e-6\n  MAX_EPOCH: 200\n  WARMUP_EPOCHS: 50.0\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 7e-2\n  ZERO_WD_1D_PARAM: True\n  WARMUP_START_LR: 0.000001\n  OPTIMIZING_METHOD: adamw\n\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: slow\n  MODEL_NAME: MViT\n  LOSS_FUNC: soft_cross_entropy\n  DROPOUT_RATE: 0.5\n\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  NUM_SPATIAL_CROPS: 1\n  NUM_ENSEMBLE_VIEWS: 5 # [1, 3, 5, 7, 10]\n\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\n\nNUM_GPUS: 8\nNUM_SHARDS: 16\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/SLOWFAST_4x16_R50.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\nDATA:\n  NUM_FRAMES: 32\n  SAMPLING_RATE: 2\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3, 3]\nSLOWFAST:\n  ALPHA: 8\n  BETA_INV: 8\n  FUSION_CONV_CHANNEL_RATIO: 2\n  FUSION_KERNEL_SZ: 5\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3, 3], [4, 4], [6, 6], [3, 3]]\n  SPATIAL_STRIDES: [[1, 1], [2, 2], [2, 2], [2, 2]]\n  SPATIAL_DILATIONS: [[1, 1], [1, 1], [1, 1], [1, 1]]\nNONLOCAL:\n  LOCATION: [[[], []], [[], []], [[], []], [[], []]]\n  GROUP: [[1, 1], [1, 1], [1, 1], [1, 1]]\n  INSTANTIATION: dot_product\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.1\n  LR_POLICY: cosine\n  MAX_EPOCH: 196\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 34.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: slowfast\n  MODEL_NAME: SlowFast\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/SLOWFAST_8x8_R50.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\nDATA:\n  NUM_FRAMES: 32\n  SAMPLING_RATE: 2\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3, 3]\nSLOWFAST:\n  ALPHA: 4\n  BETA_INV: 8\n  FUSION_CONV_CHANNEL_RATIO: 2\n  FUSION_KERNEL_SZ: 7\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3, 3], [4, 4], [6, 6], [3, 3]]\n  SPATIAL_STRIDES: [[1, 1], [2, 2], [2, 2], [2, 2]]\n  SPATIAL_DILATIONS: [[1, 1], [1, 1], [1, 1], [1, 1]]\nNONLOCAL:\n  LOCATION: [[[], []], [[], []], [[], []], [[], []]]\n  GROUP: [[1, 1], [1, 1], [1, 1], [1, 1]]\n  INSTANTIATION: dot_product\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.1\n  LR_POLICY: cosine\n  MAX_EPOCH: 196\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 34.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: slowfast\n  MODEL_NAME: SlowFast\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/SLOWFAST_8x8_R50_stepwise.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\nDATA:\n  NUM_FRAMES: 32\n  SAMPLING_RATE: 2\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3, 3]\nSLOWFAST:\n  ALPHA: 4\n  BETA_INV: 8\n  FUSION_CONV_CHANNEL_RATIO: 2\n  FUSION_KERNEL_SZ: 7\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3, 3], [4, 4], [6, 6], [3, 3]]\n  SPATIAL_STRIDES: [[1, 1], [2, 2], [2, 2], [2, 2]]\n  SPATIAL_DILATIONS: [[1, 1], [1, 1], [1, 1], [1, 1]]\nNONLOCAL:\n  LOCATION: [[[], []], [[], []], [[], []], [[], []]]\n  GROUP: [[1, 1], [1, 1], [1, 1], [1, 1]]\n  INSTANTIATION: dot_product\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.1\n  LR_POLICY: steps_with_relative_lrs\n  LRS: [1, 0.1, 0.01, 0.001, 0.0001, 0.00001]\n  STEPS: [0, 94, 154, 196]\n  MAX_EPOCH: 239\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 34.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: slowfast\n  MODEL_NAME: SlowFast\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/SLOWFAST_8x8_R50_stepwise_multigrid.yaml",
    "content": "MULTIGRID:\n  SHORT_CYCLE: True\n  LONG_CYCLE: True\nTRAIN:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\nDATA:\n  NUM_FRAMES: 32\n  SAMPLING_RATE: 2\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 224\n  INPUT_CHANNEL_NUM: [3, 3]\nSLOWFAST:\n  ALPHA: 4\n  BETA_INV: 8\n  FUSION_CONV_CHANNEL_RATIO: 2\n  FUSION_KERNEL_SZ: 7\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3, 3], [4, 4], [6, 6], [3, 3]]\n  SPATIAL_STRIDES: [[1, 1], [2, 2], [2, 2], [2, 2]]\n  SPATIAL_DILATIONS: [[1, 1], [1, 1], [1, 1], [1, 1]]\nNONLOCAL:\n  LOCATION: [[[], []], [[], []], [[], []], [[], []]]\n  GROUP: [[1, 1], [1, 1], [1, 1], [1, 1]]\n  INSTANTIATION: dot_product\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.1\n  LR_POLICY: steps_with_relative_lrs\n  LRS: [1, 0.1, 0.01, 0.001, 0.0001, 0.00001]\n  STEPS: [0, 94, 154, 196]\n  MAX_EPOCH: 239\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 34.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: slowfast\n  MODEL_NAME: SlowFast\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/SLOWFAST_NLN_4x16_R50.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\nDATA:\n  NUM_FRAMES: 32\n  SAMPLING_RATE: 2\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3, 3]\nSLOWFAST:\n  ALPHA: 8\n  BETA_INV: 8\n  FUSION_CONV_CHANNEL_RATIO: 2\n  FUSION_KERNEL_SZ: 5\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3, 3], [4, 4], [6, 6], [3, 3]]\n  SPATIAL_STRIDES: [[1, 1], [2, 2], [2, 2], [2, 2]]\n  SPATIAL_DILATIONS: [[1, 1], [1, 1], [1, 1], [1, 1]]\nNONLOCAL:\n  LOCATION: [[[], []], [[1, 3], []], [[1, 3, 5], []], [[], []]]\n  GROUP: [[1, 1], [1, 1], [1, 1], [1, 1]]\n  INSTANTIATION: dot_product\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.1\n  LR_POLICY: cosine\n  MAX_EPOCH: 196\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 34.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: slowfast\n  MODEL_NAME: SlowFast\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/SLOWFAST_NLN_8x8_R50.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\nDATA:\n  NUM_FRAMES: 32\n  SAMPLING_RATE: 2\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3, 3]\nSLOWFAST:\n  ALPHA: 4\n  BETA_INV: 8\n  FUSION_CONV_CHANNEL_RATIO: 2\n  FUSION_KERNEL_SZ: 5\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3, 3], [4, 4], [6, 6], [3, 3]]\n  SPATIAL_STRIDES: [[1, 1], [2, 2], [2, 2], [2, 2]]\n  SPATIAL_DILATIONS: [[1, 1], [1, 1], [1, 1], [1, 1]]\nNONLOCAL:\n  LOCATION: [[[], []], [[1, 3], []], [[1, 3, 5], []], [[], []]]\n  GROUP: [[1, 1], [1, 1], [1, 1], [1, 1]]\n  INSTANTIATION: dot_product\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.1\n  LR_POLICY: cosine\n  MAX_EPOCH: 196\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 34.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: slowfast\n  MODEL_NAME: SlowFast\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/SLOW_4x16_R50.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\nDATA:\n  NUM_FRAMES: 4\n  SAMPLING_RATE: 16\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3]\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3], [4], [6], [3]]\nNONLOCAL:\n  LOCATION: [[[]], [[]], [[]], [[]]]\n  GROUP: [[1], [1], [1], [1]]\n  INSTANTIATION: dot_product\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.1\n  LR_POLICY: cosine\n  MAX_EPOCH: 196\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 34.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: slow\n  MODEL_NAME: ResNet\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/SLOW_8x8_R50.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\nDATA:\n  NUM_FRAMES: 8\n  SAMPLING_RATE: 8\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3]\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3], [4], [6], [3]]\nNONLOCAL:\n  LOCATION: [[[]], [[]], [[]], [[]]]\n  GROUP: [[1], [1], [1], [1]]\n  INSTANTIATION: dot_product\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.1\n  LR_POLICY: cosine\n  MAX_EPOCH: 196\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 34.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: slow\n  MODEL_NAME: ResNet\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/SLOW_NLN_4x16_R50.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\nDATA:\n  NUM_FRAMES: 4\n  SAMPLING_RATE: 16\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3]\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3], [4], [6], [3]]\nNONLOCAL:\n  LOCATION: [[[]], [[1, 3]], [[1, 3, 5]], [[]]]\n  GROUP: [[1], [1], [1], [1]]\n  INSTANTIATION: dot_product\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.1\n  LR_POLICY: cosine\n  MAX_EPOCH: 196\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 34.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: slow\n  MODEL_NAME: ResNet\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/SLOW_NLN_8x8_R50.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\nDATA:\n  NUM_FRAMES: 8\n  SAMPLING_RATE: 8\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3]\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3], [4], [6], [3]]\nNONLOCAL:\n  LOCATION: [[[]], [[1, 3]], [[1, 3, 5]], [[]]]\n  GROUP: [[1], [1], [1], [1]]\n  INSTANTIATION: dot_product\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.1\n  LR_POLICY: cosine\n  MAX_EPOCH: 196\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 34.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: slow\n  MODEL_NAME: ResNet\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/X3D_L.yaml",
    "content": "TRAIN:\n  # ENABLE: False # default True\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 10\n  AUTO_RESUME: True\nX3D:\n  WIDTH_FACTOR: 2.0\n  DEPTH_FACTOR: 5.0\n  BOTTLENECK_FACTOR: 2.25\n  DIM_C5: 2048\n  DIM_C1: 12\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 32\n  # CHECKPOINT_FILE_PATH: 'x3d_l.pyth' # 77.48% top1 30-view accuracy to download from the model zoo (optional).\n  # NUM_SPATIAL_CROPS: 1\n  NUM_SPATIAL_CROPS: 3\nDATA:\n  NUM_FRAMES: 16\n  SAMPLING_RATE: 5\n  TRAIN_JITTER_SCALES: [356, 446]\n  TRAIN_CROP_SIZE: 312\n  # TEST_CROP_SIZE: 312 # use if TEST.NUM_SPATIAL_CROPS: 1\n  TEST_CROP_SIZE: 356 # use if TEST.NUM_SPATIAL_CROPS: 3\n  INPUT_CHANNEL_NUM: [3]\n  DECODING_BACKEND: torchvision\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  TRANS_FUNC: x3d_transform\n  STRIDE_1X1: False\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\n  WEIGHT_DECAY: 0.0\nSOLVER:\n  BASE_LR: 0.05 # 1 machine\n  BASE_LR_SCALE_NUM_SHARDS: True\n  MAX_EPOCH: 256\n  LR_POLICY: cosine\n  WEIGHT_DECAY: 5e-5\n  WARMUP_EPOCHS: 35.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: x3d\n  MODEL_NAME: X3D\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/X3D_M.yaml",
    "content": "TRAIN:\n  # ENABLE: False # default True\n  DATASET: kinetics\n  BATCH_SIZE: 128\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 10\n  AUTO_RESUME: True\nX3D:\n  WIDTH_FACTOR: 2.0\n  DEPTH_FACTOR: 2.2\n  BOTTLENECK_FACTOR: 2.25\n  DIM_C5: 2048\n  DIM_C1: 12\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  # CHECKPOINT_FILE_PATH: 'x3d_m.pyth' # 76.21% top1 30-view accuracy to download from the model zoo (optional).\n  # NUM_SPATIAL_CROPS: 1\n  NUM_SPATIAL_CROPS: 3\nDATA:\n  NUM_FRAMES: 16\n  SAMPLING_RATE: 5\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  # TEST_CROP_SIZE: 224 # use if TEST.NUM_SPATIAL_CROPS: 1\n  TEST_CROP_SIZE: 256 # use if TEST.NUM_SPATIAL_CROPS: 3\n  INPUT_CHANNEL_NUM: [3]\n  DECODING_BACKEND: torchvision\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  TRANS_FUNC: x3d_transform\n  STRIDE_1X1: False\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\n  WEIGHT_DECAY: 0.0\nSOLVER:\n  BASE_LR: 0.1 # 1 machine\n  BASE_LR_SCALE_NUM_SHARDS: True\n  LR_POLICY: cosine\n  MAX_EPOCH: 300\n  WEIGHT_DECAY: 5e-5\n  WARMUP_EPOCHS: 35.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: x3d\n  MODEL_NAME: X3D\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/X3D_S.yaml",
    "content": "TRAIN:\n  # ENABLE: False # default True\n  DATASET: kinetics\n  BATCH_SIZE: 128\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 10\n  AUTO_RESUME: True\nX3D:\n  WIDTH_FACTOR: 2.0\n  DEPTH_FACTOR: 2.2\n  BOTTLENECK_FACTOR: 2.25\n  DIM_C5: 2048\n  DIM_C1: 12\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  # CHECKPOINT_FILE_PATH: 'x3d_s.pyth' # 73.50% top1 30-view accuracy to download from the model zoo (optional).\n  # NUM_SPATIAL_CROPS: 1\n  NUM_SPATIAL_CROPS: 3\nDATA:\n  NUM_FRAMES: 13\n  SAMPLING_RATE: 6\n  TRAIN_JITTER_SCALES: [182, 228]\n  TRAIN_CROP_SIZE: 160\n  # TEST_CROP_SIZE: 160 # use if TEST.NUM_SPATIAL_CROPS: 1\n  TEST_CROP_SIZE: 182 # use if TEST.NUM_SPATIAL_CROPS: 3\n  INPUT_CHANNEL_NUM: [3]\n  DECODING_BACKEND: torchvision\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  TRANS_FUNC: x3d_transform\n  STRIDE_1X1: False\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\n  WEIGHT_DECAY: 0.0\nSOLVER:\n  BASE_LR: 0.1 # 16 machine\n  BASE_LR_SCALE_NUM_SHARDS: True\n  LR_POLICY: cosine\n  MAX_EPOCH: 300\n  WEIGHT_DECAY: 5e-5\n  WARMUP_EPOCHS: 35.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: x3d\n  MODEL_NAME: X3D\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/X3D_XS.yaml",
    "content": "TRAIN:\n  # ENABLE: False # default True\n  DATASET: kinetics\n  BATCH_SIZE: 128\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 10\n  AUTO_RESUME: True\nX3D:\n  WIDTH_FACTOR: 2.0\n  DEPTH_FACTOR: 2.2\n  BOTTLENECK_FACTOR: 2.25\n  DIM_C5: 2048\n  DIM_C1: 12\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  # CHECKPOINT_FILE_PATH: 'x3d_xs.pyth' # 69.46% top1 30-view accuracy to download from the model zoo (optional).\n  # NUM_SPATIAL_CROPS: 1\n  NUM_SPATIAL_CROPS: 3\nDATA:\n  NUM_FRAMES: 4\n  SAMPLING_RATE: 12\n  TRAIN_JITTER_SCALES: [182, 228]\n  TRAIN_CROP_SIZE: 160\n  # TEST_CROP_SIZE: 160 # use if TEST.NUM_SPATIAL_CROPS: 1\n  TEST_CROP_SIZE: 182 # use if TEST.NUM_SPATIAL_CROPS: 3\n  INPUT_CHANNEL_NUM: [3]\n  DECODING_BACKEND: torchvision\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  TRANS_FUNC: x3d_transform\n  STRIDE_1X1: False\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\n  WEIGHT_DECAY: 0.0\nSOLVER:\n  BASE_LR: 0.1 # 1 machine\n  BASE_LR_SCALE_NUM_SHARDS: True\n  LR_POLICY: cosine\n  MAX_EPOCH: 300\n  WEIGHT_DECAY: 5e-5\n  WARMUP_EPOCHS: 35.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: x3d\n  MODEL_NAME: X3D\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/c2/C2D_NOPOOL_8x8_R50.yaml",
    "content": "TRAIN:\n  ENABLE: False\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\n  # CHECKPOINT_FILE_PATH: path to the model to test\n  CHECKPOINT_TYPE: caffe2\nDATA:\n  NUM_FRAMES: 8\n  SAMPLING_RATE: 8\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3]\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3], [4], [6], [3]]\nNONLOCAL:\n  LOCATION: [[[]], [[]], [[]], [[]]]\n  GROUP: [[1], [1], [1], [1]]\n  INSTANTIATION: softmax\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.1\n  LR_POLICY: cosine\n  MAX_EPOCH: 196\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 34.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: c2d\n  MODEL_NAME: ResNet_nopool\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/c2/I3D_8x8_R50.yaml",
    "content": "TRAIN:\n  ENABLE: False\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\n  # CHECKPOINT_FILE_PATH: path to the model to test\n  CHECKPOINT_TYPE: caffe2\nDATA:\n  NUM_FRAMES: 8\n  SAMPLING_RATE: 8\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3]\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3], [4], [6], [3]]\nNONLOCAL:\n  LOCATION: [[[]], [[]], [[]], [[]]]\n  GROUP: [[1], [1], [1], [1]]\n  INSTANTIATION: softmax\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.1\n  LR_POLICY: cosine\n  MAX_EPOCH: 196\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 34.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: i3d\n  MODEL_NAME: ResNet\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/c2/I3D_NLN_8x8_R50.yaml",
    "content": "TRAIN:\n  ENABLE: False\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\n  # CHECKPOINT_FILE_PATH: path to the model to test\n  CHECKPOINT_TYPE: caffe2\nDATA:\n  NUM_FRAMES: 8\n  SAMPLING_RATE: 8\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3]\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3], [4], [6], [3]]\nNONLOCAL:\n  LOCATION: [[[]], [[1, 3]], [[1, 3, 5]], [[]]]\n  GROUP: [[1], [1], [1], [1]]\n  INSTANTIATION: softmax\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.1\n  LR_POLICY: cosine\n  MAX_EPOCH: 196\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 34.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: i3d\n  MODEL_NAME: ResNet\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/c2/SLOWFAST_16x8_R101_50_50.yaml",
    "content": "TRAIN:\n  ENABLE: False\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\n  # CHECKPOINT_FILE_PATH: path to the model to test\n  CHECKPOINT_TYPE: caffe2\nDATA:\n  NUM_FRAMES: 64\n  SAMPLING_RATE: 2\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3, 3]\nSLOWFAST:\n  ALPHA: 4\n  BETA_INV: 8\n  FUSION_CONV_CHANNEL_RATIO: 2\n  FUSION_KERNEL_SZ: 5\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 101\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3, 3], [4, 4], [6, 6], [3, 3]]\n  SPATIAL_STRIDES: [[1, 1], [2, 2], [2, 2], [2, 2]]\n  SPATIAL_DILATIONS: [[1, 1], [1, 1], [1, 1], [1, 1]]\nNONLOCAL:\n  LOCATION: [[[], []], [[], []], [[], []], [[], []]]\n  GROUP: [[1, 1], [1, 1], [1, 1], [1, 1]]\n  INSTANTIATION: dot_product\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.1\n  LR_POLICY: cosine\n  MAX_EPOCH: 196\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 34.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: slowfast\n  MODEL_NAME: SlowFast\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/c2/SLOWFAST_4x16_R50.yaml",
    "content": "TRAIN:\n  ENABLE: False\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\n  # CHECKPOINT_FILE_PATH: path to the model to test\n  CHECKPOINT_TYPE: caffe2\nDATA:\n  NUM_FRAMES: 32\n  SAMPLING_RATE: 2\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3, 3]\nSLOWFAST:\n  ALPHA: 8\n  BETA_INV: 8\n  FUSION_CONV_CHANNEL_RATIO: 2\n  FUSION_KERNEL_SZ: 5\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3, 3], [4, 4], [6, 6], [3, 3]]\n  SPATIAL_STRIDES: [[1, 1], [2, 2], [2, 2], [2, 2]]\n  SPATIAL_DILATIONS: [[1, 1], [1, 1], [1, 1], [1, 1]]\nNONLOCAL:\n  LOCATION: [[[], []], [[], []], [[], []], [[], []]]\n  GROUP: [[1, 1], [1, 1], [1, 1], [1, 1]]\n  INSTANTIATION: dot_product\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.1\n  LR_POLICY: cosine\n  MAX_EPOCH: 196\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 34.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: slowfast\n  MODEL_NAME: SlowFast\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/c2/SLOWFAST_8x8_R101_101_101.yaml",
    "content": "TRAIN:\n  ENABLE: False\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\n  # CHECKPOINT_FILE_PATH: path to the model to test\n  CHECKPOINT_TYPE: caffe2\nDATA:\n  NUM_FRAMES: 32\n  SAMPLING_RATE: 2\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3, 3]\nSLOWFAST:\n  ALPHA: 4\n  BETA_INV: 8\n  FUSION_CONV_CHANNEL_RATIO: 2\n  FUSION_KERNEL_SZ: 5\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 101\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3, 3], [4, 4], [23, 23], [3, 3]]\n  SPATIAL_STRIDES: [[1, 1], [2, 2], [2, 2], [2, 2]]\n  SPATIAL_DILATIONS: [[1, 1], [1, 1], [1, 1], [1, 1]]\nNONLOCAL:\n  LOCATION: [[[], []], [[], []], [[], []], [[], []]]\n  GROUP: [[1, 1], [1, 1], [1, 1], [1, 1]]\n  INSTANTIATION: dot_product\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.1\n  LR_POLICY: cosine\n  MAX_EPOCH: 196\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 34.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: slowfast\n  MODEL_NAME: SlowFast\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/c2/SLOWFAST_8x8_R101_50_101.yaml",
    "content": "TRAIN:\n  ENABLE: False\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\n  # CHECKPOINT_FILE_PATH: path to the model to test\n  CHECKPOINT_TYPE: caffe2\nDATA:\n  NUM_FRAMES: 32\n  SAMPLING_RATE: 2\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3, 3]\nSLOWFAST:\n  ALPHA: 4\n  BETA_INV: 8\n  FUSION_CONV_CHANNEL_RATIO: 2\n  FUSION_KERNEL_SZ: 5\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 101\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3, 3], [4, 4], [6, 23], [3, 3]]\n  SPATIAL_STRIDES: [[1, 1], [2, 2], [2, 2], [2, 2]]\n  SPATIAL_DILATIONS: [[1, 1], [1, 1], [1, 1], [1, 1]]\nNONLOCAL:\n  LOCATION: [[[], []], [[], []], [[], []], [[], []]]\n  GROUP: [[1, 1], [1, 1], [1, 1], [1, 1]]\n  INSTANTIATION: dot_product\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.1\n  LR_POLICY: cosine\n  MAX_EPOCH: 196\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 34.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: slowfast\n  MODEL_NAME: SlowFast\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/c2/SLOWFAST_8x8_R101_50_50.yaml",
    "content": "TRAIN:\n  ENABLE: False\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\n  # CHECKPOINT_FILE_PATH: path to the model to test\n  CHECKPOINT_TYPE: caffe2\nDATA:\n  NUM_FRAMES: 32\n  SAMPLING_RATE: 2\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3, 3]\nSLOWFAST:\n  ALPHA: 4\n  BETA_INV: 8\n  FUSION_CONV_CHANNEL_RATIO: 2\n  FUSION_KERNEL_SZ: 5\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 101\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3, 3], [4, 4], [6, 6], [3, 3]]\n  SPATIAL_STRIDES: [[1, 1], [2, 2], [2, 2], [2, 2]]\n  SPATIAL_DILATIONS: [[1, 1], [1, 1], [1, 1], [1, 1]]\nNONLOCAL:\n  LOCATION: [[[], []], [[], []], [[], []], [[], []]]\n  GROUP: [[1, 1], [1, 1], [1, 1], [1, 1]]\n  INSTANTIATION: dot_product\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.1\n  LR_POLICY: cosine\n  MAX_EPOCH: 196\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 34.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: slowfast\n  MODEL_NAME: SlowFast\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/c2/SLOWFAST_8x8_R50.yaml",
    "content": "TRAIN:\n  ENABLE: False\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\n  # CHECKPOINT_FILE_PATH: path to the model to test\n  CHECKPOINT_TYPE: caffe2\nDATA:\n  NUM_FRAMES: 32\n  SAMPLING_RATE: 2\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3, 3]\nSLOWFAST:\n  ALPHA: 4\n  BETA_INV: 8\n  FUSION_CONV_CHANNEL_RATIO: 2\n  FUSION_KERNEL_SZ: 7\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3, 3], [4, 4], [6, 6], [3, 3]]\n  SPATIAL_STRIDES: [[1, 1], [2, 2], [2, 2], [2, 2]]\n  SPATIAL_DILATIONS: [[1, 1], [1, 1], [1, 1], [1, 1]]\nNONLOCAL:\n  LOCATION: [[[], []], [[], []], [[], []], [[], []]]\n  GROUP: [[1, 1], [1, 1], [1, 1], [1, 1]]\n  INSTANTIATION: dot_product\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.1\n  LR_POLICY: cosine\n  MAX_EPOCH: 196\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 34.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: slowfast\n  MODEL_NAME: SlowFast\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/c2/SLOWFAST_NLN_16x8_R101_50_50.yaml",
    "content": "TRAIN:\n  ENABLE: False\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\n  # CHECKPOINT_FILE_PATH: path to the model to test\n  CHECKPOINT_TYPE: caffe2\nDATA:\n  NUM_FRAMES: 64\n  SAMPLING_RATE: 2\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3, 3]\nSLOWFAST:\n  ALPHA: 4\n  BETA_INV: 8\n  FUSION_CONV_CHANNEL_RATIO: 2\n  FUSION_KERNEL_SZ: 5\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 101\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3, 3], [4, 4], [6, 6], [3, 3]]\n  SPATIAL_STRIDES: [[1, 1], [2, 2], [2, 2], [2, 2]]\n  SPATIAL_DILATIONS: [[1, 1], [1, 1], [1, 1], [1, 1]]\nNONLOCAL:\n  LOCATION: [[[], []], [[], []], [[6, 13, 20], []], [[], []]]\n  GROUP: [[1, 1], [1, 1], [1, 1], [1, 1]]\n  INSTANTIATION: dot_product\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.1\n  LR_POLICY: cosine\n  MAX_EPOCH: 196\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 34.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: slowfast\n  MODEL_NAME: SlowFast\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/c2/SLOW_4x16_R50.yaml",
    "content": "TRAIN:\n  ENABLE: False\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\n  # CHECKPOINT_FILE_PATH: path to the model to test\n  CHECKPOINT_TYPE: caffe2\nDATA:\n  NUM_FRAMES: 4\n  SAMPLING_RATE: 16\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3]\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3], [4], [6], [3]]\nNONLOCAL:\n  LOCATION: [[[]], [[]], [[]], [[]]]\n  GROUP: [[1], [1], [1], [1]]\n  INSTANTIATION: dot_product\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.1\n  LR_POLICY: cosine\n  MAX_EPOCH: 196\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 34.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: slow\n  MODEL_NAME: ResNet\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/c2/SLOW_8x8_R50.yaml",
    "content": "TRAIN:\n  ENABLE: False\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\n  # CHECKPOINT_FILE_PATH: path to the model to test\n  CHECKPOINT_TYPE: caffe2\nDATA:\n  NUM_FRAMES: 8\n  SAMPLING_RATE: 8\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3]\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3], [4], [6], [3]]\nNONLOCAL:\n  LOCATION: [[[]], [[]], [[]], [[]]]\n  GROUP: [[1], [1], [1], [1]]\n  INSTANTIATION: dot_product\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.1\n  LR_POLICY: cosine\n  MAX_EPOCH: 196\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 34.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: slow\n  MODEL_NAME: ResNet\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/pytorchvideo/C2D_8x8_R50.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: ptvkinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\nDATA:\n  NUM_FRAMES: 8\n  SAMPLING_RATE: 8\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3]\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3], [4], [6], [3]]\nNONLOCAL:\n  LOCATION: [[[]], [[]], [[]], [[]]]\n  GROUP: [[1], [1], [1], [1]]\n  INSTANTIATION: dot_product\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.1 # 1-node\n  BASE_LR_SCALE_NUM_SHARDS: True\n  LR_POLICY: cosine\n  MAX_EPOCH: 196\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 34.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: c2d\n  MODEL_NAME: PTVResNet\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: ptvkinetics\n  BATCH_SIZE: 64\n  NUM_SPATIAL_CROPS: 3\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\nTENSORBOARD:\n  ENABLE: True\n"
  },
  {
    "path": "configs/Kinetics/pytorchvideo/CSN_32x2_R101.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: ptvkinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\nDATA:\n  NUM_FRAMES: 32\n  SAMPLING_RATE: 2\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3]\n\n\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  DEPTH: 101\nNONLOCAL:\n  LOCATION: [[[]], [[]], [[]], [[]]]\n  GROUP: [[1], [1], [1], [1]]\n  INSTANTIATION: dot_product\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.1 # 1-node\n  BASE_LR_SCALE_NUM_SHARDS: True\n  LR_POLICY: cosine\n  MAX_EPOCH: 196\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 34.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: csn\n  MODEL_NAME: PTVCSN\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\n  SINGLE_PATHWAY_ARCH: ['csn']\nTEST:\n  ENABLE: True\n  DATASET: ptvkinetics\n  BATCH_SIZE: 64\n  NUM_SPATIAL_CROPS: 3\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\nTENSORBOARD:\n  ENABLE: True\n"
  },
  {
    "path": "configs/Kinetics/pytorchvideo/I3D_8x8_R50.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: ptvkinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\nDATA:\n  NUM_FRAMES: 8\n  SAMPLING_RATE: 8\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3]\n\n\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3], [4], [6], [3]]\nNONLOCAL:\n  LOCATION: [[[]], [[]], [[]], [[]]]\n  GROUP: [[1], [1], [1], [1]]\n  INSTANTIATION: dot_product\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.1 # 1-node\n  BASE_LR_SCALE_NUM_SHARDS: True\n  LR_POLICY: cosine\n  MAX_EPOCH: 196\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 34.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: i3d\n  MODEL_NAME: PTVResNet\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: ptvkinetics\n  BATCH_SIZE: 64\n  NUM_SPATIAL_CROPS: 3\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\nTENSORBOARD:\n  ENABLE: True\n"
  },
  {
    "path": "configs/Kinetics/pytorchvideo/MVIT_B_16x4_CONV.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 16\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 10\n  AUTO_RESUME: True\nDATA:\n  USE_OFFSET_SAMPLING: True\n  DECODING_BACKEND: torchvision\n  NUM_FRAMES: 16\n  SAMPLING_RATE: 4\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 224\n  INPUT_CHANNEL_NUM: [3]\n  TRAIN_JITTER_SCALES_RELATIVE: [0.08, 1.0]\n  TRAIN_JITTER_ASPECT_RELATIVE: [0.75, 1.3333]\nMVIT:\n  ZERO_DECAY_POS_CLS: False\n  SEP_POS_EMBED: True\n  DEPTH: 16\n  NUM_HEADS: 1\n  EMBED_DIM: 96\n  PATCH_KERNEL: (3, 7, 7)\n  PATCH_STRIDE: (2, 4, 4)\n  PATCH_PADDING: (1, 3, 3)\n  MLP_RATIO: 4.0\n  QKV_BIAS: True\n  DROPPATH_RATE: 0.2\n  NORM: \"layernorm\"\n  MODE: \"conv\"\n  CLS_EMBED_ON: True\n  DIM_MUL: [[1, 2.0], [3, 2.0], [14, 2.0]]\n  HEAD_MUL: [[1, 2.0], [3, 2.0], [14, 2.0]]\n  POOL_KVQ_KERNEL: [3, 3, 3]\n  POOL_KV_STRIDE_ADAPTIVE: [1, 8, 8]\n  POOL_Q_STRIDE: [[1, 1, 2, 2], [3, 1, 2, 2], [14, 1, 2, 2]]\n  DROPOUT_RATE: 0.0\nAUG:\n  NUM_SAMPLE: 2\n  ENABLE: True\n  COLOR_JITTER: 0.4\n  AA_TYPE: rand-m7-n4-mstd0.5-inc1\n  INTERPOLATION: bicubic\n  RE_PROB: 0.25\n  RE_MODE: pixel\n  RE_COUNT: 1\n  RE_SPLIT: False\nMIXUP:\n  ENABLE: True\n  ALPHA: 0.8\n  CUTMIX_ALPHA: 1.0\n  PROB: 1.0\n  SWITCH_PROB: 0.5\n  LABEL_SMOOTH_VALUE: 0.1\nBN:\n  USE_PRECISE_STATS: False\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  ZERO_WD_1D_PARAM: True\n  CLIP_GRAD_L2NORM: 1.0\n  BASE_LR_SCALE_NUM_SHARDS: True\n  BASE_LR: 0.0001\n  COSINE_AFTER_WARMUP: True\n  COSINE_END_LR: 1e-6\n  WARMUP_START_LR: 1e-6\n  WARMUP_EPOCHS: 30.0\n  LR_POLICY: cosine\n  MAX_EPOCH: 200\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 0.05\n  OPTIMIZING_METHOD: adamw\n  COSINE_AFTER_WARMUP: True\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: mvit\n  MODEL_NAME: PTVMViT\n  LOSS_FUNC: soft_cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  NUM_SPATIAL_CROPS: 1\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/pytorchvideo/R2PLUS1D_16x4_R50.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: ptvkinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\nDATA:\n  NUM_FRAMES: 16\n  SAMPLING_RATE: 4\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3]\n\n\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  DEPTH: 50\nNONLOCAL:\n  LOCATION: [[[]], [[]], [[]], [[]]]\n  GROUP: [[1], [1], [1], [1]]\n  INSTANTIATION: dot_product\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.1 # 1-node\n  BASE_LR_SCALE_NUM_SHARDS: True\n  LR_POLICY: cosine\n  MAX_EPOCH: 196\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 34.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: r2plus1d\n  MODEL_NAME: PTVR2plus1D\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\n  SINGLE_PATHWAY_ARCH: ['r2plus1d']\nTEST:\n  ENABLE: True\n  DATASET: ptvkinetics\n  BATCH_SIZE: 64\n  NUM_SPATIAL_CROPS: 3\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\nTENSORBOARD:\n  ENABLE: True\n"
  },
  {
    "path": "configs/Kinetics/pytorchvideo/SLOWFAST_16x8_R101_50_50.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: ptvkinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 2\n  CHECKPOINT_PERIOD: 10\n  AUTO_RESUME: True\nDATA:\n  NUM_FRAMES: 64\n  SAMPLING_RATE: 2\n  TRAIN_JITTER_SCALES: [256, 360]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3, 3]\nSLOWFAST:\n  ALPHA: 4\n  BETA_INV: 8\n  FUSION_CONV_CHANNEL_RATIO: 2\n  FUSION_KERNEL_SZ: 5\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 101\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3, 3], [4, 4], [6, 6], [3, 3]]\n  SPATIAL_STRIDES: [[1, 1], [2, 2], [2, 2], [2, 2]]\n  SPATIAL_DILATIONS: [[1, 1], [1, 1], [1, 1], [1, 1]]\nNONLOCAL:\n  LOCATION: [[[], []], [[], []], [[], []], [[], []]]\n  GROUP: [[1, 1], [1, 1], [1, 1], [1, 1]]\n  INSTANTIATION: dot_product\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.1 # 1-node\n  BASE_LR_SCALE_NUM_SHARDS: True\n  LR_POLICY: cosine\n  MAX_EPOCH: 196\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 34.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: slowfast\n  MODEL_NAME: PTVSlowFast\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: ptvkinetics\n  BATCH_SIZE: 64\n  NUM_SPATIAL_CROPS: 3\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\nTENSORBOARD:\n  ENABLE: True\n"
  },
  {
    "path": "configs/Kinetics/pytorchvideo/SLOWFAST_4x16_R50.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: ptvkinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\nDATA:\n  NUM_FRAMES: 32\n  SAMPLING_RATE: 2\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3, 3]\n\n\nSLOWFAST:\n  ALPHA: 8\n  BETA_INV: 8\n  FUSION_CONV_CHANNEL_RATIO: 2\n  FUSION_KERNEL_SZ: 5\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3, 3], [4, 4], [6, 6], [3, 3]]\n  SPATIAL_STRIDES: [[1, 1], [2, 2], [2, 2], [2, 2]]\n  SPATIAL_DILATIONS: [[1, 1], [1, 1], [1, 1], [1, 1]]\nNONLOCAL:\n  LOCATION: [[[], []], [[], []], [[], []], [[], []]]\n  GROUP: [[1, 1], [1, 1], [1, 1], [1, 1]]\n  INSTANTIATION: dot_product\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.1 # 1-node\n  BASE_LR_SCALE_NUM_SHARDS: True\n  LR_POLICY: cosine\n  MAX_EPOCH: 196\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 34.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: slowfast\n  MODEL_NAME: PTVSlowFast\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: ptvkinetics\n  BATCH_SIZE: 64\n  NUM_SPATIAL_CROPS: 3\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\nTENSORBOARD:\n  ENABLE: True\n"
  },
  {
    "path": "configs/Kinetics/pytorchvideo/SLOWFAST_8x8_R101.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: ptvkinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 2\n  CHECKPOINT_PERIOD: 10\n  AUTO_RESUME: True\nDATA:\n  NUM_FRAMES: 32\n  SAMPLING_RATE: 2\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3, 3]\n\n\nSLOWFAST:\n  ALPHA: 4\n  BETA_INV: 8\n  FUSION_CONV_CHANNEL_RATIO: 2\n  FUSION_KERNEL_SZ: 5\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 101\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3, 3], [4, 4], [23, 23], [3, 3]]\n  SPATIAL_STRIDES: [[1, 1], [2, 2], [2, 2], [2, 2]]\n  SPATIAL_DILATIONS: [[1, 1], [1, 1], [1, 1], [1, 1]]\nNONLOCAL:\n  LOCATION: [[[], []], [[], []], [[], []], [[], []]]\n  GROUP: [[1, 1], [1, 1], [1, 1], [1, 1]]\n  INSTANTIATION: dot_product\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.1 # 1-node\n  BASE_LR_SCALE_NUM_SHARDS: True\n  LR_POLICY: cosine\n  MAX_EPOCH: 196\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 34.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: slowfast\n  MODEL_NAME: PTVSlowFast\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: ptvkinetics\n  BATCH_SIZE: 64\n  NUM_SPATIAL_CROPS: 3\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\nTENSORBOARD:\n  ENABLE: True\n"
  },
  {
    "path": "configs/Kinetics/pytorchvideo/SLOWFAST_8x8_R50.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: ptvkinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\nDATA:\n  NUM_FRAMES: 32\n  SAMPLING_RATE: 2\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3, 3]\n\n\nSLOWFAST:\n  ALPHA: 4\n  BETA_INV: 8\n  FUSION_CONV_CHANNEL_RATIO: 2\n  FUSION_KERNEL_SZ: 7\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3, 3], [4, 4], [6, 6], [3, 3]]\n  SPATIAL_STRIDES: [[1, 1], [2, 2], [2, 2], [2, 2]]\n  SPATIAL_DILATIONS: [[1, 1], [1, 1], [1, 1], [1, 1]]\nNONLOCAL:\n  LOCATION: [[[], []], [[], []], [[], []], [[], []]]\n  GROUP: [[1, 1], [1, 1], [1, 1], [1, 1]]\n  INSTANTIATION: dot_product\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.1 # 1-node\n  BASE_LR_SCALE_NUM_SHARDS: True\n  LR_POLICY: cosine\n  MAX_EPOCH: 196\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 34.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: slowfast\n  MODEL_NAME: PTVSlowFast\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: ptvkinetics\n  BATCH_SIZE: 64\n  NUM_SPATIAL_CROPS: 3\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\nTENSORBOARD:\n  ENABLE: True\n"
  },
  {
    "path": "configs/Kinetics/pytorchvideo/SLOW_4x16_R50.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: ptvkinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\nDATA:\n  NUM_FRAMES: 4\n  SAMPLING_RATE: 16\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3]\n\n\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3], [4], [6], [3]]\nNONLOCAL:\n  LOCATION: [[[]], [[]], [[]], [[]]]\n  GROUP: [[1], [1], [1], [1]]\n  INSTANTIATION: dot_product\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.1 # 1-node\n  BASE_LR_SCALE_NUM_SHARDS: True\n  LR_POLICY: cosine\n  MAX_EPOCH: 196\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 34.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: slow\n  MODEL_NAME: PTVResNet\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: ptvkinetics\n  BATCH_SIZE: 64\n  NUM_SPATIAL_CROPS: 3\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\nTENSORBOARD:\n  ENABLE: True\n"
  },
  {
    "path": "configs/Kinetics/pytorchvideo/SLOW_8x8_R50.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: ptvkinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\nDATA:\n  NUM_FRAMES: 8\n  SAMPLING_RATE: 8\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3]\n\n\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3], [4], [6], [3]]\nNONLOCAL:\n  LOCATION: [[[]], [[]], [[]], [[]]]\n  GROUP: [[1], [1], [1], [1]]\n  INSTANTIATION: dot_product\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.1 # 1-node\n  BASE_LR_SCALE_NUM_SHARDS: True\n  LR_POLICY: cosine\n  MAX_EPOCH: 196\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 34.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: slow\n  MODEL_NAME: PTVResNet\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: ptvkinetics\n  BATCH_SIZE: 64\n  NUM_SPATIAL_CROPS: 3\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\nTENSORBOARD:\n  ENABLE: True\n"
  },
  {
    "path": "configs/Kinetics/pytorchvideo/X3D_L.yaml",
    "content": "TRAIN:\n  DATASET: ptvkinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 2\n  CHECKPOINT_PERIOD: 10\n  AUTO_RESUME: True\nX3D:\n  WIDTH_FACTOR: 2.0\n  DEPTH_FACTOR: 5.0\n  BOTTLENECK_FACTOR: 2.25\n  DIM_C5: 2048\n  DIM_C1: 12\nTEST:\n  ENABLE: True\n  DATASET: ptvkinetics\n  BATCH_SIZE: 32\n  # NUM_SPATIAL_CROPS: 1\n  NUM_SPATIAL_CROPS: 3\nDATA:\n  NUM_FRAMES: 16\n  SAMPLING_RATE: 5\n  TRAIN_JITTER_SCALES: [356, 446]\n  TRAIN_CROP_SIZE: 312\n  # TEST_CROP_SIZE: 312 # use if TEST.NUM_SPATIAL_CROPS: 1\n  TEST_CROP_SIZE: 356 # use if TEST.NUM_SPATIAL_CROPS: 3\n  INPUT_CHANNEL_NUM: [3]\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  TRANS_FUNC: x3d_transform\n  STRIDE_1X1: False\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\n  WEIGHT_DECAY: 0.0\nSOLVER:\n  BASE_LR: 0.05 # 1 machine\n  BASE_LR_SCALE_NUM_SHARDS: True\n  MAX_EPOCH: 256\n  LR_POLICY: cosine\n  WEIGHT_DECAY: 5e-5\n  WARMUP_EPOCHS: 35.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: x3d\n  MODEL_NAME: PTVX3D\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/pytorchvideo/X3D_M.yaml",
    "content": "TRAIN:\n  # ENABLE: False # default True\n  DATASET: ptvkinetics\n  BATCH_SIZE: 128\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 10\n  AUTO_RESUME: True\nX3D:\n  WIDTH_FACTOR: 2.0\n  DEPTH_FACTOR: 2.2\n  BOTTLENECK_FACTOR: 2.25\n  DIM_C5: 2048\n  DIM_C1: 12\nTEST:\n  ENABLE: True\n  DATASET: ptvkinetics\n  BATCH_SIZE: 64\n  # CHECKPOINT_FILE_PATH: 'x3d_m.pyth' # 76.21% top1 30-view accuracy to download from the model zoo (optional).\n  # NUM_SPATIAL_CROPS: 1\n  NUM_SPATIAL_CROPS: 3\nDATA:\n  NUM_FRAMES: 16\n  SAMPLING_RATE: 5\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  # TEST_CROP_SIZE: 224 # use if TEST.NUM_SPATIAL_CROPS: 1\n  TEST_CROP_SIZE: 256 # use if TEST.NUM_SPATIAL_CROPS: 3\n  INPUT_CHANNEL_NUM: [3]\n\n\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  TRANS_FUNC: x3d_transform\n  STRIDE_1X1: False\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\n  WEIGHT_DECAY: 0.0\nSOLVER:\n  BASE_LR: 0.1 # 1 machine\n  BASE_LR_SCALE_NUM_SHARDS: True\n  LR_POLICY: cosine\n  MAX_EPOCH: 300\n  WEIGHT_DECAY: 5e-5\n  WARMUP_EPOCHS: 35.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: x3d\n  MODEL_NAME: PTVX3D\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/pytorchvideo/X3D_S.yaml",
    "content": "TRAIN:\n  # ENABLE: False # default True\n  DATASET: ptvkinetics\n  BATCH_SIZE: 128\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 10\n  AUTO_RESUME: True\nX3D:\n  WIDTH_FACTOR: 2.0\n  DEPTH_FACTOR: 2.2\n  BOTTLENECK_FACTOR: 2.25\n  DIM_C5: 2048\n  DIM_C1: 12\nTEST:\n  ENABLE: True\n  DATASET: ptvkinetics\n  BATCH_SIZE: 64\n  # CHECKPOINT_FILE_PATH: 'x3d_s.pyth' # 73.50% top1 30-view accuracy to download from the model zoo (optional).\n  # NUM_SPATIAL_CROPS: 1\n  NUM_SPATIAL_CROPS: 3\nDATA:\n  NUM_FRAMES: 13\n  SAMPLING_RATE: 6\n  TRAIN_JITTER_SCALES: [182, 228]\n  TRAIN_CROP_SIZE: 160\n  # TEST_CROP_SIZE: 160 # use if TEST.NUM_SPATIAL_CROPS: 1\n  TEST_CROP_SIZE: 182 # use if TEST.NUM_SPATIAL_CROPS: 3\n  INPUT_CHANNEL_NUM: [3]\n  # DECODING_BACKEND: torchvision\n\n\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  TRANS_FUNC: x3d_transform\n  STRIDE_1X1: False\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\n  WEIGHT_DECAY: 0.0\nSOLVER:\n  BASE_LR: 0.1 # 16 machine\n  BASE_LR_SCALE_NUM_SHARDS: True\n  LR_POLICY: cosine\n  MAX_EPOCH: 300\n  WEIGHT_DECAY: 5e-5\n  WARMUP_EPOCHS: 35.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: x3d\n  MODEL_NAME: PTVX3D\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/Kinetics/pytorchvideo/X3D_XS.yaml",
    "content": "TRAIN:\n  # ENABLE: False # default True\n  DATASET: ptvkinetics\n  BATCH_SIZE: 128\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 10\n  AUTO_RESUME: True\nX3D:\n  WIDTH_FACTOR: 2.0\n  DEPTH_FACTOR: 2.2\n  BOTTLENECK_FACTOR: 2.25\n  DIM_C5: 2048\n  DIM_C1: 12\nTEST:\n  ENABLE: True\n  DATASET: ptvkinetics\n  BATCH_SIZE: 64\n  # CHECKPOINT_FILE_PATH: 'x3d_xs.pyth' # 69.46% top1 30-view accuracy to download from the model zoo (optional).\n  # NUM_SPATIAL_CROPS: 1\n  NUM_SPATIAL_CROPS: 3\nDATA:\n  NUM_FRAMES: 4\n  SAMPLING_RATE: 12\n  TRAIN_JITTER_SCALES: [182, 228]\n  TRAIN_CROP_SIZE: 160\n  # TEST_CROP_SIZE: 160 # use if TEST.NUM_SPATIAL_CROPS: 1\n  TEST_CROP_SIZE: 182 # use if TEST.NUM_SPATIAL_CROPS: 3\n  INPUT_CHANNEL_NUM: [3]\n\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  TRANS_FUNC: x3d_transform\n  STRIDE_1X1: False\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\n  WEIGHT_DECAY: 0.0\nSOLVER:\n  BASE_LR: 0.1 # 1 machine\n  BASE_LR_SCALE_NUM_SHARDS: True\n  LR_POLICY: cosine\n  MAX_EPOCH: 300\n  WEIGHT_DECAY: 5e-5\n  WARMUP_EPOCHS: 35.0\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: x3d\n  MODEL_NAME: PTVX3D\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/SSv2/MVITv2_B_32x3.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: ssv2\n  BATCH_SIZE: 16\n  EVAL_PERIOD: 2\n  CHECKPOINT_PERIOD: 2\n  AUTO_RESUME: True\n  # CHECKPOINT_FILE_PATH:\n  CHECKPOINT_TYPE: pytorch\n  CHECKPOINT_EPOCH_RESET: True\nDATA:\n  DECODING_BACKEND: torchvision\n  NUM_FRAMES: 32\n  SAMPLING_RATE: 3\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 224\n  INPUT_CHANNEL_NUM: [3]\n  INV_UNIFORM_SAMPLE: True\n  RANDOM_FLIP: False\n  REVERSE_INPUT_CHANNEL: True\n  # PATH_TO_DATA_DIR:\nMVIT:\n  ZERO_DECAY_POS_CLS: False\n  USE_ABS_POS: False\n  REL_POS_SPATIAL: True\n  REL_POS_TEMPORAL: True\n  DEPTH: 24\n  NUM_HEADS: 1\n  EMBED_DIM: 96\n  PATCH_KERNEL: (3, 7, 7)\n  PATCH_STRIDE: (2, 4, 4)\n  PATCH_PADDING: (1, 3, 3)\n  MLP_RATIO: 4.0\n  QKV_BIAS: True\n  DROPPATH_RATE: 0.4\n  NORM: \"layernorm\"\n  MODE: \"conv\"\n  CLS_EMBED_ON: True\n  DIM_MUL: [[2, 2.0], [5, 2.0], [21, 2.0]]\n  HEAD_MUL: [[2, 2.0], [5, 2.0], [21, 2.0]]\n  POOL_KVQ_KERNEL: [3, 3, 3]\n  POOL_KV_STRIDE_ADAPTIVE: [1, 8, 8]\n  POOL_Q_STRIDE: [[0, 1, 1, 1], [1, 1, 1, 1], [2, 1, 2, 2], [3, 1, 1, 1], [4, 1, 1, 1], [5, 1, 2, 2], [6, 1, 1, 1], [7, 1, 1, 1], [8, 1, 1, 1], [9, 1, 1, 1], [10, 1, 1, 1], [11, 1, 1, 1], [12, 1, 1, 1], [13, 1, 1, 1], [14, 1, 1, 1], [15, 1, 1, 1], [16, 1, 1, 1], [17, 1, 1, 1], [18, 1, 1, 1], [19, 1, 1, 1], [20, 1, 1, 1], [21, 1, 2, 2], [22, 1, 1, 1], [23, 1, 1, 1]]\n  DROPOUT_RATE: 0.0\n  DIM_MUL_IN_ATT: True\n  RESIDUAL_POOLING: True\nAUG:\n  NUM_SAMPLE: 1\n  ENABLE: True\n  COLOR_JITTER: 0.4\n  AA_TYPE: rand-m7-n4-mstd0.5-inc1\n  INTERPOLATION: bicubic\n  RE_PROB: 0.25\n  RE_MODE: pixel\n  RE_COUNT: 1\n  RE_SPLIT: False\nMIXUP:\n  ENABLE: True\n  ALPHA: 0.8\n  CUTMIX_ALPHA: 1.0\n  PROB: 1.0\n  SWITCH_PROB: 0.5\n  LABEL_SMOOTH_VALUE: 0.1\nBN:\n  USE_PRECISE_STATS: False\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  ZERO_WD_1D_PARAM: True\n  BASE_LR_SCALE_NUM_SHARDS: True\n  BASE_LR: 0.0015\n  COSINE_AFTER_WARMUP: True\n  COSINE_END_LR: 1e-6\n  WARMUP_START_LR: 1e-4\n  WARMUP_EPOCHS: 3.0\n  LR_POLICY: cosine\n  MAX_EPOCH: 100\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  OPTIMIZING_METHOD: sgd\n  COSINE_AFTER_WARMUP: True\nMODEL:\n  NUM_CLASSES: 174\n  ARCH: mvit\n  MODEL_NAME: MViT\n  LOSS_FUNC: soft_cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: ssv2\n  BATCH_SIZE: 64\n  NUM_ENSEMBLE_VIEWS: 1\n  NUM_SPATIAL_CROPS: 3\nDATA_LOADER:\n  NUM_WORKERS: 2\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\nPREFETCH:\n  NUM_LOADERS: 3\n"
  },
  {
    "path": "configs/SSv2/MVITv2_L_40x3.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: ssv2\n  BATCH_SIZE: 8\n  EVAL_PERIOD: 2\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\n  #CHECKPOINT_FILE_PATH:\n  CHECKPOINT_IN_INIT: True\n  CHECKPOINT_TYPE: pytorch\n  CHECKPOINT_EPOCH_RESET: True\nDATA:\n  DECODING_BACKEND: torchvision\n  NUM_FRAMES: 40\n  SAMPLING_RATE: 3\n  TRAIN_JITTER_SCALES: [356, 446]\n  TRAIN_CROP_SIZE: 312\n  TEST_CROP_SIZE: 312\n  INPUT_CHANNEL_NUM: [3]\n  INV_UNIFORM_SAMPLE: True\n  RANDOM_FLIP: False\n  REVERSE_INPUT_CHANNEL: True\n  # PATH_TO_DATA_DIR:\n  MEAN: [0.485, 0.456, 0.406]\n  STD: [0.229, 0.224, 0.225]\nMVIT:\n  ZERO_DECAY_POS_CLS: False\n  USE_ABS_POS: False\n  REL_POS_SPATIAL: True\n  REL_POS_TEMPORAL: True\n  DEPTH: 48\n  NUM_HEADS: 2\n  EMBED_DIM: 144\n  PATCH_KERNEL: (3, 7, 7)\n  PATCH_STRIDE: (2, 4, 4)\n  PATCH_PADDING: (1, 3, 3)\n  MLP_RATIO: 4.0\n  QKV_BIAS: True\n  DROPPATH_RATE: 0.75\n  NORM: \"layernorm\"\n  MODE: \"conv\"\n  CLS_EMBED_ON: True\n  DIM_MUL: [[2, 2.0], [8, 2.0], [44, 2.0]]\n  HEAD_MUL: [[2, 2.0], [8, 2.0], [44, 2.0]]\n  POOL_KVQ_KERNEL: [3, 3, 3]\n  POOL_KV_STRIDE_ADAPTIVE: [2, 4, 4]\n  DROPOUT_RATE: 0.0\n  POOL_Q_STRIDE: [[2, 1, 2, 2], [8, 1, 2, 2], [44, 1, 2, 2]]\n  RESIDUAL_POOLING: True\n  MODE: \"conv\"\n  DIM_MUL_IN_ATT: True\n  POOL_Q_STRIDE: [[0, 1, 1, 1], [1, 1, 1, 1], [2, 1, 2, 2], [3, 1, 1, 1], [4, 1, 1, 1], [5, 1, 1, 1], [6, 1, 1, 1], [7, 1, 1, 1], [8, 1, 2, 2], [9, 1, 1, 1], [10, 1, 1, 1],\n  [11, 1, 1, 1], [12, 1, 1, 1], [13, 1, 1, 1], [14, 1, 1, 1], [15, 1, 1, 1], [16, 1, 1, 1], [17, 1, 1, 1], [18, 1, 1, 1], [19, 1, 1, 1], [20, 1, 1, 1],\n  [21, 1, 1, 1], [22, 1, 1, 1], [23, 1, 1, 1], [24, 1, 1, 1], [25, 1, 1, 1], [26, 1, 1, 1], [27, 1, 1, 1], [28, 1, 1, 1], [29, 1, 1, 1], [30, 1, 1, 1],\n  [31, 1, 1, 1], [32, 1, 1, 1], [33, 1, 1, 1], [34, 1, 1, 1], [35, 1, 1, 1], [36, 1, 1, 1], [37, 1, 1, 1], [38, 1, 1, 1], [39, 1, 1, 1], [40, 1, 1, 1],\n  [41, 1, 1, 1], [42, 1, 1, 1], [43, 1, 1, 1], [44, 1, 2, 2], [45, 1, 1, 1], [46, 1, 1, 1], [47, 1, 1, 1] ]\n  POOL_FIRST: False # default: False\n  POOL_KV_STRIDE_ADAPTIVE: [1, 8, 8]\n\nAUG:\n  NUM_SAMPLE: 1\n  ENABLE: True\n  COLOR_JITTER: 0.4\n  AA_TYPE: rand-m7-n4-mstd0.5-inc1\n  INTERPOLATION: bicubic\n  RE_PROB: 0.25\n  RE_MODE: pixel\n  RE_COUNT: 1\n  RE_SPLIT: False\nMIXUP:\n  ENABLE: True\n  ALPHA: 0.8\n  CUTMIX_ALPHA: 1.0\n  PROB: 0.0\n  SWITCH_PROB: 0.5\n  LABEL_SMOOTH_VALUE: 0.1\nBN:\n  USE_PRECISE_STATS: False\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  CLIP_GRAD_L2NORM: 2.0\n  ZERO_WD_1D_PARAM: True\n  BASE_LR_SCALE_NUM_SHARDS: True\n  BASE_LR: 0.00125\n  # BASE_LR: 0.0001\n\n  COSINE_AFTER_WARMUP: True\n  COSINE_END_LR: 1e-6\n  WARMUP_START_LR: 1e-6\n  WARMUP_EPOCHS: 3.0\n  LR_POLICY: cosine\n  MAX_EPOCH: 40\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  OPTIMIZING_METHOD: sgd\n  COSINE_AFTER_WARMUP: True\nMODEL:\n  NUM_CLASSES: 174\n  ARCH: mvit\n  MODEL_NAME: MViT\n  LOSS_FUNC: soft_cross_entropy\n  DROPOUT_RATE: 0.5\n  ACT_CHECKPOINT: True\nTEST:\n  ENABLE: True\n  DATASET: ssv2\n  BATCH_SIZE: 8\n  NUM_ENSEMBLE_VIEWS: 1\n  NUM_SPATIAL_CROPS: 3\nDATA_LOADER:\n  NUM_WORKERS: 2\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\nPREFETCH:\n  NUM_LOADERS: 3\n"
  },
  {
    "path": "configs/SSv2/MVITv2_S_16x4.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: ssv2\n  BATCH_SIZE: 16\n  EVAL_PERIOD: 2\n  CHECKPOINT_PERIOD: 2\n  AUTO_RESUME: True\n  # CHECKPOINT_FILE_PATH:\n  CHECKPOINT_TYPE: pytorch\n  CHECKPOINT_EPOCH_RESET: True\nDATA:\n  DECODING_BACKEND: torchvision\n  NUM_FRAMES: 16\n  SAMPLING_RATE: 4\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 224\n  INPUT_CHANNEL_NUM: [3]\n  INV_UNIFORM_SAMPLE: True\n  RANDOM_FLIP: False\n  REVERSE_INPUT_CHANNEL: True\n  #PATH_TO_DATA_DIR:\nMVIT:\n  ZERO_DECAY_POS_CLS: False\n  USE_ABS_POS: False\n  REL_POS_SPATIAL: True\n  REL_POS_TEMPORAL: True\n  DEPTH: 16\n  NUM_HEADS: 1\n  EMBED_DIM: 96\n  PATCH_KERNEL: (3, 7, 7)\n  PATCH_STRIDE: (2, 4, 4)\n  PATCH_PADDING: (1, 3, 3)\n  MLP_RATIO: 4.0\n  QKV_BIAS: True\n  DROPPATH_RATE: 0.4\n  NORM: \"layernorm\"\n  MODE: \"conv\"\n  CLS_EMBED_ON: True\n  DIM_MUL: [[1, 2.0], [3, 2.0], [14, 2.0]]\n  HEAD_MUL: [[1, 2.0], [3, 2.0], [14, 2.0]]\n  POOL_KVQ_KERNEL: [3, 3, 3]\n  POOL_KV_STRIDE_ADAPTIVE: [1, 8, 8]\n  POOL_Q_STRIDE: [[0, 1, 1, 1], [1, 1, 2, 2], [2, 1, 1, 1], [3, 1, 2, 2], [4, 1, 1, 1], [5, 1, 1, 1], [6, 1, 1, 1], [7, 1, 1, 1], [8, 1, 1, 1], [9, 1, 1, 1], [10, 1, 1, 1], [11, 1, 1, 1], [12, 1, 1, 1], [13, 1, 1, 1], [14, 1, 2, 2], [15, 1, 1, 1]]\n  DROPOUT_RATE: 0.0\n  DIM_MUL_IN_ATT: True\n  RESIDUAL_POOLING: True\nAUG:\n  NUM_SAMPLE: 1\n  ENABLE: True\n  COLOR_JITTER: 0.4\n  AA_TYPE: rand-m7-n4-mstd0.5-inc1\n  INTERPOLATION: bicubic\n  RE_PROB: 0.25\n  RE_MODE: pixel\n  RE_COUNT: 1\n  RE_SPLIT: False\nMIXUP:\n  ENABLE: True\n  ALPHA: 0.8\n  CUTMIX_ALPHA: 1.0\n  PROB: 1.0\n  SWITCH_PROB: 0.5\n  LABEL_SMOOTH_VALUE: 0.1\nSOLVER:\n  ZERO_WD_1D_PARAM: True\n  BASE_LR_SCALE_NUM_SHARDS: True\n  BASE_LR: 0.0025\n  COSINE_AFTER_WARMUP: True\n  COSINE_END_LR: 1e-6\n  WARMUP_START_LR: 1e-4\n  WARMUP_EPOCHS: 3.0\n  LR_POLICY: cosine\n  MAX_EPOCH: 100\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  OPTIMIZING_METHOD: sgd\n  COSINE_AFTER_WARMUP: True\nMODEL:\n  NUM_CLASSES: 174\n  ARCH: mvit\n  MODEL_NAME: MViT\n  LOSS_FUNC: soft_cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: ssv2\n  BATCH_SIZE: 64\n  NUM_ENSEMBLE_VIEWS: 1\n  NUM_SPATIAL_CROPS: 3\nDATA_LOADER:\n  NUM_WORKERS: 2\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\nPREFETCH:\n  NUM_LOADERS: 3\n"
  },
  {
    "path": "configs/SSv2/SLOWFAST_16x8_R50.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: ssv2\n  BATCH_SIZE: 16\n  EVAL_PERIOD: 2\n  CHECKPOINT_PERIOD: 2\n  AUTO_RESUME: True\n  CHECKPOINT_FILE_PATH: SLOWFAST_8x8_R50.pkl # please download from the model zoo.\n  CHECKPOINT_TYPE: caffe2\nDATA:\n  NUM_FRAMES: 64\n  SAMPLING_RATE: 2\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3, 3]\n  INV_UNIFORM_SAMPLE: True\n  RANDOM_FLIP: False\n  REVERSE_INPUT_CHANNEL: True\nSLOWFAST:\n  ALPHA: 4\n  BETA_INV: 8\n  FUSION_CONV_CHANNEL_RATIO: 2\n  FUSION_KERNEL_SZ: 7\nRESNET:\n  SPATIAL_STRIDES: [[1, 1], [2, 2], [2, 2], [2, 2]]\n  SPATIAL_DILATIONS: [[1, 1], [1, 1], [1, 1], [1, 1]]\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3, 3], [4, 4], [6, 6], [3, 3]]\nNONLOCAL:\n  LOCATION: [[[], []], [[], []], [[], []], [[], []]]\n  GROUP: [[1, 1], [1, 1], [1, 1], [1, 1]]\n  INSTANTIATION: dot_product\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\n  NORM_TYPE: sync_batchnorm\n  NUM_SYNC_DEVICES: 4\nSOLVER:\n  BASE_LR: 0.03\n  LR_POLICY: steps_with_relative_lrs\n  LRS: [1, 0.1, 0.01, 0.001, 0.0001, 0.00001]\n  STEPS: [0, 14, 18]\n  MAX_EPOCH: 22\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-6\n  WARMUP_EPOCHS: 0.19\n  WARMUP_START_LR: 0.0001\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 174\n  ARCH: slowfast\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: ssv2\n  BATCH_SIZE: 16\n  NUM_ENSEMBLE_VIEWS: 1\n  NUM_SPATIAL_CROPS: 1\nDATA_LOADER:\n  NUM_WORKERS: 4\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\nLOG_MODEL_INFO: False\n"
  },
  {
    "path": "configs/SSv2/SLOWFAST_16x8_R50_multigrid.yaml",
    "content": "MULTIGRID:\n  SHORT_CYCLE: True\n  LONG_CYCLE: True\nTRAIN:\n  ENABLE: True\n  DATASET: ssv2\n  BATCH_SIZE: 16\n  EVAL_PERIOD: 2\n  CHECKPOINT_PERIOD: 2\n  AUTO_RESUME: True\n  CHECKPOINT_FILE_PATH: SLOWFAST_8x8_R50.pkl # please download from the model zoo.\n  CHECKPOINT_TYPE: caffe2\nDATA:\n  NUM_FRAMES: 64\n  SAMPLING_RATE: 2\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 224\n  INPUT_CHANNEL_NUM: [3, 3]\n  INV_UNIFORM_SAMPLE: True\n  RANDOM_FLIP: False\n  REVERSE_INPUT_CHANNEL: True\nSLOWFAST:\n  ALPHA: 4\n  BETA_INV: 8\n  FUSION_CONV_CHANNEL_RATIO: 2\n  FUSION_KERNEL_SZ: 7\nRESNET:\n  SPATIAL_STRIDES: [[1, 1], [2, 2], [2, 2], [2, 2]]\n  SPATIAL_DILATIONS: [[1, 1], [1, 1], [1, 1], [1, 1]]\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3, 3], [4, 4], [6, 6], [3, 3]]\nNONLOCAL:\n  LOCATION: [[[], []], [[], []], [[], []], [[], []]]\n  GROUP: [[1, 1], [1, 1], [1, 1], [1, 1]]\n  INSTANTIATION: dot_product\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\n  NORM_TYPE: sync_batchnorm\n  NUM_SYNC_DEVICES: 4\nSOLVER:\n  BASE_LR: 0.03\n  LR_POLICY: steps_with_relative_lrs\n  LRS: [1, 0.1, 0.01, 0.001, 0.0001, 0.00001]\n  STEPS: [0, 14, 18]\n  MAX_EPOCH: 22\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-6\n  WARMUP_EPOCHS: 0.19\n  WARMUP_START_LR: 0.0001\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 174\n  ARCH: slowfast\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: ssv2\n  BATCH_SIZE: 16\n  NUM_ENSEMBLE_VIEWS: 1\n  NUM_SPATIAL_CROPS: 1\nDATA_LOADER:\n  NUM_WORKERS: 4\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\nLOG_MODEL_INFO: False\n"
  },
  {
    "path": "configs/SSv2/pytorchvideo/SLOWFAST_8x8_R50.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: ptvssv2\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 2\n  CHECKPOINT_PERIOD: 2\n  AUTO_RESUME: True\n  # CHECKPOINT_FILE_PATH: # please download from the model zoo.\n  CHECKPOINT_EPOCH_RESET: True\nDATA:\n  NUM_FRAMES: 32\n  SAMPLING_RATE: 2\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3, 3]\n  INV_UNIFORM_SAMPLE: True\n  RANDOM_FLIP: False\n  REVERSE_INPUT_CHANNEL: True\nSLOWFAST:\n  ALPHA: 4\n  BETA_INV: 8\n  FUSION_CONV_CHANNEL_RATIO: 2\n  FUSION_KERNEL_SZ: 7\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3, 3], [4, 4], [6, 6], [3, 3]]\n  SPATIAL_STRIDES: [[1, 1], [2, 2], [2, 2], [2, 2]]\n  SPATIAL_DILATIONS: [[1, 1], [1, 1], [1, 1], [1, 1]]\nNONLOCAL:\n  LOCATION: [[[], []], [[], []], [[], []], [[], []]]\n  GROUP: [[1, 1], [1, 1], [1, 1], [1, 1]]\n  INSTANTIATION: dot_product\n  POOL: [[[1, 2, 2], [1, 2, 2]], [[1, 2, 2], [1, 2, 2]], [[1, 2, 2], [1, 2, 2]], [[1, 2, 2], [1, 2, 2]]]\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 150\nSOLVER:\n  BASE_LR_SCALE_NUM_SHARDS: True\n  BASE_LR: 0.1\n  LR_POLICY: cosine\n  MAX_EPOCH: 30\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 3.0\n  WARMUP_START_LR: 0.08\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 174\n  ARCH: slowfast\n  MODEL_NAME: PTVSlowFast\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: ptvssv2\n  BATCH_SIZE: 64\n  NUM_ENSEMBLE_VIEWS: 1\n  NUM_SPATIAL_CROPS: 3\nDATA_LOADER:\n  NUM_WORKERS: 2\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/SSv2/pytorchvideo/SLOW_8x8_R50.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: ptvssv2\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 2\n  CHECKPOINT_PERIOD: 2\n  AUTO_RESUME: True\n  # CHECKPOINT_FILE_PATH: # please download from the model zoo.\n  CHECKPOINT_EPOCH_RESET: True\nDATA:\n  NUM_FRAMES: 8\n  SAMPLING_RATE: 8\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3]\n  INV_UNIFORM_SAMPLE: True\n  RANDOM_FLIP: False\n  REVERSE_INPUT_CHANNEL: True\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3], [4], [6], [3]]\nNONLOCAL:\n  LOCATION: [[[]], [[]], [[]], [[]]]\n  GROUP: [[1], [1], [1], [1]]\n  INSTANTIATION: dot_product\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR_SCALE_NUM_SHARDS: True\n  BASE_LR: 0.1\n  LR_POLICY: cosine\n  MAX_EPOCH: 30\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 3.0\n  WARMUP_START_LR: 0.08\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 174\n  ARCH: slow\n  MODEL_NAME: PTVResNet\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: ptvssv2\n  BATCH_SIZE: 64\n  NUM_ENSEMBLE_VIEWS: 1\n  NUM_SPATIAL_CROPS: 3\nDATA_LOADER:\n  NUM_WORKERS: 2\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/contrastive_ssl/BYOL_SlowR50_8x8.yaml",
    "content": "TASK: ssl\nTRAIN:\n  DATASET: kinetics\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 5\n  AUTO_RESUME: True\nMODEL:\n  NUM_CLASSES: 256\n  MODEL_NAME: ContrastiveModel\n  ARCH: slow_c2d\n  ARCH: slow\n  LOSS_FUNC: contrastive_loss\n  DROPOUT_RATE: 0.0\n  HEAD_ACT: none\nCONTRASTIVE:\n  T: 0.5\n  DIM: 256 # 128 default, if changed, change nCls too\n  NUM_MLP_LAYERS: 2 # default 1\n  BN_MLP: True\n  BN_SYNC_MLP: True\n  MLP_DIM: 4096\n  SEQUENTIAL: True  # def fault\n  MOMENTUM: 0.996 # default 0.5\n  MOMENTUM_ANNEALING: True # default false\n  TYPE: byol # default mem\n  PREDICTOR_DEPTHS: [2]\nDATA:\n  NUM_FRAMES: 8\n  SAMPLING_RATE: 8\n  # NUM_FRAMES: 16 # dont forget to change these parameters in linear & finetuning configs\n  # SAMPLING_RATE: 4\n  TRAIN_CROP_NUM_TEMPORAL: 2  # default 1\n  TRAIN_CROP_NUM_SPATIAL: 1  # default 1\n  TRAIN_JITTER_SCALES_RELATIVE: [0.2, 0.766]\n  TRAIN_JITTER_ASPECT_RELATIVE: [0.75, 1.3333]\n  SSL_MOCOV2_AUG: True\n  SSL_COLOR_JITTER: True # default false\n  COLOR_RND_GRAYSCALE: 0.2 # default 0.0\n  SSL_COLOR_HUE: 0.15\n  SSL_COLOR_BRI_CON_SAT: [0.6, 0.6, 0.6] # default [0.4, 0.4, 0.4]\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3]\n  PATH_LABEL_SEPARATOR: \" \"\nBN:\n  USE_PRECISE_STATS: False\n  NUM_BATCHES_PRECISE: 200\n  WEIGHT_DECAY: 0.0\n  NUM_SYNC_DEVICES: 8\n  NORM_TYPE: \"sync_batchnorm\"\n  # NORM_TYPE: \"sync_batchnorm_apex\"\nSOLVER:\n  # BASE_LR: 1.2 # for rho=4 clips\n  BASE_LR: 0.6\n  LARS_ON: True\n  BASE_LR_SCALE_NUM_SHARDS: True\n  LR_POLICY: cosine\n  MAX_EPOCH: 200\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-6\n  WARMUP_EPOCHS: 35.0\n  WARMUP_START_LR: 0.001\n  OPTIMIZING_METHOD: sgd\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\nDATA_LOADER:\n  NUM_WORKERS: 10\n  PIN_MEMORY: True\nNUM_GPUS: 8\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/contrastive_ssl/MoCo_SlowR50_8x8.yaml",
    "content": "TASK: ssl\nTRAIN:\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 5\n  AUTO_RESUME: True\n  MIXED_PRECISION: True\nMODEL:\n  NUM_CLASSES: 128\n  MODEL_NAME: ContrastiveModel\n  ARCH: slow_c2d\n  ARCH: slow\n  LOSS_FUNC: contrastive_loss\n  DROPOUT_RATE: 0.0\n  HEAD_ACT: none\nCONTRASTIVE:\n  T: 0.1\n  NUM_MLP_LAYERS: 3 # default 1\n  SEQUENTIAL: True  # def fault\n  MOCO_MULTI_VIEW_QUEUE: True # default: False\n  MOMENTUM: 0.994 # default 0.5 ours: 0.999\n  MOMENTUM_ANNEALING: True # default false\n  TYPE: moco # default mem\nDATA:\n  NUM_FRAMES: 8\n  SAMPLING_RATE: 8\n  TRAIN_CROP_NUM_TEMPORAL: 4  # default 1\n  TRAIN_CROP_NUM_SPATIAL: 1  # default 1\n  TRAIN_JITTER_SCALES_RELATIVE: [0.2, 0.766]\n  TRAIN_JITTER_ASPECT_RELATIVE: [0.75, 1.3333]\n  SSL_MOCOV2_AUG: True\n  COLOR_RND_GRAYSCALE: 0.2 # default 0.0\n  SSL_COLOR_JITTER: True # default false\n  SSL_COLOR_HUE: 0.15\n  SSL_COLOR_BRI_CON_SAT: [0.6, 0.6, 0.6] # default [0.4, 0.4, 0.4]\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3]\n  PATH_LABEL_SEPARATOR: \" \"\nDATA_LOADER:\n  NUM_WORKERS: 8\nBN:\n  USE_PRECISE_STATS: False\n  NUM_BATCHES_PRECISE: 200\n  WEIGHT_DECAY: 0.0\nSOLVER:\n  BASE_LR_SCALE_NUM_SHARDS: True\n  BASE_LR: 0.05\n  BASE_LR: 0.1\n  LR_POLICY: cosine\n  MAX_EPOCH: 200\n  WARMUP_EPOCHS: 35.0\n  MOMENTUM: 0.9\n  WARMUP_START_LR: 0.001\n  OPTIMIZING_METHOD: sgd\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\nNUM_GPUS: 8\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/contrastive_ssl/SimCLR_SlowR50_8x8.yaml",
    "content": "TASK: ssl\nTRAIN:\n  DATASET: kinetics\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 10\n  AUTO_RESUME: True\nMODEL:\n  NUM_CLASSES: 128\n  MODEL_NAME: ContrastiveModel\n  ARCH: slow\n  LOSS_FUNC: contrastive_loss\n  DROPOUT_RATE: 0.0\n  HEAD_ACT: none\nCONTRASTIVE:\n  TYPE: simclr # default mem\n  T: 0.1 # default 0.07\n  DIM: 128 # 128 default, if changed, change nCls too\n  NUM_CLASSES_DOWNSTREAM: 400\n  NUM_MLP_LAYERS: 3 # default 1\n  BN_SYNC_MLP: True\n  SIMCLR_DIST_ON: True # default false\n  SEQUENTIAL: True  # def fault\nDATA:\n  NUM_FRAMES: 8\n  SAMPLING_RATE: 8\n  TRAIN_CROP_NUM_TEMPORAL: 4  # default 1\n  TRAIN_CROP_NUM_SPATIAL: 1  # default 1\n  SSL_MOCOV2_AUG: True\n  SSL_COLOR_JITTER: True # default false\n  TRAIN_JITTER_SCALES_RELATIVE: [0.2, 0.766]\n  TRAIN_JITTER_ASPECT_RELATIVE: [0.75, 1.3333]\n  COLOR_RND_GRAYSCALE: 0.2 # default 0.0\n  SSL_COLOR_HUE: 0.15\n  SSL_COLOR_BRI_CON_SAT: [0.6, 0.6, 0.6] # default [0.4, 0.4, 0.4]\n  TRAIN_JITTER_SCALES: [224, 224]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3]\n  PATH_LABEL_SEPARATOR: \" \"\nBN:\n  USE_PRECISE_STATS: False\n  NUM_BATCHES_PRECISE: 200\n  WEIGHT_DECAY: 0.0\n  NUM_SYNC_DEVICES: 8\n  NORM_TYPE: \"sync_batchnorm\"\n  # NORM_TYPE: \"sync_batchnorm_apex\"\nSOLVER:\n  BASE_LR: 0.6\n  BASE_LR: 1.2\n\n  LARS_ON: True\n  BASE_LR_SCALE_NUM_SHARDS: True\n  LR_POLICY: cosine\n  MAX_EPOCH: 200\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-6\n  WARMUP_EPOCHS: 35.0\n  WARMUP_START_LR: 0.001\n  OPTIMIZING_METHOD: sgd\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\nDATA_LOADER:\n  NUM_WORKERS: 10\n  PIN_MEMORY: True\nNUM_GPUS: 8\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/contrastive_ssl/SwAV_Slow_R50_8x8.yaml",
    "content": "TASK: ssl\nTRAIN:\n  DATASET: kinetics\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 10\n  AUTO_RESUME: True\nMODEL:\n  NUM_CLASSES: 128\n  MODEL_NAME: ContrastiveModel\n  ARCH: slow\n  LOSS_FUNC: contrastive_loss\n  DROPOUT_RATE: 0.0\n  HEAD_ACT: none\nCONTRASTIVE:\n  TYPE: swav # default mem\n  T: 0.1 # default 0.07\n  DIM: 128 # 128 default, if changed, change nCls too\n  NUM_CLASSES_DOWNSTREAM: 400\n  NUM_MLP_LAYERS: 3 # default 1\n  BN_MLP: True\n  BN_SYNC_MLP: True\nDATA:\n  NUM_FRAMES: 8\n  SAMPLING_RATE: 8\n  TRAIN_CROP_NUM_TEMPORAL: 2  # default 1\n  TRAIN_CROP_NUM_SPATIAL: 1  # default 1\n  TRAIN_JITTER_SCALES_RELATIVE: [0.2, 0.766]\n  TRAIN_JITTER_ASPECT_RELATIVE: [0.75, 1.3333]\n  SSL_MOCOV2_AUG: True\n  SSL_COLOR_JITTER: True # default false\n  COLOR_RND_GRAYSCALE: 0.2 # default 0.0\n  SSL_COLOR_HUE: 0.15\n  SSL_COLOR_BRI_CON_SAT: [0.6, 0.6, 0.6] # default [0.4, 0.4, 0.4]\n  TRAIN_JITTER_SCALES: [224, 224]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3]\n  PATH_LABEL_SEPARATOR: \" \"\nBN:\n  USE_PRECISE_STATS: False\n  NUM_BATCHES_PRECISE: 200\n  WEIGHT_DECAY: 0.0\n  NUM_SYNC_DEVICES: 8\n  NORM_TYPE: \"sync_batchnorm\"\nSOLVER:\n  BASE_LR: 0.6\n  BASE_LR_SCALE_NUM_SHARDS: True\n  LR_POLICY: cosine\n  MAX_EPOCH: 200\n  LARS_ON: True\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-6\n  WARMUP_EPOCHS: 35.0\n  WARMUP_START_LR: 0.001\n  OPTIMIZING_METHOD: sgd\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\nDATA_LOADER:\n  NUM_WORKERS: 10\n  PIN_MEMORY: True\nNUM_GPUS: 8\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/contrastive_ssl/finetune_SSv2_Slow_R50_syn0.yaml",
    "content": "TASK: ssl_eval_ssv2\nTRAIN:\n  ENABLE: True\n  DATASET: ssv2\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 4\n  CHECKPOINT_PERIOD: 4\n  AUTO_RESUME: True\n  CHECKPOINT_TYPE: pytorch\n  CHECKPOINT_EPOCH_RESET: True\n  CHECKPOINT_CLEAR_NAME_PATTERN: (\"backbone.\",)\nDATA:\n  NUM_FRAMES: 8\n  SAMPLING_RATE: 8\n  DECODING_BACKEND: torchvision\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3]\n  INV_UNIFORM_SAMPLE: True\n  RANDOM_FLIP: False\n  PATH_TO_DATA_DIR: # pls add\n  PATH_PREFIX: # pls add\nBN:\n  USE_PRECISE_STATS: False\n  NUM_BATCHES_PRECISE: 200\n  WEIGHT_DECAY: 0.0\nSOLVER:\n  BASE_LR: 0.12\n  BASE_LR_SCALE_NUM_SHARDS: True\n  LR_POLICY: steps_with_relative_lrs\n  LRS: [1, 0.1, 0.01, 0.001, 0.0001, 0.00001]\n  STEPS: [0, 14, 18]\n  MAX_EPOCH: 22\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-6\n  WARMUP_EPOCHS: 0.19\n  WARMUP_START_LR: 0.0001\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 174\n  ARCH: slow\n  MODEL_NAME: ResNet\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: ssv2\n  BATCH_SIZE: 64\n  NUM_ENSEMBLE_VIEWS: 1\n  NUM_SPATIAL_CROPS: 1\nDATA_LOADER:\n  NUM_WORKERS: 2\n  PIN_MEMORY: True\nNUM_GPUS: 8\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/contrastive_ssl/finetune_SSv2_Slow_R50_syn8.yaml",
    "content": "TASK: ssl_eval_ssv2\nTRAIN:\n  ENABLE: True\n  DATASET: ssv2\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 4\n  CHECKPOINT_PERIOD: 4\n  AUTO_RESUME: True\n  CHECKPOINT_TYPE: pytorch\n  CHECKPOINT_EPOCH_RESET: True\n  CHECKPOINT_CLEAR_NAME_PATTERN: (\"backbone.\",)\nDATA:\n  NUM_FRAMES: 8\n  SAMPLING_RATE: 8\n  DECODING_BACKEND: torchvision\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3]\n  INV_UNIFORM_SAMPLE: True\n  RANDOM_FLIP: False\n  PATH_TO_DATA_DIR: # add\n  PATH_PREFIX: # add\nBN:\n  USE_PRECISE_STATS: False\n  NUM_BATCHES_PRECISE: 200\n  WEIGHT_DECAY: 0.0\n  NUM_SYNC_DEVICES: 8\n  NORM_TYPE: \"sync_batchnorm\"\n  # NORM_TYPE: \"sync_batchnorm_apex\"\nSOLVER:\n  BASE_LR: 0.12\n  BASE_LR_SCALE_NUM_SHARDS: True\n  LR_POLICY: steps_with_relative_lrs\n  LRS: [1, 0.1, 0.01, 0.001, 0.0001, 0.00001]\n  STEPS: [0, 14, 18]\n  MAX_EPOCH: 22\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-6\n  WARMUP_EPOCHS: 0.19\n  WARMUP_START_LR: 0.0001\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 174\n  ARCH: slow\n  MODEL_NAME: ResNet\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: ssv2\n  BATCH_SIZE: 64\n  NUM_ENSEMBLE_VIEWS: 1\n  NUM_SPATIAL_CROPS: 1\nDATA_LOADER:\n  NUM_WORKERS: 2\n  PIN_MEMORY: True\nNUM_GPUS: 8\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/contrastive_ssl/finetune_ava_Slow_R50_syn0.yaml",
    "content": "TASK: ssl_eval_ava\nTRAIN:\n  DATASET: ava\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 4\n  CHECKPOINT_PERIOD: 4\n  AUTO_RESUME: True\n  CHECKPOINT_CLEAR_NAME_PATTERN: (\"backbone.\",)\n  CHECKPOINT_TYPE: pytorch\n  CHECKPOINT_EPOCH_RESET: True\nDATA:\n  NUM_FRAMES: 8\n  SAMPLING_RATE: 8\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3]\nDETECTION:\n  ENABLE: True\n  ALIGNED: True\nAVA:\n  DETECTION_SCORE_THRESH: 0.9\n  TRAIN_PREDICT_BOX_LISTS: # add\n  TEST_PREDICT_BOX_LISTS: # add\n  ANNOTATION_DIR: # add\n  FRAME_LIST_DIR: # add\n  FRAME_DIR: # add\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3], [4], [6], [3]]\n  SPATIAL_DILATIONS: [[1], [1], [1], [2]]\n  SPATIAL_STRIDES: [[1], [2], [2], [1]]\nNONLOCAL:\n  LOCATION: [[[]], [[]], [[]], [[]]]\n  GROUP: [[1], [1], [1], [1]]\n  INSTANTIATION: softmax\nBN:\n  USE_PRECISE_STATS: False\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.3  # default 0.1\n  BASE_LR_SCALE_NUM_SHARDS: True\n  LR_POLICY: steps_with_relative_lrs\n  STEPS: [0, 16, 24, 28, 32]\n  LRS: [1, 0.1, 0.01, 0.001]\n  MAX_EPOCH: 32\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-7\n  WARMUP_EPOCHS: 5.3\n  WARMUP_START_LR: 0.0005  # 0.001 / 16 / 32 * 64 = 0.000125\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 80\n  ARCH: slow\n  MODEL_NAME: ResNet\n  LOSS_FUNC: bce\n  DROPOUT_RATE: 0.5\n  HEAD_ACT: sigmoid\nTEST:\n  ENABLE: True\n  DATASET: ava\n  BATCH_SIZE: 8\nDATA_LOADER:\n  NUM_WORKERS: 2\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/contrastive_ssl/finetune_ava_Slow_R50_syn8.yaml",
    "content": "TASK: ssl_eval_ava\nTRAIN:\n  DATASET: ava\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 4\n  CHECKPOINT_PERIOD: 4\n  AUTO_RESUME: True\n  CHECKPOINT_CLEAR_NAME_PATTERN: (\"backbone.\",)\n  CHECKPOINT_TYPE: pytorch\n  CHECKPOINT_EPOCH_RESET: True\nDATA:\n  NUM_FRAMES: 8\n  SAMPLING_RATE: 8\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3]\nDETECTION:\n  ENABLE: True\n  ALIGNED: True\nAVA:\n  DETECTION_SCORE_THRESH: 0.9\n  TRAIN_PREDICT_BOX_LISTS: # add\n  TEST_PREDICT_BOX_LISTS: # add\n  ANNOTATION_DIR: # add\n  FRAME_LIST_DIR: # add\n  FRAME_DIR: # add\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3], [4], [6], [3]]\n  SPATIAL_DILATIONS: [[1], [1], [1], [2]]\n  SPATIAL_STRIDES: [[1], [2], [2], [1]]\nNONLOCAL:\n  LOCATION: [[[]], [[]], [[]], [[]]]\n  GROUP: [[1], [1], [1], [1]]\n  INSTANTIATION: softmax\nBN:\n  USE_PRECISE_STATS: False\n  NUM_BATCHES_PRECISE: 200\n  NUM_SYNC_DEVICES: 8\n  NORM_TYPE: \"sync_batchnorm\"\n  # NORM_TYPE: \"sync_batchnorm_apex\"\nSOLVER:\n  BASE_LR: 0.3  # default 0.1\n  BASE_LR_SCALE_NUM_SHARDS: True\n  LR_POLICY: steps_with_relative_lrs\n  STEPS: [0, 16, 24, 28, 32]\n  LRS: [1, 0.1, 0.01, 0.001]\n  MAX_EPOCH: 32\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-7\n  WARMUP_EPOCHS: 5.3\n  WARMUP_START_LR: 0.0005  # 0.001 / 16 / 32 * 64 = 0.000125\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 80\n  ARCH: slow\n  MODEL_NAME: ResNet\n  LOSS_FUNC: bce\n  DROPOUT_RATE: 0.5\n  HEAD_ACT: sigmoid\nTEST:\n  ENABLE: True\n  DATASET: ava\n  BATCH_SIZE: 8\nDATA_LOADER:\n  NUM_WORKERS: 2\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/contrastive_ssl/finetune_ucf_Slow_R50_syn0.yaml",
    "content": "TASK: ssl_eval_ucf\nTRAIN:\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 20\n  CHECKPOINT_PERIOD: 20\n  AUTO_RESUME: True\n  CHECKPOINT_CLEAR_NAME_PATTERN: (\"backbone.\",)\n  CHECKPOINT_EPOCH_RESET: True\nDATA:\n  NUM_FRAMES: 8\n  SAMPLING_RATE: 8\n  SSL_COLOR_JITTER: True # default false\n  COLOR_RND_GRAYSCALE: 0.2 # default 0.0\n  SSL_COLOR_HUE: 0.1\n  SSL_COLOR_BRI_CON_SAT: [0.4, 0.4, 0.4] # default [0.4, 0.4, 0.4]\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3]\n  PATH_TO_DATA_DIR: # add\n  DECODING_BACKEND: torchvision\nBN:\n  USE_PRECISE_STATS: False\n  NUM_BATCHES_PRECISE: 200\n  WEIGHT_DECAY: 0.0\nSOLVER:\n  BASE_LR: 0.005\n  BASE_LR_SCALE_NUM_SHARDS: True\n  LR_POLICY: cosine\n  MAX_EPOCH: 200\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 0.0\n  WARMUP_START_LR: 0.001\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 101\n  ARCH: slow\n  MODEL_NAME: ResNet\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/contrastive_ssl/finetune_ucf_Slow_R50_syn8.yaml",
    "content": "TASK: ssl_eval_ucf\nTRAIN:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 20\n  CHECKPOINT_PERIOD: 10\n  AUTO_RESUME: True\n  CHECKPOINT_CLEAR_NAME_PATTERN: (\"backbone.\",)\n  CHECKPOINT_EPOCH_RESET: True\nDATA:\n  NUM_FRAMES: 8\n  SAMPLING_RATE: 8\n  SSL_COLOR_JITTER: True # default false\n  COLOR_RND_GRAYSCALE: 0.2 # default 0.0\n  SSL_COLOR_HUE: 0.1\n  SSL_COLOR_BRI_CON_SAT: [0.4, 0.4, 0.4] # default [0.4, 0.4, 0.4]\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3]\n  PATH_TO_DATA_DIR: # add\n  DECODING_BACKEND: torchvision\nBN:\n  USE_PRECISE_STATS: False\n  NUM_BATCHES_PRECISE: 200\n  WEIGHT_DECAY: 0.0\n  NUM_SYNC_DEVICES: 8\n  NORM_TYPE: \"sync_batchnorm\"\nSOLVER:\n  BASE_LR: 0.01\n  BASE_LR_SCALE_NUM_SHARDS: True\n  LR_POLICY: cosine\n  MAX_EPOCH: 200\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 0.0\n  WARMUP_EPOCHS: 0.0\n  WARMUP_START_LR: 0.001\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 101\n  MODEL_NAME: ResNet\n  ARCH: slow\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.8\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/contrastive_ssl/linear_k400_Slow_8x8_R50_syn0.yaml",
    "content": "TASK: ssl_eval_k400\nTRAIN:\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 20\n  CHECKPOINT_PERIOD: 20\n  AUTO_RESUME: True\n  CHECKPOINT_CLEAR_NAME_PATTERN: (\"backbone.\",)\n  CHECKPOINT_TYPE: pytorch\n  CHECKPOINT_EPOCH_RESET: True\nDATA:\n  NUM_FRAMES: 8\n  SAMPLING_RATE: 8\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3]\n  PATH_TO_DATA_DIR: # insert pls\nBN:\n  USE_PRECISE_STATS: False\n  NUM_BATCHES_PRECISE: 50\n  WEIGHT_DECAY: 0.0\nSOLVER:\n  BASE_LR: 0.5 # for 8 gpus\n  BASE_LR_SCALE_NUM_SHARDS: True\n  MAX_EPOCH: 60\n  LR_POLICY: cosine\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 0.0 # 1e-4 default\n  WARMUP_EPOCHS: 8.0\n  WARMUP_START_LR: 0.0\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  MODEL_NAME: ResNet\n  ARCH: slow\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.0\n  DETACH_FINAL_FC: True\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/contrastive_ssl/linear_k400_Slow_8x8_R50_syn8.yaml",
    "content": "TASK: ssl_eval_k400\nTRAIN:\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 20\n  CHECKPOINT_PERIOD: 20\n  AUTO_RESUME: True\n  CHECKPOINT_CLEAR_NAME_PATTERN: (\"backbone.\",)\n  CHECKPOINT_TYPE: pytorch\n  CHECKPOINT_EPOCH_RESET: True\nDATA:\n  NUM_FRAMES: 8\n  SAMPLING_RATE: 8\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3]\n  PATH_TO_DATA_DIR: # plz enter\nBN:\n  USE_PRECISE_STATS: False\n  NUM_BATCHES_PRECISE: 50\n  WEIGHT_DECAY: 0.0\n  NUM_SYNC_DEVICES: 8\n  NORM_TYPE: \"sync_batchnorm_apex\"\n  NORM_TYPE: \"sync_batchnorm\"\nSOLVER:\n  BASE_LR: 0.5 # slightly better +0.3%\n  BASE_LR_SCALE_NUM_SHARDS: True\n  MAX_EPOCH: 60\n  LR_POLICY: cosine\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 0.0 # 1e-4 default\n  WARMUP_EPOCHS: 8.0\n  WARMUP_START_LR: 0.0\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: slow\n  MODEL_NAME: ResNet\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.0\n  DETACH_FINAL_FC: True\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/masked_ssl/in1k_VIT_B_MaskFeat_FT.yaml",
    "content": "TASK: ssl_eval\nTRAIN:\n  ENABLE: True\n  DATASET: imagenet\n  BATCH_SIZE: 512\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 10\n  CHECKPOINT_EPOCH_RESET: True\n  AUTO_RESUME: True\nDATA:\n  MEAN: [0.485, 0.456, 0.406]\n  STD: [0.229, 0.224, 0.225]\n  NUM_FRAMES: 1\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 224\n  INPUT_CHANNEL_NUM: [3]\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nMVIT:\n  PATCH_2D: True\n  ZERO_DECAY_POS_CLS: False\n  MODE: \"conv\"\n  CLS_EMBED_ON: True\n  PATCH_KERNEL: [16, 16]\n  PATCH_STRIDE: [16, 16]\n  PATCH_PADDING: [0, 0]\n  EMBED_DIM: 768\n  NUM_HEADS: 12\n  MLP_RATIO: 4.0\n  QKV_BIAS: True\n  DROPPATH_RATE: 0.1\n  LAYER_SCALE_INIT_VALUE: 0.0\n  DEPTH: 12\n  NORM: \"layernorm\"\n  HEAD_INIT_SCALE: 0.001\n  USE_MEAN_POOLING: True\nAUG:\n  ENABLE: True\n  COLOR_JITTER: 0.4\n  AA_TYPE: rand-m9-mstd0.5-inc1\n  INTERPOLATION: bicubic\n  RE_PROB: 0.25\n  RE_MODE: pixel\n  RE_COUNT: 1\n  RE_SPLIT: False\nMIXUP:\n  ENABLE: True\n  ALPHA: 0.8\n  CUTMIX_ALPHA: 1.0\n  PROB: 1.0\n  SWITCH_PROB: 0.5\n  LABEL_SMOOTH_VALUE: 0.1\nSOLVER:\n  BASE_LR_SCALE_NUM_SHARDS: True\n  LAYER_DECAY: 0.65\n  BASE_LR: 0.002\n  LR_POLICY: cosine\n  MAX_EPOCH: 100\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 0.05\n  WARMUP_EPOCHS: 20.0\n  WARMUP_START_LR: 1e-8\n  OPTIMIZING_METHOD: adamw\n  COSINE_AFTER_WARMUP: True\n  COSINE_END_LR: 1e-6\n  ZERO_WD_1D_PARAM: True\nMODEL:\n  NUM_CLASSES: 1000\n  ARCH: mvit\n  MODEL_NAME: MViT\n  LOSS_FUNC: soft_cross_entropy\n  DROPOUT_RATE: 0.0\nTEST:\n  ENABLE: False\n  DATASET: imagenet\n  BATCH_SIZE: 256\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/masked_ssl/in1k_VIT_B_MaskFeat_PT.yaml",
    "content": "TASK: ssl\nTRAIN:\n  ENABLE: True\n  DATASET: imagenet\n  BATCH_SIZE: 256\n  EVAL_PERIOD: 10000\n  CHECKPOINT_PERIOD: 10\n  AUTO_RESUME: True\nDATA:\n  TRAIN_JITTER_SCALES_RELATIVE: [0.5, 1.0]\n  TRAIN_JITTER_ASPECT_RELATIVE: [0.75, 1.3333]\n  MEAN: [0.485, 0.456, 0.406]\n  STD: [0.229, 0.224, 0.225]\n  NUM_FRAMES: 1\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 224\n  INPUT_CHANNEL_NUM: [3]\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nMVIT:\n  PATCH_2D: True\n  ZERO_DECAY_POS_CLS: False\n  MODE: \"conv\"\n  CLS_EMBED_ON: True\n  PATCH_KERNEL: [16, 16]\n  PATCH_STRIDE: [16, 16]\n  PATCH_PADDING: [0, 0]\n  EMBED_DIM: 768\n  NUM_HEADS: 12\n  MLP_RATIO: 4.0\n  QKV_BIAS: True\n  DROPPATH_RATE: 0.0\n  DEPTH: 12\n  NORM: \"layernorm\"\nMASK:\n  ENABLE: True\n  PRETRAIN_DEPTH: [11]\n  HEAD_TYPE: \"separate\"\n  PRED_HOG: True\nAUG:\n  ENABLE: True\n  RE_PROB: 0.0\n  COLOR_JITTER: None\n  AA_TYPE: \"\"\n  GEN_MASK_LOADER: True\n  MASK_RATIO: 0.4\nMIXUP:\n  ENABLE: False\nSOLVER:\n  BASE_LR_SCALE_NUM_SHARDS: True\n  BASE_LR: 0.0002 # bs256\n  LR_POLICY: cosine\n  MAX_EPOCH: 300\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 0.05\n  WARMUP_EPOCHS: 30.0\n  WARMUP_START_LR: 1e-6\n  OPTIMIZING_METHOD: adamw\n  COSINE_AFTER_WARMUP: True\n  COSINE_END_LR: 1e-6\n  ZERO_WD_1D_PARAM: True\n  CLIP_GRAD_L2NORM: 0.02\nMODEL:\n  ARCH: maskmvit\n  MODEL_NAME: MaskMViT\n  LOSS_FUNC: multi_mse\n  DROPOUT_RATE: 0.0\nTEST:\n  ENABLE: False\n  DATASET: imagenet\n  BATCH_SIZE: 256\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\nLOG_MODEL_INFO: False\n"
  },
  {
    "path": "configs/masked_ssl/in1k_VIT_L_MaskFeat_FT.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: imagenet\n  BATCH_SIZE: 256\n  EVAL_PERIOD: 1\n  CHECKPOINT_PERIOD: 5\n  CHECKPOINT_EPOCH_RESET: True\n  AUTO_RESUME: True\n  MIXED_PRECISION: False\nDATA:\n  MEAN: [0.485, 0.456, 0.406]\n  STD: [0.229, 0.224, 0.225]\n  NUM_FRAMES: 1\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 224\n  INPUT_CHANNEL_NUM: [3]\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nMVIT:\n  PATCH_2D: True\n  ZERO_DECAY_POS_CLS: False\n  MODE: \"conv\"\n  CLS_EMBED_ON: True\n  PATCH_KERNEL: [16, 16]\n  PATCH_STRIDE: [16, 16]\n  PATCH_PADDING: [0, 0]\n  EMBED_DIM: 1024\n  NUM_HEADS: 16\n  MLP_RATIO: 4.0\n  QKV_BIAS: True\n  DROPPATH_RATE: 0.1\n  LAYER_SCALE_INIT_VALUE: 0.0\n  DEPTH: 24\n  NORM: \"layernorm\"\n  HEAD_INIT_SCALE: 0.001\n  USE_MEAN_POOLING: True\nAUG:\n  ENABLE: True\n  COLOR_JITTER: 0.4\n  AA_TYPE: rand-m9-mstd0.5-inc1    # !\n  INTERPOLATION: bicubic\n  RE_PROB: 0.25\n  RE_MODE: pixel\n  RE_COUNT: 1\n  RE_SPLIT: False\nMIXUP:\n  ENABLE: True\n  ALPHA: 0.8\n  CUTMIX_ALPHA: 1.0\n  PROB: 1.0\n  SWITCH_PROB: 0.5\n  LABEL_SMOOTH_VALUE: 0.1\nSOLVER:\n  BASE_LR_SCALE_NUM_SHARDS: True\n  LAYER_DECAY: 0.75\n  BASE_LR: 0.001\n  LR_POLICY: cosine\n  MAX_EPOCH: 50\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 0.05\n  WARMUP_EPOCHS: 5.0\n  WARMUP_START_LR: 1e-8\n  OPTIMIZING_METHOD: adamw\n  COSINE_AFTER_WARMUP: True\n  COSINE_END_LR: 1e-6\n  ZERO_WD_1D_PARAM: True\nMODEL:\n  NUM_CLASSES: 1000\n  ARCH: mvit\n  MODEL_NAME: MViT\n  LOSS_FUNC: soft_cross_entropy\n  DROPOUT_RATE: 0.0\nTEST:\n  ENABLE: False\n  DATASET: imagenet\n  BATCH_SIZE: 256\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/masked_ssl/in1k_VIT_L_MaskFeat_PT.yaml",
    "content": "TRAIN:\n  ENABLE: True\n  DATASET: imagenet\n  BATCH_SIZE: 256\n  EVAL_PERIOD: 10000\n  CHECKPOINT_PERIOD: 10\n  AUTO_RESUME: True\n  MIXED_PRECISION: False\nDATA:\n  TRAIN_JITTER_SCALES_RELATIVE: [0.5, 1.0]\n  TRAIN_JITTER_ASPECT_RELATIVE: [0.75, 1.3333]\n  MEAN: [0.485, 0.456, 0.406]\n  STD: [0.229, 0.224, 0.225]\n  NUM_FRAMES: 1\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 224\n  INPUT_CHANNEL_NUM: [3]\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nMVIT:\n  PATCH_2D: True\n  ZERO_DECAY_POS_CLS: False\n  MODE: \"conv\"\n  CLS_EMBED_ON: True\n  PATCH_KERNEL: [16, 16]\n  PATCH_STRIDE: [16, 16]\n  PATCH_PADDING: [0, 0]\n  EMBED_DIM: 1024\n  NUM_HEADS: 16\n  MLP_RATIO: 4.0\n  QKV_BIAS: True\n  DROPPATH_RATE: 0.0\n  DEPTH: 24\n  NORM: \"layernorm\"\nMASK:\n  ENABLE: True\n  PRETRAIN_DEPTH: [23]\n  HEAD_TYPE: \"separate\"\n  PRED_HOG: True\nAUG:\n  ENABLE: True\n  RE_PROB: 0.0\n  COLOR_JITTER: None             # important, not using color aug\n  AA_TYPE: \"\"                    # important, not using color aug\n  GEN_MASK_LOADER: True\n  MASK_RATIO: 0.4\nMIXUP:\n  ENABLE: False\nSOLVER:\n  BASE_LR_SCALE_NUM_SHARDS: True\n  BASE_LR: 0.0002\n  LR_POLICY: cosine\n  MAX_EPOCH: 300\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 0.05\n  WARMUP_EPOCHS: 30.0\n  WARMUP_START_LR: 1e-6\n  OPTIMIZING_METHOD: adamw\n  COSINE_AFTER_WARMUP: True\n  COSINE_END_LR: 1e-6\n  ZERO_WD_1D_PARAM: True\n  CLIP_GRAD_L2NORM: 0.02\nMODEL:\n  ARCH: maskmvit\n  MODEL_NAME: MaskMViT\n  LOSS_FUNC: multi_mse\n  DROPOUT_RATE: 0.0\nTEST:\n  ENABLE: False\n  DATASET: imagenet\n  BATCH_SIZE: 256\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\nLOG_MODEL_INFO: False\n"
  },
  {
    "path": "configs/masked_ssl/k400_MVITv2_L_16x4_FT.yaml",
    "content": "TASK: ssl_eval\nTRAIN:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 16\n  EVAL_PERIOD: 5\n  CHECKPOINT_PERIOD: 5\n  AUTO_RESUME: True\n  CHECKPOINT_EPOCH_RESET: True\nDATA:\n  USE_OFFSET_SAMPLING: True\n  DECODING_BACKEND: torchvision\n  NUM_FRAMES: 16\n  SAMPLING_RATE: 4\n  TRAIN_JITTER_SCALES: [256, 320]\n  DECODING_SHORT_SIZE: 320\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 224\n  INPUT_CHANNEL_NUM: [3]\n  TRAIN_JITTER_SCALES_RELATIVE: [0.08, 1.0]\n  TRAIN_JITTER_ASPECT_RELATIVE: [0.75, 1.3333]\nMVIT:\n  ZERO_DECAY_POS_CLS: False\n  SEP_POS_EMBED: True\n  PATCH_KERNEL: (3, 7, 7)\n  PATCH_STRIDE: (2, 4, 4)\n  PATCH_PADDING: (1, 3, 3)\n  MLP_RATIO: 4.0\n  QKV_BIAS: True\n  NORM: \"layernorm\"\n  EMBED_DIM: 144\n  NUM_HEADS: 2\n  DEPTH: 48 # [2, 6, 36, 2]\n  DIM_MUL: [[2, 2.0], [8, 2.0], [44, 2.0]]\n  HEAD_MUL: [[2, 2.0], [8, 2.0], [44, 2.0]]\n  POOL_Q_STRIDE: [[0, 1, 1, 1], [1, 1, 1, 1], [2, 1, 2, 2], [3, 1, 1, 1], [4, 1, 1, 1], [5, 1, 1, 1], [6, 1, 1, 1], [7, 1, 1, 1], [8, 1, 2, 2], [9, 1, 1, 1], [10, 1, 1, 1],\n  [11, 1, 1, 1], [12, 1, 1, 1], [13, 1, 1, 1], [14, 1, 1, 1], [15, 1, 1, 1], [16, 1, 1, 1], [17, 1, 1, 1], [18, 1, 1, 1], [19, 1, 1, 1], [20, 1, 1, 1],\n  [21, 1, 1, 1], [22, 1, 1, 1], [23, 1, 1, 1], [24, 1, 1, 1], [25, 1, 1, 1], [26, 1, 1, 1], [27, 1, 1, 1], [28, 1, 1, 1], [29, 1, 1, 1], [30, 1, 1, 1],\n  [31, 1, 1, 1], [32, 1, 1, 1], [33, 1, 1, 1], [34, 1, 1, 1], [35, 1, 1, 1], [36, 1, 1, 1], [37, 1, 1, 1], [38, 1, 1, 1], [39, 1, 1, 1], [40, 1, 1, 1],\n  [41, 1, 1, 1], [42, 1, 1, 1], [43, 1, 1, 1], [44, 1, 2, 2], [45, 1, 1, 1], [46, 1, 1, 1], [47, 1, 1, 1]]\n  POOL_KV_STRIDE_ADAPTIVE: [1, 8, 8]\n  POOL_KVQ_KERNEL: [3, 3, 3]\n  # Please check if your loaded model has cls token.\n  # It would be better to have cls token. But some of old models forget to turn this on.\n  CLS_EMBED_ON: True # default: True\n  # uncomment comment for abs pos:\n  # SEP_POS_EMBED: True\n\n  # uncomment for rel pos:\n  USE_ABS_POS: False # default: True\n  REL_POS_SPATIAL: True # default: false\n  REL_POS_TEMPORAL: True # default: false\n\n  MODE: \"conv\"\n  RESIDUAL_POOLING: True\n  DROPPATH_RATE: 0.2\n  LAYER_SCALE_INIT_VALUE: 0.0\n  USE_MEAN_POOLING: True\n  HEAD_INIT_SCALE: 0.001\nAUG:\n  ENABLE: True\n  COLOR_JITTER: None\n  AA_TYPE: rand-m7-mstd0.5-inc1\n  INTERPOLATION: bicubic\n  RE_PROB: 0.25\n  RE_MODE: pixel\n  RE_COUNT: 1\n  RE_SPLIT: False\nMIXUP:\n  ENABLE: True\n  ALPHA: 0.8\n  CUTMIX_ALPHA: 1.0\n  PROB: 1.0\n  SWITCH_PROB: 0.5\n  LABEL_SMOOTH_VALUE: 0.1\nSOLVER:\n  MAX_EPOCH: 75\n  LAYER_DECAY: 0.875\n  WARMUP_EPOCHS: 5.0\n  CLIP_GRAD_L2NORM: 5.0\n  ZERO_WD_1D_PARAM: True\n  BASE_LR_SCALE_NUM_SHARDS: True\n  BASE_LR: 0.0006\n  COSINE_AFTER_WARMUP: True\n  COSINE_END_LR: 1e-5\n  WARMUP_START_LR: 1e-8\n  LR_POLICY: cosine\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 0.05\n  OPTIMIZING_METHOD: adamw\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: mvit\n  MODEL_NAME: MViT\n  LOSS_FUNC: soft_cross_entropy # default cross_entropy\n  DROPOUT_RATE: 0.0\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  NUM_SPATIAL_CROPS: 1\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/masked_ssl/k400_MVITv2_L_16x4_MaskFeat_PT.yaml",
    "content": "TASK: ssl\nTRAIN:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 32\n  EVAL_PERIOD: 100000\n  CHECKPOINT_PERIOD: 5\n  AUTO_RESUME: True\nDATA:\n  USE_OFFSET_SAMPLING: True\n  DECODING_BACKEND: torchvision\n  NUM_FRAMES: 16\n  SAMPLING_RATE: 4\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 224\n  INPUT_CHANNEL_NUM: [3]\n  DECODING_SHORT_SIZE: 320\n  TRAIN_JITTER_SCALES_RELATIVE: [0.5, 1.0]\n  TRAIN_JITTER_ASPECT_RELATIVE: [0.75, 1.3333]\nMVIT:\n  ZERO_DECAY_POS_CLS: False\n  SEP_POS_EMBED: True\n  PATCH_KERNEL: (3, 7, 7)\n  PATCH_STRIDE: (2, 4, 4)\n  PATCH_PADDING: (1, 3, 3)\n  MLP_RATIO: 4.0\n  QKV_BIAS: True\n  NORM: \"layernorm\"\n  EMBED_DIM: 144\n  NUM_HEADS: 2\n  DEPTH: 48\n  DIM_MUL: [[2, 2.0], [8, 2.0], [44, 2.0]]\n  HEAD_MUL: [[2, 2.0], [8, 2.0], [44, 2.0]]\n  # Highlight: [44, 1, 1, 1] instead of [44, 1, 2, 2] for 14x14 output\n  POOL_Q_STRIDE: [[0, 1, 1, 1], [1, 1, 1, 1], [2, 1, 2, 2], [3, 1, 1, 1], [4, 1, 1, 1], [5, 1, 1, 1], [6, 1, 1, 1], [7, 1, 1, 1], [8, 1, 2, 2], [9, 1, 1, 1], [10, 1, 1, 1],\n  [11, 1, 1, 1], [12, 1, 1, 1], [13, 1, 1, 1], [14, 1, 1, 1], [15, 1, 1, 1], [16, 1, 1, 1], [17, 1, 1, 1], [18, 1, 1, 1], [19, 1, 1, 1], [20, 1, 1, 1],\n  [21, 1, 1, 1], [22, 1, 1, 1], [23, 1, 1, 1], [24, 1, 1, 1], [25, 1, 1, 1], [26, 1, 1, 1], [27, 1, 1, 1], [28, 1, 1, 1], [29, 1, 1, 1], [30, 1, 1, 1],\n  [31, 1, 1, 1], [32, 1, 1, 1], [33, 1, 1, 1], [34, 1, 1, 1], [35, 1, 1, 1], [36, 1, 1, 1], [37, 1, 1, 1], [38, 1, 1, 1], [39, 1, 1, 1], [40, 1, 1, 1],\n  [41, 1, 1, 1], [42, 1, 1, 1], [43, 1, 1, 1], [44, 1, 1, 1], [45, 1, 1, 1], [46, 1, 1, 1], [47, 1, 1, 1]]\n  DROPPATH_RATE: 0.0\n  POOL_KV_STRIDE_ADAPTIVE: [1, 8, 8]\n  POOL_KVQ_KERNEL: [3, 3, 3]\n  CLS_EMBED_ON: True # defauult: True\n  # uncomment comment for abs pos:\n  # SEP_POS_EMBED: True\n\n  # uncomment for rel pos:\n  USE_ABS_POS: False # default: True\n  REL_POS_SPATIAL: True # default: false\n  REL_POS_TEMPORAL: True # default: false\n\n  MODE: \"conv\"\n  RESIDUAL_POOLING: True\nMASK:\n  ENABLE: True\n  PRETRAIN_DEPTH: [47]\n  PRED_HOG: True # default: false\nAUG:\n  ENABLE: True\n  COLOR_JITTER: None\n  AA_TYPE: \"\"\n  RE_PROB: 0.0\n  GEN_MASK_LOADER: True\n  MASK_RATIO: 0.4\n\n  # Mask Cube (Default)\n  MASK_TUBE: False\n  MASK_FRAMES: False\n  MASK_WINDOW_SIZE: [8, 7, 7]\n\nMIXUP:\n  ENABLE: False\nSOLVER:\n  CLIP_GRAD_L2NORM: 0.02\n  BASE_LR: 0.0001\n  BASE_LR_SCALE_NUM_SHARDS: True\n  LR_POLICY: cosine\n  COSINE_AFTER_WARMUP: True\n  COSINE_END_LR: 1e-6\n  MAX_EPOCH: 300\n  WARMUP_EPOCHS: 10.0\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 0.05\n  ZERO_WD_1D_PARAM: True\n  WARMUP_START_LR: 1e-6\n  OPTIMIZING_METHOD: adamw\nMODEL:\n  ARCH: maskmvit\n  MODEL_NAME: MaskMViT\n  LOSS_FUNC: multi_mse\n  DROPOUT_RATE: 0.0\nTEST:\n  ENABLE: False\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  NUM_SPATIAL_CROPS: 1\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\nLOG_MODEL_INFO: False\n"
  },
  {
    "path": "configs/masked_ssl/k400_MVITv2_S_16x4_FT.yaml",
    "content": "TASK: ssl_eval\nTRAIN:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 16\n  EVAL_PERIOD: 10\n  CHECKPOINT_EPOCH_RESET: True\n  CHECKPOINT_PERIOD: 10\n  AUTO_RESUME: True\n  MIXED_PRECISION: False\nDATA:\n  USE_OFFSET_SAMPLING: True\n  DECODING_BACKEND: torchvision\n  NUM_FRAMES: 16\n  SAMPLING_RATE: 4\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 224\n  INPUT_CHANNEL_NUM: [3]\n  TRAIN_JITTER_SCALES_RELATIVE: [0.08, 1.0]\n  TRAIN_JITTER_ASPECT_RELATIVE: [0.75, 1.3333]\n  DECODING_SHORT_SIZE: 320\nMVIT:\n  DEPTH: 16 # def 12\n  NUM_HEADS: 3 # default 12 16\n  EMBED_DIM: 384 # def 192 768 384 1024\n  NUM_HEADS: 1 # default 12 16\n  EMBED_DIM: 96 # def 256 192 768 384 1024\n  PATCH_KERNEL: [3, 7, 7]\n  PATCH_STRIDE: [2, 4, 4]  # default:empty\n  PATCH_PADDING: [1, 3, 3]\n  ZERO_DECAY_POS_CLS: False\n  QKV_BIAS: True\n  DIM_MUL: [[1, 2.0], [3, 2.0], [14, 2.0]]\n  HEAD_MUL: [[1, 2.0], [3, 2.0], [14, 2.0]]\n  POOL_Q_STRIDE: [[0, 1, 1, 1], [1, 1, 2, 2], [2, 1, 1, 1], [3, 1, 2, 2], [4, 1, 1, 1], [5, 1, 1, 1], [6, 1, 1, 1], [7, 1, 1, 1], [8, 1, 1, 1], [9, 1, 1, 1], [10, 1, 1, 1], [11, 1, 1, 1], [12, 1, 1, 1], [13, 1, 1, 1], [14, 1, 2, 2], [15, 1, 1, 1]]\n  POOL_KVQ_KERNEL: [3, 3, 3]\n  POOL_KV_STRIDE_ADAPTIVE: [1, 8, 8]\n  CLS_EMBED_ON: True # default: True\n  USE_ABS_POS: False # default: True\n  SEP_POS_EMBED: True # default: false\n  REL_POS_SPATIAL: True # default: false\n  REL_POS_TEMPORAL: True # default: false\n  RESIDUAL_POOLING: True\n  MODE: \"conv\"\n  DROPPATH_RATE: 0.1\n  LAYER_SCALE_INIT_VALUE: 0.0\n  USE_MEAN_POOLING: True\n  HEAD_INIT_SCALE: 0.001\nAUG:\n  ENABLE: True\n  COLOR_JITTER: 0.4\n  AA_TYPE: rand-m7-mstd0.5-inc1\n  INTERPOLATION: bicubic\n  NUM_SAMPLE: 2\n  RE_PROB: 0.25\n  RE_MODE: pixel\n  RE_COUNT: 1\n  RE_SPLIT: False\nMIXUP:\n  ENABLE: True\n  ALPHA: 0.8\n  CUTMIX_ALPHA: 1.0\n  PROB: 1.0\n  SWITCH_PROB: 0.5\n  LABEL_SMOOTH_VALUE: 0.1\nSOLVER:\n  ZERO_WD_1D_PARAM: True\n  CLIP_GRAD_L2NORM: 5.0\n  LAYER_DECAY: 0.75                # !\n  BASE_LR_SCALE_NUM_SHARDS: True\n  BASE_LR: 0.0006                  # !\n  COSINE_AFTER_WARMUP: True\n  COSINE_END_LR: 1e-6\n  WARMUP_START_LR: 1e-6\n  WARMUP_EPOCHS: 20.0\n  LR_POLICY: cosine\n  MAX_EPOCH: 100\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 0.05\n  OPTIMIZING_METHOD: adamw\n  COSINE_AFTER_WARMUP: True\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: mvit\n  MODEL_NAME: MViT\n  LOSS_FUNC: soft_cross_entropy\n  DROPOUT_RATE: 0.0\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  NUM_SPATIAL_CROPS: 1\n  NUM_TEMPORAL_CLIPS: [1, 3, 4, 5, 7, 10]\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/masked_ssl/k400_MVITv2_S_16x4_MaskFeat_PT.yaml",
    "content": "TASK: ssl\nTRAIN:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 32\n  EVAL_PERIOD: 1000000\n  CHECKPOINT_PERIOD: 10\n  AUTO_RESUME: True\nDATA:\n  USE_OFFSET_SAMPLING: True\n  DECODING_BACKEND: torchvision\n  NUM_FRAMES: 16\n  SAMPLING_RATE: 4\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 224\n  INPUT_CHANNEL_NUM: [3]\n  DECODING_SHORT_SIZE: 320\n  TRAIN_JITTER_SCALES_RELATIVE: [0.5, 1.0]\n  TRAIN_JITTER_ASPECT_RELATIVE: [0.75, 1.3333]\nMVIT:\n  DEPTH: 16\n  NUM_HEADS: 1\n  EMBED_DIM: 96\n  PATCH_KERNEL: [3, 7, 7]\n  PATCH_STRIDE: [2, 4, 4]\n  PATCH_PADDING: [1, 3, 3]\n  ZERO_DECAY_POS_CLS: False\n  QKV_BIAS: True\n  DIM_MUL: [[1, 2.0], [3, 2.0], [14, 2.0]]\n  HEAD_MUL: [[1, 2.0], [3, 2.0], [14, 2.0]]\n  # Highlight: [14, 1, 1, 1] instead of [14, 1, 2, 2] for 14x14 output\n  POOL_Q_STRIDE: [[0, 1, 1, 1], [1, 1, 2, 2], [2, 1, 1, 1], [3, 1, 2, 2], [4, 1, 1, 1], [5, 1, 1, 1], [6, 1, 1, 1], [7, 1, 1, 1], [8, 1, 1, 1], [9, 1, 1, 1], [10, 1, 1, 1], [11, 1, 1, 1], [12, 1, 1, 1], [13, 1, 1, 1], [14, 1, 1, 1], [15, 1, 1, 1]]\n  POOL_KVQ_KERNEL: [3, 3, 3]\n  POOL_KV_STRIDE_ADAPTIVE: [1, 8, 8]\n  DROPPATH_RATE: 0.0\n  CLS_EMBED_ON: True # default: True\n  USE_ABS_POS: False # default: True\n  SEP_POS_EMBED: True # default: false\n  REL_POS_SPATIAL: True # default: false\n  REL_POS_TEMPORAL: True # default: false\n  RESIDUAL_POOLING: True\n  MODE: \"conv\"\nMASK:\n  ENABLE: True\n  PRETRAIN_DEPTH: [15]\n  HEAD_TYPE: \"separate\"\n  PRED_HOG: True\nAUG:\n  ENABLE: True\n  COLOR_JITTER: None\n  AA_TYPE: \"\"\n  RE_PROB: 0.0\n  GEN_MASK_LOADER: True\n  MASK_RATIO: 0.4\n\n  # Mask Cube (Default)\n  MASK_TUBE: False\n  MASK_FRAMES: False\n  MASK_WINDOW_SIZE: [8, 7, 7]\nMIXUP:\n  ENABLE: False\nBN:\n  USE_PRECISE_STATS: False\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  CLIP_GRAD_L2NORM: 0.02\n  BASE_LR: 0.0001\n  BASE_LR_SCALE_NUM_SHARDS: True\n  LR_POLICY: cosine\n  COSINE_AFTER_WARMUP: True\n  COSINE_END_LR: 1e-6\n  MAX_EPOCH: 300\n  WARMUP_EPOCHS: 10.0\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 0.05\n  ZERO_WD_1D_PARAM: True\n  WARMUP_START_LR: 1e-6\n  OPTIMIZING_METHOD: adamw\nMODEL:\n  ARCH: maskmvit\n  MODEL_NAME: MaskMViT\n  LOSS_FUNC: multi_mse\n  DROPOUT_RATE: 0.0\nTEST:\n  ENABLE: False\n  DATASET: kinetics\n  BATCH_SIZE: 16\n  NUM_SPATIAL_CROPS: 1\n  NUM_TEMPORAL_CLIPS: [5, 10]\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\nLOG_MODEL_INFO: False\n"
  },
  {
    "path": "configs/masked_ssl/k400_VIT_B_16x4_FT.yaml",
    "content": "TASK: ssl_eval\nTRAIN:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 1\n  EVAL_PERIOD: 1\n  CHECKPOINT_PERIOD: 5\n  AUTO_RESUME: True\n  CHECKPOINT_EPOCH_RESET: True\nDATA:\n  USE_OFFSET_SAMPLING: True\n  DECODING_BACKEND: torchvision\n  NUM_FRAMES: 16\n  SAMPLING_RATE: 4\n  TRAIN_JITTER_SCALES: [256, 320]\n  DECODING_SHORT_SIZE: 320\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 224\n  INPUT_CHANNEL_NUM: [3]\n  TRAIN_JITTER_SCALES_RELATIVE: [0.08, 1.0]\n  TRAIN_JITTER_ASPECT_RELATIVE: [0.75, 1.3333]\nMVIT:\n  ZERO_DECAY_POS_CLS: False\n  SEP_POS_EMBED: True\n  PATCH_KERNEL: (2, 16, 16)\n  PATCH_STRIDE: (2, 16, 16)\n  PATCH_PADDING: (0, 0, 0)\n  EMBED_DIM: 768 # vit-B, change depth & layerdecay\n  NUM_HEADS: 12\n  MLP_RATIO: 4.0\n  QKV_BIAS: True\n  NORM: \"layernorm\"\n  DEPTH: 12\n  MODE: \"conv\"\n  DROPPATH_RATE: 0.1\n  LAYER_SCALE_INIT_VALUE: 0.0\n  USE_MEAN_POOLING: True\n  HEAD_INIT_SCALE: 0.001\nAUG:\n  NUM_SAMPLE: 2\n  ENABLE: True\n  COLOR_JITTER: None\n  AA_TYPE: rand-m7-mstd0.5-inc1\n  INTERPOLATION: bicubic\n  RE_PROB: 0.25\n  RE_MODE: pixel\n  RE_COUNT: 1\n  RE_SPLIT: False\nMIXUP:\n  ENABLE: True\n  ALPHA: 0.8\n  CUTMIX_ALPHA: 1.0\n  PROB: 1.0\n  SWITCH_PROB: 0.5\n  LABEL_SMOOTH_VALUE: 0.1\nSOLVER:\n  MAX_EPOCH: 75\n  LAYER_DECAY: 0.75\n  LAYER_DECAY: 0.65 # vit-B\n  WARMUP_EPOCHS: 5.0\n  CLIP_GRAD_L2NORM: 5.0\n  ZERO_WD_1D_PARAM: True\n  BASE_LR_SCALE_NUM_SHARDS: True\n  BASE_LR: 6e-4\n  COSINE_AFTER_WARMUP: True\n  COSINE_END_LR: 1e-5\n  WARMUP_START_LR: 1e-8\n  LR_POLICY: cosine\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 0.05\n  OPTIMIZING_METHOD: adamw\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: mvit\n  MODEL_NAME: MViT\n  LOSS_FUNC: soft_cross_entropy # default cross_entropy\n  DROPOUT_RATE: 0.3\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  NUM_SPATIAL_CROPS: 3\n  NUM_TEMPORAL_CLIPS: [5, 7, 10]\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/masked_ssl/k400_VIT_B_16x4_MAE_PT.yaml",
    "content": "TASK: ssl\nTRAIN:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 100000\n  CHECKPOINT_PERIOD: 5\n  AUTO_RESUME: True\n  KILL_LOSS_EXPLOSION_FACTOR: 2.0 # default 0.0 (not enforced)\nDATA:\n  USE_OFFSET_SAMPLING: True\n  DECODING_BACKEND: torchvision\n  NUM_FRAMES: 16\n  SAMPLING_RATE: 4\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 224\n  INPUT_CHANNEL_NUM: [3]\n  DECODING_SHORT_SIZE: 320\n  TRAIN_JITTER_SCALES_RELATIVE: [0.5, 1.0]\n  TRAIN_JITTER_ASPECT_RELATIVE: [0.75, 1.3333]\n  TRAIN_CROP_NUM_TEMPORAL: 4  # default 1\nMVIT:\n  ZERO_DECAY_POS_CLS: False\n  SEP_POS_EMBED: True # default: False needs to be false for fixed sincos\n  PATCH_KERNEL: (2, 16, 16)\n  PATCH_STRIDE: (2, 16, 16)\n  PATCH_PADDING: (0, 0, 0)\n  EMBED_DIM: 768 # vit-B\n  NUM_HEADS: 12\n  MLP_RATIO: 4.0\n  QKV_BIAS: True\n  NORM: \"layernorm\"\n  DEPTH: 12\n  DROPPATH_RATE: 0.0\n  MODE: \"conv\"\n  CLS_EMBED_ON: True\nMASK:\n  ENABLE: True\n  MAE_ON: True #  default False\n  MAE_RND_MASK: True # default: false\n  PRETRAIN_DEPTH: [11]\n  HEAD_TYPE: \"separate_xformer\"\n  DECODER_DEPTH: 4\n  DECODER_EMBED_DIM: 512\nAUG:\n  ENABLE: True\n  COLOR_JITTER: None\n  AA_TYPE: \"\"\n  RE_PROB: 0.0\n  MASK_RATIO: 0.9\nMIXUP:\n  ENABLE: False\nSOLVER:\n  CLIP_GRAD_L2NORM: 0.02\n  BASE_LR: 1e-4\n  BASE_LR_SCALE_NUM_SHARDS: True\n  LR_POLICY: cosine\n  COSINE_AFTER_WARMUP: True\n  COSINE_END_LR: 1e-6\n  MAX_EPOCH: 200\n  WARMUP_EPOCHS: 60.0\n  BETAS: (0.9, 0.95) # default (0.9, 0.999)\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 0.05\n  ZERO_WD_1D_PARAM: True # default False\n  WARMUP_START_LR: 1e-6\n  OPTIMIZING_METHOD: adamw\nMODEL:\n  ARCH: maskmvit\n  MODEL_NAME: MaskMViT\n  LOSS_FUNC: multi_mse\n  DROPOUT_RATE: 0.0\nTEST:\n  ENABLE: False\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  NUM_SPATIAL_CROPS: 1\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\nLOG_MODEL_INFO: False\n"
  },
  {
    "path": "configs/masked_ssl/k400_VIT_H_16x4_FT.yaml",
    "content": "TASK: ssl_eval\nTRAIN:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 16\n  EVAL_PERIOD: 5\n  CHECKPOINT_PERIOD: 5\n  AUTO_RESUME: True\n  CHECKPOINT_EPOCH_RESET: True\nDATA:\n  USE_OFFSET_SAMPLING: True\n  DECODING_BACKEND: torchvision\n  NUM_FRAMES: 16\n  SAMPLING_RATE: 4\n  TRAIN_JITTER_SCALES: [256, 320]\n  DECODING_SHORT_SIZE: 320\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 224\n  INPUT_CHANNEL_NUM: [3]\n  TRAIN_JITTER_SCALES_RELATIVE: [0.08, 1.0]\n  TRAIN_JITTER_ASPECT_RELATIVE: [0.75, 1.3333]\nMVIT:\n  ZERO_DECAY_POS_CLS: False\n  SEP_POS_EMBED: True\n  PATCH_KERNEL: (2, 16, 16)\n  PATCH_STRIDE: (2, 16, 16)\n  PATCH_PADDING: (0, 0, 0)\n  EMBED_DIM: 1280\n  NUM_HEADS: 16\n  MLP_RATIO: 4.0\n  QKV_BIAS: True\n  NORM: \"layernorm\"\n  DEPTH: 32\n  MODE: \"conv\"\n  DROPPATH_RATE: 0.2\n  LAYER_SCALE_INIT_VALUE: 0.0\n  USE_MEAN_POOLING: True\n  HEAD_INIT_SCALE: 0.001\nAUG:\n  ENABLE: True\n  COLOR_JITTER: None\n  AA_TYPE: rand-m7-mstd0.5-inc1\n  INTERPOLATION: bicubic\n  RE_PROB: 0.25\n  RE_MODE: pixel\n  RE_COUNT: 1\n  RE_SPLIT: False\nMIXUP:\n  ENABLE: True\n  ALPHA: 0.8\n  CUTMIX_ALPHA: 1.0\n  PROB: 1.0\n  SWITCH_PROB: 0.5\n  LABEL_SMOOTH_VALUE: 0.1\nSOLVER:\n  MAX_EPOCH: 75\n  LAYER_DECAY: 0.8\n  WARMUP_EPOCHS: 5.0\n  CLIP_GRAD_L2NORM: 5.0\n  ZERO_WD_1D_PARAM: True\n  BASE_LR_SCALE_NUM_SHARDS: True\n  BASE_LR: 3e-4 #  works best\n  COSINE_AFTER_WARMUP: True\n  COSINE_END_LR: 1e-6\n  WARMUP_START_LR: 1e-8\n  LR_POLICY: cosine\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 0.05\n  OPTIMIZING_METHOD: adamw\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: mvit\n  MODEL_NAME: MViT\n  LOSS_FUNC: soft_cross_entropy # default cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  NUM_SPATIAL_CROPS: 3\n  NUM_TEMPORAL_CLIPS: [5, 7, 10]\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/masked_ssl/k400_VIT_H_16x4_MAE_PT.yaml",
    "content": "TASK: ssl\nTRAIN:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 32\n  EVAL_PERIOD: 100000\n  CHECKPOINT_PERIOD: 5\n  AUTO_RESUME: True\n  KILL_LOSS_EXPLOSION_FACTOR: 2.0 # default 0.0 (not enforced)\nDATA:\n  USE_OFFSET_SAMPLING: True\n  DECODING_BACKEND: torchvision\n  NUM_FRAMES: 16\n  SAMPLING_RATE: 4\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 224\n  INPUT_CHANNEL_NUM: [3]\n  DECODING_SHORT_SIZE: 320\n  TRAIN_JITTER_SCALES_RELATIVE: [0.5, 1.0]\n  TRAIN_JITTER_ASPECT_RELATIVE: [0.75, 1.3333]\n  TRAIN_CROP_NUM_TEMPORAL: 4  # default 1\nMVIT:\n  ZERO_DECAY_POS_CLS: False\n  SEP_POS_EMBED: True\n  PATCH_KERNEL: (2, 16, 16)\n  PATCH_STRIDE: (2, 16, 16)\n  PATCH_PADDING: (0, 0, 0)\n  EMBED_DIM: 1280\n  NUM_HEADS: 16\n  MLP_RATIO: 4.0\n  QKV_BIAS: True\n  NORM: \"layernorm\"\n  DEPTH: 32\n  DROPPATH_RATE: 0.0\n  MODE: \"conv\"\n  CLS_EMBED_ON: True # default: True\nMASK:\n  ENABLE: True\n  MAE_ON: True #  default False\n  MAE_RND_MASK: True # default: false\n  PRETRAIN_DEPTH: [31]\n  HEAD_TYPE: \"separate_xformer\"\n  DECODER_DEPTH: 4               # default: 0\n  DECODER_EMBED_DIM: 512\nAUG:\n  NUM_SAMPLE: 1\n  ENABLE: True\n  COLOR_JITTER: None\n  AA_TYPE: \"\"\n  RE_PROB: 0.0\n  MASK_RATIO: 0.9\n\nMIXUP:\n  ENABLE: False\nSOLVER:\n  CLIP_GRAD_L2NORM: 0.02\n  BASE_LR: 1e-4 # for bs32\n  BASE_LR_SCALE_NUM_SHARDS: True\n  LR_POLICY: cosine\n  COSINE_AFTER_WARMUP: True\n  COSINE_END_LR: 1e-6\n  MAX_EPOCH: 400\n  WARMUP_EPOCHS: 60.0\n  BETAS: (0.9, 0.95) # default (0.9, 0.999)\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 0.05\n  ZERO_WD_1D_PARAM: True # default False\n  WARMUP_START_LR: 1e-6\n  OPTIMIZING_METHOD: adamw\nMODEL:\n  ARCH: maskmvit\n  MODEL_NAME: MaskMViT\n  LOSS_FUNC: multi_mse\n  DROPOUT_RATE: 0.0\nTEST:\n  ENABLE: False\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  NUM_SPATIAL_CROPS: 1\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\nLOG_MODEL_INFO: False\n"
  },
  {
    "path": "configs/masked_ssl/k400_VIT_L_16x4_FT.yaml",
    "content": "TASK: ssl_eval\nTRAIN:\n  DATASET: kinetics\n  BATCH_SIZE: 16\n  EVAL_PERIOD: 5\n  CHECKPOINT_PERIOD: 5\n  AUTO_RESUME: True\n  CHECKPOINT_EPOCH_RESET: True\nDATA:\n  USE_OFFSET_SAMPLING: True\n  DECODING_BACKEND: torchvision\n  NUM_FRAMES: 16\n  SAMPLING_RATE: 4\n  TRAIN_JITTER_SCALES: [256, 320]\n  DECODING_SHORT_SIZE: 320\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 224\n  INPUT_CHANNEL_NUM: [3]\n  TRAIN_JITTER_SCALES_RELATIVE: [0.08, 1.0]\n  TRAIN_JITTER_ASPECT_RELATIVE: [0.75, 1.3333]\nMVIT:\n  ZERO_DECAY_POS_CLS: False\n  SEP_POS_EMBED: True\n  PATCH_KERNEL: (2, 16, 16)\n  PATCH_STRIDE: (2, 16, 16)\n  PATCH_PADDING: (0, 0, 0)\n  EMBED_DIM: 1024\n  NUM_HEADS: 16\n  MLP_RATIO: 4.0\n  QKV_BIAS: True\n  NORM: \"layernorm\"\n  DEPTH: 24\n  MODE: \"conv\"\n  DROPPATH_RATE: 0.2\n  LAYER_SCALE_INIT_VALUE: 0.0\n  USE_MEAN_POOLING: True\n  HEAD_INIT_SCALE: 0.001\nAUG:\n  NUM_SAMPLE: 2\n  ENABLE: True\n  COLOR_JITTER: None\n  AA_TYPE: rand-m7-mstd0.5-inc1\n  INTERPOLATION: bicubic\n  RE_PROB: 0.25\n  RE_MODE: pixel\n  RE_COUNT: 1\n  RE_SPLIT: False\nMIXUP:\n  ENABLE: True\n  ALPHA: 0.8\n  CUTMIX_ALPHA: 1.0\n  PROB: 1.0\n  SWITCH_PROB: 0.5\n  LABEL_SMOOTH_VALUE: 0.1\nSOLVER:\n  MAX_EPOCH: 75\n  MAX_EPOCH: 50\n  LAYER_DECAY: 0.75\n  WARMUP_EPOCHS: 5.0\n  CLIP_GRAD_L2NORM: 5.0\n  ZERO_WD_1D_PARAM: True\n  BASE_LR_SCALE_NUM_SHARDS: True\n  BASE_LR: 3e-4 #  works best\n  COSINE_AFTER_WARMUP: True\n  COSINE_END_LR: 1e-5\n  WARMUP_START_LR: 1e-8\n  LR_POLICY: cosine\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 0.05\n  OPTIMIZING_METHOD: adamw\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: mvit\n  MODEL_NAME: MViT\n  LOSS_FUNC: soft_cross_entropy # default cross_entropy\n  DROPOUT_RATE: 0.3\nTEST:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  NUM_SPATIAL_CROPS: 3\n  NUM_TEMPORAL_CLIPS: [5, 7, 10]\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "configs/masked_ssl/k400_VIT_L_16x4_MAE_PT.yaml",
    "content": "TASK: ssl\nTRAIN:\n  ENABLE: True\n  DATASET: kinetics\n  BATCH_SIZE: 32 # 32 fits 32g\n  EVAL_PERIOD: 100000\n  CHECKPOINT_PERIOD: 5\n  AUTO_RESUME: True\n  KILL_LOSS_EXPLOSION_FACTOR: 2.0 # default 0.0 (not enforced)\nDATA:\n  USE_OFFSET_SAMPLING: True\n  DECODING_BACKEND: torchvision\n  NUM_FRAMES: 16\n  SAMPLING_RATE: 4\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 224\n  INPUT_CHANNEL_NUM: [3]\n  DECODING_SHORT_SIZE: 320\n  TRAIN_JITTER_SCALES_RELATIVE: [0.5, 1.0]\n  TRAIN_JITTER_ASPECT_RELATIVE: [0.75, 1.3333]\n  TRAIN_CROP_NUM_TEMPORAL: 4  # default 1\nMVIT:\n  ZERO_DECAY_POS_CLS: False\n  SEP_POS_EMBED: True # default: False needs to be false for fixed sincos\n  PATCH_KERNEL: (2, 16, 16)\n  PATCH_STRIDE: (2, 16, 16)\n  PATCH_PADDING: (0, 0, 0)\n  EMBED_DIM: 1024\n  NUM_HEADS: 16\n  MLP_RATIO: 4.0\n  QKV_BIAS: True\n  NORM: \"layernorm\"\n  DEPTH: 24\n  DROPPATH_RATE: 0.0\n  MODE: \"conv\"\n  CLS_EMBED_ON: True # default: True\nMASK:\n  ENABLE: True\n  MAE_ON: True #  default False\n  MAE_RND_MASK: True # default: false\n  PRETRAIN_DEPTH: [23]\n  HEAD_TYPE: \"separate_xformer\"\n  DECODER_DEPTH: 4               # default: 0\n  DECODER_EMBED_DIM: 512\nAUG:\n  # NUM_SAMPLE: 4\n  ENABLE: True\n  COLOR_JITTER: None\n  AA_TYPE: \"\"\n  RE_PROB: 0.0\n  MASK_RATIO: 0.9\nMIXUP:\n  ENABLE: False\nSOLVER:\n  CLIP_GRAD_L2NORM: 0.02\n  BASE_LR: 1e-4 # for bs64\n  BASE_LR_SCALE_NUM_SHARDS: True\n  LR_POLICY: cosine\n  COSINE_AFTER_WARMUP: True\n  COSINE_END_LR: 1e-6\n  MAX_EPOCH: 200\n  WARMUP_EPOCHS: 30.0\n  BETAS: (0.9, 0.95) # default (0.9, 0.999)\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 0.05\n  ZERO_WD_1D_PARAM: True # default False\n  WARMUP_START_LR: 1e-6\n  OPTIMIZING_METHOD: adamw\nMODEL:\n  ARCH: maskmvit\n  MODEL_NAME: MaskMViT\n  LOSS_FUNC: multi_mse\n  DROPOUT_RATE: 0.0\nTEST:\n  ENABLE: False\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  NUM_SPATIAL_CROPS: 1\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nNUM_GPUS: 8\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\nLOG_MODEL_INFO: False\n"
  },
  {
    "path": "demo/AVA/SLOWFAST_32x2_R101_50_50.yaml",
    "content": "TRAIN:\n  ENABLE: False\n  DATASET: ava\n  BATCH_SIZE: 16\n  EVAL_PERIOD: 1\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\n  CHECKPOINT_FILE_PATH: ./SLOWFAST_32x2_R101_50_50.pkl  #path to pretrain model\n  CHECKPOINT_TYPE: pytorch\nDATA:\n  NUM_FRAMES: 32\n  SAMPLING_RATE: 2\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3, 3]\nDETECTION:\n  ENABLE: True\n  ALIGNED: False\nAVA:\n  BGR: False\n  DETECTION_SCORE_THRESH: 0.8\n  TEST_PREDICT_BOX_LISTS: [\"person_box_67091280_iou90/ava_detection_val_boxes_and_labels.csv\"]\nSLOWFAST:\n  ALPHA: 4\n  BETA_INV: 8\n  FUSION_CONV_CHANNEL_RATIO: 2\n  FUSION_KERNEL_SZ: 5\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 101\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3, 3], [4, 4], [6, 6], [3, 3]]\n  SPATIAL_DILATIONS: [[1, 1], [1, 1], [1, 1], [2, 2]]\n  SPATIAL_STRIDES: [[1, 1], [2, 2], [2, 2], [1, 1]]\nNONLOCAL:\n  LOCATION: [[[], []], [[], []], [[6, 13, 20], []], [[], []]]\n  GROUP: [[1, 1], [1, 1], [1, 1], [1, 1]]\n  INSTANTIATION: dot_product\n  POOL: [[[2, 2, 2], [2, 2, 2]], [[2, 2, 2], [2, 2, 2]], [[2, 2, 2], [2, 2, 2]], [[2, 2, 2], [2, 2, 2]]]\nBN:\n  USE_PRECISE_STATS: False\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-7\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 80\n  ARCH: slowfast\n  MODEL_NAME: SlowFast\n  LOSS_FUNC: bce\n  DROPOUT_RATE: 0.5\n  HEAD_ACT: sigmoid\nTEST:\n  ENABLE: False\n  DATASET: ava\n  BATCH_SIZE: 8\nDATA_LOADER:\n  NUM_WORKERS: 2\n  PIN_MEMORY: True\n\nNUM_GPUS: 1\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\nTENSORBOARD:\n  MODEL_VIS:\n    TOPK: 2\nDEMO:\n  ENABLE: True\n  LABEL_FILE_PATH:  # Add local label file path here.\n  WEBCAM: 0\n  DETECTRON2_CFG: \"COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml\"\n  DETECTRON2_WEIGHTS: detectron2://COCO-Detection/faster_rcnn_R_50_FPN_3x/137849458/model_final_280758.pkl\n"
  },
  {
    "path": "demo/Kinetics/SLOWFAST_8x8_R50.yaml",
    "content": "TRAIN:\n  ENABLE: False\n  DATASET: kinetics\n  BATCH_SIZE: 64\n  EVAL_PERIOD: 10\n  CHECKPOINT_PERIOD: 1\n  AUTO_RESUME: True\n  CHECKPOINT_FILE_PATH: \"./SLOWFAST_8x8_R50.pkl\" # path to pretrain model to run demo\n  CHECKPOINT_TYPE: caffe2\nDATA:\n  NUM_FRAMES: 32\n  SAMPLING_RATE: 2\n  TRAIN_JITTER_SCALES: [256, 320]\n  TRAIN_CROP_SIZE: 224\n  TEST_CROP_SIZE: 256\n  INPUT_CHANNEL_NUM: [3, 3]\nSLOWFAST:\n  ALPHA: 4\n  BETA_INV: 8\n  FUSION_CONV_CHANNEL_RATIO: 2\n  FUSION_KERNEL_SZ: 7\nRESNET:\n  ZERO_INIT_FINAL_BN: True\n  WIDTH_PER_GROUP: 64\n  NUM_GROUPS: 1\n  DEPTH: 50\n  TRANS_FUNC: bottleneck_transform\n  STRIDE_1X1: False\n  NUM_BLOCK_TEMP_KERNEL: [[3, 3], [4, 4], [6, 6], [3, 3]]\n  SPATIAL_STRIDES: [[1, 1], [2, 2], [2, 2], [2, 2]]\n  SPATIAL_DILATIONS: [[1, 1], [1, 1], [1, 1], [1, 1]]\nNONLOCAL:\n  LOCATION: [[[], []], [[], []], [[], []], [[], []]]\n  GROUP: [[1, 1], [1, 1], [1, 1], [1, 1]]\n  INSTANTIATION: dot_product\nBN:\n  USE_PRECISE_STATS: True\n  NUM_BATCHES_PRECISE: 200\nSOLVER:\n  BASE_LR: 0.1\n  LR_POLICY: cosine\n  MAX_EPOCH: 196\n  MOMENTUM: 0.9\n  WEIGHT_DECAY: 1e-4\n  WARMUP_EPOCHS: 34\n  WARMUP_START_LR: 0.01\n  OPTIMIZING_METHOD: sgd\nMODEL:\n  NUM_CLASSES: 400\n  ARCH: slowfast\n  LOSS_FUNC: cross_entropy\n  DROPOUT_RATE: 0.5\nTEST:\n  ENABLE: False\n  DATASET: kinetics\n  BATCH_SIZE: 64\nDATA_LOADER:\n  NUM_WORKERS: 8\n  PIN_MEMORY: True\nDEMO:\n  ENABLE: True\n  LABEL_FILE_PATH:  # Add local label file path here.\n  WEBCAM: 0\nNUM_GPUS: 1\nNUM_SHARDS: 1\nRNG_SEED: 0\nOUTPUT_DIR: .\n"
  },
  {
    "path": "linter.sh",
    "content": "#!/bin/bash -e\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n# Run this script at project root by \".linter.sh\" before you commit.\necho \"Running isort...\"\nisort -y -sp .\n\necho \"Running black...\"\nblack -l 80 .\n\necho \"Running flake...\"\nflake8 .\n\ncommand -v arc > /dev/null && {\n  echo \"Running arc lint ...\"\n  arc lint\n}\n"
  },
  {
    "path": "projects/contrastive_ssl/README.md",
    "content": "# A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning\n[Christoph Feichtenhofer](http://feichtenhofer.github.io/), [Haoqi Fan](https://haoqifan.github.io/), [Bo Xiong](https://www.cs.utexas.edu/~bxiong/), [Ross Girshick](http://www.cs.berkeley.edu/~rbg/), [Kaiming He](http://kaiminghe.com/)\n<br/>\nIn CVPR, 2021. [[Paper](https://arxiv.org/abs/2104.14558)]\n<br/>\n<div align=\"center\">\n  <img src=\"http://feichtenhofer.github.io/pubs/videomoco_concept2.png\" width=\"500px\">\n</div>\n<br/>\n\n\n## Kinetics 400 and 600\n\n| method | &rho;| architecture | size |  frames x sampling |  pretrain data | K400-linear |  UCF101-split1  | AVA  | SSv2  | model | config |\n| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |\n| MoCo | 2 | Slow-only | R50 | 8 x 8 | Kinetics-400 | 66.6 | 91.3  | 19.7 | 52.7 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/videomoco_models/MoCo_SlowR50_8x8_T2_epoch_00200.pyth) | contrastive_ssl/MoCo_SlowR50_8x8 |\n| BYOL | 2 | Slow-only | R50 | 8 x 8 | Kinetics-400 | 67.4 | 94.0  | 22.8 | 54.4 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/videomoco_models/BYOL_SlowR50_8x8_T2_epoch_00200.pyth) | contrastive_ssl/BYOL_SlowR50_8x8 |\n| SimCLR | 2 | Slow-only | R50 | 8 x 8 | Kinetics-400 | 61.5 | 88.3  | 17.5 | 51.4 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/videomoco_models/SimCLR_SlowR50_8x8_T2_epoch_00200.pyth) | contrastive_ssl/SimCLR_SlowR50_8x8 |\n| SwAV | 2 | Slow-only | R50 | 8 x 8 | Kinetics-400 |  62.6 | 90.2  | 19.2 | 52.5| [`link`](https://dl.fbaipublicfiles.com/pyslowfast/videomoco_models/SwAV_SlowR50_8x8_T2_epoch_00200.pyth) | contrastive_ssl/SwAV_SlowR50_8x8 |\n| MoCo | 4 | Slow-only | R50 | 8 x 8 | Kinetics-400 | 71.0 | 94.5  | 21.9 | 54.0 |  [`link`](https://dl.fbaipublicfiles.com/pyslowfast/videomoco_models/MoCo_SlowR50_8x8_T4_epoch_00200.pyth) | contrastive_ssl/MoCo_SlowR50_8x8 |\n| BYOL | 4 | Slow-only | R50 | 8 x 8 | Kinetics-400 | 70.1 | 94.7 | xx | xx | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/videomoco_models/BYOL_SlowR50_8x8_T4_epoch_00200.pyth) | contrastive_ssl/BYOL_SlowR50_8x8 |\n| BYOL | 4 | Slow-only | R50 | 16 x 4 | Kinetics-400 | 71.1 | 95.4  | xx | xx | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/videomoco_models/BYOL_SlowR50_16x4_T4_epoch_00200.pyth) | contrastive_ssl/BYOL_SlowR50_8x8 |\n\n\n## Getting started\nTo use self-supervised learning techniques please refer to the configs under `configs/contrastive_ssl`, or see the [MODEL_ZOO.md](https://github.com/facebookresearch/SlowFast/blob/master/MODEL_ZOO.md) for pre-trained models. See [paper](https://arxiv.org/abs/2104.14558) for details. For example, the command\n\n```\npython tools/run_net.py \\\n  --cfg configs/Kinetics/contrastive_ssl/MoCo_SlowR50_8x8.yaml \\\n  DATA.PATH_TO_DATA_DIR path_to_your_dataset \\\n```\n\nshould train a MoCo R50 Slow-only model with 8x8 sampling on your dataset.\n\n\n## Reference\nIf you find this useful for your research, please consider citing the paper using the following BibTeX entry.\n```BibTeX\n@inproceedings{videossl2021,\n  Author    = {Christoph Feichtenhofer, Haoqi Fan, Bo Xiong, Ross Girshick, Kaiming He},\n  Title     = {A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning},\n  Booktitle = {CVPR},\n  Year      = {2021}}\n```\n"
  },
  {
    "path": "projects/mae/README.md",
    "content": "# Masked Autoencoders As Spatiotemporal Learners\n[Christoph Feichtenhofer*](http://feichtenhofer.github.io/), [Haoqi Fan*](https://haoqifan.github.io/), [Yanghao Li](https://lyttonhao.github.io/), [Kaiming He](http://kaiminghe.com/)\n<br/>\nTechnical report, arXiv, May 2022. [[Paper](https://arxiv.org/abs/2205.09113)]\n<br/>\n<div align=\"center\">\n  <img src=\"http://feichtenhofer.github.io/pubs/mae_concept.jpg\" width=\"500px\">\n</div>\n<br/>\n\n## Results & Models\n\n### **Kinetics-400**; configs are under configs/masked_ssl/\n\n\n| name | frame length x sample rate | top1 |  Flops (G) x views | #params (M) |   config pre-train (PT) | config fine-tune | model PT |\n| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |\n| ViT-B | 16 x 4 | 81.3 | 180 x 3 x 7 | 87 |  k400_VIT_B_16x4_MAE_PT | k400_VIT_B_16x4_FT | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/masked_models/VIT_B_16x4_MAE_PT.pyth) |\n| ViT-L | 16 x 4 | 84.8 | 598 x 3 x 7 | 304 |  k400_VIT_L_16x4_MAE_PT | k400_VIT_L_16x4_FT | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/masked_models/VIT_L_16x4_MAE_PT.pyth) |\n| ViT-H | 16 x 4 | 85.1 | 1193 x 3 x 7 | 632 |  k400_VIT_H_16x4_MAE_PT | k400_VIT_H_16x4_FT | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/masked_models/VIT_H_16x4_MAE_PT_400e_85.1.pyth) |\n\n\n## Getting started\nTo use self-supervised learning techniques please refer to the configs under `configs/masked_ssl`. For example, the command\n\n```\npython tools/run_net.py \\\n  --cfg configs/masked_ssl/k400_VIT_L_16x4_MAE_PT.yaml \\\n  DATA.PATH_TO_DATA_DIR path_to_your_Kinetics_dataset\n```\n\nshould train an MAE ViT-L model on the Kinetics-400 dataset, and the command\n\n```\npython tools/run_net.py \\\n  --cfg configs/masked_ssl/k400_VIT_L_16x4_FT.yaml \\\n  DATA.PATH_TO_DATA_DIR path_to_your_Kinetics_dataset \\\n  TRAIN.CHECKPOINT_FILE_PATH path_to_your_pretrain_checkpoint\n```\n\nwill fine-tune the resulting model, after passing the checkpoint path to the config.\n\n\n## Reference\nIf you find this useful for your research, please consider citing the paper using the following BibTeX entry.\n```BibTeX\n@article{feichtenhofer2022masked,\n  title={Masked Autoencoders As Spatiotemporal Learners},\n  author={Feichtenhofer, Christoph and Fan, Haoqi and Li, Yanghao and He, Kaiming},\n  journal={arXiv preprint arXiv:2205.09113},\n  year={2022}\n}\n```\n"
  },
  {
    "path": "projects/maskfeat/README.md",
    "content": "# Masked Feature Prediction for Self-Supervised Visual Pre-Training\n[Chen Wei*](https://weichen582.github.io/), [Haoqi Fan](https://haoqifan.github.io/), [Saining Xie](https://www.sainingxie.com/), [Chao-Yuan Wu](https://chaoyuan.org/), [Alan Yuille](https://www.cs.jhu.edu/~ayuille/), [Christoph Feichtenhofer*](http://feichtenhofer.github.io/)\n<br/>\nIn CVPR, 2022. [[Paper](https://arxiv.org/abs/2112.09133)]\n<br/>\n<div align=\"center\">\n  <img src=\"http://feichtenhofer.github.io/pubs/maskfeat_concept.png\" width=\"500px\">\n</div>\n<br/>\n\n## Results & Models\n\n### **ImageNet-1K**; configs are under configs/masked_ssl/\n\n\n| name | top1 |  config pre-train (PT) | config fine-tune | model PT |\n| ------------- | ------------- | ------------- | ------------- | ------------- |\n| ViT-B | 84.0 |  in1k_VIT_B_MaskFeat_PT | in1k_VIT_B_MaskFeat_FT | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/masked_models/in1k_VIT_B_MaskFeat_PT_epoch_01600.pyth) |\n| ViT-L | 85.7 |  in1k_VIT_L_MaskFeat_PT | in1k_VIT_L_MaskFeat_FT | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/masked_models/in1k_VIT_L_MaskFeat_PT_epoch_01600.pyth) |\n\n\n\n### **Kinetics-400**; configs are under configs/masked_ssl/\n\n\n| name | frame length x sample rate | top1 |  Flops (G) x views | #params (M) |   config pre-train (PT) | config fine-tune | model PT |\n| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |\n| MViT-S | 16 x 4 | 82.2 | 71 x 1 x 10 | 36 |  k400_MVITv2_S_16x4_MaskFeat_PT | k400_MVITv2_S_16x4_FT | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/masked_models/k400_MVIT_S_MaskFeat_PT_epoch_00300.pyth) |\n| MViT-L | 16 x 4 | 84.3 | 377 x 1 x 10 | 218 |  k400_MVITv2_L_16x4_MaskFeat_PT | k400_MVITv2_L_16x4_FT | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/masked_models/k400_MVIT_L_MaskFeat_PT_epoch_00800.pyth) |\n\n\n\n## Getting started\nTo use self-supervised learning techniques please refer to the configs under `configs/masked_ssl`. For example, the command\n\n```\npython tools/run_net.py \\\n  --cfg configs/masked_ssl/k400_MVITv2_L_16x4_MaskFeat_PT.yaml \\\n  DATA.PATH_TO_DATA_DIR path_to_your_Kinetics_dataset\n```\n\nshould train a MaskFeat MViT-L model on the Kinetics-400 dataset, and the command\n\n```\npython tools/run_net.py \\\n  --cfg configs/masked_ssl/k400_MVITv2_L_16x4_FT.yaml \\\n  DATA.PATH_TO_DATA_DIR path_to_your_Kinetics_dataset \\\n  TRAIN.CHECKPOINT_FILE_PATH path_to_your_pretrain_checkpoint\n```\n\nwill fine-tune the resulting model, after passing the checkpoint path to the config.\n\nFor images, the command\n\n```\npython tools/run_net.py \\\n  --cfg configs/masked_ssl/in1k_VIT_B_MaskFeat_PT.yaml \\\n  DATA.PATH_TO_DATA_DIR path_to_your_ImageNet_dataset\n```\n\nshould train a MaskFeat ViT-B model on the ImageNet dataset, and the command\n\n```\npython tools/run_net.py \\\n  --cfg configs/masked_ssl/in1k_VIT_B_FT.yaml \\\n  DATA.PATH_TO_DATA_DIR path_to_your_ImageNet_dataset \\\n  TRAIN.CHECKPOINT_FILE_PATH path_to_your_pretrain_checkpoint\n```\n\nwill fine-tune the resulting model, after passing the checkpoint path to the config.\n\n## Reference\nIf you find this useful for your research, please consider citing the paper using the following BibTeX entry.\n```BibTeX\n@InProceedings{wei2022masked,\n    author    = {Wei, Chen and Fan, Haoqi and Xie, Saining and Wu, Chao-Yuan and Yuille, Alan and Feichtenhofer, Christoph},\n    title     = {Masked Feature Prediction for Self-Supervised Visual Pre-Training},\n    booktitle = {CVPR},\n    year      = {2022},\n}\n```\n"
  },
  {
    "path": "projects/multigrid/README.md",
    "content": "# A Multigrid Method for Efficiently Training Video Models\n[Chao-Yuan Wu](https://www.cs.utexas.edu/~cywu/),\n[Ross Girshick](http://rossgirshick.info),\n[Kaiming He](http://kaiminghe.com),\n[Christoph Feichtenhofer](http://feichtenhofer.github.io/),\n[Philipp Kr&auml;henb&uuml;hl](http://www.philkr.net/)\n<br/>\nIn CVPR, 2020. [[Paper](https://arxiv.org/abs/1912.00998)]\n<br/>\n<div align=\"center\">\n  <img src=\"multigrid.png\" width=\"700px\" />\n</div>\n<br/>\n\n\n## Getting started\nTo enable multigrid training, add `MULTIGRID.LONG_CYCLE True` and/or `MULTIGRID.SHORT_CYCLE True` when training your model. (Default multigrid training uses both long and short cycles; See [paper](https://arxiv.org/abs/1912.00998) for details.) For example,\n\n```\npython tools/run_net.py \\\n  --cfg configs/Charades/SLOWFAST_16x8_R50.yaml \\\n  DATA.PATH_TO_DATA_DIR path_to_your_dataset \\\n  MULTIGRID.LONG_CYCLE True \\\n  MULTIGRID.SHORT_CYCLE True \\\n```\nThis should train multiple times faster than training *without* multigrid training.\nNote that multigrid training might induce higher IO overhead.\nSystems with faster IO (e.g., with efficient local disk) might enjoy more speedup.\nPlease see [MODEL_ZOO.md](https://github.com/facebookresearch/SlowFast/blob/master/MODEL_ZOO.md) for more examples of multigrid training.\n\n## Citing Multigrid Training\nIf you use multigrid training or the models from MODEL_ZOO in your research, please use the following BibTeX entry.\n```BibTeX\n@inproceedings{multigrid2020,\n  Author    = {Chao-Yuan Wu and Ross Girshick and Kaiming He and Christoph Feichtenhofer\n               and Philipp Kr\\\"{a}henb\\\"{u}hl},\n  Title     = {{A Multigrid Method for Efficiently Training Video Models}},\n  Booktitle = {{CVPR}},\n  Year      = {2020}}\n```\n"
  },
  {
    "path": "projects/mvit/README.md",
    "content": "# Multiscale Vision Transformers\n[Haoqi Fan](https://haoqifan.github.io/)\\*, [Bo Xiong](https://www.cs.utexas.edu/~bxiong/)\\*, [Karttikeya Mangalam](https://karttikeya.github.io/)\\*, [Yanghao Li](https://lyttonhao.github.io/)\\*, [Zhicheng Yan](https://sites.google.com/view/zhicheng-yan), [Jitendra Malik](http://people.eecs.berkeley.edu/~malik/), [Christoph Feichtenhofer](http://feichtenhofer.github.io/)\\*,\n<br/>\nIn arXiv, 2104.11227, 2021. [[Paper](https://arxiv.org/abs/2104.11227.pdf)]\n<br/>\n<div align=\"center\">\n  <img src=\"teaser.png\" width=\"500px\" />\n</div>\n<br/>\n\n\n## Getting started\n\nTo use MViT-B models please refer to the configs under `configs/Kinetics`, or see the [MODEL_ZOO.md](https://github.com/facebookresearch/SlowFast/blob/master/MODEL_ZOO.md) for pre-trained models. See [paper](https://arxiv.org/abs/2104.11227.pdf) for details. For example, the command\n\n```\npython tools/run_net.py \\\n  --cfg configs/Kinetics/MVIT-B.yaml \\\n  DATA.PATH_TO_DATA_DIR path_to_your_dataset \\\n```\n\nshould train and test a MViT-B model on your dataset.\n\n## Citing MViT\nIf you find MViT useful for your research, please consider citing the paper using the following BibTeX entry.\n```BibTeX\n@Article{mvit2021,\n  author = {Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, Christoph Feichtenhofer},\n  title = {Multiscale Vision Transformers},\n  journal = {arXiv:2104.11227},\n  Year = {2021},\n}\n```\n"
  },
  {
    "path": "projects/mvitv2/README.md",
    "content": "# [MViTv2: Improved Multiscale Vision Transformers for Classification and Detection](https://arxiv.org/abs/2112.01526)\n\nOfficial PyTorch implementation of **MViTv2**, from the following paper:\n\n[MViTv2: Improved Multiscale Vision Transformers for Classification and Detection](https://arxiv.org/abs/2112.01526). CVPR 2022.\\\nYanghao Li*, Chao-Yuan Wu*, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik, Christoph Feichtenhofer*\n\n---\n\nMViT is a multiscale transformer which serves as a general vision backbone for different visual recognition tasks. PySlowFast supports MViTv2 for video action recognition and detection tasks. For other tasks, please check:\n\n> **Image Classification**: See [MViTv2 for image classification](https://github.com/facebookresearch/mvit).\n\n> **Object Detection and Instance Segmentation**: See [MViTv2 in Detectron2](https://github.com/facebookresearch/detectron2/tree/main/projects/MViTv2).\n\n<div align=\"center\">\n  <img src=\"mvitv2.png\" width=\"500px\" />\n</div>\n<br/>\n\n## Results\n\n### Kinetics-400\n\n\n| name | frame length x sample rate | top1 |  top5  | Flops (G) x views | #params (M) |  model | config |\n| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |\n| MViTv2-S | 16 x 4 | 81.0 | 94.6 | 64 x 1 x 5 | 34.5 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/mvitv2/pysf_video_models/MViTv2_S_16x4_k400_f302660347.pyth) | Kinetics/MVITv2_S_16x4 |\n| MViTv2-B | 32 x 3 | 82.9 | 95.7 | 225 x 1 x 5 | 51.2 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/mvitv2/pysf_video_models/MViTv2_B_32x3_k400_f304025456.pyth) | Kinetics/MVITv2_B_32x3 |\n| MViTv2-L | 40 x 3 | 86.1 | 97.0 | 2828 x 3 x 5 | 217.6 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/mvitv2/pysf_video_models/MViTv2_L_40x3_k400_f306903192.pyth) | Kinetics/MVITv2_L_40x3_test |\n\n\n### SSv2\n\n\n| name | pretrain | frame length x sample rate | top1 |  top5  | Flops (G) x views | #params (M) |  model | config |\n| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |\n| MViTv2-S | K400 | 16 x 4 | 68.2 | 91.4 | 64 x 3 x 1 | 34.4 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/mvitv2/pysf_video_models/MViTv2_S_16x4_ssv2_f308341823.pyth) | SSv2/MVITv2_S_16x4 |\n| MViTv2-B | K400 | 32 x 3 | 70.5 | 92.7 | 225 x 3 x 1 | 51.1 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/mvitv2/pysf_video_models/MViTv2_B_32x3_ssv2_f305803282.pyth) | SSv2/MVITv2_B_32x3 |\n| MViTv2-L | IN21K + K400 | 40 x 3 | 73.3 | 94.1 | 2828 x 3 x 1 | 213.1 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/model_zoo/mvitv2/pysf_video_models/MViTv2_L_40x3_ssv2_f309603968.pyth) | SSv2/MVITv2_L_40x3 |\n\n\n### ImageNet-1K\n\n| name | resolution |acc@1 | #params | FLOPs | 1k model |\n|:---:|:---:|:---:|:---:| :---:|:---:|\n| MViTv2-T | 224x224 | 82.3 | 24M | 4.7G | [model](https://dl.fbaipublicfiles.com/mvit/mvitv2_models/MViTv2_T_in1k.pyth) |\n| MViTv2-S | 224x224 | 83.6 | 35M | 7.0G | [model](https://dl.fbaipublicfiles.com/mvit/mvitv2_models/MViTv2_S_in1k.pyth) |\n| MViTv2-B | 224x224 | 84.4 | 52M | 10.2G | [model](https://dl.fbaipublicfiles.com/mvit/mvitv2_models/MViTv2_B_in1k.pyth) |\n\nFor more ImageNet results, please check the MViTv2 for image classification [repo](https://github.com/facebookresearch/mvit).\n\n## Get started\n\nHere we can train a standard MViTv2 model from scratch by:\n\n```\npython tools/run_net.py \\\n  --cfg configs/Kinetics/MVITv2_S_16x4.yaml \\\n  DATA.PATH_TO_DATA_DIR path_to_your_dataset \\\n```\n\n\n## Citing MViTv2\nIf you find this repository helpful, please consider citing:\n```\n@inproceedings{li2021improved,\n  title={MViTv2: Improved multiscale vision transformers for classification and detection},\n  author={Li, Yanghao and Wu, Chao-Yuan and Fan, Haoqi and Mangalam, Karttikeya and Xiong, Bo and Malik, Jitendra and Feichtenhofer, Christoph},\n  booktitle={CVPR},\n  year={2022}\n}\n\n@inproceedings{fan2021multiscale,\n  title={Multiscale vision transformers},\n  author={Fan, Haoqi and Xiong, Bo and Mangalam, Karttikeya and Li, Yanghao and Yan, Zhicheng and Malik, Jitendra and Feichtenhofer, Christoph},\n  booktitle={ICCV},\n  year={2021}\n}\n```\n"
  },
  {
    "path": "projects/pytorchvideo/README.md",
    "content": "# Support PyTorchVideo in PySlowFast\n\n## Introduction\n\n[PyTorchVideo](https://pytorchvideo.org/) is a new deeplearning library with a focus on video understanding work, which provides reusable, modular and efficient components for video understanding. In PySlowFast, we add the support to incorporate PyTorchVideo components, including standard video datasets and state-of-the-art video models. Thus, we could use standard PySlowFast workflow to train and test PyTorchVideo datasets and models.\n\nWe add PySlowFast wrapper for different PyTorchVideo models and datasets. So we can easily construct PyTorchVideo datasets and models using PySlowFast config system. Right now, the supported [PyTorchVideo models](https://github.com/facebookresearch/SlowFast/blob/master/slowfast/models/ptv_model_builder.py) includes:\n  * [I3D](https://arxiv.org/pdf/1705.07750.pdf)\n  * [C2D](https://arxiv.org/pdf/1711.07971.pdf)\n  * [R(2+1)D](https://openaccess.thecvf.com/content_cvpr_2018/papers/Tran_A_Closer_Look_CVPR_2018_paper.pdf)\n  * [CSN](https://arxiv.org/abs/1904.02811)\n  * [Slow, SlowFast](https://arxiv.org/pdf/1812.03982.pdf)\n  * [X3D](https://arxiv.org/pdf/2004.04730.pdf)\n\nThe supported [PyTorchVideo datasets](https://github.com/facebookresearch/SlowFast/blob/master/slowfast/datasets/ptv_datasets.py) includes:\n  * Kinetics\n  * Charades\n  * Something-something v2\n\n## PyTorchVideo Model Zoo\n\nWe also provide a comprehensive PyTorchVideo Model Zoo using standard PySlowFast workflow and training recipe for PyTorchVideo datasets and models.\n\n\n### Kinetics-400\n\n| arch     | depth | pretrain | frame length x sample rate | top 1 | top 5 | Flops (G) x views | Params (M) | Model                                                                                                       | config                                         |\n| -------- | ----- | -------- | -------------------------- | ----- | ----- | ----------------- | ---------- | ------------------------------------------------------------------------------------------------------------| ---------------------------------------------- |\n| C2D      | R50   | \\-       | 8x8                        | 71.46 | 89.68 | 25.89 x 3 x 10    | 24.33      | [link](https://dl.fbaipublicfiles.com/pytorchvideo/pysf_model_zoo/kinetics/C2D_8x8_R50.pyth)                | Kinetics/pytorchvideo/C2D_8x8_R50              |\n| I3D      | R50   | \\-       | 8x8                        | 73.27 | 90.70 | 37.53 x 3 x 10    | 28.04      | [link](https://dl.fbaipublicfiles.com/pytorchvideo/pysf_model_zoo/kinetics/I3D_8x8_R50.pyth)                | Kinetics/pytorchvideo/I3D_8x8_R50              |\n| Slow     | R50   | \\-       | 4x16                       | 72.40 | 90.18 | 27.55 x 3 x 10    | 32.45      | [link](https://dl.fbaipublicfiles.com/pytorchvideo/pysf_model_zoo/kinetics/SLOW_4x16_R50.pyth)              | Kinetics/pytorchvideo/SLOW_4x16_R50            |\n| Slow     | R50   | \\-       | 8x8                        | 74.58 | 91.63 | 54.52 x 3 x 10    | 32.45      | [link](https://dl.fbaipublicfiles.com/pytorchvideo/pysf_model_zoo/kinetics/SLOW_8x8_R50.pyth)               | Kinetics/pytorchvideo/SLOW_8x8_R50             |\n| SlowFast | R50   | \\-       | 4x16                       | 75.34 | 91.89 | 36.69 x 3 x 10    | 34.48      | [link](https://dl.fbaipublicfiles.com/pytorchvideo/pysf_model_zoo/kinetics/SLOWFAST_4x16_R50.pyth)          | Kinetics/pytorchvideo/SLOWFAST_4x16_R50        |\n| SlowFast | R50   | \\-       | 8x8                        | 76.94 | 92.69 | 65.71 x 3 x 10    | 34.57      | [link](https://dl.fbaipublicfiles.com/pytorchvideo/pysf_model_zoo/kinetics/SLOWFAST_8x8_R50.pyth)           | Kinetics/pytorchvideo/SLOWFAST_8x8_R50         |\n| SlowFast | R101  | \\-       | 8x8                        | 77.90 | 93.27 | 127.20 x 3 x 10   | 62.83      | [link](https://dl.fbaipublicfiles.com/pytorchvideo/pysf_model_zoo/kinetics/SLOWFAST_8x8_R101.pyth)          | Kinetics/pytorchvideo/SLOWFAST_8x8_R101        |\n| SlowFast | R101  | \\-       | 16x8                       | 78.70 | 93.61 | 215.61 x 3 x 10   | 53.77      | [link](https://dl.fbaipublicfiles.com/pytorchvideo/pysf_model_zoo/kinetics/SLOWFAST\\_16x8\\_R101_50_50.pyth) | Kinetics/pytorchvideo/SLOWFAST_16x8_R101_50_50 |\n| CSN      | R101  | \\-       | 32x2                       | 77.00 | 92.90 | 75.62 x 3 x 10    | 22.21      | [link](https://dl.fbaipublicfiles.com/pytorchvideo/pysf_model_zoo/kinetics/CSN_32x2_R101.pyth)              | Kinetics/pytorchvideo/CSN_32x2_R101            |\n| R(2+1)D  | R50   | \\-       | 16x4                       | 76.01 | 92.23 | 76.45 x 3 x 10    | 28.11      | [link](https://dl.fbaipublicfiles.com/pytorchvideo/pysf_model_zoo/kinetics/R2PLUS1D_16x4_R50.pyth)          | Kinetics/pytorchvideo/R2PLUS1D_16x4_R50        |\n| X3D      | XS    | \\-       | 4x12                       | 69.12 | 88.63 | 0.91 x 3 x 10     | 3.79       | [link](https://dl.fbaipublicfiles.com/pytorchvideo/pysf_model_zoo/kinetics/X3D_XS.pyth)                     | Kinetics/pytorchvideo/X3D_XS                   |\n| X3D      | S     | \\-       | 13x6                       | 73.33 | 91.27 | 2.96 x 3 x 10     | 3.79       | [link](https://dl.fbaipublicfiles.com/pytorchvideo/pysf_model_zoo/kinetics/X3D_S.pyth)                      | Kinetics/pytorchvideo/X3D_S                    |\n| X3D      | M     | \\-       | 16x5                       | 75.94 | 92.72 | 6.72 x 3 x 10     | 3.79       | [link](https://dl.fbaipublicfiles.com/pytorchvideo/pysf_model_zoo/kinetics/X3D_M.pyth)                      | Kinetics/pytorchvideo/X3D_M                    |\n| X3D      | L     | \\-       | 16x5                       | 77.44 | 93.31 | 26.64 x 3 x 10    | 6.15       | [link](https://dl.fbaipublicfiles.com/pytorchvideo/pysf_model_zoo/kinetics/X3D_L.pyth)                      | Kinetics/pytorchvideo/X3D_L                    |\n\n\n### Something-Something V2\n\n| arch     | depth | pretrain     | frame length x sample rate | top 1 | top 5 | Flops (G) x views | Params (M) | Model                                                                                               | config             |\n| -------- | ----- | ------------ | -------------------------- | ----- | ----- | ----------------- | ---------- | --------------------------------------------------------------------------------------------------- | ------------------ |\n| Slow     | R50   | Kinetics 400 | 8x8                        | 60.04 | 85.19 | 55.10 x 3 x 1     | 31.96      | [link](https://dl.fbaipublicfiles.com/pytorchvideo/pysf_model_zoo/ssv2/SLOW_8x8_R50.pyth)     | SSv2/pytorchvideo/SLOW_8x8_R50     |\n| SlowFast | R50   | Kinetics 400 | 8x8                        | 61.68 | 86.92 | 66.60 x 3 x 1     | 34.04      | [link](https://dl.fbaipublicfiles.com/pytorchvideo/pysf_model_zoo/ssv2/SLOWFAST_8x8_R50.pyth) | SSv2/pytorchvideo/SLOWFAST_8x8_R50 |\n\n### Charades\n\n| arch     | depth | pretrain     | frame x interval | MAP   | Flops (G) x views | Params (M) | Model                                                                                               | config             |\n| -------- | ----- | ------------ | ---------------- | ----- | ----------------- | ---------- | --------------------------------------------------------------------------------------------------- | ------------------ |\n| Slow     | R50   | Kinetics 400 | 8x8              | 34.72 | 55.10 x 3 x 10    | 31.96      | [link](https://dl.fbaipublicfiles.com/pytorchvideo/pysf_model_zoo/ssv2/SLOW_8x8_R50.pyth)     | Charades/pytorchvideo/SLOW_8x8_R50     |\n| SlowFast | R50   | Kinetics 400 | 8x8              | 37.24 | 66.60 x 3 x 10    | 34.00      | [link](https://dl.fbaipublicfiles.com/pytorchvideo/pysf_model_zoo/ssv2/SLOWFAST_8x8_R50.pyth) | Charades/pytorchvideo/SLOWFAST_8x8_R50 |\n\nNotes:\n* The above model weights has slightly difference with these in [PyTorchVideo official model zoo](https://github.com/facebookresearch/pytorchvideo/blob/master/docs/source/model_zoo.md). The layer names of above model weights will contain the additional prefix of `model.` due to the [model wrapper](https://github.com/facebookresearch/SlowFast/blob/master/slowfast/models/ptv_model_builder.py) in PySlowFast.\n* For `Flops x views` column, we report the inference cost with a single “view\" × the number of views (FLOPs × space_views × time_views). For example, we take 3 spatial crops for 10 temporal clips on Kinetics.\n"
  },
  {
    "path": "projects/rev/README.md",
    "content": "# Reversible Vision Transformers\n\nOfficial PyTorch implementation of **Rev-ViT** and **Rev-MViT** models, from the following paper:\n<br/>[Reversible Vision Transformers](https://openaccess.thecvf.com/content/CVPR2022/papers/Mangalam_Reversible_Vision_Transformers_CVPR_2022_paper.pdf)\n[Karttikeya Mangalam](https://karttikeya.github.io/)\\*, [Haoqi Fan](https://haoqifan.github.io/)\\, [Yanghao Li](https://lyttonhao.github.io/)\\, [Chao-Yuan Wu](https://chaoyuan.org/)\\,  [Bo Xiong](https://www.cs.utexas.edu/~bxiong/)\\,  [Christoph Feichtenhofer](http://feichtenhofer.github.io/)\\*, [Jitendra Malik](http://people.eecs.berkeley.edu/~malik/)\nCVPR 2022 (Oral)\n\n**Project Homepage**: https://karttikeya.github.io/publication/revvit/\n\n<br>\n<div  align=\"center\">\n<img  src=\"teaser.png\"  width=\"500px\"  />\n</div>\n<br/>\n\n## Pretrained Models\n\n### ImageNet\n\n| Architecture | #params (M) | FLOPs (G) | Top1 | weights | Config                |\n| ------------ | ----------- | --------- | ---- | ------- | --------------------- |\n| Rev-ViT-S    | 22          | 4.6       | 79.9 | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/rev/REV_VIT_S.pyth)       | ImageNet/REV_VIT_S    |\n| Rev-ViT-B    | 87          | 17.6      | 81.8      | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/rev/REV_VIT_B.pyth)       | ImageNet/REV_VIT_B    |\n| Rev-MViT-B   | 39          | 8.7       | 82.9*    | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/rev/REV_MVIT_B.pyth)       | ImageNet/REV_MVIT_B_16_CONV   |\n\n*improved from 82.5% reported in the [Paper Table 1](https://openaccess.thecvf.com/content/CVPR2022/papers/Mangalam_Reversible_Vision_Transformers_CVPR_2022_paper.pdf).\n\n### Kinetics 400\n\n| Architecture | frame length x sample rate | Top1 | Top5 | Flops (G) x views | #params (M) | weights | config                            |\n| ------------ | -------------------------- | ---- | ---- | ----------------- | ----------- | ------- | --------------------------------- |\n| Rev-MViT-B   | 16 x 4                     | 78.4   | 93.4    | 64 x 1 x 5        | 34.9        | [`link`](https://dl.fbaipublicfiles.com/pyslowfast/rev/REV_MVIT_B_16x4.pyth)       | Kinetics/REV_MVIT_B_16x4_CONV     |\n\n## Getting started\n\nTo use Rev-ViT (or Rev-MViT) image models please refer to the configs under `configs/ImageNet`, and see [paper](https://openaccess.thecvf.com/content/CVPR2022/papers/Mangalam_Reversible_Vision_Transformers_CVPR_2022_paper.pdf) for details. For example, the command\n\n```\npython tools/run_net.py \\\n\n--cfg configs/ImageNet/REV_VIT_B.yaml \\\n\nDATA.PATH_TO_DATA_DIR path_to_your_dataset \\\nNUM_GPUS 1 \\\n```\n\nshould train and test a Reversible ViT Base Image model (trained with DeiT recipe) on your dataset. Please refer to [general repo-level](https://github.com/facebookresearch/SlowFast/blob/main/GETTING_STARTED.md) instruction for futher details.\n\nFor Rev-MViT video models, please run the `configs/Kinetics` configs as,\n```\n\npython tools/run_net.py \\\n\n--cfg configs/Kinetics/REV_MVIT_B_16x4_CONV.yaml \\\n\nDATA.PATH_TO_DATA_DIR path_to_your_dataset \\\nNUM_GPUS 1 \\\n```\n\n## Citing Rev-ViT \\& Rev-MViT\n\nIf you find Reversible Models useful for your research, please consider citing the paper using the following BibTeX entry.\n\n```BibTeX\n\n\n@inproceedings{mangalam2022,\n\n  title = {Reversible Vision Transformers},\n\n  author = {Mangalam, Karttikeya and Fan, Haoqi and Li, Yanghao and Wu, Chao-Yuan and Xiong, Bo and Feichtenhofer, Christoph and Malik, Jitendra},\n\n  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},\n\n  year = {2022},\n}\n\n```\n"
  },
  {
    "path": "projects/x3d/README.md",
    "content": "# X3D: Progressive Network Expansion for Efficient Video Recognition\n[Christoph Feichtenhofer](http://feichtenhofer.github.io/),\n<br/>\nIn CVPR, 2020. [[Paper](https://arxiv.org/abs/2004.04730)]\n<br/>\n<div align=\"center\">\n  <img src=\"x3d_concept.png\" width=\"500px\" />\n  <img src=\"x3d_output.png\" width=\"350px\" />\n</div>\n<br/>\n\n\n## Getting started\n**IMPORTANT** The naïve implementation of channelwise 3D convolution (Conv3D operation with group size > 1) in PyTorch is extremely slow. To have fast GPU runtime with X3D models, please patch the following pull request before using X3D: [Conv3D pull request](https://github.com/pytorch/pytorch/pull/40801)\n\nTo use X3D models please refer to the configs under `configs/Kinetics`, or see the [MODEL_ZOO.md](https://github.com/facebookresearch/SlowFast/blob/master/MODEL_ZOO.md) for pre-trained models. See [paper](https://arxiv.org/abs/2004.04730) for details. For example, the command\n\n```\npython tools/run_net.py \\\n  --cfg configs/Kinetics/X3D-XS.yaml \\\n  DATA.PATH_TO_DATA_DIR path_to_your_dataset \\\n```\n\nshould train and test an extra small (XS) X3D model on your dataset.\n\n## Citing X3D\nIf you find X3D useful for your research, please consider citing the paper using the following BibTeX entry.\n```BibTeX\n@inproceedings{x3d2020,\n  Author    = {Christoph Feichtenhofer},\n  Title     = {{X3D}: Progressive Network Expansion for Efficient Video Recognition},\n  Booktitle = {{CVPR}},\n  Year      = {2020}}\n```\n"
  },
  {
    "path": "setup.cfg",
    "content": "[isort]\nline_length=100\nmulti_line_output=4\nknown_standard_library=numpy,setuptools\nknown_myself=slowfast\nknown_third_party=fvcore,iopath,av,torch,pycocotools,yacs,termcolor,scipy,simplejson,matplotlib,detectron2,torchvision,yaml,tqdm,psutil,opencv-python,pandas,tensorboard,moviepy,sklearn,cv2,PIL\nno_lines_before=STDLIB,THIRDPARTY\nsections=FUTURE,STDLIB,THIRDPARTY,myself,FIRSTPARTY,LOCALFOLDER\ndefault_section=FIRSTPARTY\n\n[mypy]\npython_version=3.6\nignore_missing_imports = True\nwarn_unused_configs = True\ndisallow_untyped_defs = True\ncheck_untyped_defs = True\nwarn_unused_ignores = True\nwarn_redundant_casts = True\nshow_column_numbers = True\nfollow_imports = silent\nallow_redefinition = True\n; Require all functions to be annotated\ndisallow_incomplete_defs = True\n"
  },
  {
    "path": "setup.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\nfrom setuptools import find_packages, setup\n\nsetup(\n    name=\"slowfast\",\n    version=\"1.0\",\n    author=\"FAIR\",\n    url=\"unknown\",\n    description=\"SlowFast Video Understanding\",\n    install_requires=[\n        \"yacs>=0.1.6\",\n        \"pyyaml>=5.1\",\n        \"av\",\n        \"matplotlib\",\n        \"termcolor>=1.1\",\n        \"simplejson\",\n        \"tqdm\",\n        \"psutil\",\n        \"matplotlib\",\n        \"detectron2\",\n        \"opencv-python\",\n        \"pandas\",\n        \"torchvision>=0.4.2\",\n        \"PIL\",\n        \"sklearn\",\n        \"tensorboard\",\n        \"fairscale\",\n    ],\n    extras_require={\"tensorboard_video_visualization\": [\"moviepy\"]},\n    packages=find_packages(exclude=(\"configs\", \"tests\")),\n)\n"
  },
  {
    "path": "slowfast/__init__.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\nfrom slowfast.utils.env import setup_environment\n\nsetup_environment()\n"
  },
  {
    "path": "slowfast/config/__init__.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n"
  },
  {
    "path": "slowfast/config/custom_config.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\n\"\"\"Add custom configs and default values\"\"\"\n\n\ndef add_custom_config(_C):\n    # Add your own customized configs.\n    pass\n"
  },
  {
    "path": "slowfast/config/defaults.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\n\"\"\"Configs.\"\"\"\n\nimport math\n\nfrom fvcore.common.config import CfgNode\n\nfrom . import custom_config\n\n# -----------------------------------------------------------------------------\n# Config definition\n# -----------------------------------------------------------------------------\n_C = CfgNode()\n\n# -----------------------------------------------------------------------------\n# Contrastive Model (for MoCo, SimCLR, SwAV, BYOL)\n# -----------------------------------------------------------------------------\n\n_C.CONTRASTIVE = CfgNode()\n\n# temperature used for contrastive losses\n_C.CONTRASTIVE.T = 0.07\n\n# output dimension for the loss\n_C.CONTRASTIVE.DIM = 128\n\n# number of training samples (for kNN bank)\n_C.CONTRASTIVE.LENGTH = 239975\n\n# the length of MoCo's and MemBanks' queues\n_C.CONTRASTIVE.QUEUE_LEN = 65536\n\n# momentum for momentum encoder updates\n_C.CONTRASTIVE.MOMENTUM = 0.5\n\n# whether to anneal momentum to value above with cosine schedule\n_C.CONTRASTIVE.MOMENTUM_ANNEALING = False\n\n# either memorybank, moco, simclr, byol, swav\n_C.CONTRASTIVE.TYPE = \"mem\"\n\n# wether to interpolate memorybank in time\n_C.CONTRASTIVE.INTERP_MEMORY = False\n\n# 1d or 2d (+temporal) memory\n_C.CONTRASTIVE.MEM_TYPE = \"1d\"\n\n# number of classes for online kNN evaluation\n_C.CONTRASTIVE.NUM_CLASSES_DOWNSTREAM = 400\n\n# use an MLP projection with these num layers\n_C.CONTRASTIVE.NUM_MLP_LAYERS = 1\n\n# dimension of projection and predictor MLPs\n_C.CONTRASTIVE.MLP_DIM = 2048\n\n# use BN in projection/prediction MLP\n_C.CONTRASTIVE.BN_MLP = False\n\n# use synchronized BN in projection/prediction MLP\n_C.CONTRASTIVE.BN_SYNC_MLP = False\n\n# shuffle BN only locally vs. across machines\n_C.CONTRASTIVE.LOCAL_SHUFFLE_BN = True\n\n# Wether to fill multiple clips (or just the first) into queue\n_C.CONTRASTIVE.MOCO_MULTI_VIEW_QUEUE = False\n\n# if sampling multiple clips per vid they need to be at least min frames apart\n_C.CONTRASTIVE.DELTA_CLIPS_MIN = -math.inf\n\n# if sampling multiple clips per vid they can be max frames apart\n_C.CONTRASTIVE.DELTA_CLIPS_MAX = math.inf\n\n# if non empty, use predictors with depth specified\n_C.CONTRASTIVE.PREDICTOR_DEPTHS = []\n\n# Wether to sequentially process multiple clips (=lower mem usage) or batch them\n_C.CONTRASTIVE.SEQUENTIAL = False\n\n# Wether to perform SimCLR loss across machines (or only locally)\n_C.CONTRASTIVE.SIMCLR_DIST_ON = True\n\n# Length of queue used in SwAV\n_C.CONTRASTIVE.SWAV_QEUE_LEN = 0\n\n# Wether to run online kNN evaluation during training\n_C.CONTRASTIVE.KNN_ON = True\n\n\n# ---------------------------------------------------------------------------- #\n# Batch norm options\n# ---------------------------------------------------------------------------- #\n_C.BN = CfgNode()\n\n# Precise BN stats.\n_C.BN.USE_PRECISE_STATS = False\n\n# Number of samples use to compute precise bn.\n_C.BN.NUM_BATCHES_PRECISE = 200\n\n# Weight decay value that applies on BN.\n_C.BN.WEIGHT_DECAY = 0.0\n\n# Norm type, options include `batchnorm`, `sub_batchnorm`, `sync_batchnorm`\n_C.BN.NORM_TYPE = \"batchnorm\"\n\n# Parameter for SubBatchNorm, where it splits the batch dimension into\n# NUM_SPLITS splits, and run BN on each of them separately independently.\n_C.BN.NUM_SPLITS = 1\n\n# Parameter for NaiveSyncBatchNorm, where the stats across `NUM_SYNC_DEVICES`\n# devices will be synchronized. `NUM_SYNC_DEVICES` cannot be larger than number of\n# devices per machine; if global sync is desired, set `GLOBAL_SYNC`.\n# By default ONLY applies to NaiveSyncBatchNorm3d; consider also setting\n# CONTRASTIVE.BN_SYNC_MLP if appropriate.\n_C.BN.NUM_SYNC_DEVICES = 1\n\n# Parameter for NaiveSyncBatchNorm. Setting `GLOBAL_SYNC` to True synchronizes\n# stats across all devices, across all machines; in this case, `NUM_SYNC_DEVICES`\n# must be set to None.\n# By default ONLY applies to NaiveSyncBatchNorm3d; consider also setting\n# CONTRASTIVE.BN_SYNC_MLP if appropriate.\n_C.BN.GLOBAL_SYNC = False\n\n# ---------------------------------------------------------------------------- #\n# Training options.\n# ---------------------------------------------------------------------------- #\n_C.TRAIN = CfgNode()\n\n# If True Train the model, else skip training.\n_C.TRAIN.ENABLE = True\n\n# Kill training if loss explodes over this ratio from the previous 5 measurements.\n# Only enforced if > 0.0\n_C.TRAIN.KILL_LOSS_EXPLOSION_FACTOR = 0.0\n\n# Dataset.\n_C.TRAIN.DATASET = \"kinetics\"\n\n# Total mini-batch size.\n_C.TRAIN.BATCH_SIZE = 64\n\n# Evaluate model on test data every eval period epochs.\n_C.TRAIN.EVAL_PERIOD = 10\n\n# Save model checkpoint every checkpoint period epochs.\n_C.TRAIN.CHECKPOINT_PERIOD = 10\n\n# Resume training from the latest checkpoint in the output directory.\n_C.TRAIN.AUTO_RESUME = True\n\n# Path to the checkpoint to load the initial weight.\n_C.TRAIN.CHECKPOINT_FILE_PATH = \"\"\n\n# Checkpoint types include `caffe2` or `pytorch`.\n_C.TRAIN.CHECKPOINT_TYPE = \"pytorch\"\n\n# If True, perform inflation when loading checkpoint.\n_C.TRAIN.CHECKPOINT_INFLATE = False\n\n# If True, reset epochs when loading checkpoint.\n_C.TRAIN.CHECKPOINT_EPOCH_RESET = False\n\n# If set, clear all layer names according to the pattern provided.\n_C.TRAIN.CHECKPOINT_CLEAR_NAME_PATTERN = ()  # (\"backbone.\",)\n\n# If True, use FP16 for activations\n_C.TRAIN.MIXED_PRECISION = False\n\n# if True, inflate some params from imagenet model.\n_C.TRAIN.CHECKPOINT_IN_INIT = False\n\n# ---------------------------------------------------------------------------- #\n# Augmentation options.\n# ---------------------------------------------------------------------------- #\n_C.AUG = CfgNode()\n\n# Whether to enable randaug.\n_C.AUG.ENABLE = False\n\n# Number of repeated augmentations to used during training.\n# If this is greater than 1, then the actual batch size is\n# TRAIN.BATCH_SIZE * AUG.NUM_SAMPLE.\n_C.AUG.NUM_SAMPLE = 1\n\n# Not used if using randaug.\n_C.AUG.COLOR_JITTER = 0.4\n\n# RandAug parameters.\n_C.AUG.AA_TYPE = \"rand-m9-mstd0.5-inc1\"\n\n# Interpolation method.\n_C.AUG.INTERPOLATION = \"bicubic\"\n\n# Probability of random erasing.\n_C.AUG.RE_PROB = 0.25\n\n# Random erasing mode.\n_C.AUG.RE_MODE = \"pixel\"\n\n# Random erase count.\n_C.AUG.RE_COUNT = 1\n\n# Do not random erase first (clean) augmentation split.\n_C.AUG.RE_SPLIT = False\n\n# Whether to generate input mask during image processing.\n_C.AUG.GEN_MASK_LOADER = False\n\n# If True, masking mode is \"tube\". Default is \"cube\".\n_C.AUG.MASK_TUBE = False\n\n# If True, masking mode is \"frame\". Default is \"cube\".\n_C.AUG.MASK_FRAMES = False\n\n# The size of generated masks.\n_C.AUG.MASK_WINDOW_SIZE = [8, 7, 7]\n\n# The ratio of masked tokens out of all tokens. Also applies to MViT supervised training\n_C.AUG.MASK_RATIO = 0.0\n\n# The maximum number of a masked block. None means no maximum limit. (Used only in image MaskFeat.)\n_C.AUG.MAX_MASK_PATCHES_PER_BLOCK = None\n\n# ---------------------------------------------------------------------------- #\n# Masked pretraining visualization options.\n# ---------------------------------------------------------------------------- #\n_C.VIS_MASK = CfgNode()\n\n# Whether to do visualization.\n_C.VIS_MASK.ENABLE = False\n\n# ---------------------------------------------------------------------------- #\n# MipUp options.\n# ---------------------------------------------------------------------------- #\n_C.MIXUP = CfgNode()\n\n# Whether to use mixup.\n_C.MIXUP.ENABLE = False\n\n# Mixup alpha.\n_C.MIXUP.ALPHA = 0.8\n\n# Cutmix alpha.\n_C.MIXUP.CUTMIX_ALPHA = 1.0\n\n# Probability of performing mixup or cutmix when either/both is enabled.\n_C.MIXUP.PROB = 1.0\n\n# Probability of switching to cutmix when both mixup and cutmix enabled.\n_C.MIXUP.SWITCH_PROB = 0.5\n\n# Label smoothing.\n_C.MIXUP.LABEL_SMOOTH_VALUE = 0.1\n\n# ---------------------------------------------------------------------------- #\n# Testing options\n# ---------------------------------------------------------------------------- #\n_C.TEST = CfgNode()\n\n# If True test the model, else skip the testing.\n_C.TEST.ENABLE = True\n\n# Dataset for testing.\n_C.TEST.DATASET = \"kinetics\"\n\n# Total mini-batch size\n_C.TEST.BATCH_SIZE = 8\n\n# Path to the checkpoint to load the initial weight.\n_C.TEST.CHECKPOINT_FILE_PATH = \"\"\n\n# Number of clips to sample from a video uniformly for aggregating the\n# prediction results.\n_C.TEST.NUM_ENSEMBLE_VIEWS = 10\n\n# Number of crops to sample from a frame spatially for aggregating the\n# prediction results.\n_C.TEST.NUM_SPATIAL_CROPS = 3\n\n# Checkpoint types include `caffe2` or `pytorch`.\n_C.TEST.CHECKPOINT_TYPE = \"pytorch\"\n# Path to saving prediction results file.\n_C.TEST.SAVE_RESULTS_PATH = \"\"\n\n_C.TEST.NUM_TEMPORAL_CLIPS = []\n# -----------------------------------------------------------------------------\n# ResNet options\n# -----------------------------------------------------------------------------\n_C.RESNET = CfgNode()\n\n# Transformation function.\n_C.RESNET.TRANS_FUNC = \"bottleneck_transform\"\n\n# Number of groups. 1 for ResNet, and larger than 1 for ResNeXt).\n_C.RESNET.NUM_GROUPS = 1\n\n# Width of each group (64 -> ResNet; 4 -> ResNeXt).\n_C.RESNET.WIDTH_PER_GROUP = 64\n\n# Apply relu in a inplace manner.\n_C.RESNET.INPLACE_RELU = True\n\n# Apply stride to 1x1 conv.\n_C.RESNET.STRIDE_1X1 = False\n\n#  If true, initialize the gamma of the final BN of each block to zero.\n_C.RESNET.ZERO_INIT_FINAL_BN = False\n\n#  If true, initialize the final conv layer of each block to zero.\n_C.RESNET.ZERO_INIT_FINAL_CONV = False\n\n# Number of weight layers.\n_C.RESNET.DEPTH = 50\n\n# If the current block has more than NUM_BLOCK_TEMP_KERNEL blocks, use temporal\n# kernel of 1 for the rest of the blocks.\n_C.RESNET.NUM_BLOCK_TEMP_KERNEL = [[3], [4], [6], [3]]\n\n# Size of stride on different res stages.\n_C.RESNET.SPATIAL_STRIDES = [[1], [2], [2], [2]]\n\n# Size of dilation on different res stages.\n_C.RESNET.SPATIAL_DILATIONS = [[1], [1], [1], [1]]\n\n# ---------------------------------------------------------------------------- #\n# X3D  options\n# See https://arxiv.org/abs/2004.04730 for details about X3D Networks.\n# ---------------------------------------------------------------------------- #\n_C.X3D = CfgNode()\n\n# Width expansion factor.\n_C.X3D.WIDTH_FACTOR = 1.0\n\n# Depth expansion factor.\n_C.X3D.DEPTH_FACTOR = 1.0\n\n# Bottleneck expansion factor for the 3x3x3 conv.\n_C.X3D.BOTTLENECK_FACTOR = 1.0  #\n\n# Dimensions of the last linear layer before classificaiton.\n_C.X3D.DIM_C5 = 2048\n\n# Dimensions of the first 3x3 conv layer.\n_C.X3D.DIM_C1 = 12\n\n# Whether to scale the width of Res2, default is false.\n_C.X3D.SCALE_RES2 = False\n\n# Whether to use a BatchNorm (BN) layer before the classifier, default is false.\n_C.X3D.BN_LIN5 = False\n\n# Whether to use channelwise (=depthwise) convolution in the center (3x3x3)\n# convolution operation of the residual blocks.\n_C.X3D.CHANNELWISE_3x3x3 = True\n\n# -----------------------------------------------------------------------------\n# Nonlocal options\n# -----------------------------------------------------------------------------\n_C.NONLOCAL = CfgNode()\n\n# Index of each stage and block to add nonlocal layers.\n_C.NONLOCAL.LOCATION = [[[]], [[]], [[]], [[]]]\n\n# Number of group for nonlocal for each stage.\n_C.NONLOCAL.GROUP = [[1], [1], [1], [1]]\n\n# Instatiation to use for non-local layer.\n_C.NONLOCAL.INSTANTIATION = \"dot_product\"\n\n\n# Size of pooling layers used in Non-Local.\n_C.NONLOCAL.POOL = [\n    # Res2\n    [[1, 2, 2], [1, 2, 2]],\n    # Res3\n    [[1, 2, 2], [1, 2, 2]],\n    # Res4\n    [[1, 2, 2], [1, 2, 2]],\n    # Res5\n    [[1, 2, 2], [1, 2, 2]],\n]\n\n# -----------------------------------------------------------------------------\n# Model options\n# -----------------------------------------------------------------------------\n_C.MODEL = CfgNode()\n\n# Model architecture.\n_C.MODEL.ARCH = \"slowfast\"\n\n# Model name\n_C.MODEL.MODEL_NAME = \"SlowFast\"\n\n# The number of classes to predict for the model.\n_C.MODEL.NUM_CLASSES = 400\n\n# Loss function.\n_C.MODEL.LOSS_FUNC = \"cross_entropy\"\n\n# Model architectures that has one single pathway.\n_C.MODEL.SINGLE_PATHWAY_ARCH = [\n    \"2d\",\n    \"c2d\",\n    \"i3d\",\n    \"slow\",\n    \"x3d\",\n    \"mvit\",\n    \"maskmvit\",\n]\n\n# Model architectures that has multiple pathways.\n_C.MODEL.MULTI_PATHWAY_ARCH = [\"slowfast\"]\n\n# Dropout rate before final projection in the backbone.\n_C.MODEL.DROPOUT_RATE = 0.5\n\n# Randomly drop rate for Res-blocks, linearly increase from res2 to res5\n_C.MODEL.DROPCONNECT_RATE = 0.0\n\n# The std to initialize the fc layer(s).\n_C.MODEL.FC_INIT_STD = 0.01\n\n# Activation layer for the output head.\n_C.MODEL.HEAD_ACT = \"softmax\"\n\n# Activation checkpointing enabled or not to save GPU memory.\n_C.MODEL.ACT_CHECKPOINT = False\n\n# If True, detach the final fc layer from the network, by doing so, only the\n# final fc layer will be trained.\n_C.MODEL.DETACH_FINAL_FC = False\n\n# If True, frozen batch norm stats during training.\n_C.MODEL.FROZEN_BN = False\n\n# If True, AllReduce gradients are compressed to fp16\n_C.MODEL.FP16_ALLREDUCE = False\n\n\n# -----------------------------------------------------------------------------\n# MViT options\n# -----------------------------------------------------------------------------\n_C.MVIT = CfgNode()\n\n# Options include `conv`, `max`.\n_C.MVIT.MODE = \"conv\"\n\n# If True, perform pool before projection in attention.\n_C.MVIT.POOL_FIRST = False\n\n# If True, use cls embed in the network, otherwise don't use cls_embed in transformer.\n_C.MVIT.CLS_EMBED_ON = True\n\n# Kernel size for patchtification.\n_C.MVIT.PATCH_KERNEL = [3, 7, 7]\n\n# Stride size for patchtification.\n_C.MVIT.PATCH_STRIDE = [2, 4, 4]\n\n# Padding size for patchtification.\n_C.MVIT.PATCH_PADDING = [2, 4, 4]\n\n# If True, use 2d patch, otherwise use 3d patch.\n_C.MVIT.PATCH_2D = False\n\n# Base embedding dimension for the transformer.\n_C.MVIT.EMBED_DIM = 96\n\n# Base num of heads for the transformer.\n_C.MVIT.NUM_HEADS = 1\n\n# Dimension reduction ratio for the MLP layers.\n_C.MVIT.MLP_RATIO = 4.0\n\n# If use, use bias term in attention fc layers.\n_C.MVIT.QKV_BIAS = True\n\n# Drop path rate for the tranfomer.\n_C.MVIT.DROPPATH_RATE = 0.1\n\n# The initial value of layer scale gamma. Set 0.0 to disable layer scale.\n_C.MVIT.LAYER_SCALE_INIT_VALUE = 0.0\n\n# Depth of the transformer.\n_C.MVIT.DEPTH = 16\n\n# Normalization layer for the transformer. Only layernorm is supported now.\n_C.MVIT.NORM = \"layernorm\"\n\n# Dimension multiplication at layer i. If 2.0 is used, then the next block will increase\n# the dimension by 2 times. Format: [depth_i: mul_dim_ratio]\n_C.MVIT.DIM_MUL = []\n\n# Head number multiplication at layer i. If 2.0 is used, then the next block will\n# increase the number of heads by 2 times. Format: [depth_i: head_mul_ratio]\n_C.MVIT.HEAD_MUL = []\n\n# Stride size for the Pool KV at layer i.\n# Format: [[i, stride_t_i, stride_h_i, stride_w_i], ...,]\n_C.MVIT.POOL_KV_STRIDE = []\n\n# Initial stride size for KV at layer 1. The stride size will be further reduced with\n# the raio of MVIT.DIM_MUL. If will overwrite MVIT.POOL_KV_STRIDE if not None.\n_C.MVIT.POOL_KV_STRIDE_ADAPTIVE = None\n\n# Stride size for the Pool Q at layer i.\n# Format: [[i, stride_t_i, stride_h_i, stride_w_i], ...,]\n_C.MVIT.POOL_Q_STRIDE = []\n\n# If not None, overwrite the KV_KERNEL and Q_KERNEL size with POOL_KVQ_CONV_SIZ.\n# Otherwise the kernel_size is [s + 1 if s > 1 else s for s in stride_size].\n_C.MVIT.POOL_KVQ_KERNEL = None\n\n# If True, perform no decay on positional embedding and cls embedding.\n_C.MVIT.ZERO_DECAY_POS_CLS = True\n\n# If True, use norm after stem.\n_C.MVIT.NORM_STEM = False\n\n# If True, perform separate positional embedding.\n_C.MVIT.SEP_POS_EMBED = False\n\n# Dropout rate for the MViT backbone.\n_C.MVIT.DROPOUT_RATE = 0.0\n\n# If True, use absolute positional embedding.\n_C.MVIT.USE_ABS_POS = True\n\n# If True, use relative positional embedding for spatial dimentions\n_C.MVIT.REL_POS_SPATIAL = False\n\n# If True, use relative positional embedding for temporal dimentions\n_C.MVIT.REL_POS_TEMPORAL = False\n\n# If True, init rel with zero\n_C.MVIT.REL_POS_ZERO_INIT = False\n\n# If True, using Residual Pooling connection\n_C.MVIT.RESIDUAL_POOLING = False\n\n# Dim mul in qkv linear layers of attention block instead of MLP\n_C.MVIT.DIM_MUL_IN_ATT = False\n\n# If True, using separate linear layers for Q, K, V in attention blocks.\n_C.MVIT.SEPARATE_QKV = False\n\n# The initialization scale factor for the head parameters.\n_C.MVIT.HEAD_INIT_SCALE = 1.0\n\n# Whether to use the mean pooling of all patch tokens as the output.\n_C.MVIT.USE_MEAN_POOLING = False\n\n# If True, use frozen sin cos positional embedding.\n_C.MVIT.USE_FIXED_SINCOS_POS = False\n\n# -----------------------------------------------------------------------------\n# Masked pretraining options\n# -----------------------------------------------------------------------------\n_C.MASK = CfgNode()\n\n# Whether to enable Masked style pretraining.\n_C.MASK.ENABLE = False\n\n# Whether to enable MAE (discard encoder tokens).\n_C.MASK.MAE_ON = False\n\n# Whether to enable random masking in mae\n_C.MASK.MAE_RND_MASK = False\n\n# Whether to do random masking per-frame in mae\n_C.MASK.PER_FRAME_MASKING = False\n\n# only predict loss on temporal strided patches, or predict full time extent\n_C.MASK.TIME_STRIDE_LOSS = True\n\n# Whether to normalize the pred pixel loss\n_C.MASK.NORM_PRED_PIXEL = True\n\n# Whether to fix initialization with inverse depth of layer for pretraining.\n_C.MASK.SCALE_INIT_BY_DEPTH = False\n\n# Base embedding dimension for the decoder transformer.\n_C.MASK.DECODER_EMBED_DIM = 512\n\n# Base embedding dimension for the decoder transformer.\n_C.MASK.DECODER_SEP_POS_EMBED = False\n\n# Use a KV kernel in decoder?\n_C.MASK.DEC_KV_KERNEL = []\n\n# Use a KV stride in decoder?\n_C.MASK.DEC_KV_STRIDE = []\n\n# The depths of features which are inputs of the prediction head.\n_C.MASK.PRETRAIN_DEPTH = [15]\n\n# The type of Masked pretraining prediction head.\n# Can be \"separate\", \"separate_xformer\".\n_C.MASK.HEAD_TYPE = \"separate\"\n\n# The depth of MAE's decoder\n_C.MASK.DECODER_DEPTH = 0\n\n# The weight of HOG target loss.\n_C.MASK.PRED_HOG = False\n# Reversible Configs\n_C.MVIT.REV = CfgNode()\n\n# Enable Reversible Model\n_C.MVIT.REV.ENABLE = False\n\n# Method to fuse the reversible paths\n# see :class: `TwoStreamFusion` for all the options\n_C.MVIT.REV.RESPATH_FUSE = \"concat\"\n\n# Layers to buffer activations at\n# (at least Q-pooling layers needed)\n_C.MVIT.REV.BUFFER_LAYERS = []\n\n# 'conv' or 'max' operator for the respath in Qpooling\n_C.MVIT.REV.RES_PATH = \"conv\"\n\n# Method to merge hidden states before Qpoolinglayers\n_C.MVIT.REV.PRE_Q_FUSION = \"avg\"\n\n# -----------------------------------------------------------------------------\n# SlowFast options\n# -----------------------------------------------------------------------------\n_C.SLOWFAST = CfgNode()\n\n# Corresponds to the inverse of the channel reduction ratio, $\\beta$ between\n# the Slow and Fast pathways.\n_C.SLOWFAST.BETA_INV = 8\n\n# Corresponds to the frame rate reduction ratio, $\\alpha$ between the Slow and\n# Fast pathways.\n_C.SLOWFAST.ALPHA = 8\n\n# Ratio of channel dimensions between the Slow and Fast pathways.\n_C.SLOWFAST.FUSION_CONV_CHANNEL_RATIO = 2\n\n# Kernel dimension used for fusing information from Fast pathway to Slow\n# pathway.\n_C.SLOWFAST.FUSION_KERNEL_SZ = 5\n\n\n# -----------------------------------------------------------------------------\n# Data options\n# -----------------------------------------------------------------------------\n_C.DATA = CfgNode()\n\n# The path to the data directory.\n_C.DATA.PATH_TO_DATA_DIR = \"\"\n\n# The separator used between path and label.\n_C.DATA.PATH_LABEL_SEPARATOR = \" \"\n\n# Video path prefix if any.\n_C.DATA.PATH_PREFIX = \"\"\n\n# The number of frames of the input clip.\n_C.DATA.NUM_FRAMES = 8\n\n# The video sampling rate of the input clip.\n_C.DATA.SAMPLING_RATE = 8\n\n# Eigenvalues for PCA jittering. Note PCA is RGB based.\n_C.DATA.TRAIN_PCA_EIGVAL = [0.225, 0.224, 0.229]\n\n# Eigenvectors for PCA jittering.\n_C.DATA.TRAIN_PCA_EIGVEC = [\n    [-0.5675, 0.7192, 0.4009],\n    [-0.5808, -0.0045, -0.8140],\n    [-0.5836, -0.6948, 0.4203],\n]\n\n# If a imdb have been dumpped to a local file with the following format:\n# `{\"im_path\": im_path, \"class\": cont_id}`\n# then we can skip the construction of imdb and load it from the local file.\n_C.DATA.PATH_TO_PRELOAD_IMDB = \"\"\n\n# The mean value of the video raw pixels across the R G B channels.\n_C.DATA.MEAN = [0.45, 0.45, 0.45]\n# List of input frame channel dimensions.\n\n_C.DATA.INPUT_CHANNEL_NUM = [3, 3]\n\n# The std value of the video raw pixels across the R G B channels.\n_C.DATA.STD = [0.225, 0.225, 0.225]\n\n# The spatial augmentation jitter scales for training.\n_C.DATA.TRAIN_JITTER_SCALES = [256, 320]\n\n# The relative scale range of Inception-style area based random resizing augmentation.\n# If this is provided, DATA.TRAIN_JITTER_SCALES above is ignored.\n_C.DATA.TRAIN_JITTER_SCALES_RELATIVE = []\n\n# The relative aspect ratio range of Inception-style area based random resizing\n# augmentation.\n_C.DATA.TRAIN_JITTER_ASPECT_RELATIVE = []\n\n# If True, perform stride length uniform temporal sampling.\n_C.DATA.USE_OFFSET_SAMPLING = False\n\n# Whether to apply motion shift for augmentation.\n_C.DATA.TRAIN_JITTER_MOTION_SHIFT = False\n\n# The spatial crop size for training.\n_C.DATA.TRAIN_CROP_SIZE = 224\n\n# The spatial crop size for testing.\n_C.DATA.TEST_CROP_SIZE = 256\n\n# Input videos may has different fps, convert it to the target video fps before\n# frame sampling.\n_C.DATA.TARGET_FPS = 30\n\n# JITTER TARGET_FPS by +- this number randomly\n_C.DATA.TRAIN_JITTER_FPS = 0.0\n\n# Decoding backend, options include `pyav` or `torchvision`\n_C.DATA.DECODING_BACKEND = \"torchvision\"\n\n# Decoding resize to short size (set to native size for best speed)\n_C.DATA.DECODING_SHORT_SIZE = 256\n\n# if True, sample uniformly in [1 / max_scale, 1 / min_scale] and take a\n# reciprocal to get the scale. If False, take a uniform sample from\n# [min_scale, max_scale].\n_C.DATA.INV_UNIFORM_SAMPLE = False\n\n# If True, perform random horizontal flip on the video frames during training.\n_C.DATA.RANDOM_FLIP = True\n\n# If True, calculdate the map as metric.\n_C.DATA.MULTI_LABEL = False\n\n# Method to perform the ensemble, options include \"sum\" and \"max\".\n_C.DATA.ENSEMBLE_METHOD = \"sum\"\n\n# If True, revert the default input channel (RBG <-> BGR).\n_C.DATA.REVERSE_INPUT_CHANNEL = False\n\n# how many samples (=clips) to decode from a single video\n_C.DATA.TRAIN_CROP_NUM_TEMPORAL = 1\n\n# how many spatial samples to crop from a single clip\n_C.DATA.TRAIN_CROP_NUM_SPATIAL = 1\n\n# color random percentage for grayscale conversion\n_C.DATA.COLOR_RND_GRAYSCALE = 0.0\n\n# loader can read .csv file in chunks of this chunk size\n_C.DATA.LOADER_CHUNK_SIZE = 0\n\n# if LOADER_CHUNK_SIZE > 0, define overall length of .csv file\n_C.DATA.LOADER_CHUNK_OVERALL_SIZE = 0\n\n# for chunked reading, dataloader can skip rows in (large)\n# training csv file\n_C.DATA.SKIP_ROWS = 0\n\n# The separator used between path and label.\n_C.DATA.PATH_LABEL_SEPARATOR = \" \"\n\n# augmentation probability to convert raw decoded video to\n# grayscale temporal difference\n_C.DATA.TIME_DIFF_PROB = 0.0\n\n# Apply SSL-based SimCLR / MoCo v1/v2 color augmentations,\n#  with params below\n_C.DATA.SSL_COLOR_JITTER = False\n\n# color jitter percentage for brightness, contrast, saturation\n_C.DATA.SSL_COLOR_BRI_CON_SAT = [0.4, 0.4, 0.4]\n\n# color jitter percentage for hue\n_C.DATA.SSL_COLOR_HUE = 0.1\n\n# SimCLR / MoCo v2 augmentations on/off\n_C.DATA.SSL_MOCOV2_AUG = False\n\n# SimCLR / MoCo v2 blur augmentation minimum gaussian sigma\n_C.DATA.SSL_BLUR_SIGMA_MIN = [0.0, 0.1]\n\n# SimCLR / MoCo v2 blur augmentation maximum gaussian sigma\n_C.DATA.SSL_BLUR_SIGMA_MAX = [0.0, 2.0]\n\n\n# If combine train/val split as training for in21k\n_C.DATA.IN22K_TRAINVAL = False\n\n# If not None, use IN1k as val split when training in21k\n_C.DATA.IN22k_VAL_IN1K = \"\"\n\n# Large resolution models may use different crop ratios\n_C.DATA.IN_VAL_CROP_RATIO = 0.875  # 224/256 = 0.875\n\n# don't use real video for kinetics.py\n_C.DATA.DUMMY_LOAD = False\n\n# ---------------------------------------------------------------------------- #\n# Optimizer options\n# ---------------------------------------------------------------------------- #\n_C.SOLVER = CfgNode()\n\n# Base learning rate.\n_C.SOLVER.BASE_LR = 0.1\n\n# Learning rate policy (see utils/lr_policy.py for options and examples).\n_C.SOLVER.LR_POLICY = \"cosine\"\n\n# Final learning rates for 'cosine' policy.\n_C.SOLVER.COSINE_END_LR = 0.0\n\n# Exponential decay factor.\n_C.SOLVER.GAMMA = 0.1\n\n# Step size for 'exp' and 'cos' policies (in epochs).\n_C.SOLVER.STEP_SIZE = 1\n\n# Steps for 'steps_' policies (in epochs).\n_C.SOLVER.STEPS = []\n\n# Learning rates for 'steps_' policies.\n_C.SOLVER.LRS = []\n\n# Maximal number of epochs.\n_C.SOLVER.MAX_EPOCH = 300\n\n# Momentum.\n_C.SOLVER.MOMENTUM = 0.9\n\n# Momentum dampening.\n_C.SOLVER.DAMPENING = 0.0\n\n# Nesterov momentum.\n_C.SOLVER.NESTEROV = True\n\n# L2 regularization.\n_C.SOLVER.WEIGHT_DECAY = 1e-4\n\n# Start the warm up from SOLVER.BASE_LR * SOLVER.WARMUP_FACTOR.\n_C.SOLVER.WARMUP_FACTOR = 0.1\n\n# Gradually warm up the SOLVER.BASE_LR over this number of epochs.\n_C.SOLVER.WARMUP_EPOCHS = 0.0\n\n# The start learning rate of the warm up.\n_C.SOLVER.WARMUP_START_LR = 0.01\n\n# Optimization method.\n_C.SOLVER.OPTIMIZING_METHOD = \"sgd\"\n\n# Base learning rate is linearly scaled with NUM_SHARDS.\n_C.SOLVER.BASE_LR_SCALE_NUM_SHARDS = False\n\n# If True, start from the peak cosine learning rate after warm up.\n_C.SOLVER.COSINE_AFTER_WARMUP = False\n\n# If True, perform no weight decay on parameter with one dimension (bias term, etc).\n_C.SOLVER.ZERO_WD_1D_PARAM = False\n\n# Clip gradient at this value before optimizer update\n_C.SOLVER.CLIP_GRAD_VAL = None\n\n# Clip gradient at this norm before optimizer update\n_C.SOLVER.CLIP_GRAD_L2NORM = None\n\n# LARS optimizer\n_C.SOLVER.LARS_ON = False\n\n# The layer-wise decay of learning rate. Set to 1. to disable.\n_C.SOLVER.LAYER_DECAY = 1.0\n\n# Adam's beta\n_C.SOLVER.BETAS = (0.9, 0.999)\n# ---------------------------------------------------------------------------- #\n# Misc options\n# ---------------------------------------------------------------------------- #\n\n# The name of the current task; e.g. \"ssl\"/\"sl\" for (self)supervised learning\n_C.TASK = \"\"\n\n# Number of GPUs to use (applies to both training and testing).\n_C.NUM_GPUS = 1\n\n# Number of machine to use for the job.\n_C.NUM_SHARDS = 1\n\n# The index of the current machine.\n_C.SHARD_ID = 0\n\n# Output basedir.\n_C.OUTPUT_DIR = \".\"\n\n# Note that non-determinism may still be present due to non-deterministic\n# operator implementations in GPU operator libraries.\n_C.RNG_SEED = 1\n\n# Log period in iters.\n_C.LOG_PERIOD = 10\n\n# If True, log the model info.\n_C.LOG_MODEL_INFO = True\n\n# Distributed backend.\n_C.DIST_BACKEND = \"nccl\"\n\n# ---------------------------------------------------------------------------- #\n# Benchmark options\n# ---------------------------------------------------------------------------- #\n_C.BENCHMARK = CfgNode()\n\n# Number of epochs for data loading benchmark.\n_C.BENCHMARK.NUM_EPOCHS = 5\n\n# Log period in iters for data loading benchmark.\n_C.BENCHMARK.LOG_PERIOD = 100\n\n# If True, shuffle dataloader for epoch during benchmark.\n_C.BENCHMARK.SHUFFLE = True\n\n\n# ---------------------------------------------------------------------------- #\n# Common train/test data loader options\n# ---------------------------------------------------------------------------- #\n_C.DATA_LOADER = CfgNode()\n\n# Number of data loader workers per training process.\n_C.DATA_LOADER.NUM_WORKERS = 8\n\n# Load data to pinned host memory.\n_C.DATA_LOADER.PIN_MEMORY = True\n\n# Enable multi thread decoding.\n_C.DATA_LOADER.ENABLE_MULTI_THREAD_DECODE = False\n\n\n# ---------------------------------------------------------------------------- #\n# Detection options.\n# ---------------------------------------------------------------------------- #\n_C.DETECTION = CfgNode()\n\n# Whether enable video detection.\n_C.DETECTION.ENABLE = False\n\n# Aligned version of RoI. More details can be found at slowfast/models/head_helper.py\n_C.DETECTION.ALIGNED = True\n\n# Spatial scale factor.\n_C.DETECTION.SPATIAL_SCALE_FACTOR = 16\n\n# RoI tranformation resolution.\n_C.DETECTION.ROI_XFORM_RESOLUTION = 7\n\n\n# -----------------------------------------------------------------------------\n# AVA Dataset options\n# -----------------------------------------------------------------------------\n_C.AVA = CfgNode()\n\n# Directory path of frames.\n_C.AVA.FRAME_DIR = \"/mnt/fair-flash3-east/ava_trainval_frames.img/\"\n\n# Directory path for files of frame lists.\n_C.AVA.FRAME_LIST_DIR = (\n    \"/mnt/vol/gfsai-flash3-east/ai-group/users/haoqifan/ava/frame_list/\"\n)\n\n# Directory path for annotation files.\n_C.AVA.ANNOTATION_DIR = (\n    \"/mnt/vol/gfsai-flash3-east/ai-group/users/haoqifan/ava/frame_list/\"\n)\n\n# Filenames of training samples list files.\n_C.AVA.TRAIN_LISTS = [\"train.csv\"]\n\n# Filenames of test samples list files.\n_C.AVA.TEST_LISTS = [\"val.csv\"]\n\n# Filenames of box list files for training. Note that we assume files which\n# contains predicted boxes will have a suffix \"predicted_boxes\" in the\n# filename.\n_C.AVA.TRAIN_GT_BOX_LISTS = [\"ava_train_v2.2.csv\"]\n_C.AVA.TRAIN_PREDICT_BOX_LISTS = []\n\n# Filenames of box list files for test.\n_C.AVA.TEST_PREDICT_BOX_LISTS = [\"ava_val_predicted_boxes.csv\"]\n\n# This option controls the score threshold for the predicted boxes to use.\n_C.AVA.DETECTION_SCORE_THRESH = 0.9\n\n# If use BGR as the format of input frames.\n_C.AVA.BGR = False\n\n# Training augmentation parameters\n# Whether to use color augmentation method.\n_C.AVA.TRAIN_USE_COLOR_AUGMENTATION = False\n\n# Whether to only use PCA jitter augmentation when using color augmentation\n# method (otherwise combine with color jitter method).\n_C.AVA.TRAIN_PCA_JITTER_ONLY = True\n\n# Whether to do horizontal flipping during test.\n_C.AVA.TEST_FORCE_FLIP = False\n\n# Whether to use full test set for validation split.\n_C.AVA.FULL_TEST_ON_VAL = False\n\n# The name of the file to the ava label map.\n_C.AVA.LABEL_MAP_FILE = \"ava_action_list_v2.2_for_activitynet_2019.pbtxt\"\n\n# The name of the file to the ava exclusion.\n_C.AVA.EXCLUSION_FILE = \"ava_val_excluded_timestamps_v2.2.csv\"\n\n# The name of the file to the ava groundtruth.\n_C.AVA.GROUNDTRUTH_FILE = \"ava_val_v2.2.csv\"\n\n# Backend to process image, includes `pytorch` and `cv2`.\n_C.AVA.IMG_PROC_BACKEND = \"cv2\"\n\n# ---------------------------------------------------------------------------- #\n# Multigrid training options\n# See https://arxiv.org/abs/1912.00998 for details about multigrid training.\n# ---------------------------------------------------------------------------- #\n_C.MULTIGRID = CfgNode()\n\n# Multigrid training allows us to train for more epochs with fewer iterations.\n# This hyperparameter specifies how many times more epochs to train.\n# The default setting in paper trains for 1.5x more epochs than baseline.\n_C.MULTIGRID.EPOCH_FACTOR = 1.5\n\n# Enable short cycles.\n_C.MULTIGRID.SHORT_CYCLE = False\n# Short cycle additional spatial dimensions relative to the default crop size.\n_C.MULTIGRID.SHORT_CYCLE_FACTORS = [0.5, 0.5**0.5]\n\n_C.MULTIGRID.LONG_CYCLE = False\n# (Temporal, Spatial) dimensions relative to the default shape.\n_C.MULTIGRID.LONG_CYCLE_FACTORS = [\n    (0.25, 0.5**0.5),\n    (0.5, 0.5**0.5),\n    (0.5, 1),\n    (1, 1),\n]\n\n# While a standard BN computes stats across all examples in a GPU,\n# for multigrid training we fix the number of clips to compute BN stats on.\n# See https://arxiv.org/abs/1912.00998 for details.\n_C.MULTIGRID.BN_BASE_SIZE = 8\n\n# Multigrid training epochs are not proportional to actual training time or\n# computations, so _C.TRAIN.EVAL_PERIOD leads to too frequent or rare\n# evaluation. We use a multigrid-specific rule to determine when to evaluate:\n# This hyperparameter defines how many times to evaluate a model per long\n# cycle shape.\n_C.MULTIGRID.EVAL_FREQ = 3\n\n# No need to specify; Set automatically and used as global variables.\n_C.MULTIGRID.LONG_CYCLE_SAMPLING_RATE = 0\n_C.MULTIGRID.DEFAULT_B = 0\n_C.MULTIGRID.DEFAULT_T = 0\n_C.MULTIGRID.DEFAULT_S = 0\n\n# -----------------------------------------------------------------------------\n# Tensorboard Visualization Options\n# -----------------------------------------------------------------------------\n_C.TENSORBOARD = CfgNode()\n\n# Log to summary writer, this will automatically.\n# log loss, lr and metrics during train/eval.\n_C.TENSORBOARD.ENABLE = False\n# Provide path to prediction results for visualization.\n# This is a pickle file of [prediction_tensor, label_tensor]\n_C.TENSORBOARD.PREDICTIONS_PATH = \"\"\n# Path to directory for tensorboard logs.\n# Default to to cfg.OUTPUT_DIR/runs-{cfg.TRAIN.DATASET}.\n_C.TENSORBOARD.LOG_DIR = \"\"\n# Path to a json file providing class_name - id mapping\n# in the format {\"class_name1\": id1, \"class_name2\": id2, ...}.\n# This file must be provided to enable plotting confusion matrix\n# by a subset or parent categories.\n_C.TENSORBOARD.CLASS_NAMES_PATH = \"\"\n\n# Path to a json file for categories -> classes mapping\n# in the format {\"parent_class\": [\"child_class1\", \"child_class2\",...], ...}.\n_C.TENSORBOARD.CATEGORIES_PATH = \"\"\n\n# Config for confusion matrices visualization.\n_C.TENSORBOARD.CONFUSION_MATRIX = CfgNode()\n# Visualize confusion matrix.\n_C.TENSORBOARD.CONFUSION_MATRIX.ENABLE = False\n# Figure size of the confusion matrices plotted.\n_C.TENSORBOARD.CONFUSION_MATRIX.FIGSIZE = [8, 8]\n# Path to a subset of categories to visualize.\n# File contains class names separated by newline characters.\n_C.TENSORBOARD.CONFUSION_MATRIX.SUBSET_PATH = \"\"\n\n# Config for histogram visualization.\n_C.TENSORBOARD.HISTOGRAM = CfgNode()\n# Visualize histograms.\n_C.TENSORBOARD.HISTOGRAM.ENABLE = False\n# Path to a subset of classes to plot histograms.\n# Class names must be separated by newline characters.\n_C.TENSORBOARD.HISTOGRAM.SUBSET_PATH = \"\"\n# Visualize top-k most predicted classes on histograms for each\n# chosen true label.\n_C.TENSORBOARD.HISTOGRAM.TOPK = 10\n# Figure size of the histograms plotted.\n_C.TENSORBOARD.HISTOGRAM.FIGSIZE = [8, 8]\n\n# Config for layers' weights and activations visualization.\n# _C.TENSORBOARD.ENABLE must be True.\n_C.TENSORBOARD.MODEL_VIS = CfgNode()\n\n# If False, skip model visualization.\n_C.TENSORBOARD.MODEL_VIS.ENABLE = False\n\n# If False, skip visualizing model weights.\n_C.TENSORBOARD.MODEL_VIS.MODEL_WEIGHTS = False\n\n# If False, skip visualizing model activations.\n_C.TENSORBOARD.MODEL_VIS.ACTIVATIONS = False\n\n# If False, skip visualizing input videos.\n_C.TENSORBOARD.MODEL_VIS.INPUT_VIDEO = False\n\n\n# List of strings containing data about layer names and their indexing to\n# visualize weights and activations for. The indexing is meant for\n# choosing a subset of activations outputed by a layer for visualization.\n# If indexing is not specified, visualize all activations outputed by the layer.\n# For each string, layer name and indexing is separated by whitespaces.\n# e.g.: [layer1 1,2;1,2, layer2, layer3 150,151;3,4]; this means for each array `arr`\n# along the batch dimension in `layer1`, we take arr[[1, 2], [1, 2]]\n_C.TENSORBOARD.MODEL_VIS.LAYER_LIST = []\n# Top-k predictions to plot on videos\n_C.TENSORBOARD.MODEL_VIS.TOPK_PREDS = 1\n# Colormap to for text boxes and bounding boxes colors\n_C.TENSORBOARD.MODEL_VIS.COLORMAP = \"Pastel2\"\n# Config for visualization video inputs with Grad-CAM.\n# _C.TENSORBOARD.ENABLE must be True.\n_C.TENSORBOARD.MODEL_VIS.GRAD_CAM = CfgNode()\n# Whether to run visualization using Grad-CAM technique.\n_C.TENSORBOARD.MODEL_VIS.GRAD_CAM.ENABLE = True\n# CNN layers to use for Grad-CAM. The number of layers must be equal to\n# number of pathway(s).\n_C.TENSORBOARD.MODEL_VIS.GRAD_CAM.LAYER_LIST = []\n# If True, visualize Grad-CAM using true labels for each instances.\n# If False, use the highest predicted class.\n_C.TENSORBOARD.MODEL_VIS.GRAD_CAM.USE_TRUE_LABEL = False\n# Colormap to for text boxes and bounding boxes colors\n_C.TENSORBOARD.MODEL_VIS.GRAD_CAM.COLORMAP = \"viridis\"\n\n# Config for visualization for wrong prediction visualization.\n# _C.TENSORBOARD.ENABLE must be True.\n_C.TENSORBOARD.WRONG_PRED_VIS = CfgNode()\n_C.TENSORBOARD.WRONG_PRED_VIS.ENABLE = False\n# Folder tag to origanize model eval videos under.\n_C.TENSORBOARD.WRONG_PRED_VIS.TAG = \"Incorrectly classified videos.\"\n# Subset of labels to visualize. Only wrong predictions with true labels\n# within this subset is visualized.\n_C.TENSORBOARD.WRONG_PRED_VIS.SUBSET_PATH = \"\"\n\n\n# ---------------------------------------------------------------------------- #\n# Demo options\n# ---------------------------------------------------------------------------- #\n_C.DEMO = CfgNode()\n\n# Run model in DEMO mode.\n_C.DEMO.ENABLE = False\n\n# Path to a json file providing class_name - id mapping\n# in the format {\"class_name1\": id1, \"class_name2\": id2, ...}.\n_C.DEMO.LABEL_FILE_PATH = \"\"\n\n# Specify a camera device as input. This will be prioritized\n# over input video if set.\n# If -1, use input video instead.\n_C.DEMO.WEBCAM = -1\n\n# Path to input video for demo.\n_C.DEMO.INPUT_VIDEO = \"\"\n# Custom width for reading input video data.\n_C.DEMO.DISPLAY_WIDTH = 0\n# Custom height for reading input video data.\n_C.DEMO.DISPLAY_HEIGHT = 0\n# Path to Detectron2 object detection model configuration,\n# only used for detection tasks.\n_C.DEMO.DETECTRON2_CFG = \"COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml\"\n# Path to Detectron2 object detection model pre-trained weights.\n_C.DEMO.DETECTRON2_WEIGHTS = \"detectron2://COCO-Detection/faster_rcnn_R_50_FPN_3x/137849458/model_final_280758.pkl\"\n# Threshold for choosing predicted bounding boxes by Detectron2.\n_C.DEMO.DETECTRON2_THRESH = 0.9\n# Number of overlapping frames between 2 consecutive clips.\n# Increase this number for more frequent action predictions.\n# The number of overlapping frames cannot be larger than\n# half of the sequence length `cfg.DATA.NUM_FRAMES * cfg.DATA.SAMPLING_RATE`\n_C.DEMO.BUFFER_SIZE = 0\n# If specified, the visualized outputs will be written this a video file of\n# this path. Otherwise, the visualized outputs will be displayed in a window.\n_C.DEMO.OUTPUT_FILE = \"\"\n# Frames per second rate for writing to output video file.\n# If not set (-1), use fps rate from input file.\n_C.DEMO.OUTPUT_FPS = -1\n# Input format from demo video reader (\"RGB\" or \"BGR\").\n_C.DEMO.INPUT_FORMAT = \"BGR\"\n# Draw visualization frames in [keyframe_idx - CLIP_VIS_SIZE, keyframe_idx + CLIP_VIS_SIZE] inclusively.\n_C.DEMO.CLIP_VIS_SIZE = 10\n# Number of processes to run video visualizer.\n_C.DEMO.NUM_VIS_INSTANCES = 2\n\n# Path to pre-computed predicted boxes\n_C.DEMO.PREDS_BOXES = \"\"\n# Whether to run in with multi-threaded video reader.\n_C.DEMO.THREAD_ENABLE = False\n# Take one clip for every `DEMO.NUM_CLIPS_SKIP` + 1 for prediction and visualization.\n# This is used for fast demo speed by reducing the prediction/visualiztion frequency.\n# If -1, take the most recent read clip for visualization. This mode is only supported\n# if `DEMO.THREAD_ENABLE` is set to True.\n_C.DEMO.NUM_CLIPS_SKIP = 0\n# Path to ground-truth boxes and labels (optional)\n_C.DEMO.GT_BOXES = \"\"\n# The starting second of the video w.r.t bounding boxes file.\n_C.DEMO.STARTING_SECOND = 900\n# Frames per second of the input video/folder of images.\n_C.DEMO.FPS = 30\n# Visualize with top-k predictions or predictions above certain threshold(s).\n# Option: {\"thres\", \"top-k\"}\n_C.DEMO.VIS_MODE = \"thres\"\n# Threshold for common class names.\n_C.DEMO.COMMON_CLASS_THRES = 0.7\n# Theshold for uncommon class names. This will not be\n# used if `_C.DEMO.COMMON_CLASS_NAMES` is empty.\n_C.DEMO.UNCOMMON_CLASS_THRES = 0.3\n# This is chosen based on distribution of examples in\n# each classes in AVA dataset.\n_C.DEMO.COMMON_CLASS_NAMES = [\n    \"watch (a person)\",\n    \"talk to (e.g., self, a person, a group)\",\n    \"listen to (a person)\",\n    \"touch (an object)\",\n    \"carry/hold (an object)\",\n    \"walk\",\n    \"sit\",\n    \"lie/sleep\",\n    \"bend/bow (at the waist)\",\n]\n# Slow-motion rate for the visualization. The visualized portions of the\n# video will be played `_C.DEMO.SLOWMO` times slower than usual speed.\n_C.DEMO.SLOWMO = 1\n\n# Add custom config with default values.\ncustom_config.add_custom_config(_C)\n\n\ndef assert_and_infer_cfg(cfg):\n    # BN assertions.\n    if cfg.BN.USE_PRECISE_STATS:\n        assert cfg.BN.NUM_BATCHES_PRECISE >= 0\n    # TRAIN assertions.\n    assert cfg.TRAIN.CHECKPOINT_TYPE in [\"pytorch\", \"caffe2\"]\n    assert cfg.NUM_GPUS == 0 or cfg.TRAIN.BATCH_SIZE % cfg.NUM_GPUS == 0\n\n    # TEST assertions.\n    assert cfg.TEST.CHECKPOINT_TYPE in [\"pytorch\", \"caffe2\"]\n    assert cfg.NUM_GPUS == 0 or cfg.TEST.BATCH_SIZE % cfg.NUM_GPUS == 0\n\n    # RESNET assertions.\n    assert cfg.RESNET.NUM_GROUPS > 0\n    assert cfg.RESNET.WIDTH_PER_GROUP > 0\n    assert cfg.RESNET.WIDTH_PER_GROUP % cfg.RESNET.NUM_GROUPS == 0\n\n    # Execute LR scaling by num_shards.\n    if cfg.SOLVER.BASE_LR_SCALE_NUM_SHARDS:\n        cfg.SOLVER.BASE_LR *= cfg.NUM_SHARDS\n        cfg.SOLVER.WARMUP_START_LR *= cfg.NUM_SHARDS\n        cfg.SOLVER.COSINE_END_LR *= cfg.NUM_SHARDS\n\n    # General assertions.\n    assert cfg.SHARD_ID < cfg.NUM_SHARDS\n    return cfg\n\n\ndef get_cfg():\n    \"\"\"\n    Get a copy of the default config.\n    \"\"\"\n    return _C.clone()\n"
  },
  {
    "path": "slowfast/datasets/DATASET.md",
    "content": "# Dataset Preparation\n\n## Kinetics\n\nThe Kinetics Dataset could be downloaded via the code released by ActivityNet:\n\n1. Download the videos via the official [scripts](https://github.com/activitynet/ActivityNet/tree/master/Crawler/Kinetics).\n\n2. After all the videos were downloaded, resize the video to the short edge size of 256, then prepare the csv files for training, validation, and testing set as `train.csv`, `val.csv`, `test.csv`. The format of the csv file is:\n\n```\npath_to_video_1 label_1\npath_to_video_2 label_2\npath_to_video_3 label_3\n...\npath_to_video_N label_N\n```\n\nAll the Kinetics models in the Model Zoo are trained and tested with the same data as [Non-local Network](https://github.com/facebookresearch/video-nonlocal-net/blob/master/DATASET.md). For dataset specific issues, please reach out to the [dataset provider](https://deepmind.com/research/open-source/kinetics).\n\n## AVA\n\nThe AVA Dataset could be downloaded from the [official site](https://research.google.com/ava/download.html#ava_actions_download)\n\nWe followed the same [downloading and preprocessing procedure](https://github.com/facebookresearch/video-long-term-feature-banks/blob/master/DATASET.md) as the [Long-Term Feature Banks for Detailed Video Understanding](https://arxiv.org/abs/1812.05038) do.\n\nYou could follow these steps to download and preprocess the data:\n\n1. Download videos\n\n```\nDATA_DIR=\"../../data/ava/videos\"\n\nif [[ ! -d \"${DATA_DIR}\" ]]; then\n  echo \"${DATA_DIR} doesn't exist. Creating it.\";\n  mkdir -p ${DATA_DIR}\nfi\n\nwget https://s3.amazonaws.com/ava-dataset/annotations/ava_file_names_trainval_v2.1.txt\n\nfor line in $(cat ava_file_names_trainval_v2.1.txt)\ndo\n  wget https://s3.amazonaws.com/ava-dataset/trainval/$line -P ${DATA_DIR}\ndone\n```\n\n2. Cut each video from its 15th to 30th minute\n\n```\nIN_DATA_DIR=\"../../data/ava/videos\"\nOUT_DATA_DIR=\"../../data/ava/videos_15min\"\n\nif [[ ! -d \"${OUT_DATA_DIR}\" ]]; then\n  echo \"${OUT_DATA_DIR} doesn't exist. Creating it.\";\n  mkdir -p ${OUT_DATA_DIR}\nfi\n\nfor video in $(ls -A1 -U ${IN_DATA_DIR}/*)\ndo\n  out_name=\"${OUT_DATA_DIR}/${video##*/}\"\n  if [ ! -f \"${out_name}\" ]; then\n    ffmpeg -ss 900 -t 901 -i \"${video}\" \"${out_name}\"\n  fi\ndone\n```\n\n3. Extract frames\n\n```\nIN_DATA_DIR=\"../../data/ava/videos_15min\"\nOUT_DATA_DIR=\"../../data/ava/frames\"\n\nif [[ ! -d \"${OUT_DATA_DIR}\" ]]; then\n  echo \"${OUT_DATA_DIR} doesn't exist. Creating it.\";\n  mkdir -p ${OUT_DATA_DIR}\nfi\n\nfor video in $(ls -A1 -U ${IN_DATA_DIR}/*)\ndo\n  video_name=${video##*/}\n\n  if [[ $video_name = *\".webm\" ]]; then\n    video_name=${video_name::-5}\n  else\n    video_name=${video_name::-4}\n  fi\n\n  out_video_dir=${OUT_DATA_DIR}/${video_name}/\n  mkdir -p \"${out_video_dir}\"\n\n  out_name=\"${out_video_dir}/${video_name}_%06d.jpg\"\n\n  ffmpeg -i \"${video}\" -r 30 -q:v 1 \"${out_name}\"\ndone\n```\n\n4. Download annotations\n\n```\nDATA_DIR=\"../../data/ava/annotations\"\n\nif [[ ! -d \"${DATA_DIR}\" ]]; then\n  echo \"${DATA_DIR} doesn't exist. Creating it.\";\n  mkdir -p ${DATA_DIR}\nfi\n\nwget https://research.google.com/ava/download/ava_train_v2.1.csv -P ${DATA_DIR}\nwget https://research.google.com/ava/download/ava_val_v2.1.csv -P ${DATA_DIR}\nwget https://research.google.com/ava/download/ava_action_list_v2.1_for_activitynet_2018.pbtxt -P ${DATA_DIR}\nwget https://research.google.com/ava/download/ava_train_excluded_timestamps_v2.1.csv -P ${DATA_DIR}\nwget https://research.google.com/ava/download/ava_val_excluded_timestamps_v2.1.csv -P ${DATA_DIR}\n```\n\n5. Download \"frame lists\" ([train](https://dl.fbaipublicfiles.com/video-long-term-feature-banks/data/ava/frame_lists/train.csv), [val](https://dl.fbaipublicfiles.com/video-long-term-feature-banks/data/ava/frame_lists/val.csv)) and put them in\nthe `frame_lists` folder (see structure above).\n\n6. Download person boxes ([train](https://dl.fbaipublicfiles.com/video-long-term-feature-banks/data/ava/annotations/ava_train_predicted_boxes.csv), [val](https://dl.fbaipublicfiles.com/video-long-term-feature-banks/data/ava/annotations/ava_val_predicted_boxes.csv), [test](https://dl.fbaipublicfiles.com/video-long-term-feature-banks/data/ava/annotations/ava_test_predicted_boxes.csv)) and put them in the `annotations` folder (see structure above).\nIf you prefer to use your own person detector, please see details\nin [here](https://github.com/facebookresearch/video-long-term-feature-banks/blob/master/GETTING_STARTED.md#ava-person-detector).\n\n\nDownload the ava dataset with the following structure:\n\n```\nava\n|_ frames\n|  |_ [video name 0]\n|  |  |_ [video name 0]_000001.jpg\n|  |  |_ [video name 0]_000002.jpg\n|  |  |_ ...\n|  |_ [video name 1]\n|     |_ [video name 1]_000001.jpg\n|     |_ [video name 1]_000002.jpg\n|     |_ ...\n|_ frame_lists\n|  |_ train.csv\n|  |_ val.csv\n|_ annotations\n   |_ [official AVA annotation files]\n   |_ ava_train_predicted_boxes.csv\n   |_ ava_val_predicted_boxes.csv\n```\n\nYou could also replace the `v2.1` by `v2.2` if you need the AVA v2.2 annotation. You can also download some pre-prepared annotations from [here](https://dl.fbaipublicfiles.com/pyslowfast/annotation/ava/ava_annotations.tar).\n\n\n## Charades\n1. Please download the Charades RGB frames from [dataset provider](http://ai2-website.s3.amazonaws.com/data/Charades_v1_rgb.tar).\n\n2. Download the *frame list* from the following links: ([train](https://dl.fbaipublicfiles.com/pyslowfast/dataset/charades/frame_lists/train.csv), [val](https://dl.fbaipublicfiles.com/pyslowfast/dataset/charades/frame_lists/val.csv)).\n\nPlease set `DATA.PATH_TO_DATA_DIR` to point to the folder containing the frame lists, and `DATA.PATH_PREFIX` to the folder containing RGB frames.\n\n\n## Something-Something V2\n1. Please download the dataset and annotations from [dataset provider](https://20bn.com/datasets/something-something).\n\n2. Download the *frame list* from the following links: ([train](https://dl.fbaipublicfiles.com/pyslowfast/dataset/ssv2/frame_lists/train.csv), [val](https://dl.fbaipublicfiles.com/pyslowfast/dataset/ssv2/frame_lists/val.csv)).\n\n3. Extract the frames at 30 FPS. (We used ffmpeg-4.1.3 with command\n`ffmpeg -i \"${video}\" -r 30 -q:v 1 \"${out_name}\"`\n   in experiments.) Please put the frames in a structure consistent with the frame lists.\n\n\nPlease put all annotation json files and the frame lists in the same folder, and set `DATA.PATH_TO_DATA_DIR` to the path. Set `DATA.PATH_PREFIX` to be the path to the folder containing extracted frames.\n"
  },
  {
    "path": "slowfast/datasets/__init__.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\nfrom .ava_dataset import Ava  # noqa\nfrom .build import build_dataset, DATASET_REGISTRY  # noqa\nfrom .charades import Charades  # noqa\nfrom .imagenet import Imagenet  # noqa\nfrom .kinetics import Kinetics  # noqa\nfrom .ssv2 import Ssv2  # noqa\n\ntry:\n    from .ptv_datasets import Ptvcharades, Ptvkinetics, Ptvssv2  # noqa\nexcept Exception:\n    print(\"Please update your PyTorchVideo to latest master\")\n"
  },
  {
    "path": "slowfast/datasets/ava_dataset.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\nimport logging\n\nimport numpy as np\nimport torch\n\nfrom . import (\n    ava_helper as ava_helper,\n    cv2_transform as cv2_transform,\n    transform as transform,\n    utils as utils,\n)\nfrom .build import DATASET_REGISTRY\n\nlogger = logging.getLogger(__name__)\n\n\n@DATASET_REGISTRY.register()\nclass Ava(torch.utils.data.Dataset):\n    \"\"\"\n    AVA Dataset\n    \"\"\"\n\n    def __init__(self, cfg, split):\n        self.cfg = cfg\n        self._split = split\n        self._sample_rate = cfg.DATA.SAMPLING_RATE\n        self._video_length = cfg.DATA.NUM_FRAMES\n        self._seq_len = self._video_length * self._sample_rate\n        self._num_classes = cfg.MODEL.NUM_CLASSES\n        # Augmentation params.\n        self._data_mean = cfg.DATA.MEAN\n        self._data_std = cfg.DATA.STD\n        self._use_bgr = cfg.AVA.BGR\n        self.random_horizontal_flip = cfg.DATA.RANDOM_FLIP\n        if self._split == \"train\":\n            self._crop_size = cfg.DATA.TRAIN_CROP_SIZE\n            self._jitter_min_scale = cfg.DATA.TRAIN_JITTER_SCALES[0]\n            self._jitter_max_scale = cfg.DATA.TRAIN_JITTER_SCALES[1]\n            self._use_color_augmentation = cfg.AVA.TRAIN_USE_COLOR_AUGMENTATION\n            self._pca_jitter_only = cfg.AVA.TRAIN_PCA_JITTER_ONLY\n            self._pca_eigval = cfg.DATA.TRAIN_PCA_EIGVAL\n            self._pca_eigvec = cfg.DATA.TRAIN_PCA_EIGVEC\n        else:\n            self._crop_size = cfg.DATA.TEST_CROP_SIZE\n            self._test_force_flip = cfg.AVA.TEST_FORCE_FLIP\n\n        self._load_data(cfg)\n\n    def _load_data(self, cfg):\n        \"\"\"\n        Load frame paths and annotations from files\n\n        Args:\n            cfg (CfgNode): config\n        \"\"\"\n        # Loading frame paths.\n        (\n            self._image_paths,\n            self._video_idx_to_name,\n        ) = ava_helper.load_image_lists(cfg, is_train=(self._split == \"train\"))\n\n        # Loading annotations for boxes and labels.\n        boxes_and_labels = ava_helper.load_boxes_and_labels(cfg, mode=self._split)\n\n        assert len(boxes_and_labels) == len(self._image_paths)\n\n        boxes_and_labels = [\n            boxes_and_labels[self._video_idx_to_name[i]]\n            for i in range(len(self._image_paths))\n        ]\n\n        # Get indices of keyframes and corresponding boxes and labels.\n        (\n            self._keyframe_indices,\n            self._keyframe_boxes_and_labels,\n        ) = ava_helper.get_keyframe_data(boxes_and_labels)\n\n        # Calculate the number of used boxes.\n        self._num_boxes_used = ava_helper.get_num_boxes_used(\n            self._keyframe_indices, self._keyframe_boxes_and_labels\n        )\n\n        self.print_summary()\n\n    def print_summary(self):\n        logger.info(\"=== AVA dataset summary ===\")\n        logger.info(\"Split: {}\".format(self._split))\n        logger.info(\"Number of videos: {}\".format(len(self._image_paths)))\n        total_frames = sum(\n            len(video_img_paths) for video_img_paths in self._image_paths\n        )\n        logger.info(\"Number of frames: {}\".format(total_frames))\n        logger.info(\"Number of key frames: {}\".format(len(self)))\n        logger.info(\"Number of boxes: {}.\".format(self._num_boxes_used))\n\n    def __len__(self):\n        \"\"\"\n        Returns:\n            (int): the number of videos in the dataset.\n        \"\"\"\n        return self.num_videos\n\n    @property\n    def num_videos(self):\n        \"\"\"\n        Returns:\n            (int): the number of videos in the dataset.\n        \"\"\"\n        return len(self._keyframe_indices)\n\n    def _images_and_boxes_preprocessing_cv2(self, imgs, boxes):\n        \"\"\"\n        This function performs preprocessing for the input images and\n        corresponding boxes for one clip with opencv as backend.\n\n        Args:\n            imgs (tensor): the images.\n            boxes (ndarray): the boxes for the current clip.\n\n        Returns:\n            imgs (tensor): list of preprocessed images.\n            boxes (ndarray): preprocessed boxes.\n        \"\"\"\n\n        height, width, _ = imgs[0].shape\n\n        boxes[:, [0, 2]] *= width\n        boxes[:, [1, 3]] *= height\n        boxes = cv2_transform.clip_boxes_to_image(boxes, height, width)\n\n        # `transform.py` is list of np.array. However, for AVA, we only have\n        # one np.array.\n        boxes = [boxes]\n\n        # The image now is in HWC, BGR format.\n        if self._split == \"train\":  # \"train\"\n            imgs, boxes = cv2_transform.random_short_side_scale_jitter_list(\n                imgs,\n                min_size=self._jitter_min_scale,\n                max_size=self._jitter_max_scale,\n                boxes=boxes,\n            )\n            imgs, boxes = cv2_transform.random_crop_list(\n                imgs, self._crop_size, order=\"HWC\", boxes=boxes\n            )\n\n            if self.random_horizontal_flip:\n                # random flip\n                imgs, boxes = cv2_transform.horizontal_flip_list(\n                    0.5, imgs, order=\"HWC\", boxes=boxes\n                )\n        elif self._split == \"val\":\n            # Short side to test_scale. Non-local and STRG uses 256.\n            imgs = [cv2_transform.scale(self._crop_size, img) for img in imgs]\n            boxes = [\n                cv2_transform.scale_boxes(self._crop_size, boxes[0], height, width)\n            ]\n            imgs, boxes = cv2_transform.spatial_shift_crop_list(\n                self._crop_size, imgs, 1, boxes=boxes\n            )\n\n            if self._test_force_flip:\n                imgs, boxes = cv2_transform.horizontal_flip_list(\n                    1, imgs, order=\"HWC\", boxes=boxes\n                )\n        elif self._split == \"test\":\n            # Short side to test_scale. Non-local and STRG uses 256.\n            imgs = [cv2_transform.scale(self._crop_size, img) for img in imgs]\n            boxes = [\n                cv2_transform.scale_boxes(self._crop_size, boxes[0], height, width)\n            ]\n\n            if self._test_force_flip:\n                imgs, boxes = cv2_transform.horizontal_flip_list(\n                    1, imgs, order=\"HWC\", boxes=boxes\n                )\n        else:\n            raise NotImplementedError(\"Unsupported split mode {}\".format(self._split))\n\n        # Convert image to CHW keeping BGR order.\n        imgs = [cv2_transform.HWC2CHW(img) for img in imgs]\n\n        # Image [0, 255] -> [0, 1].\n        imgs = [img / 255.0 for img in imgs]\n\n        imgs = [\n            np.ascontiguousarray(\n                # img.reshape((3, self._crop_size, self._crop_size))\n                img.reshape((3, imgs[0].shape[1], imgs[0].shape[2]))\n            ).astype(np.float32)\n            for img in imgs\n        ]\n\n        # Do color augmentation (after divided by 255.0).\n        if self._split == \"train\" and self._use_color_augmentation:\n            if not self._pca_jitter_only:\n                imgs = cv2_transform.color_jitter_list(\n                    imgs,\n                    img_brightness=0.4,\n                    img_contrast=0.4,\n                    img_saturation=0.4,\n                )\n\n            imgs = cv2_transform.lighting_list(\n                imgs,\n                alphastd=0.1,\n                eigval=np.array(self._pca_eigval).astype(np.float32),\n                eigvec=np.array(self._pca_eigvec).astype(np.float32),\n            )\n\n        # Normalize images by mean and std.\n        imgs = [\n            cv2_transform.color_normalization(\n                img,\n                np.array(self._data_mean, dtype=np.float32),\n                np.array(self._data_std, dtype=np.float32),\n            )\n            for img in imgs\n        ]\n\n        # Concat list of images to single ndarray.\n        imgs = np.concatenate([np.expand_dims(img, axis=1) for img in imgs], axis=1)\n\n        if not self._use_bgr:\n            # Convert image format from BGR to RGB.\n            imgs = imgs[::-1, ...]\n\n        imgs = np.ascontiguousarray(imgs)\n        imgs = torch.from_numpy(imgs)\n        boxes = cv2_transform.clip_boxes_to_image(\n            boxes[0], imgs[0].shape[1], imgs[0].shape[2]\n        )\n        return imgs, boxes\n\n    def _images_and_boxes_preprocessing(self, imgs, boxes):\n        \"\"\"\n        This function performs preprocessing for the input images and\n        corresponding boxes for one clip.\n\n        Args:\n            imgs (tensor): the images.\n            boxes (ndarray): the boxes for the current clip.\n\n        Returns:\n            imgs (tensor): list of preprocessed images.\n            boxes (ndarray): preprocessed boxes.\n        \"\"\"\n        # Image [0, 255] -> [0, 1].\n        imgs = imgs.float()\n        imgs = imgs / 255.0\n\n        height, width = imgs.shape[2], imgs.shape[3]\n        # The format of boxes is [x1, y1, x2, y2]. The input boxes are in the\n        # range of [0, 1].\n        boxes[:, [0, 2]] *= width\n        boxes[:, [1, 3]] *= height\n        boxes = transform.clip_boxes_to_image(boxes, height, width)\n\n        if self._split == \"train\":\n            # Train split\n            imgs, boxes = transform.random_short_side_scale_jitter(\n                imgs,\n                min_size=self._jitter_min_scale,\n                max_size=self._jitter_max_scale,\n                boxes=boxes,\n            )\n            imgs, boxes = transform.random_crop(imgs, self._crop_size, boxes=boxes)\n\n            # Random flip.\n            imgs, boxes = transform.horizontal_flip(0.5, imgs, boxes=boxes)\n        elif self._split == \"val\":\n            # Val split\n            # Resize short side to crop_size. Non-local and STRG uses 256.\n            imgs, boxes = transform.random_short_side_scale_jitter(\n                imgs,\n                min_size=self._crop_size,\n                max_size=self._crop_size,\n                boxes=boxes,\n            )\n\n            # Apply center crop for val split\n            imgs, boxes = transform.uniform_crop(\n                imgs, size=self._crop_size, spatial_idx=1, boxes=boxes\n            )\n\n            if self._test_force_flip:\n                imgs, boxes = transform.horizontal_flip(1, imgs, boxes=boxes)\n        elif self._split == \"test\":\n            # Test split\n            # Resize short side to crop_size. Non-local and STRG uses 256.\n            imgs, boxes = transform.random_short_side_scale_jitter(\n                imgs,\n                min_size=self._crop_size,\n                max_size=self._crop_size,\n                boxes=boxes,\n            )\n\n            if self._test_force_flip:\n                imgs, boxes = transform.horizontal_flip(1, imgs, boxes=boxes)\n        else:\n            raise NotImplementedError(\"{} split not supported yet!\".format(self._split))\n\n        # Do color augmentation (after divided by 255.0).\n        if self._split == \"train\" and self._use_color_augmentation:\n            if not self._pca_jitter_only:\n                imgs = transform.color_jitter(\n                    imgs,\n                    img_brightness=0.4,\n                    img_contrast=0.4,\n                    img_saturation=0.4,\n                )\n\n            imgs = transform.lighting_jitter(\n                imgs,\n                alphastd=0.1,\n                eigval=np.array(self._pca_eigval).astype(np.float32),\n                eigvec=np.array(self._pca_eigvec).astype(np.float32),\n            )\n\n        # Normalize images by mean and std.\n        imgs = transform.color_normalization(\n            imgs,\n            np.array(self._data_mean, dtype=np.float32),\n            np.array(self._data_std, dtype=np.float32),\n        )\n\n        if not self._use_bgr:\n            # Convert image format from BGR to RGB.\n            # Note that Kinetics pre-training uses RGB!\n            imgs = imgs[:, [2, 1, 0], ...]\n\n        boxes = transform.clip_boxes_to_image(boxes, self._crop_size, self._crop_size)\n\n        return imgs, boxes\n\n    def __getitem__(self, idx):\n        \"\"\"\n        Generate corresponding clips, boxes, labels and metadata for given idx.\n\n        Args:\n            idx (int): the video index provided by the pytorch sampler.\n        Returns:\n            frames (tensor): the frames of sampled from the video. The dimension\n                is `channel` x `num frames` x `height` x `width`.\n            label (ndarray): the label for correspond boxes for the current video.\n            time index (zero): The time index is currently not supported for AVA.\n            idx (int): the video index provided by the pytorch sampler.\n            extra_data (dict): a dict containing extra data fields, like \"boxes\",\n                \"ori_boxes\" and \"metadata\".\n        \"\"\"\n        short_cycle_idx = None\n        # When short cycle is used, input index is a tupple.\n        if isinstance(idx, tuple):\n            idx, self._num_yielded = idx\n            if self.cfg.MULTIGRID.SHORT_CYCLE:\n                idx, short_cycle_idx = idx\n\n        video_idx, sec_idx, sec, center_idx = self._keyframe_indices[idx]\n        # Get the frame idxs for current clip.\n        seq = utils.get_sequence(\n            center_idx,\n            self._seq_len // 2,\n            self._sample_rate,\n            num_frames=len(self._image_paths[video_idx]),\n        )\n\n        clip_label_list = self._keyframe_boxes_and_labels[video_idx][sec_idx]\n        assert len(clip_label_list) > 0\n\n        # Get boxes and labels for current clip.\n        boxes = []\n        labels = []\n        for box_labels in clip_label_list:\n            boxes.append(box_labels[0])\n            labels.append(box_labels[1])\n        boxes = np.array(boxes)\n        # Score is not used.\n        boxes = boxes[:, :4].copy()\n        ori_boxes = boxes.copy()\n\n        # Load images of current clip.\n        image_paths = [self._image_paths[video_idx][frame] for frame in seq]\n        imgs = utils.retry_load_images(\n            image_paths, backend=self.cfg.AVA.IMG_PROC_BACKEND\n        )\n        if self.cfg.AVA.IMG_PROC_BACKEND == \"pytorch\":\n            # T H W C -> T C H W.\n            imgs = imgs.permute(0, 3, 1, 2)\n            # Preprocess images and boxes.\n            imgs, boxes = self._images_and_boxes_preprocessing(imgs, boxes=boxes)\n            # T C H W -> C T H W.\n            imgs = imgs.permute(1, 0, 2, 3)\n        else:\n            # Preprocess images and boxes\n            imgs, boxes = self._images_and_boxes_preprocessing_cv2(imgs, boxes=boxes)\n\n        # Construct label arrays.\n        label_arrs = np.zeros((len(labels), self._num_classes), dtype=np.int32)\n        for i, box_labels in enumerate(labels):\n            # AVA label index starts from 1.\n            for label in box_labels:\n                if label == -1:\n                    continue\n                assert label >= 1 and label <= 80\n                label_arrs[i][label - 1] = 1\n\n        imgs = utils.pack_pathway_output(self.cfg, imgs)\n        metadata = [[video_idx, sec]] * len(boxes)\n\n        extra_data = {\n            \"boxes\": boxes,\n            \"ori_boxes\": ori_boxes,\n            \"metadata\": metadata,\n        }\n\n        return imgs, label_arrs, idx, torch.zeros(1), extra_data\n"
  },
  {
    "path": "slowfast/datasets/ava_helper.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\nimport logging\nimport os\nfrom collections import defaultdict\n\nfrom slowfast.utils.env import pathmgr\n\nlogger = logging.getLogger(__name__)\n\nFPS = 30\nAVA_VALID_FRAMES = range(902, 1799)\n\n\ndef load_image_lists(cfg, is_train):\n    \"\"\"\n    Loading image paths from corresponding files.\n\n    Args:\n        cfg (CfgNode): config.\n        is_train (bool): if it is training dataset or not.\n\n    Returns:\n        image_paths (list[list]): a list of items. Each item (also a list)\n            corresponds to one video and contains the paths of images for\n            this video.\n        video_idx_to_name (list): a list which stores video names.\n    \"\"\"\n    list_filenames = [\n        os.path.join(cfg.AVA.FRAME_LIST_DIR, filename)\n        for filename in (cfg.AVA.TRAIN_LISTS if is_train else cfg.AVA.TEST_LISTS)\n    ]\n    image_paths = defaultdict(list)\n    video_name_to_idx = {}\n    video_idx_to_name = []\n    for list_filename in list_filenames:\n        with pathmgr.open(list_filename, \"r\") as f:\n            f.readline()\n            for line in f:\n                row = line.split()\n                # The format of each row should follow:\n                # original_vido_id video_id frame_id path labels.\n                assert len(row) == 5\n                video_name = row[0]\n\n                if video_name not in video_name_to_idx:\n                    idx = len(video_name_to_idx)\n                    video_name_to_idx[video_name] = idx\n                    video_idx_to_name.append(video_name)\n\n                data_key = video_name_to_idx[video_name]\n\n                image_paths[data_key].append(os.path.join(cfg.AVA.FRAME_DIR, row[3]))\n\n    image_paths = [image_paths[i] for i in range(len(image_paths))]\n\n    logger.info(\"Finished loading image paths from: %s\" % \", \".join(list_filenames))\n\n    return image_paths, video_idx_to_name\n\n\ndef load_boxes_and_labels(cfg, mode):\n    \"\"\"\n    Loading boxes and labels from csv files.\n\n    Args:\n        cfg (CfgNode): config.\n        mode (str): 'train', 'val', or 'test' mode.\n    Returns:\n        all_boxes (dict): a dict which maps from `video_name` and\n            `frame_sec` to a list of `box`. Each `box` is a\n            [`box_coord`, `box_labels`] where `box_coord` is the\n            coordinates of box and 'box_labels` are the corresponding\n            labels for the box.\n    \"\"\"\n    gt_lists = cfg.AVA.TRAIN_GT_BOX_LISTS if mode == \"train\" else []\n    pred_lists = (\n        cfg.AVA.TRAIN_PREDICT_BOX_LISTS\n        if mode == \"train\"\n        else cfg.AVA.TEST_PREDICT_BOX_LISTS\n    )\n    ann_filenames = [\n        os.path.join(cfg.AVA.ANNOTATION_DIR, filename)\n        for filename in gt_lists + pred_lists\n    ]\n    ann_is_gt_box = [True] * len(gt_lists) + [False] * len(pred_lists)\n\n    detect_thresh = cfg.AVA.DETECTION_SCORE_THRESH\n    # Only select frame_sec % 4 = 0 samples for validation if not\n    # set FULL_TEST_ON_VAL.\n    boxes_sample_rate = 4 if mode == \"val\" and not cfg.AVA.FULL_TEST_ON_VAL else 1\n    all_boxes, count, unique_box_count = parse_bboxes_file(\n        ann_filenames=ann_filenames,\n        ann_is_gt_box=ann_is_gt_box,\n        detect_thresh=detect_thresh,\n        boxes_sample_rate=boxes_sample_rate,\n    )\n    logger.info(\"Finished loading annotations from: %s\" % \", \".join(ann_filenames))\n    logger.info(\"Detection threshold: {}\".format(detect_thresh))\n    logger.info(\"Number of unique boxes: %d\" % unique_box_count)\n    logger.info(\"Number of annotations: %d\" % count)\n\n    return all_boxes\n\n\ndef get_keyframe_data(boxes_and_labels):\n    \"\"\"\n    Getting keyframe indices, boxes and labels in the dataset.\n\n    Args:\n        boxes_and_labels (list[dict]): a list which maps from video_idx to a dict.\n            Each dict `frame_sec` to a list of boxes and corresponding labels.\n\n    Returns:\n        keyframe_indices (list): a list of indices of the keyframes.\n        keyframe_boxes_and_labels (list[list[list]]): a list of list which maps from\n            video_idx and sec_idx to a list of boxes and corresponding labels.\n    \"\"\"\n\n    def sec_to_frame(sec):\n        \"\"\"\n        Convert time index (in second) to frame index.\n        0: 900\n        30: 901\n        \"\"\"\n        return (sec - 900) * FPS\n\n    keyframe_indices = []\n    keyframe_boxes_and_labels = []\n    count = 0\n    for video_idx in range(len(boxes_and_labels)):\n        sec_idx = 0\n        keyframe_boxes_and_labels.append([])\n        for sec in boxes_and_labels[video_idx].keys():\n            if sec not in AVA_VALID_FRAMES:\n                continue\n\n            if len(boxes_and_labels[video_idx][sec]) > 0:\n                keyframe_indices.append((video_idx, sec_idx, sec, sec_to_frame(sec)))\n                keyframe_boxes_and_labels[video_idx].append(\n                    boxes_and_labels[video_idx][sec]\n                )\n                sec_idx += 1\n                count += 1\n    logger.info(\"%d keyframes used.\" % count)\n\n    return keyframe_indices, keyframe_boxes_and_labels\n\n\ndef get_num_boxes_used(keyframe_indices, keyframe_boxes_and_labels):\n    \"\"\"\n    Get total number of used boxes.\n\n    Args:\n        keyframe_indices (list): a list of indices of the keyframes.\n        keyframe_boxes_and_labels (list[list[list]]): a list of list which maps from\n            video_idx and sec_idx to a list of boxes and corresponding labels.\n\n    Returns:\n        count (int): total number of used boxes.\n    \"\"\"\n\n    count = 0\n    for video_idx, sec_idx, _, _ in keyframe_indices:\n        count += len(keyframe_boxes_and_labels[video_idx][sec_idx])\n    return count\n\n\ndef parse_bboxes_file(ann_filenames, ann_is_gt_box, detect_thresh, boxes_sample_rate=1):\n    \"\"\"\n    Parse AVA bounding boxes files.\n    Args:\n        ann_filenames (list of str(s)): a list of AVA bounding boxes annotation files.\n        ann_is_gt_box (list of bools): a list of boolean to indicate whether the corresponding\n            ann_file is ground-truth. `ann_is_gt_box[i]` correspond to `ann_filenames[i]`.\n        detect_thresh (float): threshold for accepting predicted boxes, range [0, 1].\n        boxes_sample_rate (int): sample rate for test bounding boxes. Get 1 every `boxes_sample_rate`.\n    \"\"\"\n    all_boxes = {}\n    count = 0\n    unique_box_count = 0\n    for filename, is_gt_box in zip(ann_filenames, ann_is_gt_box):\n        with pathmgr.open(filename, \"r\") as f:\n            for line in f:\n                row = line.strip().split(\",\")\n                # When we use predicted boxes to train/eval, we need to\n                # ignore the boxes whose scores are below the threshold.\n                if not is_gt_box:\n                    score = float(row[7])\n                    if score < detect_thresh:\n                        continue\n\n                video_name, frame_sec = row[0], int(row[1])\n                if frame_sec % boxes_sample_rate != 0:\n                    continue\n\n                # Box with format [x1, y1, x2, y2] with a range of [0, 1] as float.\n                box_key = \",\".join(row[2:6])\n                box = list(map(float, row[2:6]))\n                label = -1 if row[6] == \"\" else int(row[6])\n\n                if video_name not in all_boxes:\n                    all_boxes[video_name] = {}\n                    for sec in AVA_VALID_FRAMES:\n                        all_boxes[video_name][sec] = {}\n\n                if box_key not in all_boxes[video_name][frame_sec]:\n                    all_boxes[video_name][frame_sec][box_key] = [box, []]\n                    unique_box_count += 1\n\n                all_boxes[video_name][frame_sec][box_key][1].append(label)\n                if label != -1:\n                    count += 1\n\n    for video_name in all_boxes.keys():\n        for frame_sec in all_boxes[video_name].keys():\n            # Save in format of a list of [box_i, box_i_labels].\n            all_boxes[video_name][frame_sec] = list(\n                all_boxes[video_name][frame_sec].values()\n            )\n\n    return all_boxes, count, unique_box_count\n"
  },
  {
    "path": "slowfast/datasets/build.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\nfrom fvcore.common.registry import Registry\n\nDATASET_REGISTRY = Registry(\"DATASET\")\nDATASET_REGISTRY.__doc__ = \"\"\"\nRegistry for dataset.\n\nThe registered object will be called with `obj(cfg, split)`.\nThe call should return a `torch.utils.data.Dataset` object.\n\"\"\"\n\n\ndef build_dataset(dataset_name, cfg, split):\n    \"\"\"\n    Build a dataset, defined by `dataset_name`.\n    Args:\n        dataset_name (str): the name of the dataset to be constructed.\n        cfg (CfgNode): configs. Details can be found in\n            slowfast/config/defaults.py\n        split (str): the split of the data loader. Options include `train`,\n            `val`, and `test`.\n    Returns:\n        Dataset: a constructed dataset specified by dataset_name.\n    \"\"\"\n    # Capitalize the the first letter of the dataset_name since the dataset_name\n    # in configs may be in lowercase but the name of dataset class should always\n    # start with an uppercase letter.\n    name = dataset_name.capitalize()\n    return DATASET_REGISTRY.get(name)(cfg, split)\n"
  },
  {
    "path": "slowfast/datasets/charades.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\nimport os\nimport random\nfrom itertools import chain as chain\n\nimport slowfast.utils.logging as logging\nimport torch\nimport torch.utils.data\nfrom slowfast.utils.env import pathmgr\n\nfrom . import utils as utils\nfrom .build import DATASET_REGISTRY\n\nlogger = logging.get_logger(__name__)\n\n\n@DATASET_REGISTRY.register()\nclass Charades(torch.utils.data.Dataset):\n    \"\"\"\n    Charades video loader. Construct the Charades video loader, then sample\n    clips from the videos. For training and validation, a single clip is randomly\n    sampled from every video with random cropping, scaling, and flipping. For\n    testing, multiple clips are uniformaly sampled from every video with uniform\n    cropping. For uniform cropping, we take the left, center, and right crop if\n    the width is larger than height, or take top, center, and bottom crop if the\n    height is larger than the width.\n    \"\"\"\n\n    def __init__(self, cfg, mode, num_retries=10):\n        \"\"\"\n        Load Charades data (frame paths, labels, etc. ) to a given Dataset object.\n        The dataset could be downloaded from Chrades official website\n        (https://allenai.org/plato/charades/).\n        Please see datasets/DATASET.md for more information about the data format.\n        Args:\n            dataset (Dataset): a Dataset object to load Charades data to.\n            mode (string): 'train', 'val', or 'test'.\n        Args:\n            cfg (CfgNode): configs.\n            mode (string): Options includes `train`, `val`, or `test` mode.\n                For the train and val mode, the data loader will take data\n                from the train or val set, and sample one clip per video.\n                For the test mode, the data loader will take data from test set,\n                and sample multiple clips per video.\n            num_retries (int): number of retries.\n        \"\"\"\n        # Only support train, val, and test mode.\n        assert mode in [\n            \"train\",\n            \"val\",\n            \"test\",\n        ], \"Split '{}' not supported for Charades \".format(mode)\n        self.mode = mode\n        self.cfg = cfg\n\n        self._video_meta = {}\n        self._num_retries = num_retries\n        # For training or validation mode, one single clip is sampled from every\n        # video. For testing, NUM_ENSEMBLE_VIEWS clips are sampled from every\n        # video. For every clip, NUM_SPATIAL_CROPS is cropped spatially from\n        # the frames.\n        if self.mode in [\"train\", \"val\"]:\n            self._num_clips = 1\n        elif self.mode in [\"test\"]:\n            self._num_clips = cfg.TEST.NUM_ENSEMBLE_VIEWS * cfg.TEST.NUM_SPATIAL_CROPS\n\n        logger.info(\"Constructing Charades {}...\".format(mode))\n        self._construct_loader()\n\n    def _construct_loader(self):\n        \"\"\"\n        Construct the video loader.\n        \"\"\"\n        path_to_file = os.path.join(\n            self.cfg.DATA.PATH_TO_DATA_DIR,\n            \"{}.csv\".format(\"train\" if self.mode == \"train\" else \"val\"),\n        )\n        assert pathmgr.exists(path_to_file), \"{} dir not found\".format(path_to_file)\n        (self._path_to_videos, self._labels) = utils.load_image_lists(\n            path_to_file, self.cfg.DATA.PATH_PREFIX, return_list=True\n        )\n\n        if self.mode != \"train\":\n            # Form video-level labels from frame level annotations.\n            self._labels = utils.convert_to_video_level_labels(self._labels)\n\n        self._path_to_videos = list(\n            chain.from_iterable([[x] * self._num_clips for x in self._path_to_videos])\n        )\n        self._labels = list(\n            chain.from_iterable([[x] * self._num_clips for x in self._labels])\n        )\n        self._spatial_temporal_idx = list(\n            chain.from_iterable(\n                [range(self._num_clips) for _ in range(len(self._labels))]\n            )\n        )\n\n        logger.info(\n            \"Charades dataloader constructed (size: {}) from {}\".format(\n                len(self._path_to_videos), path_to_file\n            )\n        )\n\n    def get_seq_frames(self, index):\n        \"\"\"\n        Given the video index, return the list of indexs of sampled frames.\n        Args:\n            index (int): the video index.\n        Returns:\n            seq (list): the indexes of sampled frames from the video.\n        \"\"\"\n        temporal_sample_index = (\n            -1\n            if self.mode in [\"train\", \"val\"]\n            else self._spatial_temporal_idx[index] // self.cfg.TEST.NUM_SPATIAL_CROPS\n        )\n        num_frames = self.cfg.DATA.NUM_FRAMES\n        sampling_rate = utils.get_random_sampling_rate(\n            self.cfg.MULTIGRID.LONG_CYCLE_SAMPLING_RATE,\n            self.cfg.DATA.SAMPLING_RATE,\n        )\n        video_length = len(self._path_to_videos[index])\n        assert video_length == len(self._labels[index])\n\n        clip_length = (num_frames - 1) * sampling_rate + 1\n        if temporal_sample_index == -1:\n            if clip_length > video_length:\n                start = random.randint(video_length - clip_length, 0)\n            else:\n                start = random.randint(0, video_length - clip_length)\n        else:\n            gap = float(max(video_length - clip_length, 0)) / (\n                self.cfg.TEST.NUM_ENSEMBLE_VIEWS - 1\n            )\n            start = int(round(gap * temporal_sample_index))\n\n        seq = [\n            max(min(start + i * sampling_rate, video_length - 1), 0)\n            for i in range(num_frames)\n        ]\n\n        return seq\n\n    def __getitem__(self, index):\n        \"\"\"\n        Given the video index, return the list of frames, label, and video\n        index if the video frames can be fetched.\n        Args:\n            index (int): the video index provided by the pytorch sampler.\n        Returns:\n            frames (tensor): the frames of sampled from the video. The dimension\n                is `channel` x `num frames` x `height` x `width`.\n            label (int): the label of the current video.\n            index (int): the index of the video.\n            time index (zero): The time index is currently not supported.\n            {} extra data, currently not supported\n        \"\"\"\n        short_cycle_idx = None\n        # When short cycle is used, input index is a tupple.\n        if isinstance(index, tuple):\n            index, self._num_yielded = index\n            if self.cfg.MULTIGRID.SHORT_CYCLE:\n                index, short_cycle_idx = index\n\n        if self.mode in [\"train\", \"val\"]:\n            # -1 indicates random sampling.\n            spatial_sample_index = -1\n            min_scale = self.cfg.DATA.TRAIN_JITTER_SCALES[0]\n            max_scale = self.cfg.DATA.TRAIN_JITTER_SCALES[1]\n            crop_size = self.cfg.DATA.TRAIN_CROP_SIZE\n            if short_cycle_idx in [0, 1]:\n                crop_size = int(\n                    round(\n                        self.cfg.MULTIGRID.SHORT_CYCLE_FACTORS[short_cycle_idx]\n                        * self.cfg.MULTIGRID.DEFAULT_S\n                    )\n                )\n            if self.cfg.MULTIGRID.DEFAULT_S > 0:\n                # Decreasing the scale is equivalent to using a larger \"span\"\n                # in a sampling grid.\n                min_scale = int(\n                    round(float(min_scale) * crop_size / self.cfg.MULTIGRID.DEFAULT_S)\n                )\n        elif self.mode in [\"test\"]:\n            # spatial_sample_index is in [0, 1, 2]. Corresponding to left,\n            # center, or right if width is larger than height, and top, middle,\n            # or bottom if height is larger than width.\n            spatial_sample_index = (\n                self._spatial_temporal_idx[index] % self.cfg.TEST.NUM_SPATIAL_CROPS\n            )\n            min_scale, max_scale, crop_size = [self.cfg.DATA.TEST_CROP_SIZE] * 3\n            # The testing is deterministic and no jitter should be performed.\n            # min_scale, max_scale, and crop_size are expect to be the same.\n            assert len({min_scale, max_scale, crop_size}) == 1\n        else:\n            raise NotImplementedError(\"Does not support {} mode\".format(self.mode))\n\n        seq = self.get_seq_frames(index)\n        frames = torch.as_tensor(\n            utils.retry_load_images(\n                [self._path_to_videos[index][frame] for frame in seq],\n                self._num_retries,\n            )\n        )\n\n        label = utils.aggregate_labels(\n            [self._labels[index][i] for i in range(seq[0], seq[-1] + 1)]\n        )\n        label = torch.as_tensor(\n            utils.as_binary_vector(label, self.cfg.MODEL.NUM_CLASSES)\n        )\n\n        # Perform color normalization.\n        frames = utils.tensor_normalize(frames, self.cfg.DATA.MEAN, self.cfg.DATA.STD)\n        # T H W C -> C T H W.\n        frames = frames.permute(3, 0, 1, 2)\n        # Perform data augmentation.\n        frames = utils.spatial_sampling(\n            frames,\n            spatial_idx=spatial_sample_index,\n            min_scale=min_scale,\n            max_scale=max_scale,\n            crop_size=crop_size,\n            random_horizontal_flip=self.cfg.DATA.RANDOM_FLIP,\n            inverse_uniform_sampling=self.cfg.DATA.INV_UNIFORM_SAMPLE,\n        )\n        frames = utils.pack_pathway_output(self.cfg, frames)\n        return frames, label, index, 0, {}\n\n    def __len__(self):\n        \"\"\"\n        Returns:\n            (int): the number of videos in the dataset.\n        \"\"\"\n        return self.num_videos\n\n    @property\n    def num_videos(self):\n        \"\"\"\n        Returns:\n            (int): the number of videos in the dataset.\n        \"\"\"\n        return len(self._path_to_videos)\n"
  },
  {
    "path": "slowfast/datasets/cv2_transform.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\nimport math\n\nimport cv2\nimport numpy as np\n\n\ndef clip_boxes_to_image(boxes, height, width):\n    \"\"\"\n    Clip the boxes with the height and width of the image size.\n    Args:\n        boxes (ndarray): bounding boxes to peform crop. The dimension is\n        `num boxes` x 4.\n        height (int): the height of the image.\n        width (int): the width of the image.\n    Returns:\n        boxes (ndarray): cropped bounding boxes.\n    \"\"\"\n    boxes[:, [0, 2]] = np.minimum(width - 1.0, np.maximum(0.0, boxes[:, [0, 2]]))\n    boxes[:, [1, 3]] = np.minimum(height - 1.0, np.maximum(0.0, boxes[:, [1, 3]]))\n    return boxes\n\n\ndef random_short_side_scale_jitter_list(images, min_size, max_size, boxes=None):\n    \"\"\"\n    Perform a spatial short scale jittering on the given images and\n    corresponding boxes.\n    Args:\n        images (list): list of images to perform scale jitter. Dimension is\n            `height` x `width` x `channel`.\n        min_size (int): the minimal size to scale the frames.\n        max_size (int): the maximal size to scale the frames.\n        boxes (list): optional. Corresponding boxes to images. Dimension is\n            `num boxes` x 4.\n    Returns:\n        (list): the list of scaled images with dimension of\n            `new height` x `new width` x `channel`.\n        (ndarray or None): the scaled boxes with dimension of\n            `num boxes` x 4.\n    \"\"\"\n    size = int(round(1.0 / np.random.uniform(1.0 / max_size, 1.0 / min_size)))\n\n    height = images[0].shape[0]\n    width = images[0].shape[1]\n    if (width <= height and width == size) or (height <= width and height == size):\n        return images, boxes\n    new_width = size\n    new_height = size\n    if width < height:\n        new_height = int(math.floor((float(height) / width) * size))\n        if boxes is not None:\n            boxes = [proposal * float(new_height) / height for proposal in boxes]\n    else:\n        new_width = int(math.floor((float(width) / height) * size))\n        if boxes is not None:\n            boxes = [proposal * float(new_width) / width for proposal in boxes]\n    return (\n        [\n            cv2.resize(\n                image, (new_width, new_height), interpolation=cv2.INTER_LINEAR\n            ).astype(np.float32)\n            for image in images\n        ],\n        boxes,\n    )\n\n\ndef scale(size, image):\n    \"\"\"\n    Scale the short side of the image to size.\n    Args:\n        size (int): size to scale the image.\n        image (array): image to perform short side scale. Dimension is\n            `height` x `width` x `channel`.\n    Returns:\n        (ndarray): the scaled image with dimension of\n            `height` x `width` x `channel`.\n    \"\"\"\n    height = image.shape[0]\n    width = image.shape[1]\n    if (width <= height and width == size) or (height <= width and height == size):\n        return image\n    new_width = size\n    new_height = size\n    if width < height:\n        new_height = int(math.floor((float(height) / width) * size))\n    else:\n        new_width = int(math.floor((float(width) / height) * size))\n    img = cv2.resize(image, (new_width, new_height), interpolation=cv2.INTER_LINEAR)\n    return img.astype(np.float32)\n\n\ndef scale_boxes(size, boxes, height, width):\n    \"\"\"\n    Scale the short side of the box to size.\n    Args:\n        size (int): size to scale the image.\n        boxes (ndarray): bounding boxes to peform scale. The dimension is\n        `num boxes` x 4.\n        height (int): the height of the image.\n        width (int): the width of the image.\n    Returns:\n        boxes (ndarray): scaled bounding boxes.\n    \"\"\"\n    if (width <= height and width == size) or (height <= width and height == size):\n        return boxes\n\n    new_width = size\n    new_height = size\n    if width < height:\n        new_height = int(math.floor((float(height) / width) * size))\n        boxes *= float(new_height) / height\n    else:\n        new_width = int(math.floor((float(width) / height) * size))\n        boxes *= float(new_width) / width\n    return boxes\n\n\ndef horizontal_flip_list(prob, images, order=\"CHW\", boxes=None):\n    \"\"\"\n    Horizontally flip the list of image and optional boxes.\n    Args:\n        prob (float): probability to flip.\n        image (list): ilist of images to perform short side scale. Dimension is\n            `height` x `width` x `channel` or `channel` x `height` x `width`.\n        order (str): order of the `height`, `channel` and `width`.\n        boxes (list): optional. Corresponding boxes to images.\n            Dimension is `num boxes` x 4.\n    Returns:\n        (ndarray): the scaled image with dimension of\n            `height` x `width` x `channel`.\n        (list): optional. Corresponding boxes to images. Dimension is\n            `num boxes` x 4.\n    \"\"\"\n    _, width, _ = images[0].shape\n    if np.random.uniform() < prob:\n        if boxes is not None:\n            boxes = [flip_boxes(proposal, width) for proposal in boxes]\n        if order == \"CHW\":\n            out_images = []\n            for image in images:\n                image = np.asarray(image).swapaxes(2, 0)\n                image = image[::-1]\n                out_images.append(image.swapaxes(0, 2))\n            return out_images, boxes\n        elif order == \"HWC\":\n            return [cv2.flip(image, 1) for image in images], boxes\n    return images, boxes\n\n\ndef spatial_shift_crop_list(size, images, spatial_shift_pos, boxes=None):\n    \"\"\"\n    Perform left, center, or right crop of the given list of images.\n    Args:\n        size (int): size to crop.\n        image (list): ilist of images to perform short side scale. Dimension is\n            `height` x `width` x `channel` or `channel` x `height` x `width`.\n        spatial_shift_pos (int): option includes 0 (left), 1 (middle), and\n            2 (right) crop.\n        boxes (list): optional. Corresponding boxes to images.\n            Dimension is `num boxes` x 4.\n    Returns:\n        cropped (ndarray): the cropped list of images with dimension of\n            `height` x `width` x `channel`.\n        boxes (list): optional. Corresponding boxes to images. Dimension is\n            `num boxes` x 4.\n    \"\"\"\n\n    assert spatial_shift_pos in [0, 1, 2]\n\n    height = images[0].shape[0]\n    width = images[0].shape[1]\n    y_offset = int(math.ceil((height - size) / 2))\n    x_offset = int(math.ceil((width - size) / 2))\n\n    if height > width:\n        if spatial_shift_pos == 0:\n            y_offset = 0\n        elif spatial_shift_pos == 2:\n            y_offset = height - size\n    else:\n        if spatial_shift_pos == 0:\n            x_offset = 0\n        elif spatial_shift_pos == 2:\n            x_offset = width - size\n\n    cropped = [\n        image[y_offset : y_offset + size, x_offset : x_offset + size, :]\n        for image in images\n    ]\n    assert cropped[0].shape[0] == size, \"Image height not cropped properly\"\n    assert cropped[0].shape[1] == size, \"Image width not cropped properly\"\n\n    if boxes is not None:\n        for i in range(len(boxes)):\n            boxes[i][:, [0, 2]] -= x_offset\n            boxes[i][:, [1, 3]] -= y_offset\n    return cropped, boxes\n\n\ndef CHW2HWC(image):\n    \"\"\"\n    Transpose the dimension from `channel` x `height` x `width` to\n        `height` x `width` x `channel`.\n    Args:\n        image (array): image to transpose.\n    Returns\n        (array): transposed image.\n    \"\"\"\n    return image.transpose([1, 2, 0])\n\n\ndef HWC2CHW(image):\n    \"\"\"\n    Transpose the dimension from `height` x `width` x `channel` to\n        `channel` x `height` x `width`.\n    Args:\n        image (array): image to transpose.\n    Returns\n        (array): transposed image.\n    \"\"\"\n    return image.transpose([2, 0, 1])\n\n\ndef color_jitter_list(images, img_brightness=0, img_contrast=0, img_saturation=0):\n    \"\"\"\n    Perform color jitter on the list of images.\n    Args:\n        images (list): list of images to perform color jitter.\n        img_brightness (float): jitter ratio for brightness.\n        img_contrast (float): jitter ratio for contrast.\n        img_saturation (float): jitter ratio for saturation.\n    Returns:\n        images (list): the jittered list of images.\n    \"\"\"\n    jitter = []\n    if img_brightness != 0:\n        jitter.append(\"brightness\")\n    if img_contrast != 0:\n        jitter.append(\"contrast\")\n    if img_saturation != 0:\n        jitter.append(\"saturation\")\n\n    if len(jitter) > 0:\n        order = np.random.permutation(np.arange(len(jitter)))\n        for idx in range(0, len(jitter)):\n            if jitter[order[idx]] == \"brightness\":\n                images = brightness_list(img_brightness, images)\n            elif jitter[order[idx]] == \"contrast\":\n                images = contrast_list(img_contrast, images)\n            elif jitter[order[idx]] == \"saturation\":\n                images = saturation_list(img_saturation, images)\n    return images\n\n\ndef lighting_list(imgs, alphastd, eigval, eigvec, alpha=None):\n    \"\"\"\n    Perform AlexNet-style PCA jitter on the given list of images.\n    Args:\n        images (list): list of images to perform lighting jitter.\n        alphastd (float): jitter ratio for PCA jitter.\n        eigval (list): eigenvalues for PCA jitter.\n        eigvec (list[list]): eigenvectors for PCA jitter.\n    Returns:\n        out_images (list): the list of jittered images.\n    \"\"\"\n    if alphastd == 0:\n        return imgs\n    # generate alpha1, alpha2, alpha3\n    alpha = np.random.normal(0, alphastd, size=(1, 3))\n    eig_vec = np.array(eigvec)\n    eig_val = np.reshape(eigval, (1, 3))\n    rgb = np.sum(\n        eig_vec * np.repeat(alpha, 3, axis=0) * np.repeat(eig_val, 3, axis=0),\n        axis=1,\n    )\n    out_images = []\n    for img in imgs:\n        for idx in range(img.shape[0]):\n            img[idx] = img[idx] + rgb[2 - idx]\n        out_images.append(img)\n    return out_images\n\n\ndef color_normalization(image, mean, stddev):\n    \"\"\"\n    Perform color normalization on the image with the given mean and stddev.\n    Args:\n        image (array): image to perform color normalization.\n        mean (float): mean value to subtract.\n        stddev (float): stddev to devide.\n    \"\"\"\n    # Input image should in format of CHW\n    assert len(mean) == image.shape[0], \"channel mean not computed properly\"\n    assert len(stddev) == image.shape[0], \"channel stddev not computed properly\"\n    for idx in range(image.shape[0]):\n        image[idx] = image[idx] - mean[idx]\n        image[idx] = image[idx] / stddev[idx]\n    return image\n\n\ndef pad_image(image, pad_size, order=\"CHW\"):\n    \"\"\"\n    Pad the given image with the size of pad_size.\n    Args:\n        image (array): image to pad.\n        pad_size (int): size to pad.\n        order (str): order of the `height`, `channel` and `width`.\n    Returns:\n        img (array): padded image.\n    \"\"\"\n    if order == \"CHW\":\n        img = np.pad(\n            image,\n            ((0, 0), (pad_size, pad_size), (pad_size, pad_size)),\n            mode=str(\"constant\"),\n        )\n    elif order == \"HWC\":\n        img = np.pad(\n            image,\n            ((pad_size, pad_size), (pad_size, pad_size), (0, 0)),\n            mode=str(\"constant\"),\n        )\n    return img\n\n\ndef horizontal_flip(prob, image, order=\"CHW\"):\n    \"\"\"\n    Horizontally flip the image.\n    Args:\n        prob (float): probability to flip.\n        image (array): image to pad.\n        order (str): order of the `height`, `channel` and `width`.\n    Returns:\n        img (array): flipped image.\n    \"\"\"\n    assert order in [\"CHW\", \"HWC\"], \"order {} is not supported\".format(order)\n    if np.random.uniform() < prob:\n        if order == \"CHW\":\n            image = image[:, :, ::-1]\n        elif order == \"HWC\":\n            image = image[:, ::-1, :]\n        else:\n            raise NotImplementedError(\"Unknown order {}\".format(order))\n    return image\n\n\ndef flip_boxes(boxes, im_width):\n    \"\"\"\n    Horizontally flip the boxes.\n    Args:\n        boxes (array): box to flip.\n        im_width (int): width of the image.\n    Returns:\n        boxes_flipped (array): flipped box.\n    \"\"\"\n\n    boxes_flipped = boxes.copy()\n    boxes_flipped[:, 0::4] = im_width - boxes[:, 2::4] - 1\n    boxes_flipped[:, 2::4] = im_width - boxes[:, 0::4] - 1\n    return boxes_flipped\n\n\ndef crop_boxes(boxes, x_offset, y_offset):\n    \"\"\"\n    Crop the boxes given the offsets.\n    Args:\n        boxes (array): boxes to crop.\n        x_offset (int): offset on x.\n        y_offset (int): offset on y.\n    \"\"\"\n    boxes[:, [0, 2]] = boxes[:, [0, 2]] - x_offset\n    boxes[:, [1, 3]] = boxes[:, [1, 3]] - y_offset\n    return boxes\n\n\ndef random_crop_list(images, size, pad_size=0, order=\"CHW\", boxes=None):\n    \"\"\"\n    Perform random crop on a list of images.\n    Args:\n        images (list): list of images to perform random crop.\n        size (int): size to crop.\n        pad_size (int): padding size.\n        order (str): order of the `height`, `channel` and `width`.\n        boxes (list): optional. Corresponding boxes to images.\n            Dimension is `num boxes` x 4.\n    Returns:\n        cropped (ndarray): the cropped list of images with dimension of\n            `height` x `width` x `channel`.\n        boxes (list): optional. Corresponding boxes to images. Dimension is\n            `num boxes` x 4.\n    \"\"\"\n    # explicitly dealing processing per image order to avoid flipping images.\n    if pad_size > 0:\n        images = [\n            pad_image(pad_size=pad_size, image=image, order=order) for image in images\n        ]\n\n    # image format should be CHW.\n    if order == \"CHW\":\n        if images[0].shape[1] == size and images[0].shape[2] == size:\n            return images, boxes\n        height = images[0].shape[1]\n        width = images[0].shape[2]\n        y_offset = 0\n        if height > size:\n            y_offset = int(np.random.randint(0, height - size))\n        x_offset = 0\n        if width > size:\n            x_offset = int(np.random.randint(0, width - size))\n        cropped = [\n            image[:, y_offset : y_offset + size, x_offset : x_offset + size]\n            for image in images\n        ]\n        assert cropped[0].shape[1] == size, \"Image not cropped properly\"\n        assert cropped[0].shape[2] == size, \"Image not cropped properly\"\n    elif order == \"HWC\":\n        if images[0].shape[0] == size and images[0].shape[1] == size:\n            return images, boxes\n        height = images[0].shape[0]\n        width = images[0].shape[1]\n        y_offset = 0\n        if height > size:\n            y_offset = int(np.random.randint(0, height - size))\n        x_offset = 0\n        if width > size:\n            x_offset = int(np.random.randint(0, width - size))\n        cropped = [\n            image[y_offset : y_offset + size, x_offset : x_offset + size, :]\n            for image in images\n        ]\n        assert cropped[0].shape[0] == size, \"Image not cropped properly\"\n        assert cropped[0].shape[1] == size, \"Image not cropped properly\"\n\n    if boxes is not None:\n        boxes = [crop_boxes(proposal, x_offset, y_offset) for proposal in boxes]\n    return cropped, boxes\n\n\ndef center_crop(size, image):\n    \"\"\"\n    Perform center crop on input images.\n    Args:\n        size (int): size of the cropped height and width.\n        image (array): the image to perform center crop.\n    \"\"\"\n    height = image.shape[0]\n    width = image.shape[1]\n    y_offset = int(math.ceil((height - size) / 2))\n    x_offset = int(math.ceil((width - size) / 2))\n    cropped = image[y_offset : y_offset + size, x_offset : x_offset + size, :]\n    assert cropped.shape[0] == size, \"Image height not cropped properly\"\n    assert cropped.shape[1] == size, \"Image width not cropped properly\"\n    return cropped\n\n\n# ResNet style scale jittering: randomly select the scale from\n# [1/max_size, 1/min_size]\ndef random_scale_jitter(image, min_size, max_size):\n    \"\"\"\n    Perform ResNet style random scale jittering: randomly select the scale from\n        [1/max_size, 1/min_size].\n    Args:\n        image (array): image to perform random scale.\n        min_size (int): min size to scale.\n        max_size (int) max size to scale.\n    Returns:\n        image (array): scaled image.\n    \"\"\"\n    img_scale = int(round(1.0 / np.random.uniform(1.0 / max_size, 1.0 / min_size)))\n    image = scale(img_scale, image)\n    return image\n\n\ndef random_scale_jitter_list(images, min_size, max_size):\n    \"\"\"\n    Perform ResNet style random scale jittering on a list of image: randomly\n        select the scale from [1/max_size, 1/min_size]. Note that all the image\n        will share the same scale.\n    Args:\n        images (list): list of images to perform random scale.\n        min_size (int): min size to scale.\n        max_size (int) max size to scale.\n    Returns:\n        images (list): list of scaled image.\n    \"\"\"\n    img_scale = int(round(1.0 / np.random.uniform(1.0 / max_size, 1.0 / min_size)))\n    return [scale(img_scale, image) for image in images]\n\n\ndef random_sized_crop(image, size, area_frac=0.08):\n    \"\"\"\n    Perform random sized cropping on the given image. Random crop with size\n        8% - 100% image area and aspect ratio in [3/4, 4/3].\n    Args:\n        image (array): image to crop.\n        size (int): size to crop.\n        area_frac (float): area of fraction.\n    Returns:\n        (array): cropped image.\n    \"\"\"\n    for _ in range(0, 10):\n        height = image.shape[0]\n        width = image.shape[1]\n        area = height * width\n        target_area = np.random.uniform(area_frac, 1.0) * area\n        aspect_ratio = np.random.uniform(3.0 / 4.0, 4.0 / 3.0)\n        w = int(round(math.sqrt(float(target_area) * aspect_ratio)))\n        h = int(round(math.sqrt(float(target_area) / aspect_ratio)))\n        if np.random.uniform() < 0.5:\n            w, h = h, w\n        if h <= height and w <= width:\n            if height == h:\n                y_offset = 0\n            else:\n                y_offset = np.random.randint(0, height - h)\n            if width == w:\n                x_offset = 0\n            else:\n                x_offset = np.random.randint(0, width - w)\n            y_offset = int(y_offset)\n            x_offset = int(x_offset)\n            cropped = image[y_offset : y_offset + h, x_offset : x_offset + w, :]\n            assert cropped.shape[0] == h and cropped.shape[1] == w, \"Wrong crop size\"\n            cropped = cv2.resize(cropped, (size, size), interpolation=cv2.INTER_LINEAR)\n            return cropped.astype(np.float32)\n    return center_crop(size, scale(size, image))\n\n\ndef lighting(img, alphastd, eigval, eigvec):\n    \"\"\"\n    Perform AlexNet-style PCA jitter on the given image.\n    Args:\n        image (array): list of images to perform lighting jitter.\n        alphastd (float): jitter ratio for PCA jitter.\n        eigval (array): eigenvalues for PCA jitter.\n        eigvec (list): eigenvectors for PCA jitter.\n    Returns:\n        img (tensor): the jittered image.\n    \"\"\"\n    if alphastd == 0:\n        return img\n    # generate alpha1, alpha2, alpha3.\n    alpha = np.random.normal(0, alphastd, size=(1, 3))\n    eig_vec = np.array(eigvec)\n    eig_val = np.reshape(eigval, (1, 3))\n    rgb = np.sum(\n        eig_vec * np.repeat(alpha, 3, axis=0) * np.repeat(eig_val, 3, axis=0),\n        axis=1,\n    )\n    for idx in range(img.shape[0]):\n        img[idx] = img[idx] + rgb[2 - idx]\n    return img\n\n\ndef random_sized_crop_list(images, size, crop_area_fraction=0.08):\n    \"\"\"\n    Perform random sized cropping on the given list of images. Random crop with\n        size 8% - 100% image area and aspect ratio in [3/4, 4/3].\n    Args:\n        images (list): image to crop.\n        size (int): size to crop.\n        area_frac (float): area of fraction.\n    Returns:\n        (list): list of cropped image.\n    \"\"\"\n    for _ in range(0, 10):\n        height = images[0].shape[0]\n        width = images[0].shape[1]\n        area = height * width\n        target_area = np.random.uniform(crop_area_fraction, 1.0) * area\n        aspect_ratio = np.random.uniform(3.0 / 4.0, 4.0 / 3.0)\n        w = int(round(math.sqrt(float(target_area) * aspect_ratio)))\n        h = int(round(math.sqrt(float(target_area) / aspect_ratio)))\n        if np.random.uniform() < 0.5:\n            w, h = h, w\n        if h <= height and w <= width:\n            if height == h:\n                y_offset = 0\n            else:\n                y_offset = np.random.randint(0, height - h)\n            if width == w:\n                x_offset = 0\n            else:\n                x_offset = np.random.randint(0, width - w)\n            y_offset = int(y_offset)\n            x_offset = int(x_offset)\n\n            croppsed_images = []\n            for image in images:\n                cropped = image[y_offset : y_offset + h, x_offset : x_offset + w, :]\n                assert cropped.shape[0] == h and cropped.shape[1] == w, (\n                    \"Wrong crop size\"\n                )\n                cropped = cv2.resize(\n                    cropped, (size, size), interpolation=cv2.INTER_LINEAR\n                )\n                croppsed_images.append(cropped.astype(np.float32))\n            return croppsed_images\n\n    return [center_crop(size, scale(size, image)) for image in images]\n\n\ndef blend(image1, image2, alpha):\n    return image1 * alpha + image2 * (1 - alpha)\n\n\ndef grayscale(image):\n    \"\"\"\n    Convert the image to gray scale.\n    Args:\n        image (tensor): image to convert to gray scale. Dimension is\n            `channel` x `height` x `width`.\n    Returns:\n        img_gray (tensor): image in gray scale.\n    \"\"\"\n    # R -> 0.299, G -> 0.587, B -> 0.114.\n    img_gray = np.copy(image)\n    gray_channel = 0.299 * image[2] + 0.587 * image[1] + 0.114 * image[0]\n    img_gray[0] = gray_channel\n    img_gray[1] = gray_channel\n    img_gray[2] = gray_channel\n    return img_gray\n\n\ndef saturation(var, image):\n    \"\"\"\n    Perform color saturation on the given image.\n    Args:\n        var (float): variance.\n        image (array): image to perform color saturation.\n    Returns:\n        (array): image that performed color saturation.\n    \"\"\"\n    img_gray = grayscale(image)\n    alpha = 1.0 + np.random.uniform(-var, var)\n    return blend(image, img_gray, alpha)\n\n\ndef brightness(var, image):\n    \"\"\"\n    Perform color brightness on the given image.\n    Args:\n        var (float): variance.\n        image (array): image to perform color brightness.\n    Returns:\n        (array): image that performed color brightness.\n    \"\"\"\n    img_bright = np.zeros(image.shape).astype(image.dtype)\n    alpha = 1.0 + np.random.uniform(-var, var)\n    return blend(image, img_bright, alpha)\n\n\ndef contrast(var, image):\n    \"\"\"\n    Perform color contrast on the given image.\n    Args:\n        var (float): variance.\n        image (array): image to perform color contrast.\n    Returns:\n        (array): image that performed color contrast.\n    \"\"\"\n    img_gray = grayscale(image)\n    img_gray.fill(np.mean(img_gray[0]))\n    alpha = 1.0 + np.random.uniform(-var, var)\n    return blend(image, img_gray, alpha)\n\n\ndef saturation_list(var, images):\n    \"\"\"\n    Perform color saturation on the list of given images.\n    Args:\n        var (float): variance.\n        images (list): list of images to perform color saturation.\n    Returns:\n        (list): list of images that performed color saturation.\n    \"\"\"\n    alpha = 1.0 + np.random.uniform(-var, var)\n\n    out_images = []\n    for image in images:\n        img_gray = grayscale(image)\n        out_images.append(blend(image, img_gray, alpha))\n    return out_images\n\n\ndef brightness_list(var, images):\n    \"\"\"\n    Perform color brightness on the given list of images.\n    Args:\n        var (float): variance.\n        images (list): list of images to perform color brightness.\n    Returns:\n        (array): list of images that performed color brightness.\n    \"\"\"\n    alpha = 1.0 + np.random.uniform(-var, var)\n\n    out_images = []\n    for image in images:\n        img_bright = np.zeros(image.shape).astype(image.dtype)\n        out_images.append(blend(image, img_bright, alpha))\n    return out_images\n\n\ndef contrast_list(var, images):\n    \"\"\"\n    Perform color contrast on the given list of images.\n    Args:\n        var (float): variance.\n        images (list): list of images to perform color contrast.\n    Returns:\n        (array): image that performed color contrast.\n    \"\"\"\n    alpha = 1.0 + np.random.uniform(-var, var)\n\n    out_images = []\n    for image in images:\n        img_gray = grayscale(image)\n        img_gray.fill(np.mean(img_gray[0]))\n        out_images.append(blend(image, img_gray, alpha))\n    return out_images\n\n\ndef color_jitter(image, img_brightness=0, img_contrast=0, img_saturation=0):\n    \"\"\"\n    Perform color jitter on the given image.\n    Args:\n        image (array): image to perform color jitter.\n        img_brightness (float): jitter ratio for brightness.\n        img_contrast (float): jitter ratio for contrast.\n        img_saturation (float): jitter ratio for saturation.\n    Returns:\n        image (array): the jittered image.\n    \"\"\"\n    jitter = []\n    if img_brightness != 0:\n        jitter.append(\"brightness\")\n    if img_contrast != 0:\n        jitter.append(\"contrast\")\n    if img_saturation != 0:\n        jitter.append(\"saturation\")\n\n    if len(jitter) > 0:\n        order = np.random.permutation(np.arange(len(jitter)))\n        for idx in range(0, len(jitter)):\n            if jitter[order[idx]] == \"brightness\":\n                image = brightness(img_brightness, image)\n            elif jitter[order[idx]] == \"contrast\":\n                image = contrast(img_contrast, image)\n            elif jitter[order[idx]] == \"saturation\":\n                image = saturation(img_saturation, image)\n    return image\n\n\ndef revert_scaled_boxes(size, boxes, img_height, img_width):\n    \"\"\"\n    Revert scaled input boxes to match the original image size.\n    Args:\n        size (int): size of the cropped image.\n        boxes (array): shape (num_boxes, 4).\n        img_height (int): height of original image.\n        img_width (int): width of original image.\n    Returns:\n        reverted_boxes (array): boxes scaled back to the original image size.\n    \"\"\"\n    scaled_aspect = np.min([img_height, img_width])\n    scale_ratio = scaled_aspect / size\n    reverted_boxes = boxes * scale_ratio\n    return reverted_boxes\n"
  },
  {
    "path": "slowfast/datasets/decoder.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\nimport logging\nimport math\nimport random\n\nimport numpy as np\nimport torch\nimport torchvision.io as io\n\nfrom . import transform as transform\n\nlogger = logging.getLogger(__name__)\n\n\ndef temporal_sampling(frames, start_idx, end_idx, num_samples):\n    \"\"\"\n    Given the start and end frame index, sample num_samples frames between\n    the start and end with equal interval.\n    Args:\n        frames (tensor): a tensor of video frames, dimension is\n            `num video frames` x `channel` x `height` x `width`.\n        start_idx (int): the index of the start frame.\n        end_idx (int): the index of the end frame.\n        num_samples (int): number of frames to sample.\n    Returns:\n        frames (tersor): a tensor of temporal sampled video frames, dimension is\n            `num clip frames` x `channel` x `height` x `width`.\n    \"\"\"\n    index = torch.linspace(start_idx, end_idx, num_samples)\n    index = torch.clamp(index, 0, frames.shape[0] - 1).long()\n    frames = torch.index_select(frames, 0, index)\n    return frames\n\n\ndef get_start_end_idx(\n    video_size, clip_size, clip_idx, num_clips_uniform, use_offset=False\n):\n    \"\"\"\n    Sample a clip of size clip_size from a video of size video_size and\n    return the indices of the first and last frame of the clip. If clip_idx is\n    -1, the clip is randomly sampled, otherwise uniformly split the video to\n    num_clips_uniform clips, and select the start and end index of clip_idx-th video\n    clip.\n    Args:\n        video_size (int): number of overall frames.\n        clip_size (int): size of the clip to sample from the frames.\n        clip_idx (int): if clip_idx is -1, perform random jitter sampling. If\n            clip_idx is larger than -1, uniformly split the video to num_clips_uniform\n            clips, and select the start and end index of the clip_idx-th video\n            clip.\n        num_clips_uniform (int): overall number of clips to uniformly sample from the\n            given video for testing.\n    Returns:\n        start_idx (int): the start frame index.\n        end_idx (int): the end frame index.\n    \"\"\"\n    delta = max(video_size - clip_size, 0)\n    if clip_idx == -1:\n        # Random temporal sampling.\n        start_idx = random.uniform(0, delta)\n    else:\n        if use_offset:\n            if num_clips_uniform == 1:\n                # Take the center clip if num_clips_uniform is 1.\n                start_idx = math.floor(delta / 2)\n            else:\n                # Uniformly sample the clip with the given index.\n                start_idx = clip_idx * math.floor(delta / (num_clips_uniform - 1))\n        else:\n            # Uniformly sample the clip with the given index.\n            start_idx = delta * clip_idx / num_clips_uniform\n    end_idx = start_idx + clip_size - 1\n\n    return start_idx, end_idx, start_idx / delta if delta != 0 else 0.0\n\n\ndef get_multiple_start_end_idx(\n    video_size,\n    clip_sizes,\n    clip_idx,\n    num_clips_uniform,\n    min_delta=0,\n    max_delta=math.inf,\n    use_offset=False,\n):\n    \"\"\"\n    Sample a clip of size clip_size from a video of size video_size and\n    return the indices of the first and last frame of the clip. If clip_idx is\n    -1, the clip is randomly sampled, otherwise uniformly split the video to\n    num_clips_uniform clips, and select the start and end index of clip_idx-th video\n    clip.\n    Args:\n        video_size (int): number of overall frames.\n        clip_sizes (list): size of the clip to sample from the frames.\n        clip_idx (int): if clip_idx is -1, perform random jitter sampling. If\n            clip_idx is larger than -1, uniformly split the video to num_clips_uniform\n            clips, and select the start and end index of the clip_idx-th video\n            clip.\n        num_clips_uniform (int): overall number of clips to uniformly sample from the\n            given video for testing.\n    Returns:\n        start_idx (int): the start frame index.\n        end_idx (int): the end frame index.\n    \"\"\"\n\n    def sample_clips(\n        video_size,\n        clip_sizes,\n        clip_idx,\n        num_clips_uniform,\n        min_delta=0,\n        max_delta=math.inf,\n        num_retries=100,\n        use_offset=False,\n    ):\n        se_inds = np.empty((0, 2))\n        dt = np.empty((0))\n        for clip_size in clip_sizes:\n            for i_try in range(num_retries):\n                # clip_size = int(clip_size)\n                max_start = max(video_size - clip_size, 0)\n                if clip_idx == -1:\n                    # Random temporal sampling.\n                    start_idx = random.uniform(0, max_start)\n                else:  # Uniformly sample the clip with the given index.\n                    if use_offset:\n                        if num_clips_uniform == 1:\n                            # Take the center clip if num_clips is 1.\n                            start_idx = math.floor(max_start / 2)\n                        else:\n                            start_idx = clip_idx * math.floor(\n                                max_start / (num_clips_uniform - 1)\n                            )\n                    else:\n                        start_idx = max_start * clip_idx / num_clips_uniform\n\n                end_idx = start_idx + clip_size - 1\n\n                se_inds_new = np.append(se_inds, [[start_idx, end_idx]], axis=0)\n                if se_inds.shape[0] < 1:\n                    se_inds = se_inds_new\n                    break\n\n                se_inds_new = np.sort(se_inds_new, 0)\n                t_start, t_end = se_inds_new[:, 0], se_inds_new[:, 1]\n                dt = t_start[1:] - t_end[:-1]\n                if (\n                    any(dt < min_delta) or any(dt > max_delta)\n                ) and i_try < num_retries - 1:\n                    continue  # there is overlap\n                else:\n                    se_inds = se_inds_new\n                    break\n        return se_inds, dt\n\n    num_retries, goodness = 100, -math.inf\n    for _ in range(num_retries):\n        se_inds, dt = sample_clips(\n            video_size,\n            clip_sizes,\n            clip_idx,\n            num_clips_uniform,\n            min_delta,\n            max_delta,\n            100,\n            use_offset,\n        )\n        success = not (any(dt < min_delta) or any(dt > max_delta))\n        if success or clip_idx != -1:\n            se_final, dt_final = se_inds, dt\n            break\n        else:\n            cur_goodness = np.r_[dt[dt < min_delta], -dt[dt > max_delta]].sum()\n            if goodness < cur_goodness:\n                se_final, dt_final = se_inds, dt\n                goodness = cur_goodness\n\n    delta_clips = np.concatenate((np.array([0]), dt_final))\n    start_end_delta_time = np.c_[se_final, delta_clips]\n\n    return start_end_delta_time\n\n\ndef pyav_decode_stream(\n    container, start_pts, end_pts, stream, stream_name, buffer_size=0\n):\n    \"\"\"\n    Decode the video with PyAV decoder.\n    Args:\n        container (container): PyAV container.\n        start_pts (int): the starting Presentation TimeStamp to fetch the\n            video frames.\n        end_pts (int): the ending Presentation TimeStamp of the decoded frames.\n        stream (stream): PyAV stream.\n        stream_name (dict): a dictionary of streams. For example, {\"video\": 0}\n            means video stream at stream index 0.\n        buffer_size (int): number of additional frames to decode beyond end_pts.\n    Returns:\n        result (list): list of frames decoded.\n        max_pts (int): max Presentation TimeStamp of the video sequence.\n    \"\"\"\n    # Seeking in the stream is imprecise. Thus, seek to an ealier PTS by a\n    # margin pts.\n    margin = 1024\n    seek_offset = max(start_pts - margin, 0)\n\n    container.seek(seek_offset, any_frame=False, backward=True, stream=stream)\n    frames = {}\n    buffer_count = 0\n    max_pts = 0\n    for frame in container.decode(**stream_name):\n        max_pts = max(max_pts, frame.pts)\n        if frame.pts < start_pts:\n            continue\n        if frame.pts <= end_pts:\n            frames[frame.pts] = frame\n        else:\n            buffer_count += 1\n            frames[frame.pts] = frame\n            if buffer_count >= buffer_size:\n                break\n    result = [frames[pts] for pts in sorted(frames)]\n    return result, max_pts\n\n\ndef torchvision_decode(\n    video_handle,\n    sampling_rate,\n    num_frames,\n    clip_idx,\n    video_meta,\n    num_clips_uniform=10,\n    target_fps=30,\n    modalities=(\"visual\",),\n    max_spatial_scale=0,\n    use_offset=False,\n    min_delta=-math.inf,\n    max_delta=math.inf,\n):\n    \"\"\"\n    If video_meta is not empty, perform temporal selective decoding to sample a\n    clip from the video with TorchVision decoder. If video_meta is empty, decode\n    the entire video and update the video_meta.\n    Args:\n        video_handle (bytes): raw bytes of the video file.\n        sampling_rate (int): frame sampling rate (interval between two sampled\n            frames).\n        num_frames (int): number of frames to sample.\n        clip_idx (int): if clip_idx is -1, perform random temporal\n            sampling. If clip_idx is larger than -1, uniformly split the\n            video to num_clips_uniform clips, and select the clip_idx-th video clip.\n        video_meta (dict): a dict contains VideoMetaData. Details can be found\n            at `pytorch/vision/torchvision/io/_video_opt.py`.\n        num_clips_uniform (int): overall number of clips to uniformly sample from the\n            given video.\n        target_fps (int): the input video may has different fps, convert it to\n            the target video fps.\n        modalities (tuple): tuple of modalities to decode. Currently only\n            support `visual`, planning to support `acoustic` soon.\n        max_spatial_scale (int): the resolution of the spatial shorter\n            edge size during decoding.\n        min_delta (int): minimum distance between clips when sampling multiple.\n        max_delta (int): max distance between clips when sampling multiple.\n    Returns:\n        frames (tensor): decoded frames from the video.\n        fps (float): the number of frames per second of the video.\n        decode_all_video (bool): if True, the entire video was decoded.\n    \"\"\"\n    # Convert the bytes to a tensor.\n    video_tensor = torch.from_numpy(np.frombuffer(video_handle, dtype=np.uint8))\n\n    decode_all_video = True\n    video_start_pts, video_end_pts = 0, -1\n    # The video_meta is empty, fetch the meta data from the raw video.\n    if len(video_meta) == 0:\n        # Tracking the meta info for selective decoding in the future.\n        meta = io._probe_video_from_memory(video_tensor)\n        # Using the information from video_meta to perform selective decoding.\n        video_meta[\"video_timebase\"] = meta.video_timebase\n        video_meta[\"video_numerator\"] = meta.video_timebase.numerator\n        video_meta[\"video_denominator\"] = meta.video_timebase.denominator\n        video_meta[\"has_video\"] = meta.has_video\n        video_meta[\"video_duration\"] = meta.video_duration\n        video_meta[\"video_fps\"] = meta.video_fps\n        video_meta[\"audio_timebas\"] = meta.audio_timebase\n        video_meta[\"audio_numerator\"] = meta.audio_timebase.numerator\n        video_meta[\"audio_denominator\"] = meta.audio_timebase.denominator\n        video_meta[\"has_audio\"] = meta.has_audio\n        video_meta[\"audio_duration\"] = meta.audio_duration\n        video_meta[\"audio_sample_rate\"] = meta.audio_sample_rate\n\n    fps = video_meta[\"video_fps\"]\n\n    if len(video_meta) > 0 and (\n        video_meta[\"has_video\"]\n        and video_meta[\"video_denominator\"] > 0\n        and video_meta[\"video_duration\"] > 0\n        and fps * video_meta[\"video_duration\"]\n        > sum(T * tau for T, tau in zip(num_frames, sampling_rate))\n    ):\n        decode_all_video = False  # try selective decoding\n\n        clip_sizes = [\n            np.maximum(1.0, sampling_rate[i] * num_frames[i] / target_fps * fps)\n            for i in range(len(sampling_rate))\n        ]\n        start_end_delta_time = get_multiple_start_end_idx(\n            fps * video_meta[\"video_duration\"],\n            clip_sizes,\n            clip_idx,\n            num_clips_uniform,\n            min_delta=min_delta,\n            max_delta=max_delta,\n            use_offset=use_offset,\n        )\n        frames_out = [None] * len(num_frames)\n        for k in range(len(num_frames)):\n            pts_per_frame = video_meta[\"video_denominator\"] / video_meta[\"video_fps\"]\n            video_start_pts = int(start_end_delta_time[k, 0] * pts_per_frame)\n            video_end_pts = int(start_end_delta_time[k, 1] * pts_per_frame)\n\n            # Decode the raw video with the tv decoder.\n            v_frames, _ = io._read_video_from_memory(\n                video_tensor,\n                seek_frame_margin=1.0,\n                read_video_stream=\"visual\" in modalities,\n                video_width=0,\n                video_height=0,\n                video_min_dimension=max_spatial_scale,\n                video_pts_range=(video_start_pts, video_end_pts),\n                video_timebase_numerator=video_meta[\"video_numerator\"],\n                video_timebase_denominator=video_meta[\"video_denominator\"],\n                read_audio_stream=0,\n            )\n            if v_frames is None or v_frames.shape == torch.Size([0]):\n                decode_all_video = True\n                logger.info(\"TV decode FAILED try decode all\")\n                break\n            frames_out[k] = v_frames\n\n    if decode_all_video:\n        # failed selective decoding\n        decode_all_video = True\n        video_start_pts, video_end_pts = 0, -1\n        start_end_delta_time = None\n        v_frames, _ = io._read_video_from_memory(\n            video_tensor,\n            seek_frame_margin=1.0,\n            read_video_stream=\"visual\" in modalities,\n            video_width=0,\n            video_height=0,\n            video_min_dimension=max_spatial_scale,\n            video_pts_range=(video_start_pts, video_end_pts),\n            video_timebase_numerator=video_meta[\"video_numerator\"],\n            video_timebase_denominator=video_meta[\"video_denominator\"],\n            read_audio_stream=0,\n        )\n        if v_frames.shape == torch.Size([0]):\n            v_frames = None\n            logger.info(\"TV decode FAILED try cecode all\")\n\n        frames_out = [v_frames]\n\n    if any([t.shape[0] < 0 for t in frames_out]):\n        frames_out = [None]\n        logger.info(\"TV decode FAILED: Decoded empty video\")\n\n    return frames_out, fps, decode_all_video, start_end_delta_time\n\n\ndef pyav_decode(\n    container,\n    sampling_rate,\n    num_frames,\n    clip_idx,\n    num_clips_uniform=10,\n    target_fps=30,\n    use_offset=False,\n):\n    \"\"\"\n    Convert the video from its original fps to the target_fps. If the video\n    support selective decoding (contain decoding information in the video head),\n    the perform temporal selective decoding and sample a clip from the video\n    with the PyAV decoder. If the video does not support selective decoding,\n    decode the entire video.\n\n    Args:\n        container (container): pyav container.\n        sampling_rate (int): frame sampling rate (interval between two sampled\n            frames.\n        num_frames (int): number of frames to sample.\n        clip_idx (int): if clip_idx is -1, perform random temporal sampling. If\n            clip_idx is larger than -1, uniformly split the video to num_clips_uniform\n            clips, and select the clip_idx-th video clip.\n        num_clips_uniform (int): overall number of clips to uniformly sample from the\n            given video.\n        target_fps (int): the input video may has different fps, convert it to\n            the target video fps before frame sampling.\n    Returns:\n        frames (tensor): decoded frames from the video. Return None if the no\n            video stream was found.\n        fps (float): the number of frames per second of the video.\n        decode_all_video (bool): If True, the entire video was decoded.\n    \"\"\"\n    # Try to fetch the decoding information from the video head. Some of the\n    # videos does not support fetching the decoding information, for that case\n    # it will get None duration.\n    fps = float(container.streams.video[0].average_rate)\n    frames_length = container.streams.video[0].frames\n    duration = container.streams.video[0].duration\n\n    if duration is None:\n        # If failed to fetch the decoding information, decode the entire video.\n        decode_all_video = True\n        video_start_pts, video_end_pts = 0, math.inf\n    else:\n        # Perform selective decoding.\n        decode_all_video = False\n        clip_size = np.maximum(\n            1.0, np.ceil(sampling_rate * (num_frames - 1) / target_fps * fps)\n        )\n        start_idx, end_idx, fraction = get_start_end_idx(\n            frames_length,\n            clip_size,\n            clip_idx,\n            num_clips_uniform,\n            use_offset=use_offset,\n        )\n        timebase = duration / frames_length\n        video_start_pts = int(start_idx * timebase)\n        video_end_pts = int(end_idx * timebase)\n\n    frames = None\n    # If video stream was found, fetch video frames from the video.\n    if container.streams.video:\n        video_frames, max_pts = pyav_decode_stream(\n            container,\n            video_start_pts,\n            video_end_pts,\n            container.streams.video[0],\n            {\"video\": 0},\n        )\n        container.close()\n\n        frames = [frame.to_rgb().to_ndarray() for frame in video_frames]\n        frames = torch.as_tensor(np.stack(frames))\n    return frames, fps, decode_all_video\n\n\ndef decode(\n    container,\n    sampling_rate,\n    num_frames,\n    clip_idx=-1,\n    num_clips_uniform=10,\n    video_meta=None,\n    target_fps=30,\n    backend=\"pyav\",\n    max_spatial_scale=0,\n    use_offset=False,\n    time_diff_prob=0.0,\n    gaussian_prob=0.0,\n    min_delta=-math.inf,\n    max_delta=math.inf,\n    temporally_rnd_clips=True,\n):\n    \"\"\"\n    Decode the video and perform temporal sampling.\n    Args:\n        container (container): pyav container.\n        sampling_rate (list of ints): frame sampling rate (interval between two sampled\n            frames).\n        num_frames (list of ints): number of frames to sample.\n        clip_idx (int): if clip_idx is -1, perform random temporal\n            sampling. If clip_idx is larger than -1, uniformly split the\n            video to num_clips_uniform clips, and select the\n            clip_idx-th video clip.\n        num_clips_uniform (int): overall number of clips to uniformly\n            sample from the given video.\n        video_meta (dict): a dict contains VideoMetaData. Details can be find\n            at `pytorch/vision/torchvision/io/_video_opt.py`.\n        target_fps (int): the input video may have different fps, convert it to\n            the target video fps before frame sampling.\n        backend (str): decoding backend includes `pyav` and `torchvision`. The\n            default one is `pyav`.\n        max_spatial_scale (int): keep the aspect ratio and resize the frame so\n            that shorter edge size is max_spatial_scale. Only used in\n            `torchvision` backend.\n    Returns:\n        frames (tensor): decoded frames from the video.\n    \"\"\"\n    # Currently support two decoders: 1) PyAV, and 2) TorchVision.\n    assert clip_idx >= -1, \"Not valied clip_idx {}\".format(clip_idx)\n    assert len(sampling_rate) == len(num_frames)\n    num_decode = len(num_frames)\n    num_frames_orig = num_frames\n    if num_decode > 1 and temporally_rnd_clips:\n        ind_clips = np.random.permutation(num_decode)\n        sampling_rate = [sampling_rate[i] for i in ind_clips]\n        num_frames = [num_frames[i] for i in ind_clips]\n    else:\n        ind_clips = np.arange(num_decode)  # clips come temporally ordered from decoder\n    try:\n        if backend == \"pyav\":\n            assert min_delta == -math.inf and max_delta == math.inf, (\n                \"delta sampling not supported in pyav\"\n            )\n            frames_decoded, fps, decode_all_video = pyav_decode(\n                container,\n                sampling_rate,\n                num_frames,\n                clip_idx,\n                num_clips_uniform,\n                target_fps,\n                use_offset=use_offset,\n            )\n        elif backend == \"torchvision\":\n            (\n                frames_decoded,\n                fps,\n                decode_all_video,\n                start_end_delta_time,\n            ) = torchvision_decode(\n                container,\n                sampling_rate,\n                num_frames,\n                clip_idx,\n                video_meta,\n                num_clips_uniform,\n                target_fps,\n                (\"visual\",),\n                max_spatial_scale,\n                use_offset=use_offset,\n                min_delta=min_delta,\n                max_delta=max_delta,\n            )\n        else:\n            raise NotImplementedError(\"Unknown decoding backend {}\".format(backend))\n    except Exception as e:\n        print(\"Failed to decode by {} with exception: {}\".format(backend, e))\n        return None, None, None\n\n    # Return None if the frames was not decoded successfully.\n    if frames_decoded is None or None in frames_decoded:\n        return None, None, None\n\n    if not isinstance(frames_decoded, list):\n        frames_decoded = [frames_decoded]\n    num_decoded = len(frames_decoded)\n    clip_sizes = [\n        np.maximum(1.0, sampling_rate[i] * num_frames[i] / target_fps * fps)\n        for i in range(len(sampling_rate))\n    ]\n\n    if decode_all_video:  # full video was decoded (not trimmed yet)\n        assert num_decoded == 1 and start_end_delta_time is None\n        start_end_delta_time = get_multiple_start_end_idx(\n            frames_decoded[0].shape[0],\n            clip_sizes,\n            clip_idx if decode_all_video else 0,\n            num_clips_uniform if decode_all_video else 1,\n            min_delta=min_delta,\n            max_delta=max_delta,\n            use_offset=use_offset,\n        )\n\n    frames_out, start_inds, time_diff_aug = (\n        [None] * num_decode,\n        [None] * num_decode,\n        [None] * num_decode,\n    )\n    augment_vid = gaussian_prob > 0.0 or time_diff_prob > 0.0\n    for k in range(num_decode):\n        T = num_frames[k]\n        # Perform temporal sampling from the decoded video.\n\n        if decode_all_video:\n            frames = frames_decoded[0]\n            if augment_vid:\n                frames = frames.clone()\n            start_idx, end_idx = (\n                start_end_delta_time[k, 0],\n                start_end_delta_time[k, 1],\n            )\n        else:\n            frames = frames_decoded[k]\n            # video is already trimmed so we just need subsampling\n            start_idx, end_idx, clip_position = get_start_end_idx(\n                frames.shape[0], clip_sizes[k], 0, 1\n            )\n        if augment_vid:\n            frames, time_diff_aug[k] = transform.augment_raw_frames(\n                frames, time_diff_prob, gaussian_prob\n            )\n        frames_k = temporal_sampling(frames, start_idx, end_idx, T)\n        frames_out[k] = frames_k\n\n    # if we shuffle, need to randomize the output, otherwise it will always be past->future\n    if num_decode > 1 and temporally_rnd_clips:\n        frames_out_, time_diff_aug_ = [None] * num_decode, [None] * num_decode\n        start_end_delta_time_ = np.zeros_like(start_end_delta_time)\n        for i, j in enumerate(ind_clips):\n            frames_out_[j] = frames_out[i]\n            start_end_delta_time_[j, :] = start_end_delta_time[i, :]\n            time_diff_aug_[j] = time_diff_aug[i]\n\n        frames_out = frames_out_\n        start_end_delta_time = start_end_delta_time_\n        time_diff_aug = time_diff_aug_\n        assert all(\n            frames_out[i].shape[0] == num_frames_orig[i] for i in range(num_decode)\n        )\n\n    return frames_out, start_end_delta_time, time_diff_aug\n"
  },
  {
    "path": "slowfast/datasets/imagenet.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates.\n\nimport json\nimport os\nimport random\nimport re\n\nimport numpy as np\nimport slowfast.datasets.transform as transform\nimport slowfast.utils.logging as logging\nimport torch\nimport torch.utils.data\nfrom PIL import Image\nfrom slowfast.models.utils import calc_mvit_feature_geometry\n\n# import cv2\nfrom slowfast.utils.env import pathmgr\nfrom torchvision import transforms as transforms_tv\n\nfrom .build import DATASET_REGISTRY\nfrom .transform import MaskingGenerator, transforms_imagenet_train\n\nlogger = logging.get_logger(__name__)\n\n\n@DATASET_REGISTRY.register()\nclass Imagenet(torch.utils.data.Dataset):\n    \"\"\"ImageNet dataset.\"\"\"\n\n    def __init__(self, cfg, mode, num_retries=10):\n        self.num_retries = num_retries\n        self.cfg = cfg\n        self.mode = mode\n        self.data_path = cfg.DATA.PATH_TO_DATA_DIR\n        assert mode in [\n            \"train\",\n            \"val\",\n            \"test\",\n        ], \"Split '{}' not supported for ImageNet\".format(mode)\n        logger.info(\"Constructing ImageNet {}...\".format(mode))\n        if cfg.DATA.PATH_TO_PRELOAD_IMDB == \"\":\n            self._construct_imdb()\n        else:\n            self._load_imdb()\n        self.num_videos = len(self._imdb)\n        self.feat_size, self.feat_stride = calc_mvit_feature_geometry(cfg)\n        self.dummy_output = None\n\n    def _load_imdb(self):\n        split_path = os.path.join(\n            self.cfg.DATA.PATH_TO_PRELOAD_IMDB, f\"{self.mode}.json\"\n        )\n        with pathmgr.open(split_path, \"r\") as f:\n            data = f.read()\n        self._imdb = json.loads(data)\n\n    def _construct_imdb(self):\n        \"\"\"Constructs the imdb.\"\"\"\n        # Compile the split data path\n        split_path = os.path.join(self.data_path, self.mode)\n        logger.info(\"{} data path: {}\".format(self.mode, split_path))\n        # Images are stored per class in subdirs (format: n<number>)\n        split_files = pathmgr.ls(split_path)\n        self._class_ids = sorted(f for f in split_files if re.match(r\"^n[0-9]+$\", f))\n        # Map ImageNet class ids to contiguous ids\n        self._class_id_cont_id = {v: i for i, v in enumerate(self._class_ids)}\n        # Construct the image db\n        self._imdb = []\n        for class_id in self._class_ids:\n            cont_id = self._class_id_cont_id[class_id]\n            im_dir = os.path.join(split_path, class_id)\n            for im_name in pathmgr.ls(im_dir):\n                im_path = os.path.join(im_dir, im_name)\n                self._imdb.append({\"im_path\": im_path, \"class\": cont_id})\n        logger.info(\"Number of images: {}\".format(len(self._imdb)))\n        logger.info(\"Number of classes: {}\".format(len(self._class_ids)))\n\n    def load_image(self, im_path):\n        \"\"\"Prepares the image for network input with format of CHW RGB float\"\"\"\n        with pathmgr.open(im_path, \"rb\") as f:\n            with Image.open(f) as im:\n                im = im.convert(\"RGB\")\n        im = torch.from_numpy(np.array(im).astype(np.float32) / 255.0)\n        # H W C to C H W\n        im = im.permute([2, 0, 1])\n        return im\n\n    def _prepare_im_res(self, im_path):\n        # Prepare resnet style augmentation.\n        im = self.load_image(im_path)\n        # Train and test setups differ\n        train_size, test_size = (\n            self.cfg.DATA.TRAIN_CROP_SIZE,\n            self.cfg.DATA.TEST_CROP_SIZE,\n        )\n        if self.mode == \"train\":\n            # For training use random_sized_crop, horizontal_flip, augment, lighting\n            im = transform.random_sized_crop_img(\n                im,\n                train_size,\n                jitter_scale=self.cfg.DATA.TRAIN_JITTER_SCALES_RELATIVE,\n                jitter_aspect=self.cfg.DATA.TRAIN_JITTER_ASPECT_RELATIVE,\n            )\n            im, _ = transform.horizontal_flip(prob=0.5, images=im)\n            # im = transforms.augment(im, cfg.TRAIN.AUGMENT)\n            im = transform.lighting_jitter(\n                im,\n                0.1,\n                self.cfg.DATA.TRAIN_PCA_EIGVAL,\n                self.cfg.DATA.TRAIN_PCA_EIGVEC,\n            )\n        else:\n            # For testing use scale and center crop\n            im, _ = transform.uniform_crop(\n                im, test_size, spatial_idx=1, scale_size=train_size\n            )\n        # For training and testing use color normalization\n        im = transform.color_normalization(im, self.cfg.DATA.MEAN, self.cfg.DATA.STD)\n        return im\n\n    def _prepare_im_tf(self, im_path):\n        with pathmgr.open(im_path, \"rb\") as f:\n            with Image.open(f) as im:\n                im = im.convert(\"RGB\")\n        # Convert HWC/BGR/int to HWC/RGB/float format for applying transforms\n        train_size, test_size = (\n            self.cfg.DATA.TRAIN_CROP_SIZE,\n            self.cfg.DATA.TEST_CROP_SIZE,\n        )\n\n        if self.mode == \"train\":\n            aug_transform = transforms_imagenet_train(\n                img_size=(train_size, train_size),\n                color_jitter=self.cfg.AUG.COLOR_JITTER,\n                auto_augment=self.cfg.AUG.AA_TYPE,\n                interpolation=self.cfg.AUG.INTERPOLATION,\n                re_prob=self.cfg.AUG.RE_PROB,\n                re_mode=self.cfg.AUG.RE_MODE,\n                re_count=self.cfg.AUG.RE_COUNT,\n                mean=self.cfg.DATA.MEAN,\n                std=self.cfg.DATA.STD,\n            )\n        else:\n            t = []\n            if self.cfg.DATA.IN_VAL_CROP_RATIO == 0.0:\n                t.append(\n                    transforms_tv.Resize(\n                        (test_size, test_size),\n                        interpolation=transforms_tv.InterpolationMode.BICUBIC,\n                    ),\n                )\n            else:\n                size = int(\n                    (1.0 / self.cfg.DATA.IN_VAL_CROP_RATIO) * test_size\n                )  # = 1/0.875 * test_size\n                t.append(\n                    transforms_tv.Resize(\n                        size, interpolation=transforms_tv.InterpolationMode.BICUBIC\n                    ),  # to maintain same ratio w.r.t. 224 images\n                )\n                t.append(transforms_tv.CenterCrop(test_size))\n            t.append(transforms_tv.ToTensor())\n            t.append(transforms_tv.Normalize(self.cfg.DATA.MEAN, self.cfg.DATA.STD))\n            aug_transform = transforms_tv.Compose(t)\n        im = aug_transform(im)\n        return im\n\n    def _prepare_im_masked(self, im_path):\n        with pathmgr.open(im_path, \"rb\") as f:\n            with Image.open(f) as im:\n                im = im.convert(\"RGB\")\n\n        if self.mode in [\"train\", \"val\"]:\n            depth = self.cfg.MASK.PRETRAIN_DEPTH[-1]\n            assert depth == max(self.cfg.MASK.PRETRAIN_DEPTH)\n            max_mask = self.cfg.AUG.MAX_MASK_PATCHES_PER_BLOCK\n            # use feat geometry for determining num masks\n            mask_window_size = self.feat_size[depth][-1]\n            num_mask = round(\n                self.feat_size[depth][-1]\n                * self.feat_size[depth][-2]\n                * self.cfg.AUG.MASK_RATIO\n            )\n            min_mask = num_mask // 5\n\n            train_size = self.cfg.DATA.TRAIN_CROP_SIZE\n            mask_generator = MaskingGenerator(\n                mask_window_size,\n                num_masking_patches=num_mask,\n                max_num_patches=max_mask,\n                min_num_patches=min_mask,\n            )\n            aug_transform = transforms_imagenet_train(\n                img_size=(train_size, train_size),\n                scale=self.cfg.DATA.TRAIN_JITTER_SCALES_RELATIVE,\n                ratio=self.cfg.DATA.TRAIN_JITTER_ASPECT_RELATIVE,\n                interpolation=self.cfg.AUG.INTERPOLATION,\n                color_jitter=self.cfg.AUG.COLOR_JITTER,\n                auto_augment=self.cfg.AUG.AA_TYPE,\n                re_prob=0.0,\n                mean=self.cfg.DATA.MEAN,\n                std=self.cfg.DATA.STD,\n            )\n            im = aug_transform(im)\n            mask = mask_generator()\n            return [im, torch.Tensor(), mask]\n        else:\n            raise NotImplementedError\n        return aug_transform(im)\n\n    def __load__(self, index):\n        try:\n            # Load the image\n            im_path = self._imdb[index][\"im_path\"]\n            # Prepare the image for training / testing\n            if self.cfg.AUG.ENABLE:\n                if self.cfg.AUG.GEN_MASK_LOADER:\n                    return self._prepare_im_masked(im_path)\n                elif self.mode == \"train\" and self.cfg.AUG.NUM_SAMPLE > 1:\n                    im = []\n                    for _ in range(self.cfg.AUG.NUM_SAMPLE):\n                        crop = self._prepare_im_tf(im_path)\n                        im.append(crop)\n                    return im\n                else:\n                    im = self._prepare_im_tf(im_path)\n                    return im\n            else:\n                im = self._prepare_im_res(im_path)\n                return im\n        except Exception:\n            return None\n\n    def __getitem__(self, index):\n        if self.dummy_output is not None:\n            return self.dummy_output\n        # if the current image is corrupted, load a different image.\n        for _ in range(self.num_retries):\n            im = self.__load__(index)\n            # Data corrupted, retry with a different image.\n            if im is None:\n                assert self.mode == \"train\", f\"{index} failed loading\"\n                print(f\"{index} failed. retry\")\n                index = random.randint(0, len(self._imdb) - 1)\n            else:\n                break\n        # Retrieve the label\n        label = self._imdb[index][\"class\"]\n        if isinstance(im, list):\n            if self.cfg.AUG.GEN_MASK_LOADER:\n                dummy = torch.Tensor()\n                label = torch.Tensor()\n            else:\n                label = [label for _ in range(len(im))]\n                dummy = [torch.Tensor() for _ in range(len(im))]\n            if self.cfg.DATA.DUMMY_LOAD:\n                if self.dummy_output is None:\n                    self.dummy_output = (im, label, index, dummy, {})\n            return im, label, index, dummy, {}\n        else:\n            dummy = torch.Tensor()\n            if self.cfg.DATA.DUMMY_LOAD:\n                if self.dummy_output is None:\n                    self.dummy_output = ([im], label, index, dummy, {})\n            return [im], label, index, dummy, {}\n\n    def __len__(self):\n        return len(self._imdb)\n"
  },
  {
    "path": "slowfast/datasets/kinetics.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\nimport os\nimport random\n\nimport numpy as np\nimport pandas\nimport slowfast.utils.logging as logging\nimport torch\nimport torch.utils.data\nfrom slowfast.utils.env import pathmgr\nfrom torchvision import transforms\n\nfrom . import (\n    decoder as decoder,\n    transform as transform,\n    utils as utils,\n    video_container as container,\n)\nfrom .build import DATASET_REGISTRY\nfrom .random_erasing import RandomErasing\nfrom .transform import create_random_augment, MaskingGenerator, MaskingGenerator3D\n\nlogger = logging.get_logger(__name__)\n\n\n@DATASET_REGISTRY.register()\nclass Kinetics(torch.utils.data.Dataset):\n    \"\"\"\n    Kinetics video loader. Construct the Kinetics video loader, then sample\n    clips from the videos. For training and validation, a single clip is\n    randomly sampled from every video with random cropping, scaling, and\n    flipping. For testing, multiple clips are uniformaly sampled from every\n    video with uniform cropping. For uniform cropping, we take the left, center,\n    and right crop if the width is larger than height, or take top, center, and\n    bottom crop if the height is larger than the width.\n    \"\"\"\n\n    def __init__(self, cfg, mode, num_retries=100):\n        \"\"\"\n        Construct the Kinetics video loader with a given csv file. The format of\n        the csv file is:\n        ```\n        path_to_video_1 label_1\n        path_to_video_2 label_2\n        ...\n        path_to_video_N label_N\n        ```\n        Args:\n            cfg (CfgNode): configs.\n            mode (string): Options includes `train`, `val`, or `test` mode.\n                For the train and val mode, the data loader will take data\n                from the train or val set, and sample one clip per video.\n                For the test mode, the data loader will take data from test set,\n                and sample multiple clips per video.\n            num_retries (int): number of retries.\n        \"\"\"\n        # Only support train, val, and test mode.\n        assert mode in [\n            \"train\",\n            \"val\",\n            \"test\",\n        ], \"Split '{}' not supported for Kinetics\".format(mode)\n        self.mode = mode\n        self.cfg = cfg\n        self.p_convert_gray = self.cfg.DATA.COLOR_RND_GRAYSCALE\n        self.p_convert_dt = self.cfg.DATA.TIME_DIFF_PROB\n        self._video_meta = {}\n        self._num_retries = num_retries\n        self._num_epoch = 0.0\n        self._num_yielded = 0\n        self.skip_rows = self.cfg.DATA.SKIP_ROWS\n        self.use_chunk_loading = (\n            True\n            if self.mode in [\"train\"] and self.cfg.DATA.LOADER_CHUNK_SIZE > 0\n            else False\n        )\n        self.dummy_output = None\n        # For training or validation mode, one single clip is sampled from every\n        # video. For testing, NUM_ENSEMBLE_VIEWS clips are sampled from every\n        # video. For every clip, NUM_SPATIAL_CROPS is cropped spatially from\n        # the frames.\n        if self.mode in [\"train\", \"val\"]:\n            self._num_clips = 1\n        elif self.mode in [\"test\"]:\n            self._num_clips = cfg.TEST.NUM_ENSEMBLE_VIEWS * cfg.TEST.NUM_SPATIAL_CROPS\n\n        logger.info(\"Constructing Kinetics {}...\".format(mode))\n        self._construct_loader()\n        self.aug = False\n        self.rand_erase = False\n        self.use_temporal_gradient = False\n        self.temporal_gradient_rate = 0.0\n        self.cur_epoch = 0\n\n        if self.mode == \"train\" and self.cfg.AUG.ENABLE:\n            self.aug = True\n            if self.cfg.AUG.RE_PROB > 0:\n                self.rand_erase = True\n\n    def _construct_loader(self):\n        \"\"\"\n        Construct the video loader.\n        \"\"\"\n        path_to_file = os.path.join(\n            self.cfg.DATA.PATH_TO_DATA_DIR, \"{}.csv\".format(self.mode)\n        )\n        assert pathmgr.exists(path_to_file), \"{} dir not found\".format(path_to_file)\n\n        self._path_to_videos = []\n        self._labels = []\n        self._spatial_temporal_idx = []\n        self.cur_iter = 0\n        self.chunk_epoch = 0\n        self.epoch = 0.0\n        self.skip_rows = self.cfg.DATA.SKIP_ROWS\n\n        with pathmgr.open(path_to_file, \"r\") as f:\n            if self.use_chunk_loading:\n                rows = self._get_chunk(f, self.cfg.DATA.LOADER_CHUNK_SIZE)\n            else:\n                rows = f.read().splitlines()\n            for clip_idx, path_label in enumerate(rows):\n                fetch_info = path_label.split(self.cfg.DATA.PATH_LABEL_SEPARATOR)\n                if len(fetch_info) == 2:\n                    path, label = fetch_info\n                elif len(fetch_info) == 3:\n                    path, fn, label = fetch_info\n                elif len(fetch_info) == 1:\n                    path, label = fetch_info[0], 0\n                else:\n                    raise RuntimeError(\n                        \"Failed to parse video fetch {} info {} retries.\".format(\n                            path_to_file, fetch_info\n                        )\n                    )\n                for idx in range(self._num_clips):\n                    self._path_to_videos.append(\n                        os.path.join(self.cfg.DATA.PATH_PREFIX, path)\n                    )\n                    self._labels.append(int(label))\n                    self._spatial_temporal_idx.append(idx)\n                    self._video_meta[clip_idx * self._num_clips + idx] = {}\n        assert len(self._path_to_videos) > 0, (\n            \"Failed to load Kinetics split {} from {}\".format(\n                self._split_idx, path_to_file\n            )\n        )\n        logger.info(\n            \"Constructing kinetics dataloader (size: {} skip_rows {}) from {} \".format(\n                len(self._path_to_videos), self.skip_rows, path_to_file\n            )\n        )\n\n    def _set_epoch_num(self, epoch):\n        self.epoch = epoch\n\n    def _get_chunk(self, path_to_file, chunksize):\n        try:\n            chunk = next(\n                pandas.read_csv(\n                    path_to_file,\n                    chunksize=self.cfg.DATA.LOADER_CHUNK_SIZE,\n                    skiprows=self.skip_rows,\n                )\n            )\n        except Exception:\n            self.skip_rows = 0\n            return self._get_chunk(path_to_file, chunksize)\n        else:\n            return pandas.array(chunk.values.flatten(), dtype=\"string\")\n\n    def __getitem__(self, index):\n        \"\"\"\n        Given the video index, return the list of frames, label, and video\n        index if the video can be fetched and decoded successfully, otherwise\n        repeatly find a random video that can be decoded as a replacement.\n        Args:\n            index (int): the video index provided by the pytorch sampler.\n        Returns:\n            frames (tensor): the frames of sampled from the video. The dimension\n                is `channel` x `num frames` x `height` x `width`.\n            label (int): the label of the current video.\n            index (int): if the video provided by pytorch sampler can be\n                decoded, then return the index of the video. If not, return the\n                index of the video replacement that can be decoded.\n        \"\"\"\n        short_cycle_idx = None\n        # When short cycle is used, input index is a tupple.\n        if isinstance(index, tuple):\n            index, self._num_yielded = index\n            if self.cfg.MULTIGRID.SHORT_CYCLE:\n                index, short_cycle_idx = index\n        if self.dummy_output is not None:\n            return self.dummy_output\n        if self.mode in [\"train\", \"val\"]:\n            # -1 indicates random sampling.\n            temporal_sample_index = -1\n            spatial_sample_index = -1\n            min_scale = self.cfg.DATA.TRAIN_JITTER_SCALES[0]\n            max_scale = self.cfg.DATA.TRAIN_JITTER_SCALES[1]\n            crop_size = self.cfg.DATA.TRAIN_CROP_SIZE\n            if short_cycle_idx in [0, 1]:\n                crop_size = int(\n                    round(\n                        self.cfg.MULTIGRID.SHORT_CYCLE_FACTORS[short_cycle_idx]\n                        * self.cfg.MULTIGRID.DEFAULT_S\n                    )\n                )\n            if self.cfg.MULTIGRID.DEFAULT_S > 0:\n                # Decreasing the scale is equivalent to using a larger \"span\"\n                # in a sampling grid.\n                min_scale = int(\n                    round(float(min_scale) * crop_size / self.cfg.MULTIGRID.DEFAULT_S)\n                )\n        elif self.mode in [\"test\"]:\n            temporal_sample_index = (\n                self._spatial_temporal_idx[index] // self.cfg.TEST.NUM_SPATIAL_CROPS\n            )\n            # spatial_sample_index is in [0, 1, 2]. Corresponding to left,\n            # center, or right if width is larger than height, and top, middle,\n            # or bottom if height is larger than width.\n            spatial_sample_index = (\n                (self._spatial_temporal_idx[index] % self.cfg.TEST.NUM_SPATIAL_CROPS)\n                if self.cfg.TEST.NUM_SPATIAL_CROPS > 1\n                else 1\n            )\n            min_scale, max_scale, crop_size = (\n                [self.cfg.DATA.TEST_CROP_SIZE] * 3\n                if self.cfg.TEST.NUM_SPATIAL_CROPS > 1\n                else [self.cfg.DATA.TRAIN_JITTER_SCALES[0]] * 2\n                + [self.cfg.DATA.TEST_CROP_SIZE]\n            )\n            # The testing is deterministic and no jitter should be performed.\n            # min_scale, max_scale, and crop_size are expect to be the same.\n            assert len({min_scale, max_scale}) == 1\n        else:\n            raise NotImplementedError(\"Does not support {} mode\".format(self.mode))\n        num_decode = (\n            self.cfg.DATA.TRAIN_CROP_NUM_TEMPORAL if self.mode in [\"train\"] else 1\n        )\n        min_scale, max_scale, crop_size = [min_scale], [max_scale], [crop_size]\n        if len(min_scale) < num_decode:\n            min_scale += [self.cfg.DATA.TRAIN_JITTER_SCALES[0]] * (\n                num_decode - len(min_scale)\n            )\n            max_scale += [self.cfg.DATA.TRAIN_JITTER_SCALES[1]] * (\n                num_decode - len(max_scale)\n            )\n            crop_size += (\n                [self.cfg.MULTIGRID.DEFAULT_S] * (num_decode - len(crop_size))\n                if self.cfg.MULTIGRID.LONG_CYCLE or self.cfg.MULTIGRID.SHORT_CYCLE\n                else [self.cfg.DATA.TRAIN_CROP_SIZE] * (num_decode - len(crop_size))\n            )\n            assert self.mode in [\"train\", \"val\"]\n        # Try to decode and sample a clip from a video. If the video can not be\n        # decoded, repeatly find a random video replacement that can be decoded.\n        for i_try in range(self._num_retries):\n            video_container = None\n            try:\n                video_container = container.get_video_container(\n                    self._path_to_videos[index],\n                    self.cfg.DATA_LOADER.ENABLE_MULTI_THREAD_DECODE,\n                    self.cfg.DATA.DECODING_BACKEND,\n                )\n            except Exception as e:\n                logger.info(\n                    \"Failed to load video from {} with error {}\".format(\n                        self._path_to_videos[index], e\n                    )\n                )\n                if self.mode not in [\"test\"]:\n                    # let's try another one\n                    index = random.randint(0, len(self._path_to_videos) - 1)\n                continue  # Select a random video if the current video was not able to access.\n            if video_container is None:\n                logger.warning(\n                    \"Failed to meta load video idx {} from {}; trial {}\".format(\n                        index, self._path_to_videos[index], i_try\n                    )\n                )\n                if self.mode not in [\"test\"] and i_try > self._num_retries // 8:\n                    # let's try another one\n                    index = random.randint(0, len(self._path_to_videos) - 1)\n                continue\n\n            frames_decoded, time_idx_decoded = (\n                [None] * num_decode,\n                [None] * num_decode,\n            )\n\n            # for i in range(num_decode):\n            num_frames = [self.cfg.DATA.NUM_FRAMES]\n            sampling_rate = utils.get_random_sampling_rate(\n                self.cfg.MULTIGRID.LONG_CYCLE_SAMPLING_RATE,\n                self.cfg.DATA.SAMPLING_RATE,\n            )\n            sampling_rate = [sampling_rate]\n            if len(num_frames) < num_decode:\n                num_frames.extend(\n                    [num_frames[-1] for i in range(num_decode - len(num_frames))]\n                )\n                # base case where keys have same frame-rate as query\n                sampling_rate.extend(\n                    [sampling_rate[-1] for i in range(num_decode - len(sampling_rate))]\n                )\n            elif len(num_frames) > num_decode:\n                num_frames = num_frames[:num_decode]\n                sampling_rate = sampling_rate[:num_decode]\n\n            if self.mode in [\"train\"]:\n                assert len(min_scale) == len(max_scale) == len(crop_size) == num_decode\n\n            target_fps = self.cfg.DATA.TARGET_FPS\n            if self.cfg.DATA.TRAIN_JITTER_FPS > 0.0 and self.mode in [\"train\"]:\n                target_fps += random.uniform(0.0, self.cfg.DATA.TRAIN_JITTER_FPS)\n\n            # Decode video. Meta info is used to perform selective decoding.\n            frames, time_idx, tdiff = decoder.decode(\n                video_container,\n                sampling_rate,\n                num_frames,\n                temporal_sample_index,\n                self.cfg.TEST.NUM_ENSEMBLE_VIEWS,\n                video_meta=(\n                    self._video_meta[index] if len(self._video_meta) < 5e6 else {}\n                ),  # do not cache on huge datasets\n                target_fps=target_fps,\n                backend=self.cfg.DATA.DECODING_BACKEND,\n                use_offset=self.cfg.DATA.USE_OFFSET_SAMPLING,\n                max_spatial_scale=(\n                    min_scale[0] if all(x == min_scale[0] for x in min_scale) else 0\n                ),  # if self.mode in [\"test\"] else 0,\n                time_diff_prob=self.p_convert_dt if self.mode in [\"train\"] else 0.0,\n                temporally_rnd_clips=True,\n                min_delta=self.cfg.CONTRASTIVE.DELTA_CLIPS_MIN,\n                max_delta=self.cfg.CONTRASTIVE.DELTA_CLIPS_MAX,\n            )\n            frames_decoded = frames\n            time_idx_decoded = time_idx\n\n            # If decoding failed (wrong format, video is too short, and etc),\n            # select another video.\n            if frames_decoded is None or None in frames_decoded:\n                logger.warning(\n                    \"Failed to decode video idx {} from {}; trial {}\".format(\n                        index, self._path_to_videos[index], i_try\n                    )\n                )\n                if (\n                    self.mode not in [\"test\"]\n                    and (i_try % (self._num_retries // 8)) == 0\n                ):\n                    # let's try another one\n                    index = random.randint(0, len(self._path_to_videos) - 1)\n                continue\n\n            num_aug = (\n                self.cfg.DATA.TRAIN_CROP_NUM_SPATIAL * self.cfg.AUG.NUM_SAMPLE\n                if self.mode in [\"train\"]\n                else 1\n            )\n            num_out = num_aug * num_decode\n            f_out, time_idx_out = [None] * num_out, [None] * num_out\n            idx = -1\n            label = self._labels[index]\n\n            for i in range(num_decode):\n                for _ in range(num_aug):\n                    idx += 1\n                    f_out[idx] = frames_decoded[i].clone()\n                    time_idx_out[idx] = time_idx_decoded[i, :]\n\n                    f_out[idx] = f_out[idx].float()\n                    f_out[idx] = f_out[idx] / 255.0\n\n                    if self.mode in [\"train\"] and self.cfg.DATA.SSL_COLOR_JITTER:\n                        f_out[idx] = transform.color_jitter_video_ssl(\n                            f_out[idx],\n                            bri_con_sat=self.cfg.DATA.SSL_COLOR_BRI_CON_SAT,\n                            hue=self.cfg.DATA.SSL_COLOR_HUE,\n                            p_convert_gray=self.p_convert_gray,\n                            moco_v2_aug=self.cfg.DATA.SSL_MOCOV2_AUG,\n                            gaussan_sigma_min=self.cfg.DATA.SSL_BLUR_SIGMA_MIN,\n                            gaussan_sigma_max=self.cfg.DATA.SSL_BLUR_SIGMA_MAX,\n                        )\n\n                    if self.aug and self.cfg.AUG.AA_TYPE:\n                        aug_transform = create_random_augment(\n                            input_size=(f_out[idx].size(1), f_out[idx].size(2)),\n                            auto_augment=self.cfg.AUG.AA_TYPE,\n                            interpolation=self.cfg.AUG.INTERPOLATION,\n                        )\n                        # T H W C -> T C H W.\n                        f_out[idx] = f_out[idx].permute(0, 3, 1, 2)\n                        list_img = self._frame_to_list_img(f_out[idx])\n                        list_img = aug_transform(list_img)\n                        f_out[idx] = self._list_img_to_frames(list_img)\n                        f_out[idx] = f_out[idx].permute(0, 2, 3, 1)\n\n                    # Perform color normalization.\n                    f_out[idx] = utils.tensor_normalize(\n                        f_out[idx], self.cfg.DATA.MEAN, self.cfg.DATA.STD\n                    )\n\n                    # T H W C -> C T H W.\n                    f_out[idx] = f_out[idx].permute(3, 0, 1, 2)\n\n                    scl, asp = (\n                        self.cfg.DATA.TRAIN_JITTER_SCALES_RELATIVE,\n                        self.cfg.DATA.TRAIN_JITTER_ASPECT_RELATIVE,\n                    )\n                    relative_scales = (\n                        None if (self.mode not in [\"train\"] or len(scl) == 0) else scl\n                    )\n                    relative_aspect = (\n                        None if (self.mode not in [\"train\"] or len(asp) == 0) else asp\n                    )\n                    f_out[idx] = utils.spatial_sampling(\n                        f_out[idx],\n                        spatial_idx=spatial_sample_index,\n                        min_scale=min_scale[i],\n                        max_scale=max_scale[i],\n                        crop_size=crop_size[i],\n                        random_horizontal_flip=self.cfg.DATA.RANDOM_FLIP,\n                        inverse_uniform_sampling=self.cfg.DATA.INV_UNIFORM_SAMPLE,\n                        aspect_ratio=relative_aspect,\n                        scale=relative_scales,\n                        motion_shift=(\n                            self.cfg.DATA.TRAIN_JITTER_MOTION_SHIFT\n                            if self.mode in [\"train\"]\n                            else False\n                        ),\n                    )\n\n                    if self.rand_erase:\n                        erase_transform = RandomErasing(\n                            self.cfg.AUG.RE_PROB,\n                            mode=self.cfg.AUG.RE_MODE,\n                            max_count=self.cfg.AUG.RE_COUNT,\n                            num_splits=self.cfg.AUG.RE_COUNT,\n                            device=\"cpu\",\n                        )\n                        f_out[idx] = erase_transform(\n                            f_out[idx].permute(1, 0, 2, 3)\n                        ).permute(1, 0, 2, 3)\n\n                    f_out[idx] = utils.pack_pathway_output(self.cfg, f_out[idx])\n                    if self.cfg.AUG.GEN_MASK_LOADER:\n                        mask = self._gen_mask()\n                        f_out[idx] = f_out[idx] + [torch.Tensor(), mask]\n            frames = f_out[0] if num_out == 1 else f_out\n            time_idx = np.array(time_idx_out)\n            if (\n                num_aug * num_decode > 1\n                and not self.cfg.MODEL.MODEL_NAME == \"ContrastiveModel\"\n            ):\n                label = [label] * num_aug * num_decode\n                index = [index] * num_aug * num_decode\n            if self.cfg.DATA.DUMMY_LOAD:\n                if self.dummy_output is None:\n                    self.dummy_output = (frames, label, index, time_idx, {})\n            return frames, label, index, time_idx, {}\n        else:\n            logger.warning(\n                \"Failed to fetch video after {} retries.\".format(self._num_retries)\n            )\n\n    def _gen_mask(self):\n        if self.cfg.AUG.MASK_TUBE:\n            num_masking_patches = round(\n                np.prod(self.cfg.AUG.MASK_WINDOW_SIZE) * self.cfg.AUG.MASK_RATIO\n            )\n            min_mask = num_masking_patches // 5\n            masked_position_generator = MaskingGenerator(\n                mask_window_size=self.cfg.AUG.MASK_WINDOW_SIZE,\n                num_masking_patches=num_masking_patches,\n                max_num_patches=None,\n                min_num_patches=min_mask,\n            )\n            mask = masked_position_generator()\n            mask = np.tile(mask, (8, 1, 1))\n        elif self.cfg.AUG.MASK_FRAMES:\n            mask = np.zeros(shape=self.cfg.AUG.MASK_WINDOW_SIZE, dtype=int)\n            n_mask = round(self.cfg.AUG.MASK_WINDOW_SIZE[0] * self.cfg.AUG.MASK_RATIO)\n            mask_t_ind = random.sample(\n                range(0, self.cfg.AUG.MASK_WINDOW_SIZE[0]), n_mask\n            )\n            mask[mask_t_ind, :, :] += 1\n        else:\n            num_masking_patches = round(\n                np.prod(self.cfg.AUG.MASK_WINDOW_SIZE) * self.cfg.AUG.MASK_RATIO\n            )\n            max_mask = np.prod(self.cfg.AUG.MASK_WINDOW_SIZE[1:])\n            min_mask = max_mask // 5\n            masked_position_generator = MaskingGenerator3D(\n                mask_window_size=self.cfg.AUG.MASK_WINDOW_SIZE,\n                num_masking_patches=num_masking_patches,\n                max_num_patches=max_mask,\n                min_num_patches=min_mask,\n            )\n            mask = masked_position_generator()\n        return mask\n\n    def _frame_to_list_img(self, frames):\n        img_list = [transforms.ToPILImage()(frames[i]) for i in range(frames.size(0))]\n        return img_list\n\n    def _list_img_to_frames(self, img_list):\n        img_list = [transforms.ToTensor()(img) for img in img_list]\n        return torch.stack(img_list)\n\n    def __len__(self):\n        \"\"\"\n        Returns:\n            (int): the number of videos in the dataset.\n        \"\"\"\n        return self.num_videos\n\n    @property\n    def num_videos(self):\n        \"\"\"\n        Returns:\n            (int): the number of videos in the dataset.\n        \"\"\"\n        return len(self._path_to_videos)\n"
  },
  {
    "path": "slowfast/datasets/loader.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\n\"\"\"Data loader.\"\"\"\n\nimport itertools\nfrom functools import partial\n\nimport numpy as np\nimport torch\nfrom slowfast.datasets.multigrid_helper import ShortCycleBatchSampler\nfrom torch.utils.data._utils.collate import default_collate\nfrom torch.utils.data.distributed import DistributedSampler\nfrom torch.utils.data.sampler import RandomSampler\n\nfrom . import utils as utils\nfrom .build import build_dataset\n\n\ndef multiple_samples_collate(batch, fold=False):\n    \"\"\"\n    Collate function for repeated augmentation. Each instance in the batch has\n    more than one sample.\n    Args:\n        batch (tuple or list): data batch to collate.\n    Returns:\n        (tuple): collated data batch.\n    \"\"\"\n    inputs, labels, video_idx, time, extra_data = zip(*batch)\n    inputs = [item for sublist in inputs for item in sublist]\n    labels = [item for sublist in labels for item in sublist]\n    video_idx = [item for sublist in video_idx for item in sublist]\n    time = [item for sublist in time for item in sublist]\n\n    inputs, labels, video_idx, time, extra_data = (\n        default_collate(inputs),\n        default_collate(labels),\n        default_collate(video_idx),\n        default_collate(time),\n        default_collate(extra_data),\n    )\n    if fold:\n        return [inputs], labels, video_idx, time, extra_data\n    else:\n        return inputs, labels, video_idx, time, extra_data\n\n\ndef detection_collate(batch):\n    \"\"\"\n    Collate function for detection task. Concatanate bboxes, labels and\n    metadata from different samples in the first dimension instead of\n    stacking them to have a batch-size dimension.\n    Args:\n        batch (tuple or list): data batch to collate.\n    Returns:\n        (tuple): collated detection data batch.\n    \"\"\"\n    inputs, labels, video_idx, time, extra_data = zip(*batch)\n    inputs, video_idx = default_collate(inputs), default_collate(video_idx)\n    time = default_collate(time)\n    labels = torch.tensor(np.concatenate(labels, axis=0)).float()\n\n    collated_extra_data = {}\n    for key in extra_data[0].keys():\n        data = [d[key] for d in extra_data]\n        if key == \"boxes\" or key == \"ori_boxes\":\n            # Append idx info to the bboxes before concatenating them.\n            bboxes = [\n                np.concatenate(\n                    [np.full((data[i].shape[0], 1), float(i)), data[i]], axis=1\n                )\n                for i in range(len(data))\n            ]\n            bboxes = np.concatenate(bboxes, axis=0)\n            collated_extra_data[key] = torch.tensor(bboxes).float()\n        elif key == \"metadata\":\n            collated_extra_data[key] = torch.tensor(list(itertools.chain(*data))).view(\n                -1, 2\n            )\n        else:\n            collated_extra_data[key] = default_collate(data)\n\n    return inputs, labels, video_idx, time, collated_extra_data\n\n\ndef construct_loader(cfg, split, is_precise_bn=False):\n    \"\"\"\n    Constructs the data loader for the given dataset.\n    Args:\n        cfg (CfgNode): configs. Details can be found in\n            slowfast/config/defaults.py\n        split (str): the split of the data loader. Options include `train`,\n            `val`, and `test`.\n    \"\"\"\n    assert split in [\"train\", \"val\", \"test\"]\n    if split in [\"train\"]:\n        dataset_name = cfg.TRAIN.DATASET\n        batch_size = int(cfg.TRAIN.BATCH_SIZE / max(1, cfg.NUM_GPUS))\n        shuffle = True\n        drop_last = True\n    elif split in [\"val\"]:\n        dataset_name = cfg.TRAIN.DATASET\n        batch_size = int(cfg.TRAIN.BATCH_SIZE / max(1, cfg.NUM_GPUS))\n        shuffle = False\n        drop_last = False\n    elif split in [\"test\"]:\n        dataset_name = cfg.TEST.DATASET\n        batch_size = int(cfg.TEST.BATCH_SIZE / max(1, cfg.NUM_GPUS))\n        shuffle = False\n        drop_last = False\n\n    # Construct the dataset\n    dataset = build_dataset(dataset_name, cfg, split)\n\n    if isinstance(dataset, torch.utils.data.IterableDataset):\n        loader = torch.utils.data.DataLoader(\n            dataset,\n            batch_size=batch_size,\n            num_workers=cfg.DATA_LOADER.NUM_WORKERS,\n            pin_memory=cfg.DATA_LOADER.PIN_MEMORY,\n            drop_last=drop_last,\n            collate_fn=detection_collate if cfg.DETECTION.ENABLE else None,\n            worker_init_fn=utils.loader_worker_init_fn(dataset),\n        )\n    else:\n        if cfg.MULTIGRID.SHORT_CYCLE and split in [\"train\"] and not is_precise_bn:\n            # Create a sampler for multi-process training\n            sampler = utils.create_sampler(dataset, shuffle, cfg)\n            batch_sampler = ShortCycleBatchSampler(\n                sampler, batch_size=batch_size, drop_last=drop_last, cfg=cfg\n            )\n            # Create a loader\n            loader = torch.utils.data.DataLoader(\n                dataset,\n                batch_sampler=batch_sampler,\n                num_workers=cfg.DATA_LOADER.NUM_WORKERS,\n                pin_memory=cfg.DATA_LOADER.PIN_MEMORY,\n                worker_init_fn=utils.loader_worker_init_fn(dataset),\n            )\n        else:\n            # Create a sampler for multi-process training\n            sampler = utils.create_sampler(dataset, shuffle, cfg)\n            # Create a loader\n            if cfg.DETECTION.ENABLE:\n                collate_func = detection_collate\n            elif (\n                (\n                    cfg.AUG.NUM_SAMPLE > 1\n                    or cfg.DATA.TRAIN_CROP_NUM_TEMPORAL > 1\n                    or cfg.DATA.TRAIN_CROP_NUM_SPATIAL > 1\n                )\n                and split in [\"train\"]\n                and not cfg.MODEL.MODEL_NAME == \"ContrastiveModel\"\n            ):\n                collate_func = partial(\n                    multiple_samples_collate, fold=\"imagenet\" in dataset_name\n                )\n            else:\n                collate_func = None\n            loader = torch.utils.data.DataLoader(\n                dataset,\n                batch_size=batch_size,\n                shuffle=(False if sampler else shuffle),\n                sampler=sampler,\n                num_workers=cfg.DATA_LOADER.NUM_WORKERS,\n                pin_memory=cfg.DATA_LOADER.PIN_MEMORY,\n                drop_last=drop_last,\n                collate_fn=collate_func,\n                worker_init_fn=utils.loader_worker_init_fn(dataset),\n            )\n    return loader\n\n\ndef shuffle_dataset(loader, cur_epoch):\n    \"\"\" \"\n    Shuffles the data.\n    Args:\n        loader (loader): data loader to perform shuffle.\n        cur_epoch (int): number of the current epoch.\n    \"\"\"\n    if loader._dataset_kind == torch.utils.data.dataloader._DatasetKind.Iterable:\n        if hasattr(loader.dataset, \"sampler\"):\n            sampler = loader.dataset.sampler\n        else:\n            raise RuntimeError(\n                \"Unknown sampler for IterableDataset when shuffling dataset\"\n            )\n    else:\n        sampler = (\n            loader.batch_sampler.sampler\n            if isinstance(loader.batch_sampler, ShortCycleBatchSampler)\n            else loader.sampler\n        )\n    assert isinstance(sampler, (RandomSampler, DistributedSampler)), (\n        \"Sampler type '{}' not supported\".format(type(sampler))\n    )\n    # RandomSampler handles shuffling automatically\n    if isinstance(sampler, DistributedSampler):\n        # DistributedSampler shuffles data based on epoch\n        sampler.set_epoch(cur_epoch)\n\n    if hasattr(loader.dataset, \"prefetcher\"):\n        sampler = loader.dataset.prefetcher.sampler\n        if isinstance(sampler, DistributedSampler):\n            # DistributedSampler shuffles data based on epoch\n            print(\"prefetcher sampler\")\n            sampler.set_epoch(cur_epoch)\n"
  },
  {
    "path": "slowfast/datasets/mixup.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\n\"\"\"\nThis implementation is based on\nhttps://github.com/rwightman/pytorch-image-models/blob/master/timm/data/mixup.py,\npublished under an Apache License 2.0.\n\nCOMMENT FROM ORIGINAL:\nMixup and Cutmix\nPapers:\nmixup: Beyond Empirical Risk Minimization (https://arxiv.org/abs/1710.09412)\nCutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features (https://arxiv.org/abs/1905.04899) # NOQA\nCode Reference:\nCutMix: https://github.com/clovaai/CutMix-PyTorch\nHacked together by / Copyright 2020 Ross Wightman\n\"\"\"\n\nimport numpy as np\nimport torch\n\n\ndef convert_to_one_hot(targets, num_classes, on_value=1.0, off_value=0.0):\n    \"\"\"\n    This function converts target class indices to one-hot vectors, given the\n    number of classes.\n    Args:\n        targets (loader): Class labels.\n        num_classes (int): Total number of classes.\n        on_value (float): Target Value for ground truth class.\n        off_value (float): Target Value for other classes.This value is used for\n            label smoothing.\n    \"\"\"\n\n    targets = targets.long().view(-1, 1)\n    return torch.full(\n        (targets.size()[0], num_classes), off_value, device=targets.device\n    ).scatter_(1, targets, on_value)\n\n\ndef mixup_target(target, num_classes, lam=1.0, smoothing=0.0):\n    \"\"\"\n    This function converts target class indices to one-hot vectors, given the\n    number of classes.\n    Args:\n        targets (loader): Class labels.\n        num_classes (int): Total number of classes.\n        lam (float): lamba value for mixup/cutmix.\n        smoothing (float): Label smoothing value.\n    \"\"\"\n    off_value = smoothing / num_classes\n    on_value = 1.0 - smoothing + off_value\n    target1 = convert_to_one_hot(\n        target,\n        num_classes,\n        on_value=on_value,\n        off_value=off_value,\n    )\n    target2 = convert_to_one_hot(\n        target.flip(0),\n        num_classes,\n        on_value=on_value,\n        off_value=off_value,\n    )\n    return target1 * lam + target2 * (1.0 - lam)\n\n\ndef rand_bbox(img_shape, lam, margin=0.0, count=None):\n    \"\"\"\n    Generates a random square bbox based on lambda value.\n\n    Args:\n        img_shape (tuple): Image shape as tuple\n        lam (float): Cutmix lambda value\n        margin (float): Percentage of bbox dimension to enforce as margin (reduce amount of box outside image)\n        count (int): Number of bbox to generate\n    \"\"\"\n    ratio = np.sqrt(1 - lam)\n    img_h, img_w = img_shape[-2:]\n    cut_h, cut_w = int(img_h * ratio), int(img_w * ratio)\n    margin_y, margin_x = int(margin * cut_h), int(margin * cut_w)\n    cy = np.random.randint(0 + margin_y, img_h - margin_y, size=count)\n    cx = np.random.randint(0 + margin_x, img_w - margin_x, size=count)\n    yl = np.clip(cy - cut_h // 2, 0, img_h)\n    yh = np.clip(cy + cut_h // 2, 0, img_h)\n    xl = np.clip(cx - cut_w // 2, 0, img_w)\n    xh = np.clip(cx + cut_w // 2, 0, img_w)\n    return yl, yh, xl, xh\n\n\ndef get_cutmix_bbox(img_shape, lam, correct_lam=True, count=None):\n    \"\"\"\n    Generates the box coordinates for cutmix.\n\n    Args:\n        img_shape (tuple): Image shape as tuple\n        lam (float): Cutmix lambda value\n        correct_lam (bool): Apply lambda correction when cutmix bbox clipped by\n            image borders.\n        count (int): Number of bbox to generate\n    \"\"\"\n\n    yl, yu, xl, xu = rand_bbox(img_shape, lam, count=count)\n    if correct_lam:\n        bbox_area = (yu - yl) * (xu - xl)\n        lam = 1.0 - bbox_area / float(img_shape[-2] * img_shape[-1])\n    return (yl, yu, xl, xu), lam\n\n\nclass MixUp:\n    \"\"\"\n    Apply mixup and/or cutmix for videos at batch level.\n    mixup: Beyond Empirical Risk Minimization (https://arxiv.org/abs/1710.09412)\n    CutMix: Regularization Strategy to Train Strong Classifiers with Localizable\n        Features (https://arxiv.org/abs/1905.04899)\n    \"\"\"\n\n    def __init__(\n        self,\n        mixup_alpha=1.0,\n        cutmix_alpha=0.0,\n        mix_prob=1.0,\n        switch_prob=0.5,\n        correct_lam=True,\n        label_smoothing=0.1,\n        num_classes=1000,\n    ):\n        \"\"\"\n        Args:\n            mixup_alpha (float): Mixup alpha value.\n            cutmix_alpha (float): Cutmix alpha value.\n            mix_prob (float): Probability of applying mixup or cutmix.\n            switch_prob (float): Probability of switching to cutmix instead of\n                mixup when both are active.\n            correct_lam (bool): Apply lambda correction when cutmix bbox\n                clipped by image borders.\n            label_smoothing (float): Apply label smoothing to the mixed target\n                tensor. If label_smoothing is not used, set it to 0.\n            num_classes (int): Number of classes for target.\n        \"\"\"\n        self.mixup_alpha = mixup_alpha\n        self.cutmix_alpha = cutmix_alpha\n        self.mix_prob = mix_prob\n        self.switch_prob = switch_prob\n        self.label_smoothing = label_smoothing\n        self.num_classes = num_classes\n        self.correct_lam = correct_lam\n\n    def _get_mixup_params(self):\n        lam = 1.0\n        use_cutmix = False\n        if np.random.rand() < self.mix_prob:\n            if self.mixup_alpha > 0.0 and self.cutmix_alpha > 0.0:\n                use_cutmix = np.random.rand() < self.switch_prob\n                lam_mix = (\n                    np.random.beta(self.cutmix_alpha, self.cutmix_alpha)\n                    if use_cutmix\n                    else np.random.beta(self.mixup_alpha, self.mixup_alpha)\n                )\n            elif self.mixup_alpha > 0.0:\n                lam_mix = np.random.beta(self.mixup_alpha, self.mixup_alpha)\n            elif self.cutmix_alpha > 0.0:\n                use_cutmix = True\n                lam_mix = np.random.beta(self.cutmix_alpha, self.cutmix_alpha)\n            lam = float(lam_mix)\n        return lam, use_cutmix\n\n    def _mix_batch(self, x):\n        lam, use_cutmix = self._get_mixup_params()\n        if lam == 1.0:\n            return 1.0\n        if use_cutmix:\n            (yl, yh, xl, xh), lam = get_cutmix_bbox(\n                x.shape,\n                lam,\n                correct_lam=self.correct_lam,\n            )\n            x[..., yl:yh, xl:xh] = x.flip(0)[..., yl:yh, xl:xh]\n        else:\n            x_flipped = x.flip(0).mul_(1.0 - lam)\n            x.mul_(lam).add_(x_flipped)\n        return lam\n\n    def __call__(self, x, target):\n        if self.mix_prob > 0.0:\n            assert len(x) > 1, \"Batch size should be greater than 1 for mixup.\"\n            lam = self._mix_batch(x)\n        elif self.mix_prob == 0.0:\n            lam = 1.0\n        else:\n            raise NotImplementedError\n        target = mixup_target(target, self.num_classes, lam, self.label_smoothing)\n        return x, target\n"
  },
  {
    "path": "slowfast/datasets/multigrid_helper.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\n\"\"\"Helper functions for multigrid training.\"\"\"\n\nimport numpy as np\nimport torch\nfrom torch.utils.data.sampler import Sampler\n\nTORCH_MAJOR = int(torch.__version__.split(\".\")[0])\nTORCH_MINOR = int(torch.__version__.split(\".\")[1])\n\nif TORCH_MAJOR >= 1 and TORCH_MINOR >= 8:\n    _int_classes = int\nelse:\n    from torch._six import int_classes as _int_classes\n\n\nclass ShortCycleBatchSampler(Sampler):\n    \"\"\"\n    Extend Sampler to support \"short cycle\" sampling.\n    See paper \"A Multigrid Method for Efficiently Training Video Models\",\n    Wu et al., 2019 (https://arxiv.org/abs/1912.00998) for details.\n    \"\"\"\n\n    def __init__(self, sampler, batch_size, drop_last, cfg):\n        if not isinstance(sampler, Sampler):\n            raise ValueError(\n                \"sampler should be an instance of \"\n                \"torch.utils.data.Sampler, but got sampler={}\".format(sampler)\n            )\n        if (\n            not isinstance(batch_size, _int_classes)\n            or isinstance(batch_size, bool)\n            or batch_size <= 0\n        ):\n            raise ValueError(\n                \"batch_size should be a positive integer value, \"\n                \"but got batch_size={}\".format(batch_size)\n            )\n        if not isinstance(drop_last, bool):\n            raise ValueError(\n                \"drop_last should be a boolean value, but got drop_last={}\".format(\n                    drop_last\n                )\n            )\n        self.sampler = sampler\n        self.drop_last = drop_last\n\n        bs_factor = [\n            int(\n                round(\n                    (float(cfg.DATA.TRAIN_CROP_SIZE) / (s * cfg.MULTIGRID.DEFAULT_S))\n                    ** 2\n                )\n            )\n            for s in cfg.MULTIGRID.SHORT_CYCLE_FACTORS\n        ]\n\n        self.batch_sizes = [\n            batch_size * bs_factor[0],\n            batch_size * bs_factor[1],\n            batch_size,\n        ]\n\n    def __iter__(self):\n        counter = 0\n        batch_size = self.batch_sizes[0]\n        batch = []\n        for idx in self.sampler:\n            batch.append((idx, counter % 3))\n            if len(batch) == batch_size:\n                yield batch\n                counter += 1\n                batch_size = self.batch_sizes[counter % 3]\n                batch = []\n        if len(batch) > 0 and not self.drop_last:\n            yield batch\n\n    def __len__(self):\n        avg_batch_size = sum(self.batch_sizes) / 3.0\n        if self.drop_last:\n            return int(np.floor(len(self.sampler) / avg_batch_size))\n        else:\n            return int(np.ceil(len(self.sampler) / avg_batch_size))\n"
  },
  {
    "path": "slowfast/datasets/ptv_datasets.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\nimport functools\nimport os\nfrom typing import Dict\n\nimport slowfast.utils.logging as logging\nimport torch\nfrom pytorchvideo.data import Charades, LabeledVideoDataset, make_clip_sampler, SSv2\nfrom pytorchvideo.data.labeled_video_paths import LabeledVideoPaths\nfrom pytorchvideo.transforms import (\n    ApplyTransformToKey,\n    RandomShortSideScale,\n    ShortSideScale,\n    UniformCropVideo,\n    UniformTemporalSubsample,\n)\nfrom torch.utils.data import DistributedSampler, RandomSampler, SequentialSampler\nfrom torchvision.transforms import Compose, Lambda\nfrom torchvision.transforms._transforms_video import (\n    NormalizeVideo,\n    RandomCropVideo,\n    RandomHorizontalFlipVideo,\n)\n\nfrom . import utils as utils\nfrom .build import DATASET_REGISTRY\n\nlogger = logging.get_logger(__name__)\n\n\nclass PTVDatasetWrapper(torch.utils.data.IterableDataset):\n    \"\"\"\n    Wrapper for PyTorchVideo datasets.\n    \"\"\"\n\n    def __init__(self, num_videos, clips_per_video, crops_per_clip, dataset):\n        \"\"\"\n        Construct the dataset.\n\n        Args:\n            num_vidoes (int): number of videos in the dataset.\n            clips_per_video (int): number of clips per video in the dataset.\n            dataset (torch.utils.data.IterableDataset): a PyTorchVideo dataset.\n        \"\"\"\n        self._clips_per_video = clips_per_video\n        self._crops_per_clip = crops_per_clip\n        self._num_videos = num_videos\n        self.dataset = dataset\n\n    def __next__(self):\n        \"\"\"\n        Retrieves the next clip from the dataset.\n        \"\"\"\n        return self.dataset.__next__()\n\n    @property\n    def sampler(self):\n        \"\"\"\n        Returns:\n            (torch.utils.data.Sampler): video sampler for the dataset.\n        \"\"\"\n        return self.dataset.video_sampler\n\n    def __len__(self):\n        \"\"\"\n        Returns:\n            (int): the number of clips per replica in the IterableDataset.\n        \"\"\"\n        return len(self.sampler) * self._clips_per_video * self._crops_per_clip\n\n    @property\n    def num_videos(self):\n        \"\"\"\n        Returns:\n            (int): the number of clips in total in the dataset.\n        \"\"\"\n        return self._num_videos * self._clips_per_video * self._crops_per_clip\n\n    def __iter__(self):\n        return self\n\n\nclass PackPathway(torch.nn.Module):\n    \"\"\"\n    Transform for converting video frames as a list of tensors. Each tensor\n    corresponding to a unique pathway.\n    \"\"\"\n\n    def __init__(self, cfg):\n        super().__init__()\n        self.cfg = cfg\n\n    def forward(self, x: torch.Tensor):\n        return utils.pack_pathway_output(self.cfg, x)\n\n\nclass DictToTuple(torch.nn.Module):\n    \"\"\"\n    Transform for converting output from dict to a tuple following PySlowFast\n    dataset output format.\n    \"\"\"\n\n    def __init__(self, num_clips, num_crops):\n        super().__init__()\n        self._num_clips = num_clips\n        self._num_crops = num_crops\n\n    def forward(self, x: Dict[str, torch.Tensor]):\n        index = (\n            x[\"video_index\"] * self._num_clips * self._num_crops\n            + x[\"clip_index\"] * self._num_crops\n            + x[\"aug_index\"]\n        )\n\n        return x[\"video\"], x[\"label\"], index, {}\n\n\ndef div255(x):\n    \"\"\"\n    Scale clip frames from [0, 255] to [0, 1].\n    Args:\n        x (Tensor): A tensor of the clip's RGB frames with shape:\n            (channel, time, height, width).\n\n    Returns:\n        x (Tensor): Scaled tensor by divide 255.\n    \"\"\"\n    return x / 255.0\n\n\n@DATASET_REGISTRY.register()\ndef Ptvkinetics(cfg, mode):\n    \"\"\"\n    Construct the Kinetics video loader with a given csv file. The format of\n    the csv file is:\n    ```\n    path_to_video_1 label_1\n    path_to_video_2 label_2\n    ...\n    path_to_video_N label_N\n    ```\n    For `train` and `val` mode, a single clip is randomly sampled from every video\n    with random cropping, scaling, and flipping. For `test` mode, multiple clips are\n    uniformaly sampled from every video with center cropping.\n    Args:\n        cfg (CfgNode): configs.\n        mode (string): Options includes `train`, `val`, or `test` mode.\n            For the train and val mode, the data loader will take data\n            from the train or val set, and sample one clip per video.\n            For the test mode, the data loader will take data from test set,\n            and sample multiple clips per video.\n    \"\"\"\n    # Only support train, val, and test mode.\n    assert mode in [\n        \"train\",\n        \"val\",\n        \"test\",\n    ], \"Split '{}' not supported\".format(mode)\n\n    logger.info(\"Constructing Ptvkinetics {}...\".format(mode))\n\n    clip_duration = cfg.DATA.NUM_FRAMES * cfg.DATA.SAMPLING_RATE / cfg.DATA.TARGET_FPS\n    path_to_file = os.path.join(cfg.DATA.PATH_TO_DATA_DIR, \"{}.csv\".format(mode))\n    labeled_video_paths = LabeledVideoPaths.from_path(path_to_file)\n    num_videos = len(labeled_video_paths)\n    labeled_video_paths.path_prefix = cfg.DATA.PATH_PREFIX\n    logger.info(\n        \"Constructing kinetics dataloader (size: {}) from {}\".format(\n            num_videos, path_to_file\n        )\n    )\n\n    if mode in [\"train\", \"val\"]:\n        num_clips = 1\n        num_crops = 1\n\n        transform = Compose(\n            [\n                ApplyTransformToKey(\n                    key=\"video\",\n                    transform=Compose(\n                        [\n                            UniformTemporalSubsample(cfg.DATA.NUM_FRAMES),\n                            Lambda(div255),\n                            NormalizeVideo(cfg.DATA.MEAN, cfg.DATA.STD),\n                            RandomShortSideScale(\n                                min_size=cfg.DATA.TRAIN_JITTER_SCALES[0],\n                                max_size=cfg.DATA.TRAIN_JITTER_SCALES[1],\n                            ),\n                            RandomCropVideo(cfg.DATA.TRAIN_CROP_SIZE),\n                        ]\n                        + (\n                            [RandomHorizontalFlipVideo(p=0.5)]\n                            if cfg.DATA.RANDOM_FLIP\n                            else []\n                        )\n                        + [PackPathway(cfg)]\n                    ),\n                ),\n                DictToTuple(num_clips, num_crops),\n            ]\n        )\n\n        clip_sampler = make_clip_sampler(\"random\", clip_duration)\n        if cfg.NUM_GPUS > 1:\n            video_sampler = DistributedSampler\n        else:\n            video_sampler = RandomSampler if mode == \"train\" else SequentialSampler\n    else:\n        num_clips = cfg.TEST.NUM_ENSEMBLE_VIEWS\n        num_crops = cfg.TEST.NUM_SPATIAL_CROPS\n\n        transform = Compose(\n            [\n                ApplyTransformToKey(\n                    key=\"video\",\n                    transform=Compose(\n                        [\n                            UniformTemporalSubsample(cfg.DATA.NUM_FRAMES),\n                            Lambda(div255),\n                            NormalizeVideo(cfg.DATA.MEAN, cfg.DATA.STD),\n                            ShortSideScale(size=cfg.DATA.TRAIN_JITTER_SCALES[0]),\n                        ]\n                    ),\n                ),\n                UniformCropVideo(size=cfg.DATA.TEST_CROP_SIZE),\n                ApplyTransformToKey(key=\"video\", transform=PackPathway(cfg)),\n                DictToTuple(num_clips, num_crops),\n            ]\n        )\n        clip_sampler = make_clip_sampler(\n            \"constant_clips_per_video\",\n            clip_duration,\n            num_clips,\n            num_crops,\n        )\n        video_sampler = DistributedSampler if cfg.NUM_GPUS > 1 else SequentialSampler\n\n    return PTVDatasetWrapper(\n        num_videos=num_videos,\n        clips_per_video=num_clips,\n        crops_per_clip=num_crops,\n        dataset=LabeledVideoDataset(\n            labeled_video_paths=labeled_video_paths,\n            clip_sampler=clip_sampler,\n            video_sampler=video_sampler,\n            transform=transform,\n            decode_audio=False,\n            decoder=cfg.DATA.DECODING_BACKEND,\n        ),\n    )\n\n\ndef process_charades_label(x, mode, num_classes):\n    \"\"\"\n    Process the video label for Charades dataset. Use video-level label for\n    training mode, otherwise use clip-level label. Then convert the label into\n    a binary vector.\n    Args:\n        x (dict): a video clip including label index.\n        mode (string): Options includes `train`, `val`, or `test` mode.\n        num_classes (int): Number of classes in the dataset.\n\n    Returns:\n        x (dict): video clip with updated label information.\n    \"\"\"\n    label = utils.aggregate_labels(x[\"label\"]) if mode == \"train\" else x[\"video_label\"]\n    x[\"label\"] = torch.as_tensor(utils.as_binary_vector(label, num_classes))\n\n    return x\n\n\ndef rgb2bgr(x):\n    \"\"\"\n    Convert clip frames from RGB mode to BRG mode.\n    Args:\n        x (Tensor): A tensor of the clip's RGB frames with shape:\n            (channel, time, height, width).\n\n    Returns:\n        x (Tensor): Converted tensor\n    \"\"\"\n    return x[[2, 1, 0], ...]\n\n\n@DATASET_REGISTRY.register()\ndef Ptvcharades(cfg, mode):\n    \"\"\"\n    Construct PyTorchVideo Charades video loader.\n    Load Charades data (frame paths, labels, etc. ) to Charades Dataset object.\n    The dataset could be downloaded from Chrades official website\n    (https://allenai.org/plato/charades/).\n    Please see datasets/DATASET.md for more information about the data format.\n    For `train` and `val` mode, a single clip is randomly sampled from every video\n    with random cropping, scaling, and flipping. For `test` mode, multiple clips are\n    uniformaly sampled from every video with center cropping.\n    Args:\n        cfg (CfgNode): configs.\n        mode (string): Options includes `train`, `val`, or `test` mode.\n            For the train and val mode, the data loader will take data\n            from the train or val set, and sample one clip per video.\n            For the test mode, the data loader will take data from test set,\n            and sample multiple clips per video.\n    \"\"\"\n    # Only support train, val, and test mode.\n    assert mode in [\n        \"train\",\n        \"val\",\n        \"test\",\n    ], \"Split '{}' not supported\".format(mode)\n\n    logger.info(\"Constructing Ptvcharades {}...\".format(mode))\n\n    clip_duration = (\n        (cfg.DATA.NUM_FRAMES - 1) * cfg.DATA.SAMPLING_RATE + 1\n    ) / cfg.DATA.TARGET_FPS\n\n    if mode in [\"train\", \"val\"]:\n        num_clips = 1\n        num_crops = 1\n\n        transform = Compose(\n            [\n                ApplyTransformToKey(\n                    key=\"video\",\n                    transform=Compose(\n                        [\n                            Lambda(div255),\n                            NormalizeVideo(cfg.DATA.MEAN, cfg.DATA.STD),\n                            RandomShortSideScale(\n                                min_size=cfg.DATA.TRAIN_JITTER_SCALES[0],\n                                max_size=cfg.DATA.TRAIN_JITTER_SCALES[1],\n                            ),\n                            RandomCropVideo(cfg.DATA.TRAIN_CROP_SIZE),\n                            Lambda(rgb2bgr),\n                        ]\n                        + (\n                            [RandomHorizontalFlipVideo(p=0.5)]\n                            if cfg.DATA.RANDOM_FLIP\n                            else []\n                        )\n                        + [PackPathway(cfg)]\n                    ),\n                ),\n                Lambda(\n                    functools.partial(\n                        process_charades_label,\n                        mode=mode,\n                        num_classes=cfg.MODEL.NUM_CLASSES,\n                    )\n                ),\n                DictToTuple(num_clips, num_crops),\n            ]\n        )\n        clip_sampler = make_clip_sampler(\"random\", clip_duration)\n        if cfg.NUM_GPUS > 1:\n            video_sampler = DistributedSampler\n        else:\n            video_sampler = RandomSampler if mode == \"train\" else SequentialSampler\n    else:\n        num_clips = cfg.TEST.NUM_ENSEMBLE_VIEWS\n        num_crops = cfg.TEST.NUM_SPATIAL_CROPS\n\n        transform = Compose(\n            [\n                ApplyTransformToKey(\n                    key=\"video\",\n                    transform=Compose(\n                        [\n                            Lambda(div255),\n                            NormalizeVideo(cfg.DATA.MEAN, cfg.DATA.STD),\n                            ShortSideScale(size=cfg.DATA.TEST_CROP_SIZE),\n                        ]\n                    ),\n                ),\n                UniformCropVideo(size=cfg.DATA.TEST_CROP_SIZE),\n                Lambda(\n                    functools.partial(\n                        process_charades_label,\n                        mode=mode,\n                        num_classes=cfg.MODEL.NUM_CLASSES,\n                    )\n                ),\n                ApplyTransformToKey(\n                    key=\"video\",\n                    transform=Compose(\n                        [Lambda(rgb2bgr), PackPathway(cfg)],\n                    ),\n                ),\n                DictToTuple(num_clips, num_crops),\n            ]\n        )\n        clip_sampler = make_clip_sampler(\n            \"constant_clips_per_video\",\n            clip_duration,\n            num_clips,\n            num_crops,\n        )\n        video_sampler = DistributedSampler if cfg.NUM_GPUS > 1 else SequentialSampler\n\n    data_path = os.path.join(cfg.DATA.PATH_TO_DATA_DIR, \"{}.csv\".format(mode))\n    dataset = Charades(\n        data_path=data_path,\n        clip_sampler=clip_sampler,\n        video_sampler=video_sampler,\n        transform=transform,\n        video_path_prefix=cfg.DATA.PATH_PREFIX,\n        frames_per_clip=cfg.DATA.NUM_FRAMES,\n    )\n\n    logger.info(\n        \"Constructing charades dataloader (size: {}) from {}\".format(\n            len(dataset._path_to_videos), data_path\n        )\n    )\n\n    return PTVDatasetWrapper(\n        num_videos=len(dataset._path_to_videos),\n        clips_per_video=num_clips,\n        crops_per_clip=num_crops,\n        dataset=dataset,\n    )\n\n\n@DATASET_REGISTRY.register()\ndef Ptvssv2(cfg, mode):\n    \"\"\"\n    Construct PyTorchVideo Something-Something v2 SSv2 video loader.\n    Load SSv2 data (frame paths, labels, etc. ) to SSv2 Dataset object.\n    The dataset could be downloaded from Chrades official website\n    (https://20bn.com/datasets/something-something).\n    Please see datasets/DATASET.md for more information about the data format.\n    For training and validation, a single  clip is randomly sampled from every\n    video with random cropping and scaling. For testing, multiple clips are\n    uniformaly sampled from every video with uniform cropping. For uniform cropping,\n    we take the left, center, and right crop if the width is larger than height,\n    or take top, center, and bottom crop if the height is larger than the width.\n    Args:\n        cfg (CfgNode): configs.\n        mode (string): Options includes `train`, `val`, or `test` mode.\n    \"\"\"\n    # Only support train, val, and test mode.\n    assert mode in [\n        \"train\",\n        \"val\",\n        \"test\",\n    ], \"Split '{}' not supported\".format(mode)\n\n    logger.info(\"Constructing Ptvcharades {}...\".format(mode))\n\n    if mode in [\"train\", \"val\"]:\n        num_clips = 1\n        num_crops = 1\n\n        transform = Compose(\n            [\n                ApplyTransformToKey(\n                    key=\"video\",\n                    transform=Compose(\n                        [\n                            Lambda(div255),\n                            NormalizeVideo(cfg.DATA.MEAN, cfg.DATA.STD),\n                            RandomShortSideScale(\n                                min_size=cfg.DATA.TRAIN_JITTER_SCALES[0],\n                                max_size=cfg.DATA.TRAIN_JITTER_SCALES[1],\n                            ),\n                            RandomCropVideo(cfg.DATA.TRAIN_CROP_SIZE),\n                            Lambda(rgb2bgr),\n                        ]\n                        + (\n                            [RandomHorizontalFlipVideo(p=0.5)]\n                            if cfg.DATA.RANDOM_FLIP\n                            else []\n                        )\n                        + [PackPathway(cfg)]\n                    ),\n                ),\n                DictToTuple(num_clips, num_crops),\n            ]\n        )\n        clip_sampler = make_clip_sampler(\n            \"constant_clips_per_video\",\n            1,  # Put arbitrary duration as ssv2 always needs full video clip.\n            num_clips,\n            num_crops,\n        )\n        if cfg.NUM_GPUS > 1:\n            video_sampler = DistributedSampler\n        else:\n            video_sampler = RandomSampler if mode == \"train\" else SequentialSampler\n    else:\n        assert cfg.TEST.NUM_ENSEMBLE_VIEWS == 1\n        num_clips = cfg.TEST.NUM_ENSEMBLE_VIEWS\n        num_crops = cfg.TEST.NUM_SPATIAL_CROPS\n\n        transform = Compose(\n            [\n                ApplyTransformToKey(\n                    key=\"video\",\n                    transform=Compose(\n                        [\n                            Lambda(div255),\n                            NormalizeVideo(cfg.DATA.MEAN, cfg.DATA.STD),\n                            ShortSideScale(size=cfg.DATA.TEST_CROP_SIZE),\n                        ]\n                    ),\n                ),\n                UniformCropVideo(size=cfg.DATA.TEST_CROP_SIZE),\n                ApplyTransformToKey(\n                    key=\"video\",\n                    transform=Compose(\n                        [Lambda(rgb2bgr), PackPathway(cfg)],\n                    ),\n                ),\n                DictToTuple(num_clips, num_crops),\n            ]\n        )\n        clip_sampler = make_clip_sampler(\n            \"constant_clips_per_video\",\n            1,  # Put arbitrary duration as ssv2 always needs full video clip.\n            num_clips,\n            num_crops,\n        )\n        video_sampler = DistributedSampler if cfg.NUM_GPUS > 1 else SequentialSampler\n\n    label_name_file = os.path.join(\n        cfg.DATA.PATH_TO_DATA_DIR, \"something-something-v2-labels.json\"\n    )\n    video_label_file = os.path.join(\n        cfg.DATA.PATH_TO_DATA_DIR,\n        \"something-something-v2-{}.json\".format(\n            \"train\" if mode == \"train\" else \"validation\"\n        ),\n    )\n    data_path = os.path.join(\n        cfg.DATA.PATH_TO_DATA_DIR,\n        \"{}.csv\".format(\"train\" if mode == \"train\" else \"val\"),\n    )\n    dataset = SSv2(\n        label_name_file=label_name_file,\n        video_label_file=video_label_file,\n        video_path_label_file=data_path,\n        clip_sampler=clip_sampler,\n        video_sampler=video_sampler,\n        transform=transform,\n        video_path_prefix=cfg.DATA.PATH_PREFIX,\n        frames_per_clip=cfg.DATA.NUM_FRAMES,\n        rand_sample_frames=mode == \"train\",\n    )\n\n    logger.info(\n        \"Constructing ssv2 dataloader (size: {}) from {}\".format(\n            len(dataset._path_to_videos), data_path\n        )\n    )\n\n    return PTVDatasetWrapper(\n        num_videos=len(dataset._path_to_videos),\n        clips_per_video=num_clips,\n        crops_per_clip=num_crops,\n        dataset=dataset,\n    )\n"
  },
  {
    "path": "slowfast/datasets/rand_augment.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\n\"\"\"\nThis implementation is based on\nhttps://github.com/rwightman/pytorch-image-models/blob/master/timm/data/auto_augment.py\npulished under an Apache License 2.0.\n\nCOMMENT FROM ORIGINAL:\nAutoAugment, RandAugment, and AugMix for PyTorch\nThis code implements the searched ImageNet policies with various tweaks and\nimprovements and does not include any of the search code. AA and RA\nImplementation adapted from:\n    https://github.com/tensorflow/tpu/blob/master/models/official/efficientnet/autoaugment.py\nAugMix adapted from:\n    https://github.com/google-research/augmix\nPapers:\n    AutoAugment: Learning Augmentation Policies from Data\n    https://arxiv.org/abs/1805.09501\n    Learning Data Augmentation Strategies for Object Detection\n    https://arxiv.org/abs/1906.11172\n    RandAugment: Practical automated data augmentation...\n    https://arxiv.org/abs/1909.13719\n    AugMix: A Simple Data Processing Method to Improve Robustness and\n    Uncertainty https://arxiv.org/abs/1912.02781\n\nHacked together by / Copyright 2020 Ross Wightman\n\"\"\"\n\nimport math\nimport random\nimport re\n\nimport numpy as np\nimport PIL\nfrom PIL import Image, ImageEnhance, ImageOps\n\n_PIL_VER = tuple([int(x) for x in PIL.__version__.split(\".\")[:2]])\n\n_FILL = (128, 128, 128)\n\n# This signifies the max integer that the controller RNN could predict for the\n# augmentation scheme.\n_MAX_LEVEL = 10.0\n\n_HPARAMS_DEFAULT = {\n    \"translate_const\": 250,\n    \"img_mean\": _FILL,\n}\n\n_RANDOM_INTERPOLATION = (Image.BILINEAR, Image.BICUBIC)\n\n\ndef _interpolation(kwargs):\n    interpolation = kwargs.pop(\"resample\", Image.BILINEAR)\n    if isinstance(interpolation, (list, tuple)):\n        return random.choice(interpolation)\n    else:\n        return interpolation\n\n\ndef _check_args_tf(kwargs):\n    if \"fillcolor\" in kwargs and _PIL_VER < (5, 0):\n        kwargs.pop(\"fillcolor\")\n    kwargs[\"resample\"] = _interpolation(kwargs)\n\n\ndef shear_x(img, factor, **kwargs):\n    _check_args_tf(kwargs)\n    return img.transform(img.size, Image.AFFINE, (1, factor, 0, 0, 1, 0), **kwargs)\n\n\ndef shear_y(img, factor, **kwargs):\n    _check_args_tf(kwargs)\n    return img.transform(img.size, Image.AFFINE, (1, 0, 0, factor, 1, 0), **kwargs)\n\n\ndef translate_x_rel(img, pct, **kwargs):\n    pixels = pct * img.size[0]\n    _check_args_tf(kwargs)\n    return img.transform(img.size, Image.AFFINE, (1, 0, pixels, 0, 1, 0), **kwargs)\n\n\ndef translate_y_rel(img, pct, **kwargs):\n    pixels = pct * img.size[1]\n    _check_args_tf(kwargs)\n    return img.transform(img.size, Image.AFFINE, (1, 0, 0, 0, 1, pixels), **kwargs)\n\n\ndef translate_x_abs(img, pixels, **kwargs):\n    _check_args_tf(kwargs)\n    return img.transform(img.size, Image.AFFINE, (1, 0, pixels, 0, 1, 0), **kwargs)\n\n\ndef translate_y_abs(img, pixels, **kwargs):\n    _check_args_tf(kwargs)\n    return img.transform(img.size, Image.AFFINE, (1, 0, 0, 0, 1, pixels), **kwargs)\n\n\ndef rotate(img, degrees, **kwargs):\n    _check_args_tf(kwargs)\n    if _PIL_VER >= (5, 2):\n        return img.rotate(degrees, **kwargs)\n    elif _PIL_VER >= (5, 0):\n        w, h = img.size\n        post_trans = (0, 0)\n        rotn_center = (w / 2.0, h / 2.0)\n        angle = -math.radians(degrees)\n        matrix = [\n            round(math.cos(angle), 15),\n            round(math.sin(angle), 15),\n            0.0,\n            round(-math.sin(angle), 15),\n            round(math.cos(angle), 15),\n            0.0,\n        ]\n\n        def transform(x, y, matrix):\n            (a, b, c, d, e, f) = matrix\n            return a * x + b * y + c, d * x + e * y + f\n\n        matrix[2], matrix[5] = transform(\n            -rotn_center[0] - post_trans[0],\n            -rotn_center[1] - post_trans[1],\n            matrix,\n        )\n        matrix[2] += rotn_center[0]\n        matrix[5] += rotn_center[1]\n        return img.transform(img.size, Image.AFFINE, matrix, **kwargs)\n    else:\n        return img.rotate(degrees, resample=kwargs[\"resample\"])\n\n\ndef auto_contrast(img, **__):\n    return ImageOps.autocontrast(img)\n\n\ndef invert(img, **__):\n    return ImageOps.invert(img)\n\n\ndef equalize(img, **__):\n    return ImageOps.equalize(img)\n\n\ndef solarize(img, thresh, **__):\n    return ImageOps.solarize(img, thresh)\n\n\ndef solarize_add(img, add, thresh=128, **__):\n    lut = []\n    for i in range(256):\n        if i < thresh:\n            lut.append(min(255, i + add))\n        else:\n            lut.append(i)\n    if img.mode in (\"L\", \"RGB\"):\n        if img.mode == \"RGB\" and len(lut) == 256:\n            lut = lut + lut + lut\n        return img.point(lut)\n    else:\n        return img\n\n\ndef posterize(img, bits_to_keep, **__):\n    if bits_to_keep >= 8:\n        return img\n    return ImageOps.posterize(img, bits_to_keep)\n\n\ndef contrast(img, factor, **__):\n    return ImageEnhance.Contrast(img).enhance(factor)\n\n\ndef color(img, factor, **__):\n    return ImageEnhance.Color(img).enhance(factor)\n\n\ndef brightness(img, factor, **__):\n    return ImageEnhance.Brightness(img).enhance(factor)\n\n\ndef sharpness(img, factor, **__):\n    return ImageEnhance.Sharpness(img).enhance(factor)\n\n\ndef _randomly_negate(v):\n    \"\"\"With 50% prob, negate the value\"\"\"\n    return -v if random.random() > 0.5 else v\n\n\ndef _rotate_level_to_arg(level, _hparams):\n    # range [-30, 30]\n    level = (level / _MAX_LEVEL) * 30.0\n    level = _randomly_negate(level)\n    return (level,)\n\n\ndef _enhance_level_to_arg(level, _hparams):\n    # range [0.1, 1.9]\n    return ((level / _MAX_LEVEL) * 1.8 + 0.1,)\n\n\ndef _enhance_increasing_level_to_arg(level, _hparams):\n    # the 'no change' level is 1.0, moving away from that towards 0. or 2.0 increases the enhancement blend\n    # range [0.1, 1.9]\n    level = (level / _MAX_LEVEL) * 0.9\n    level = 1.0 + _randomly_negate(level)\n    return (level,)\n\n\ndef _shear_level_to_arg(level, _hparams):\n    # range [-0.3, 0.3]\n    level = (level / _MAX_LEVEL) * 0.3\n    level = _randomly_negate(level)\n    return (level,)\n\n\ndef _translate_abs_level_to_arg(level, hparams):\n    translate_const = hparams[\"translate_const\"]\n    level = (level / _MAX_LEVEL) * float(translate_const)\n    level = _randomly_negate(level)\n    return (level,)\n\n\ndef _translate_rel_level_to_arg(level, hparams):\n    # default range [-0.45, 0.45]\n    translate_pct = hparams.get(\"translate_pct\", 0.45)\n    level = (level / _MAX_LEVEL) * translate_pct\n    level = _randomly_negate(level)\n    return (level,)\n\n\ndef _posterize_level_to_arg(level, _hparams):\n    # As per Tensorflow TPU EfficientNet impl\n    # range [0, 4], 'keep 0 up to 4 MSB of original image'\n    # intensity/severity of augmentation decreases with level\n    return (int((level / _MAX_LEVEL) * 4),)\n\n\ndef _posterize_increasing_level_to_arg(level, hparams):\n    # As per Tensorflow models research and UDA impl\n    # range [4, 0], 'keep 4 down to 0 MSB of original image',\n    # intensity/severity of augmentation increases with level\n    return (4 - _posterize_level_to_arg(level, hparams)[0],)\n\n\ndef _posterize_original_level_to_arg(level, _hparams):\n    # As per original AutoAugment paper description\n    # range [4, 8], 'keep 4 up to 8 MSB of image'\n    # intensity/severity of augmentation decreases with level\n    return (int((level / _MAX_LEVEL) * 4) + 4,)\n\n\ndef _solarize_level_to_arg(level, _hparams):\n    # range [0, 256]\n    # intensity/severity of augmentation decreases with level\n    return (int((level / _MAX_LEVEL) * 256),)\n\n\ndef _solarize_increasing_level_to_arg(level, _hparams):\n    # range [0, 256]\n    # intensity/severity of augmentation increases with level\n    return (256 - _solarize_level_to_arg(level, _hparams)[0],)\n\n\ndef _solarize_add_level_to_arg(level, _hparams):\n    # range [0, 110]\n    return (int((level / _MAX_LEVEL) * 110),)\n\n\nLEVEL_TO_ARG = {\n    \"AutoContrast\": None,\n    \"Equalize\": None,\n    \"Invert\": None,\n    \"Rotate\": _rotate_level_to_arg,\n    # There are several variations of the posterize level scaling in various Tensorflow/Google repositories/papers\n    \"Posterize\": _posterize_level_to_arg,\n    \"PosterizeIncreasing\": _posterize_increasing_level_to_arg,\n    \"PosterizeOriginal\": _posterize_original_level_to_arg,\n    \"Solarize\": _solarize_level_to_arg,\n    \"SolarizeIncreasing\": _solarize_increasing_level_to_arg,\n    \"SolarizeAdd\": _solarize_add_level_to_arg,\n    \"Color\": _enhance_level_to_arg,\n    \"ColorIncreasing\": _enhance_increasing_level_to_arg,\n    \"Contrast\": _enhance_level_to_arg,\n    \"ContrastIncreasing\": _enhance_increasing_level_to_arg,\n    \"Brightness\": _enhance_level_to_arg,\n    \"BrightnessIncreasing\": _enhance_increasing_level_to_arg,\n    \"Sharpness\": _enhance_level_to_arg,\n    \"SharpnessIncreasing\": _enhance_increasing_level_to_arg,\n    \"ShearX\": _shear_level_to_arg,\n    \"ShearY\": _shear_level_to_arg,\n    \"TranslateX\": _translate_abs_level_to_arg,\n    \"TranslateY\": _translate_abs_level_to_arg,\n    \"TranslateXRel\": _translate_rel_level_to_arg,\n    \"TranslateYRel\": _translate_rel_level_to_arg,\n}\n\n\nNAME_TO_OP = {\n    \"AutoContrast\": auto_contrast,\n    \"Equalize\": equalize,\n    \"Invert\": invert,\n    \"Rotate\": rotate,\n    \"Posterize\": posterize,\n    \"PosterizeIncreasing\": posterize,\n    \"PosterizeOriginal\": posterize,\n    \"Solarize\": solarize,\n    \"SolarizeIncreasing\": solarize,\n    \"SolarizeAdd\": solarize_add,\n    \"Color\": color,\n    \"ColorIncreasing\": color,\n    \"Contrast\": contrast,\n    \"ContrastIncreasing\": contrast,\n    \"Brightness\": brightness,\n    \"BrightnessIncreasing\": brightness,\n    \"Sharpness\": sharpness,\n    \"SharpnessIncreasing\": sharpness,\n    \"ShearX\": shear_x,\n    \"ShearY\": shear_y,\n    \"TranslateX\": translate_x_abs,\n    \"TranslateY\": translate_y_abs,\n    \"TranslateXRel\": translate_x_rel,\n    \"TranslateYRel\": translate_y_rel,\n}\n\n\nclass AugmentOp:\n    \"\"\"\n    Apply for video.\n    \"\"\"\n\n    def __init__(self, name, prob=0.5, magnitude=10, hparams=None):\n        hparams = hparams or _HPARAMS_DEFAULT\n        self.aug_fn = NAME_TO_OP[name]\n        self.level_fn = LEVEL_TO_ARG[name]\n        self.prob = prob\n        self.magnitude = magnitude\n        self.hparams = hparams.copy()\n        self.kwargs = {\n            \"fillcolor\": hparams[\"img_mean\"] if \"img_mean\" in hparams else _FILL,\n            \"resample\": (\n                hparams[\"interpolation\"]\n                if \"interpolation\" in hparams\n                else _RANDOM_INTERPOLATION\n            ),\n        }\n\n        # If magnitude_std is > 0, we introduce some randomness\n        # in the usually fixed policy and sample magnitude from a normal distribution\n        # with mean `magnitude` and std-dev of `magnitude_std`.\n        # NOTE This is my own hack, being tested, not in papers or reference impls.\n        self.magnitude_std = self.hparams.get(\"magnitude_std\", 0)\n\n    def __call__(self, img_list):\n        if self.prob < 1.0 and random.random() > self.prob:\n            return img_list\n        magnitude = self.magnitude\n        if self.magnitude_std and self.magnitude_std > 0:\n            magnitude = random.gauss(magnitude, self.magnitude_std)\n        magnitude = min(_MAX_LEVEL, max(0, magnitude))  # clip to valid range\n        level_args = (\n            self.level_fn(magnitude, self.hparams) if self.level_fn is not None else ()\n        )\n\n        if isinstance(img_list, list):\n            return [self.aug_fn(img, *level_args, **self.kwargs) for img in img_list]\n        else:\n            return self.aug_fn(img_list, *level_args, **self.kwargs)\n\n\n_RAND_TRANSFORMS = [\n    \"AutoContrast\",\n    \"Equalize\",\n    \"Invert\",\n    \"Rotate\",\n    \"Posterize\",\n    \"Solarize\",\n    \"SolarizeAdd\",\n    \"Color\",\n    \"Contrast\",\n    \"Brightness\",\n    \"Sharpness\",\n    \"ShearX\",\n    \"ShearY\",\n    \"TranslateXRel\",\n    \"TranslateYRel\",\n]\n\n\n_RAND_INCREASING_TRANSFORMS = [\n    \"AutoContrast\",\n    \"Equalize\",\n    \"Invert\",\n    \"Rotate\",\n    \"PosterizeIncreasing\",\n    \"SolarizeIncreasing\",\n    \"SolarizeAdd\",\n    \"ColorIncreasing\",\n    \"ContrastIncreasing\",\n    \"BrightnessIncreasing\",\n    \"SharpnessIncreasing\",\n    \"ShearX\",\n    \"ShearY\",\n    \"TranslateXRel\",\n    \"TranslateYRel\",\n]\n\n\n# These experimental weights are based loosely on the relative improvements mentioned in paper.\n# They may not result in increased performance, but could likely be tuned to so.\n_RAND_CHOICE_WEIGHTS_0 = {\n    \"Rotate\": 0.3,\n    \"ShearX\": 0.2,\n    \"ShearY\": 0.2,\n    \"TranslateXRel\": 0.1,\n    \"TranslateYRel\": 0.1,\n    \"Color\": 0.025,\n    \"Sharpness\": 0.025,\n    \"AutoContrast\": 0.025,\n    \"Solarize\": 0.005,\n    \"SolarizeAdd\": 0.005,\n    \"Contrast\": 0.005,\n    \"Brightness\": 0.005,\n    \"Equalize\": 0.005,\n    \"Posterize\": 0,\n    \"Invert\": 0,\n}\n\n\ndef _select_rand_weights(weight_idx=0, transforms=None):\n    transforms = transforms or _RAND_TRANSFORMS\n    assert weight_idx == 0  # only one set of weights currently\n    rand_weights = _RAND_CHOICE_WEIGHTS_0\n    probs = [rand_weights[k] for k in transforms]\n    probs /= np.sum(probs)\n    return probs\n\n\ndef rand_augment_ops(magnitude=10, hparams=None, transforms=None):\n    hparams = hparams or _HPARAMS_DEFAULT\n    transforms = transforms or _RAND_TRANSFORMS\n    return [\n        AugmentOp(name, prob=0.5, magnitude=magnitude, hparams=hparams)\n        for name in transforms\n    ]\n\n\nclass RandAugment:\n    def __init__(self, ops, num_layers=2, choice_weights=None):\n        self.ops = ops\n        self.num_layers = num_layers\n        self.choice_weights = choice_weights\n\n    def __call__(self, img):\n        # no replacement when using weighted choice\n        ops = np.random.choice(\n            self.ops,\n            self.num_layers,\n            replace=self.choice_weights is None,\n            p=self.choice_weights,\n        )\n        for op in ops:\n            img = op(img)\n        return img\n\n\ndef rand_augment_transform(config_str, hparams):\n    \"\"\"\n    RandAugment: Practical automated data augmentation... - https://arxiv.org/abs/1909.13719\n\n    Create a RandAugment transform\n    :param config_str: String defining configuration of random augmentation. Consists of multiple sections separated by\n    dashes ('-'). The first section defines the specific variant of rand augment (currently only 'rand'). The remaining\n    sections, not order sepecific determine\n        'm' - integer magnitude of rand augment\n        'n' - integer num layers (number of transform ops selected per image)\n        'w' - integer probabiliy weight index (index of a set of weights to influence choice of op)\n        'mstd' -  float std deviation of magnitude noise applied\n        'inc' - integer (bool), use augmentations that increase in severity with magnitude (default: 0)\n    Ex 'rand-m9-n3-mstd0.5' results in RandAugment with magnitude 9, num_layers 3, magnitude_std 0.5\n    'rand-mstd1-w0' results in magnitude_std 1.0, weights 0, default magnitude of 10 and num_layers 2\n    :param hparams: Other hparams (kwargs) for the RandAugmentation scheme\n    :return: A PyTorch compatible Transform\n    \"\"\"\n    magnitude = _MAX_LEVEL  # default to _MAX_LEVEL for magnitude (currently 10)\n    num_layers = 2  # default to 2 ops per image\n    weight_idx = None  # default to no probability weights for op choice\n    transforms = _RAND_TRANSFORMS\n    config = config_str.split(\"-\")\n    assert config[0] == \"rand\"\n    config = config[1:]\n    for c in config:\n        cs = re.split(r\"(\\d.*)\", c)\n        if len(cs) < 2:\n            continue\n        key, val = cs[:2]\n        if key == \"mstd\":\n            # noise param injected via hparams for now\n            hparams.setdefault(\"magnitude_std\", float(val))\n        elif key == \"inc\":\n            if bool(val):\n                transforms = _RAND_INCREASING_TRANSFORMS\n        elif key == \"m\":\n            magnitude = int(val)\n        elif key == \"n\":\n            num_layers = int(val)\n        elif key == \"w\":\n            weight_idx = int(val)\n        else:\n            assert NotImplementedError\n    ra_ops = rand_augment_ops(\n        magnitude=magnitude, hparams=hparams, transforms=transforms\n    )\n    choice_weights = None if weight_idx is None else _select_rand_weights(weight_idx)\n    return RandAugment(ra_ops, num_layers, choice_weights=choice_weights)\n"
  },
  {
    "path": "slowfast/datasets/random_erasing.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\n\"\"\"\nThis implementation is based on\nhttps://github.com/rwightman/pytorch-image-models/blob/master/timm/data/random_erasing.py\npulished under an Apache License 2.0.\n\nCOMMENT FROM ORIGINAL:\nOriginally inspired by impl at https://github.com/zhunzhong07/Random-Erasing, Apache 2.0\nCopyright Zhun Zhong & Liang Zheng\nHacked together by / Copyright 2020 Ross Wightman\n\"\"\"\n\nimport math\nimport random\n\nimport torch\n\n\ndef _get_pixels(per_pixel, rand_color, patch_size, dtype=torch.float32, device=\"cuda\"):\n    # NOTE I've seen CUDA illegal memory access errors being caused by the normal_()\n    # paths, flip the order so normal is run on CPU if this becomes a problem\n    # Issue has been fixed in master https://github.com/pytorch/pytorch/issues/19508\n    if per_pixel:\n        return torch.empty(patch_size, dtype=dtype, device=device).normal_()\n    elif rand_color:\n        return torch.empty((patch_size[0], 1, 1), dtype=dtype, device=device).normal_()\n    else:\n        return torch.zeros((patch_size[0], 1, 1), dtype=dtype, device=device)\n\n\nclass RandomErasing:\n    \"\"\"Randomly selects a rectangle region in an image and erases its pixels.\n        'Random Erasing Data Augmentation' by Zhong et al.\n        See https://arxiv.org/pdf/1708.04896.pdf\n        This variant of RandomErasing is intended to be applied to either a batch\n        or single image tensor after it has been normalized by dataset mean and std.\n    Args:\n         probability: Probability that the Random Erasing operation will be performed.\n         min_area: Minimum percentage of erased area wrt input image area.\n         max_area: Maximum percentage of erased area wrt input image area.\n         min_aspect: Minimum aspect ratio of erased area.\n         mode: pixel color mode, one of 'const', 'rand', or 'pixel'\n            'const' - erase block is constant color of 0 for all channels\n            'rand'  - erase block is same per-channel random (normal) color\n            'pixel' - erase block is per-pixel random (normal) color\n        max_count: maximum number of erasing blocks per image, area per box is scaled by count.\n            per-image count is randomly chosen between 1 and this value.\n    \"\"\"\n\n    def __init__(\n        self,\n        probability=0.5,\n        min_area=0.02,\n        max_area=1 / 3,\n        min_aspect=0.3,\n        max_aspect=None,\n        mode=\"const\",\n        min_count=1,\n        max_count=None,\n        num_splits=0,\n        device=\"cuda\",\n        cube=True,\n    ):\n        self.probability = probability\n        self.min_area = min_area\n        self.max_area = max_area\n        max_aspect = max_aspect or 1 / min_aspect\n        self.log_aspect_ratio = (math.log(min_aspect), math.log(max_aspect))\n        self.min_count = min_count\n        self.max_count = max_count or min_count\n        self.num_splits = num_splits\n        mode = mode.lower()\n        self.rand_color = False\n        self.per_pixel = False\n        self.cube = cube\n        if mode == \"rand\":\n            self.rand_color = True  # per block random normal\n        elif mode == \"pixel\":\n            self.per_pixel = True  # per pixel random normal\n        else:\n            assert not mode or mode == \"const\"\n        self.device = device\n\n    def _erase(self, img, chan, img_h, img_w, dtype):\n        if random.random() > self.probability:\n            return\n        area = img_h * img_w\n        count = (\n            self.min_count\n            if self.min_count == self.max_count\n            else random.randint(self.min_count, self.max_count)\n        )\n        for _ in range(count):\n            for _ in range(10):\n                target_area = (\n                    random.uniform(self.min_area, self.max_area) * area / count\n                )\n                aspect_ratio = math.exp(random.uniform(*self.log_aspect_ratio))\n                h = int(round(math.sqrt(target_area * aspect_ratio)))\n                w = int(round(math.sqrt(target_area / aspect_ratio)))\n                if w < img_w and h < img_h:\n                    top = random.randint(0, img_h - h)\n                    left = random.randint(0, img_w - w)\n                    img[:, top : top + h, left : left + w] = _get_pixels(\n                        self.per_pixel,\n                        self.rand_color,\n                        (chan, h, w),\n                        dtype=dtype,\n                        device=self.device,\n                    )\n                    break\n\n    def _erase_cube(\n        self,\n        img,\n        batch_start,\n        batch_size,\n        chan,\n        img_h,\n        img_w,\n        dtype,\n    ):\n        if random.random() > self.probability:\n            return\n        area = img_h * img_w\n        count = (\n            self.min_count\n            if self.min_count == self.max_count\n            else random.randint(self.min_count, self.max_count)\n        )\n        for _ in range(count):\n            for _ in range(100):\n                target_area = (\n                    random.uniform(self.min_area, self.max_area) * area / count\n                )\n                aspect_ratio = math.exp(random.uniform(*self.log_aspect_ratio))\n                h = int(round(math.sqrt(target_area * aspect_ratio)))\n                w = int(round(math.sqrt(target_area / aspect_ratio)))\n                if w < img_w and h < img_h:\n                    top = random.randint(0, img_h - h)\n                    left = random.randint(0, img_w - w)\n                    for i in range(batch_start, batch_size):\n                        img_instance = img[i]\n                        img_instance[:, top : top + h, left : left + w] = _get_pixels(\n                            self.per_pixel,\n                            self.rand_color,\n                            (chan, h, w),\n                            dtype=dtype,\n                            device=self.device,\n                        )\n                    break\n\n    def __call__(self, input):\n        if len(input.size()) == 3:\n            self._erase(input, *input.size(), input.dtype)\n        else:\n            batch_size, chan, img_h, img_w = input.size()\n            # skip first slice of batch if num_splits is set (for clean portion of samples)\n            batch_start = batch_size // self.num_splits if self.num_splits > 1 else 0\n            if self.cube:\n                self._erase_cube(\n                    input,\n                    batch_start,\n                    batch_size,\n                    chan,\n                    img_h,\n                    img_w,\n                    input.dtype,\n                )\n            else:\n                for i in range(batch_start, batch_size):\n                    self._erase(input[i], chan, img_h, img_w, input.dtype)\n        return input\n"
  },
  {
    "path": "slowfast/datasets/ssv2.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\nimport json\nimport os\nimport random\nfrom itertools import chain as chain\n\nimport numpy as np\nimport slowfast.utils.logging as logging\nimport torch\nimport torch.utils.data\nfrom slowfast.utils.env import pathmgr\n\nfrom . import utils as utils\nfrom .build import DATASET_REGISTRY\n\nlogger = logging.get_logger(__name__)\n\n\n@DATASET_REGISTRY.register()\nclass Ssv2(torch.utils.data.Dataset):\n    \"\"\"\n    Something-Something v2 (SSV2) video loader. Construct the SSV2 video loader,\n    then sample clips from the videos. For training and validation, a single\n    clip is randomly sampled from every video with random cropping, scaling, and\n    flipping. For testing, multiple clips are uniformaly sampled from every\n    video with uniform cropping. For uniform cropping, we take the left, center,\n    and right crop if the width is larger than height, or take top, center, and\n    bottom crop if the height is larger than the width.\n    \"\"\"\n\n    def __init__(self, cfg, mode, num_retries=10):\n        \"\"\"\n        Load Something-Something V2 data (frame paths, labels, etc. ) to a given\n        Dataset object. The dataset could be downloaded from Something-Something\n        official website (https://20bn.com/datasets/something-something).\n        Please see datasets/DATASET.md for more information about the data format.\n        Args:\n            cfg (CfgNode): configs.\n            mode (string): Options includes `train`, `val`, or `test` mode.\n                For the train and val mode, the data loader will take data\n                from the train or val set, and sample one clip per video.\n                For the test mode, the data loader will take data from test set,\n                and sample multiple clips per video.\n            num_retries (int): number of retries for reading frames from disk.\n        \"\"\"\n        # Only support train, val, and test mode.\n        assert mode in [\n            \"train\",\n            \"val\",\n            \"test\",\n        ], \"Split '{}' not supported for Something-Something V2\".format(mode)\n        self.mode = mode\n        self.cfg = cfg\n\n        self._video_meta = {}\n        self._num_retries = num_retries\n        # For training or validation mode, one single clip is sampled from every\n        # video. For testing, NUM_ENSEMBLE_VIEWS clips are sampled from every\n        # video. For every clip, NUM_SPATIAL_CROPS is cropped spatially from\n        # the frames.\n        if self.mode in [\"train\", \"val\"]:\n            self._num_clips = 1\n        elif self.mode in [\"test\"]:\n            self._num_clips = cfg.TEST.NUM_ENSEMBLE_VIEWS * cfg.TEST.NUM_SPATIAL_CROPS\n\n        logger.info(\"Constructing Something-Something V2 {}...\".format(mode))\n        self._construct_loader()\n\n        self.aug = False\n        self.rand_erase = False\n        self.use_temporal_gradient = False\n        self.temporal_gradient_rate = 0.0\n\n        if self.mode == \"train\" and self.cfg.AUG.ENABLE:\n            self.aug = True\n            if self.cfg.AUG.RE_PROB > 0:\n                self.rand_erase = True\n\n    def _construct_loader(self):\n        \"\"\"\n        Construct the video loader.\n        \"\"\"\n        # Loading label names.\n        with pathmgr.open(\n            os.path.join(\n                self.cfg.DATA.PATH_TO_DATA_DIR,\n                \"something-something-v2-labels.json\",\n            ),\n            \"r\",\n        ) as f:\n            label_dict = json.load(f)\n\n        # Loading labels.\n        label_file = os.path.join(\n            self.cfg.DATA.PATH_TO_DATA_DIR,\n            \"something-something-v2-{}.json\".format(\n                \"train\" if self.mode == \"train\" else \"validation\"\n            ),\n        )\n        with pathmgr.open(label_file, \"r\") as f:\n            label_json = json.load(f)\n\n        self._video_names = []\n        self._labels = []\n        for video in label_json:\n            video_name = video[\"id\"]\n            template = video[\"template\"]\n            template = template.replace(\"[\", \"\")\n            template = template.replace(\"]\", \"\")\n            label = int(label_dict[template])\n            self._video_names.append(video_name)\n            self._labels.append(label)\n\n        path_to_file = os.path.join(\n            self.cfg.DATA.PATH_TO_DATA_DIR,\n            \"{}.csv\".format(\"train\" if self.mode == \"train\" else \"val\"),\n        )\n        assert pathmgr.exists(path_to_file), \"{} dir not found\".format(path_to_file)\n\n        self._path_to_videos, _ = utils.load_image_lists(\n            path_to_file, self.cfg.DATA.PATH_PREFIX\n        )\n\n        assert len(self._path_to_videos) == len(self._video_names), (\n            len(self._path_to_videos),\n            len(self._video_names),\n        )\n\n        # From dict to list.\n        new_paths, new_labels = [], []\n        for index in range(len(self._video_names)):\n            if self._video_names[index] in self._path_to_videos:\n                new_paths.append(self._path_to_videos[self._video_names[index]])\n                new_labels.append(self._labels[index])\n\n        self._labels = new_labels\n        self._path_to_videos = new_paths\n\n        # Extend self when self._num_clips > 1 (during testing).\n        self._path_to_videos = list(\n            chain.from_iterable([[x] * self._num_clips for x in self._path_to_videos])\n        )\n        self._labels = list(\n            chain.from_iterable([[x] * self._num_clips for x in self._labels])\n        )\n        self._spatial_temporal_idx = list(\n            chain.from_iterable(\n                [range(self._num_clips) for _ in range(len(self._path_to_videos))]\n            )\n        )\n        logger.info(\n            \"Something-Something V2 dataloader constructed  (size: {}) from {}\".format(\n                len(self._path_to_videos), path_to_file\n            )\n        )\n\n    def get_seq_frames(self, index):\n        \"\"\"\n        Given the video index, return the list of sampled frame indexes.\n        Args:\n            index (int): the video index.\n        Returns:\n            seq (list): the indexes of frames of sampled from the video.\n        \"\"\"\n        num_frames = self.cfg.DATA.NUM_FRAMES\n        video_length = len(self._path_to_videos[index])\n\n        seg_size = float(video_length - 1) / num_frames\n        seq = []\n        for i in range(num_frames):\n            start = int(np.round(seg_size * i))\n            end = int(np.round(seg_size * (i + 1)))\n            if self.mode == \"train\":\n                seq.append(random.randint(start, end))\n            else:\n                seq.append((start + end) // 2)\n\n        return seq\n\n    def __getitem__(self, index):\n        \"\"\"\n        Given the video index, return the list of frames, label, and video\n        index if the video frames can be fetched.\n        Args:\n            index (int): the video index provided by the pytorch sampler.\n        Returns:\n            frames (tensor): the frames of sampled from the video. The dimension\n                is `channel` x `num frames` x `height` x `width`.\n            label (int): the label of the current video.\n            index (int): the index of the video.\n        \"\"\"\n        short_cycle_idx = None\n        # When short cycle is used, input index is a tupple.\n        if isinstance(index, tuple):\n            index, self._num_yielded = index\n            if self.cfg.MULTIGRID.SHORT_CYCLE:\n                index, short_cycle_idx = index\n\n        if self.mode in [\"train\", \"val\"]:\n            # -1 indicates random sampling.\n            spatial_sample_index = -1\n            min_scale = self.cfg.DATA.TRAIN_JITTER_SCALES[0]\n            max_scale = self.cfg.DATA.TRAIN_JITTER_SCALES[1]\n            crop_size = self.cfg.DATA.TRAIN_CROP_SIZE\n            if short_cycle_idx in [0, 1]:\n                crop_size = int(\n                    round(\n                        self.cfg.MULTIGRID.SHORT_CYCLE_FACTORS[short_cycle_idx]\n                        * self.cfg.MULTIGRID.DEFAULT_S\n                    )\n                )\n            if self.cfg.MULTIGRID.DEFAULT_S > 0:\n                # Decreasing the scale is equivalent to using a larger \"span\"\n                # in a sampling grid.\n                min_scale = int(\n                    round(float(min_scale) * crop_size / self.cfg.MULTIGRID.DEFAULT_S)\n                )\n        elif self.mode in [\"test\"]:\n            # spatial_sample_index is in [0, 1, 2]. Corresponding to left,\n            # center, or right if width is larger than height, and top, middle,\n            # or bottom if height is larger than width.\n            spatial_sample_index = (\n                self._spatial_temporal_idx[index] % self.cfg.TEST.NUM_SPATIAL_CROPS\n            )\n            min_scale, max_scale, crop_size = [self.cfg.DATA.TEST_CROP_SIZE] * 3\n            # The testing is deterministic and no jitter should be performed.\n            # min_scale, max_scale, and crop_size are expect to be the same.\n            assert len({min_scale, max_scale, crop_size}) == 1\n        else:\n            raise NotImplementedError(\"Does not support {} mode\".format(self.mode))\n\n        label = self._labels[index]\n\n        seq = self.get_seq_frames(index)\n\n        frames = torch.as_tensor(\n            utils.retry_load_images(\n                [self._path_to_videos[index][frame] for frame in seq],\n                self._num_retries,\n            )\n        )\n\n        if self.aug:\n            if self.cfg.AUG.NUM_SAMPLE > 1:\n                frame_list = []\n                label_list = []\n                index_list = []\n                for _ in range(self.cfg.AUG.NUM_SAMPLE):\n                    new_frames = utils.aug_frame(\n                        self.cfg,\n                        self.mode,\n                        self.rand_erase,\n                        frames,\n                        spatial_sample_index,\n                        min_scale,\n                        max_scale,\n                        crop_size,\n                    )\n                    new_frames = utils.pack_pathway_output(self.cfg, new_frames)\n                    frame_list.append(new_frames)\n                    label_list.append(label)\n                    index_list.append(index)\n                return (\n                    frame_list,\n                    label_list,\n                    index_list,\n                    [0] * self.cfg.AUG.NUM_SAMPLE,\n                    {},\n                )\n\n            else:\n                frames = utils.aug_frame(\n                    self.cfg,\n                    self.mode,\n                    self.rand_erase,\n                    frames,\n                    spatial_sample_index,\n                    min_scale,\n                    max_scale,\n                    crop_size,\n                )\n        else:\n            # Perform color normalization.\n            frames = utils.tensor_normalize(\n                frames, self.cfg.DATA.MEAN, self.cfg.DATA.STD\n            )\n\n            # T H W C -> C T H W.\n            frames = frames.permute(3, 0, 1, 2)\n            # Perform data augmentation.\n            frames = utils.spatial_sampling(\n                frames,\n                spatial_idx=spatial_sample_index,\n                min_scale=min_scale,\n                max_scale=max_scale,\n                crop_size=crop_size,\n                random_horizontal_flip=self.cfg.DATA.RANDOM_FLIP,\n                inverse_uniform_sampling=self.cfg.DATA.INV_UNIFORM_SAMPLE,\n            )\n        frames = utils.pack_pathway_output(self.cfg, frames)\n        return frames, label, index, 0, {}\n\n    def __len__(self):\n        \"\"\"\n        Returns:\n            (int): the number of videos in the dataset.\n        \"\"\"\n        return self.num_videos\n\n    @property\n    def num_videos(self):\n        \"\"\"\n        Returns:\n            (int): the number of videos in the dataset.\n        \"\"\"\n        return len(self._path_to_videos)\n"
  },
  {
    "path": "slowfast/datasets/transform.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\nimport logging\nimport math\n\n# import cv2\nimport random\n\nimport numpy as np\nimport torch\nimport torchvision as tv\nimport torchvision.transforms.functional as F\nfrom PIL import Image, ImageFilter\nfrom scipy.ndimage import gaussian_filter\nfrom torchvision import transforms\n\nfrom .rand_augment import rand_augment_transform\nfrom .random_erasing import RandomErasing\n\n_pil_interpolation_to_str = {\n    Image.NEAREST: \"PIL.Image.NEAREST\",\n    Image.BILINEAR: \"PIL.Image.BILINEAR\",\n    Image.BICUBIC: \"PIL.Image.BICUBIC\",\n    Image.LANCZOS: \"PIL.Image.LANCZOS\",\n    Image.HAMMING: \"PIL.Image.HAMMING\",\n    Image.BOX: \"PIL.Image.BOX\",\n}\n\n\n_RANDOM_INTERPOLATION = (Image.BILINEAR, Image.BICUBIC)\n\n\ndef _pil_interp(method):\n    if method == \"bicubic\":\n        return Image.BICUBIC\n    elif method == \"lanczos\":\n        return Image.LANCZOS\n    elif method == \"hamming\":\n        return Image.HAMMING\n    else:\n        return Image.BILINEAR\n\n\nlogger = logging.getLogger(__name__)\n\n\ndef random_short_side_scale_jitter(\n    images, min_size, max_size, boxes=None, inverse_uniform_sampling=False\n):\n    \"\"\"\n    Perform a spatial short scale jittering on the given images and\n    corresponding boxes.\n    Args:\n        images (tensor): images to perform scale jitter. Dimension is\n            `num frames` x `channel` x `height` x `width`.\n        min_size (int): the minimal size to scale the frames.\n        max_size (int): the maximal size to scale the frames.\n        boxes (ndarray): optional. Corresponding boxes to images.\n            Dimension is `num boxes` x 4.\n        inverse_uniform_sampling (bool): if True, sample uniformly in\n            [1 / max_scale, 1 / min_scale] and take a reciprocal to get the\n            scale. If False, take a uniform sample from [min_scale, max_scale].\n    Returns:\n        (tensor): the scaled images with dimension of\n            `num frames` x `channel` x `new height` x `new width`.\n        (ndarray or None): the scaled boxes with dimension of\n            `num boxes` x 4.\n    \"\"\"\n    if inverse_uniform_sampling:\n        size = int(round(1.0 / np.random.uniform(1.0 / max_size, 1.0 / min_size)))\n    else:\n        size = int(round(np.random.uniform(min_size, max_size)))\n\n    height = images.shape[2]\n    width = images.shape[3]\n    if (width <= height and width == size) or (height <= width and height == size):\n        return images, boxes\n    new_width = size\n    new_height = size\n    if width < height:\n        new_height = int(math.floor((float(height) / width) * size))\n        if boxes is not None:\n            boxes = boxes * float(new_height) / height\n    else:\n        new_width = int(math.floor((float(width) / height) * size))\n        if boxes is not None:\n            boxes = boxes * float(new_width) / width\n\n    return (\n        torch.nn.functional.interpolate(\n            images,\n            size=(new_height, new_width),\n            mode=\"bilinear\",\n            align_corners=False,\n        ),\n        boxes,\n    )\n\n\ndef crop_boxes(boxes, x_offset, y_offset):\n    \"\"\"\n    Peform crop on the bounding boxes given the offsets.\n    Args:\n        boxes (ndarray or None): bounding boxes to peform crop. The dimension\n            is `num boxes` x 4.\n        x_offset (int): cropping offset in the x axis.\n        y_offset (int): cropping offset in the y axis.\n    Returns:\n        cropped_boxes (ndarray or None): the cropped boxes with dimension of\n            `num boxes` x 4.\n    \"\"\"\n    cropped_boxes = boxes.copy()\n    cropped_boxes[:, [0, 2]] = boxes[:, [0, 2]] - x_offset\n    cropped_boxes[:, [1, 3]] = boxes[:, [1, 3]] - y_offset\n\n    return cropped_boxes\n\n\ndef random_crop(images, size, boxes=None):\n    \"\"\"\n    Perform random spatial crop on the given images and corresponding boxes.\n    Args:\n        images (tensor): images to perform random crop. The dimension is\n            `num frames` x `channel` x `height` x `width`.\n        size (int): the size of height and width to crop on the image.\n        boxes (ndarray or None): optional. Corresponding boxes to images.\n            Dimension is `num boxes` x 4.\n    Returns:\n        cropped (tensor): cropped images with dimension of\n            `num frames` x `channel` x `size` x `size`.\n        cropped_boxes (ndarray or None): the cropped boxes with dimension of\n            `num boxes` x 4.\n    \"\"\"\n    if images.shape[2] == size and images.shape[3] == size:\n        return images, boxes\n    height = images.shape[2]\n    width = images.shape[3]\n    y_offset = 0\n    if height > size:\n        y_offset = int(np.random.randint(0, height - size))\n    x_offset = 0\n    if width > size:\n        x_offset = int(np.random.randint(0, width - size))\n    cropped = images[:, :, y_offset : y_offset + size, x_offset : x_offset + size]\n\n    cropped_boxes = crop_boxes(boxes, x_offset, y_offset) if boxes is not None else None\n\n    return cropped, cropped_boxes\n\n\ndef horizontal_flip(prob, images, boxes=None):\n    \"\"\"\n    Perform horizontal flip on the given images and corresponding boxes.\n    Args:\n        prob (float): probility to flip the images.\n        images (tensor): images to perform horizontal flip, the dimension is\n            `num frames` x `channel` x `height` x `width`.\n        boxes (ndarray or None): optional. Corresponding boxes to images.\n            Dimension is `num boxes` x 4.\n    Returns:\n        images (tensor): images with dimension of\n            `num frames` x `channel` x `height` x `width`.\n        flipped_boxes (ndarray or None): the flipped boxes with dimension of\n            `num boxes` x 4.\n    \"\"\"\n    if boxes is None:\n        flipped_boxes = None\n    else:\n        flipped_boxes = boxes.copy()\n\n    if np.random.uniform() < prob:\n        images = images.flip((-1))\n\n        if len(images.shape) == 3:\n            width = images.shape[2]\n        elif len(images.shape) == 4:\n            width = images.shape[3]\n        else:\n            raise NotImplementedError(\"Dimension does not supported\")\n        if boxes is not None:\n            flipped_boxes[:, [0, 2]] = width - boxes[:, [2, 0]] - 1\n\n    return images, flipped_boxes\n\n\ndef uniform_crop(images, size, spatial_idx, boxes=None, scale_size=None):\n    \"\"\"\n    Perform uniform spatial sampling on the images and corresponding boxes.\n    Args:\n        images (tensor): images to perform uniform crop. The dimension is\n            `num frames` x `channel` x `height` x `width`.\n        size (int): size of height and weight to crop the images.\n        spatial_idx (int): 0, 1, or 2 for left, center, and right crop if width\n            is larger than height. Or 0, 1, or 2 for top, center, and bottom\n            crop if height is larger than width.\n        boxes (ndarray or None): optional. Corresponding boxes to images.\n            Dimension is `num boxes` x 4.\n        scale_size (int): optinal. If not None, resize the images to scale_size before\n            performing any crop.\n    Returns:\n        cropped (tensor): images with dimension of\n            `num frames` x `channel` x `size` x `size`.\n        cropped_boxes (ndarray or None): the cropped boxes with dimension of\n            `num boxes` x 4.\n    \"\"\"\n    assert spatial_idx in [0, 1, 2]\n    ndim = len(images.shape)\n    if ndim == 3:\n        images = images.unsqueeze(0)\n    height = images.shape[2]\n    width = images.shape[3]\n\n    if scale_size is not None:\n        if width <= height:\n            width, height = scale_size, int(height / width * scale_size)\n        else:\n            width, height = int(width / height * scale_size), scale_size\n        images = torch.nn.functional.interpolate(\n            images,\n            size=(height, width),\n            mode=\"bilinear\",\n            align_corners=False,\n        )\n\n    y_offset = int(math.ceil((height - size) / 2))\n    x_offset = int(math.ceil((width - size) / 2))\n\n    if height > width:\n        if spatial_idx == 0:\n            y_offset = 0\n        elif spatial_idx == 2:\n            y_offset = height - size\n    else:\n        if spatial_idx == 0:\n            x_offset = 0\n        elif spatial_idx == 2:\n            x_offset = width - size\n    cropped = images[:, :, y_offset : y_offset + size, x_offset : x_offset + size]\n    cropped_boxes = crop_boxes(boxes, x_offset, y_offset) if boxes is not None else None\n    if ndim == 3:\n        cropped = cropped.squeeze(0)\n    return cropped, cropped_boxes\n\n\ndef clip_boxes_to_image(boxes, height, width):\n    \"\"\"\n    Clip an array of boxes to an image with the given height and width.\n    Args:\n        boxes (ndarray): bounding boxes to perform clipping.\n            Dimension is `num boxes` x 4.\n        height (int): given image height.\n        width (int): given image width.\n    Returns:\n        clipped_boxes (ndarray): the clipped boxes with dimension of\n            `num boxes` x 4.\n    \"\"\"\n    clipped_boxes = boxes.copy()\n    clipped_boxes[:, [0, 2]] = np.minimum(\n        width - 1.0, np.maximum(0.0, boxes[:, [0, 2]])\n    )\n    clipped_boxes[:, [1, 3]] = np.minimum(\n        height - 1.0, np.maximum(0.0, boxes[:, [1, 3]])\n    )\n    return clipped_boxes\n\n\ndef blend(images1, images2, alpha):\n    \"\"\"\n    Blend two images with a given weight alpha.\n    Args:\n        images1 (tensor): the first images to be blended, the dimension is\n            `num frames` x `channel` x `height` x `width`.\n        images2 (tensor): the second images to be blended, the dimension is\n            `num frames` x `channel` x `height` x `width`.\n        alpha (float): the blending weight.\n    Returns:\n        (tensor): blended images, the dimension is\n            `num frames` x `channel` x `height` x `width`.\n    \"\"\"\n    return images1 * alpha + images2 * (1 - alpha)\n\n\ndef grayscale(images):\n    \"\"\"\n    Get the grayscale for the input images. The channels of images should be\n    in order BGR.\n    Args:\n        images (tensor): the input images for getting grayscale. Dimension is\n            `num frames` x `channel` x `height` x `width`.\n    Returns:\n        img_gray (tensor): blended images, the dimension is\n            `num frames` x `channel` x `height` x `width`.\n    \"\"\"\n    # R -> 0.299, G -> 0.587, B -> 0.114.\n    img_gray = torch.tensor(images)\n    gray_channel = 0.299 * images[:, 2] + 0.587 * images[:, 1] + 0.114 * images[:, 0]\n    img_gray[:, 0] = gray_channel\n    img_gray[:, 1] = gray_channel\n    img_gray[:, 2] = gray_channel\n    return img_gray\n\n\ndef color_jitter(images, img_brightness=0, img_contrast=0, img_saturation=0):\n    \"\"\"\n    Perfrom a color jittering on the input images. The channels of images\n    should be in order BGR.\n    Args:\n        images (tensor): images to perform color jitter. Dimension is\n            `num frames` x `channel` x `height` x `width`.\n        img_brightness (float): jitter ratio for brightness.\n        img_contrast (float): jitter ratio for contrast.\n        img_saturation (float): jitter ratio for saturation.\n    Returns:\n        images (tensor): the jittered images, the dimension is\n            `num frames` x `channel` x `height` x `width`.\n    \"\"\"\n\n    jitter = []\n    if img_brightness != 0:\n        jitter.append(\"brightness\")\n    if img_contrast != 0:\n        jitter.append(\"contrast\")\n    if img_saturation != 0:\n        jitter.append(\"saturation\")\n\n    if len(jitter) > 0:\n        order = np.random.permutation(np.arange(len(jitter)))\n        for idx in range(0, len(jitter)):\n            if jitter[order[idx]] == \"brightness\":\n                images = brightness_jitter(img_brightness, images)\n            elif jitter[order[idx]] == \"contrast\":\n                images = contrast_jitter(img_contrast, images)\n            elif jitter[order[idx]] == \"saturation\":\n                images = saturation_jitter(img_saturation, images)\n    return images\n\n\ndef brightness_jitter(var, images):\n    \"\"\"\n    Perfrom brightness jittering on the input images. The channels of images\n    should be in order BGR.\n    Args:\n        var (float): jitter ratio for brightness.\n        images (tensor): images to perform color jitter. Dimension is\n            `num frames` x `channel` x `height` x `width`.\n    Returns:\n        images (tensor): the jittered images, the dimension is\n            `num frames` x `channel` x `height` x `width`.\n    \"\"\"\n    alpha = 1.0 + np.random.uniform(-var, var)\n\n    img_bright = torch.zeros(images.shape)\n    images = blend(images, img_bright, alpha)\n    return images\n\n\ndef contrast_jitter(var, images):\n    \"\"\"\n    Perfrom contrast jittering on the input images. The channels of images\n    should be in order BGR.\n    Args:\n        var (float): jitter ratio for contrast.\n        images (tensor): images to perform color jitter. Dimension is\n            `num frames` x `channel` x `height` x `width`.\n    Returns:\n        images (tensor): the jittered images, the dimension is\n            `num frames` x `channel` x `height` x `width`.\n    \"\"\"\n    alpha = 1.0 + np.random.uniform(-var, var)\n\n    img_gray = grayscale(images)\n    img_gray[:] = torch.mean(img_gray, dim=(1, 2, 3), keepdim=True)\n    images = blend(images, img_gray, alpha)\n    return images\n\n\ndef saturation_jitter(var, images):\n    \"\"\"\n    Perfrom saturation jittering on the input images. The channels of images\n    should be in order BGR.\n    Args:\n        var (float): jitter ratio for saturation.\n        images (tensor): images to perform color jitter. Dimension is\n            `num frames` x `channel` x `height` x `width`.\n    Returns:\n        images (tensor): the jittered images, the dimension is\n            `num frames` x `channel` x `height` x `width`.\n    \"\"\"\n    alpha = 1.0 + np.random.uniform(-var, var)\n    img_gray = grayscale(images)\n    images = blend(images, img_gray, alpha)\n\n    return images\n\n\ndef lighting_jitter(images, alphastd, eigval, eigvec):\n    \"\"\"\n    Perform AlexNet-style PCA jitter on the given images.\n    Args:\n        images (tensor): images to perform lighting jitter. Dimension is\n            `num frames` x `channel` x `height` x `width`.\n        alphastd (float): jitter ratio for PCA jitter.\n        eigval (list): eigenvalues for PCA jitter.\n        eigvec (list[list]): eigenvectors for PCA jitter.\n    Returns:\n        out_images (tensor): the jittered images, the dimension is\n            `num frames` x `channel` x `height` x `width`.\n    \"\"\"\n    if alphastd == 0:\n        return images\n    # generate alpha1, alpha2, alpha3.\n    alpha = np.random.normal(0, alphastd, size=(1, 3))\n    eig_vec = np.array(eigvec)\n    eig_val = np.reshape(eigval, (1, 3))\n    rgb = np.sum(\n        eig_vec * np.repeat(alpha, 3, axis=0) * np.repeat(eig_val, 3, axis=0),\n        axis=1,\n    )\n    out_images = torch.zeros_like(images)\n    if len(images.shape) == 3:\n        # C H W\n        channel_dim = 0\n    elif len(images.shape) == 4:\n        # T C H W\n        channel_dim = 1\n    else:\n        raise NotImplementedError(f\"Unsupported dimension {len(images.shape)}\")\n\n    for idx in range(images.shape[channel_dim]):\n        # C H W\n        if len(images.shape) == 3:\n            out_images[idx] = images[idx] + rgb[2 - idx]\n        # T C H W\n        elif len(images.shape) == 4:\n            out_images[:, idx] = images[:, idx] + rgb[2 - idx]\n        else:\n            raise NotImplementedError(f\"Unsupported dimension {len(images.shape)}\")\n\n    return out_images\n\n\ndef color_normalization(images, mean, stddev):\n    \"\"\"\n    Perform color nomration on the given images.\n    Args:\n        images (tensor): images to perform color normalization. Dimension is\n            `num frames` x `channel` x `height` x `width`.\n        mean (list): mean values for normalization.\n        stddev (list): standard deviations for normalization.\n\n    Returns:\n        out_images (tensor): the noramlized images, the dimension is\n            `num frames` x `channel` x `height` x `width`.\n    \"\"\"\n    if len(images.shape) == 3:\n        assert len(mean) == images.shape[0], \"channel mean not computed properly\"\n        assert len(stddev) == images.shape[0], \"channel stddev not computed properly\"\n    elif len(images.shape) == 4:\n        assert len(mean) == images.shape[1], \"channel mean not computed properly\"\n        assert len(stddev) == images.shape[1], \"channel stddev not computed properly\"\n    else:\n        raise NotImplementedError(f\"Unsupported dimension {len(images.shape)}\")\n\n    out_images = torch.zeros_like(images)\n    for idx in range(len(mean)):\n        # C H W\n        if len(images.shape) == 3:\n            out_images[idx] = (images[idx] - mean[idx]) / stddev[idx]\n        elif len(images.shape) == 4:\n            out_images[:, idx] = (images[:, idx] - mean[idx]) / stddev[idx]\n        else:\n            raise NotImplementedError(f\"Unsupported dimension {len(images.shape)}\")\n    return out_images\n\n\ndef _get_param_spatial_crop(\n    scale, ratio, height, width, num_repeat=10, log_scale=True, switch_hw=False\n):\n    \"\"\"\n    Given scale, ratio, height and width, return sampled coordinates of the videos.\n    \"\"\"\n    for _ in range(num_repeat):\n        area = height * width\n        target_area = random.uniform(*scale) * area\n        if log_scale:\n            log_ratio = (math.log(ratio[0]), math.log(ratio[1]))\n            aspect_ratio = math.exp(random.uniform(*log_ratio))\n        else:\n            aspect_ratio = random.uniform(*ratio)\n\n        w = int(round(math.sqrt(target_area * aspect_ratio)))\n        h = int(round(math.sqrt(target_area / aspect_ratio)))\n\n        if np.random.uniform() < 0.5 and switch_hw:\n            w, h = h, w\n\n        if 0 < w <= width and 0 < h <= height:\n            i = random.randint(0, height - h)\n            j = random.randint(0, width - w)\n            return i, j, h, w\n\n    # Fallback to central crop\n    in_ratio = float(width) / float(height)\n    if in_ratio < min(ratio):\n        w = width\n        h = int(round(w / min(ratio)))\n    elif in_ratio > max(ratio):\n        h = height\n        w = int(round(h * max(ratio)))\n    else:  # whole image\n        w = width\n        h = height\n    i = (height - h) // 2\n    j = (width - w) // 2\n    return i, j, h, w\n\n\ndef random_resized_crop(\n    images,\n    target_height,\n    target_width,\n    scale=(0.8, 1.0),\n    ratio=(3.0 / 4.0, 4.0 / 3.0),\n):\n    \"\"\"\n    Crop the given images to random size and aspect ratio. A crop of random\n    size (default: of 0.08 to 1.0) of the original size and a random aspect\n    ratio (default: of 3/4 to 4/3) of the original aspect ratio is made. This\n    crop is finally resized to given size. This is popularly used to train the\n    Inception networks.\n\n    Args:\n        images: Images to perform resizing and cropping.\n        target_height: Desired height after cropping.\n        target_width: Desired width after cropping.\n        scale: Scale range of Inception-style area based random resizing.\n        ratio: Aspect ratio range of Inception-style area based random resizing.\n    \"\"\"\n\n    height = images.shape[2]\n    width = images.shape[3]\n\n    i, j, h, w = _get_param_spatial_crop(scale, ratio, height, width)\n    cropped = images[:, :, i : i + h, j : j + w]\n    return torch.nn.functional.interpolate(\n        cropped,\n        size=(target_height, target_width),\n        mode=\"bilinear\",\n        align_corners=False,\n    )\n\n\ndef random_resized_crop_with_shift(\n    images,\n    target_height,\n    target_width,\n    scale=(0.8, 1.0),\n    ratio=(3.0 / 4.0, 4.0 / 3.0),\n):\n    \"\"\"\n    This is similar to random_resized_crop. However, it samples two different\n    boxes (for cropping) for the first and last frame. It then linearly\n    interpolates the two boxes for other frames.\n\n    Args:\n        images: Images to perform resizing and cropping.\n        target_height: Desired height after cropping.\n        target_width: Desired width after cropping.\n        scale: Scale range of Inception-style area based random resizing.\n        ratio: Aspect ratio range of Inception-style area based random resizing.\n    \"\"\"\n    t = images.shape[1]\n    height = images.shape[2]\n    width = images.shape[3]\n\n    i, j, h, w = _get_param_spatial_crop(scale, ratio, height, width)\n    i_, j_, h_, w_ = _get_param_spatial_crop(scale, ratio, height, width)\n    i_s = [int(i) for i in torch.linspace(i, i_, steps=t).tolist()]\n    j_s = [int(i) for i in torch.linspace(j, j_, steps=t).tolist()]\n    h_s = [int(i) for i in torch.linspace(h, h_, steps=t).tolist()]\n    w_s = [int(i) for i in torch.linspace(w, w_, steps=t).tolist()]\n    out = torch.zeros((3, t, target_height, target_width))\n    for ind in range(t):\n        out[:, ind : ind + 1, :, :] = torch.nn.functional.interpolate(\n            images[\n                :,\n                ind : ind + 1,\n                i_s[ind] : i_s[ind] + h_s[ind],\n                j_s[ind] : j_s[ind] + w_s[ind],\n            ],\n            size=(target_height, target_width),\n            mode=\"bilinear\",\n            align_corners=False,\n        )\n    return out\n\n\ndef create_random_augment(\n    input_size,\n    auto_augment=None,\n    interpolation=\"bilinear\",\n):\n    \"\"\"\n    Get video randaug transform.\n\n    Args:\n        input_size: The size of the input video in tuple.\n        auto_augment: Parameters for randaug. An example:\n            \"rand-m7-n4-mstd0.5-inc1\" (m is the magnitude and n is the number\n            of operations to apply).\n        interpolation: Interpolation method.\n    \"\"\"\n    if isinstance(input_size, tuple):\n        img_size = input_size[-2:]\n    else:\n        img_size = input_size\n\n    if auto_augment:\n        assert isinstance(auto_augment, str)\n        if isinstance(img_size, tuple):\n            img_size_min = min(img_size)\n        else:\n            img_size_min = img_size\n        aa_params = {\"translate_const\": int(img_size_min * 0.45)}\n        if interpolation and interpolation != \"random\":\n            aa_params[\"interpolation\"] = _pil_interp(interpolation)\n        if auto_augment.startswith(\"rand\"):\n            return transforms.Compose([rand_augment_transform(auto_augment, aa_params)])\n    raise NotImplementedError\n\n\ndef random_sized_crop_img(\n    im,\n    size,\n    jitter_scale=(0.08, 1.0),\n    jitter_aspect=(3.0 / 4.0, 4.0 / 3.0),\n    max_iter=10,\n):\n    \"\"\"\n    Performs Inception-style cropping (used for training).\n    \"\"\"\n    assert len(im.shape) == 3, \"Currently only support image for random_sized_crop\"\n    h, w = im.shape[1:3]\n    i, j, h, w = _get_param_spatial_crop(\n        scale=jitter_scale,\n        ratio=jitter_aspect,\n        height=h,\n        width=w,\n        num_repeat=max_iter,\n        log_scale=False,\n        switch_hw=True,\n    )\n    cropped = im[:, i : i + h, j : j + w]\n    return torch.nn.functional.interpolate(\n        cropped.unsqueeze(0),\n        size=(size, size),\n        mode=\"bilinear\",\n        align_corners=False,\n    ).squeeze(0)\n\n\n# The following code are modified based on timm lib, we will replace the following\n# contents with dependency from PyTorchVideo.\n# https://github.com/facebookresearch/pytorchvideo\nclass RandomResizedCropAndInterpolation:\n    \"\"\"Crop the given PIL Image to random size and aspect ratio with random interpolation.\n    A crop of random size (default: of 0.08 to 1.0) of the original size and a random\n    aspect ratio (default: of 3/4 to 4/3) of the original aspect ratio is made. This crop\n    is finally resized to given size.\n    This is popularly used to train the Inception networks.\n    Args:\n        size: expected output size of each edge\n        scale: range of size of the origin size cropped\n        ratio: range of aspect ratio of the origin aspect ratio cropped\n        interpolation: Default: PIL.Image.BILINEAR\n    \"\"\"\n\n    def __init__(\n        self,\n        size,\n        scale=(0.08, 1.0),\n        ratio=(3.0 / 4.0, 4.0 / 3.0),\n        interpolation=\"bilinear\",\n    ):\n        if isinstance(size, tuple):\n            self.size = size\n        else:\n            self.size = (size, size)\n        if (scale[0] > scale[1]) or (ratio[0] > ratio[1]):\n            print(\"range should be of kind (min, max)\")\n\n        if interpolation == \"random\":\n            self.interpolation = _RANDOM_INTERPOLATION\n        else:\n            self.interpolation = _pil_interp(interpolation)\n        self.scale = scale\n        self.ratio = ratio\n\n    @staticmethod\n    def get_params(img, scale, ratio):\n        \"\"\"Get parameters for ``crop`` for a random sized crop.\n        Args:\n            img (PIL Image): Image to be cropped.\n            scale (tuple): range of size of the origin size cropped\n            ratio (tuple): range of aspect ratio of the origin aspect ratio cropped\n        Returns:\n            tuple: params (i, j, h, w) to be passed to ``crop`` for a random\n                sized crop.\n        \"\"\"\n        area = img.size[0] * img.size[1]\n\n        for _ in range(10):\n            target_area = random.uniform(*scale) * area\n            log_ratio = (math.log(ratio[0]), math.log(ratio[1]))\n            aspect_ratio = math.exp(random.uniform(*log_ratio))\n\n            w = int(round(math.sqrt(target_area * aspect_ratio)))\n            h = int(round(math.sqrt(target_area / aspect_ratio)))\n\n            if w <= img.size[0] and h <= img.size[1]:\n                i = random.randint(0, img.size[1] - h)\n                j = random.randint(0, img.size[0] - w)\n                return i, j, h, w\n\n        # Fallback to central crop\n        in_ratio = img.size[0] / img.size[1]\n        if in_ratio < min(ratio):\n            w = img.size[0]\n            h = int(round(w / min(ratio)))\n        elif in_ratio > max(ratio):\n            h = img.size[1]\n            w = int(round(h * max(ratio)))\n        else:  # whole image\n            w = img.size[0]\n            h = img.size[1]\n        i = (img.size[1] - h) // 2\n        j = (img.size[0] - w) // 2\n        return i, j, h, w\n\n    def __call__(self, img):\n        \"\"\"\n        Args:\n            img (PIL Image): Image to be cropped and resized.\n        Returns:\n            PIL Image: Randomly cropped and resized image.\n        \"\"\"\n        i, j, h, w = self.get_params(img, self.scale, self.ratio)\n        if isinstance(self.interpolation, (tuple, list)):\n            interpolation = random.choice(self.interpolation)\n        else:\n            interpolation = self.interpolation\n        return F.resized_crop(img, i, j, h, w, self.size, interpolation)\n\n    def __repr__(self):\n        if isinstance(self.interpolation, (tuple, list)):\n            interpolate_str = \" \".join(\n                [_pil_interpolation_to_str[x] for x in self.interpolation]\n            )\n        else:\n            interpolate_str = _pil_interpolation_to_str[self.interpolation]\n        format_string = self.__class__.__name__ + \"(size={0}\".format(self.size)\n        format_string += \", scale={0}\".format(tuple(round(s, 4) for s in self.scale))\n        format_string += \", ratio={0}\".format(tuple(round(r, 4) for r in self.ratio))\n        format_string += \", interpolation={0})\".format(interpolate_str)\n        return format_string\n\n\n\"\"\"\nThis implementation is based on\nhttps://github.com/microsoft/unilm/blob/master/beit/masking_generator.py\nLicensed under The MIT License\n\"\"\"\n\n\nclass MaskingGenerator:\n    def __init__(\n        self,\n        mask_window_size,\n        num_masking_patches,\n        min_num_patches=16,\n        max_num_patches=None,\n        min_aspect=0.3,\n        max_aspect=None,\n    ):\n        if not isinstance(\n            mask_window_size,\n            (\n                list,\n                tuple,\n            ),\n        ):\n            mask_window_size = (mask_window_size,) * 2\n        self.height, self.width = mask_window_size\n\n        self.num_patches = self.height * self.width\n        self.num_masking_patches = num_masking_patches\n\n        self.min_num_patches = min_num_patches\n        self.max_num_patches = (\n            num_masking_patches if max_num_patches is None else max_num_patches\n        )\n\n        max_aspect = max_aspect or 1 / min_aspect\n        self.log_aspect_ratio = (math.log(min_aspect), math.log(max_aspect))\n\n    def __repr__(self):\n        repr_str = \"Generator(%d, %d -> [%d ~ %d], max = %d, %.3f ~ %.3f)\" % (\n            self.height,\n            self.width,\n            self.min_num_patches,\n            self.max_num_patches,\n            self.num_masking_patches,\n            self.log_aspect_ratio[0],\n            self.log_aspect_ratio[1],\n        )\n        return repr_str\n\n    def get_shape(self):\n        return self.height, self.width\n\n    def _mask(self, mask, max_mask_patches):\n        delta = 0\n        for _ in range(10):\n            target_area = random.uniform(self.min_num_patches, max_mask_patches)\n            aspect_ratio = math.exp(random.uniform(*self.log_aspect_ratio))\n            h = int(round(math.sqrt(target_area * aspect_ratio)))\n            w = int(round(math.sqrt(target_area / aspect_ratio)))\n            if w < self.width and h < self.height:\n                top = random.randint(0, self.height - h)\n                left = random.randint(0, self.width - w)\n\n                num_masked = mask[top : top + h, left : left + w].sum()\n                # Overlap\n                if 0 < h * w - num_masked <= max_mask_patches:\n                    for i in range(top, top + h):\n                        for j in range(left, left + w):\n                            if mask[i, j] == 0:\n                                mask[i, j] = 1\n                                delta += 1\n\n                if delta > 0:\n                    break\n        return delta\n\n    def __call__(self):\n        mask = np.zeros(shape=self.get_shape(), dtype=int)\n        mask_count = 0\n        while mask_count < self.num_masking_patches:\n            max_mask_patches = self.num_masking_patches - mask_count\n            max_mask_patches = min(max_mask_patches, self.max_num_patches)\n\n            delta = self._mask(mask, max_mask_patches)\n            if delta == 0:\n                break\n            else:\n                mask_count += delta\n\n        return mask\n\n\n\"\"\"\nThis implementation is based on\nhttps://github.com/microsoft/unilm/blob/master/beit/masking_generator.py\nLicensed under The MIT License\n\"\"\"\n\n\nclass MaskingGenerator3D:\n    def __init__(\n        self,\n        mask_window_size,\n        num_masking_patches,\n        min_num_patches=16,\n        max_num_patches=None,\n        min_aspect=0.3,\n        max_aspect=None,\n    ):\n        self.temporal, self.height, self.width = mask_window_size\n        self.num_masking_patches = num_masking_patches\n        self.min_num_patches = min_num_patches\n        self.max_num_patches = (\n            num_masking_patches if max_num_patches is None else max_num_patches\n        )\n        max_aspect = max_aspect or 1 / min_aspect\n        self.log_aspect_ratio = (math.log(min_aspect), math.log(max_aspect))\n\n    def __repr__(self):\n        repr_str = \"Generator(%d, %d, %d -> [%d ~ %d], max = %d, %.3f ~ %.3f)\" % (\n            self.temporal,\n            self.height,\n            self.width,\n            self.min_num_patches,\n            self.max_num_patches,\n            self.num_masking_patches,\n            self.log_aspect_ratio[0],\n            self.log_aspect_ratio[1],\n        )\n        return repr_str\n\n    def get_shape(self):\n        return self.temporal, self.height, self.width\n\n    def _mask(self, mask, max_mask_patches):\n        delta = 0\n        for _ in range(100):\n            target_area = random.uniform(self.min_num_patches, self.max_num_patches)\n            aspect_ratio = math.exp(random.uniform(*self.log_aspect_ratio))\n            h = int(round(math.sqrt(target_area * aspect_ratio)))\n            w = int(round(math.sqrt(target_area / aspect_ratio)))\n            t = random.randint(1, self.temporal)  # !\n            if w < self.width and h < self.height:\n                top = random.randint(0, self.height - h)\n                left = random.randint(0, self.width - w)\n                front = random.randint(0, self.temporal - t)\n\n                num_masked = mask[\n                    front : front + t, top : top + h, left : left + w\n                ].sum()\n                # Overlap\n                if 0 < h * w * t - num_masked <= max_mask_patches:\n                    for i in range(front, front + t):\n                        for j in range(top, top + h):\n                            for k in range(left, left + w):\n                                if mask[i, j, k] == 0:\n                                    mask[i, j, k] = 1\n                                    delta += 1\n\n                if delta > 0:\n                    break\n        return delta\n\n    def __call__(self):\n        mask = np.zeros(shape=self.get_shape(), dtype=int)\n        mask_count = 0\n        while mask_count < self.num_masking_patches:\n            max_mask_patches = self.num_masking_patches - mask_count\n\n            delta = self._mask(mask, max_mask_patches)\n            if delta == 0:\n                break\n            else:\n                mask_count += delta\n\n        return mask\n\n\ndef transforms_imagenet_train(\n    img_size=224,\n    scale=None,\n    ratio=None,\n    hflip=0.5,\n    vflip=0.0,\n    color_jitter=0.4,\n    auto_augment=None,\n    interpolation=\"random\",\n    use_prefetcher=False,\n    mean=(0.485, 0.456, 0.406),\n    std=(0.229, 0.224, 0.225),\n    re_prob=0.0,\n    re_mode=\"const\",\n    re_count=1,\n    re_num_splits=0,\n    separate=False,\n):\n    \"\"\"\n    If separate==True, the transforms are returned as a tuple of 3 separate transforms\n    for use in a mixing dataset that passes\n     * all data through the first (primary) transform, called the 'clean' data\n     * a portion of the data through the secondary transform\n     * normalizes and converts the branches above with the third, final transform\n    \"\"\"\n    if isinstance(img_size, tuple):\n        img_size = img_size[-2:]\n    else:\n        img_size = img_size\n\n    scale = tuple(scale or (0.08, 1.0))  # default imagenet scale range\n    ratio = tuple(ratio or (3.0 / 4.0, 4.0 / 3.0))  # default imagenet ratio range\n    primary_tfl = [\n        RandomResizedCropAndInterpolation(\n            img_size, scale=scale, ratio=ratio, interpolation=interpolation\n        )\n    ]\n    if hflip > 0.0:\n        primary_tfl += [transforms.RandomHorizontalFlip(p=hflip)]\n    if vflip > 0.0:\n        primary_tfl += [transforms.RandomVerticalFlip(p=vflip)]\n\n    secondary_tfl = []\n    if auto_augment:\n        assert isinstance(auto_augment, str)\n        if isinstance(img_size, tuple):\n            img_size_min = min(img_size)\n        else:\n            img_size_min = img_size\n        aa_params = dict(\n            translate_const=int(img_size_min * 0.45),\n            img_mean=tuple([min(255, round(255 * x)) for x in mean]),\n        )\n        if interpolation and interpolation != \"random\":\n            aa_params[\"interpolation\"] = _pil_interp(interpolation)\n        if auto_augment.startswith(\"rand\"):\n            secondary_tfl += [rand_augment_transform(auto_augment, aa_params)]\n        elif auto_augment.startswith(\"augmix\"):\n            raise NotImplementedError(\"Augmix not implemented\")\n        else:\n            raise NotImplementedError(\"Auto aug not implemented\")\n    elif color_jitter is not None:\n        # color jitter is enabled when not using AA\n        if isinstance(color_jitter, (list, tuple)):\n            # color jitter should be a 3-tuple/list if spec brightness/contrast/saturation\n            # or 4 if also augmenting hue\n            assert len(color_jitter) in (3, 4)\n        else:\n            # if it's a scalar, duplicate for brightness, contrast, and saturation, no hue\n            color_jitter = (float(color_jitter),) * 3\n        secondary_tfl += [transforms.ColorJitter(*color_jitter)]\n\n    final_tfl = []\n    final_tfl += [\n        transforms.ToTensor(),\n        transforms.Normalize(mean=torch.tensor(mean), std=torch.tensor(std)),\n    ]\n    if re_prob > 0.0:\n        final_tfl.append(\n            RandomErasing(\n                re_prob,\n                mode=re_mode,\n                max_count=re_count,\n                num_splits=re_num_splits,\n                device=\"cpu\",\n                cube=False,\n            )\n        )\n\n    if separate:\n        return (\n            transforms.Compose(primary_tfl),\n            transforms.Compose(secondary_tfl),\n            transforms.Compose(final_tfl),\n        )\n    else:\n        return transforms.Compose(primary_tfl + secondary_tfl + final_tfl)\n\n\ndef temporal_difference(\n    frames,\n    use_grayscale=False,\n    absolute=False,\n):\n    if use_grayscale:\n        gray_channel = (\n            0.299 * frames[2, :] + 0.587 * frames[1, :] + 0.114 * frames[0, :]\n        )\n        frames[0, :] = gray_channel\n        frames[1, :] = gray_channel\n        frames[2, :] = gray_channel\n\n    out_images = torch.zeros_like(frames)\n    t = frames.shape[1]\n\n    dt = frames[:, 0 : t - 1, :, :] - frames[:, 1:t, :, :]\n    if absolute:\n        dt = dt.abs()\n    out_images[:, 0 : t - 1, :, :] = dt\n    if t <= 1:\n        return out_images\n    out_images[:, -1, :, :] = dt[:, -1, :, :]\n    return out_images\n\n\ndef color_jitter_video_ssl(\n    frames,\n    bri_con_sat=[0.4] * 3,\n    hue=0.1,\n    p_convert_gray=0.0,\n    moco_v2_aug=False,\n    gaussan_sigma_min=[0.0, 0.1],\n    gaussan_sigma_max=[0.0, 2.0],\n):\n    # T H W C -> C T H W.\n    frames = frames.permute(3, 0, 1, 2)\n\n    if moco_v2_aug:\n        color_jitter = tv.transforms.Compose(\n            [\n                tv.transforms.ToPILImage(),\n                tv.transforms.RandomApply(\n                    [\n                        tv.transforms.ColorJitter(\n                            bri_con_sat[0], bri_con_sat[1], bri_con_sat[2], hue\n                        )\n                    ],\n                    p=0.8,\n                ),\n                tv.transforms.RandomGrayscale(p=p_convert_gray),\n                tv.transforms.RandomApply([GaussianBlur([0.1, 2.0])], p=0.5),\n                tv.transforms.ToTensor(),\n            ]\n        )\n    else:\n        color_jitter = tv.transforms.Compose(\n            [\n                tv.transforms.ToPILImage(),\n                tv.transforms.RandomGrayscale(p=p_convert_gray),\n                tv.transforms.ColorJitter(\n                    bri_con_sat[0], bri_con_sat[1], bri_con_sat[2], hue\n                ),\n                tv.transforms.ToTensor(),\n            ]\n        )\n\n    c, t, h, w = frames.shape\n    frames = frames.view(c, t * h, w)\n    frames = color_jitter(frames)\n    frames = frames.view(c, t, h, w)\n    # C T H W ->  T H W C.\n    frames = frames.permute(1, 2, 3, 0)\n\n    return frames\n\n\ndef augment_raw_frames(frames, time_diff_prob=0.0, gaussian_prob=0.0):\n    frames = frames.float()\n    if gaussian_prob > 0.0:\n        blur_trans = tv.transforms.RandomApply([GaussianBlurVideo()], p=gaussian_prob)\n        frames = blur_trans(frames)\n\n    time_diff_out = False\n    if time_diff_prob > 0.0 and random.random() < time_diff_prob:\n        # T H W C -> C T H W.\n        frames = frames.permute(3, 0, 1, 2)\n        frames = temporal_difference(frames, use_grayscale=True, absolute=False)\n        # end_idx -= 1\n        frames += 255.0\n        frames /= 2.0\n        # C T H W -> T H W C\n        frames = frames.permute(1, 2, 3, 0)\n        time_diff_out = True\n\n    return frames, time_diff_out\n\n\nclass GaussianBlur:\n    \"\"\"Gaussian blur augmentation in SimCLR https://arxiv.org/abs/2002.05709\"\"\"\n\n    def __init__(self, sigma=[0.1, 2.0]):\n        self.sigma = sigma\n\n    def __call__(self, x):\n        if len(self.sigma) == 2:\n            sigma = random.uniform(self.sigma[0], self.sigma[1])\n        elif len(self.sigma) == 1:\n            sigma = self.sigma[0]\n        x = x.filter(ImageFilter.GaussianBlur(radius=sigma))\n        return x\n\n\nclass GaussianBlurVideo:\n    def __init__(self, sigma_min=[0.0, 0.1], sigma_max=[0.0, 2.0], use_PIL=False):\n        self.sigma_min = sigma_min\n        self.sigma_max = sigma_max\n\n    def __call__(self, frames):\n        sigma_y = sigma_x = random.uniform(self.sigma_min[1], self.sigma_max[1])\n        sigma_t = random.uniform(self.sigma_min[0], self.sigma_max[0])\n        frames = gaussian_filter(frames, sigma=(0.0, sigma_t, sigma_y, sigma_x))\n        frames = torch.from_numpy(frames)\n        return frames\n"
  },
  {
    "path": "slowfast/datasets/utils.py",
    "content": "#!/usr/bin/env python3\n\nimport logging\nimport os\nimport random\nimport time\nfrom collections import defaultdict\n\nimport cv2\nimport numpy as np\nimport torch\nfrom slowfast.utils.env import pathmgr\nfrom torch.utils.data.distributed import DistributedSampler\nfrom torchvision import transforms\n\nfrom . import transform as transform\nfrom .random_erasing import RandomErasing\nfrom .transform import create_random_augment\n\n\nlogger = logging.getLogger(__name__)\n\n\ndef retry_load_images(image_paths, retry=10, backend=\"pytorch\"):\n    \"\"\"\n    This function is to load images with support of retrying for failed load.\n\n    Args:\n        image_paths (list): paths of images needed to be loaded.\n        retry (int, optional): maximum time of loading retrying. Defaults to 10.\n        backend (str): `pytorch` or `cv2`.\n\n    Returns:\n        imgs (list): list of loaded images.\n    \"\"\"\n    for i in range(retry):\n        imgs = []\n        for image_path in image_paths:\n            with pathmgr.open(image_path, \"rb\") as f:\n                img_str = np.frombuffer(f.read(), np.uint8)\n                img = cv2.imdecode(img_str, flags=cv2.IMREAD_COLOR)\n            imgs.append(img)\n\n        if all(img is not None for img in imgs):\n            if backend == \"pytorch\":\n                imgs = torch.as_tensor(np.stack(imgs))\n            return imgs\n        else:\n            logger.warn(\"Reading failed. Will retry.\")\n            time.sleep(1.0)\n        if i == retry - 1:\n            raise Exception(\"Failed to load images {}\".format(image_paths))\n\n\ndef get_sequence(center_idx, half_len, sample_rate, num_frames):\n    \"\"\"\n    Sample frames among the corresponding clip.\n\n    Args:\n        center_idx (int): center frame idx for current clip\n        half_len (int): half of the clip length\n        sample_rate (int): sampling rate for sampling frames inside of the clip\n        num_frames (int): number of expected sampled frames\n\n    Returns:\n        seq (list): list of indexes of sampled frames in this clip.\n    \"\"\"\n    seq = list(range(center_idx - half_len, center_idx + half_len, sample_rate))\n\n    for seq_idx in range(len(seq)):\n        if seq[seq_idx] < 0:\n            seq[seq_idx] = 0\n        elif seq[seq_idx] >= num_frames:\n            seq[seq_idx] = num_frames - 1\n    return seq\n\n\ndef pack_pathway_output(cfg, frames):\n    \"\"\"\n    Prepare output as a list of tensors. Each tensor corresponding to a\n    unique pathway.\n    Args:\n        frames (tensor): frames of images sampled from the video. The\n            dimension is `channel` x `num frames` x `height` x `width`.\n    Returns:\n        frame_list (list): list of tensors with the dimension of\n            `channel` x `num frames` x `height` x `width`.\n    \"\"\"\n    if cfg.DATA.REVERSE_INPUT_CHANNEL:\n        frames = frames[[2, 1, 0], :, :, :]\n    if cfg.MODEL.ARCH in cfg.MODEL.SINGLE_PATHWAY_ARCH:\n        frame_list = [frames]\n    elif cfg.MODEL.ARCH in cfg.MODEL.MULTI_PATHWAY_ARCH:\n        fast_pathway = frames\n        # Perform temporal sampling from the fast pathway.\n        slow_pathway = torch.index_select(\n            frames,\n            1,\n            torch.linspace(\n                0, frames.shape[1] - 1, frames.shape[1] // cfg.SLOWFAST.ALPHA\n            ).long(),\n        )\n        frame_list = [slow_pathway, fast_pathway]\n    else:\n        raise NotImplementedError(\n            \"Model arch {} is not in {}\".format(\n                cfg.MODEL.ARCH,\n                cfg.MODEL.SINGLE_PATHWAY_ARCH + cfg.MODEL.MULTI_PATHWAY_ARCH,\n            )\n        )\n    return frame_list\n\n\ndef spatial_sampling(\n    frames,\n    spatial_idx=-1,\n    min_scale=256,\n    max_scale=320,\n    crop_size=224,\n    random_horizontal_flip=True,\n    inverse_uniform_sampling=False,\n    aspect_ratio=None,\n    scale=None,\n    motion_shift=False,\n):\n    \"\"\"\n    Perform spatial sampling on the given video frames. If spatial_idx is\n    -1, perform random scale, random crop, and random flip on the given\n    frames. If spatial_idx is 0, 1, or 2, perform spatial uniform sampling\n    with the given spatial_idx.\n    Args:\n        frames (tensor): frames of images sampled from the video. The\n            dimension is `num frames` x `height` x `width` x `channel`.\n        spatial_idx (int): if -1, perform random spatial sampling. If 0, 1,\n            or 2, perform left, center, right crop if width is larger than\n            height, and perform top, center, buttom crop if height is larger\n            than width.\n        min_scale (int): the minimal size of scaling.\n        max_scale (int): the maximal size of scaling.\n        crop_size (int): the size of height and width used to crop the\n            frames.\n        inverse_uniform_sampling (bool): if True, sample uniformly in\n            [1 / max_scale, 1 / min_scale] and take a reciprocal to get the\n            scale. If False, take a uniform sample from [min_scale,\n            max_scale].\n        aspect_ratio (list): Aspect ratio range for resizing.\n        scale (list): Scale range for resizing.\n        motion_shift (bool): Whether to apply motion shift for resizing.\n    Returns:\n        frames (tensor): spatially sampled frames.\n    \"\"\"\n    assert spatial_idx in [-1, 0, 1, 2]\n    if spatial_idx == -1:\n        if aspect_ratio is None and scale is None:\n            frames, _ = transform.random_short_side_scale_jitter(\n                images=frames,\n                min_size=min_scale,\n                max_size=max_scale,\n                inverse_uniform_sampling=inverse_uniform_sampling,\n            )\n            frames, _ = transform.random_crop(frames, crop_size)\n        else:\n            transform_func = (\n                transform.random_resized_crop_with_shift\n                if motion_shift\n                else transform.random_resized_crop\n            )\n            frames = transform_func(\n                images=frames,\n                target_height=crop_size,\n                target_width=crop_size,\n                scale=scale,\n                ratio=aspect_ratio,\n            )\n        if random_horizontal_flip:\n            frames, _ = transform.horizontal_flip(0.5, frames)\n    else:\n        # The testing is deterministic and no jitter should be performed.\n        # min_scale, max_scale, and crop_size are expect to be the same.\n        assert len({min_scale, max_scale}) == 1\n        frames, _ = transform.random_short_side_scale_jitter(\n            frames, min_scale, max_scale\n        )\n        frames, _ = transform.uniform_crop(frames, crop_size, spatial_idx)\n    return frames\n\n\ndef as_binary_vector(labels, num_classes):\n    \"\"\"\n    Construct binary label vector given a list of label indices.\n    Args:\n        labels (list): The input label list.\n        num_classes (int): Number of classes of the label vector.\n    Returns:\n        labels (numpy array): the resulting binary vector.\n    \"\"\"\n    label_arr = np.zeros((num_classes,))\n\n    for lbl in set(labels):\n        label_arr[lbl] = 1.0\n    return label_arr\n\n\ndef aggregate_labels(label_list):\n    \"\"\"\n    Join a list of label list.\n    Args:\n        labels (list): The input label list.\n    Returns:\n        labels (list): The joint list of all lists in input.\n    \"\"\"\n    all_labels = []\n    for labels in label_list:\n        for l in labels:\n            all_labels.append(l)\n    return list(set(all_labels))\n\n\ndef convert_to_video_level_labels(labels):\n    \"\"\"\n    Aggregate annotations from all frames of a video to form video-level labels.\n    Args:\n        labels (list): The input label list.\n    Returns:\n        labels (list): Same as input, but with each label replaced by\n        a video-level one.\n    \"\"\"\n    for video_id in range(len(labels)):\n        video_level_labels = aggregate_labels(labels[video_id])\n        for i in range(len(labels[video_id])):\n            labels[video_id][i] = video_level_labels\n    return labels\n\n\ndef load_image_lists(frame_list_file, prefix=\"\", return_list=False):\n    \"\"\"\n    Load image paths and labels from a \"frame list\".\n    Each line of the frame list contains:\n    `original_vido_id video_id frame_id path labels`\n    Args:\n        frame_list_file (string): path to the frame list.\n        prefix (str): the prefix for the path.\n        return_list (bool): if True, return a list. If False, return a dict.\n    Returns:\n        image_paths (list or dict): list of list containing path to each frame.\n            If return_list is False, then return in a dict form.\n        labels (list or dict): list of list containing label of each frame.\n            If return_list is False, then return in a dict form.\n    \"\"\"\n    image_paths = defaultdict(list)\n    labels = defaultdict(list)\n    with pathmgr.open(frame_list_file, \"r\") as f:\n        assert f.readline().startswith(\"original_vido_id\")\n        for line in f:\n            row = line.split()\n            # original_vido_id video_id frame_id path labels\n            assert len(row) == 5\n            video_name = row[0]\n            if prefix == \"\":\n                path = row[3]\n            else:\n                path = os.path.join(prefix, row[3])\n            image_paths[video_name].append(path)\n            frame_labels = row[-1].replace('\"', \"\")\n            if frame_labels != \"\":\n                labels[video_name].append([int(x) for x in frame_labels.split(\",\")])\n            else:\n                labels[video_name].append([])\n\n    if return_list:\n        keys = image_paths.keys()\n        image_paths = [image_paths[key] for key in keys]\n        labels = [labels[key] for key in keys]\n        return image_paths, labels\n    return dict(image_paths), dict(labels)\n\n\ndef tensor_normalize(tensor, mean, std, func=None):\n    \"\"\"\n    Normalize a given tensor by subtracting the mean and dividing the std.\n    Args:\n        tensor (tensor): tensor to normalize.\n        mean (tensor or list): mean value to subtract.\n        std (tensor or list): std to divide.\n    \"\"\"\n    if tensor.dtype == torch.uint8:\n        tensor = tensor.float()\n        tensor = tensor / 255.0\n    if type(mean) == list:\n        mean = torch.tensor(mean)\n    if type(std) == list:\n        std = torch.tensor(std)\n    if func is not None:\n        tensor = func(tensor)\n    tensor = tensor - mean\n    tensor = tensor / std\n    return tensor\n\n\ndef get_random_sampling_rate(long_cycle_sampling_rate, sampling_rate):\n    \"\"\"\n    When multigrid training uses a fewer number of frames, we randomly\n    increase the sampling rate so that some clips cover the original span.\n    \"\"\"\n    if long_cycle_sampling_rate > 0:\n        assert long_cycle_sampling_rate >= sampling_rate\n        return random.randint(sampling_rate, long_cycle_sampling_rate)\n    else:\n        return sampling_rate\n\n\ndef revert_tensor_normalize(tensor, mean, std):\n    \"\"\"\n    Revert normalization for a given tensor by multiplying by the std and adding the mean.\n    Args:\n        tensor (tensor): tensor to revert normalization.\n        mean (tensor or list): mean value to add.\n        std (tensor or list): std to multiply.\n    \"\"\"\n    if type(mean) == list:\n        mean = torch.tensor(mean)\n    if type(std) == list:\n        std = torch.tensor(std)\n    tensor = tensor * std\n    tensor = tensor + mean\n    return tensor\n\n\ndef create_sampler(dataset, shuffle, cfg):\n    \"\"\"\n    Create sampler for the given dataset.\n    Args:\n        dataset (torch.utils.data.Dataset): the given dataset.\n        shuffle (bool): set to ``True`` to have the data reshuffled\n            at every epoch.\n        cfg (CfgNode): configs. Details can be found in\n            slowfast/config/defaults.py\n    Returns:\n        sampler (Sampler): the created sampler.\n    \"\"\"\n    sampler = DistributedSampler(dataset) if cfg.NUM_GPUS > 1 else None\n\n    return sampler\n\n\ndef loader_worker_init_fn(dataset):\n    \"\"\"\n    Create init function passed to pytorch data loader.\n    Args:\n        dataset (torch.utils.data.Dataset): the given dataset.\n    \"\"\"\n    return None\n\n\ndef aug_frame(\n    cfg,\n    mode,\n    rand_erase,\n    frames,\n    spatial_sample_index,\n    min_scale,\n    max_scale,\n    crop_size,\n):\n    \"\"\"\n    Perform augmentations on the given video frames, including\n    random augmentation, normalization, spatial sampling and optional random\n    erasing.\n    Args:\n        cfg (CfgNode): configs.\n        mode (string): Options includes `train`, `val`, or `test` mode.\n        rand_erase (bool): if performing random erasing.\n        frames (tensor): frames of images sampled from the video. The\n            dimension is `num frames` x `height` x `width` x `channel`.\n        spatial_sample_index (int): if -1, perform random spatial sampling.\n            If 0, 1, or 2, perform left, center, right crop if width is larger\n             thanheight, and perform top, center, buttom crop if height is larger\n            than width.\n        min_scale (int): the minimal size of scaling.\n        max_scale (int): the maximal size of scaling.\n        crop_size (int): the size of height and width used to crop the\n            frames.\n    Returns:\n        frames (tensor): spatially sampled frames.\n    \"\"\"\n    if cfg.AUG.AA_TYPE:\n        aug_transform = create_random_augment(\n            input_size=(frames.size(1), frames.size(2)),\n            auto_augment=cfg.AUG.AA_TYPE,\n            interpolation=cfg.AUG.INTERPOLATION,\n        )\n        # T H W C -> T C H W.\n        frames = frames.permute(0, 3, 1, 2)\n        list_img = _frame_to_list_img(frames)\n        list_img = aug_transform(list_img)\n        frames = _list_img_to_frames(list_img)\n        frames = frames.permute(0, 2, 3, 1)\n\n    frames = tensor_normalize(frames, cfg.DATA.MEAN, cfg.DATA.STD)\n    # T H W C -> C T H W.\n    frames = frames.permute(3, 0, 1, 2)\n    # Perform data augmentation.\n    scl, asp = (\n        cfg.DATA.TRAIN_JITTER_SCALES_RELATIVE,\n        cfg.DATA.TRAIN_JITTER_ASPECT_RELATIVE,\n    )\n    relative_scales = None if (mode not in [\"train\"] or len(scl) == 0) else scl\n    relative_aspect = None if (mode not in [\"train\"] or len(asp) == 0) else asp\n    frames = spatial_sampling(\n        frames,\n        spatial_idx=spatial_sample_index,\n        min_scale=min_scale,\n        max_scale=max_scale,\n        crop_size=crop_size,\n        random_horizontal_flip=cfg.DATA.RANDOM_FLIP,\n        inverse_uniform_sampling=cfg.DATA.INV_UNIFORM_SAMPLE,\n        aspect_ratio=relative_aspect,\n        scale=relative_scales,\n        motion_shift=cfg.DATA.TRAIN_JITTER_MOTION_SHIFT if mode in [\"train\"] else False,\n    )\n\n    if rand_erase:\n        erase_transform = RandomErasing(\n            cfg.AUG.RE_PROB,\n            mode=cfg.AUG.RE_MODE,\n            max_count=cfg.AUG.RE_COUNT,\n            num_splits=cfg.AUG.RE_COUNT,\n            device=\"cpu\",\n        )\n        frames = frames.permute(1, 0, 2, 3)\n        frames = erase_transform(frames)\n        frames = frames.permute(1, 0, 2, 3)\n\n    return frames\n\n\ndef _frame_to_list_img(frames):\n    img_list = [transforms.ToPILImage()(frames[i]) for i in range(frames.size(0))]\n    return img_list\n\n\ndef _list_img_to_frames(img_list):\n    img_list = [transforms.ToTensor()(img) for img in img_list]\n    return torch.stack(img_list)\n"
  },
  {
    "path": "slowfast/datasets/video_container.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\nimport av\n\n\ndef get_video_container(path_to_vid, multi_thread_decode=False, backend=\"pyav\"):\n    \"\"\"\n    Given the path to the video, return the pyav video container.\n    Args:\n        path_to_vid (str): path to the video.\n        multi_thread_decode (bool): if True, perform multi-thread decoding.\n        backend (str): decoder backend, options include `pyav` and\n            `torchvision`, default is `pyav`.\n    Returns:\n        container (container): video container.\n    \"\"\"\n    if backend == \"torchvision\":\n        with open(path_to_vid, \"rb\") as fp:\n            container = fp.read()\n        return container\n    elif backend == \"pyav\":\n        container = av.open(path_to_vid)\n        if multi_thread_decode:\n            # Enable multiple threads for decoding.\n            container.streams.video[0].thread_type = \"AUTO\"\n        return container\n    else:\n        raise NotImplementedError(\"Unknown backend {}\".format(backend))\n"
  },
  {
    "path": "slowfast/models/__init__.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\nfrom .build import build_model, MODEL_REGISTRY  # noqa\nfrom .contrastive import ContrastiveModel  # noqa\nfrom .masked import MaskMViT  # noqa\nfrom .video_model_builder import MViT, ResNet, SlowFast  # noqa\n\ntry:\n    from .ptv_model_builder import (\n        PTVCSN,\n        PTVR2plus1D,\n        PTVResNet,\n        PTVSlowFast,\n        PTVX3D,\n    )  # noqa\nexcept Exception:\n    print(\"Please update your PyTorchVideo to latest master\")\n"
  },
  {
    "path": "slowfast/models/attention.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\n\nimport numpy\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nfrom slowfast.models.common import DropPath, Mlp\nfrom torch.nn.init import trunc_normal_\n\n\ndef attention_pool(tensor, pool, thw_shape, has_cls_embed=True, norm=None):\n    if pool is None:\n        return tensor, thw_shape\n    tensor_dim = tensor.ndim\n    if tensor_dim == 4:\n        pass\n    elif tensor_dim == 3:\n        tensor = tensor.unsqueeze(1)\n    else:\n        raise NotImplementedError(f\"Unsupported input dimension {tensor.shape}\")\n\n    if has_cls_embed:\n        cls_tok, tensor = tensor[:, :, :1, :], tensor[:, :, 1:, :]\n\n    B, N, L, C = tensor.shape\n    T, H, W = thw_shape\n    tensor = tensor.reshape(B * N, T, H, W, C).permute(0, 4, 1, 2, 3).contiguous()\n\n    tensor = pool(tensor)\n\n    thw_shape = [tensor.shape[2], tensor.shape[3], tensor.shape[4]]\n    L_pooled = tensor.shape[2] * tensor.shape[3] * tensor.shape[4]\n    tensor = tensor.reshape(B, N, C, L_pooled).transpose(2, 3)\n    if has_cls_embed:\n        tensor = torch.cat((cls_tok, tensor), dim=2)\n    if norm is not None:\n        tensor = norm(tensor)\n    # Assert tensor_dim in [3, 4]\n    if tensor_dim == 4:\n        pass\n    else:  #  tensor_dim == 3:\n        tensor = tensor.squeeze(1)\n    return tensor, thw_shape\n\n\ndef get_rel_pos(rel_pos, d):\n    if isinstance(d, int):\n        ori_d = rel_pos.shape[0]\n        if ori_d == d:\n            return rel_pos\n        else:\n            # Interpolate rel pos.\n            new_pos_embed = F.interpolate(\n                rel_pos.reshape(1, ori_d, -1).permute(0, 2, 1),\n                size=d,\n                mode=\"linear\",\n            )\n\n            return new_pos_embed.reshape(-1, d).permute(1, 0)\n\n\ndef cal_rel_pos_spatial(\n    attn, q, k, has_cls_embed, q_shape, k_shape, rel_pos_h, rel_pos_w\n):\n    \"\"\"\n    Decomposed Spatial Relative Positional Embeddings.\n    \"\"\"\n    sp_idx = 1 if has_cls_embed else 0\n    q_t, q_h, q_w = q_shape\n    k_t, k_h, k_w = k_shape\n    dh = int(2 * max(q_h, k_h) - 1)\n    dw = int(2 * max(q_w, k_w) - 1)\n\n    # Scale up rel pos if shapes for q and k are different.\n    q_h_ratio = max(k_h / q_h, 1.0)\n    k_h_ratio = max(q_h / k_h, 1.0)\n    dist_h = (\n        torch.arange(q_h)[:, None] * q_h_ratio - torch.arange(k_h)[None, :] * k_h_ratio\n    )\n    dist_h += (k_h - 1) * k_h_ratio\n    q_w_ratio = max(k_w / q_w, 1.0)\n    k_w_ratio = max(q_w / k_w, 1.0)\n    dist_w = (\n        torch.arange(q_w)[:, None] * q_w_ratio - torch.arange(k_w)[None, :] * k_w_ratio\n    )\n    dist_w += (k_w - 1) * k_w_ratio\n\n    # Intepolate rel pos if needed.\n    rel_pos_h = get_rel_pos(rel_pos_h, dh)\n    rel_pos_w = get_rel_pos(rel_pos_w, dw)\n    Rh = rel_pos_h[dist_h.long()]\n    Rw = rel_pos_w[dist_w.long()]\n\n    B, n_head, q_N, dim = q.shape\n\n    r_q = q[:, :, sp_idx:].reshape(B, n_head, q_t, q_h, q_w, dim)\n    rel_h_q = torch.einsum(\"bythwc,hkc->bythwk\", r_q, Rh)  # [B, H, q_t, qh, qw, k_h]\n    rel_w_q = torch.einsum(\"bythwc,wkc->bythwk\", r_q, Rw)  # [B, H, q_t, qh, qw, k_w]\n\n    attn[:, :, sp_idx:, sp_idx:] = (\n        attn[:, :, sp_idx:, sp_idx:].view(B, -1, q_t, q_h, q_w, k_t, k_h, k_w)\n        + rel_h_q[:, :, :, :, :, None, :, None]\n        + rel_w_q[:, :, :, :, :, None, None, :]\n    ).view(B, -1, q_t * q_h * q_w, k_t * k_h * k_w)\n\n    return attn\n\n\ndef cal_rel_pos_temporal(attn, q, has_cls_embed, q_shape, k_shape, rel_pos_t):\n    \"\"\"\n    Temporal Relative Positional Embeddings.\n    \"\"\"\n    sp_idx = 1 if has_cls_embed else 0\n    q_t, q_h, q_w = q_shape\n    k_t, k_h, k_w = k_shape\n    dt = int(2 * max(q_t, k_t) - 1)\n    # Intepolate rel pos if needed.\n    rel_pos_t = get_rel_pos(rel_pos_t, dt)\n\n    # Scale up rel pos if shapes for q and k are different.\n    q_t_ratio = max(k_t / q_t, 1.0)\n    k_t_ratio = max(q_t / k_t, 1.0)\n    dist_t = (\n        torch.arange(q_t)[:, None] * q_t_ratio - torch.arange(k_t)[None, :] * k_t_ratio\n    )\n    dist_t += (k_t - 1) * k_t_ratio\n    Rt = rel_pos_t[dist_t.long()]\n\n    B, n_head, q_N, dim = q.shape\n\n    r_q = q[:, :, sp_idx:].reshape(B, n_head, q_t, q_h, q_w, dim)\n    # [B, H, q_t, q_h, q_w, dim] -> [q_t, B, H, q_h, q_w, dim] -> [q_t, B*H*q_h*q_w, dim]\n    r_q = r_q.permute(2, 0, 1, 3, 4, 5).reshape(q_t, B * n_head * q_h * q_w, dim)\n\n    # [q_t, B*H*q_h*q_w, dim] * [q_t, dim, k_t] = [q_t, B*H*q_h*q_w, k_t] -> [B*H*q_h*q_w, q_t, k_t]\n    rel = torch.matmul(r_q, Rt.transpose(1, 2)).transpose(0, 1)\n    # [B*H*q_h*q_w, q_t, k_t] -> [B, H, q_t, q_h, q_w, k_t]\n    rel = rel.view(B, n_head, q_h, q_w, q_t, k_t).permute(0, 1, 4, 2, 3, 5)\n\n    attn[:, :, sp_idx:, sp_idx:] = (\n        attn[:, :, sp_idx:, sp_idx:].view(B, -1, q_t, q_h, q_w, k_t, k_h, k_w)\n        + rel[:, :, :, :, :, :, None, None]\n    ).view(B, -1, q_t * q_h * q_w, k_t * k_h * k_w)\n\n    return attn\n\n\nclass MultiScaleAttention(nn.Module):\n    def __init__(\n        self,\n        dim,\n        dim_out,\n        input_size,\n        num_heads=8,\n        qkv_bias=False,\n        drop_rate=0.0,\n        kernel_q=(1, 1, 1),\n        kernel_kv=(1, 1, 1),\n        stride_q=(1, 1, 1),\n        stride_kv=(1, 1, 1),\n        norm_layer=nn.LayerNorm,\n        has_cls_embed=True,\n        # Options include `conv`, `avg`, and `max`.\n        mode=\"conv\",\n        # If True, perform pool before projection.\n        pool_first=False,\n        rel_pos_spatial=False,\n        rel_pos_temporal=False,\n        rel_pos_zero_init=False,\n        residual_pooling=False,\n        separate_qkv=False,\n    ):\n        super().__init__()\n        self.pool_first = pool_first\n        self.separate_qkv = separate_qkv\n        self.drop_rate = drop_rate\n        self.num_heads = num_heads\n        self.dim_out = dim_out\n        head_dim = dim_out // num_heads\n        self.scale = head_dim**-0.5\n        self.has_cls_embed = has_cls_embed\n        self.mode = mode\n        padding_q = [int(q // 2) for q in kernel_q]\n        padding_kv = [int(kv // 2) for kv in kernel_kv]\n\n        if pool_first or separate_qkv:\n            self.q = nn.Linear(dim, dim_out, bias=qkv_bias)\n            self.k = nn.Linear(dim, dim_out, bias=qkv_bias)\n            self.v = nn.Linear(dim, dim_out, bias=qkv_bias)\n        else:\n            self.qkv = nn.Linear(dim, dim_out * 3, bias=qkv_bias)\n\n        self.proj = nn.Linear(dim_out, dim_out)\n        if drop_rate > 0.0:\n            self.proj_drop = nn.Dropout(drop_rate)\n\n        # Skip pooling with kernel and stride size of (1, 1, 1).\n        if numpy.prod(kernel_q) == 1 and numpy.prod(stride_q) == 1:\n            kernel_q = ()\n        if numpy.prod(kernel_kv) == 1 and numpy.prod(stride_kv) == 1:\n            kernel_kv = ()\n\n        if mode in (\"avg\", \"max\"):\n            pool_op = nn.MaxPool3d if mode == \"max\" else nn.AvgPool3d\n            self.pool_q = (\n                pool_op(kernel_q, stride_q, padding_q, ceil_mode=False)\n                if len(kernel_q) > 0\n                else None\n            )\n            self.pool_k = (\n                pool_op(kernel_kv, stride_kv, padding_kv, ceil_mode=False)\n                if len(kernel_kv) > 0\n                else None\n            )\n            self.pool_v = (\n                pool_op(kernel_kv, stride_kv, padding_kv, ceil_mode=False)\n                if len(kernel_kv) > 0\n                else None\n            )\n        elif mode == \"conv\" or mode == \"conv_unshared\":\n            if pool_first:\n                dim_conv = dim // num_heads if mode == \"conv\" else dim\n            else:\n                dim_conv = dim_out // num_heads if mode == \"conv\" else dim_out\n            self.pool_q = (\n                nn.Conv3d(\n                    dim_conv,\n                    dim_conv,\n                    kernel_q,\n                    stride=stride_q,\n                    padding=padding_q,\n                    groups=dim_conv,\n                    bias=False,\n                )\n                if len(kernel_q) > 0\n                else None\n            )\n            self.norm_q = norm_layer(dim_conv) if len(kernel_q) > 0 else None\n            self.pool_k = (\n                nn.Conv3d(\n                    dim_conv,\n                    dim_conv,\n                    kernel_kv,\n                    stride=stride_kv,\n                    padding=padding_kv,\n                    groups=dim_conv,\n                    bias=False,\n                )\n                if len(kernel_kv) > 0\n                else None\n            )\n            self.norm_k = norm_layer(dim_conv) if len(kernel_kv) > 0 else None\n            self.pool_v = (\n                nn.Conv3d(\n                    dim_conv,\n                    dim_conv,\n                    kernel_kv,\n                    stride=stride_kv,\n                    padding=padding_kv,\n                    groups=dim_conv,\n                    bias=False,\n                )\n                if len(kernel_kv) > 0\n                else None\n            )\n            self.norm_v = norm_layer(dim_conv) if len(kernel_kv) > 0 else None\n        else:\n            raise NotImplementedError(f\"Unsupported model {mode}\")\n\n        self.rel_pos_spatial = rel_pos_spatial\n        self.rel_pos_temporal = rel_pos_temporal\n        if self.rel_pos_spatial:\n            assert input_size[1] == input_size[2]\n            size = input_size[1]\n            q_size = size // stride_q[1] if len(stride_q) > 0 else size\n            kv_size = size // stride_kv[1] if len(stride_kv) > 0 else size\n            rel_sp_dim = 2 * max(q_size, kv_size) - 1\n\n            self.rel_pos_h = nn.Parameter(torch.zeros(rel_sp_dim, head_dim))\n            self.rel_pos_w = nn.Parameter(torch.zeros(rel_sp_dim, head_dim))\n            if not rel_pos_zero_init:\n                trunc_normal_(self.rel_pos_h, std=0.02)\n                trunc_normal_(self.rel_pos_w, std=0.02)\n        if self.rel_pos_temporal:\n            self.rel_pos_t = nn.Parameter(torch.zeros(2 * input_size[0] - 1, head_dim))\n            if not rel_pos_zero_init:\n                trunc_normal_(self.rel_pos_t, std=0.02)\n\n        self.residual_pooling = residual_pooling\n\n    def forward(self, x, thw_shape):\n        B, N, _ = x.shape\n\n        if self.pool_first:\n            if self.mode == \"conv_unshared\":\n                fold_dim = 1\n            else:\n                fold_dim = self.num_heads\n            x = x.reshape(B, N, fold_dim, -1).permute(0, 2, 1, 3)\n            q = k = v = x\n        else:\n            assert self.mode != \"conv_unshared\"\n            if not self.separate_qkv:\n                qkv = (\n                    self.qkv(x)\n                    .reshape(B, N, 3, self.num_heads, -1)\n                    .permute(2, 0, 3, 1, 4)\n                )\n                q, k, v = qkv[0], qkv[1], qkv[2]\n            else:\n                q = k = v = x\n                q = self.q(q).reshape(B, N, self.num_heads, -1).permute(0, 2, 1, 3)\n                k = self.k(k).reshape(B, N, self.num_heads, -1).permute(0, 2, 1, 3)\n                v = self.v(v).reshape(B, N, self.num_heads, -1).permute(0, 2, 1, 3)\n\n        q, q_shape = attention_pool(\n            q,\n            self.pool_q,\n            thw_shape,\n            has_cls_embed=self.has_cls_embed,\n            norm=getattr(self, \"norm_q\", None),\n        )\n        k, k_shape = attention_pool(\n            k,\n            self.pool_k,\n            thw_shape,\n            has_cls_embed=self.has_cls_embed,\n            norm=getattr(self, \"norm_k\", None),\n        )\n        v, v_shape = attention_pool(\n            v,\n            self.pool_v,\n            thw_shape,\n            has_cls_embed=self.has_cls_embed,\n            norm=getattr(self, \"norm_v\", None),\n        )\n\n        if self.pool_first:\n            q_N = numpy.prod(q_shape) + 1 if self.has_cls_embed else numpy.prod(q_shape)\n            k_N = numpy.prod(k_shape) + 1 if self.has_cls_embed else numpy.prod(k_shape)\n            v_N = numpy.prod(v_shape) + 1 if self.has_cls_embed else numpy.prod(v_shape)\n\n            q = q.permute(0, 2, 1, 3).reshape(B, q_N, -1)\n            q = self.q(q).reshape(B, q_N, self.num_heads, -1).permute(0, 2, 1, 3)\n\n            v = v.permute(0, 2, 1, 3).reshape(B, v_N, -1)\n            v = self.v(v).reshape(B, v_N, self.num_heads, -1).permute(0, 2, 1, 3)\n\n            k = k.permute(0, 2, 1, 3).reshape(B, k_N, -1)\n            k = self.k(k).reshape(B, k_N, self.num_heads, -1).permute(0, 2, 1, 3)\n\n        N = q.shape[2]\n        attn = (q * self.scale) @ k.transpose(-2, -1)\n        if self.rel_pos_spatial:\n            attn = cal_rel_pos_spatial(\n                attn,\n                q,\n                k,\n                self.has_cls_embed,\n                q_shape,\n                k_shape,\n                self.rel_pos_h,\n                self.rel_pos_w,\n            )\n\n        if self.rel_pos_temporal:\n            attn = cal_rel_pos_temporal(\n                attn,\n                q,\n                self.has_cls_embed,\n                q_shape,\n                k_shape,\n                self.rel_pos_t,\n            )\n        attn = attn.softmax(dim=-1)\n\n        x = attn @ v\n\n        if self.residual_pooling:\n            if self.has_cls_embed:\n                x[:, :, 1:, :] += q[:, :, 1:, :]\n            else:\n                x = x + q\n\n        x = x.transpose(1, 2).reshape(B, -1, self.dim_out)\n        x = self.proj(x)\n\n        if self.drop_rate > 0.0:\n            x = self.proj_drop(x)\n        return x, q_shape\n\n\nclass MultiScaleBlock(nn.Module):\n    def __init__(\n        self,\n        dim,\n        dim_out,\n        num_heads,\n        input_size,\n        mlp_ratio=4.0,\n        qkv_bias=False,\n        qk_scale=None,\n        drop_rate=0.0,\n        drop_path=0.0,\n        layer_scale_init_value=0.0,\n        act_layer=nn.GELU,\n        norm_layer=nn.LayerNorm,\n        up_rate=None,\n        kernel_q=(1, 1, 1),\n        kernel_kv=(1, 1, 1),\n        stride_q=(1, 1, 1),\n        stride_kv=(1, 1, 1),\n        mode=\"conv\",\n        has_cls_embed=True,\n        pool_first=False,\n        rel_pos_spatial=False,\n        rel_pos_temporal=False,\n        rel_pos_zero_init=False,\n        residual_pooling=False,\n        dim_mul_in_att=False,\n        separate_qkv=False,\n    ):\n        super().__init__()\n        self.dim = dim\n        self.dim_out = dim_out\n        self.norm1 = norm_layer(dim)\n        self.dim_mul_in_att = dim_mul_in_att\n        kernel_skip = [s + 1 if s > 1 else s for s in stride_q]\n        stride_skip = stride_q\n        padding_skip = [int(skip // 2) for skip in kernel_skip]\n        att_dim = dim_out if dim_mul_in_att else dim\n        self.attn = MultiScaleAttention(\n            dim,\n            att_dim,\n            num_heads=num_heads,\n            input_size=input_size,\n            qkv_bias=qkv_bias,\n            drop_rate=drop_rate,\n            kernel_q=kernel_q,\n            kernel_kv=kernel_kv,\n            stride_q=stride_q,\n            stride_kv=stride_kv,\n            norm_layer=norm_layer,\n            has_cls_embed=has_cls_embed,\n            mode=mode,\n            pool_first=pool_first,\n            rel_pos_spatial=rel_pos_spatial,\n            rel_pos_temporal=rel_pos_temporal,\n            rel_pos_zero_init=rel_pos_zero_init,\n            residual_pooling=residual_pooling,\n            separate_qkv=separate_qkv,\n        )\n        self.drop_path = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()\n        self.norm2 = norm_layer(att_dim)\n        mlp_hidden_dim = int(att_dim * mlp_ratio)\n        self.has_cls_embed = has_cls_embed\n        # TODO: check the use case for up_rate, and merge the following lines\n        if up_rate is not None and up_rate > 1:\n            mlp_dim_out = dim * up_rate\n        else:\n            mlp_dim_out = dim_out\n        self.mlp = Mlp(\n            in_features=att_dim,\n            hidden_features=mlp_hidden_dim,\n            out_features=mlp_dim_out,\n            act_layer=act_layer,\n            drop_rate=drop_rate,\n        )\n        if layer_scale_init_value > 0:\n            self.gamma_1 = nn.Parameter(\n                layer_scale_init_value * torch.ones((dim)), requires_grad=True\n            )\n            self.gamma_2 = nn.Parameter(\n                layer_scale_init_value * torch.ones((dim_out)),\n                requires_grad=True,\n            )\n        else:\n            self.gamma_1, self.gamma_2 = None, None\n\n        if dim != dim_out:\n            self.proj = nn.Linear(dim, dim_out)\n\n        self.pool_skip = (\n            nn.MaxPool3d(kernel_skip, stride_skip, padding_skip, ceil_mode=False)\n            if len(stride_skip) > 0 and numpy.prod(stride_skip) > 1\n            else None\n        )\n\n    def forward(self, x, thw_shape=None):\n        x_norm = self.norm1(x)\n        x_block, thw_shape_new = self.attn(x_norm, thw_shape)\n        if self.dim_mul_in_att and self.dim != self.dim_out:\n            x = self.proj(x_norm)\n        x_res, _ = attention_pool(\n            x, self.pool_skip, thw_shape, has_cls_embed=self.has_cls_embed\n        )\n        if self.gamma_1 is not None:\n            x = x_res + self.drop_path(self.gamma_1 * x_block)\n        else:\n            x = x_res + self.drop_path(x_block)\n        x_norm = self.norm2(x)\n        x_mlp = self.mlp(x_norm)\n        if not self.dim_mul_in_att and self.dim != self.dim_out:\n            x = self.proj(x_norm)\n        if self.gamma_2 is not None:\n            x = x + self.drop_path(self.gamma_2 * x_mlp)\n        else:\n            x = x + self.drop_path(x_mlp)\n        if thw_shape:\n            return x, thw_shape_new\n        else:\n            return x\n"
  },
  {
    "path": "slowfast/models/batchnorm_helper.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\n\"\"\"BatchNorm (BN) utility functions and custom batch-size BN implementations\"\"\"\n\nfrom functools import partial\n\nimport torch\nimport torch.nn as nn\nfrom pytorchvideo.layers.batch_norm import (  # noqa: F401\n    NaiveSyncBatchNorm1d,\n    NaiveSyncBatchNorm3d,\n)\n\n\ndef get_norm(cfg):\n    \"\"\"\n    Args:\n        cfg (CfgNode): model building configs, details are in the comments of\n            the config file.\n    Returns:\n        nn.Module: the normalization layer.\n    \"\"\"\n    if cfg.BN.NORM_TYPE in {\"batchnorm\", \"sync_batchnorm_apex\"}:\n        return nn.BatchNorm3d\n    elif cfg.BN.NORM_TYPE == \"sub_batchnorm\":\n        return partial(SubBatchNorm3d, num_splits=cfg.BN.NUM_SPLITS)\n    elif cfg.BN.NORM_TYPE == \"sync_batchnorm\":\n        return partial(\n            NaiveSyncBatchNorm3d,\n            num_sync_devices=cfg.BN.NUM_SYNC_DEVICES,\n            global_sync=cfg.BN.GLOBAL_SYNC,\n        )\n    else:\n        raise NotImplementedError(\n            \"Norm type {} is not supported\".format(cfg.BN.NORM_TYPE)\n        )\n\n\nclass SubBatchNorm3d(nn.Module):\n    \"\"\"\n    The standard BN layer computes stats across all examples in a GPU. In some\n    cases it is desirable to compute stats across only a subset of examples\n    (e.g., in multigrid training https://arxiv.org/abs/1912.00998).\n    SubBatchNorm3d splits the batch dimension into N splits, and run BN on\n    each of them separately (so that the stats are computed on each subset of\n    examples (1/N of batch) independently. During evaluation, it aggregates\n    the stats from all splits into one BN.\n    \"\"\"\n\n    def __init__(self, num_splits, **args):\n        \"\"\"\n        Args:\n            num_splits (int): number of splits.\n            args (list): other arguments.\n        \"\"\"\n        super(SubBatchNorm3d, self).__init__()\n        self.num_splits = num_splits\n        num_features = args[\"num_features\"]\n        # Keep only one set of weight and bias.\n        if args.get(\"affine\", True):\n            self.affine = True\n            args[\"affine\"] = False\n            self.weight = torch.nn.Parameter(torch.ones(num_features))\n            self.bias = torch.nn.Parameter(torch.zeros(num_features))\n        else:\n            self.affine = False\n        self.bn = nn.BatchNorm3d(**args)\n        args[\"num_features\"] = num_features * num_splits\n        self.split_bn = nn.BatchNorm3d(**args)\n\n    def _get_aggregated_mean_std(self, means, stds, n):\n        \"\"\"\n        Calculate the aggregated mean and stds.\n        Args:\n            means (tensor): mean values.\n            stds (tensor): standard deviations.\n            n (int): number of sets of means and stds.\n        \"\"\"\n        mean = means.view(n, -1).sum(0) / n\n        std = (\n            stds.view(n, -1).sum(0) / n\n            + ((means.view(n, -1) - mean) ** 2).view(n, -1).sum(0) / n\n        )\n        return mean.detach(), std.detach()\n\n    def aggregate_stats(self):\n        \"\"\"\n        Synchronize running_mean, and running_var. Call this before eval.\n        \"\"\"\n        if self.split_bn.track_running_stats:\n            (\n                self.bn.running_mean.data,\n                self.bn.running_var.data,\n            ) = self._get_aggregated_mean_std(\n                self.split_bn.running_mean,\n                self.split_bn.running_var,\n                self.num_splits,\n            )\n\n    def forward(self, x):\n        if self.training:\n            n, c, t, h, w = x.shape\n            x = x.view(n // self.num_splits, c * self.num_splits, t, h, w)\n            x = self.split_bn(x)\n            x = x.view(n, c, t, h, w)\n        else:\n            x = self.bn(x)\n        if self.affine:\n            x = x * self.weight.view((-1, 1, 1, 1))\n            x = x + self.bias.view((-1, 1, 1, 1))\n        return x\n"
  },
  {
    "path": "slowfast/models/build.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\n\"\"\"Model construction functions.\"\"\"\n\nimport slowfast.utils.logging as logging\nimport torch\nfrom fvcore.common.registry import Registry\nfrom torch.distributed.algorithms.ddp_comm_hooks import default as comm_hooks_default\n\nlogger = logging.get_logger(__name__)\n\nMODEL_REGISTRY = Registry(\"MODEL\")\nMODEL_REGISTRY.__doc__ = \"\"\"\nRegistry for video model.\n\nThe registered object will be called with `obj(cfg)`.\nThe call should return a `torch.nn.Module` object.\n\"\"\"\n\n\ndef build_model(cfg, gpu_id=None):\n    \"\"\"\n    Builds the video model.\n    Args:\n        cfg (configs): configs that contains the hyper-parameters to build the\n        backbone. Details can be seen in slowfast/config/defaults.py.\n        gpu_id (Optional[int]): specify the gpu index to build model.\n    \"\"\"\n    if torch.cuda.is_available():\n        assert cfg.NUM_GPUS <= torch.cuda.device_count(), (\n            \"Cannot use more GPU devices than available\"\n        )\n    else:\n        assert cfg.NUM_GPUS == 0, (\n            \"Cuda is not available. Please set `NUM_GPUS: 0 for running on CPUs.\"\n        )\n\n    # Construct the model\n    name = cfg.MODEL.MODEL_NAME\n    model = MODEL_REGISTRY.get(name)(cfg)\n\n    if cfg.BN.NORM_TYPE == \"sync_batchnorm_apex\":\n        try:\n            import apex\n        except ImportError:\n            raise ImportError(\"APEX is required for this model, pelase install\")\n\n        logger.info(\"Converting BN layers to Apex SyncBN\")\n        process_group = apex.parallel.create_syncbn_process_group(\n            group_size=cfg.BN.NUM_SYNC_DEVICES\n        )\n        model = apex.parallel.convert_syncbn_model(model, process_group=process_group)\n\n    if cfg.NUM_GPUS:\n        if gpu_id is None:\n            # Determine the GPU used by the current process\n            cur_device = torch.cuda.current_device()\n        else:\n            cur_device = gpu_id\n        # Transfer the model to the current GPU device\n        model = model.cuda(device=cur_device)\n    # Use multi-process data parallel model in the multi-gpu setting\n    if cfg.NUM_GPUS > 1:\n        # Make model replica operate on the current device\n        model = torch.nn.parallel.DistributedDataParallel(\n            module=model,\n            device_ids=[cur_device],\n            output_device=cur_device,\n            find_unused_parameters=(\n                True\n                if cfg.MODEL.DETACH_FINAL_FC\n                or cfg.MODEL.MODEL_NAME == \"ContrastiveModel\"\n                else False\n            ),\n        )\n        if cfg.MODEL.FP16_ALLREDUCE:\n            model.register_comm_hook(\n                state=None, hook=comm_hooks_default.fp16_compress_hook\n            )\n    return model\n"
  },
  {
    "path": "slowfast/models/common.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates.\n\nimport torch\nimport torch.nn as nn\n\n\nclass Mlp(nn.Module):\n    def __init__(\n        self,\n        in_features,\n        hidden_features=None,\n        out_features=None,\n        act_layer=nn.GELU,\n        drop_rate=0.0,\n    ):\n        super().__init__()\n        self.drop_rate = drop_rate\n        out_features = out_features or in_features\n        hidden_features = hidden_features or in_features\n        self.fc1 = nn.Linear(in_features, hidden_features)\n        self.act = act_layer()\n        self.fc2 = nn.Linear(hidden_features, out_features)\n        if self.drop_rate > 0.0:\n            self.drop = nn.Dropout(drop_rate)\n\n    def forward(self, x):\n        x = self.fc1(x)\n        x = self.act(x)\n        if self.drop_rate > 0.0:\n            x = self.drop(x)\n        x = self.fc2(x)\n        if self.drop_rate > 0.0:\n            x = self.drop(x)\n        return x\n\n\nclass Permute(nn.Module):\n    def __init__(self, dims):\n        super().__init__()\n        self.dims = dims\n\n    def forward(self, x):\n        return x.permute(*self.dims)\n\n\ndef drop_path(x, drop_prob: float = 0.0, training: bool = False):\n    \"\"\"\n    Stochastic Depth per sample.\n    \"\"\"\n    if drop_prob == 0.0 or not training:\n        return x\n    keep_prob = 1 - drop_prob\n    shape = (x.shape[0],) + (1,) * (\n        x.ndim - 1\n    )  # work with diff dim tensors, not just 2D ConvNets\n    mask = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device)\n    mask.floor_()  # binarize\n    output = x.div(keep_prob) * mask\n    return output\n\n\nclass DropPath(nn.Module):\n    \"\"\"Drop paths (Stochastic Depth) per sample  (when applied in main path of residual blocks).\"\"\"\n\n    def __init__(self, drop_prob=None):\n        super(DropPath, self).__init__()\n        self.drop_prob = drop_prob\n\n    def forward(self, x):\n        return drop_path(x, self.drop_prob, self.training)\n\n\nclass TwoStreamFusion(nn.Module):\n    def __init__(self, mode, dim=None, kernel=3, padding=1):\n        \"\"\"\n        A general constructor for neural modules fusing two equal sized tensors\n        in forward. Following options are supported:\n\n        \"add\" / \"max\" / \"min\" / \"avg\"             : respective operations on the two halves.\n        \"concat\"                                  : NOOP.\n        \"concat_linear_{dim_mult}_{drop_rate}\"    : MLP to fuse with hidden dim \"dim_mult\"\n                                                    (optional, def 1.) higher than input dim\n                                                    with optional dropout \"drop_rate\" (def: 0.)\n        \"ln+concat_linear_{dim_mult}_{drop_rate}\" : perform MLP after layernorm on the input.\n\n        \"\"\"\n        super().__init__()\n        self.mode = mode\n        if mode == \"add\":\n            self.fuse_fn = lambda x: torch.stack(torch.chunk(x, 2, dim=2)).sum(dim=0)\n        elif mode == \"max\":\n            self.fuse_fn = (\n                lambda x: torch.stack(torch.chunk(x, 2, dim=2)).max(dim=0).values\n            )\n        elif mode == \"min\":\n            self.fuse_fn = (\n                lambda x: torch.stack(torch.chunk(x, 2, dim=2)).min(dim=0).values\n            )\n        elif mode == \"avg\":\n            self.fuse_fn = lambda x: torch.stack(torch.chunk(x, 2, dim=2)).mean(dim=0)\n        elif mode == \"concat\":\n            # x itself is the channel concat version\n            self.fuse_fn = lambda x: x\n        elif \"concat_linear\" in mode:\n            if len(mode.split(\"_\")) == 2:\n                dim_mult = 1.0\n                drop_rate = 0.0\n            elif len(mode.split(\"_\")) == 3:\n                dim_mult = float(mode.split(\"_\")[-1])\n                drop_rate = 0.0\n\n            elif len(mode.split(\"_\")) == 4:\n                dim_mult = float(mode.split(\"_\")[-2])\n                drop_rate = float(mode.split(\"_\")[-1])\n            else:\n                raise NotImplementedError\n\n            if mode.split(\"+\")[0] == \"ln\":\n                self.fuse_fn = nn.Sequential(\n                    nn.LayerNorm(dim),\n                    Mlp(\n                        in_features=dim,\n                        hidden_features=int(dim * dim_mult),\n                        act_layer=nn.GELU,\n                        out_features=dim,\n                        drop_rate=drop_rate,\n                    ),\n                )\n            else:\n                self.fuse_fn = Mlp(\n                    in_features=dim,\n                    hidden_features=int(dim * dim_mult),\n                    act_layer=nn.GELU,\n                    out_features=dim,\n                    drop_rate=drop_rate,\n                )\n\n        else:\n            raise NotImplementedError\n\n    def forward(self, x):\n        if \"concat_linear\" in self.mode:\n            return self.fuse_fn(x) + x\n\n        else:\n            return self.fuse_fn(x)\n"
  },
  {
    "path": "slowfast/models/contrastive.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\nimport math\n\nimport numpy as np\nimport slowfast.models.losses as losses\nimport slowfast.utils.distributed as du\nimport slowfast.utils.logging as logging\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nfrom slowfast.models.video_model_builder import MViT, ResNet, SlowFast, X3D\n\nfrom .build import MODEL_REGISTRY\n\nlogger = logging.get_logger(__name__)\n\n# Supported model types\n_MODEL_TYPES = {\n    \"slowfast\": SlowFast,\n    \"slow\": ResNet,\n    \"c2d\": ResNet,\n    \"i3d\": ResNet,\n    \"slow_c2d\": ResNet,\n    \"x3d\": X3D,\n    \"mvit\": MViT,\n}\n\n\n@MODEL_REGISTRY.register()\nclass ContrastiveModel(nn.Module):\n    \"\"\"\n    Contrastive Model, currently mainly focused on memory bank and CSC.\n    \"\"\"\n\n    def __init__(self, cfg):\n        super(ContrastiveModel, self).__init__()\n        # Construct the model.\n        self.backbone = _MODEL_TYPES[cfg.MODEL.ARCH](cfg)\n        self.type = cfg.CONTRASTIVE.TYPE\n        self.T = cfg.CONTRASTIVE.T\n        self.dim = cfg.CONTRASTIVE.DIM\n        self.length = cfg.CONTRASTIVE.LENGTH\n        self.k = cfg.CONTRASTIVE.QUEUE_LEN\n        self.mmt = cfg.CONTRASTIVE.MOMENTUM\n        self.momentum_annealing = cfg.CONTRASTIVE.MOMENTUM_ANNEALING\n        self.duration = 1\n        self.cfg = cfg\n        self.num_gpus = cfg.NUM_GPUS\n        self.l2_norm = Normalize()\n        self.knn_num_imgs = 0\n        self.knn_on = cfg.CONTRASTIVE.KNN_ON\n        self.train_labels = np.zeros((0,), dtype=np.int32)\n        self.num_pos = 2\n        self.num_crops = (\n            self.cfg.DATA.TRAIN_CROP_NUM_TEMPORAL * self.cfg.DATA.TRAIN_CROP_NUM_SPATIAL\n        )\n        self.nce_loss_fun = losses.get_loss_func(\"contrastive_loss\")(reduction=\"mean\")\n        assert self.cfg.MODEL.LOSS_FUNC == \"contrastive_loss\"\n        self.softmax = nn.Softmax(dim=1).cuda()\n\n        if self.type == \"mem\":\n            self.mem_type = cfg.CONTRASTIVE.MEM_TYPE\n            if self.mem_type == \"1d\":\n                self.memory = Memory1D(self.length, self.duration, self.dim, cfg)\n            else:\n                self.memory = Memory(self.length, self.duration, self.dim, cfg)\n            self.examplar_type = \"video\"\n            self.interp = cfg.CONTRASTIVE.INTERP_MEMORY\n        elif self.type == \"self\":\n            pass\n        elif self.type == \"moco\" or self.type == \"byol\":\n            # MoCo components\n            self.backbone_hist = _MODEL_TYPES[cfg.MODEL.ARCH](cfg)\n            for p in self.backbone_hist.parameters():\n                p.requires_grad = False\n            self.register_buffer(\"ptr\", torch.tensor([0]))\n            self.ptr.requires_grad = False\n            stdv = 1.0 / math.sqrt(self.dim / 3)\n            self.register_buffer(\n                \"queue_x\",\n                torch.rand(self.k, self.dim).mul_(2 * stdv).add_(-stdv),\n            )\n            self.register_buffer(\"iter\", torch.zeros([1], dtype=torch.long))\n            self._batch_shuffle_on = (\n                False\n                if (\n                    \"sync\" in cfg.BN.NORM_TYPE\n                    and cfg.BN.NUM_SYNC_DEVICES == cfg.NUM_GPUS\n                )\n                or self.type == \"byol\"\n                else True\n            )\n        elif self.type == \"swav\":\n            self.swav_use_public_code = True\n            if self.swav_use_public_code:\n                self.swav_prototypes = nn.Linear(\n                    self.dim, 1000, bias=False\n                )  # for orig implementation\n            else:\n                self.swav_prototypes = nn.Parameter(\n                    torch.randn((self.dim, 1000), dtype=torch.float)\n                )\n            self.swav_eps_sinkhorn = 0.05\n            self.swav_use_the_queue = False\n            # optionally starts a queue\n            if self.cfg.CONTRASTIVE.SWAV_QEUE_LEN > 0:\n                self.register_buffer(\n                    \"queue_swav\",\n                    torch.zeros(\n                        2,  # = args.crops_for_assign\n                        self.cfg.CONTRASTIVE.SWAV_QEUE_LEN // du.get_world_size(),\n                        self.dim,\n                    ),\n                )\n        elif self.type == \"simclr\":\n            self._simclr_precompute_pos_neg_mask_multi()\n        self.simclr_dist_on = cfg.CONTRASTIVE.SIMCLR_DIST_ON\n\n        # self.knn_mem = Memory1D(self.length, 1, self.dim, cfg) #  does not work\n        if self.knn_on:\n            self.knn_mem = Memory(self.length, 1, self.dim, cfg)\n\n    @torch.no_grad()\n    def knn_mem_update(self, q_knn, index):\n        if self.knn_on:\n            self.knn_mem.update(\n                q_knn,\n                momentum=1.0,\n                ind=index,\n                time=torch.zeros_like(index),\n                interp=False,\n            )\n\n    @torch.no_grad()\n    def init_knn_labels(self, train_loader):\n        logger.info(\"initializing knn labels\")\n        self.num_imgs = len(train_loader.dataset._labels)\n        self.train_labels = np.zeros((self.num_imgs,), dtype=np.int32)\n        for i in range(self.num_imgs):\n            self.train_labels[i] = train_loader.dataset._labels[i]\n        self.train_labels = torch.LongTensor(self.train_labels).cuda()\n        if self.length != self.num_imgs:\n            logger.error(\n                \"Kinetics dataloader size: {} differs from memorybank length {}\".format(\n                    self.num_imgs, self.length\n                )\n            )\n            self.knn_mem.resize(self.num_imgs, 1, self.dim)\n\n    @torch.no_grad()\n    def _update_history(self):\n        # momentum update\n        iter = int(self.iter)\n        m = self.mmt\n        dist = {}\n        for name, p in self.backbone.named_parameters():\n            dist[name] = p\n\n        if iter == 0:\n            for name, p in self.backbone_hist.named_parameters():\n                p.data.copy_(dist[name].data)\n\n        for name, p in self.backbone_hist.named_parameters():\n            p.data = dist[name].data * (1.0 - m) + p.data * m\n\n    @torch.no_grad()\n    def _batch_shuffle(self, x):\n        if len(x) == 2:\n            another_crop = True\n        else:\n            another_crop = False\n        if another_crop:\n            x, x_crop = x[0], x[1]\n        else:\n            x = x[0]\n\n        world_size = self.cfg.NUM_GPUS * self.cfg.NUM_SHARDS\n        if self.num_gpus > 1:\n            if self.cfg.CONTRASTIVE.LOCAL_SHUFFLE_BN:\n                x = du.cat_all_gather(x, local=True)\n                if another_crop:\n                    x_crop = du.cat_all_gather(x_crop, local=True)\n                world_size = du.get_local_size()\n                gpu_idx = du.get_local_rank()\n            else:\n                x = du.cat_all_gather(x)\n                if another_crop:\n                    x_crop = du.cat_all_gather(x_crop)\n                gpu_idx = torch.distributed.get_rank()\n\n        idx_randperm = torch.randperm(x.shape[0]).cuda()\n        if self.num_gpus > 1:\n            torch.distributed.broadcast(idx_randperm, src=0)\n        else:\n            gpu_idx = 0\n        idx_randperm = idx_randperm.view(world_size, -1)\n        x = x[idx_randperm[gpu_idx, :]]\n        if another_crop:\n            x_crop = x_crop[idx_randperm[gpu_idx, :]]\n\n        idx_restore = torch.argsort(idx_randperm.view(-1))\n        idx_restore = idx_restore.view(world_size, -1)\n        if another_crop:\n            return [x, x_crop], idx_restore\n        else:\n            return [x], idx_restore\n\n    @torch.no_grad()\n    def _batch_unshuffle(self, x, idx_restore):\n        if self.num_gpus > 1:\n            if self.cfg.CONTRASTIVE.LOCAL_SHUFFLE_BN:\n                x = du.cat_all_gather(x, local=True)\n                gpu_idx = du.get_local_rank()\n            else:\n                x = du.cat_all_gather(x)\n                gpu_idx = torch.distributed.get_rank()\n        else:\n            gpu_idx = 0\n\n        idx = idx_restore[gpu_idx, :]\n        x = x[idx]\n        return x\n\n    @torch.no_grad()\n    def eval_knn(self, q_knn, knn_k=200):\n        with torch.no_grad():\n            dist = torch.einsum(\n                \"nc,mc->nm\",\n                q_knn.view(q_knn.size(0), -1),\n                self.knn_mem.memory.view(self.knn_mem.memory.size(0), -1),\n            )\n            yd, yi = dist.topk(knn_k, dim=1, largest=True, sorted=True)\n        return yd, yi\n\n    def sim_loss(self, q, k):\n        similarity = torch.einsum(\"nc,nc->n\", [q, k])  # N-dim\n        # similarity += delta_t # higher if time distance is larger\n        # sim = sim - max_margin + delta_t * k\n        similarity /= self.T  # history-compatible\n        loss = -similarity.mean()\n        return loss\n\n    @torch.no_grad()\n    def momentum_anneal_cosine(self, epoch_exact):\n        self.mmt = (\n            1\n            - (1 - self.cfg.CONTRASTIVE.MOMENTUM)\n            * (math.cos(math.pi * epoch_exact / self.cfg.SOLVER.MAX_EPOCH) + 1.0)\n            * 0.5\n        )\n\n    @torch.no_grad()\n    def _dequeue_and_enqueue(self, keys, extra_keys=None):\n        ptr = int(self.ptr.item())\n        if (\n            not self.cfg.CONTRASTIVE.MOCO_MULTI_VIEW_QUEUE\n        ):  # TODO: add multiview negatives\n            keys_queue_update = [keys[0]]\n        else:\n            assert len(keys) > 0, \"need to have multiple views for adding them to queue\"\n            keys_queue_update = []\n            keys_queue_update += keys\n            if extra_keys:\n                keys_queue_update += [\n                    item for sublist in extra_keys for item in sublist\n                ]\n        for key in keys_queue_update:\n            # write the current feat into queue, at pointer\n            num_items = int(key.size(0))\n\n            assert self.k % num_items == 0\n            assert ptr + num_items <= self.k\n            self.queue_x[ptr : ptr + num_items, :] = key\n            # move pointer\n            ptr += num_items\n            # reset pointer\n            if ptr == self.k:\n                ptr = 0\n            self.ptr[0] = ptr\n\n    @torch.no_grad()\n    def batch_clips(self, clips):\n        clips_batched = [None] * len(clips[0])\n        for i, clip in enumerate(clips):\n            for j, view in enumerate(clip):\n                if i == 0:\n                    clips_batched[j] = view\n                else:\n                    clips_batched[j] = torch.cat([clips_batched[j], view], dim=0)\n                del view\n        return clips_batched\n\n    @torch.no_grad()\n    def compute_key_feat(\n        self, clips_k, compute_predictor_keys=False, batched_inference=True\n    ):\n        assert self.training\n        # momentum update key encoder\n        self._update_history()\n        self.iter += 1\n        n_clips = len(clips_k)\n        bsz = clips_k[0][0].shape[0]\n        if n_clips * bsz * clips_k[0][0].numel() > 4 * 64 * 3 * 8 * 224 * 224:\n            batched_inference = False  # hack to avoid oom on large inputs\n        assert n_clips > 0\n        if batched_inference and all(\n            [\n                clips_k[i][j].shape[1:] == clips_k[0][j].shape[1:]\n                for i in range(len(clips_k))\n                for j in range(len(clips_k[i]))\n            ]\n        ):\n            clips_k = [self.batch_clips(clips_k)]\n            batched = True\n        else:\n            batched = False\n\n        keys, pred_keys = [], []\n        for k in range(0, len(clips_k)):\n            clip_k = clips_k[k]\n            if self._batch_shuffle_on:\n                with torch.no_grad():\n                    clip_k, idx_restore = self._batch_shuffle(clip_k)\n            with torch.no_grad():\n                hist_feat = self.backbone_hist(clip_k)\n                if isinstance(hist_feat, list):\n                    hist_time = hist_feat[1:]\n                    hist_feat = hist_feat[0]\n                    if compute_predictor_keys:\n                        tks = []\n                        for tk in hist_time:\n                            tk = self.l2_norm(tk)\n                            if self._batch_shuffle_on:\n                                tk = self._batch_unshuffle(tk, idx_restore).detach()\n                            tks.append(tk)\n                        pred_keys.append(tks)\n                x_hist = self.l2_norm(hist_feat)\n                if self._batch_shuffle_on:\n                    x_hist = self._batch_unshuffle(x_hist, idx_restore).detach()\n            keys.append(x_hist)\n        if batched:\n            assert len(keys) == 1, \"batched input uses single clip\"\n            batched_key = keys[0]\n            if compute_predictor_keys:\n                batched_pred_key = pred_keys[0]\n            keys, pred_keys = [], []\n            for k in range(0, n_clips):\n                keys.append(batched_key[k * bsz : (k + 1) * bsz])\n                if compute_predictor_keys:\n                    pred_keys.append(batched_pred_key[k * bsz : (k + 1) * bsz])\n        if compute_predictor_keys:\n            return keys, pred_keys\n        else:\n            return keys\n\n    def forward(self, clips, index=None, time=None, epoch_exact=None, keys=None):\n        if epoch_exact is not None and self.momentum_annealing:\n            self.momentum_anneal_cosine(epoch_exact)\n\n        if self.type == \"mem\":\n            batch_size = clips[0].size(0)\n            q = self.backbone(clips)\n            if index is None:\n                return q\n            q = self.l2_norm(q)\n\n            if not self.training:\n                assert self.knn_mem.duration == 1\n                return self.eval_knn(q)\n            time *= self.duration - 1\n            clip_ind = torch.randint(\n                0,\n                self.length,\n                size=(\n                    batch_size,\n                    self.k + 1,\n                ),\n            ).cuda()\n            clip_ind.select(1, 0).copy_(index.data)\n\n            if self.mem_type == \"2d\":\n                if self.interp:\n                    time_ind = (\n                        torch.empty(batch_size, self.k + 1)\n                        .uniform_(0, self.duration - 1)\n                        .cuda()\n                    )\n                else:\n                    time_ind = torch.randint(\n                        0,\n                        self.duration - 1,\n                        size=(\n                            batch_size,\n                            self.k + 1,\n                        ),\n                    ).cuda()\n            else:\n                time_ind = torch.zeros(size=(batch_size, self.k + 1), dtype=int).cuda()\n\n            if self.examplar_type == \"clip\":\n                # Diff clip from same video are negative.\n                time_ind.select(1, 0).copy_(time.data)\n            elif self.examplar_type == \"video\":\n                # Diff clip from same video are positive.\n                pass\n            else:\n                raise NotImplementedError(\n                    \"unsupported examplar_type {}\".format(self.examplar_type)\n                )\n            k = self.memory.get(clip_ind, time_ind, self.interp)\n            # q: N x C, k: N x K x C\n            prod = torch.einsum(\"nc,nkc->nk\", q, k)\n            prod = torch.div(prod, self.T)\n\n            loss = self.nce_loss_fun(prod)\n\n            self.memory.update(\n                q, momentum=self.mmt, ind=index, time=time, interp=self.interp\n            )\n            self.knn_mem_update(q, index)\n            return prod, 0.0, True\n        elif self.type == \"moco\":\n            if isinstance(clips[0], list):\n                n_clips = len(clips)\n                ind_clips = np.arange(\n                    n_clips\n                )  # clips come ordered temporally from decoder\n\n                clip_q = clips[ind_clips[0]]\n                clips_k = [clips[i] for i in ind_clips[1:]]\n                # rearange time\n                time_q = time[:, ind_clips[0], :]\n                time_k = (\n                    time[:, ind_clips[1:], :]\n                    if keys is None\n                    else time[:, ind_clips[0] + 1 :, :]\n                )\n            else:\n                clip_q = clips\n\n            feat_q = self.backbone(clip_q)\n            extra_projs = []\n            if isinstance(feat_q, list):\n                extra_projs = feat_q[1:]\n                feat_q = feat_q[0]\n                extra_projs = [self.l2_norm(feat) for feat in extra_projs]\n\n            if index is None:\n                return feat_q\n            q = self.l2_norm(feat_q)\n            q_knn = q\n\n            if not self.training:\n                return self.eval_knn(q_knn)\n\n            if keys is None:\n                keys = self.compute_key_feat(clips_k, compute_predictor_keys=False)\n                auto_enqueue_keys = True\n            else:\n                auto_enqueue_keys = False\n\n            # score computation\n            queue_neg = torch.einsum(\"nc,kc->nk\", [q, self.queue_x.clone().detach()])\n\n            for k, key in enumerate(keys):\n                out_pos = torch.einsum(\"nc,nc->n\", [q, key]).unsqueeze(-1)\n                lgt_k = torch.cat([out_pos, queue_neg], dim=1)\n                if k == 0:\n                    logits = lgt_k\n                else:\n                    logits = torch.cat([logits, lgt_k], dim=0)\n\n            logits = torch.div(logits, self.T)\n\n            loss = self.nce_loss_fun(logits)\n            # update queue\n            if self.training and auto_enqueue_keys:\n                self._dequeue_and_enqueue(keys)\n\n            self.knn_mem_update(q_knn, index)\n            return logits, loss\n\n        elif self.type == \"byol\":\n            clips_key = [None] * len(clips)\n            for i, clip in enumerate(clips):\n                p = []\n                for path in clip:\n                    p.append(path)\n                clips_key[i] = p\n            batch_clips = False\n            if isinstance(clips[0], list):\n                n_clips = len(clips)\n                ind_clips = np.arange(\n                    n_clips\n                )  # clips come ordered temporally from decoder\n                if batch_clips and n_clips > 1:\n                    clips_batched = self.batch_clips(clips)\n                    clips_key = [clips_batched]\n                    clip_q = clips_batched\n                else:\n                    clip_q = clips[0]\n            else:\n                clip_q = clips\n\n            feat_q = self.backbone(clip_q)\n            predictors = []\n            if isinstance(feat_q, list):\n                predictors = feat_q[1:]\n                feat_q = feat_q[0]\n                predictors = [self.l2_norm(feat) for feat in predictors]\n            else:\n                raise NotImplementedError(\"BYOL: predictor is missing\")\n            assert len(predictors) == 1\n            if index is None:\n                return feat_q\n            q = self.l2_norm(feat_q)\n\n            q_knn = q  # projector\n\n            if not self.training:\n                return self.eval_knn(q_knn)\n\n            ind_clips = np.arange(n_clips)  # clips come ordered temporally from decoder\n\n            # rest down is for training\n            if keys is None:\n                keys = self.compute_key_feat(clips_key, compute_predictor_keys=False)\n\n            if self.cfg.CONTRASTIVE.SEQUENTIAL:\n                loss_reg = self.sim_loss(predictors[0], keys[0])\n                for i in range(1, len(keys)):\n                    loss_reg += self.sim_loss(predictors[0], keys[i])\n                loss_reg /= len(keys)\n            else:\n                if batch_clips:\n                    bs = predictors[0].shape[0] // 2\n                    loss_reg = self.sim_loss(\n                        predictors[0][:bs, :], keys[0][bs:, :]\n                    ) + self.sim_loss(predictors[0][bs:, :], keys[0][:bs, :])\n                    q_knn = q_knn[:bs, :]\n                    del clips_batched[0]\n                else:\n                    loss_q1 = self.sim_loss(predictors[0], keys[1])\n                    assert len(clips) == 2\n                    clip_q2 = clips[1]\n                    feat_q2 = self.backbone(clip_q2)\n                    predictors2 = feat_q2[1:]\n                    # feat_q2 = feat_q2[0] # not used\n                    predictors2 = [self.l2_norm(feat) for feat in predictors2]\n                    assert len(predictors2) == 1\n\n                    loss_q2 = self.sim_loss(predictors2[0], keys[0])\n                    loss_reg = loss_q1 + loss_q2\n\n            # loss_pos = self.sim_loss(q1_proj, q2_proj)\n            dummy_logits = torch.cat(\n                (\n                    9999.0 * torch.ones((len(index), 1), dtype=torch.float).cuda(),\n                    torch.zeros((len(index), self.k), dtype=torch.float).cuda(),\n                ),\n                dim=1,\n            )\n\n            self.knn_mem_update(q_knn, index)\n\n            return dummy_logits, loss_reg\n\n        elif self.type == \"swav\":\n            if not isinstance(clips[0], list):\n                if self.swav_use_public_code:\n                    proj_1, _ = self.run_swav_orig_encoder_q(clips)\n                else:\n                    proj_1, _ = self.run_swav_encoder_q(clips)\n                if index is None:\n                    return proj_1\n                if not self.training:\n                    return self.eval_knn(proj_1)\n            n_clips = len(clips)\n            ind_clips = np.arange(n_clips)  # clips come ordered temporally from decoder\n            clip_q = clips[0]\n\n            if self.swav_use_public_code:\n                # uses official code of SwAV from\n                # https://github.com/facebookresearch/swav/blob/master/main_swav.py\n                with torch.no_grad():\n                    m = getattr(self, \"module\", self)\n                    w = m.swav_prototypes.weight.data.clone()\n                    w = nn.functional.normalize(w, dim=1, p=2)\n                    m.swav_prototypes.weight.copy_(w)\n\n                bs = clips[0][0].size(0)\n                output, embedding = [], []\n\n                for clip_q in clips:\n                    x = self.run_swav_orig_encoder_q(clip_q)\n                    embedding.append(x[0])\n                    output.append(x[1])\n                q_knn = embedding[0]\n                embedding = torch.cat(embedding, dim=0)\n                output = torch.cat(output, dim=0)\n\n                loss_swav = 0\n                swav_extra_crops = n_clips - 2\n                self.swav_crops_for_assign = np.arange(n_clips - swav_extra_crops)\n                for i, crop_id in enumerate(self.swav_crops_for_assign):\n                    with torch.no_grad():\n                        out = output[bs * crop_id : bs * (crop_id + 1)]\n                        if (\n                            self.cfg.CONTRASTIVE.SWAV_QEUE_LEN > 0\n                            and epoch_exact >= 15.0\n                        ):\n                            if self.swav_use_the_queue or not torch.all(\n                                self.queue_swav[i, -1, :] == 0\n                            ):\n                                self.swav_use_the_queue = True\n                                out = torch.cat(\n                                    (\n                                        torch.mm(\n                                            self.queue_swav[i],\n                                            m.swav_prototypes.weight.t(),\n                                        ),\n                                        out,\n                                    )\n                                )\n                            self.queue_swav[i, bs:] = self.queue_swav[i, :-bs].clone()\n                            self.queue_swav[i, :bs] = embedding[\n                                crop_id * bs : (crop_id + 1) * bs\n                            ]\n                        q = out / self.swav_eps_sinkhorn\n                        q = torch.exp(q).t()\n                        q = (\n                            self.distributed_sinkhorn(q, 3)[-bs:]\n                            if self.cfg.NUM_SHARDS > 1\n                            else self.sinkhorn(q.t(), 3)[-bs:]\n                        )\n                    subloss = 0\n                    for v in np.delete(np.arange(n_clips), crop_id):\n                        p = self.softmax(output[bs * v : bs * (v + 1)] / self.T)\n                        subloss -= torch.mean(torch.sum(q * torch.log(p), dim=1))\n                    loss_swav += subloss / (n_clips - 1)\n                loss_swav /= len(self.swav_crops_for_assign)\n            else:\n                proj_1, out_1 = self.run_swav_encoder_q(clip_q)\n                q_knn = proj_1\n                if not self.training:\n                    return self.eval_knn(q_knn)\n                proj_2, out_2 = self.run_swav_encoder_q(clips[1])\n                bs = proj_1.shape[0]\n                if self.cfg.CONTRASTIVE.SWAV_QEUE_LEN > 0:\n                    if epoch_exact >= 15.0 and not torch.all(\n                        self.queue_swav[0, -1, :] == 0\n                    ):\n                        swav_prototypes = F.normalize(\n                            self.swav_prototypes, dim=0, p=2\n                        ).detach()\n                        out_1 = torch.cat(\n                            (\n                                torch.mm(self.queue_swav[0].detach(), swav_prototypes),\n                                out_1,\n                            )\n                        )\n                        out_2 = torch.cat(\n                            (\n                                torch.mm(self.queue_swav[1].detach(), swav_prototypes),\n                                out_2,\n                            )\n                        )\n                    # fill the queue\n                    self.queue_swav[0, bs:] = self.queue_swav[0, :-bs].clone()\n                    self.queue_swav[0, :bs] = proj_1.detach()\n                    self.queue_swav[1, bs:] = self.queue_swav[1, :-bs].clone()\n                    self.queue_swav[1, :bs] = proj_2.detach()\n\n                with torch.no_grad():\n                    code_1 = self.get_code(out_1)\n                    code_2 = self.get_code(out_2)\n                loss12 = self.KLDivLoss(out_1[-bs:], code_2[-bs:].detach())\n                loss21 = self.KLDivLoss(out_2[-bs:], code_1[-bs:].detach())\n                loss_swav = loss12 + loss21\n            self.knn_mem_update(q_knn, index)\n            dummy_logits = torch.cat(\n                (\n                    9999.0 * torch.ones((len(index), 1), dtype=torch.float).cuda(),\n                    torch.zeros((len(index), self.k), dtype=torch.float).cuda(),\n                ),\n                dim=1,\n            )\n            return dummy_logits, loss_swav\n\n        elif self.type == \"simclr\":\n            if isinstance(clips[0], list):\n                n_clips = len(clips)\n                clip_q = clips[0]\n            else:\n                clip_q = clips\n            feat_q = self.backbone(clip_q)\n            q = self.l2_norm(feat_q)\n            if index is None:\n                return q\n            q_knn = q\n            if not self.training:\n                return self.eval_knn(q_knn)\n            q2 = self.backbone(clips[1])\n            q2 = self.l2_norm(q2)\n            distributed_loss = False\n            if distributed_loss and self.cfg.NUM_GPUS > 1:\n                out = torch.cat([q, q2], dim=0)\n                if self.cfg.CONTRASTIVE.SIMCLR_DIST_ON:\n                    out_all = du.cat_all_gather(out)\n                else:\n                    out_all = out\n                similarity = torch.exp(torch.mm(out, out_all.t()) / self.T)\n                Z, loss = 0.0, 0.0\n                for loss_id in range(len(self.pos_mask)):\n                    pos = torch.sum(similarity * self.pos_mask[loss_id], 1)\n                    neg = torch.sum(similarity * self.neg_mask, 1)\n                    idx = (1 - torch.sum(self.pos_mask[loss_id], 1) > 0).detach()\n                    term_prob = pos / (pos + neg)\n                    term_prob[idx] = 1.0\n                    term_loss = torch.log(term_prob)\n                    Z += torch.sum(~idx).detach()\n                    loss -= torch.sum(term_loss)\n                loss /= Z\n            else:\n                cat_across_gpus = True\n                if cat_across_gpus and self.cfg.NUM_GPUS > 1:\n                    q = du.AllGatherWithGradient.apply(q)\n                    q2 = du.AllGatherWithGradient.apply(q2)\n                out = torch.cat([q, q2], dim=0)\n                # [2*B, 2*B]\n                sim_matrix = torch.exp(torch.mm(out, out.t().contiguous()) / self.T)\n                # SANITY:\n                mask = (\n                    torch.ones_like(sim_matrix)\n                    - torch.eye(out.shape[0], device=sim_matrix.device)\n                ).bool()\n                # [2*B, 2*B-1]\n                sim_matrix = sim_matrix.masked_select(mask).view(out.shape[0], -1)\n                # compute loss\n                pos_sim = torch.exp(torch.sum(q * q2, dim=-1) / self.T)\n                # [2*B]\n                pos_sim = torch.cat([pos_sim, pos_sim], dim=0)\n                loss = (-torch.log(pos_sim / sim_matrix.sum(dim=-1))).mean()\n            self.knn_mem_update(q_knn, index)\n            dummy_logits = torch.cat(\n                (\n                    9999.0 * torch.ones((len(index), 1), dtype=torch.float).cuda(),\n                    torch.zeros((len(index), self.k), dtype=torch.float).cuda(),\n                ),\n                dim=1,\n            )\n            return dummy_logits, loss\n        else:\n            raise NotImplementedError()\n\n    def _simclr_precompute_pos_neg_mask_multi(self):\n        # computed once at the beginning of training\n        distributed = self.cfg.CONTRASTIVE.SIMCLR_DIST_ON\n        if distributed:\n            total_images = self.cfg.TRAIN.BATCH_SIZE * self.cfg.NUM_SHARDS\n            world_size = du.get_world_size()\n            rank = du.get_rank()\n        else:\n            total_images = self.cfg.TRAIN.BATCH_SIZE\n            world_size = du.get_local_size()\n            rank = du.get_local_rank()\n        local_orig_images = total_images // world_size\n        local_crops = local_orig_images * self.num_crops\n\n        pos_temps = []\n        for d in np.arange(self.num_crops):\n            pos_temp, neg_temp = [], []\n            for i in range(world_size):\n                if i == rank:\n                    pos = np.eye(local_crops, k=d * local_orig_images) + np.eye(\n                        local_crops, k=-local_crops + d * local_orig_images\n                    )\n                    neg = np.ones((local_crops, local_crops))\n                else:\n                    pos = np.zeros((local_crops, local_crops))\n                    neg = np.zeros((local_crops, local_crops))\n                pos_temp.append(pos)\n                neg_temp.append(neg)\n            pos_temps.append(np.hstack(pos_temp))\n            neg_temp = np.hstack(neg_temp)\n\n        pos_mask = []\n        for i in range(self.num_crops - 1):\n            pos_mask.append(torch.from_numpy(pos_temps[1 + i]))\n        neg_mask = torch.from_numpy(neg_temp - sum(pos_temps))\n\n        if self.num_gpus:\n            for i in range(len(pos_mask)):\n                pos_mask[i] = pos_mask[i].cuda(non_blocking=True)\n            neg_mask = neg_mask.cuda(non_blocking=True)\n        self.pos_mask, self.neg_mask = pos_mask, neg_mask\n\n    def run_swav_encoder_q(self, im):\n        proj = self.backbone(im)  # Nx512, Nx128\n        proj = F.normalize(proj, dim=1)  # always normalize\n        swav_prototypes = F.normalize(self.swav_prototypes, dim=0, p=2)\n        out = proj @ swav_prototypes\n        return proj, out\n\n    @torch.no_grad()\n    def get_code(self, out):\n        with torch.no_grad():\n            Q = torch.exp(out / self.swav_eps_sinkhorn)  # BxK\n            if self.cfg.NUM_SHARDS > 1:\n                Q_sink = self.distributed_sinkhorn(Q.t(), 3)  # BxK\n            else:\n                Q_sink = self.sinkhorn(Q, 3)  # BxK\n        return Q_sink\n\n    def run_swav_orig_encoder_q(self, x):\n        x = self.backbone(x)  # Nx512, Nx128\n        x = nn.functional.normalize(x, dim=1, p=2)\n        if self.swav_prototypes is not None:\n            return x, self.swav_prototypes(x)\n        return x\n\n    @torch.no_grad()\n    def sinkhorn(self, Q, iters):\n        with torch.no_grad():\n            Q = Q.t()  # KxB\n            sum_Q = torch.sum(Q)\n            Q /= sum_Q\n\n            r = torch.ones(Q.shape[0]).cuda(non_blocking=True) / Q.shape[0]\n            c = torch.ones(Q.shape[1]).cuda(non_blocking=True) / Q.shape[1]\n\n            for _ in range(iters):\n                Q *= (r / torch.sum(Q, dim=1)).unsqueeze(1)\n                Q *= (c / torch.sum(Q, dim=0)).unsqueeze(0)\n\n            Q = Q / torch.sum(Q, dim=0, keepdim=True)\n            return Q.t().float()\n\n    def distributed_sinkhorn(self, Q, nmb_iters):\n        with torch.no_grad():\n            sum_Q = torch.sum(Q)\n            du.all_reduce([sum_Q], average=False)\n            Q /= sum_Q\n\n            u = torch.zeros(Q.shape[0]).cuda(non_blocking=True)\n            r = torch.ones(Q.shape[0]).cuda(non_blocking=True) / Q.shape[0]\n            c = torch.ones(Q.shape[1]).cuda(non_blocking=True) / (\n                du.get_world_size() * Q.shape[1]\n            )\n\n            curr_sum = torch.sum(Q, dim=1)\n            du.all_reduce([curr_sum], average=False)\n\n            for _ in range(nmb_iters):\n                u = curr_sum\n                Q *= (r / u).unsqueeze(1)\n                Q *= (c / torch.sum(Q, dim=0)).unsqueeze(0)\n                curr_sum = torch.sum(Q, dim=1)\n                du.all_reduce([curr_sum], average=False)\n            return (Q / torch.sum(Q, dim=0, keepdim=True)).t().float()\n\n    def KLDivLoss(self, out, code):\n        softmax = nn.Softmax(dim=1).cuda()\n        p = softmax(out / self.T)\n        loss = torch.mean(-torch.sum(code * torch.log(p), dim=1))\n        return loss\n\n\ndef l2_loss(x, y):\n    return 2 - 2 * (x * y).sum(dim=-1)\n\n\nclass Normalize(nn.Module):\n    def __init__(self, power=2, dim=1):\n        super(Normalize, self).__init__()\n        self.dim = dim\n        self.power = power\n\n    def forward(self, x):\n        norm = x.pow(self.power).sum(self.dim, keepdim=True).pow(1.0 / self.power)\n        out = x.div(norm)\n        return out\n\n\nclass Memory(nn.Module):\n    def __init__(self, length, duration, dim, cfg):\n        super(Memory, self).__init__()\n        self.length = length\n        self.duration = duration\n        self.dim = dim\n        stdv = 1.0 / math.sqrt(dim / 3)\n        self.register_buffer(\n            \"memory\",\n            torch.rand(length, duration, dim).mul_(2 * stdv).add_(-stdv),\n        )\n        self.device = self.memory.device\n        self.l2_norm = Normalize(dim=1)\n        self.l2_norm2d = Normalize(dim=2)\n        self.num_gpus = cfg.NUM_GPUS\n\n    def resize(self, length, duration, dim):\n        self.length = length\n        self.duration = duration\n        self.dim = dim\n        stdv = 1.0 / math.sqrt(dim / 3)\n        del self.memory\n        self.memory = (\n            torch.rand(length, duration, dim, device=self.device)\n            .mul_(2 * stdv)\n            .add_(-stdv)\n            .cuda()\n        )\n\n    def get(self, ind, time, interp=False):\n        batch_size = ind.size(0)\n        with torch.no_grad():\n            if interp:\n                # mem_idx = self.memory[ind.view(-1), :, :]\n                t0 = time.floor().long()  # - 1\n                t0 = torch.clamp(t0, 0, self.memory.shape[1] - 1)\n                t1 = t0 + 1\n                t1 = torch.clamp(t1, 0, self.memory.shape[1] - 1)\n\n                mem_t0 = self.memory[ind.view(-1), t0.view(-1), :]\n                mem_t1 = self.memory[ind.view(-1), t1.view(-1), :]\n                w2 = time.view(-1, 1) / self.duration\n                w_t1 = (time - t0).view(-1, 1).float()\n                w_t1 = 1 - w_t1  # hack for inverse\n                selected_mem = mem_t0 * (1 - w_t1) + mem_t1 * w_t1\n            else:\n                # logger.info(\"1dmem get ind shape {} time shape {}\".format(ind.shape, time.shape))\n                selected_mem = self.memory[ind.view(-1), time.long().view(-1), :]\n\n        out = selected_mem.view(batch_size, -1, self.dim)\n        return out\n\n    def update(self, mem, momentum, ind, time, interp=False):\n        if self.num_gpus > 1:\n            mem, ind, time = du.all_gather([mem, ind, time])\n        with torch.no_grad():\n            if interp:\n                t0 = time.floor().long()  # - 1\n                t0 = torch.clamp(t0, 0, self.memory.shape[1] - 1)\n                t1 = t0 + 1\n                t1 = torch.clamp(t1, 0, self.memory.shape[1] - 1)\n                mem_t0 = self.memory[ind.view(-1), t0.view(-1), :]\n                mem_t1 = self.memory[ind.view(-1), t1.view(-1), :]\n                w2 = time.float().view(-1, 1) / float(self.duration)\n                w_t1 = (time - t0).view(-1, 1).float()\n                w_t1 = 1 - w_t1  # hack for inverse\n\n                w_t0 = 1 - w_t1\n                # mem = mem.squeeze()\n                duo_update = False\n                if duo_update:\n                    update_t0 = (mem * w_t0 + mem_t0 * w_t1) * momentum + mem_t0 * (\n                        1 - momentum\n                    )\n                    update_t1 = (mem * w_t1 + mem_t1 * w_t0) * momentum + mem_t1 * (\n                        1 - momentum\n                    )\n                else:\n                    update_t0 = mem * w_t0 * momentum + mem_t0 * (1 - momentum)\n                    update_t1 = mem * w_t1 * momentum + mem_t1 * (1 - momentum)\n\n                update_t0 = self.l2_norm(update_t0)\n                update_t1 = self.l2_norm(update_t1)\n\n                self.memory[ind.view(-1), t0.view(-1), :] = update_t0.squeeze()\n                self.memory[ind.view(-1), t1.view(-1), :] = update_t1.squeeze()\n            else:\n                mem = mem.view(mem.size(0), 1, -1)\n                mem_old = self.get(ind, time, interp=interp)\n                mem_update = mem * momentum + mem_old * (1 - momentum)\n                mem_update = self.l2_norm2d(mem_update)\n                # logger.info(\"1dmem set ind shape {} time shape {}\".format(ind.shape, time.shape))\n\n                # my version\n                self.memory[ind.view(-1), time.long().view(-1), :] = (\n                    mem_update.squeeze()\n                )\n                return\n\n    def forward(self, inputs):\n        pass\n\n\nclass Memory1D(nn.Module):\n    def __init__(self, length, duration, dim, cfg):\n        super(Memory1D, self).__init__()\n        assert duration == 1\n        self.length = length\n        self.duration = duration\n        self.dim = dim\n        stdv = 1.0 / math.sqrt(dim / 3)\n        self.register_buffer(\n            \"memory\", torch.rand(length, dim).mul_(2 * stdv).add_(-stdv)\n        )\n        self.l2_norm = Normalize(dim=1)\n        self.num_gpus = cfg.NUM_GPUS\n\n    @torch.no_grad()\n    def get(self, ind, time, interp=False):\n        batch_size = ind.size(0)\n        if len(ind.shape) == 1:\n            return torch.index_select(self.memory, 0, ind.view(-1)).view(\n                batch_size, self.dim\n            )\n        else:\n            return torch.index_select(self.memory, 0, ind.view(-1)).view(\n                batch_size, -1, self.dim\n            )\n\n    @torch.no_grad()\n    def update(self, mem, momentum, ind, time, interp=False):\n        if self.num_gpus > 1:\n            mem, ind, time = du.all_gather([mem, ind, time])\n        mem = mem.view(mem.size(0), -1)\n        ind, time = ind.long(), time.long()\n\n        mem_old = self.get(ind, time, interp=interp)\n        mem_update = mem_old * (1 - momentum) + mem * momentum\n        mem_update = self.l2_norm(mem_update)\n\n        self.memory.index_copy_(0, ind, mem_update)\n        return\n\n\ndef contrastive_parameter_surgery(model, cfg, epoch_exact, cur_iter):\n    # cancel some gradients in first epoch of SwAV\n    if (\n        cfg.MODEL.MODEL_NAME == \"ContrastiveModel\"\n        and cfg.CONTRASTIVE.TYPE == \"swav\"\n        and epoch_exact <= 1.0\n    ):\n        for name, p in model.named_parameters():\n            if \"swav_prototypes\" in name:\n                p.grad = None\n\n    iters_noupdate = 0\n    if cfg.MODEL.MODEL_NAME == \"ContrastiveModel\" and cfg.CONTRASTIVE.TYPE == \"moco\":\n        assert cfg.CONTRASTIVE.QUEUE_LEN % (cfg.TRAIN.BATCH_SIZE * cfg.NUM_SHARDS) == 0\n        iters_noupdate = (\n            cfg.CONTRASTIVE.QUEUE_LEN // cfg.TRAIN.BATCH_SIZE // cfg.NUM_SHARDS\n        )\n\n    if cur_iter < iters_noupdate and epoch_exact < 1:  #  for e.g. MoCo\n        logger.info(\"Not updating parameters {}/{}\".format(cur_iter, iters_noupdate))\n        update_param = False\n    else:\n        update_param = True\n\n    return model, update_param\n\n\ndef contrastive_forward(model, cfg, inputs, index, time, epoch_exact, scaler):\n    if cfg.CONTRASTIVE.SEQUENTIAL:\n        perform_backward = False\n        mdl = getattr(model, \"module\", model)\n        keys = (\n            mdl.compute_key_feat(\n                inputs,\n                compute_predictor_keys=False,\n                batched_inference=True if len(inputs) < 2 else False,\n            )\n            if cfg.CONTRASTIVE.TYPE == \"moco\" or cfg.CONTRASTIVE.TYPE == \"byol\"\n            else [None] * len(inputs)\n        )\n        for k, vid in enumerate(inputs):\n            other_keys = keys[:k] + keys[k + 1 :]\n            time_cur = torch.cat(\n                [\n                    time[:, k : k + 1, :],\n                    time[:, :k, :],\n                    time[:, k + 1 :, :],\n                ],\n                1,\n            )  # q, kpre, kpost\n            vids = [vid]\n            if cfg.CONTRASTIVE.TYPE == \"swav\" or cfg.CONTRASTIVE.TYPE == \"simclr\":\n                if k < len(inputs) - 1:\n                    vids = inputs[k : k + 2]\n                else:\n                    break\n            lgt_k, loss_k = model(vids, index, time_cur, epoch_exact, keys=other_keys)\n            scaler.scale(loss_k).backward()\n            if k == 0:\n                preds, partial_loss = lgt_k, loss_k.detach()\n            else:\n                preds = torch.cat([preds, lgt_k], dim=0)\n                partial_loss += loss_k.detach()\n        partial_loss /= len(inputs) * 2.0  # to have same loss as symm model\n        if cfg.CONTRASTIVE.TYPE == \"moco\":\n            mdl._dequeue_and_enqueue(keys)\n    else:\n        perform_backward = True\n        preds, partial_loss = model(inputs, index, time, epoch_exact, keys=None)\n    return model, preds, partial_loss, perform_backward\n"
  },
  {
    "path": "slowfast/models/custom_video_model_builder.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\n\n\"\"\"A More Flexible Video models.\"\"\"\n"
  },
  {
    "path": "slowfast/models/head_helper.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\n\"\"\"ResNe(X)t Head helper.\"\"\"\n\nfrom functools import partial\n\nimport slowfast.utils.logging as logging\nimport torch\nimport torch.nn as nn\nfrom detectron2.layers import ROIAlign\nfrom slowfast.models.attention import MultiScaleBlock\nfrom slowfast.models.batchnorm_helper import (\n    NaiveSyncBatchNorm1d as NaiveSyncBatchNorm1d,\n)\n\nlogger = logging.get_logger(__name__)\n\n\nclass ResNetRoIHead(nn.Module):\n    \"\"\"\n    ResNe(X)t RoI head.\n    \"\"\"\n\n    def __init__(\n        self,\n        dim_in,\n        num_classes,\n        pool_size,\n        resolution,\n        scale_factor,\n        dropout_rate=0.0,\n        act_func=\"softmax\",\n        aligned=True,\n        detach_final_fc=False,\n    ):\n        \"\"\"\n        The `__init__` method of any subclass should also contain these\n            arguments.\n        ResNetRoIHead takes p pathways as input where p in [1, infty].\n\n        Args:\n            dim_in (list): the list of channel dimensions of the p inputs to the\n                ResNetHead.\n            num_classes (int): the channel dimensions of the p outputs to the\n                ResNetHead.\n            pool_size (list): the list of kernel sizes of p spatial temporal\n                poolings, temporal pool kernel size, spatial pool kernel size,\n                spatial pool kernel size in order.\n            resolution (list): the list of spatial output size from the ROIAlign.\n            scale_factor (list): the list of ratio to the input boxes by this\n                number.\n            dropout_rate (float): dropout rate. If equal to 0.0, perform no\n                dropout.\n            act_func (string): activation function to use. 'softmax': applies\n                softmax on the output. 'sigmoid': applies sigmoid on the output.\n            aligned (bool): if False, use the legacy implementation. If True,\n                align the results more perfectly.\n            detach_final_fc (bool): if True, detach the final fc layer from the\n                gradient graph. By doing so, only the final fc layer will be\n                trained.\n        Note:\n            Given a continuous coordinate c, its two neighboring pixel indices\n            (in our pixel model) are computed by floor (c - 0.5) and ceil\n            (c - 0.5). For example, c=1.3 has pixel neighbors with discrete\n            indices [0] and [1] (which are sampled from the underlying signal at\n            continuous coordinates 0.5 and 1.5). But the original roi_align\n            (aligned=False) does not subtract the 0.5 when computing neighboring\n            pixel indices and therefore it uses pixels with a slightly incorrect\n            alignment (relative to our pixel model) when performing bilinear\n            interpolation.\n            With `aligned=True`, we first appropriately scale the ROI and then\n            shift it by -0.5 prior to calling roi_align. This produces the\n            correct neighbors; It makes negligible differences to the model's\n            performance if ROIAlign is used together with conv layers.\n        \"\"\"\n        super(ResNetRoIHead, self).__init__()\n        assert len({len(pool_size), len(dim_in)}) == 1, (\n            \"pathway dimensions are not consistent.\"\n        )\n        self.num_pathways = len(pool_size)\n        self.detach_final_fc = detach_final_fc\n\n        for pathway in range(self.num_pathways):\n            temporal_pool = nn.AvgPool3d([pool_size[pathway][0], 1, 1], stride=1)\n            self.add_module(\"s{}_tpool\".format(pathway), temporal_pool)\n\n            roi_align = ROIAlign(\n                resolution[pathway],\n                spatial_scale=1.0 / scale_factor[pathway],\n                sampling_ratio=0,\n                aligned=aligned,\n            )\n            self.add_module(\"s{}_roi\".format(pathway), roi_align)\n            spatial_pool = nn.MaxPool2d(resolution[pathway], stride=1)\n            self.add_module(\"s{}_spool\".format(pathway), spatial_pool)\n\n        if dropout_rate > 0.0:\n            self.dropout = nn.Dropout(dropout_rate)\n\n        # Perform FC in a fully convolutional manner. The FC layer will be\n        # initialized with a different std comparing to convolutional layers.\n        self.projection = nn.Linear(sum(dim_in), num_classes, bias=True)\n\n        # Softmax for evaluation and testing.\n        if act_func == \"softmax\":\n            self.act = nn.Softmax(dim=1)\n        elif act_func == \"sigmoid\":\n            self.act = nn.Sigmoid()\n        else:\n            raise NotImplementedError(\n                \"{} is not supported as an activationfunction.\".format(act_func)\n            )\n\n    def forward(self, inputs, bboxes):\n        assert len(inputs) == self.num_pathways, (\n            \"Input tensor does not contain {} pathway\".format(self.num_pathways)\n        )\n        pool_out = []\n        for pathway in range(self.num_pathways):\n            t_pool = getattr(self, \"s{}_tpool\".format(pathway))\n            out = t_pool(inputs[pathway])\n            assert out.shape[2] == 1\n            out = torch.squeeze(out, 2)\n\n            roi_align = getattr(self, \"s{}_roi\".format(pathway))\n            out = roi_align(out, bboxes)\n\n            s_pool = getattr(self, \"s{}_spool\".format(pathway))\n            pool_out.append(s_pool(out))\n\n        # B C H W.\n        x = torch.cat(pool_out, 1)\n\n        # Perform dropout.\n        if hasattr(self, \"dropout\"):\n            x = self.dropout(x)\n\n        x = x.view(x.shape[0], -1)\n        if self.detach_final_fc:\n            x = x.detach()\n        x = self.projection(x)\n        x = self.act(x)\n        return x\n\n\nclass MLPHead(nn.Module):\n    def __init__(\n        self,\n        dim_in,\n        dim_out,\n        mlp_dim,\n        num_layers,\n        bn_on=False,\n        bias=True,\n        flatten=False,\n        xavier_init=True,\n        bn_sync_num=1,\n        global_sync=False,\n    ):\n        super(MLPHead, self).__init__()\n        self.flatten = flatten\n        b = False if bn_on else bias\n        # assert bn_on or bn_sync_num=1\n        mlp_layers = [nn.Linear(dim_in, mlp_dim, bias=b)]\n        mlp_layers[-1].xavier_init = xavier_init\n        for i in range(1, num_layers):\n            if bn_on:\n                if global_sync or bn_sync_num > 1:\n                    mlp_layers.append(\n                        NaiveSyncBatchNorm1d(\n                            num_sync_devices=bn_sync_num,\n                            global_sync=global_sync,\n                            num_features=mlp_dim,\n                        )\n                    )\n                else:\n                    mlp_layers.append(nn.BatchNorm1d(num_features=mlp_dim))\n            mlp_layers.append(nn.ReLU(inplace=True))\n            if i == num_layers - 1:\n                d = dim_out\n                b = bias\n            else:\n                d = mlp_dim\n            mlp_layers.append(nn.Linear(mlp_dim, d, bias=b))\n            mlp_layers[-1].xavier_init = xavier_init\n        self.projection = nn.Sequential(*mlp_layers)\n\n    def forward(self, x):\n        if x.ndim == 5:\n            x = x.permute((0, 2, 3, 4, 1))\n        if self.flatten:\n            x = x.reshape(-1, x.shape[-1])\n\n        return self.projection(x)\n\n\nclass ResNetBasicHead(nn.Module):\n    \"\"\"\n    ResNe(X)t 3D head.\n    This layer performs a fully-connected projection during training, when the\n    input size is 1x1x1. It performs a convolutional projection during testing\n    when the input size is larger than 1x1x1. If the inputs are from multiple\n    different pathways, the inputs will be concatenated after pooling.\n    \"\"\"\n\n    def __init__(\n        self,\n        dim_in,\n        num_classes,\n        pool_size,\n        dropout_rate=0.0,\n        act_func=\"softmax\",\n        detach_final_fc=False,\n        cfg=None,\n    ):\n        \"\"\"\n        The `__init__` method of any subclass should also contain these\n            arguments.\n        ResNetBasicHead takes p pathways as input where p in [1, infty].\n\n        Args:\n            dim_in (list): the list of channel dimensions of the p inputs to the\n                ResNetHead.\n            num_classes (int): the channel dimensions of the p outputs to the\n                ResNetHead.\n            pool_size (list): the list of kernel sizes of p spatial temporal\n                poolings, temporal pool kernel size, spatial pool kernel size,\n                spatial pool kernel size in order.\n            dropout_rate (float): dropout rate. If equal to 0.0, perform no\n                dropout.\n            act_func (string): activation function to use. 'softmax': applies\n                softmax on the output. 'sigmoid': applies sigmoid on the output.\n            detach_final_fc (bool): if True, detach the fc layer from the\n                gradient graph. By doing so, only the final fc layer will be\n                trained.\n            cfg (struct): The config for the current experiment.\n        \"\"\"\n        super(ResNetBasicHead, self).__init__()\n        assert len({len(pool_size), len(dim_in)}) == 1, (\n            \"pathway dimensions are not consistent.\"\n        )\n        self.num_pathways = len(pool_size)\n        self.detach_final_fc = detach_final_fc\n        self.cfg = cfg\n        self.local_projection_modules = []\n        self.predictors = nn.ModuleList()\n        self.l2norm_feats = False\n\n        for pathway in range(self.num_pathways):\n            if pool_size[pathway] is None:\n                avg_pool = nn.AdaptiveAvgPool3d((1, 1, 1))\n            else:\n                avg_pool = nn.AvgPool3d(pool_size[pathway], stride=1)\n            self.add_module(\"pathway{}_avgpool\".format(pathway), avg_pool)\n\n        if dropout_rate > 0.0:\n            self.dropout = nn.Dropout(dropout_rate)\n        # Perform FC in a fully convolutional manner. The FC layer will be\n        # initialized with a different std comparing to convolutional layers.\n        if cfg.CONTRASTIVE.NUM_MLP_LAYERS == 1:\n            self.projection = nn.Linear(sum(dim_in), num_classes, bias=True)\n        else:\n            self.projection = MLPHead(\n                sum(dim_in),\n                num_classes,\n                cfg.CONTRASTIVE.MLP_DIM,\n                cfg.CONTRASTIVE.NUM_MLP_LAYERS,\n                bn_on=cfg.CONTRASTIVE.BN_MLP,\n                bn_sync_num=(\n                    cfg.BN.NUM_SYNC_DEVICES if cfg.CONTRASTIVE.BN_SYNC_MLP else 1\n                ),\n                global_sync=(cfg.CONTRASTIVE.BN_SYNC_MLP and cfg.BN.GLOBAL_SYNC),\n            )\n\n        # Softmax for evaluation and testing.\n        if act_func == \"softmax\":\n            self.act = nn.Softmax(dim=4)\n        elif act_func == \"sigmoid\":\n            self.act = nn.Sigmoid()\n        elif act_func == \"none\":\n            self.act = None\n        else:\n            raise NotImplementedError(\n                \"{} is not supported as an activationfunction.\".format(act_func)\n            )\n\n        if cfg.CONTRASTIVE.PREDICTOR_DEPTHS:\n            d_in = num_classes\n            for n_layers in cfg.CONTRASTIVE.PREDICTOR_DEPTHS:\n                local_mlp = MLPHead(\n                    d_in,\n                    num_classes,\n                    cfg.CONTRASTIVE.MLP_DIM,\n                    n_layers,\n                    bn_on=cfg.CONTRASTIVE.BN_MLP,\n                    flatten=False,\n                    bn_sync_num=(\n                        cfg.BN.NUM_SYNC_DEVICES if cfg.CONTRASTIVE.BN_SYNC_MLP else 1\n                    ),\n                    global_sync=(cfg.CONTRASTIVE.BN_SYNC_MLP and cfg.BN.GLOBAL_SYNC),\n                )\n                self.predictors.append(local_mlp)\n\n    def forward(self, inputs):\n        assert len(inputs) == self.num_pathways, (\n            \"Input tensor does not contain {} pathway\".format(self.num_pathways)\n        )\n        pool_out = []\n        for pathway in range(self.num_pathways):\n            m = getattr(self, \"pathway{}_avgpool\".format(pathway))\n            pool_out.append(m(inputs[pathway]))\n        x = torch.cat(pool_out, 1)\n        # (N, C, T, H, W) -> (N, T, H, W, C).\n        x = x.permute((0, 2, 3, 4, 1))\n        # Perform dropout.\n        if hasattr(self, \"dropout\"):\n            x = self.dropout(x)\n        if self.detach_final_fc:\n            x = x.detach()\n        if self.l2norm_feats:\n            x = nn.functional.normalize(x, dim=1, p=2)\n\n        if (\n            x.shape[1:4] == torch.Size([1, 1, 1])\n            and self.cfg.MODEL.MODEL_NAME == \"ContrastiveModel\"\n        ):\n            x = x.view(x.shape[0], -1)\n\n        x_proj = self.projection(x)\n\n        time_projs = []\n        if self.predictors:\n            x_in = x_proj\n            for proj in self.predictors:\n                time_projs.append(proj(x_in))\n\n        if not self.training:\n            if self.act is not None:\n                x_proj = self.act(x_proj)\n            # Performs fully convlutional inference.\n            if x_proj.ndim == 5 and x_proj.shape[1:4] > torch.Size([1, 1, 1]):\n                x_proj = x_proj.mean([1, 2, 3])\n\n        x_proj = x_proj.view(x_proj.shape[0], -1)\n\n        if time_projs:\n            return [x_proj] + time_projs\n        else:\n            return x_proj\n\n\nclass X3DHead(nn.Module):\n    \"\"\"\n    X3D head.\n    This layer performs a fully-connected projection during training, when the\n    input size is 1x1x1. It performs a convolutional projection during testing\n    when the input size is larger than 1x1x1. If the inputs are from multiple\n    different pathways, the inputs will be concatenated after pooling.\n    \"\"\"\n\n    def __init__(\n        self,\n        dim_in,\n        dim_inner,\n        dim_out,\n        num_classes,\n        pool_size,\n        dropout_rate=0.0,\n        act_func=\"softmax\",\n        inplace_relu=True,\n        eps=1e-5,\n        bn_mmt=0.1,\n        norm_module=nn.BatchNorm3d,\n        bn_lin5_on=False,\n    ):\n        \"\"\"\n        The `__init__` method of any subclass should also contain these\n            arguments.\n        X3DHead takes a 5-dim feature tensor (BxCxTxHxW) as input.\n\n        Args:\n            dim_in (float): the channel dimension C of the input.\n            num_classes (int): the channel dimensions of the output.\n            pool_size (float): a single entry list of kernel size for\n                spatiotemporal pooling for the TxHxW dimensions.\n            dropout_rate (float): dropout rate. If equal to 0.0, perform no\n                dropout.\n            act_func (string): activation function to use. 'softmax': applies\n                softmax on the output. 'sigmoid': applies sigmoid on the output.\n            inplace_relu (bool): if True, calculate the relu on the original\n                input without allocating new memory.\n            eps (float): epsilon for batch norm.\n            bn_mmt (float): momentum for batch norm. Noted that BN momentum in\n                PyTorch = 1 - BN momentum in Caffe2.\n            norm_module (nn.Module): nn.Module for the normalization layer. The\n                default is nn.BatchNorm3d.\n            bn_lin5_on (bool): if True, perform normalization on the features\n                before the classifier.\n        \"\"\"\n        super(X3DHead, self).__init__()\n        self.pool_size = pool_size\n        self.dropout_rate = dropout_rate\n        self.num_classes = num_classes\n        self.act_func = act_func\n        self.eps = eps\n        self.bn_mmt = bn_mmt\n        self.inplace_relu = inplace_relu\n        self.bn_lin5_on = bn_lin5_on\n        self._construct_head(dim_in, dim_inner, dim_out, norm_module)\n\n    def _construct_head(self, dim_in, dim_inner, dim_out, norm_module):\n        self.conv_5 = nn.Conv3d(\n            dim_in,\n            dim_inner,\n            kernel_size=(1, 1, 1),\n            stride=(1, 1, 1),\n            padding=(0, 0, 0),\n            bias=False,\n        )\n        self.conv_5_bn = norm_module(\n            num_features=dim_inner, eps=self.eps, momentum=self.bn_mmt\n        )\n        self.conv_5_relu = nn.ReLU(self.inplace_relu)\n\n        if self.pool_size is None:\n            self.avg_pool = nn.AdaptiveAvgPool3d((1, 1, 1))\n        else:\n            self.avg_pool = nn.AvgPool3d(self.pool_size, stride=1)\n\n        self.lin_5 = nn.Conv3d(\n            dim_inner,\n            dim_out,\n            kernel_size=(1, 1, 1),\n            stride=(1, 1, 1),\n            padding=(0, 0, 0),\n            bias=False,\n        )\n        if self.bn_lin5_on:\n            self.lin_5_bn = norm_module(\n                num_features=dim_out, eps=self.eps, momentum=self.bn_mmt\n            )\n        self.lin_5_relu = nn.ReLU(self.inplace_relu)\n\n        if self.dropout_rate > 0.0:\n            self.dropout = nn.Dropout(self.dropout_rate)\n        # Perform FC in a fully convolutional manner. The FC layer will be\n        # initialized with a different std comparing to convolutional layers.\n        self.projection = nn.Linear(dim_out, self.num_classes, bias=True)\n\n        # Softmax for evaluation and testing.\n        if self.act_func == \"softmax\":\n            self.act = nn.Softmax(dim=4)\n        elif self.act_func == \"sigmoid\":\n            self.act = nn.Sigmoid()\n        else:\n            raise NotImplementedError(\n                \"{} is not supported as an activationfunction.\".format(self.act_func)\n            )\n\n    def forward(self, inputs):\n        # In its current design the X3D head is only useable for a single\n        # pathway input.\n        assert len(inputs) == 1, \"Input tensor does not contain 1 pathway\"\n        x = self.conv_5(inputs[0])\n        x = self.conv_5_bn(x)\n        x = self.conv_5_relu(x)\n        x = self.avg_pool(x)\n\n        x = self.lin_5(x)\n        if self.bn_lin5_on:\n            x = self.lin_5_bn(x)\n        x = self.lin_5_relu(x)\n\n        # (N, C, T, H, W) -> (N, T, H, W, C).\n        x = x.permute((0, 2, 3, 4, 1))\n        # Perform dropout.\n        if hasattr(self, \"dropout\"):\n            x = self.dropout(x)\n        x = self.projection(x)\n\n        # Performs fully convlutional inference.\n        if not self.training:\n            x = self.act(x)\n            x = x.mean([1, 2, 3])\n\n        x = x.view(x.shape[0], -1)\n        return x\n\n\nclass TransformerBasicHead(nn.Module):\n    \"\"\"\n    BasicHead. No pool.\n    \"\"\"\n\n    def __init__(\n        self,\n        dim_in,\n        num_classes,\n        dropout_rate=0.0,\n        act_func=\"softmax\",\n        cfg=None,\n    ):\n        \"\"\"\n        Perform linear projection and activation as head for tranformers.\n        Args:\n            dim_in (int): the channel dimension of the input to the head.\n            num_classes (int): the channel dimensions of the output to the head.\n            dropout_rate (float): dropout rate. If equal to 0.0, perform no\n                dropout.\n            act_func (string): activation function to use. 'softmax': applies\n                softmax on the output. 'sigmoid': applies sigmoid on the output.\n        \"\"\"\n        super(TransformerBasicHead, self).__init__()\n        if dropout_rate > 0.0:\n            self.dropout = nn.Dropout(dropout_rate)\n        self.projection = nn.Linear(dim_in, num_classes, bias=True)\n\n        if cfg.CONTRASTIVE.NUM_MLP_LAYERS == 1:\n            self.projection = nn.Linear(dim_in, num_classes, bias=True)\n        else:\n            self.projection = MLPHead(\n                dim_in,\n                num_classes,\n                cfg.CONTRASTIVE.MLP_DIM,\n                cfg.CONTRASTIVE.NUM_MLP_LAYERS,\n                bn_on=cfg.CONTRASTIVE.BN_MLP,\n                bn_sync_num=(\n                    cfg.BN.NUM_SYNC_DEVICES if cfg.CONTRASTIVE.BN_SYNC_MLP else 1\n                ),\n                global_sync=(cfg.CONTRASTIVE.BN_SYNC_MLP and cfg.BN.GLOBAL_SYNC),\n            )\n        self.detach_final_fc = cfg.MODEL.DETACH_FINAL_FC\n\n        # Softmax for evaluation and testing.\n        if act_func == \"softmax\":\n            self.act = nn.Softmax(dim=1)\n        elif act_func == \"sigmoid\":\n            self.act = nn.Sigmoid()\n        elif act_func == \"none\":\n            self.act = None\n        else:\n            raise NotImplementedError(\n                \"{} is not supported as an activationfunction.\".format(act_func)\n            )\n\n    def forward(self, x):\n        if hasattr(self, \"dropout\"):\n            x = self.dropout(x)\n        if self.detach_final_fc:\n            x = x.detach()\n        x = self.projection(x)\n\n        if not self.training:\n            if self.act is not None:\n                x = self.act(x)\n            # Performs fully convlutional inference.\n            if x.ndim == 5 and x.shape[1:4] > torch.Size([1, 1, 1]):\n                x = x.mean([1, 2, 3])\n\n        x = x.view(x.shape[0], -1)\n\n        return x\n\n\nclass MSSeparateHead(nn.Module):\n    \"\"\"\n    Perform linear projection or Transformer-based decoder (optionally MultiScale)\n    for mask prediction models.\n    Args:\n        blocks (MultiScaleBlock): the encoder blocks to provide input dimensions of the head.\n        num_classes (int): the dimension of the prediction target (eg. HOG or pixels).\n        feat_sz (list): the spatiotemporal sizes of the input features.\n    \"\"\"\n\n    def __init__(\n        self,\n        blocks,\n        cfg,\n        num_classes,\n        feat_sz,\n    ):\n        super(MSSeparateHead, self).__init__()\n        head_type = cfg.MASK.HEAD_TYPE.split(\"_\")\n        assert head_type[0] == \"separate\"\n        if len(head_type) > 1:\n            transform_type = head_type[1]\n            assert transform_type in [\"xformer\"]\n        else:\n            transform_type = None\n\n        depth_list = cfg.MASK.PRETRAIN_DEPTH\n        mlp_ratio = cfg.MVIT.MLP_RATIO\n        qkv_bias = cfg.MVIT.QKV_BIAS\n        drop_rate = cfg.MVIT.DROPOUT_RATE\n        kernel_kv = cfg.MASK.DEC_KV_KERNEL\n        stride_kv = cfg.MASK.DEC_KV_STRIDE\n        mode = cfg.MVIT.MODE\n        self.cls_embed_on = cfg.MVIT.CLS_EMBED_ON\n        pool_first = cfg.MVIT.POOL_FIRST\n        if cfg.MVIT.NORM == \"layernorm\":\n            norm_layer = partial(nn.LayerNorm, eps=1e-6)\n        else:\n            raise NotImplementedError(\"Only supports layernorm.\")\n\n        self.transforms = nn.ModuleList()\n        self.projections = nn.ModuleList()\n        for depth, num_class, feature_size in zip(depth_list, num_classes, feat_sz):\n            head_dim = (\n                cfg.MASK.DECODER_EMBED_DIM if cfg.MASK.MAE_ON else blocks[depth].dim_out\n            )\n            op = []\n            if transform_type == \"xformer\":\n                assert cfg.MASK.DECODER_DEPTH > 0\n                for _ in range(cfg.MASK.DECODER_DEPTH):\n                    dim_out = cfg.MASK.DECODER_EMBED_DIM\n                    op.append(\n                        MultiScaleBlock(\n                            dim=head_dim,\n                            dim_out=dim_out,\n                            input_size=feature_size,\n                            num_heads=dim_out // 64,\n                            mlp_ratio=mlp_ratio,\n                            qkv_bias=qkv_bias,\n                            drop_rate=drop_rate,\n                            drop_path=0.0,\n                            norm_layer=norm_layer,\n                            kernel_q=[],\n                            kernel_kv=kernel_kv,\n                            stride_q=[],\n                            stride_kv=stride_kv,\n                            mode=mode,\n                            has_cls_embed=self.cls_embed_on,\n                            pool_first=pool_first,\n                        )\n                    )\n                    head_dim = dim_out\n\n            op.append(norm_layer(head_dim))\n            self.transforms.append(nn.Sequential(*op))\n            self.projections.append(nn.Linear(head_dim, num_class, bias=True))\n\n        self.apply(self._init_weights)\n\n    def _init_weights(self, m):\n        if isinstance(m, nn.Linear):\n            nn.init.trunc_normal_(m.weight, std=0.02)\n            # FYI: MAE uses xavier_uniform following official JAX ViT:\n            # torch.nn.init.xavier_uniform_(m.weight)\n            if isinstance(m, nn.Linear) and m.bias is not None:\n                nn.init.constant_(m.bias, 0)\n        elif isinstance(m, nn.LayerNorm):\n            nn.init.constant_(m.bias, 0)\n            nn.init.constant_(m.weight, 1.0)\n\n    def forward(self, block_outputs, output_masks, return_all, thw):\n        model_outputs = []\n        for idx, x in enumerate(block_outputs):\n            for blk in self.transforms[idx]:\n                if isinstance(blk, MultiScaleBlock):\n                    x, thw = blk(x, thw)\n                else:\n                    x = blk(x)\n\n            if self.cls_embed_on:\n                x = x[:, 1:]\n            if not return_all:\n                mask = output_masks[idx]\n                x = x[mask]\n            x = self.projections[idx](x)\n            model_outputs.append(x)\n        return model_outputs\n"
  },
  {
    "path": "slowfast/models/losses.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\n\"\"\"Loss functions.\"\"\"\n\nfrom functools import partial\n\nimport torch\nimport torch.nn as nn\nfrom pytorchvideo.losses.soft_target_cross_entropy import SoftTargetCrossEntropyLoss\n\n\nclass ContrastiveLoss(nn.Module):\n    def __init__(self, reduction=\"mean\"):\n        super(ContrastiveLoss, self).__init__()\n        self.reduction = reduction\n\n    def forward(self, inputs, dummy_labels=None):\n        targets = torch.zeros(inputs.shape[0], dtype=torch.long).cuda()\n        loss = nn.CrossEntropyLoss(reduction=self.reduction).cuda()(inputs, targets)\n        return loss\n\n\nclass MultipleMSELoss(nn.Module):\n    \"\"\"\n    Compute multiple mse losses and return their average.\n    \"\"\"\n\n    def __init__(self, reduction=\"mean\"):\n        \"\"\"\n        Args:\n            reduction (str): specifies reduction to apply to the output. It can be\n                \"mean\" (default) or \"none\".\n        \"\"\"\n        super(MultipleMSELoss, self).__init__()\n        self.mse_func = nn.MSELoss(reduction=reduction)\n\n    def forward(self, x, y):\n        loss_sum = 0.0\n        multi_loss = []\n        for xt, yt in zip(x, y):\n            if isinstance(yt, (tuple,)):\n                if len(yt) == 2:\n                    yt, wt = yt\n                    lt = \"mse\"\n                elif len(yt) == 3:\n                    yt, wt, lt = yt\n                else:\n                    raise NotImplementedError\n            else:\n                wt, lt = 1.0, \"mse\"\n            if lt == \"mse\":\n                loss = self.mse_func(xt, yt)\n            else:\n                raise NotImplementedError\n            loss_sum += loss * wt\n            multi_loss.append(loss)\n        return loss_sum, multi_loss\n\n\n_LOSSES = {\n    \"cross_entropy\": nn.CrossEntropyLoss,\n    \"bce\": nn.BCELoss,\n    \"bce_logit\": nn.BCEWithLogitsLoss,\n    \"soft_cross_entropy\": partial(SoftTargetCrossEntropyLoss, normalize_targets=False),\n    \"contrastive_loss\": ContrastiveLoss,\n    \"mse\": nn.MSELoss,\n    \"multi_mse\": MultipleMSELoss,\n}\n\n\ndef get_loss_func(loss_name):\n    \"\"\"\n    Retrieve the loss given the loss name.\n    Args (int):\n        loss_name: the name of the loss to use.\n    \"\"\"\n    if loss_name not in _LOSSES.keys():\n        raise NotImplementedError(\"Loss {} is not supported\".format(loss_name))\n    return _LOSSES[loss_name]\n"
  },
  {
    "path": "slowfast/models/masked.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\nimport math\nfrom functools import partial\n\nimport slowfast.utils.logging as logging\nimport slowfast.utils.misc as misc\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nfrom slowfast.models import head_helper\nfrom slowfast.models.utils import calc_mvit_feature_geometry\nfrom slowfast.models.video_model_builder import MViT\nfrom torch.nn.init import trunc_normal_\n\nfrom . import head_helper, operators, resnet_helper, stem_helper  # noqa\nfrom .build import MODEL_REGISTRY\n\nlogger = logging.get_logger(__name__)\n\n\n@MODEL_REGISTRY.register()\nclass MaskMViT(MViT):\n    def __init__(self, cfg):\n        super().__init__(cfg)\n        self.pretrain_depth = cfg.MASK.PRETRAIN_DEPTH\n        if self.pretrain_depth[-1] + 1 < cfg.MVIT.DEPTH:\n            del self.blocks[self.pretrain_depth[-1] + 1 :]\n        del self.norm\n        del self.head\n        self.feat_size, self.feat_stride = calc_mvit_feature_geometry(cfg)\n\n        self.head_type = cfg.MASK.HEAD_TYPE.split(\"_\")\n        feat_sz = [self.feat_size[depth] for depth in self.pretrain_depth]\n        if self.head_type[0] == \"separate\":\n            if not cfg.MASK.PRED_HOG:\n                pred_t_sz = (\n                    1 if self.cfg.MASK.TIME_STRIDE_LOSS else self.patch_stride[0]\n                )\n                num_classes = [\n                    pred_t_sz * (self.feat_stride[depth][-1] ** 2) * 3\n                    for depth in self.pretrain_depth\n                ]\n                self.pred_head = head_helper.MSSeparateHead(\n                    self.blocks, cfg, num_classes, feat_sz\n                )\n            else:\n                self.hogs = nn.ModuleList()\n                self.nbins = 9\n                self.cell_sz = 8\n                self.hogs.append(\n                    operators.HOGLayerC(\n                        nbins=self.nbins,\n                        pool=self.cell_sz,\n                    )\n                )\n                self.ncells = [\n                    (self.feat_stride[depth][-1] // self.cell_sz) ** 2\n                    for depth in self.pretrain_depth\n                ]\n                pred_hog_classes = [self.nbins * ncell for ncell in self.ncells]\n                pred_hog_classes = [\n                    pred_hog_class * 3  # 3 color channels\n                    for pred_hog_class in pred_hog_classes\n                ]\n                self.pred_head = head_helper.MSSeparateHead(\n                    self.blocks, cfg, pred_hog_classes, feat_sz\n                )\n                self.hog_loss = \"mse\"\n        else:\n            raise NotImplementedError\n\n        embed_dim = cfg.MVIT.EMBED_DIM\n        decoder_embed_dim = cfg.MASK.DECODER_EMBED_DIM\n        self.sep_pos_embed_decoder = cfg.MASK.DECODER_SEP_POS_EMBED\n        self.counter = 0\n        if cfg.MASK.MAE_ON:\n            # ----------------------------------------------------------------\n            # MAE decoder specifics\n            norm_layer = partial(nn.LayerNorm, eps=1e-6)\n            dim_in = self.blocks[-1].dim_out\n            self.norm = norm_layer(dim_in)\n            self.decoder_embed = nn.Linear(dim_in, decoder_embed_dim, bias=True)\n            num_patches = math.prod(self.patch_dims)\n            if self.use_abs_pos:\n                if self.sep_pos_embed_decoder:\n                    self.dec_pos_embed_spatial = nn.Parameter(\n                        torch.zeros(\n                            1,\n                            self.patch_dims[1] * self.patch_dims[2],\n                            decoder_embed_dim,\n                        )\n                    )\n                    self.dec_pos_embed_temporal = nn.Parameter(\n                        torch.zeros(1, self.patch_dims[0], decoder_embed_dim)\n                    )\n                    if self.cls_embed_on:\n                        self.dec_pos_embed_class = nn.Parameter(\n                            torch.zeros(1, 1, decoder_embed_dim)\n                        )\n                else:\n                    self.decoder_pos_embed = nn.Parameter(\n                        torch.zeros(\n                            1,\n                            num_patches + 1 if self.cls_embed_on else num_patches,\n                            decoder_embed_dim,\n                        )\n                    )\n        self.mask_token = nn.Parameter(\n            torch.zeros(1, 1, decoder_embed_dim if cfg.MASK.MAE_ON else embed_dim)\n        )\n        trunc_normal_(self.mask_token, std=0.02)\n        if self.use_abs_pos and cfg.MASK.MAE_ON:\n            if self.sep_pos_embed_decoder:\n                trunc_normal_(self.dec_pos_embed_spatial, std=0.02)\n                trunc_normal_(self.dec_pos_embed_temporal, std=0.02)\n                if self.cls_embed_on:\n                    trunc_normal_(self.dec_pos_embed_class, std=0.02)\n            else:\n                trunc_normal_(self.decoder_pos_embed, std=0.02)\n\n        if cfg.MASK.SCALE_INIT_BY_DEPTH:\n            self.fix_init_weight()\n\n        self.pred_pixel_wt = 0.0 if cfg.MASK.PRED_HOG else 1.0\n        self.pred_hog_wt = 1.0 if cfg.MASK.PRED_HOG else 0.0\n\n    @torch.jit.ignore\n    def no_weight_decay(self):\n        names = []\n        if self.cfg.MVIT.ZERO_DECAY_POS_CLS:\n            if self.use_abs_pos:\n                if self.sep_pos_embed_decoder:\n                    names.extend(\n                        [\n                            \"dec_pos_embed_spatial\",\n                            \"dec_pos_embed_temporal\",\n                            \"dec_pos_embed_class\",\n                        ]\n                    )\n                else:\n                    names.extend([\"pos_embed_decoder\"])\n            if self.cls_embed_on:\n                names.append(\"cls_token\")\n\n        return names\n\n    def fix_init_weight(self):\n        def rescale(param, layer_id):\n            param.div_(math.sqrt(2.0 * layer_id))\n\n        for layer_id, layer in enumerate(self.blocks):\n            rescale(layer.attn.proj.weight.data, layer_id + 1)\n            rescale(layer.mlp.fc2.weight.data, layer_id + 1)\n        for trans in self.pred_head.transforms:\n            for layer_id, layer in enumerate(trans):\n                if hasattr(layer, \"attn\"):\n                    rescale(\n                        layer.attn.proj.weight.data,\n                        layer_id + 1 + len(self.blocks),\n                    )  # or + len(self.blocks)\n                    rescale(layer.mlp.fc2.weight.data, layer_id + 1)\n\n    def _get_multiscale_mask(self, mask):\n        if self.use_2d_patch:\n            mask = mask.unsqueeze(0)\n        output_masks = []\n        for depth in self.pretrain_depth:\n            size = self.feat_size[depth][-1]\n            output_mask = F.interpolate(mask, size=size)\n            if self.use_2d_patch:\n                output_mask = output_mask[0]\n            output_mask = output_mask.flatten(1).to(torch.bool)\n            output_masks.append(output_mask)\n        return output_masks\n\n    def _patchify(self, imgs, p=16, time_stride_loss=True):\n        N, _, T, H, W = imgs.shape\n        u = 1 if time_stride_loss else self.patch_stride[0]\n        assert H == W and H % p == 0 and T % u == 0\n        h = w = H // p\n        t = T // u\n        x = imgs.reshape(shape=(N, 3, t, u, h, p, w, p))\n        x = torch.einsum(\"nctuhpwq->nthwupqc\", x)\n        x = x.reshape(shape=(N, t * h * w, u * p**2 * 3))\n        self.patch_info = (N, T, H, W, p, u, t, h, w)\n        return x\n\n    def _unpatchify(self, x):\n        N, T, H, W, p, u, t, h, w = self.patch_info\n        x = x.reshape(shape=(N, t, h, w, u, p, p, 3))\n        x = torch.einsum(\"nthwupqc->nctuhpwq\", x)\n        imgs = x.reshape(shape=(N, 3, T, H, W))\n        return imgs\n\n    def _get_pixel_label_2d(self, input_img, output_masks, norm=True):\n        input_img = input_img.permute(0, 2, 3, 1)\n        labels = []\n        for depth, output_mask in zip(self.pretrain_depth, output_masks):\n            size = self.feat_stride[depth][-1]\n            label = input_img.unfold(1, size, size).unfold(2, size, size)\n            label = label.flatten(1, 2).flatten(2)\n            label = label[output_mask]\n            if norm:\n                mean = label.mean(dim=-1, keepdim=True)\n                var = label.var(dim=-1, keepdim=True)\n                label = (label - mean) / (var + 1.0e-6) ** 0.5\n            labels.append(label)\n        return labels\n\n    def _get_pixel_label_3d(\n        self, input_frames, output_masks, time_stride_loss=True, norm=True\n    ):\n        if time_stride_loss:\n            input_frames = input_frames[:, :, :: self.cfg.MVIT.PATCH_STRIDE[0], :, :]\n        imgs = input_frames\n        input_frames = input_frames.permute(0, 2, 3, 4, 1)\n        labels = []\n        for depth, output_mask in zip(self.pretrain_depth, output_masks):\n            size = self.feat_stride[depth][-1]\n            label = self._patchify(imgs, p=size, time_stride_loss=time_stride_loss)\n            label = label[output_mask]\n\n            if norm:  # self.norm_pix_loss:\n                mean = label.mean(dim=-1, keepdim=True)\n                var = label.var(dim=-1, keepdim=True)\n                label = (label - mean) / (var + 1.0e-6) ** 0.5\n            labels.append((label, self.pred_pixel_wt / len(self.pretrain_depth)))\n        return labels\n\n    def _get_hog_label_2d(self, input_frames, output_masks):\n        # input_frames, B C H W\n        labels = []\n        for depth, output_mask in zip(self.pretrain_depth, output_masks):\n            feat_size = self.feat_size[depth][-1]\n            hog_list = []\n            for hog in self.hogs:\n                tmp_hog = hog(input_frames).flatten(1, 2)  # return B C H W\n                unfold_size = tmp_hog.shape[-1] // feat_size\n                tmp_hog = (\n                    tmp_hog.permute(0, 2, 3, 1)\n                    .unfold(1, unfold_size, unfold_size)\n                    .unfold(2, unfold_size, unfold_size)\n                    .flatten(1, 2)\n                    .flatten(2)\n                )\n                tmp_hog = tmp_hog[output_mask]\n                hog_list.append(tmp_hog)\n            all_tlabel = torch.cat(hog_list, -1)\n            labels.append((all_tlabel, self.pred_hog_wt, self.hog_loss))\n        return labels\n\n    def _get_hog_label_3d(self, input_frames, output_masks):\n        input_frames = input_frames[\n            :, :, :: self.cfg.MVIT.PATCH_STRIDE[0], :, :\n        ]  # B C T H W\n        input_frames = input_frames.transpose(1, 2)  # B T C H W\n        B, T = input_frames.shape[:2]\n        input_frames = input_frames.flatten(0, 1)  # BT C H W\n        labels = []\n        for depth, output_mask in zip(self.pretrain_depth, output_masks):\n            feat_size = self.feat_size[depth][-1]\n            hog_list = []\n            for hog in self.hogs:\n                tmp_hog = hog(input_frames).flatten(1, 2)  # BT C H W\n                unfold_size = tmp_hog.shape[-1] // feat_size\n                tmp_hog = (\n                    tmp_hog.permute(0, 2, 3, 1)\n                    .unfold(1, unfold_size, unfold_size)\n                    .unfold(2, unfold_size, unfold_size)\n                )  # BT h w C wh ww\n                tmp_hog = tmp_hog.flatten(3).view(\n                    B, T, feat_size, feat_size, -1\n                )  # B T h w C (3 nbins h w)\n                tmp_hog = tmp_hog.flatten(1, 3)  # B N C\n                tmp_hog = tmp_hog[output_mask]\n                hog_list.append(tmp_hog)\n            all_tlabel = torch.cat(hog_list, -1)\n            labels.append((all_tlabel, self.pred_hog_wt, self.hog_loss))\n        return labels\n\n    def _mae_random_masking(self, x, mask_ratio, mask_in=None):\n        \"\"\"\n        Perform per-sample random masking by per-sample shuffling.\n        Per-sample shuffling is done by argsort random noise.\n        x: [N, L, D], sequence\n        \"\"\"\n        N, L, D = x.shape  # batch, length, dim\n        if mask_in is None:\n            if self.cfg.AUG.MASK_TUBE:\n                noise = (\n                    torch.rand(N, 1, self.H * self.W, device=x.device)\n                    .repeat([1, self.T, 1])\n                    .reshape(N, L)\n                )  # noise in [0, 1]\n            else:\n                noise = torch.rand(N, L, device=x.device)  # noise in [0, 1]\n        else:\n            noise = mask_in.flatten(1)\n            mask_ratio = sum(noise.flatten()) / noise.numel()  # alrdy masked\n        len_keep = int(L * (1 - mask_ratio))\n        assert len_keep > 1\n        # sort noise for each sample\n        ids_shuffle = torch.argsort(\n            noise, dim=1\n        )  # ascend: small is keep, large is remove\n        ids_restore = torch.argsort(ids_shuffle, dim=1)\n        # keep the first subset\n        ids_keep = ids_shuffle[:, :len_keep]\n        x_masked = torch.gather(x, dim=1, index=ids_keep.unsqueeze(-1).repeat(1, 1, D))\n        # generate the binary mask: 0 is keep, 1 is remove\n        mask = torch.ones([N, L], device=x.device)\n        mask[:, :len_keep] = 0\n        # unshuffle to get the binary mask\n        mask = torch.gather(mask, dim=1, index=ids_restore)\n        return x_masked, mask, ids_restore, ids_keep\n\n    def _mae_forward_encoder(self, x, mask_ratio, mask=None):\n        x, bcthw = self.patch_embed(x, keep_spatial=False)\n        bcthw = list(bcthw)\n        if len(bcthw) == 4:  # Fix bcthw in case of 4D tensor\n            bcthw.insert(2, torch.tensor(self.T))\n        T, H, W = bcthw[-3], bcthw[-2], bcthw[-1]\n        assert len(bcthw) == 5 and (T, H, W) == (self.T, self.H, self.W), bcthw\n        s = 1 if self.cls_embed_on else 0\n        B, N, C = x.shape\n\n        if self.use_fixed_sincos_pos:\n            x += self.pos_embed[:, s:, :]  # 0: no cls token\n\n        if self.cfg.MASK.PER_FRAME_MASKING:\n            x = x.reshape([B * T, H * W, C])\n        x, mask, ids_restore, ids_keep = self._mae_random_masking(\n            x, mask_ratio, None if self.cfg.MASK.MAE_RND_MASK else mask\n        )\n        if self.cfg.MASK.PER_FRAME_MASKING:\n            x = x.view([B, -1, C])\n\n        if self.cls_embed_on:\n            # append cls token\n            cls_token = self.cls_token  #\n            if self.use_fixed_sincos_pos:\n                cls_token = cls_token + self.pos_embed[:, :s, :]\n            cls_tokens = cls_token.expand(x.shape[0], -1, -1)\n            x = torch.cat((cls_tokens, x), dim=1)\n\n        if self.use_abs_pos and not self.use_fixed_sincos_pos:\n            if self.sep_pos_embed:\n                pos_embed = self.pos_embed_spatial.repeat(\n                    1, self.patch_dims[0], 1\n                ) + torch.repeat_interleave(\n                    self.pos_embed_temporal,\n                    self.patch_dims[1] * self.patch_dims[2],\n                    dim=1,\n                )\n                pos_embed = pos_embed.expand(x.shape[0], -1, -1)\n                pos_embed = torch.gather(\n                    pos_embed,\n                    dim=1,\n                    index=ids_keep.unsqueeze(-1).repeat(1, 1, pos_embed.shape[2]),\n                )\n                if self.cls_embed_on:\n                    pos_embed = torch.cat(\n                        [\n                            self.pos_embed_class.expand(pos_embed.shape[0], -1, -1),\n                            pos_embed,\n                        ],\n                        1,\n                    )\n                x += pos_embed\n            else:\n                pos_embed = self.pos_embed.expand(x.shape[0], -1, -1)\n                pos_embed_sampled = torch.gather(\n                    pos_embed[:, s:, :],\n                    dim=1,\n                    index=ids_keep.unsqueeze(-1).repeat(1, 1, self.pos_embed.shape[2]),\n                )\n                if self.cls_embed_on:\n                    pos_embed_sampled = torch.cat(\n                        [pos_embed[:, :s, :], pos_embed_sampled], 1\n                    )\n                x += pos_embed_sampled\n\n        # apply Transformer blocks\n        B, N, C = x.shape\n        thw = [T, H, W]\n        for _, blk in enumerate(self.blocks):\n            x, thw = blk(x, thw)\n        x = self.norm(x)\n\n        return x, mask, ids_restore, thw\n\n    def _mae_forward_decoder(self, x, ids_restore, mask, thw):\n        # embed tokens\n        x = self.decoder_embed(x)\n        T, H, W = self.T, self.H, self.W\n        B, N, C = x.shape\n\n        s = 1 if self.cls_embed_on else 0\n\n        # append mask tokens to sequence\n        mask_tokens = self.mask_token.repeat(\n            B, T * H * W + s - x.shape[1], 1\n        )  # + s: no cls token\n        x_ = torch.cat([x[:, s:, :], mask_tokens], dim=1)  # no cls token\n        if self.cfg.MASK.PER_FRAME_MASKING:\n            x_ = x_.view([B * T, H * W, C])\n        x_ = torch.gather(\n            x_, dim=1, index=ids_restore.unsqueeze(-1).repeat(1, 1, x_.shape[2])\n        )  # unshuffle\n        if self.cfg.MASK.PER_FRAME_MASKING:\n            x_ = x_.view([B, T * H * W, C])\n        x = torch.cat([x[:, :s, :], x_], dim=1)  # append cls token\n\n        if self.sep_pos_embed_decoder:\n            pos_embed = self.dec_pos_embed_spatial.repeat(\n                1, self.patch_dims[0], 1\n            ) + torch.repeat_interleave(\n                self.dec_pos_embed_temporal,\n                self.patch_dims[1] * self.patch_dims[2],\n                dim=1,\n            )\n            pos_embed = pos_embed.expand(x.shape[0], -1, -1)\n            if self.cls_embed_on:\n                pos_embed = torch.cat(\n                    [\n                        self.dec_pos_embed_class.expand(pos_embed.shape[0], -1, -1),\n                        pos_embed,\n                    ],\n                    1,\n                )\n            x += pos_embed\n        else:\n            # add pos embed\n            x = x + self.decoder_pos_embed\n\n        pixel_outputs = self.pred_head(\n            [x],\n            [mask.to(torch.bool)],\n            return_all=self.cfg.VIS_MASK.ENABLE,\n            thw=thw,\n        )\n\n        return pixel_outputs\n\n    def _mae_forward(self, imgs, mask_ratio=0.75, mask=None):\n        latent, mask, ids_restore, thw = self._mae_forward_encoder(\n            imgs, mask_ratio, mask\n        )\n        pred = self._mae_forward_decoder(latent, ids_restore, mask, thw)\n        labels = []\n        if self.pred_pixel_wt:\n            if self.use_2d_patch:\n                labels += self._get_pixel_label_2d(\n                    imgs.detach(),\n                    [mask.to(torch.bool)],\n                    norm=self.cfg.MASK.NORM_PRED_PIXEL,\n                )\n            else:\n                labels += self._get_pixel_label_3d(\n                    imgs.detach(),\n                    [mask.to(torch.bool)],\n                    time_stride_loss=self.cfg.MASK.TIME_STRIDE_LOSS,\n                    norm=self.cfg.MASK.NORM_PRED_PIXEL,\n                )\n        if self.pred_hog_wt:\n            if self.use_2d_patch:\n                labels += self._get_hog_label_2d(imgs.detach(), [mask.to(torch.bool)])\n            else:\n                labels += self._get_hog_label_3d(imgs.detach(), [mask.to(torch.bool)])\n\n        self.counter += 1\n        if self.cfg.VIS_MASK.ENABLE:\n            return self._mae_visualize(imgs, pred, mask)\n        return pred, labels\n\n    def _mae_visualize(self, imgs, pred, mask):\n        N, T, H, W, p, u, t, h, w = self.patch_info\n        pred = pred[0]\n        if self.cfg.MASK.TIME_STRIDE_LOSS:\n            im_viz = imgs[:, :, :: self.cfg.MVIT.PATCH_STRIDE[0], :, :]\n        else:\n            im_viz = imgs\n        reconstruct = self._unpatchify(\n            pred * mask.reshape(N, t * h * w, 1)\n            + self._patchify(im_viz, time_stride_loss=self.cfg.MASK.TIME_STRIDE_LOSS)\n            * (1 - mask.reshape(N, t * h * w, 1))\n        )\n        masked = self._unpatchify(\n            self._patchify(im_viz, time_stride_loss=self.cfg.MASK.TIME_STRIDE_LOSS)\n            * (1 - mask.reshape(N, t * h * w, 1))\n        )\n\n        comparison = torch.stack(\n            [im_viz, masked, reconstruct],\n            dim=1,\n        ).permute([0, 1, 3, 2, 4, 5])\n        pfx = self.cfg.TEST.CHECKPOINT_FILE_PATH\n        mr = self.cfg.AUG.MASK_RATIO\n        for i in range(comparison.shape[0]):\n            misc.plot_input_normed(\n                comparison[i].cpu(),\n                bboxes=(),\n                texts=(),\n                path=self.cfg.OUTPUT_DIR\n                + \"/vis_mask/vid/{}vis_video_in_mask_out_mr{}/vis_{}_{}.mp4\".format(\n                    pfx[pfx.rfind(\"/\") + 1 : -5], mr, self.counter, i\n                ),\n                folder_path=self.cfg.OUTPUT_DIR\n                + \"/vis_mask/vid/{}vis_video_in_mask_out_mr{}\".format(\n                    pfx[pfx.rfind(\"/\") + 1 : -5], mr\n                ),\n                make_grids=True,\n                output_video=True,\n            )\n        return pred[0]\n\n    def _maskfeat_forward(self, x, mask, return_all=False):\n        x_embed, x_shape = self.patch_embed(x)\n        if self.cfg.MASK.MAE_RND_MASK:\n            _, mask, ids_restore, ids_keep = self._mae_random_masking(\n                x_embed, self.cfg.AUG.MASK_RATIO, None\n            )\n            output_masks = [mask.to(torch.bool)]\n        else:\n            # take masks and labels from loader\n            float_mask = mask.type_as(x)\n            output_masks = self._get_multiscale_mask(float_mask)\n        labels = []\n        if self.pred_pixel_wt:\n            if self.use_2d_patch:\n                labels += self._get_pixel_label_2d(\n                    x.detach(), output_masks, norm=self.cfg.MASK.NORM_PRED_PIXEL\n                )\n            else:\n                labels += self._get_pixel_label_3d(\n                    x.detach(), output_masks, norm=self.cfg.MASK.NORM_PRED_PIXEL\n                )\n        if self.pred_hog_wt:\n            if self.use_2d_patch:\n                labels += self._get_hog_label_2d(x.detach(), output_masks)\n            else:\n                labels += self._get_hog_label_3d(x.detach(), output_masks)\n\n        x = x_embed\n        T, H, W = self.T, self.H, self.W\n        B, N, C = x.shape\n\n        # switch input tokens by mask_token\n        mask_tokens = self.mask_token.expand(B, N, -1)\n        if self.cfg.MASK.MAE_RND_MASK:\n            float_mask = mask.unsqueeze(-1)\n        else:\n            if self.use_2d_patch:\n                float_mask = F.interpolate(float_mask.unsqueeze(0), size=(H, W))[0]\n            else:\n                float_mask = F.interpolate(float_mask, size=(H, W))\n            float_mask = float_mask.flatten(1).unsqueeze(-1)\n        x = x * (1 - float_mask) + mask_tokens * float_mask\n\n        if self.cls_embed_on:\n            cls_tokens = self.cls_token.expand(B, -1, -1)\n            x = torch.cat((cls_tokens, x), dim=1)\n\n        if self.use_abs_pos:\n            if self.sep_pos_embed:\n                pos_embed = self.pos_embed_spatial.repeat(\n                    1, self.patch_dims[0], 1\n                ) + torch.repeat_interleave(\n                    self.pos_embed_temporal,\n                    self.patch_dims[1] * self.patch_dims[2],\n                    dim=1,\n                )\n                if self.cls_embed_on:\n                    pos_embed = torch.cat([self.pos_embed_class, pos_embed], 1)\n                x = x + pos_embed\n            else:\n                x = x + self.pos_embed\n\n        if self.drop_rate:\n            x = self.pos_drop(x)\n\n        if self.norm_stem:\n            x = self.norm_stem(x)\n\n        thw = [T, H, W]\n        block_outputs = []\n        for idx, blk in enumerate(self.blocks):\n            x, thw = blk(x, thw)\n            if idx in self.pretrain_depth:\n                block_outputs.append(x)\n\n        model_outputs = []\n        if self.pred_pixel_wt:\n            pixel_outputs = self.pred_head(\n                block_outputs,\n                output_masks,\n                return_all=return_all,\n                thw=thw,\n            )\n            model_outputs += pixel_outputs\n        if self.pred_hog_wt:\n            hog_outputs = self.pred_head(\n                block_outputs,\n                output_masks,\n                return_all=return_all,\n                thw=thw,\n            )\n            model_outputs += hog_outputs\n\n        return model_outputs, labels\n\n    def forward(self, x, return_all=False):\n        if len(x) > 1:\n            x, meta, mask = x\n        else:\n            x, mask = x[0], None\n\n        if self.cfg.MASK.MAE_ON:\n            return self._mae_forward(x, mask_ratio=self.cfg.AUG.MASK_RATIO, mask=mask)\n        else:\n            return self._maskfeat_forward(x, mask, return_all)\n"
  },
  {
    "path": "slowfast/models/nonlocal_helper.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\n\"\"\"Non-local helper\"\"\"\n\nimport torch\nimport torch.nn as nn\n\n\nclass Nonlocal(nn.Module):\n    \"\"\"\n    Builds Non-local Neural Networks as a generic family of building\n    blocks for capturing long-range dependencies. Non-local Network\n    computes the response at a position as a weighted sum of the\n    features at all positions. This building block can be plugged into\n    many computer vision architectures.\n    More details in the paper: https://arxiv.org/pdf/1711.07971.pdf\n    \"\"\"\n\n    def __init__(\n        self,\n        dim,\n        dim_inner,\n        pool_size=None,\n        instantiation=\"softmax\",\n        zero_init_final_conv=False,\n        zero_init_final_norm=True,\n        norm_eps=1e-5,\n        norm_momentum=0.1,\n        norm_module=nn.BatchNorm3d,\n    ):\n        \"\"\"\n        Args:\n            dim (int): number of dimension for the input.\n            dim_inner (int): number of dimension inside of the Non-local block.\n            pool_size (list): the kernel size of spatial temporal pooling,\n                temporal pool kernel size, spatial pool kernel size, spatial\n                pool kernel size in order. By default pool_size is None,\n                then there would be no pooling used.\n            instantiation (string): supports two different instantiation method:\n                \"dot_product\": normalizing correlation matrix with L2.\n                \"softmax\": normalizing correlation matrix with Softmax.\n            zero_init_final_conv (bool): If true, zero initializing the final\n                convolution of the Non-local block.\n            zero_init_final_norm (bool):\n                If true, zero initializing the final batch norm of the Non-local\n                block.\n            norm_module (nn.Module): nn.Module for the normalization layer. The\n                default is nn.BatchNorm3d.\n        \"\"\"\n        super(Nonlocal, self).__init__()\n        self.dim = dim\n        self.dim_inner = dim_inner\n        self.pool_size = pool_size\n        self.instantiation = instantiation\n        self.use_pool = (\n            False if pool_size is None else any((size > 1 for size in pool_size))\n        )\n        self.norm_eps = norm_eps\n        self.norm_momentum = norm_momentum\n        self._construct_nonlocal(\n            zero_init_final_conv, zero_init_final_norm, norm_module\n        )\n\n    def _construct_nonlocal(\n        self, zero_init_final_conv, zero_init_final_norm, norm_module\n    ):\n        # Three convolution heads: theta, phi, and g.\n        self.conv_theta = nn.Conv3d(\n            self.dim, self.dim_inner, kernel_size=1, stride=1, padding=0\n        )\n        self.conv_phi = nn.Conv3d(\n            self.dim, self.dim_inner, kernel_size=1, stride=1, padding=0\n        )\n        self.conv_g = nn.Conv3d(\n            self.dim, self.dim_inner, kernel_size=1, stride=1, padding=0\n        )\n\n        # Final convolution output.\n        self.conv_out = nn.Conv3d(\n            self.dim_inner, self.dim, kernel_size=1, stride=1, padding=0\n        )\n        # Zero initializing the final convolution output.\n        self.conv_out.zero_init = zero_init_final_conv\n\n        # TODO: change the name to `norm`\n        self.bn = norm_module(\n            num_features=self.dim,\n            eps=self.norm_eps,\n            momentum=self.norm_momentum,\n        )\n        # Zero initializing the final bn.\n        self.bn.transform_final_bn = zero_init_final_norm\n\n        # Optional to add the spatial-temporal pooling.\n        if self.use_pool:\n            self.pool = nn.MaxPool3d(\n                kernel_size=self.pool_size,\n                stride=self.pool_size,\n                padding=[0, 0, 0],\n            )\n\n    def forward(self, x):\n        x_identity = x\n        N, C, T, H, W = x.size()\n\n        theta = self.conv_theta(x)\n\n        # Perform temporal-spatial pooling to reduce the computation.\n        if self.use_pool:\n            x = self.pool(x)\n\n        phi = self.conv_phi(x)\n        g = self.conv_g(x)\n\n        theta = theta.view(N, self.dim_inner, -1)\n        phi = phi.view(N, self.dim_inner, -1)\n        g = g.view(N, self.dim_inner, -1)\n\n        # (N, C, TxHxW) * (N, C, TxHxW) => (N, TxHxW, TxHxW).\n        theta_phi = torch.einsum(\"nct,ncp->ntp\", (theta, phi))\n        # For original Non-local paper, there are two main ways to normalize\n        # the affinity tensor:\n        #   1) Softmax normalization (norm on exp).\n        #   2) dot_product normalization.\n        if self.instantiation == \"softmax\":\n            # Normalizing the affinity tensor theta_phi before softmax.\n            theta_phi = theta_phi * (self.dim_inner**-0.5)\n            theta_phi = nn.functional.softmax(theta_phi, dim=2)\n        elif self.instantiation == \"dot_product\":\n            spatial_temporal_dim = theta_phi.shape[2]\n            theta_phi = theta_phi / spatial_temporal_dim\n        else:\n            raise NotImplementedError(\"Unknown norm type {}\".format(self.instantiation))\n\n        # (N, TxHxW, TxHxW) * (N, C, TxHxW) => (N, C, TxHxW).\n        theta_phi_g = torch.einsum(\"ntg,ncg->nct\", (theta_phi, g))\n\n        # (N, C, TxHxW) => (N, C, T, H, W).\n        theta_phi_g = theta_phi_g.view(N, self.dim_inner, T, H, W)\n\n        p = self.conv_out(theta_phi_g)\n        p = self.bn(p)\n        return x_identity + p\n"
  },
  {
    "path": "slowfast/models/operators.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\n\"\"\"Custom operators.\"\"\"\n\nimport math\n\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nfrom pytorchvideo.layers.swish import Swish\nfrom slowfast.models.utils import get_gkern\n\n\nclass SE(nn.Module):\n    \"\"\"Squeeze-and-Excitation (SE) block w/ Swish: AvgPool, FC, Swish, FC, Sigmoid.\"\"\"\n\n    def _round_width(self, width, multiplier, min_width=8, divisor=8):\n        \"\"\"\n        Round width of filters based on width multiplier\n        Args:\n            width (int): the channel dimensions of the input.\n            multiplier (float): the multiplication factor.\n            min_width (int): the minimum width after multiplication.\n            divisor (int): the new width should be dividable by divisor.\n        \"\"\"\n        if not multiplier:\n            return width\n\n        width *= multiplier\n        min_width = min_width or divisor\n        width_out = max(min_width, int(width + divisor / 2) // divisor * divisor)\n        if width_out < 0.9 * width:\n            width_out += divisor\n        return int(width_out)\n\n    def __init__(self, dim_in, ratio, relu_act=True):\n        \"\"\"\n        Args:\n            dim_in (int): the channel dimensions of the input.\n            ratio (float): the channel reduction ratio for squeeze.\n            relu_act (bool): whether to use ReLU activation instead\n                of Swish (default).\n            divisor (int): the new width should be dividable by divisor.\n        \"\"\"\n        super(SE, self).__init__()\n        self.avg_pool = nn.AdaptiveAvgPool3d((1, 1, 1))\n        dim_fc = self._round_width(dim_in, ratio)\n        self.fc1 = nn.Conv3d(dim_in, dim_fc, 1, bias=True)\n        self.fc1_act = nn.ReLU() if relu_act else Swish()\n        self.fc2 = nn.Conv3d(dim_fc, dim_in, 1, bias=True)\n\n        self.fc2_sig = nn.Sigmoid()\n\n    def forward(self, x):\n        x_in = x\n        for module in self.children():\n            x = module(x)\n        return x_in * x\n\n\nclass HOGLayerC(nn.Module):\n    def __init__(self, nbins=9, pool=7, gaussian_window=0):\n        super(HOGLayerC, self).__init__()\n        self.nbins = nbins\n        self.pool = pool\n        self.pi = math.pi\n        weight_x = torch.FloatTensor([[1, 0, -1], [2, 0, -2], [1, 0, -1]])\n        weight_x = weight_x.view(1, 1, 3, 3).repeat(3, 1, 1, 1)\n        weight_y = weight_x.transpose(2, 3)\n        self.register_buffer(\"weight_x\", weight_x)\n        self.register_buffer(\"weight_y\", weight_y)\n\n        self.gaussian_window = gaussian_window\n        if gaussian_window:\n            gkern = get_gkern(gaussian_window, gaussian_window // 2)\n            self.register_buffer(\"gkern\", gkern)\n\n    @torch.no_grad()\n    def forward(self, x):\n        # input is RGB image with shape [B 3 H W]\n        x = F.pad(x, pad=(1, 1, 1, 1), mode=\"reflect\")\n        gx_rgb = F.conv2d(x, self.weight_x, bias=None, stride=1, padding=0, groups=3)\n        gy_rgb = F.conv2d(x, self.weight_y, bias=None, stride=1, padding=0, groups=3)\n        norm_rgb = torch.stack([gx_rgb, gy_rgb], dim=-1).norm(dim=-1)\n        phase = torch.atan2(gx_rgb, gy_rgb)\n        phase = phase / self.pi * self.nbins  # [-9, 9]\n\n        b, c, h, w = norm_rgb.shape\n        out = torch.zeros((b, c, self.nbins, h, w), dtype=torch.float, device=x.device)\n        phase = phase.view(b, c, 1, h, w)\n        norm_rgb = norm_rgb.view(b, c, 1, h, w)\n        if self.gaussian_window:\n            if h != self.gaussian_window:\n                assert h % self.gaussian_window == 0, \"h {} gw {}\".format(\n                    h, self.gaussian_window\n                )\n                repeat_rate = h // self.gaussian_window\n                temp_gkern = self.gkern.repeat([repeat_rate, repeat_rate])\n            else:\n                temp_gkern = self.gkern\n            norm_rgb *= temp_gkern\n\n        out.scatter_add_(2, phase.floor().long() % self.nbins, norm_rgb)\n\n        out = out.unfold(3, self.pool, self.pool)\n        out = out.unfold(4, self.pool, self.pool)\n        out = out.sum(dim=[-1, -2])\n\n        out = torch.nn.functional.normalize(out, p=2, dim=2)\n\n        return out  # B 3 nbins H W\n"
  },
  {
    "path": "slowfast/models/optimizer.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\n\"\"\"Optimizer.\"\"\"\n\nimport slowfast.utils.lr_policy as lr_policy\nimport torch\n\n\ndef construct_optimizer(model, cfg):\n    \"\"\"\n    Construct a stochastic gradient descent or ADAM optimizer with momentum.\n    Details can be found in:\n    Herbert Robbins, and Sutton Monro. \"A stochastic approximation method.\"\n    and\n    Diederik P.Kingma, and Jimmy Ba.\n    \"Adam: A Method for Stochastic Optimization.\"\n\n    Args:\n        model (model): model to perform stochastic gradient descent\n        optimization or ADAM optimization.\n        cfg (config): configs of hyper-parameters of SGD or ADAM, includes base\n        learning rate,  momentum, weight_decay, dampening, and etc.\n    \"\"\"\n    if cfg.SOLVER.LAYER_DECAY > 0.0 and cfg.SOLVER.LAYER_DECAY < 1.0:\n        optim_params = get_param_groups(model, cfg)\n    elif cfg.SOLVER.LAYER_DECAY == 1.0:\n        bn_parameters = []\n        non_bn_parameters = []\n        zero_parameters = []\n        no_grad_parameters = []\n        skip = {}\n\n        if cfg.NUM_GPUS > 1:\n            if hasattr(model.module, \"no_weight_decay\"):\n                skip = model.module.no_weight_decay()\n        else:\n            if hasattr(model, \"no_weight_decay\"):\n                skip = model.no_weight_decay()\n\n        for name_m, m in model.named_modules():\n            is_bn = isinstance(m, torch.nn.modules.batchnorm._NormBase)\n            for name_p, p in m.named_parameters(recurse=False):\n                name = \"{}.{}\".format(name_m, name_p).strip(\".\")\n                if not p.requires_grad:\n                    no_grad_parameters.append(p)\n                elif is_bn:\n                    bn_parameters.append(p)\n                elif any(k in name for k in skip):\n                    zero_parameters.append(p)\n                elif cfg.SOLVER.ZERO_WD_1D_PARAM and (\n                    len(p.shape) == 1 or name.endswith(\".bias\")\n                ):\n                    zero_parameters.append(p)\n                else:\n                    non_bn_parameters.append(p)\n\n        optim_params = [\n            {\n                \"params\": bn_parameters,\n                \"weight_decay\": cfg.BN.WEIGHT_DECAY,\n                \"layer_decay\": 1.0,\n                \"apply_LARS\": False,\n            },\n            {\n                \"params\": non_bn_parameters,\n                \"weight_decay\": cfg.SOLVER.WEIGHT_DECAY,\n                \"layer_decay\": 1.0,\n                \"apply_LARS\": cfg.SOLVER.LARS_ON,\n            },\n            {\n                \"params\": zero_parameters,\n                \"weight_decay\": 0.0,\n                \"layer_decay\": 1.0,\n                \"apply_LARS\": cfg.SOLVER.LARS_ON,\n            },\n        ]\n        optim_params = [x for x in optim_params if len(x[\"params\"])]\n\n        # Check all parameters will be passed into optimizer.\n        assert len(list(model.parameters())) == len(non_bn_parameters) + len(\n            bn_parameters\n        ) + len(zero_parameters) + len(no_grad_parameters), (\n            \"parameter size does not match: {} + {} + {} + {} != {}\".format(\n                len(non_bn_parameters),\n                len(bn_parameters),\n                len(zero_parameters),\n                len(no_grad_parameters),\n                len(list(model.parameters())),\n            )\n        )\n        print(\n            \"bn {}, non bn {}, zero {}, no grad {}\".format(\n                len(bn_parameters),\n                len(non_bn_parameters),\n                len(zero_parameters),\n                len(no_grad_parameters),\n            )\n        )\n    else:\n        raise ValueError(\n            \"Layer decay should be in (0, 1], but is {}\".format(cfg.SOLVER.LAYER_DECAY)\n        )\n\n    if cfg.SOLVER.OPTIMIZING_METHOD == \"sgd\":\n        optimizer = torch.optim.SGD(\n            optim_params,\n            lr=cfg.SOLVER.BASE_LR,\n            momentum=cfg.SOLVER.MOMENTUM,\n            weight_decay=cfg.SOLVER.WEIGHT_DECAY,\n            dampening=cfg.SOLVER.DAMPENING,\n            nesterov=cfg.SOLVER.NESTEROV,\n        )\n    elif cfg.SOLVER.OPTIMIZING_METHOD == \"adam\":\n        optimizer = torch.optim.Adam(\n            optim_params,\n            lr=cfg.SOLVER.BASE_LR,\n            betas=cfg.SOLVER.BETAS,\n            weight_decay=cfg.SOLVER.WEIGHT_DECAY,\n        )\n    elif cfg.SOLVER.OPTIMIZING_METHOD == \"adamw\":\n        optimizer = torch.optim.AdamW(\n            optim_params,\n            lr=cfg.SOLVER.BASE_LR,\n            betas=cfg.SOLVER.BETAS,\n            eps=1e-08,\n            weight_decay=cfg.SOLVER.WEIGHT_DECAY,\n        )\n    elif cfg.SOLVER.OPTIMIZING_METHOD == \"mt_adamw\":\n        optimizer = torch.optim._multi_tensor.AdamW(\n            optim_params,\n            lr=cfg.SOLVER.BASE_LR,\n            betas=cfg.SOLVER.BETAS,\n            eps=1e-08,\n            weight_decay=cfg.SOLVER.WEIGHT_DECAY,\n        )\n    else:\n        raise NotImplementedError(\n            \"Does not support {} optimizer\".format(cfg.SOLVER.OPTIMIZING_METHOD)\n        )\n    if cfg.SOLVER.LARS_ON:\n        optimizer = LARS(optimizer=optimizer, trust_coefficient=0.001, clip=False)\n    return optimizer\n\n\ndef get_param_groups(model, cfg):\n    def _get_layer_decay(name):\n        layer_id = None\n        if name in (\"cls_token\", \"mask_token\"):\n            layer_id = 0\n        elif name.startswith(\"pos_embed\"):\n            layer_id = 0\n        elif name.startswith(\"patch_embed\"):\n            layer_id = 0\n        elif name.startswith(\"blocks\"):\n            layer_id = int(name.split(\".\")[1]) + 1\n        else:\n            layer_id = cfg.MVIT.DEPTH + 1\n        layer_decay = cfg.SOLVER.LAYER_DECAY ** (cfg.MVIT.DEPTH + 1 - layer_id)\n        return layer_id, layer_decay\n\n    for m in model.modules():\n        assert not isinstance(m, torch.nn.modules.batchnorm._NormBase), (\n            \"BN is not supported with layer decay\"\n        )\n\n    non_bn_parameters_count = 0\n    zero_parameters_count = 0\n    no_grad_parameters_count = 0\n    parameter_group_names = {}\n    parameter_group_vars = {}\n\n    skip = {}\n    if cfg.NUM_GPUS > 1:\n        if hasattr(model.module, \"no_weight_decay\"):\n            skip = model.module.no_weight_decay()\n            # skip = {\"module.\" + v for v in skip}\n    else:\n        if hasattr(model, \"no_weight_decay\"):\n            skip = model.no_weight_decay()\n\n    for name, p in model.named_parameters():\n        if not p.requires_grad:\n            group_name = \"no_grad\"\n            no_grad_parameters_count += 1\n            continue\n        name = name[len(\"module.\") :] if name.startswith(\"module.\") else name\n        if name in skip or (\n            (len(p.shape) == 1 or name.endswith(\".bias\"))\n            and cfg.SOLVER.ZERO_WD_1D_PARAM\n        ):\n            layer_id, layer_decay = _get_layer_decay(name)\n            group_name = \"layer_%d_%s\" % (layer_id, \"zero\")\n            weight_decay = 0.0\n            zero_parameters_count += 1\n        else:\n            layer_id, layer_decay = _get_layer_decay(name)\n            group_name = \"layer_%d_%s\" % (layer_id, \"non_bn\")\n            weight_decay = cfg.SOLVER.WEIGHT_DECAY\n            non_bn_parameters_count += 1\n\n        if group_name not in parameter_group_names:\n            parameter_group_names[group_name] = {\n                \"weight_decay\": weight_decay,\n                \"params\": [],\n                \"layer_decay\": layer_decay,\n            }\n            parameter_group_vars[group_name] = {\n                \"weight_decay\": weight_decay,\n                \"params\": [],\n                \"layer_decay\": layer_decay,\n            }\n        parameter_group_names[group_name][\"params\"].append(name)\n        parameter_group_vars[group_name][\"params\"].append(p)\n\n    # print(\"Param groups = %s\" % json.dumps(parameter_group_names, indent=2))\n    optim_params = list(parameter_group_vars.values())\n\n    # Check all parameters will be passed into optimizer.\n    assert (\n        len(list(model.parameters()))\n        == non_bn_parameters_count + zero_parameters_count + no_grad_parameters_count\n    ), \"parameter size does not match: {} + {} + {} != {}\".format(\n        non_bn_parameters_count,\n        zero_parameters_count,\n        no_grad_parameters_count,\n        len(list(model.parameters())),\n    )\n    print(\n        \"non bn {}, zero {}, no grad {}\".format(\n            non_bn_parameters_count,\n            zero_parameters_count,\n            no_grad_parameters_count,\n        )\n    )\n\n    return optim_params\n\n\ndef get_epoch_lr(cur_epoch, cfg):\n    \"\"\"\n    Retrieves the lr for the given epoch (as specified by the lr policy).\n    Args:\n        cfg (config): configs of hyper-parameters of ADAM, includes base\n        learning rate, betas, and weight decay.\n        cur_epoch (float): the number of epoch of the current training stage.\n    \"\"\"\n    return lr_policy.get_lr_at_epoch(cfg, cur_epoch)\n\n\ndef set_lr(optimizer, new_lr):\n    \"\"\"\n    Sets the optimizer lr to the specified value.\n    Args:\n        optimizer (optim): the optimizer using to optimize the current network.\n        new_lr (float): the new learning rate to set.\n    \"\"\"\n    for param_group in optimizer.param_groups:\n        param_group[\"lr\"] = new_lr * param_group[\"layer_decay\"]\n\n\nclass LARS:\n    \"\"\"\n    this class is adapted from https://github.com/NVIDIA/apex/blob/master/apex/parallel/LARC.py to\n     include ignoring LARS application specific parameters (e.g. 1D params)\n\n    Args:\n        optimizer: Pytorch optimizer to wrap and modify learning rate for.\n        trust_coefficient: Trust coefficient for calculating the lr. See https://arxiv.org/abs/1708.03888\n        clip: Decides between clipping or scaling mode of LARS. If `clip=True` the learning rate is set to `min(optimizer_lr, local_lr)` for each parameter. If `clip=False` the learning rate is set to `local_lr*optimizer_lr`.\n        eps: epsilon kludge to help with numerical stability while calculating adaptive_lr\n    \"\"\"\n\n    def __init__(\n        self,\n        optimizer,\n        trust_coefficient=0.02,\n        clip=True,\n        eps=1e-8,\n        ignore_1d_param=True,\n    ):\n        self.optim = optimizer\n        self.trust_coefficient = trust_coefficient\n        self.eps = eps\n        self.clip = clip\n        self.ignore_1d_param = ignore_1d_param\n\n    def __getstate__(self):\n        return self.optim.__getstate__()\n\n    def __setstate__(self, state):\n        self.optim.__setstate__(state)\n\n    @property\n    def state(self):\n        return self.optim.state\n\n    def __repr__(self):\n        return self.optim.__repr__()\n\n    @property\n    def param_groups(self):\n        return self.optim.param_groups\n\n    @param_groups.setter\n    def param_groups(self, value):\n        self.optim.param_groups = value\n\n    def state_dict(self):\n        return self.optim.state_dict()\n\n    def load_state_dict(self, state_dict):\n        self.optim.load_state_dict(state_dict)\n\n    def zero_grad(self):\n        self.optim.zero_grad()\n\n    def add_param_group(self, param_group):\n        self.optim.add_param_group(param_group)\n\n    def step(self):\n        with torch.no_grad():\n            weight_decays = []\n            for group in self.optim.param_groups:\n                # absorb weight decay control from optimizer\n                weight_decay = group[\"weight_decay\"] if \"weight_decay\" in group else 0\n                weight_decays.append(weight_decay)\n                apply_LARS = group[\"apply_LARS\"] if \"apply_LARS\" in group else True\n                if not apply_LARS:\n                    continue\n                group[\"weight_decay\"] = 0\n                for p in group[\"params\"]:\n                    if p.grad is None:\n                        continue\n                    if self.ignore_1d_param and p.ndim == 1:  # ignore bias\n                        continue\n                    param_norm = torch.norm(p.data)\n                    grad_norm = torch.norm(p.grad.data)\n\n                    if param_norm != 0 and grad_norm != 0:\n                        # calculate adaptive lr + weight decay\n                        adaptive_lr = (\n                            self.trust_coefficient\n                            * (param_norm)\n                            / (grad_norm + param_norm * weight_decay + self.eps)\n                        )\n\n                        # clip learning rate for LARS\n                        if self.clip:\n                            # calculation of adaptive_lr so that when multiplied by lr it equals `min(adaptive_lr, lr)`\n                            adaptive_lr = min(adaptive_lr / group[\"lr\"], 1)\n\n                        p.grad.data += weight_decay * p.data\n                        p.grad.data *= adaptive_lr\n\n        self.optim.step()\n        # return weight decay control to optimizer\n        for i, group in enumerate(self.optim.param_groups):\n            group[\"weight_decay\"] = weight_decays[i]\n\n\ndef get_grad_norm_(parameters, norm_type=2.0):\n    if isinstance(parameters, torch.Tensor):\n        parameters = [parameters]\n    parameters = [p for p in parameters if p.grad is not None]\n    norm_type = float(norm_type)\n    if len(parameters) == 0:\n        return torch.tensor(0.0)\n    device = parameters[0].grad.device\n    if norm_type == \"inf\":\n        total_norm = max(p.grad.detach().abs().max().to(device) for p in parameters)\n    else:\n        total_norm = torch.norm(\n            torch.stack(\n                [torch.norm(p.grad.detach(), norm_type).to(device) for p in parameters]\n            ),\n            norm_type,\n        )\n    return total_norm\n"
  },
  {
    "path": "slowfast/models/ptv_model_builder.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\n\n\"\"\"Video models using PyTorchVideo model builder.\"\"\"\n\nfrom functools import partial\n\nimport torch.nn as nn\nfrom detectron2.layers import ROIAlign\nfrom pytorchvideo.models.csn import create_csn\nfrom pytorchvideo.models.head import create_res_basic_head, create_res_roi_pooling_head\nfrom pytorchvideo.models.r2plus1d import (\n    create_2plus1d_bottleneck_block,\n    create_r2plus1d,\n)\nfrom pytorchvideo.models.resnet import create_bottleneck_block, create_resnet\nfrom pytorchvideo.models.slowfast import create_slowfast\nfrom pytorchvideo.models.vision_transformers import (\n    create_multiscale_vision_transformers,\n)\nfrom pytorchvideo.models.x3d import create_x3d, create_x3d_bottleneck_block, Swish\nfrom slowfast.models.batchnorm_helper import get_norm\nfrom slowfast.models.video_model_builder import _POOL1, _TEMPORAL_KERNEL_BASIS\n\nfrom .build import MODEL_REGISTRY\n\n\ndef get_head_act(act_func):\n    \"\"\"\n    Return the actual head activation function given the activation fucntion name.\n\n    Args:\n        act_func (string): activation function to use. 'softmax': applies\n        softmax on the output. 'sigmoid': applies sigmoid on the output.\n    Returns:\n        nn.Module: the activation layer.\n    \"\"\"\n    if act_func == \"softmax\":\n        return nn.Softmax(dim=1)\n    elif act_func == \"sigmoid\":\n        return nn.Sigmoid()\n    else:\n        raise NotImplementedError(\n            \"{} is not supported as a head activation function.\".format(act_func)\n        )\n\n\n@MODEL_REGISTRY.register()\nclass PTVResNet(nn.Module):\n    \"\"\"\n    ResNet models using PyTorchVideo model builder.\n    \"\"\"\n\n    def __init__(self, cfg):\n        \"\"\"\n        The `__init__` method of any subclass should also contain these\n            arguments.\n\n        Args:\n            cfg (CfgNode): model building configs, details are in the\n                comments of the config file.\n        \"\"\"\n        super(PTVResNet, self).__init__()\n\n        assert cfg.RESNET.STRIDE_1X1 is False, \"STRIDE_1x1 must be True for PTVResNet\"\n        assert cfg.RESNET.TRANS_FUNC == \"bottleneck_transform\", (\n            f\"Unsupported TRANS_FUNC type {cfg.RESNET.TRANS_FUNC} for PTVResNet\"\n        )\n        assert cfg.MODEL.ARCH in [\n            \"c2d\",\n            \"slow\",\n            \"i3d\",\n        ], f\"Unsupported MODEL.ARCH type {cfg.MODEL.ARCH} for PTVResNet\"\n\n        self.detection_mode = cfg.DETECTION.ENABLE\n        self._construct_network(cfg)\n\n    def _construct_network(self, cfg):\n        \"\"\"\n        Builds a single pathway ResNet model.\n\n        Args:\n            cfg (CfgNode): model building configs, details are in the\n                comments of the config file.\n        \"\"\"\n\n        # Params from configs.\n        norm_module = get_norm(cfg)\n        head_act = get_head_act(cfg.MODEL.HEAD_ACT)\n        pool_size = _POOL1[cfg.MODEL.ARCH]\n        num_groups = cfg.RESNET.NUM_GROUPS\n        spatial_dilations = cfg.RESNET.SPATIAL_DILATIONS\n        spatial_strides = cfg.RESNET.SPATIAL_STRIDES\n        temp_kernel = _TEMPORAL_KERNEL_BASIS[cfg.MODEL.ARCH]\n        stage1_pool = pool_size[0][0] != 1 or len(set(pool_size[0])) > 1\n        stage_spatial_stride = (\n            spatial_strides[0][0],\n            spatial_strides[1][0],\n            spatial_strides[2][0],\n            spatial_strides[3][0],\n        )\n        if cfg.MODEL.ARCH == \"i3d\":\n            stage_conv_a_kernel_size = (\n                (3, 1, 1),\n                [(3, 1, 1), (1, 1, 1)],\n                [(3, 1, 1), (1, 1, 1)],\n                [(1, 1, 1), (3, 1, 1)],\n            )\n        else:\n            stage_conv_a_kernel_size = (\n                (temp_kernel[1][0][0], 1, 1),\n                (temp_kernel[2][0][0], 1, 1),\n                (temp_kernel[3][0][0], 1, 1),\n                (temp_kernel[4][0][0], 1, 1),\n            )\n\n        # Head from config\n        if cfg.DETECTION.ENABLE:\n            self.detection_head = create_res_roi_pooling_head(\n                in_features=cfg.RESNET.WIDTH_PER_GROUP * 2 ** (4 + 1),\n                out_features=cfg.MODEL.NUM_CLASSES,\n                pool=nn.AvgPool3d,\n                output_size=(1, 1, 1),\n                pool_kernel_size=(\n                    cfg.DATA.NUM_FRAMES // pool_size[0][0],\n                    1,\n                    1,\n                ),\n                dropout_rate=cfg.MODEL.DROPOUT_RATE,\n                activation=None,\n                output_with_global_average=False,\n                pool_spatial=nn.MaxPool2d,\n                resolution=[cfg.DETECTION.ROI_XFORM_RESOLUTION] * 2,\n                spatial_scale=1.0 / float(cfg.DETECTION.SPATIAL_SCALE_FACTOR),\n                sampling_ratio=0,\n                roi=ROIAlign,\n            )\n\n        self.model = create_resnet(\n            # Input clip configs.\n            input_channel=cfg.DATA.INPUT_CHANNEL_NUM[0],\n            # Model configs.\n            model_depth=cfg.RESNET.DEPTH,\n            model_num_class=cfg.MODEL.NUM_CLASSES,\n            dropout_rate=cfg.MODEL.DROPOUT_RATE,\n            # Normalization configs.\n            norm=norm_module,\n            # Activation configs.\n            activation=partial(nn.ReLU, inplace=cfg.RESNET.INPLACE_RELU),\n            # Stem configs.\n            stem_dim_out=cfg.RESNET.WIDTH_PER_GROUP,\n            stem_conv_kernel_size=(temp_kernel[0][0][0], 7, 7),\n            stem_conv_stride=(1, 2, 2),\n            stem_pool=nn.MaxPool3d,\n            stem_pool_kernel_size=(1, 3, 3),\n            stem_pool_stride=(1, 2, 2),\n            # Stage configs.\n            stage1_pool=nn.MaxPool3d if stage1_pool else None,\n            stage1_pool_kernel_size=pool_size[0],\n            stage_conv_a_kernel_size=stage_conv_a_kernel_size,\n            stage_conv_b_kernel_size=(\n                (1, 3, 3),\n                (1, 3, 3),\n                (1, 3, 3),\n                (1, 3, 3),\n            ),\n            stage_conv_b_num_groups=(\n                num_groups,\n                num_groups,\n                num_groups,\n                num_groups,\n            ),\n            stage_conv_b_dilation=(\n                (1, spatial_dilations[0][0], spatial_dilations[0][0]),\n                (1, spatial_dilations[1][0], spatial_dilations[1][0]),\n                (1, spatial_dilations[2][0], spatial_dilations[2][0]),\n                (1, spatial_dilations[3][0], spatial_dilations[3][0]),\n            ),\n            stage_spatial_h_stride=stage_spatial_stride,\n            stage_spatial_w_stride=stage_spatial_stride,\n            stage_temporal_stride=(1, 1, 1, 1),\n            bottleneck=create_bottleneck_block,\n            # Head configs.\n            head=create_res_basic_head if not self.detection_mode else None,\n            head_pool=nn.AvgPool3d,\n            head_pool_kernel_size=(\n                cfg.DATA.NUM_FRAMES // pool_size[0][0],\n                cfg.DATA.TRAIN_CROP_SIZE // 32 // pool_size[0][1],\n                cfg.DATA.TRAIN_CROP_SIZE // 32 // pool_size[0][2],\n            ),\n            head_activation=None,\n            head_output_with_global_average=False,\n        )\n\n        self.post_act = head_act\n\n    def forward(self, x, bboxes=None):\n        x = x[0]\n        x = self.model(x)\n        if self.detection_mode:\n            x = self.detection_head(x, bboxes)\n            x = self.post_act(x)\n        else:\n            # Performs fully convlutional inference.\n            if not self.training:\n                x = self.post_act(x)\n                x = x.mean([2, 3, 4])\n        x = x.view(x.shape[0], -1)\n        return x\n\n\n@MODEL_REGISTRY.register()\nclass PTVSlowFast(nn.Module):\n    def __init__(self, cfg):\n        \"\"\"\n        The `__init__` method of any subclass should also contain these\n            arguments.\n\n        Args:\n            cfg (CfgNode): model building configs, details are in the\n                comments of the config file.\n        \"\"\"\n        super(PTVSlowFast, self).__init__()\n\n        assert cfg.RESNET.STRIDE_1X1 is False, \"STRIDE_1x1 must be True for PTVSlowFast\"\n        assert cfg.RESNET.TRANS_FUNC == \"bottleneck_transform\", (\n            f\"Unsupported TRANS_FUNC type {cfg.RESNET.TRANS_FUNC} for PTVSlowFast\"\n        )\n\n        self.detection_mode = cfg.DETECTION.ENABLE\n        self._construct_network(cfg)\n\n    def _construct_network(self, cfg):\n        \"\"\"\n        Builds a SlowFast model.\n\n        Args:\n            cfg (CfgNode): model building configs, details are in the\n                comments of the config file.\n        \"\"\"\n        _MODEL_STAGE_DEPTH = {50: (3, 4, 6, 3), 101: (3, 4, 23, 3)}\n\n        # Params from configs.\n        norm_module = get_norm(cfg)\n        pool_size = _POOL1[cfg.MODEL.ARCH]\n        num_groups = cfg.RESNET.NUM_GROUPS\n        width_per_group = cfg.RESNET.WIDTH_PER_GROUP\n        spatial_dilations = cfg.RESNET.SPATIAL_DILATIONS\n        spatial_strides = cfg.RESNET.SPATIAL_STRIDES\n        temp_kernel = _TEMPORAL_KERNEL_BASIS[cfg.MODEL.ARCH]\n        num_block_temp_kernel = cfg.RESNET.NUM_BLOCK_TEMP_KERNEL\n        stage_depth = _MODEL_STAGE_DEPTH[cfg.RESNET.DEPTH]\n\n        stage_conv_a_kernel_sizes = [[], []]\n        for pathway in range(2):\n            for stage in range(4):\n                stage_conv_a_kernel_sizes[pathway].append(\n                    ((temp_kernel[stage + 1][pathway][0], 1, 1),)\n                    * num_block_temp_kernel[stage][pathway]\n                    + ((1, 1, 1),)\n                    * (stage_depth[stage] - num_block_temp_kernel[stage][pathway])\n                )\n\n        # Head from config\n        # Number of stages = 4\n        stage_dim_in = cfg.RESNET.WIDTH_PER_GROUP * 2 ** (4 + 1)\n        head_in_features = stage_dim_in + stage_dim_in // cfg.SLOWFAST.BETA_INV\n\n        if cfg.DETECTION.ENABLE:\n            self.detection_head = create_res_roi_pooling_head(\n                in_features=head_in_features,\n                out_features=cfg.MODEL.NUM_CLASSES,\n                pool=None,\n                output_size=(1, 1, 1),\n                dropout_rate=cfg.MODEL.DROPOUT_RATE,\n                activation=None,\n                output_with_global_average=False,\n                pool_spatial=nn.MaxPool2d,\n                resolution=[cfg.DETECTION.ROI_XFORM_RESOLUTION] * 2,\n                spatial_scale=1.0 / float(cfg.DETECTION.SPATIAL_SCALE_FACTOR),\n                sampling_ratio=0,\n                roi=ROIAlign,\n            )\n            head_pool_kernel_sizes = (\n                (\n                    cfg.DATA.NUM_FRAMES // cfg.SLOWFAST.ALPHA // pool_size[0][0],\n                    1,\n                    1,\n                ),\n                (cfg.DATA.NUM_FRAMES // pool_size[1][0], 1, 1),\n            )\n        else:\n            head_pool_kernel_sizes = (\n                (\n                    cfg.DATA.NUM_FRAMES // cfg.SLOWFAST.ALPHA // pool_size[0][0],\n                    cfg.DATA.TRAIN_CROP_SIZE // 32 // pool_size[0][1],\n                    cfg.DATA.TRAIN_CROP_SIZE // 32 // pool_size[0][2],\n                ),\n                (\n                    cfg.DATA.NUM_FRAMES // pool_size[1][0],\n                    cfg.DATA.TRAIN_CROP_SIZE // 32 // pool_size[1][1],\n                    cfg.DATA.TRAIN_CROP_SIZE // 32 // pool_size[1][2],\n                ),\n            )\n\n        self.model = create_slowfast(\n            # SlowFast configs.\n            slowfast_channel_reduction_ratio=cfg.SLOWFAST.BETA_INV,\n            slowfast_conv_channel_fusion_ratio=cfg.SLOWFAST.FUSION_CONV_CHANNEL_RATIO,\n            slowfast_fusion_conv_kernel_size=(\n                cfg.SLOWFAST.FUSION_KERNEL_SZ,\n                1,\n                1,\n            ),\n            slowfast_fusion_conv_stride=(cfg.SLOWFAST.ALPHA, 1, 1),\n            # Input clip configs.\n            input_channels=cfg.DATA.INPUT_CHANNEL_NUM,\n            # Model configs.\n            model_depth=cfg.RESNET.DEPTH,\n            model_num_class=cfg.MODEL.NUM_CLASSES,\n            dropout_rate=cfg.MODEL.DROPOUT_RATE,\n            # Normalization configs.\n            norm=norm_module,\n            # Activation configs.\n            activation=partial(nn.ReLU, inplace=cfg.RESNET.INPLACE_RELU),\n            # Stem configs.\n            stem_dim_outs=(\n                width_per_group,\n                width_per_group // cfg.SLOWFAST.BETA_INV,\n            ),\n            stem_conv_kernel_sizes=(\n                (temp_kernel[0][0][0], 7, 7),\n                (temp_kernel[0][1][0], 7, 7),\n            ),\n            stem_conv_strides=((1, 2, 2), (1, 2, 2)),\n            stem_pool=nn.MaxPool3d,\n            stem_pool_kernel_sizes=((1, 3, 3), (1, 3, 3)),\n            stem_pool_strides=((1, 2, 2), (1, 2, 2)),\n            # Stage configs.\n            stage_conv_a_kernel_sizes=stage_conv_a_kernel_sizes,\n            stage_conv_b_kernel_sizes=(\n                ((1, 3, 3), (1, 3, 3), (1, 3, 3), (1, 3, 3)),\n                ((1, 3, 3), (1, 3, 3), (1, 3, 3), (1, 3, 3)),\n            ),\n            stage_conv_b_num_groups=(\n                (num_groups, num_groups, num_groups, num_groups),\n                (num_groups, num_groups, num_groups, num_groups),\n            ),\n            stage_conv_b_dilations=(\n                (\n                    (1, spatial_dilations[0][0], spatial_dilations[0][0]),\n                    (1, spatial_dilations[1][0], spatial_dilations[1][0]),\n                    (1, spatial_dilations[2][0], spatial_dilations[2][0]),\n                    (1, spatial_dilations[3][0], spatial_dilations[3][0]),\n                ),\n                (\n                    (1, spatial_dilations[0][1], spatial_dilations[0][1]),\n                    (1, spatial_dilations[1][1], spatial_dilations[1][1]),\n                    (1, spatial_dilations[1][1], spatial_dilations[1][1]),\n                    (1, spatial_dilations[1][1], spatial_dilations[1][1]),\n                ),\n            ),\n            stage_spatial_strides=(\n                (\n                    spatial_strides[0][0],\n                    spatial_strides[1][0],\n                    spatial_strides[2][0],\n                    spatial_strides[3][0],\n                ),\n                (\n                    spatial_strides[0][1],\n                    spatial_strides[1][1],\n                    spatial_strides[2][1],\n                    spatial_strides[3][1],\n                ),\n            ),\n            stage_temporal_strides=((1, 1, 1, 1), (1, 1, 1, 1)),\n            bottleneck=create_bottleneck_block,\n            # Head configs.\n            head=create_res_basic_head if not self.detection_mode else None,\n            head_pool=nn.AvgPool3d,\n            head_pool_kernel_sizes=head_pool_kernel_sizes,\n            head_activation=None,\n            head_output_with_global_average=False,\n        )\n\n        self.post_act = get_head_act(cfg.MODEL.HEAD_ACT)\n\n    def forward(self, x, bboxes=None):\n        x = self.model(x)\n        if self.detection_mode:\n            x = self.detection_head(x, bboxes)\n            x = self.post_act(x)\n        else:\n            # Performs fully convlutional inference.\n            if not self.training:\n                x = self.post_act(x)\n                x = x.mean([2, 3, 4])\n        x = x.view(x.shape[0], -1)\n        return x\n\n\n@MODEL_REGISTRY.register()\nclass PTVX3D(nn.Module):\n    def __init__(self, cfg):\n        \"\"\"\n        The `__init__` method of any subclass should also contain these\n            arguments.\n\n        Args:\n            cfg (CfgNode): model building configs, details are in the\n                comments of the config file.\n        \"\"\"\n        super(PTVX3D, self).__init__()\n\n        assert cfg.RESNET.STRIDE_1X1 is False, \"STRIDE_1x1 must be True for PTVX3D\"\n        assert cfg.RESNET.TRANS_FUNC == \"x3d_transform\", (\n            f\"Unsupported TRANS_FUNC type {cfg.RESNET.TRANS_FUNC} for PTVX3D\"\n        )\n        assert cfg.DETECTION.ENABLE is False, (\n            \"Detection model is not supported for PTVX3D yet.\"\n        )\n\n        self._construct_network(cfg)\n\n    def _construct_network(self, cfg):\n        \"\"\"\n        Builds a X3D model.\n\n        Args:\n            cfg (CfgNode): model building configs, details are in the\n                comments of the config file.\n        \"\"\"\n\n        # Params from configs.\n        norm_module = get_norm(cfg)\n        temp_kernel = _TEMPORAL_KERNEL_BASIS[cfg.MODEL.ARCH]\n\n        self.model = create_x3d(\n            # Input clip configs.\n            input_channel=cfg.DATA.INPUT_CHANNEL_NUM[0],\n            input_clip_length=cfg.DATA.NUM_FRAMES,\n            input_crop_size=cfg.DATA.TRAIN_CROP_SIZE,\n            # Model configs.\n            model_num_class=cfg.MODEL.NUM_CLASSES,\n            dropout_rate=cfg.MODEL.DROPOUT_RATE,\n            width_factor=cfg.X3D.WIDTH_FACTOR,\n            depth_factor=cfg.X3D.DEPTH_FACTOR,\n            # Normalization configs.\n            norm=norm_module,\n            norm_eps=1e-5,\n            norm_momentum=0.1,\n            # Activation configs.\n            activation=partial(nn.ReLU, inplace=cfg.RESNET.INPLACE_RELU),\n            # Stem configs.\n            stem_dim_in=cfg.X3D.DIM_C1,\n            stem_conv_kernel_size=(temp_kernel[0][0][0], 3, 3),\n            stem_conv_stride=(1, 2, 2),\n            # Stage configs.\n            stage_conv_kernel_size=(\n                (temp_kernel[1][0][0], 3, 3),\n                (temp_kernel[2][0][0], 3, 3),\n                (temp_kernel[3][0][0], 3, 3),\n                (temp_kernel[4][0][0], 3, 3),\n            ),\n            stage_spatial_stride=(2, 2, 2, 2),\n            stage_temporal_stride=(1, 1, 1, 1),\n            bottleneck=create_x3d_bottleneck_block,\n            bottleneck_factor=cfg.X3D.BOTTLENECK_FACTOR,\n            se_ratio=0.0625,\n            inner_act=Swish,\n            # Head configs.\n            head_dim_out=cfg.X3D.DIM_C5,\n            head_pool_act=partial(nn.ReLU, inplace=cfg.RESNET.INPLACE_RELU),\n            head_bn_lin5_on=cfg.X3D.BN_LIN5,\n            head_activation=None,\n            head_output_with_global_average=False,\n        )\n\n        self.post_act = get_head_act(cfg.MODEL.HEAD_ACT)\n\n    def forward(self, x, bboxes=None):\n        x = x[0]\n        x = self.model(x)\n        # Performs fully convlutional inference.\n        if not self.training:\n            x = self.post_act(x)\n            x = x.mean([2, 3, 4])\n\n        x = x.reshape(x.shape[0], -1)\n        return x\n\n\n@MODEL_REGISTRY.register()\nclass PTVCSN(nn.Module):\n    \"\"\"\n    CSN models using PyTorchVideo model builder.\n    \"\"\"\n\n    def __init__(self, cfg):\n        \"\"\"\n        The `__init__` method of any subclass should also contain these\n            arguments.\n\n        Args:\n            cfg (CfgNode): model building configs, details are in the\n                comments of the config file.\n        \"\"\"\n        super(PTVCSN, self).__init__()\n\n        assert cfg.DETECTION.ENABLE is False, (\n            \"Detection model is not supported for PTVCSN yet.\"\n        )\n\n        self._construct_network(cfg)\n\n    def _construct_network(self, cfg):\n        \"\"\"\n        Builds a single pathway ResNet model.\n\n        Args:\n            cfg (CfgNode): model building configs, details are in the\n                comments of the config file.\n        \"\"\"\n\n        # Params from configs.\n        norm_module = get_norm(cfg)\n\n        self.model = create_csn(\n            # Input clip configs.\n            input_channel=cfg.DATA.INPUT_CHANNEL_NUM[0],\n            # Model configs.\n            model_depth=cfg.RESNET.DEPTH,\n            model_num_class=cfg.MODEL.NUM_CLASSES,\n            dropout_rate=cfg.MODEL.DROPOUT_RATE,\n            # Normalization configs.\n            norm=norm_module,\n            # Activation configs.\n            activation=partial(nn.ReLU, inplace=cfg.RESNET.INPLACE_RELU),\n            # Stem configs.\n            stem_dim_out=cfg.RESNET.WIDTH_PER_GROUP,\n            stem_conv_kernel_size=(3, 7, 7),\n            stem_conv_stride=(1, 2, 2),\n            stem_pool=nn.MaxPool3d,\n            stem_pool_kernel_size=(1, 3, 3),\n            stem_pool_stride=(1, 2, 2),\n            # Stage configs.\n            stage_conv_a_kernel_size=(1, 1, 1),\n            stage_conv_b_kernel_size=(3, 3, 3),\n            stage_conv_b_width_per_group=1,\n            stage_spatial_stride=(1, 2, 2, 2),\n            stage_temporal_stride=(1, 2, 2, 2),\n            bottleneck=create_bottleneck_block,\n            # Head configs.\n            head_pool=nn.AvgPool3d,\n            head_pool_kernel_size=(\n                cfg.DATA.NUM_FRAMES // 8,\n                cfg.DATA.TRAIN_CROP_SIZE // 32,\n                cfg.DATA.TRAIN_CROP_SIZE // 32,\n            ),\n            head_activation=None,\n            head_output_with_global_average=False,\n        )\n\n        self.post_act = get_head_act(cfg.MODEL.HEAD_ACT)\n\n    def forward(self, x, bboxes=None):\n        x = x[0]\n        x = self.model(x)\n        # Performs fully convlutional inference.\n        if not self.training:\n            x = self.post_act(x)\n            x = x.mean([2, 3, 4])\n\n        x = x.reshape(x.shape[0], -1)\n        return x\n\n\n@MODEL_REGISTRY.register()\nclass PTVR2plus1D(nn.Module):\n    \"\"\"\n    R(2+1)D models using PyTorchVideo model builder.\n    \"\"\"\n\n    def __init__(self, cfg):\n        \"\"\"\n        The `__init__` method of any subclass should also contain these\n            arguments.\n\n        Args:\n            cfg (CfgNode): model building configs, details are in the\n                comments of the config file.\n        \"\"\"\n        super(PTVR2plus1D, self).__init__()\n\n        assert cfg.DETECTION.ENABLE is False, (\n            \"Detection model is not supported for PTVR2plus1D yet.\"\n        )\n\n        self._construct_network(cfg)\n\n    def _construct_network(self, cfg):\n        \"\"\"\n        Builds a single pathway R(2+1)D model.\n\n        Args:\n            cfg (CfgNode): model building configs, details are in the\n                comments of the config file.\n        \"\"\"\n        self.model = create_r2plus1d(\n            # Input clip configs.\n            input_channel=cfg.DATA.INPUT_CHANNEL_NUM[0],\n            # Model configs.\n            model_depth=cfg.RESNET.DEPTH,\n            model_num_class=cfg.MODEL.NUM_CLASSES,\n            dropout_rate=cfg.MODEL.DROPOUT_RATE,\n            # Normalization configs.\n            norm=get_norm(cfg),\n            norm_eps=1e-5,\n            norm_momentum=0.1,\n            # Activation configs.\n            activation=partial(nn.ReLU, inplace=cfg.RESNET.INPLACE_RELU),\n            # Stem configs.\n            stem_dim_out=cfg.RESNET.WIDTH_PER_GROUP,\n            stem_conv_kernel_size=(1, 7, 7),\n            stem_conv_stride=(1, 2, 2),\n            # Stage configs.\n            stage_conv_a_kernel_size=(\n                (1, 1, 1),\n                (1, 1, 1),\n                (1, 1, 1),\n                (1, 1, 1),\n            ),\n            stage_conv_b_kernel_size=(\n                (3, 3, 3),\n                (3, 3, 3),\n                (3, 3, 3),\n                (3, 3, 3),\n            ),\n            stage_conv_b_num_groups=(1, 1, 1, 1),\n            stage_conv_b_dilation=(\n                (1, 1, 1),\n                (1, 1, 1),\n                (1, 1, 1),\n                (1, 1, 1),\n            ),\n            stage_spatial_stride=(2, 2, 2, 2),\n            stage_temporal_stride=(1, 1, 2, 2),\n            stage_bottleneck=(\n                create_2plus1d_bottleneck_block,\n                create_2plus1d_bottleneck_block,\n                create_2plus1d_bottleneck_block,\n                create_2plus1d_bottleneck_block,\n            ),\n            # Head configs.\n            head_pool=nn.AvgPool3d,\n            head_pool_kernel_size=(\n                cfg.DATA.NUM_FRAMES // 4,\n                cfg.DATA.TRAIN_CROP_SIZE // 32,\n                cfg.DATA.TRAIN_CROP_SIZE // 32,\n            ),\n            head_activation=None,\n            head_output_with_global_average=False,\n        )\n\n        self.post_act = get_head_act(cfg.MODEL.HEAD_ACT)\n\n    def forward(self, x, bboxes=None):\n        x = x[0]\n        x = self.model(x)\n        # Performs fully convlutional inference.\n        if not self.training:\n            x = self.post_act(x)\n            x = x.mean([2, 3, 4])\n\n        x = x.view(x.shape[0], -1)\n        return x\n\n\n@MODEL_REGISTRY.register()\nclass PTVMViT(nn.Module):\n    \"\"\"\n    MViT models using PyTorchVideo model builder.\n    \"\"\"\n\n    def __init__(self, cfg):\n        \"\"\"\n        The `__init__` method of any subclass should also contain these\n            arguments.\n\n        Args:\n            cfg (CfgNode): model building configs, details are in the\n                comments of the config file.\n        \"\"\"\n        super(PTVMViT, self).__init__()\n\n        assert cfg.DETECTION.ENABLE is False, (\n            \"Detection model is not supported for PTVMViT yet.\"\n        )\n\n        self._construct_network(cfg)\n\n    def _construct_network(self, cfg):\n        \"\"\"\n        Builds a MViT model.\n\n        Args:\n            cfg (CfgNode): model building configs, details are in the\n                comments of the config file.\n        \"\"\"\n        self.model = create_multiscale_vision_transformers(\n            spatial_size=cfg.DATA.TRAIN_CROP_SIZE,\n            temporal_size=cfg.DATA.NUM_FRAMES,\n            cls_embed_on=cfg.MVIT.CLS_EMBED_ON,\n            sep_pos_embed=cfg.MVIT.SEP_POS_EMBED,\n            depth=cfg.MVIT.DEPTH,\n            norm=cfg.MVIT.NORM,\n            # Patch embed config.\n            input_channels=cfg.DATA.INPUT_CHANNEL_NUM[0],\n            patch_embed_dim=cfg.MVIT.EMBED_DIM,\n            conv_patch_embed_kernel=cfg.MVIT.PATCH_KERNEL,\n            conv_patch_embed_stride=cfg.MVIT.PATCH_STRIDE,\n            conv_patch_embed_padding=cfg.MVIT.PATCH_PADDING,\n            enable_patch_embed_norm=cfg.MVIT.NORM_STEM,\n            use_2d_patch=cfg.MVIT.PATCH_2D,\n            # Attention block config.\n            num_heads=cfg.MVIT.NUM_HEADS,\n            mlp_ratio=cfg.MVIT.MLP_RATIO,\n            qkv_bias=cfg.MVIT.QKV_BIAS,\n            dropout_rate_block=cfg.MVIT.DROPOUT_RATE,\n            droppath_rate_block=cfg.MVIT.DROPPATH_RATE,\n            pooling_mode=cfg.MVIT.MODE,\n            pool_first=cfg.MVIT.POOL_FIRST,\n            embed_dim_mul=cfg.MVIT.DIM_MUL,\n            atten_head_mul=cfg.MVIT.HEAD_MUL,\n            pool_q_stride_size=cfg.MVIT.POOL_Q_STRIDE,\n            pool_kv_stride_size=cfg.MVIT.POOL_KV_STRIDE,\n            pool_kv_stride_adaptive=cfg.MVIT.POOL_KV_STRIDE_ADAPTIVE,\n            pool_kvq_kernel=cfg.MVIT.POOL_KVQ_KERNEL,\n            # Head config.\n            head_dropout_rate=cfg.MODEL.DROPOUT_RATE,\n            head_num_classes=cfg.MODEL.NUM_CLASSES,\n        )\n\n        self.post_act = get_head_act(cfg.MODEL.HEAD_ACT)\n\n    def forward(self, x, bboxes=None):\n        x = x[0]\n        x = self.model(x)\n\n        if not self.training:\n            x = self.post_act(x)\n\n        return x\n"
  },
  {
    "path": "slowfast/models/resnet_helper.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\n\"\"\"Video models.\"\"\"\n\nimport torch.nn as nn\nfrom slowfast.models.common import drop_path\nfrom slowfast.models.nonlocal_helper import Nonlocal\nfrom slowfast.models.operators import SE, Swish\n\n\ndef get_trans_func(name):\n    \"\"\"\n    Retrieves the transformation module by name.\n    \"\"\"\n    trans_funcs = {\n        \"bottleneck_transform\": BottleneckTransform,\n        \"basic_transform\": BasicTransform,\n        \"x3d_transform\": X3DTransform,\n    }\n    assert name in trans_funcs.keys(), (\n        \"Transformation function '{}' not supported\".format(name)\n    )\n    return trans_funcs[name]\n\n\nclass BasicTransform(nn.Module):\n    \"\"\"\n    Basic transformation: Tx3x3, 1x3x3, where T is the size of temporal kernel.\n    \"\"\"\n\n    def __init__(\n        self,\n        dim_in,\n        dim_out,\n        temp_kernel_size,\n        stride,\n        dim_inner=None,\n        num_groups=1,\n        stride_1x1=None,\n        inplace_relu=True,\n        eps=1e-5,\n        bn_mmt=0.1,\n        dilation=1,\n        norm_module=nn.BatchNorm3d,\n        block_idx=0,\n    ):\n        \"\"\"\n        Args:\n            dim_in (int): the channel dimensions of the input.\n            dim_out (int): the channel dimension of the output.\n            temp_kernel_size (int): the temporal kernel sizes of the first\n                convolution in the basic block.\n            stride (int): the stride of the bottleneck.\n            dim_inner (None): the inner dimension would not be used in\n                BasicTransform.\n            num_groups (int): number of groups for the convolution. Number of\n                group is always 1 for BasicTransform.\n            stride_1x1 (None): stride_1x1 will not be used in BasicTransform.\n            inplace_relu (bool): if True, calculate the relu on the original\n                input without allocating new memory.\n            eps (float): epsilon for batch norm.\n            bn_mmt (float): momentum for batch norm. Noted that BN momentum in\n                PyTorch = 1 - BN momentum in Caffe2.\n            norm_module (nn.Module): nn.Module for the normalization layer. The\n                default is nn.BatchNorm3d.\n        \"\"\"\n        super(BasicTransform, self).__init__()\n        self.temp_kernel_size = temp_kernel_size\n        self._inplace_relu = inplace_relu\n        self._eps = eps\n        self._bn_mmt = bn_mmt\n        self._construct(dim_in, dim_out, stride, dilation, norm_module)\n\n    def _construct(self, dim_in, dim_out, stride, dilation, norm_module):\n        # Tx3x3, BN, ReLU.\n        self.a = nn.Conv3d(\n            dim_in,\n            dim_out,\n            kernel_size=[self.temp_kernel_size, 3, 3],\n            stride=[1, stride, stride],\n            padding=[int(self.temp_kernel_size // 2), 1, 1],\n            bias=False,\n        )\n        self.a_bn = norm_module(\n            num_features=dim_out, eps=self._eps, momentum=self._bn_mmt\n        )\n        self.a_relu = nn.ReLU(inplace=self._inplace_relu)\n        # 1x3x3, BN.\n        self.b = nn.Conv3d(\n            dim_out,\n            dim_out,\n            kernel_size=[1, 3, 3],\n            stride=[1, 1, 1],\n            padding=[0, dilation, dilation],\n            dilation=[1, dilation, dilation],\n            bias=False,\n        )\n\n        self.b.final_conv = True\n\n        self.b_bn = norm_module(\n            num_features=dim_out, eps=self._eps, momentum=self._bn_mmt\n        )\n\n        self.b_bn.transform_final_bn = True\n\n    def forward(self, x):\n        x = self.a(x)\n        x = self.a_bn(x)\n        x = self.a_relu(x)\n\n        x = self.b(x)\n        x = self.b_bn(x)\n        return x\n\n\nclass X3DTransform(nn.Module):\n    \"\"\"\n    X3D transformation: 1x1x1, Tx3x3 (channelwise, num_groups=dim_in), 1x1x1,\n        augmented with (optional) SE (squeeze-excitation) on the 3x3x3 output.\n        T is the temporal kernel size (defaulting to 3)\n    \"\"\"\n\n    def __init__(\n        self,\n        dim_in,\n        dim_out,\n        temp_kernel_size,\n        stride,\n        dim_inner,\n        num_groups,\n        stride_1x1=False,\n        inplace_relu=True,\n        eps=1e-5,\n        bn_mmt=0.1,\n        dilation=1,\n        norm_module=nn.BatchNorm3d,\n        se_ratio=0.0625,\n        swish_inner=True,\n        block_idx=0,\n    ):\n        \"\"\"\n        Args:\n            dim_in (int): the channel dimensions of the input.\n            dim_out (int): the channel dimension of the output.\n            temp_kernel_size (int): the temporal kernel sizes of the middle\n                convolution in the bottleneck.\n            stride (int): the stride of the bottleneck.\n            dim_inner (int): the inner dimension of the block.\n            num_groups (int): number of groups for the convolution. num_groups=1\n                is for standard ResNet like networks, and num_groups>1 is for\n                ResNeXt like networks.\n            stride_1x1 (bool): if True, apply stride to 1x1 conv, otherwise\n                apply stride to the 3x3 conv.\n            inplace_relu (bool): if True, calculate the relu on the original\n                input without allocating new memory.\n            eps (float): epsilon for batch norm.\n            bn_mmt (float): momentum for batch norm. Noted that BN momentum in\n                PyTorch = 1 - BN momentum in Caffe2.\n            dilation (int): size of dilation.\n            norm_module (nn.Module): nn.Module for the normalization layer. The\n                default is nn.BatchNorm3d.\n            se_ratio (float): if > 0, apply SE to the Tx3x3 conv, with the SE\n                channel dimensionality being se_ratio times the Tx3x3 conv dim.\n            swish_inner (bool): if True, apply swish to the Tx3x3 conv, otherwise\n                apply ReLU to the Tx3x3 conv.\n        \"\"\"\n        super(X3DTransform, self).__init__()\n        self.temp_kernel_size = temp_kernel_size\n        self._inplace_relu = inplace_relu\n        self._eps = eps\n        self._bn_mmt = bn_mmt\n        self._se_ratio = se_ratio\n        self._swish_inner = swish_inner\n        self._stride_1x1 = stride_1x1\n        self._block_idx = block_idx\n        self._construct(\n            dim_in,\n            dim_out,\n            stride,\n            dim_inner,\n            num_groups,\n            dilation,\n            norm_module,\n        )\n\n    def _construct(\n        self,\n        dim_in,\n        dim_out,\n        stride,\n        dim_inner,\n        num_groups,\n        dilation,\n        norm_module,\n    ):\n        (str1x1, str3x3) = (stride, 1) if self._stride_1x1 else (1, stride)\n\n        # 1x1x1, BN, ReLU.\n        self.a = nn.Conv3d(\n            dim_in,\n            dim_inner,\n            kernel_size=[1, 1, 1],\n            stride=[1, str1x1, str1x1],\n            padding=[0, 0, 0],\n            bias=False,\n        )\n        self.a_bn = norm_module(\n            num_features=dim_inner, eps=self._eps, momentum=self._bn_mmt\n        )\n        self.a_relu = nn.ReLU(inplace=self._inplace_relu)\n\n        # Tx3x3, BN, ReLU.\n        self.b = nn.Conv3d(\n            dim_inner,\n            dim_inner,\n            [self.temp_kernel_size, 3, 3],\n            stride=[1, str3x3, str3x3],\n            padding=[int(self.temp_kernel_size // 2), dilation, dilation],\n            groups=num_groups,\n            bias=False,\n            dilation=[1, dilation, dilation],\n        )\n        self.b_bn = norm_module(\n            num_features=dim_inner, eps=self._eps, momentum=self._bn_mmt\n        )\n\n        # Apply SE attention or not\n        use_se = True if (self._block_idx + 1) % 2 else False\n        if self._se_ratio > 0.0 and use_se:\n            self.se = SE(dim_inner, self._se_ratio)\n\n        if self._swish_inner:\n            self.b_relu = Swish()\n        else:\n            self.b_relu = nn.ReLU(inplace=self._inplace_relu)\n\n        # 1x1x1, BN.\n        self.c = nn.Conv3d(\n            dim_inner,\n            dim_out,\n            kernel_size=[1, 1, 1],\n            stride=[1, 1, 1],\n            padding=[0, 0, 0],\n            bias=False,\n        )\n        self.c_bn = norm_module(\n            num_features=dim_out, eps=self._eps, momentum=self._bn_mmt\n        )\n        self.c_bn.transform_final_bn = True\n\n    def forward(self, x):\n        for block in self.children():\n            x = block(x)\n        return x\n\n\nclass BottleneckTransform(nn.Module):\n    \"\"\"\n    Bottleneck transformation: Tx1x1, 1x3x3, 1x1x1, where T is the size of\n        temporal kernel.\n    \"\"\"\n\n    def __init__(\n        self,\n        dim_in,\n        dim_out,\n        temp_kernel_size,\n        stride,\n        dim_inner,\n        num_groups,\n        stride_1x1=False,\n        inplace_relu=True,\n        eps=1e-5,\n        bn_mmt=0.1,\n        dilation=1,\n        norm_module=nn.BatchNorm3d,\n        block_idx=0,\n    ):\n        \"\"\"\n        Args:\n            dim_in (int): the channel dimensions of the input.\n            dim_out (int): the channel dimension of the output.\n            temp_kernel_size (int): the temporal kernel sizes of the first\n                convolution in the bottleneck.\n            stride (int): the stride of the bottleneck.\n            dim_inner (int): the inner dimension of the block.\n            num_groups (int): number of groups for the convolution. num_groups=1\n                is for standard ResNet like networks, and num_groups>1 is for\n                ResNeXt like networks.\n            stride_1x1 (bool): if True, apply stride to 1x1 conv, otherwise\n                apply stride to the 3x3 conv.\n            inplace_relu (bool): if True, calculate the relu on the original\n                input without allocating new memory.\n            eps (float): epsilon for batch norm.\n            bn_mmt (float): momentum for batch norm. Noted that BN momentum in\n                PyTorch = 1 - BN momentum in Caffe2.\n            dilation (int): size of dilation.\n            norm_module (nn.Module): nn.Module for the normalization layer. The\n                default is nn.BatchNorm3d.\n        \"\"\"\n        super(BottleneckTransform, self).__init__()\n        self.temp_kernel_size = temp_kernel_size\n        self._inplace_relu = inplace_relu\n        self._eps = eps\n        self._bn_mmt = bn_mmt\n        self._stride_1x1 = stride_1x1\n        self._construct(\n            dim_in,\n            dim_out,\n            stride,\n            dim_inner,\n            num_groups,\n            dilation,\n            norm_module,\n        )\n\n    def _construct(\n        self,\n        dim_in,\n        dim_out,\n        stride,\n        dim_inner,\n        num_groups,\n        dilation,\n        norm_module,\n    ):\n        (str1x1, str3x3) = (stride, 1) if self._stride_1x1 else (1, stride)\n\n        # Tx1x1, BN, ReLU.\n        self.a = nn.Conv3d(\n            dim_in,\n            dim_inner,\n            kernel_size=[self.temp_kernel_size, 1, 1],\n            stride=[1, str1x1, str1x1],\n            padding=[int(self.temp_kernel_size // 2), 0, 0],\n            bias=False,\n        )\n        self.a_bn = norm_module(\n            num_features=dim_inner, eps=self._eps, momentum=self._bn_mmt\n        )\n        self.a_relu = nn.ReLU(inplace=self._inplace_relu)\n\n        # 1x3x3, BN, ReLU.\n        self.b = nn.Conv3d(\n            dim_inner,\n            dim_inner,\n            [1, 3, 3],\n            stride=[1, str3x3, str3x3],\n            padding=[0, dilation, dilation],\n            groups=num_groups,\n            bias=False,\n            dilation=[1, dilation, dilation],\n        )\n        self.b_bn = norm_module(\n            num_features=dim_inner, eps=self._eps, momentum=self._bn_mmt\n        )\n        self.b_relu = nn.ReLU(inplace=self._inplace_relu)\n\n        # 1x1x1, BN.\n        self.c = nn.Conv3d(\n            dim_inner,\n            dim_out,\n            kernel_size=[1, 1, 1],\n            stride=[1, 1, 1],\n            padding=[0, 0, 0],\n            bias=False,\n        )\n        self.c.final_conv = True\n\n        self.c_bn = norm_module(\n            num_features=dim_out, eps=self._eps, momentum=self._bn_mmt\n        )\n        self.c_bn.transform_final_bn = True\n\n    def forward(self, x):\n        # Explicitly forward every layer.\n        # Branch2a.\n        x = self.a(x)\n        x = self.a_bn(x)\n        x = self.a_relu(x)\n\n        # Branch2b.\n        x = self.b(x)\n        x = self.b_bn(x)\n        x = self.b_relu(x)\n\n        # Branch2c\n        x = self.c(x)\n        x = self.c_bn(x)\n        return x\n\n\nclass ResBlock(nn.Module):\n    \"\"\"\n    Residual block.\n    \"\"\"\n\n    def __init__(\n        self,\n        dim_in,\n        dim_out,\n        temp_kernel_size,\n        stride,\n        trans_func,\n        dim_inner,\n        num_groups=1,\n        stride_1x1=False,\n        inplace_relu=True,\n        eps=1e-5,\n        bn_mmt=0.1,\n        dilation=1,\n        norm_module=nn.BatchNorm3d,\n        block_idx=0,\n        drop_connect_rate=0.0,\n    ):\n        \"\"\"\n        ResBlock class constructs redisual blocks. More details can be found in:\n            Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.\n            \"Deep residual learning for image recognition.\"\n            https://arxiv.org/abs/1512.03385\n        Args:\n            dim_in (int): the channel dimensions of the input.\n            dim_out (int): the channel dimension of the output.\n            temp_kernel_size (int): the temporal kernel sizes of the middle\n                convolution in the bottleneck.\n            stride (int): the stride of the bottleneck.\n            trans_func (string): transform function to be used to construct the\n                bottleneck.\n            dim_inner (int): the inner dimension of the block.\n            num_groups (int): number of groups for the convolution. num_groups=1\n                is for standard ResNet like networks, and num_groups>1 is for\n                ResNeXt like networks.\n            stride_1x1 (bool): if True, apply stride to 1x1 conv, otherwise\n                apply stride to the 3x3 conv.\n            inplace_relu (bool): calculate the relu on the original input\n                without allocating new memory.\n            eps (float): epsilon for batch norm.\n            bn_mmt (float): momentum for batch norm. Noted that BN momentum in\n                PyTorch = 1 - BN momentum in Caffe2.\n            dilation (int): size of dilation.\n            norm_module (nn.Module): nn.Module for the normalization layer. The\n                default is nn.BatchNorm3d.\n            drop_connect_rate (float): basic rate at which blocks are dropped,\n                linearly increases from input to output blocks.\n        \"\"\"\n        super(ResBlock, self).__init__()\n        self._inplace_relu = inplace_relu\n        self._eps = eps\n        self._bn_mmt = bn_mmt\n        self._drop_connect_rate = drop_connect_rate\n        self._construct(\n            dim_in,\n            dim_out,\n            temp_kernel_size,\n            stride,\n            trans_func,\n            dim_inner,\n            num_groups,\n            stride_1x1,\n            inplace_relu,\n            dilation,\n            norm_module,\n            block_idx,\n        )\n\n    def _construct(\n        self,\n        dim_in,\n        dim_out,\n        temp_kernel_size,\n        stride,\n        trans_func,\n        dim_inner,\n        num_groups,\n        stride_1x1,\n        inplace_relu,\n        dilation,\n        norm_module,\n        block_idx,\n    ):\n        # Use skip connection with projection if dim or res change.\n        if (dim_in != dim_out) or (stride != 1):\n            self.branch1 = nn.Conv3d(\n                dim_in,\n                dim_out,\n                kernel_size=1,\n                stride=[1, stride, stride],\n                padding=0,\n                bias=False,\n                dilation=1,\n            )\n            self.branch1_bn = norm_module(\n                num_features=dim_out, eps=self._eps, momentum=self._bn_mmt\n            )\n        self.branch2 = trans_func(\n            dim_in,\n            dim_out,\n            temp_kernel_size,\n            stride,\n            dim_inner,\n            num_groups,\n            stride_1x1=stride_1x1,\n            inplace_relu=inplace_relu,\n            dilation=dilation,\n            norm_module=norm_module,\n            block_idx=block_idx,\n        )\n        self.relu = nn.ReLU(self._inplace_relu)\n\n    def forward(self, x):\n        f_x = self.branch2(x)\n        if self.training and self._drop_connect_rate > 0.0:\n            f_x = drop_path(f_x, self._drop_connect_rate)\n        if hasattr(self, \"branch1\"):\n            x = self.branch1_bn(self.branch1(x)) + f_x\n        else:\n            x = x + f_x\n        x = self.relu(x)\n        return x\n\n\nclass ResStage(nn.Module):\n    \"\"\"\n    Stage of 3D ResNet. It expects to have one or more tensors as input for\n        single pathway (C2D, I3D, Slow), and multi-pathway (SlowFast) cases.\n        More details can be found here:\n\n        Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He.\n        \"SlowFast networks for video recognition.\"\n        https://arxiv.org/pdf/1812.03982.pdf\n    \"\"\"\n\n    def __init__(\n        self,\n        dim_in,\n        dim_out,\n        stride,\n        temp_kernel_sizes,\n        num_blocks,\n        dim_inner,\n        num_groups,\n        num_block_temp_kernel,\n        nonlocal_inds,\n        nonlocal_group,\n        nonlocal_pool,\n        dilation,\n        instantiation=\"softmax\",\n        trans_func_name=\"bottleneck_transform\",\n        stride_1x1=False,\n        inplace_relu=True,\n        norm_module=nn.BatchNorm3d,\n        drop_connect_rate=0.0,\n    ):\n        \"\"\"\n        The `__init__` method of any subclass should also contain these arguments.\n        ResStage builds p streams, where p can be greater or equal to one.\n        Args:\n            dim_in (list): list of p the channel dimensions of the input.\n                Different channel dimensions control the input dimension of\n                different pathways.\n            dim_out (list): list of p the channel dimensions of the output.\n                Different channel dimensions control the input dimension of\n                different pathways.\n            temp_kernel_sizes (list): list of the p temporal kernel sizes of the\n                convolution in the bottleneck. Different temp_kernel_sizes\n                control different pathway.\n            stride (list): list of the p strides of the bottleneck. Different\n                stride control different pathway.\n            num_blocks (list): list of p numbers of blocks for each of the\n                pathway.\n            dim_inner (list): list of the p inner channel dimensions of the\n                input. Different channel dimensions control the input dimension\n                of different pathways.\n            num_groups (list): list of number of p groups for the convolution.\n                num_groups=1 is for standard ResNet like networks, and\n                num_groups>1 is for ResNeXt like networks.\n            num_block_temp_kernel (list): extent the temp_kernel_sizes to\n                num_block_temp_kernel blocks, then fill temporal kernel size\n                of 1 for the rest of the layers.\n            nonlocal_inds (list): If the tuple is empty, no nonlocal layer will\n                be added. If the tuple is not empty, add nonlocal layers after\n                the index-th block.\n            dilation (list): size of dilation for each pathway.\n            nonlocal_group (list): list of number of p nonlocal groups. Each\n                number controls how to fold temporal dimension to batch\n                dimension before applying nonlocal transformation.\n                https://github.com/facebookresearch/video-nonlocal-net.\n            instantiation (string): different instantiation for nonlocal layer.\n                Supports two different instantiation method:\n                    \"dot_product\": normalizing correlation matrix with L2.\n                    \"softmax\": normalizing correlation matrix with Softmax.\n            trans_func_name (string): name of the the transformation function apply\n                on the network.\n            norm_module (nn.Module): nn.Module for the normalization layer. The\n                default is nn.BatchNorm3d.\n            drop_connect_rate (float): basic rate at which blocks are dropped,\n                linearly increases from input to output blocks.\n        \"\"\"\n        super(ResStage, self).__init__()\n        assert all(\n            (\n                num_block_temp_kernel[i] <= num_blocks[i]\n                for i in range(len(temp_kernel_sizes))\n            )\n        )\n        self.num_blocks = num_blocks\n        self.nonlocal_group = nonlocal_group\n        self._drop_connect_rate = drop_connect_rate\n        self.temp_kernel_sizes = [\n            (temp_kernel_sizes[i] * num_blocks[i])[: num_block_temp_kernel[i]]\n            + [1] * (num_blocks[i] - num_block_temp_kernel[i])\n            for i in range(len(temp_kernel_sizes))\n        ]\n        assert (\n            len(\n                {\n                    len(dim_in),\n                    len(dim_out),\n                    len(temp_kernel_sizes),\n                    len(stride),\n                    len(num_blocks),\n                    len(dim_inner),\n                    len(num_groups),\n                    len(num_block_temp_kernel),\n                    len(nonlocal_inds),\n                    len(nonlocal_group),\n                }\n            )\n            == 1\n        )\n        self.num_pathways = len(self.num_blocks)\n        self._construct(\n            dim_in,\n            dim_out,\n            stride,\n            dim_inner,\n            num_groups,\n            trans_func_name,\n            stride_1x1,\n            inplace_relu,\n            nonlocal_inds,\n            nonlocal_pool,\n            instantiation,\n            dilation,\n            norm_module,\n        )\n\n    def _construct(\n        self,\n        dim_in,\n        dim_out,\n        stride,\n        dim_inner,\n        num_groups,\n        trans_func_name,\n        stride_1x1,\n        inplace_relu,\n        nonlocal_inds,\n        nonlocal_pool,\n        instantiation,\n        dilation,\n        norm_module,\n    ):\n        for pathway in range(self.num_pathways):\n            for i in range(self.num_blocks[pathway]):\n                # Retrieve the transformation function.\n                trans_func = get_trans_func(trans_func_name)\n                # Construct the block.\n                res_block = ResBlock(\n                    dim_in[pathway] if i == 0 else dim_out[pathway],\n                    dim_out[pathway],\n                    self.temp_kernel_sizes[pathway][i],\n                    stride[pathway] if i == 0 else 1,\n                    trans_func,\n                    dim_inner[pathway],\n                    num_groups[pathway],\n                    stride_1x1=stride_1x1,\n                    inplace_relu=inplace_relu,\n                    dilation=dilation[pathway],\n                    norm_module=norm_module,\n                    block_idx=i,\n                    drop_connect_rate=self._drop_connect_rate,\n                )\n                self.add_module(\"pathway{}_res{}\".format(pathway, i), res_block)\n                if i in nonlocal_inds[pathway]:\n                    nln = Nonlocal(\n                        dim_out[pathway],\n                        dim_out[pathway] // 2,\n                        nonlocal_pool[pathway],\n                        instantiation=instantiation,\n                        norm_module=norm_module,\n                    )\n                    self.add_module(\"pathway{}_nonlocal{}\".format(pathway, i), nln)\n\n    def forward(self, inputs):\n        output = []\n        for pathway in range(self.num_pathways):\n            x = inputs[pathway]\n            for i in range(self.num_blocks[pathway]):\n                m = getattr(self, \"pathway{}_res{}\".format(pathway, i))\n                x = m(x)\n                if hasattr(self, \"pathway{}_nonlocal{}\".format(pathway, i)):\n                    nln = getattr(self, \"pathway{}_nonlocal{}\".format(pathway, i))\n                    b, c, t, h, w = x.shape\n                    if self.nonlocal_group[pathway] > 1:\n                        # Fold temporal dimension into batch dimension.\n                        x = x.permute(0, 2, 1, 3, 4)\n                        x = x.reshape(\n                            b * self.nonlocal_group[pathway],\n                            t // self.nonlocal_group[pathway],\n                            c,\n                            h,\n                            w,\n                        )\n                        x = x.permute(0, 2, 1, 3, 4)\n                    x = nln(x)\n                    if self.nonlocal_group[pathway] > 1:\n                        # Fold back to temporal dimension.\n                        x = x.permute(0, 2, 1, 3, 4)\n                        x = x.reshape(b, t, c, h, w)\n                        x = x.permute(0, 2, 1, 3, 4)\n            output.append(x)\n\n        return output\n"
  },
  {
    "path": "slowfast/models/reversible_mvit.py",
    "content": "import sys\nfrom functools import partial\n\nimport torch\nfrom slowfast.models.attention import attention_pool, MultiScaleAttention\nfrom slowfast.models.common import drop_path, Mlp, TwoStreamFusion\nfrom slowfast.models.utils import round_width\nfrom torch import nn\nfrom torch.autograd import Function as Function\n\n\nclass ReversibleMViT(nn.Module):\n    \"\"\"\n    Reversible model builder. This builds the reversible transformer encoder\n    and allows reversible training.\n\n    Karttikeya Mangalam, Haoqi Fan, Yanghao Li, Chao-Yuan Wu, Bo Xiong,\n    Christoph Feichtenhofer, Jitendra Malik\n    \"Reversible Vision Transformers\"\n\n    https://openaccess.thecvf.com/content/CVPR2022/papers/Mangalam_Reversible_Vision_Transformers_CVPR_2022_paper.pdf\n    \"\"\"\n\n    def __init__(self, config, model):\n        \"\"\"\n        The `__init__` method of any subclass should also contain these\n            arguments.\n        Args:\n            cfg (CfgNode): model building configs, details are in the\n                comments of the config file.\n            model (nn.Module): parent MViT module this module forms\n                a reversible encoder in.\n        \"\"\"\n\n        super().__init__()\n        self.cfg = config\n\n        embed_dim = self.cfg.MVIT.EMBED_DIM\n        depth = self.cfg.MVIT.DEPTH\n        num_heads = self.cfg.MVIT.NUM_HEADS\n        mlp_ratio = self.cfg.MVIT.MLP_RATIO\n        qkv_bias = self.cfg.MVIT.QKV_BIAS\n\n        drop_path_rate = self.cfg.MVIT.DROPPATH_RATE\n        self.dropout = config.MVIT.DROPOUT_RATE\n        self.pre_q_fusion = self.cfg.MVIT.REV.PRE_Q_FUSION\n        dpr = [\n            x.item() for x in torch.linspace(0, drop_path_rate, depth)\n        ]  # stochastic depth decay rule\n\n        input_size = model.patch_dims\n\n        self.layers = nn.ModuleList([])\n        self.no_custom_backward = False\n\n        if self.cfg.MVIT.NORM == \"layernorm\":\n            norm_layer = partial(nn.LayerNorm, eps=1e-6)\n        else:\n            raise NotImplementedError(\"Only supports layernorm.\")\n\n        dim_mul, head_mul = torch.ones(depth + 1), torch.ones(depth + 1)\n        for i in range(len(self.cfg.MVIT.DIM_MUL)):\n            dim_mul[self.cfg.MVIT.DIM_MUL[i][0]] = self.cfg.MVIT.DIM_MUL[i][1]\n        for i in range(len(self.cfg.MVIT.HEAD_MUL)):\n            head_mul[self.cfg.MVIT.HEAD_MUL[i][0]] = self.cfg.MVIT.HEAD_MUL[i][1]\n\n        pool_q = model.pool_q\n        pool_kv = model.pool_kv\n        stride_q = model.stride_q\n        stride_kv = model.stride_kv\n\n        for i in range(depth):\n            num_heads = round_width(num_heads, head_mul[i])\n\n            # Upsampling inside the MHPA, input to the Q-pooling block is lower C dimension\n            # This localizes the feature changes in a single block, making more computation reversible.\n            embed_dim = round_width(\n                embed_dim, dim_mul[i - 1] if i > 0 else 1.0, divisor=num_heads\n            )\n            dim_out = round_width(\n                embed_dim,\n                dim_mul[i],\n                divisor=round_width(num_heads, head_mul[i + 1]),\n            )\n\n            if i in self.cfg.MVIT.REV.BUFFER_LAYERS:\n                layer_type = StageTransitionBlock\n                input_mult = 2 if \"concat\" in self.pre_q_fusion else 1\n            else:\n                layer_type = ReversibleBlock\n                input_mult = 1\n\n            dimout_correction = (\n                2 if (input_mult == 2 and \"concat\" in self.pre_q_fusion) else 1\n            )\n\n            self.layers.append(\n                layer_type(\n                    dim=embed_dim\n                    * input_mult,  # added only for concat fusion before Qpooling layers\n                    input_size=input_size,\n                    dim_out=dim_out * input_mult // dimout_correction,\n                    num_heads=num_heads,\n                    cfg=self.cfg,\n                    mlp_ratio=mlp_ratio,\n                    qkv_bias=qkv_bias,\n                    drop_path=dpr[i],\n                    norm_layer=norm_layer,\n                    kernel_q=pool_q[i] if len(pool_q) > i else [],\n                    kernel_kv=pool_kv[i] if len(pool_kv) > i else [],\n                    stride_q=stride_q[i] if len(stride_q) > i else [],\n                    stride_kv=stride_kv[i] if len(stride_kv) > i else [],\n                    layer_id=i,\n                    pre_q_fusion=self.pre_q_fusion,\n                )\n            )\n            # F is the attention block\n            self.layers[-1].F.thw = input_size\n\n            if len(stride_q[i]) > 0:\n                input_size = [\n                    size // stride for size, stride in zip(input_size, stride_q[i])\n                ]\n\n        embed_dim = dim_out\n\n    @staticmethod\n    def vanilla_backward(h, layers, buffer):\n        \"\"\"\n        Using rev layers without rev backpropagation. Debugging purposes only.\n        Activated with self.no_custom_backward.\n        \"\"\"\n\n        # split into hidden states (h) and attention_output (a)\n        h, a = torch.chunk(h, 2, dim=-1)\n        for _, layer in enumerate(layers):\n            a, h = layer(a, h)\n\n        return torch.cat([a, h], dim=-1)\n\n    def forward(self, x):\n        # process the layers in a reversible stack and an irreversible stack.\n        stack = []\n        for l_i in range(len(self.layers)):\n            if isinstance(self.layers[l_i], StageTransitionBlock):\n                stack.append((\"StageTransition\", l_i))\n            else:\n                if len(stack) == 0 or stack[-1][0] == \"StageTransition\":\n                    stack.append((\"Reversible\", []))\n                stack[-1][1].append(l_i)\n\n        for layer_seq in stack:\n            if layer_seq[0] == \"StageTransition\":\n                x = self.layers[layer_seq[1]](x)\n\n            else:\n                x = torch.cat([x, x], dim=-1)\n\n                # no need for custom backprop in eval/model stat log\n                if not self.training or self.no_custom_backward:\n                    executing_fn = ReversibleMViT.vanilla_backward\n                else:\n                    executing_fn = RevBackProp.apply\n\n                x = executing_fn(\n                    x,\n                    self.layers[layer_seq[1][0] : layer_seq[1][-1] + 1],\n                    [],  # buffer activations\n                )\n\n        # Apply dropout\n        x = nn.functional.dropout(x, p=self.dropout, training=self.training)\n\n        return x\n\n\nclass RevBackProp(Function):\n    \"\"\"\n    Custom Backpropagation function to allow (A) flusing memory in foward\n    and (B) activation recomputation reversibly in backward for gradient calculation.\n\n    Inspired by https://github.com/RobinBruegger/RevTorch/blob/master/revtorch/revtorch.py\n    \"\"\"\n\n    @staticmethod\n    def forward(\n        ctx,\n        x,\n        layers,\n        buffer_layers,  # List of layer ids for int activation to buffer\n    ):\n        \"\"\"\n        Reversible Forward pass. Any intermediate activations from `buffer_layers` are\n        cached in ctx for forward pass. This is not necessary for standard usecases.\n        Each reversible layer implements its own forward pass logic.\n        \"\"\"\n        buffer_layers.sort()\n\n        X_1, X_2 = torch.chunk(x, 2, dim=-1)\n\n        intermediate = []\n\n        for layer in layers:\n            X_1, X_2 = layer(X_1, X_2)\n\n            if layer.layer_id in buffer_layers:\n                intermediate.extend([X_1.detach(), X_2.detach()])\n\n        if len(buffer_layers) == 0:\n            all_tensors = [X_1.detach(), X_2.detach()]\n        else:\n            intermediate = [torch.LongTensor(buffer_layers), *intermediate]\n            all_tensors = [X_1.detach(), X_2.detach(), *intermediate]\n\n        ctx.save_for_backward(*all_tensors)\n        ctx.layers = layers\n\n        return torch.cat([X_1, X_2], dim=-1)\n\n    @staticmethod\n    def backward(ctx, dx):\n        \"\"\"\n        Reversible Backward pass. Any intermediate activations from `buffer_layers` are\n        recovered from ctx. Each layer implements its own loic for backward pass (both\n        activation recomputation and grad calculation).\n        \"\"\"\n        dX_1, dX_2 = torch.chunk(dx, 2, dim=-1)\n\n        # retrieve params from ctx for backward\n        X_1, X_2, *int_tensors = ctx.saved_tensors\n\n        # no buffering\n        if len(int_tensors) != 0:\n            buffer_layers = int_tensors[0].tolist()\n\n        else:\n            buffer_layers = []\n\n        layers = ctx.layers\n\n        for _, layer in enumerate(layers[::-1]):\n            if layer.layer_id in buffer_layers:\n                X_1, X_2, dX_1, dX_2 = layer.backward_pass(\n                    Y_1=int_tensors[buffer_layers.index(layer.layer_id) * 2 + 1],\n                    Y_2=int_tensors[buffer_layers.index(layer.layer_id) * 2 + 2],\n                    dY_1=dX_1,\n                    dY_2=dX_2,\n                )\n\n            else:\n                X_1, X_2, dX_1, dX_2 = layer.backward_pass(\n                    Y_1=X_1,\n                    Y_2=X_2,\n                    dY_1=dX_1,\n                    dY_2=dX_2,\n                )\n\n        dx = torch.cat([dX_1, dX_2], dim=-1)\n\n        del int_tensors\n        del dX_1, dX_2, X_1, X_2\n\n        return dx, None, None\n\n\nclass StageTransitionBlock(nn.Module):\n    \"\"\"\n    Blocks for changing the feature dimensions in MViT (using Q-pooling).\n    See Section 3.3.1 in paper for details.\n    \"\"\"\n\n    def __init__(\n        self,\n        dim,\n        input_size,\n        dim_out,\n        num_heads,\n        mlp_ratio,\n        qkv_bias,\n        drop_path,\n        kernel_q,\n        kernel_kv,\n        stride_q,\n        stride_kv,\n        cfg,\n        norm_layer=nn.LayerNorm,\n        pre_q_fusion=None,\n        layer_id=0,\n    ):\n        \"\"\"\n        Uses the same structure of F and G functions as Reversible Block except\n        without using reversible forward (and backward) pass.\n        \"\"\"\n        super().__init__()\n\n        self.drop_path_rate = drop_path\n\n        embed_dim = dim\n\n        self.F = AttentionSubBlock(\n            dim=embed_dim,\n            input_size=input_size,\n            num_heads=num_heads,\n            cfg=cfg,\n            dim_out=dim_out,\n            kernel_q=kernel_q,\n            kernel_kv=kernel_kv,\n            stride_q=stride_q,\n            stride_kv=stride_kv,\n            norm_layer=norm_layer,\n        )\n\n        self.G = MLPSubblock(\n            dim=dim_out,\n            mlp_ratio=mlp_ratio,\n            norm_layer=norm_layer,\n        )\n\n        self.layer_id = layer_id\n\n        self.is_proj = False\n        self.has_cls_embed = cfg.MVIT.CLS_EMBED_ON\n\n        self.is_conv = False\n        self.pool_first = cfg.MVIT.POOL_FIRST\n        self.mode = cfg.MVIT.MODE\n        self.pre_q_fuse = TwoStreamFusion(pre_q_fusion, dim=dim)\n\n        if cfg.MVIT.REV.RES_PATH == \"max\":\n            self.res_conv = False\n            self.pool_skip = nn.MaxPool3d(\n                # self.attention.attn.pool_q.kernel_size,\n                [s + 1 if s > 1 else s for s in self.F.attn.pool_q.stride],\n                self.F.attn.pool_q.stride,\n                [int(k // 2) for k in self.F.attn.pool_q.stride],\n                # self.attention.attn.pool_q.padding,\n                ceil_mode=False,\n            )\n\n        elif cfg.MVIT.REV.RES_PATH == \"conv\":\n            self.res_conv = True\n        else:\n            raise NotImplementedError\n\n        # Add a linear projection in residual branch\n        if embed_dim != dim_out:\n            self.is_proj = True\n            self.res_proj = nn.Linear(embed_dim, dim_out, bias=True)\n\n    def forward(\n        self,\n        x,\n    ):\n        \"\"\"\n        Forward logic is similar to MultiScaleBlock with Q-pooling.\n        \"\"\"\n        x = self.pre_q_fuse(x)\n\n        # fork tensor for residual connections\n        x_res = x\n\n        # This uses conv to pool the residual hidden features\n        # but done before pooling only if not pool_first\n        if self.is_proj and not self.pool_first:\n            x_res = self.res_proj(x_res)\n\n        if self.res_conv:\n            # Pooling the hidden features with the same conv as Q\n            N, L, C = x_res.shape\n\n            # This handling is the same as that of q in MultiScaleAttention\n            if self.mode == \"conv_unshared\":\n                fold_dim = 1\n            else:\n                fold_dim = self.F.attn.num_heads\n\n            # Output is (B, N, L, C)\n            x_res = x_res.reshape(N, L, fold_dim, C // fold_dim).permute(0, 2, 1, 3)\n\n            x_res, _ = attention_pool(\n                x_res,\n                self.F.attn.pool_q,\n                # thw_shape = self.attention.attn.thw,\n                thw_shape=self.F.thw,\n                has_cls_embed=self.has_cls_embed,\n                norm=self.F.attn.norm_q if hasattr(self.F.attn, \"norm_q\") else None,\n            )\n            x_res = x_res.permute(0, 2, 1, 3).reshape(N, x_res.shape[2], C)\n\n        else:\n            # Pooling the hidden features with max op\n            x_res, _ = attention_pool(\n                x_res,\n                self.pool_skip,\n                thw_shape=self.F.attn.thw,\n                has_cls_embed=self.has_cls_embed,\n            )\n\n        # If pool_first then project to higher dim now\n        if self.is_proj and self.pool_first:\n            x_res = self.res_proj(x_res)\n\n        x = self.F(x)\n        x = x_res + x\n        x = x + self.G(x)\n\n        x = drop_path(x, drop_prob=self.drop_path_rate, training=self.training)\n\n        return x\n\n\nclass ReversibleBlock(nn.Module):\n    \"\"\"\n    Reversible Blocks for Reversible Vision Transformer and also\n    for state-preserving blocks in Reversible MViT. See Section\n    3.3.2 in paper for details.\n    \"\"\"\n\n    def __init__(\n        self,\n        dim,\n        input_size,\n        dim_out,\n        num_heads,\n        mlp_ratio,\n        qkv_bias,\n        drop_path,\n        kernel_q,\n        kernel_kv,\n        stride_q,\n        stride_kv,\n        cfg,\n        norm_layer=nn.LayerNorm,\n        layer_id=0,\n        **kwargs,\n    ):\n        \"\"\"\n        Block is composed entirely of function F (Attention\n        sub-block) and G (MLP sub-block) including layernorm.\n        \"\"\"\n        super().__init__()\n\n        self.drop_path_rate = drop_path\n\n        self.F = AttentionSubBlock(\n            dim=dim,\n            input_size=input_size,\n            num_heads=num_heads,\n            cfg=cfg,\n            dim_out=dim_out,\n            kernel_q=kernel_q,\n            kernel_kv=kernel_kv,\n            stride_q=stride_q,\n            stride_kv=stride_kv,\n            norm_layer=norm_layer,\n        )\n\n        self.G = MLPSubblock(\n            dim=dim,\n            mlp_ratio=mlp_ratio,\n            norm_layer=norm_layer,\n        )\n\n        self.layer_id = layer_id\n\n        self.seeds = {}\n\n    def seed_cuda(self, key):\n        \"\"\"\n        Fix seeds to allow for stochastic elements such as\n        dropout to be reproduced exactly in activation\n        recomputation in the backward pass.\n        \"\"\"\n\n        # randomize seeds\n        # use cuda generator if available\n        if (\n            hasattr(torch.cuda, \"default_generators\")\n            and len(torch.cuda.default_generators) > 0\n        ):\n            # GPU\n            device_idx = torch.cuda.current_device()\n            seed = torch.cuda.default_generators[device_idx].seed()\n        else:\n            # CPU\n            seed = int(torch.seed() % sys.maxsize)\n\n        self.seeds[key] = seed\n        torch.manual_seed(self.seeds[key])\n\n    def forward(self, X_1, X_2):\n        \"\"\"\n        forward pass equations:\n        Y_1 = X_1 + Attention(X_2), F = Attention\n        Y_2 = X_2 + MLP(Y_1), G = MLP\n        \"\"\"\n\n        self.seed_cuda(\"attn\")\n        # Y_1 : attn_output\n        f_X_2 = self.F(X_2)\n\n        self.seed_cuda(\"droppath\")\n        f_X_2_dropped = drop_path(\n            f_X_2, drop_prob=self.drop_path_rate, training=self.training\n        )\n\n        # Y_1 = X_1 + f(X_2)\n        Y_1 = X_1 + f_X_2_dropped\n\n        # free memory\n        del X_1\n\n        self.seed_cuda(\"FFN\")\n        g_Y_1 = self.G(Y_1)\n\n        torch.manual_seed(self.seeds[\"droppath\"])\n        g_Y_1_dropped = drop_path(\n            g_Y_1, drop_prob=self.drop_path_rate, training=self.training\n        )\n\n        # Y_2 = X_2 + g(Y_1)\n        Y_2 = X_2 + g_Y_1_dropped\n\n        del X_2\n\n        return Y_1, Y_2\n\n    def backward_pass(\n        self,\n        Y_1,\n        Y_2,\n        dY_1,\n        dY_2,\n    ):\n        \"\"\"\n        equation for activation recomputation:\n        X_2 = Y_2 - G(Y_1), G = MLP\n        X_1 = Y_1 - F(X_2), F = Attention\n        \"\"\"\n\n        # temporarily record intermediate activation for G\n        # and use them for gradient calculcation of G\n        with torch.enable_grad():\n            Y_1.requires_grad = True\n\n            torch.manual_seed(self.seeds[\"FFN\"])\n            g_Y_1 = self.G(Y_1)\n\n            torch.manual_seed(self.seeds[\"droppath\"])\n            g_Y_1 = drop_path(\n                g_Y_1, drop_prob=self.drop_path_rate, training=self.training\n            )\n\n            g_Y_1.backward(dY_2, retain_graph=True)\n\n        # activation recomputation is by design and not part of\n        # the computation graph in forward pass.\n        with torch.no_grad():\n            X_2 = Y_2 - g_Y_1\n            del g_Y_1\n\n            dY_1 = dY_1 + Y_1.grad\n            Y_1.grad = None\n\n        # record F activations and calc gradients on F\n        with torch.enable_grad():\n            X_2.requires_grad = True\n\n            torch.manual_seed(self.seeds[\"attn\"])\n            f_X_2 = self.F(X_2)\n\n            torch.manual_seed(self.seeds[\"droppath\"])\n            f_X_2 = drop_path(\n                f_X_2, drop_prob=self.drop_path_rate, training=self.training\n            )\n\n            f_X_2.backward(dY_1, retain_graph=True)\n\n        # propagate reverse computed acitvations at the start of\n        # the previou block for backprop.s\n        with torch.no_grad():\n            X_1 = Y_1 - f_X_2\n\n            del f_X_2, Y_1\n            dY_2 = dY_2 + X_2.grad\n\n            X_2.grad = None\n            X_2 = X_2.detach()\n\n        return X_1, X_2, dY_1, dY_2\n\n\nclass MLPSubblock(nn.Module):\n    \"\"\"\n    This creates the function G such that the entire block can be\n    expressed as F(G(X)). Includes pre-LayerNorm.\n    \"\"\"\n\n    def __init__(\n        self,\n        dim,\n        mlp_ratio,\n        norm_layer=nn.LayerNorm,\n    ):\n        super().__init__()\n        self.norm = norm_layer(dim, eps=1e-6, elementwise_affine=True)\n\n        mlp_hidden_dim = int(dim * mlp_ratio)\n\n        self.mlp = Mlp(\n            in_features=dim,\n            hidden_features=mlp_hidden_dim,\n            act_layer=nn.GELU,\n        )\n\n    def forward(self, x):\n        return self.mlp(self.norm(x))\n\n\nclass AttentionSubBlock(nn.Module):\n    \"\"\"\n    This creates the function F such that the entire block can be\n    expressed as F(G(X)). Includes pre-LayerNorm.\n    \"\"\"\n\n    def __init__(\n        self,\n        dim,\n        input_size,\n        num_heads,\n        cfg,\n        dim_out=None,\n        kernel_q=(1, 1, 1),\n        kernel_kv=(1, 1, 1),\n        stride_q=(1, 1, 1),\n        stride_kv=(1, 1, 1),\n        norm_layer=nn.LayerNorm,\n    ):\n        super().__init__()\n        self.norm = norm_layer(dim, eps=1e-6, elementwise_affine=True)\n\n        # This will be set externally during init\n        self.thw = None\n\n        # the actual attention details are the same as Multiscale\n        # attention for MViTv2 (with channel up=projection inside block)\n        # can also implement no upprojection attention for vanilla ViT\n        self.attn = MultiScaleAttention(\n            dim,\n            dim_out,\n            input_size=input_size,\n            num_heads=num_heads,\n            kernel_q=kernel_q,\n            kernel_kv=kernel_kv,\n            stride_q=stride_q,\n            stride_kv=stride_kv,\n            norm_layer=norm_layer,\n            drop_rate=cfg.MVIT.DROPOUT_RATE,\n            qkv_bias=cfg.MVIT.QKV_BIAS,\n            has_cls_embed=cfg.MVIT.CLS_EMBED_ON,\n            mode=cfg.MVIT.MODE,\n            pool_first=cfg.MVIT.POOL_FIRST,\n            rel_pos_spatial=cfg.MVIT.REL_POS_SPATIAL,\n            rel_pos_temporal=cfg.MVIT.REL_POS_TEMPORAL,\n            rel_pos_zero_init=cfg.MVIT.REL_POS_ZERO_INIT,\n            residual_pooling=cfg.MVIT.RESIDUAL_POOLING,\n            separate_qkv=cfg.MVIT.SEPARATE_QKV,\n        )\n\n    def forward(self, x):\n        out, _ = self.attn(self.norm(x), self.thw)\n        return out\n"
  },
  {
    "path": "slowfast/models/stem_helper.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\n\"\"\"ResNe(X)t 3D stem helper.\"\"\"\n\nimport torch.nn as nn\n\n\ndef get_stem_func(name):\n    \"\"\"\n    Retrieves the stem module by name.\n    \"\"\"\n    trans_funcs = {\"x3d_stem\": X3DStem, \"basic_stem\": ResNetBasicStem}\n    assert name in trans_funcs.keys(), (\n        \"Transformation function '{}' not supported\".format(name)\n    )\n    return trans_funcs[name]\n\n\nclass VideoModelStem(nn.Module):\n    \"\"\"\n    Video 3D stem module. Provides stem operations of Conv, BN, ReLU, MaxPool\n    on input data tensor for one or multiple pathways.\n    \"\"\"\n\n    def __init__(\n        self,\n        dim_in,\n        dim_out,\n        kernel,\n        stride,\n        padding,\n        inplace_relu=True,\n        eps=1e-5,\n        bn_mmt=0.1,\n        norm_module=nn.BatchNorm3d,\n        stem_func_name=\"basic_stem\",\n    ):\n        \"\"\"\n        The `__init__` method of any subclass should also contain these\n        arguments. List size of 1 for single pathway models (C2D, I3D, Slow\n        and etc), list size of 2 for two pathway models (SlowFast).\n\n        Args:\n            dim_in (list): the list of channel dimensions of the inputs.\n            dim_out (list): the output dimension of the convolution in the stem\n                layer.\n            kernel (list): the kernels' size of the convolutions in the stem\n                layers. Temporal kernel size, height kernel size, width kernel\n                size in order.\n            stride (list): the stride sizes of the convolutions in the stem\n                layer. Temporal kernel stride, height kernel size, width kernel\n                size in order.\n            padding (list): the paddings' sizes of the convolutions in the stem\n                layer. Temporal padding size, height padding size, width padding\n                size in order.\n            inplace_relu (bool): calculate the relu on the original input\n                without allocating new memory.\n            eps (float): epsilon for batch norm.\n            bn_mmt (float): momentum for batch norm. Noted that BN momentum in\n                PyTorch = 1 - BN momentum in Caffe2.\n            norm_module (nn.Module): nn.Module for the normalization layer. The\n                default is nn.BatchNorm3d.\n            stem_func_name (string): name of the the stem function applied on\n                input to the network.\n        \"\"\"\n        super(VideoModelStem, self).__init__()\n\n        assert (\n            len(\n                {\n                    len(dim_in),\n                    len(dim_out),\n                    len(kernel),\n                    len(stride),\n                    len(padding),\n                }\n            )\n            == 1\n        ), \"Input pathway dimensions are not consistent. {} {} {} {} {}\".format(\n            len(dim_in),\n            len(dim_out),\n            len(kernel),\n            len(stride),\n            len(padding),\n        )\n\n        self.num_pathways = len(dim_in)\n        self.kernel = kernel\n        self.stride = stride\n        self.padding = padding\n        self.inplace_relu = inplace_relu\n        self.eps = eps\n        self.bn_mmt = bn_mmt\n        # Construct the stem layer.\n        self._construct_stem(dim_in, dim_out, norm_module, stem_func_name)\n\n    def _construct_stem(self, dim_in, dim_out, norm_module, stem_func_name):\n        trans_func = get_stem_func(stem_func_name)\n\n        for pathway in range(len(dim_in)):\n            stem = trans_func(\n                dim_in[pathway],\n                dim_out[pathway],\n                self.kernel[pathway],\n                self.stride[pathway],\n                self.padding[pathway],\n                self.inplace_relu,\n                self.eps,\n                self.bn_mmt,\n                norm_module,\n            )\n            self.add_module(\"pathway{}_stem\".format(pathway), stem)\n\n    def forward(self, x):\n        assert len(x) == self.num_pathways, (\n            \"Input tensor does not contain {} pathway\".format(self.num_pathways)\n        )\n        # use a new list, don't modify in-place the x list, which is bad for activation checkpointing.\n        y = []\n        for pathway in range(len(x)):\n            m = getattr(self, \"pathway{}_stem\".format(pathway))\n            y.append(m(x[pathway]))\n        return y\n\n\nclass ResNetBasicStem(nn.Module):\n    \"\"\"\n    ResNe(X)t 3D stem module.\n    Performs spatiotemporal Convolution, BN, and Relu following by a\n        spatiotemporal pooling.\n    \"\"\"\n\n    def __init__(\n        self,\n        dim_in,\n        dim_out,\n        kernel,\n        stride,\n        padding,\n        inplace_relu=True,\n        eps=1e-5,\n        bn_mmt=0.1,\n        norm_module=nn.BatchNorm3d,\n    ):\n        \"\"\"\n        The `__init__` method of any subclass should also contain these arguments.\n\n        Args:\n            dim_in (int): the channel dimension of the input. Normally 3 is used\n                for rgb input, and 2 or 3 is used for optical flow input.\n            dim_out (int): the output dimension of the convolution in the stem\n                layer.\n            kernel (list): the kernel size of the convolution in the stem layer.\n                temporal kernel size, height kernel size, width kernel size in\n                order.\n            stride (list): the stride size of the convolution in the stem layer.\n                temporal kernel stride, height kernel size, width kernel size in\n                order.\n            padding (int): the padding size of the convolution in the stem\n                layer, temporal padding size, height padding size, width\n                padding size in order.\n            inplace_relu (bool): calculate the relu on the original input\n                without allocating new memory.\n            eps (float): epsilon for batch norm.\n            bn_mmt (float): momentum for batch norm. Noted that BN momentum in\n                PyTorch = 1 - BN momentum in Caffe2.\n            norm_module (nn.Module): nn.Module for the normalization layer. The\n                default is nn.BatchNorm3d.\n        \"\"\"\n        super(ResNetBasicStem, self).__init__()\n        self.kernel = kernel\n        self.stride = stride\n        self.padding = padding\n        self.inplace_relu = inplace_relu\n        self.eps = eps\n        self.bn_mmt = bn_mmt\n        # Construct the stem layer.\n        self._construct_stem(dim_in, dim_out, norm_module)\n\n    def _construct_stem(self, dim_in, dim_out, norm_module):\n        self.conv = nn.Conv3d(\n            dim_in,\n            dim_out,\n            self.kernel,\n            stride=self.stride,\n            padding=self.padding,\n            bias=False,\n        )\n        self.bn = norm_module(num_features=dim_out, eps=self.eps, momentum=self.bn_mmt)\n        self.relu = nn.ReLU(self.inplace_relu)\n        self.pool_layer = nn.MaxPool3d(\n            kernel_size=[1, 3, 3], stride=[1, 2, 2], padding=[0, 1, 1]\n        )\n\n    def forward(self, x):\n        x = self.conv(x)\n        x = self.bn(x)\n        x = self.relu(x)\n        x = self.pool_layer(x)\n        return x\n\n\nclass X3DStem(nn.Module):\n    \"\"\"\n    X3D's 3D stem module.\n    Performs a spatial followed by a depthwise temporal Convolution, BN, and Relu following by a\n        spatiotemporal pooling.\n    \"\"\"\n\n    def __init__(\n        self,\n        dim_in,\n        dim_out,\n        kernel,\n        stride,\n        padding,\n        inplace_relu=True,\n        eps=1e-5,\n        bn_mmt=0.1,\n        norm_module=nn.BatchNorm3d,\n    ):\n        \"\"\"\n        The `__init__` method of any subclass should also contain these arguments.\n\n        Args:\n            dim_in (int): the channel dimension of the input. Normally 3 is used\n                for rgb input, and 2 or 3 is used for optical flow input.\n            dim_out (int): the output dimension of the convolution in the stem\n                layer.\n            kernel (list): the kernel size of the convolution in the stem layer.\n                temporal kernel size, height kernel size, width kernel size in\n                order.\n            stride (list): the stride size of the convolution in the stem layer.\n                temporal kernel stride, height kernel size, width kernel size in\n                order.\n            padding (int): the padding size of the convolution in the stem\n                layer, temporal padding size, height padding size, width\n                padding size in order.\n            inplace_relu (bool): calculate the relu on the original input\n                without allocating new memory.\n            eps (float): epsilon for batch norm.\n            bn_mmt (float): momentum for batch norm. Noted that BN momentum in\n                PyTorch = 1 - BN momentum in Caffe2.\n            norm_module (nn.Module): nn.Module for the normalization layer. The\n                default is nn.BatchNorm3d.\n        \"\"\"\n        super(X3DStem, self).__init__()\n        self.kernel = kernel\n        self.stride = stride\n        self.padding = padding\n        self.inplace_relu = inplace_relu\n        self.eps = eps\n        self.bn_mmt = bn_mmt\n        # Construct the stem layer.\n        self._construct_stem(dim_in, dim_out, norm_module)\n\n    def _construct_stem(self, dim_in, dim_out, norm_module):\n        self.conv_xy = nn.Conv3d(\n            dim_in,\n            dim_out,\n            kernel_size=(1, self.kernel[1], self.kernel[2]),\n            stride=(1, self.stride[1], self.stride[2]),\n            padding=(0, self.padding[1], self.padding[2]),\n            bias=False,\n        )\n        self.conv = nn.Conv3d(\n            dim_out,\n            dim_out,\n            kernel_size=(self.kernel[0], 1, 1),\n            stride=(self.stride[0], 1, 1),\n            padding=(self.padding[0], 0, 0),\n            bias=False,\n            groups=dim_out,\n        )\n\n        self.bn = norm_module(num_features=dim_out, eps=self.eps, momentum=self.bn_mmt)\n        self.relu = nn.ReLU(self.inplace_relu)\n\n    def forward(self, x):\n        x = self.conv_xy(x)\n        x = self.conv(x)\n        x = self.bn(x)\n        x = self.relu(x)\n        return x\n\n\nclass PatchEmbed(nn.Module):\n    \"\"\"\n    PatchEmbed.\n    \"\"\"\n\n    def __init__(\n        self,\n        dim_in=3,\n        dim_out=768,\n        kernel=(1, 16, 16),\n        stride=(1, 4, 4),\n        padding=(1, 7, 7),\n        conv_2d=False,\n    ):\n        super().__init__()\n        if conv_2d:\n            conv = nn.Conv2d\n        else:\n            conv = nn.Conv3d\n        self.proj = conv(\n            dim_in,\n            dim_out,\n            kernel_size=kernel,\n            stride=stride,\n            padding=padding,\n        )\n\n    def forward(self, x, keep_spatial=False):\n        x = self.proj(x)\n        if keep_spatial:\n            return x, x.shape\n        # B C (T) H W -> B (T)HW C\n        return x.flatten(2).transpose(1, 2), x.shape\n"
  },
  {
    "path": "slowfast/models/utils.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\nimport numpy as np\nimport slowfast.utils.logging as logging\nimport torch\n\nlogger = logging.get_logger(__name__)\n\n\ndef round_width(width, multiplier, min_width=1, divisor=1, verbose=False):\n    if not multiplier:\n        return width\n    width *= multiplier\n    min_width = min_width or divisor\n    if verbose:\n        logger.info(f\"min width {min_width}\")\n        logger.info(f\"width {width} divisor {divisor}\")\n        logger.info(f\"other {int(width + divisor / 2) // divisor * divisor}\")\n\n    width_out = max(min_width, int(width + divisor / 2) // divisor * divisor)\n    if width_out < 0.9 * width:\n        width_out += divisor\n    return int(width_out)\n\n\ndef validate_checkpoint_wrapper_import(checkpoint_wrapper):\n    \"\"\"\n    Check if checkpoint_wrapper is imported.\n    \"\"\"\n    if checkpoint_wrapper is None:\n        raise ImportError(\"Please install fairscale.\")\n\n\ndef get_gkern(kernlen, std):\n    \"\"\"Returns a 2D Gaussian kernel array.\"\"\"\n\n    def _gaussian_fn(kernlen, std):\n        n = torch.arange(0, kernlen).float()\n        n -= n.mean()\n        n /= std\n        w = torch.exp(-0.5 * n**2)\n        return w\n\n    gkern1d = _gaussian_fn(kernlen, std)\n    gkern2d = torch.outer(gkern1d, gkern1d)\n    return gkern2d / gkern2d.sum()\n\n\n# --------------------------------------------------------\n# 2D sine-cosine position embedding\n# References:\n# Transformer: https://github.com/tensorflow/models/blob/master/official/nlp/transformer/model_utils.py\n# MoCo v3: https://github.com/facebookresearch/moco-v3\n# --------------------------------------------------------\ndef get_3d_sincos_pos_embed(embed_dim, grid_size, t_size, cls_token=False):\n    \"\"\"\n    grid_size: int of the grid height and width\n    t_size: int of the temporal size\n    return:\n    pos_embed: [t_size*grid_size*grid_size, embed_dim] or [1+t_size*grid_size*grid_size, embed_dim] (w/ or w/o cls_token)\n    \"\"\"\n    assert embed_dim % 4 == 0\n    embed_dim_spatial = embed_dim // 4 * 3\n    embed_dim_temporal = embed_dim // 4\n\n    # spatial\n    grid_h = np.arange(grid_size, dtype=np.float32)\n    grid_w = np.arange(grid_size, dtype=np.float32)\n    grid = np.meshgrid(grid_w, grid_h)  # here w goes first\n    grid = np.stack(grid, axis=0)\n\n    grid = grid.reshape([2, 1, grid_size, grid_size])\n    pos_embed_spatial = get_2d_sincos_pos_embed_from_grid(embed_dim_spatial, grid)\n\n    # temporal\n    grid_t = np.arange(t_size, dtype=np.float32)\n    pos_embed_temporal = get_1d_sincos_pos_embed_from_grid(embed_dim_temporal, grid_t)\n\n    # concate: [T, H, W] order\n    pos_embed_temporal = pos_embed_temporal[:, np.newaxis, :]\n    pos_embed_temporal = np.repeat(\n        pos_embed_temporal, grid_size**2, axis=1\n    )  # [T, H*W, D // 4]\n    pos_embed_spatial = pos_embed_spatial[np.newaxis, :, :]\n    pos_embed_spatial = np.repeat(\n        pos_embed_spatial, t_size, axis=0\n    )  # [T, H*W, D // 4 * 3]\n\n    pos_embed = np.concatenate([pos_embed_temporal, pos_embed_spatial], axis=-1)\n    pos_embed = pos_embed.reshape([-1, embed_dim])  # [T*H*W, D]\n\n    if cls_token:\n        pos_embed = np.concatenate([np.zeros([1, embed_dim]), pos_embed], axis=0)\n    return pos_embed\n\n\ndef get_2d_sincos_pos_embed(embed_dim, grid_size, cls_token=False):\n    \"\"\"\n    grid_size: int of the grid height and width\n    return:\n    pos_embed: [grid_size*grid_size, embed_dim] or [1+grid_size*grid_size, embed_dim] (w/ or w/o cls_token)\n    \"\"\"\n    grid_h = np.arange(grid_size, dtype=np.float32)\n    grid_w = np.arange(grid_size, dtype=np.float32)\n    grid = np.meshgrid(grid_w, grid_h)  # here w goes first\n    grid = np.stack(grid, axis=0)\n\n    grid = grid.reshape([2, 1, grid_size, grid_size])\n    pos_embed = get_2d_sincos_pos_embed_from_grid(embed_dim, grid)\n    if cls_token:\n        pos_embed = np.concatenate([np.zeros([1, embed_dim]), pos_embed], axis=0)\n    return pos_embed\n\n\ndef get_2d_sincos_pos_embed_from_grid(embed_dim, grid):\n    assert embed_dim % 2 == 0\n\n    # use half of dimensions to encode grid_h\n    emb_h = get_1d_sincos_pos_embed_from_grid(embed_dim // 2, grid[0])  # (H*W, D/2)\n    emb_w = get_1d_sincos_pos_embed_from_grid(embed_dim // 2, grid[1])  # (H*W, D/2)\n\n    emb = np.concatenate([emb_h, emb_w], axis=1)  # (H*W, D)\n    return emb\n\n\ndef get_1d_sincos_pos_embed_from_grid(embed_dim, pos):\n    \"\"\"\n    embed_dim: output dimension for each position\n    pos: a list of positions to be encoded: size (M,)\n    out: (M, D)\n    \"\"\"\n    assert embed_dim % 2 == 0\n    omega = np.arange(embed_dim // 2, dtype=float)\n    omega /= embed_dim / 2.0\n    omega = 1.0 / 10000**omega  # (D/2,)\n\n    pos = pos.reshape(-1)  # (M,)\n    out = np.einsum(\"m,d->md\", pos, omega)  # (M, D/2), outer product\n\n    emb_sin = np.sin(out)  # (M, D/2)\n    emb_cos = np.cos(out)  # (M, D/2)\n\n    emb = np.concatenate([emb_sin, emb_cos], axis=1)  # (M, D)\n    return emb\n\n\n# --------------------------------------------------------\n# Interpolate position embeddings for high-resolution\n# References:\n# DeiT: https://github.com/facebookresearch/deit\n# --------------------------------------------------------\ndef interpolate_pos_embed(model, checkpoint_model):\n    if \"pos_embed\" in checkpoint_model:\n        pos_embed_checkpoint = checkpoint_model[\"pos_embed\"]\n        embedding_size = pos_embed_checkpoint.shape[-1]\n        num_patches = model.patch_embed.num_patches\n        num_extra_tokens = model.pos_embed.shape[-2] - num_patches\n        # height (== width) for the checkpoint position embedding\n        orig_size = int((pos_embed_checkpoint.shape[-2] - num_extra_tokens) ** 0.5)\n        # height (== width) for the new position embedding\n        new_size = int(num_patches**0.5)\n        # class_token and dist_token are kept unchanged\n        if orig_size != new_size:\n            print(\n                \"Position interpolate from %dx%d to %dx%d\"\n                % (orig_size, orig_size, new_size, new_size)\n            )\n            extra_tokens = pos_embed_checkpoint[:, :num_extra_tokens]\n            # only the position tokens are interpolated\n            pos_tokens = pos_embed_checkpoint[:, num_extra_tokens:]\n            pos_tokens = pos_tokens.reshape(\n                -1, orig_size, orig_size, embedding_size\n            ).permute(0, 3, 1, 2)\n            pos_tokens = torch.nn.functional.interpolate(\n                pos_tokens,\n                size=(new_size, new_size),\n                mode=\"bicubic\",\n                align_corners=False,\n            )\n            pos_tokens = pos_tokens.permute(0, 2, 3, 1).flatten(1, 2)\n            new_pos_embed = torch.cat((extra_tokens, pos_tokens), dim=1)\n            checkpoint_model[\"pos_embed\"] = new_pos_embed\n\n\ndef calc_mvit_feature_geometry(cfg):\n    feat_size = [\n        [\n            (\n                cfg.DATA.NUM_FRAMES // cfg.MVIT.PATCH_STRIDE[0]\n                if len(cfg.MVIT.PATCH_STRIDE) > 2\n                else 1\n            ),\n            cfg.DATA.TRAIN_CROP_SIZE // cfg.MVIT.PATCH_STRIDE[-2],\n            cfg.DATA.TRAIN_CROP_SIZE // cfg.MVIT.PATCH_STRIDE[-1],\n        ]\n        for i in range(cfg.MVIT.DEPTH)\n    ]\n    feat_stride = [\n        [\n            cfg.MVIT.PATCH_STRIDE[0] if len(cfg.MVIT.PATCH_STRIDE) > 2 else 1,\n            cfg.MVIT.PATCH_STRIDE[-2],\n            cfg.MVIT.PATCH_STRIDE[-1],\n        ]\n        for i in range(cfg.MVIT.DEPTH)\n    ]\n    for _, x in enumerate(cfg.MVIT.POOL_Q_STRIDE):\n        for i in range(cfg.MVIT.DEPTH):\n            if i >= x[0]:\n                for j in range(len(feat_size[i])):\n                    feat_size[i][j] = feat_size[i][j] // x[j + 1]\n                    feat_stride[i][j] = feat_stride[i][j] * x[j + 1]\n    return feat_size, feat_stride\n"
  },
  {
    "path": "slowfast/models/video_model_builder.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\n\n\"\"\"Video models.\"\"\"\n\nimport math\nfrom functools import partial\n\nimport slowfast.utils.logging as logging\nimport slowfast.utils.weight_init_helper as init_helper\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nfrom slowfast.models.attention import MultiScaleBlock\nfrom slowfast.models.batchnorm_helper import get_norm\nfrom slowfast.models.common import TwoStreamFusion\nfrom slowfast.models.reversible_mvit import ReversibleMViT\nfrom slowfast.models.utils import (\n    calc_mvit_feature_geometry,\n    get_3d_sincos_pos_embed,\n    round_width,\n    validate_checkpoint_wrapper_import,\n)\nfrom torch.nn.init import trunc_normal_\n\nfrom . import head_helper, operators, resnet_helper, stem_helper  # noqa\nfrom .build import MODEL_REGISTRY\n\ntry:\n    from fairscale.nn.checkpoint import checkpoint_wrapper\nexcept ImportError:\n    checkpoint_wrapper = None\n\n\nlogger = logging.get_logger(__name__)\n\n# Number of blocks for different stages given the model depth.\n_MODEL_STAGE_DEPTH = {18: (2, 2, 2, 2), 50: (3, 4, 6, 3), 101: (3, 4, 23, 3)}\n\n# Basis of temporal kernel sizes for each of the stage.\n_TEMPORAL_KERNEL_BASIS = {\n    \"2d\": [\n        [[1]],  # conv1 temporal kernel.\n        [[1]],  # res2 temporal kernel.\n        [[1]],  # res3 temporal kernel.\n        [[1]],  # res4 temporal kernel.\n        [[1]],  # res5 temporal kernel.\n    ],\n    \"c2d\": [\n        [[1]],  # conv1 temporal kernel.\n        [[1]],  # res2 temporal kernel.\n        [[1]],  # res3 temporal kernel.\n        [[1]],  # res4 temporal kernel.\n        [[1]],  # res5 temporal kernel.\n    ],\n    \"slow_c2d\": [\n        [[1]],  # conv1 temporal kernel.\n        [[1]],  # res2 temporal kernel.\n        [[1]],  # res3 temporal kernel.\n        [[1]],  # res4 temporal kernel.\n        [[1]],  # res5 temporal kernel.\n    ],\n    \"i3d\": [\n        [[5]],  # conv1 temporal kernel.\n        [[3]],  # res2 temporal kernel.\n        [[3, 1]],  # res3 temporal kernel.\n        [[3, 1]],  # res4 temporal kernel.\n        [[1, 3]],  # res5 temporal kernel.\n    ],\n    \"slow_i3d\": [\n        [[5]],  # conv1 temporal kernel.\n        [[3]],  # res2 temporal kernel.\n        [[3, 1]],  # res3 temporal kernel.\n        [[3, 1]],  # res4 temporal kernel.\n        [[1, 3]],  # res5 temporal kernel.\n    ],\n    \"slow\": [\n        [[1]],  # conv1 temporal kernel.\n        [[1]],  # res2 temporal kernel.\n        [[1]],  # res3 temporal kernel.\n        [[3]],  # res4 temporal kernel.\n        [[3]],  # res5 temporal kernel.\n    ],\n    \"slowfast\": [\n        [[1], [5]],  # conv1 temporal kernel for slow and fast pathway.\n        [[1], [3]],  # res2 temporal kernel for slow and fast pathway.\n        [[1], [3]],  # res3 temporal kernel for slow and fast pathway.\n        [[3], [3]],  # res4 temporal kernel for slow and fast pathway.\n        [[3], [3]],  # res5 temporal kernel for slow and fast pathway.\n    ],\n    \"x3d\": [\n        [[5]],  # conv1 temporal kernels.\n        [[3]],  # res2 temporal kernels.\n        [[3]],  # res3 temporal kernels.\n        [[3]],  # res4 temporal kernels.\n        [[3]],  # res5 temporal kernels.\n    ],\n}\n\n_POOL1 = {\n    \"2d\": [[1, 1, 1]],\n    \"c2d\": [[2, 1, 1]],\n    \"slow_c2d\": [[1, 1, 1]],\n    \"i3d\": [[2, 1, 1]],\n    \"slow_i3d\": [[1, 1, 1]],\n    \"slow\": [[1, 1, 1]],\n    \"slowfast\": [[1, 1, 1], [1, 1, 1]],\n    \"x3d\": [[1, 1, 1]],\n}\n\n\nclass FuseFastToSlow(nn.Module):\n    \"\"\"\n    Fuses the information from the Fast pathway to the Slow pathway. Given the\n    tensors from Slow pathway and Fast pathway, fuse information from Fast to\n    Slow, then return the fused tensors from Slow and Fast pathway in order.\n    \"\"\"\n\n    def __init__(\n        self,\n        dim_in,\n        fusion_conv_channel_ratio,\n        fusion_kernel,\n        alpha,\n        eps=1e-5,\n        bn_mmt=0.1,\n        inplace_relu=True,\n        norm_module=nn.BatchNorm3d,\n    ):\n        \"\"\"\n        Args:\n            dim_in (int): the channel dimension of the input.\n            fusion_conv_channel_ratio (int): channel ratio for the convolution\n                used to fuse from Fast pathway to Slow pathway.\n            fusion_kernel (int): kernel size of the convolution used to fuse\n                from Fast pathway to Slow pathway.\n            alpha (int): the frame rate ratio between the Fast and Slow pathway.\n            eps (float): epsilon for batch norm.\n            bn_mmt (float): momentum for batch norm. Noted that BN momentum in\n                PyTorch = 1 - BN momentum in Caffe2.\n            inplace_relu (bool): if True, calculate the relu on the original\n                input without allocating new memory.\n            norm_module (nn.Module): nn.Module for the normalization layer. The\n                default is nn.BatchNorm3d.\n        \"\"\"\n        super(FuseFastToSlow, self).__init__()\n        self.conv_f2s = nn.Conv3d(\n            dim_in,\n            dim_in * fusion_conv_channel_ratio,\n            kernel_size=[fusion_kernel, 1, 1],\n            stride=[alpha, 1, 1],\n            padding=[fusion_kernel // 2, 0, 0],\n            bias=False,\n        )\n        self.bn = norm_module(\n            num_features=dim_in * fusion_conv_channel_ratio,\n            eps=eps,\n            momentum=bn_mmt,\n        )\n        self.relu = nn.ReLU(inplace_relu)\n\n    def forward(self, x):\n        x_s = x[0]\n        x_f = x[1]\n        fuse = self.conv_f2s(x_f)\n        fuse = self.bn(fuse)\n        fuse = self.relu(fuse)\n        x_s_fuse = torch.cat([x_s, fuse], 1)\n        return [x_s_fuse, x_f]\n\n\n@MODEL_REGISTRY.register()\nclass SlowFast(nn.Module):\n    \"\"\"\n    SlowFast model builder for SlowFast network.\n\n    Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He.\n    \"SlowFast networks for video recognition.\"\n    https://arxiv.org/pdf/1812.03982.pdf\n    \"\"\"\n\n    def __init__(self, cfg):\n        \"\"\"\n        The `__init__` method of any subclass should also contain these\n            arguments.\n        Args:\n            cfg (CfgNode): model building configs, details are in the\n                comments of the config file.\n        \"\"\"\n        super(SlowFast, self).__init__()\n        self.norm_module = get_norm(cfg)\n        self.cfg = cfg\n        self.enable_detection = cfg.DETECTION.ENABLE\n        self.num_pathways = 2\n        self._construct_network(cfg)\n        init_helper.init_weights(\n            self,\n            cfg.MODEL.FC_INIT_STD,\n            cfg.RESNET.ZERO_INIT_FINAL_BN,\n            cfg.RESNET.ZERO_INIT_FINAL_CONV,\n        )\n\n    def _construct_network(self, cfg):\n        \"\"\"\n        Builds a SlowFast model. The first pathway is the Slow pathway and the\n            second pathway is the Fast pathway.\n        Args:\n            cfg (CfgNode): model building configs, details are in the\n                comments of the config file.\n        \"\"\"\n        assert cfg.MODEL.ARCH in _POOL1.keys()\n        pool_size = _POOL1[cfg.MODEL.ARCH]\n        assert len({len(pool_size), self.num_pathways}) == 1\n        assert cfg.RESNET.DEPTH in _MODEL_STAGE_DEPTH.keys()\n\n        (d2, d3, d4, d5) = _MODEL_STAGE_DEPTH[cfg.RESNET.DEPTH]\n\n        num_groups = cfg.RESNET.NUM_GROUPS\n        width_per_group = cfg.RESNET.WIDTH_PER_GROUP\n        dim_inner = num_groups * width_per_group\n        out_dim_ratio = cfg.SLOWFAST.BETA_INV // cfg.SLOWFAST.FUSION_CONV_CHANNEL_RATIO\n\n        temp_kernel = _TEMPORAL_KERNEL_BASIS[cfg.MODEL.ARCH]\n\n        self.s1 = stem_helper.VideoModelStem(\n            dim_in=cfg.DATA.INPUT_CHANNEL_NUM,\n            dim_out=[width_per_group, width_per_group // cfg.SLOWFAST.BETA_INV],\n            kernel=[temp_kernel[0][0] + [7, 7], temp_kernel[0][1] + [7, 7]],\n            stride=[[1, 2, 2]] * 2,\n            padding=[\n                [temp_kernel[0][0][0] // 2, 3, 3],\n                [temp_kernel[0][1][0] // 2, 3, 3],\n            ],\n            norm_module=self.norm_module,\n        )\n        self.s1_fuse = FuseFastToSlow(\n            width_per_group // cfg.SLOWFAST.BETA_INV,\n            cfg.SLOWFAST.FUSION_CONV_CHANNEL_RATIO,\n            cfg.SLOWFAST.FUSION_KERNEL_SZ,\n            cfg.SLOWFAST.ALPHA,\n            norm_module=self.norm_module,\n        )\n\n        self.s2 = resnet_helper.ResStage(\n            dim_in=[\n                width_per_group + width_per_group // out_dim_ratio,\n                width_per_group // cfg.SLOWFAST.BETA_INV,\n            ],\n            dim_out=[\n                width_per_group * 4,\n                width_per_group * 4 // cfg.SLOWFAST.BETA_INV,\n            ],\n            dim_inner=[dim_inner, dim_inner // cfg.SLOWFAST.BETA_INV],\n            temp_kernel_sizes=temp_kernel[1],\n            stride=cfg.RESNET.SPATIAL_STRIDES[0],\n            num_blocks=[d2] * 2,\n            num_groups=[num_groups] * 2,\n            num_block_temp_kernel=cfg.RESNET.NUM_BLOCK_TEMP_KERNEL[0],\n            nonlocal_inds=cfg.NONLOCAL.LOCATION[0],\n            nonlocal_group=cfg.NONLOCAL.GROUP[0],\n            nonlocal_pool=cfg.NONLOCAL.POOL[0],\n            instantiation=cfg.NONLOCAL.INSTANTIATION,\n            trans_func_name=cfg.RESNET.TRANS_FUNC,\n            dilation=cfg.RESNET.SPATIAL_DILATIONS[0],\n            norm_module=self.norm_module,\n        )\n        self.s2_fuse = FuseFastToSlow(\n            width_per_group * 4 // cfg.SLOWFAST.BETA_INV,\n            cfg.SLOWFAST.FUSION_CONV_CHANNEL_RATIO,\n            cfg.SLOWFAST.FUSION_KERNEL_SZ,\n            cfg.SLOWFAST.ALPHA,\n            norm_module=self.norm_module,\n        )\n\n        for pathway in range(self.num_pathways):\n            pool = nn.MaxPool3d(\n                kernel_size=pool_size[pathway],\n                stride=pool_size[pathway],\n                padding=[0, 0, 0],\n            )\n            self.add_module(\"pathway{}_pool\".format(pathway), pool)\n\n        self.s3 = resnet_helper.ResStage(\n            dim_in=[\n                width_per_group * 4 + width_per_group * 4 // out_dim_ratio,\n                width_per_group * 4 // cfg.SLOWFAST.BETA_INV,\n            ],\n            dim_out=[\n                width_per_group * 8,\n                width_per_group * 8 // cfg.SLOWFAST.BETA_INV,\n            ],\n            dim_inner=[dim_inner * 2, dim_inner * 2 // cfg.SLOWFAST.BETA_INV],\n            temp_kernel_sizes=temp_kernel[2],\n            stride=cfg.RESNET.SPATIAL_STRIDES[1],\n            num_blocks=[d3] * 2,\n            num_groups=[num_groups] * 2,\n            num_block_temp_kernel=cfg.RESNET.NUM_BLOCK_TEMP_KERNEL[1],\n            nonlocal_inds=cfg.NONLOCAL.LOCATION[1],\n            nonlocal_group=cfg.NONLOCAL.GROUP[1],\n            nonlocal_pool=cfg.NONLOCAL.POOL[1],\n            instantiation=cfg.NONLOCAL.INSTANTIATION,\n            trans_func_name=cfg.RESNET.TRANS_FUNC,\n            dilation=cfg.RESNET.SPATIAL_DILATIONS[1],\n            norm_module=self.norm_module,\n        )\n        self.s3_fuse = FuseFastToSlow(\n            width_per_group * 8 // cfg.SLOWFAST.BETA_INV,\n            cfg.SLOWFAST.FUSION_CONV_CHANNEL_RATIO,\n            cfg.SLOWFAST.FUSION_KERNEL_SZ,\n            cfg.SLOWFAST.ALPHA,\n            norm_module=self.norm_module,\n        )\n\n        self.s4 = resnet_helper.ResStage(\n            dim_in=[\n                width_per_group * 8 + width_per_group * 8 // out_dim_ratio,\n                width_per_group * 8 // cfg.SLOWFAST.BETA_INV,\n            ],\n            dim_out=[\n                width_per_group * 16,\n                width_per_group * 16 // cfg.SLOWFAST.BETA_INV,\n            ],\n            dim_inner=[dim_inner * 4, dim_inner * 4 // cfg.SLOWFAST.BETA_INV],\n            temp_kernel_sizes=temp_kernel[3],\n            stride=cfg.RESNET.SPATIAL_STRIDES[2],\n            num_blocks=[d4] * 2,\n            num_groups=[num_groups] * 2,\n            num_block_temp_kernel=cfg.RESNET.NUM_BLOCK_TEMP_KERNEL[2],\n            nonlocal_inds=cfg.NONLOCAL.LOCATION[2],\n            nonlocal_group=cfg.NONLOCAL.GROUP[2],\n            nonlocal_pool=cfg.NONLOCAL.POOL[2],\n            instantiation=cfg.NONLOCAL.INSTANTIATION,\n            trans_func_name=cfg.RESNET.TRANS_FUNC,\n            dilation=cfg.RESNET.SPATIAL_DILATIONS[2],\n            norm_module=self.norm_module,\n        )\n        self.s4_fuse = FuseFastToSlow(\n            width_per_group * 16 // cfg.SLOWFAST.BETA_INV,\n            cfg.SLOWFAST.FUSION_CONV_CHANNEL_RATIO,\n            cfg.SLOWFAST.FUSION_KERNEL_SZ,\n            cfg.SLOWFAST.ALPHA,\n            norm_module=self.norm_module,\n        )\n\n        self.s5 = resnet_helper.ResStage(\n            dim_in=[\n                width_per_group * 16 + width_per_group * 16 // out_dim_ratio,\n                width_per_group * 16 // cfg.SLOWFAST.BETA_INV,\n            ],\n            dim_out=[\n                width_per_group * 32,\n                width_per_group * 32 // cfg.SLOWFAST.BETA_INV,\n            ],\n            dim_inner=[dim_inner * 8, dim_inner * 8 // cfg.SLOWFAST.BETA_INV],\n            temp_kernel_sizes=temp_kernel[4],\n            stride=cfg.RESNET.SPATIAL_STRIDES[3],\n            num_blocks=[d5] * 2,\n            num_groups=[num_groups] * 2,\n            num_block_temp_kernel=cfg.RESNET.NUM_BLOCK_TEMP_KERNEL[3],\n            nonlocal_inds=cfg.NONLOCAL.LOCATION[3],\n            nonlocal_group=cfg.NONLOCAL.GROUP[3],\n            nonlocal_pool=cfg.NONLOCAL.POOL[3],\n            instantiation=cfg.NONLOCAL.INSTANTIATION,\n            trans_func_name=cfg.RESNET.TRANS_FUNC,\n            dilation=cfg.RESNET.SPATIAL_DILATIONS[3],\n            norm_module=self.norm_module,\n        )\n\n        if cfg.DETECTION.ENABLE:\n            self.head = head_helper.ResNetRoIHead(\n                dim_in=[\n                    width_per_group * 32,\n                    width_per_group * 32 // cfg.SLOWFAST.BETA_INV,\n                ],\n                num_classes=cfg.MODEL.NUM_CLASSES,\n                pool_size=[\n                    [\n                        cfg.DATA.NUM_FRAMES // cfg.SLOWFAST.ALPHA // pool_size[0][0],\n                        1,\n                        1,\n                    ],\n                    [cfg.DATA.NUM_FRAMES // pool_size[1][0], 1, 1],\n                ],\n                resolution=[[cfg.DETECTION.ROI_XFORM_RESOLUTION] * 2] * 2,\n                scale_factor=[cfg.DETECTION.SPATIAL_SCALE_FACTOR] * 2,\n                dropout_rate=cfg.MODEL.DROPOUT_RATE,\n                act_func=cfg.MODEL.HEAD_ACT,\n                aligned=cfg.DETECTION.ALIGNED,\n                detach_final_fc=cfg.MODEL.DETACH_FINAL_FC,\n            )\n        else:\n            self.head = head_helper.ResNetBasicHead(\n                dim_in=[\n                    width_per_group * 32,\n                    width_per_group * 32 // cfg.SLOWFAST.BETA_INV,\n                ],\n                num_classes=cfg.MODEL.NUM_CLASSES,\n                pool_size=(\n                    [None, None]\n                    if cfg.MULTIGRID.SHORT_CYCLE\n                    or cfg.MODEL.MODEL_NAME == \"ContrastiveModel\"\n                    else [\n                        [\n                            cfg.DATA.NUM_FRAMES\n                            // cfg.SLOWFAST.ALPHA\n                            // pool_size[0][0],\n                            cfg.DATA.TRAIN_CROP_SIZE // 32 // pool_size[0][1],\n                            cfg.DATA.TRAIN_CROP_SIZE // 32 // pool_size[0][2],\n                        ],\n                        [\n                            cfg.DATA.NUM_FRAMES // pool_size[1][0],\n                            cfg.DATA.TRAIN_CROP_SIZE // 32 // pool_size[1][1],\n                            cfg.DATA.TRAIN_CROP_SIZE // 32 // pool_size[1][2],\n                        ],\n                    ]\n                ),  # None for AdaptiveAvgPool3d((1, 1, 1))\n                dropout_rate=cfg.MODEL.DROPOUT_RATE,\n                act_func=cfg.MODEL.HEAD_ACT,\n                detach_final_fc=cfg.MODEL.DETACH_FINAL_FC,\n                cfg=cfg,\n            )\n\n    def forward(self, x, bboxes=None):\n        x = x[:]  # avoid pass by reference\n        x = self.s1(x)\n        x = self.s1_fuse(x)\n        x = self.s2(x)\n        x = self.s2_fuse(x)\n        for pathway in range(self.num_pathways):\n            pool = getattr(self, \"pathway{}_pool\".format(pathway))\n            x[pathway] = pool(x[pathway])\n        x = self.s3(x)\n        x = self.s3_fuse(x)\n        x = self.s4(x)\n        x = self.s4_fuse(x)\n        x = self.s5(x)\n        if self.enable_detection:\n            x = self.head(x, bboxes)\n        else:\n            x = self.head(x)\n        return x\n\n\n@MODEL_REGISTRY.register()\nclass ResNet(nn.Module):\n    \"\"\"\n    ResNet model builder. It builds a ResNet like network backbone without\n    lateral connection (C2D, I3D, Slow).\n\n    Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He.\n    \"SlowFast networks for video recognition.\"\n    https://arxiv.org/pdf/1812.03982.pdf\n\n    Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He.\n    \"Non-local neural networks.\"\n    https://arxiv.org/pdf/1711.07971.pdf\n    \"\"\"\n\n    def __init__(self, cfg):\n        \"\"\"\n        The `__init__` method of any subclass should also contain these\n            arguments.\n\n        Args:\n            cfg (CfgNode): model building configs, details are in the\n                comments of the config file.\n        \"\"\"\n        super(ResNet, self).__init__()\n        self.norm_module = get_norm(cfg)\n        self.enable_detection = cfg.DETECTION.ENABLE\n        self.num_pathways = 1\n        self._construct_network(cfg)\n        init_helper.init_weights(\n            self,\n            cfg.MODEL.FC_INIT_STD,\n            cfg.RESNET.ZERO_INIT_FINAL_BN,\n            cfg.RESNET.ZERO_INIT_FINAL_CONV,\n        )\n\n    def _construct_network(self, cfg):\n        \"\"\"\n        Builds a single pathway ResNet model.\n\n        Args:\n            cfg (CfgNode): model building configs, details are in the\n                comments of the config file.\n        \"\"\"\n        assert cfg.MODEL.ARCH in _POOL1.keys()\n        pool_size = _POOL1[cfg.MODEL.ARCH]\n        assert len({len(pool_size), self.num_pathways}) == 1\n        assert cfg.RESNET.DEPTH in _MODEL_STAGE_DEPTH.keys()\n        self.cfg = cfg\n\n        (d2, d3, d4, d5) = _MODEL_STAGE_DEPTH[cfg.RESNET.DEPTH]\n\n        num_groups = cfg.RESNET.NUM_GROUPS\n        width_per_group = cfg.RESNET.WIDTH_PER_GROUP\n        dim_inner = num_groups * width_per_group\n\n        temp_kernel = _TEMPORAL_KERNEL_BASIS[cfg.MODEL.ARCH]\n\n        s1 = stem_helper.VideoModelStem(\n            dim_in=cfg.DATA.INPUT_CHANNEL_NUM,\n            dim_out=[width_per_group],\n            kernel=[temp_kernel[0][0] + [7, 7]],\n            stride=[[1, 2, 2]],\n            padding=[[temp_kernel[0][0][0] // 2, 3, 3]],\n            norm_module=self.norm_module,\n        )\n\n        s2 = resnet_helper.ResStage(\n            dim_in=[width_per_group],\n            dim_out=[width_per_group * 4],\n            dim_inner=[dim_inner],\n            temp_kernel_sizes=temp_kernel[1],\n            stride=cfg.RESNET.SPATIAL_STRIDES[0],\n            num_blocks=[d2],\n            num_groups=[num_groups],\n            num_block_temp_kernel=cfg.RESNET.NUM_BLOCK_TEMP_KERNEL[0],\n            nonlocal_inds=cfg.NONLOCAL.LOCATION[0],\n            nonlocal_group=cfg.NONLOCAL.GROUP[0],\n            nonlocal_pool=cfg.NONLOCAL.POOL[0],\n            instantiation=cfg.NONLOCAL.INSTANTIATION,\n            trans_func_name=cfg.RESNET.TRANS_FUNC,\n            stride_1x1=cfg.RESNET.STRIDE_1X1,\n            inplace_relu=cfg.RESNET.INPLACE_RELU,\n            dilation=cfg.RESNET.SPATIAL_DILATIONS[0],\n            norm_module=self.norm_module,\n        )\n\n        # Based on profiling data of activation size, s1 and s2 have the activation sizes\n        # that are 4X larger than the second largest. Therefore, checkpointing them gives\n        # best memory savings. Further tuning is possible for better memory saving and tradeoffs\n        # with recomputing FLOPs.\n        if cfg.MODEL.ACT_CHECKPOINT:\n            validate_checkpoint_wrapper_import(checkpoint_wrapper)\n            self.s1 = checkpoint_wrapper(s1)\n            self.s2 = checkpoint_wrapper(s2)\n        else:\n            self.s1 = s1\n            self.s2 = s2\n\n        for pathway in range(self.num_pathways):\n            pool = nn.MaxPool3d(\n                kernel_size=pool_size[pathway],\n                stride=pool_size[pathway],\n                padding=[0, 0, 0],\n            )\n            self.add_module(\"pathway{}_pool\".format(pathway), pool)\n\n        self.s3 = resnet_helper.ResStage(\n            dim_in=[width_per_group * 4],\n            dim_out=[width_per_group * 8],\n            dim_inner=[dim_inner * 2],\n            temp_kernel_sizes=temp_kernel[2],\n            stride=cfg.RESNET.SPATIAL_STRIDES[1],\n            num_blocks=[d3],\n            num_groups=[num_groups],\n            num_block_temp_kernel=cfg.RESNET.NUM_BLOCK_TEMP_KERNEL[1],\n            nonlocal_inds=cfg.NONLOCAL.LOCATION[1],\n            nonlocal_group=cfg.NONLOCAL.GROUP[1],\n            nonlocal_pool=cfg.NONLOCAL.POOL[1],\n            instantiation=cfg.NONLOCAL.INSTANTIATION,\n            trans_func_name=cfg.RESNET.TRANS_FUNC,\n            stride_1x1=cfg.RESNET.STRIDE_1X1,\n            inplace_relu=cfg.RESNET.INPLACE_RELU,\n            dilation=cfg.RESNET.SPATIAL_DILATIONS[1],\n            norm_module=self.norm_module,\n        )\n\n        self.s4 = resnet_helper.ResStage(\n            dim_in=[width_per_group * 8],\n            dim_out=[width_per_group * 16],\n            dim_inner=[dim_inner * 4],\n            temp_kernel_sizes=temp_kernel[3],\n            stride=cfg.RESNET.SPATIAL_STRIDES[2],\n            num_blocks=[d4],\n            num_groups=[num_groups],\n            num_block_temp_kernel=cfg.RESNET.NUM_BLOCK_TEMP_KERNEL[2],\n            nonlocal_inds=cfg.NONLOCAL.LOCATION[2],\n            nonlocal_group=cfg.NONLOCAL.GROUP[2],\n            nonlocal_pool=cfg.NONLOCAL.POOL[2],\n            instantiation=cfg.NONLOCAL.INSTANTIATION,\n            trans_func_name=cfg.RESNET.TRANS_FUNC,\n            stride_1x1=cfg.RESNET.STRIDE_1X1,\n            inplace_relu=cfg.RESNET.INPLACE_RELU,\n            dilation=cfg.RESNET.SPATIAL_DILATIONS[2],\n            norm_module=self.norm_module,\n        )\n\n        self.s5 = resnet_helper.ResStage(\n            dim_in=[width_per_group * 16],\n            dim_out=[width_per_group * 32],\n            dim_inner=[dim_inner * 8],\n            temp_kernel_sizes=temp_kernel[4],\n            stride=cfg.RESNET.SPATIAL_STRIDES[3],\n            num_blocks=[d5],\n            num_groups=[num_groups],\n            num_block_temp_kernel=cfg.RESNET.NUM_BLOCK_TEMP_KERNEL[3],\n            nonlocal_inds=cfg.NONLOCAL.LOCATION[3],\n            nonlocal_group=cfg.NONLOCAL.GROUP[3],\n            nonlocal_pool=cfg.NONLOCAL.POOL[3],\n            instantiation=cfg.NONLOCAL.INSTANTIATION,\n            trans_func_name=cfg.RESNET.TRANS_FUNC,\n            stride_1x1=cfg.RESNET.STRIDE_1X1,\n            inplace_relu=cfg.RESNET.INPLACE_RELU,\n            dilation=cfg.RESNET.SPATIAL_DILATIONS[3],\n            norm_module=self.norm_module,\n        )\n\n        if self.enable_detection:\n            self.head = head_helper.ResNetRoIHead(\n                dim_in=[width_per_group * 32],\n                num_classes=cfg.MODEL.NUM_CLASSES,\n                pool_size=[[cfg.DATA.NUM_FRAMES // pool_size[0][0], 1, 1]],\n                resolution=[[cfg.DETECTION.ROI_XFORM_RESOLUTION] * 2],\n                scale_factor=[cfg.DETECTION.SPATIAL_SCALE_FACTOR],\n                dropout_rate=cfg.MODEL.DROPOUT_RATE,\n                act_func=cfg.MODEL.HEAD_ACT,\n                aligned=cfg.DETECTION.ALIGNED,\n                detach_final_fc=cfg.MODEL.DETACH_FINAL_FC,\n            )\n        else:\n            self.head = head_helper.ResNetBasicHead(\n                dim_in=[width_per_group * 32],\n                num_classes=cfg.MODEL.NUM_CLASSES,\n                pool_size=(\n                    [None]\n                    if cfg.MULTIGRID.SHORT_CYCLE\n                    or cfg.MODEL.MODEL_NAME == \"ContrastiveModel\"\n                    else [\n                        [\n                            cfg.DATA.NUM_FRAMES // pool_size[0][0],\n                            cfg.DATA.TRAIN_CROP_SIZE // 32 // pool_size[0][1],\n                            cfg.DATA.TRAIN_CROP_SIZE // 32 // pool_size[0][2],\n                        ]\n                    ]\n                ),  # None for AdaptiveAvgPool3d((1, 1, 1))\n                dropout_rate=cfg.MODEL.DROPOUT_RATE,\n                act_func=cfg.MODEL.HEAD_ACT,\n                detach_final_fc=cfg.MODEL.DETACH_FINAL_FC,\n                cfg=cfg,\n            )\n\n    def forward(self, x, bboxes=None):\n        x = x[:]  # avoid pass by reference\n        x = self.s1(x)\n        x = self.s2(x)\n        y = []  # Don't modify x list in place due to activation checkpoint.\n        for pathway in range(self.num_pathways):\n            pool = getattr(self, \"pathway{}_pool\".format(pathway))\n            y.append(pool(x[pathway]))\n        x = self.s3(y)\n        x = self.s4(x)\n        x = self.s5(x)\n        if self.enable_detection:\n            x = self.head(x, bboxes)\n        else:\n            x = self.head(x)\n        return x\n\n\n@MODEL_REGISTRY.register()\nclass X3D(nn.Module):\n    \"\"\"\n    X3D model builder. It builds a X3D network backbone, which is a ResNet.\n\n    Christoph Feichtenhofer.\n    \"X3D: Expanding Architectures for Efficient Video Recognition.\"\n    https://arxiv.org/abs/2004.04730\n    \"\"\"\n\n    def __init__(self, cfg):\n        \"\"\"\n        The `__init__` method of any subclass should also contain these\n            arguments.\n\n        Args:\n            cfg (CfgNode): model building configs, details are in the\n                comments of the config file.\n        \"\"\"\n        super(X3D, self).__init__()\n        self.norm_module = get_norm(cfg)\n        self.enable_detection = cfg.DETECTION.ENABLE\n        self.num_pathways = 1\n\n        exp_stage = 2.0\n        self.dim_c1 = cfg.X3D.DIM_C1\n\n        self.dim_res2 = (\n            round_width(self.dim_c1, exp_stage, divisor=8)\n            if cfg.X3D.SCALE_RES2\n            else self.dim_c1\n        )\n        self.dim_res3 = round_width(self.dim_res2, exp_stage, divisor=8)\n        self.dim_res4 = round_width(self.dim_res3, exp_stage, divisor=8)\n        self.dim_res5 = round_width(self.dim_res4, exp_stage, divisor=8)\n\n        self.block_basis = [\n            # blocks, c, stride\n            [1, self.dim_res2, 2],\n            [2, self.dim_res3, 2],\n            [5, self.dim_res4, 2],\n            [3, self.dim_res5, 2],\n        ]\n        self._construct_network(cfg)\n        init_helper.init_weights(\n            self, cfg.MODEL.FC_INIT_STD, cfg.RESNET.ZERO_INIT_FINAL_BN\n        )\n\n    def _round_repeats(self, repeats, multiplier):\n        \"\"\"Round number of layers based on depth multiplier.\"\"\"\n        multiplier = multiplier\n        if not multiplier:\n            return repeats\n        return int(math.ceil(multiplier * repeats))\n\n    def _construct_network(self, cfg):\n        \"\"\"\n        Builds a single pathway X3D model.\n\n        Args:\n            cfg (CfgNode): model building configs, details are in the\n                comments of the config file.\n        \"\"\"\n        assert cfg.MODEL.ARCH in _POOL1.keys()\n        assert cfg.RESNET.DEPTH in _MODEL_STAGE_DEPTH.keys()\n\n        (d2, d3, d4, d5) = _MODEL_STAGE_DEPTH[cfg.RESNET.DEPTH]\n\n        num_groups = cfg.RESNET.NUM_GROUPS\n        width_per_group = cfg.RESNET.WIDTH_PER_GROUP\n        dim_inner = num_groups * width_per_group\n\n        w_mul = cfg.X3D.WIDTH_FACTOR\n        d_mul = cfg.X3D.DEPTH_FACTOR\n        dim_res1 = round_width(self.dim_c1, w_mul)\n\n        temp_kernel = _TEMPORAL_KERNEL_BASIS[cfg.MODEL.ARCH]\n\n        self.s1 = stem_helper.VideoModelStem(\n            dim_in=cfg.DATA.INPUT_CHANNEL_NUM,\n            dim_out=[dim_res1],\n            kernel=[temp_kernel[0][0] + [3, 3]],\n            stride=[[1, 2, 2]],\n            padding=[[temp_kernel[0][0][0] // 2, 1, 1]],\n            norm_module=self.norm_module,\n            stem_func_name=\"x3d_stem\",\n        )\n\n        # blob_in = s1\n        dim_in = dim_res1\n        for stage, block in enumerate(self.block_basis):\n            dim_out = round_width(block[1], w_mul)\n            dim_inner = int(cfg.X3D.BOTTLENECK_FACTOR * dim_out)\n\n            n_rep = self._round_repeats(block[0], d_mul)\n            prefix = \"s{}\".format(stage + 2)  # start w res2 to follow convention\n\n            s = resnet_helper.ResStage(\n                dim_in=[dim_in],\n                dim_out=[dim_out],\n                dim_inner=[dim_inner],\n                temp_kernel_sizes=temp_kernel[1],\n                stride=[block[2]],\n                num_blocks=[n_rep],\n                num_groups=[dim_inner] if cfg.X3D.CHANNELWISE_3x3x3 else [num_groups],\n                num_block_temp_kernel=[n_rep],\n                nonlocal_inds=cfg.NONLOCAL.LOCATION[0],\n                nonlocal_group=cfg.NONLOCAL.GROUP[0],\n                nonlocal_pool=cfg.NONLOCAL.POOL[0],\n                instantiation=cfg.NONLOCAL.INSTANTIATION,\n                trans_func_name=cfg.RESNET.TRANS_FUNC,\n                stride_1x1=cfg.RESNET.STRIDE_1X1,\n                norm_module=self.norm_module,\n                dilation=cfg.RESNET.SPATIAL_DILATIONS[stage],\n                drop_connect_rate=cfg.MODEL.DROPCONNECT_RATE\n                * (stage + 2)\n                / (len(self.block_basis) + 1),\n            )\n            dim_in = dim_out\n            self.add_module(prefix, s)\n\n        if self.enable_detection:\n            NotImplementedError\n        else:\n            spat_sz = int(math.ceil(cfg.DATA.TRAIN_CROP_SIZE / 32.0))\n            self.head = head_helper.X3DHead(\n                dim_in=dim_out,\n                dim_inner=dim_inner,\n                dim_out=cfg.X3D.DIM_C5,\n                num_classes=cfg.MODEL.NUM_CLASSES,\n                pool_size=[cfg.DATA.NUM_FRAMES, spat_sz, spat_sz],\n                dropout_rate=cfg.MODEL.DROPOUT_RATE,\n                act_func=cfg.MODEL.HEAD_ACT,\n                bn_lin5_on=cfg.X3D.BN_LIN5,\n            )\n\n    def forward(self, x, bboxes=None):\n        for module in self.children():\n            x = module(x)\n        return x\n\n\n@MODEL_REGISTRY.register()\nclass MViT(nn.Module):\n    \"\"\"\n    Model builder for MViTv1 and MViTv2.\n\n    \"MViTv2: Improved Multiscale Vision Transformers for Classification and Detection\"\n    Yanghao Li, Chao-Yuan Wu, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik, Christoph Feichtenhofer\n    https://arxiv.org/abs/2112.01526\n    \"Multiscale Vision Transformers\"\n    Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, Christoph Feichtenhofer\n    https://arxiv.org/abs/2104.11227\n    \"\"\"\n\n    def __init__(self, cfg):\n        super().__init__()\n        # Get parameters.\n        assert cfg.DATA.TRAIN_CROP_SIZE == cfg.DATA.TEST_CROP_SIZE\n        self.cfg = cfg\n        pool_first = cfg.MVIT.POOL_FIRST\n        # Prepare input.\n        spatial_size = cfg.DATA.TRAIN_CROP_SIZE\n        temporal_size = cfg.DATA.NUM_FRAMES\n        in_chans = cfg.DATA.INPUT_CHANNEL_NUM[0]\n        self.use_2d_patch = cfg.MVIT.PATCH_2D\n        self.enable_detection = cfg.DETECTION.ENABLE\n        self.enable_rev = cfg.MVIT.REV.ENABLE\n        self.patch_stride = cfg.MVIT.PATCH_STRIDE\n        if self.use_2d_patch:\n            self.patch_stride = [1] + self.patch_stride\n        self.T = cfg.DATA.NUM_FRAMES // self.patch_stride[0]\n        self.H = cfg.DATA.TRAIN_CROP_SIZE // self.patch_stride[1]\n        self.W = cfg.DATA.TRAIN_CROP_SIZE // self.patch_stride[2]\n        # Prepare output.\n        num_classes = cfg.MODEL.NUM_CLASSES\n        embed_dim = cfg.MVIT.EMBED_DIM\n        # Prepare backbone\n        num_heads = cfg.MVIT.NUM_HEADS\n        mlp_ratio = cfg.MVIT.MLP_RATIO\n        qkv_bias = cfg.MVIT.QKV_BIAS\n        self.drop_rate = cfg.MVIT.DROPOUT_RATE\n        depth = cfg.MVIT.DEPTH\n        drop_path_rate = cfg.MVIT.DROPPATH_RATE\n        layer_scale_init_value = cfg.MVIT.LAYER_SCALE_INIT_VALUE\n        head_init_scale = cfg.MVIT.HEAD_INIT_SCALE\n        mode = cfg.MVIT.MODE\n        self.cls_embed_on = cfg.MVIT.CLS_EMBED_ON\n        self.use_mean_pooling = cfg.MVIT.USE_MEAN_POOLING\n        # Params for positional embedding\n        self.use_abs_pos = cfg.MVIT.USE_ABS_POS\n        self.use_fixed_sincos_pos = cfg.MVIT.USE_FIXED_SINCOS_POS\n        self.sep_pos_embed = cfg.MVIT.SEP_POS_EMBED\n        self.rel_pos_spatial = cfg.MVIT.REL_POS_SPATIAL\n        self.rel_pos_temporal = cfg.MVIT.REL_POS_TEMPORAL\n        if cfg.MVIT.NORM == \"layernorm\":\n            norm_layer = partial(nn.LayerNorm, eps=1e-6)\n        else:\n            raise NotImplementedError(\"Only supports layernorm.\")\n        self.num_classes = num_classes\n        self.patch_embed = stem_helper.PatchEmbed(\n            dim_in=in_chans,\n            dim_out=embed_dim,\n            kernel=cfg.MVIT.PATCH_KERNEL,\n            stride=cfg.MVIT.PATCH_STRIDE,\n            padding=cfg.MVIT.PATCH_PADDING,\n            conv_2d=self.use_2d_patch,\n        )\n\n        if cfg.MODEL.ACT_CHECKPOINT:\n            self.patch_embed = checkpoint_wrapper(self.patch_embed)\n        self.input_dims = [temporal_size, spatial_size, spatial_size]\n        assert self.input_dims[1] == self.input_dims[2]\n        self.patch_dims = [\n            self.input_dims[i] // self.patch_stride[i]\n            for i in range(len(self.input_dims))\n        ]\n        num_patches = math.prod(self.patch_dims)\n\n        dpr = [\n            x.item() for x in torch.linspace(0, drop_path_rate, depth)\n        ]  # stochastic depth decay rule\n\n        if self.cls_embed_on:\n            self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))\n            pos_embed_dim = num_patches + 1\n        else:\n            pos_embed_dim = num_patches\n\n        if self.use_abs_pos:\n            if self.sep_pos_embed:\n                self.pos_embed_spatial = nn.Parameter(\n                    torch.zeros(1, self.patch_dims[1] * self.patch_dims[2], embed_dim)\n                )\n                self.pos_embed_temporal = nn.Parameter(\n                    torch.zeros(1, self.patch_dims[0], embed_dim)\n                )\n                if self.cls_embed_on:\n                    self.pos_embed_class = nn.Parameter(torch.zeros(1, 1, embed_dim))\n            else:\n                self.pos_embed = nn.Parameter(\n                    torch.zeros(\n                        1,\n                        pos_embed_dim,\n                        embed_dim,\n                    ),\n                    requires_grad=not self.use_fixed_sincos_pos,\n                )\n\n        if self.drop_rate > 0.0:\n            self.pos_drop = nn.Dropout(p=self.drop_rate)\n\n        dim_mul, head_mul = torch.ones(depth + 1), torch.ones(depth + 1)\n        for i in range(len(cfg.MVIT.DIM_MUL)):\n            dim_mul[cfg.MVIT.DIM_MUL[i][0]] = cfg.MVIT.DIM_MUL[i][1]\n        for i in range(len(cfg.MVIT.HEAD_MUL)):\n            head_mul[cfg.MVIT.HEAD_MUL[i][0]] = cfg.MVIT.HEAD_MUL[i][1]\n\n        pool_q = [[] for i in range(cfg.MVIT.DEPTH)]\n        pool_kv = [[] for i in range(cfg.MVIT.DEPTH)]\n        stride_q = [[] for i in range(cfg.MVIT.DEPTH)]\n        stride_kv = [[] for i in range(cfg.MVIT.DEPTH)]\n\n        for i in range(len(cfg.MVIT.POOL_Q_STRIDE)):\n            stride_q[cfg.MVIT.POOL_Q_STRIDE[i][0]] = cfg.MVIT.POOL_Q_STRIDE[i][1:]\n            if cfg.MVIT.POOL_KVQ_KERNEL is not None:\n                pool_q[cfg.MVIT.POOL_Q_STRIDE[i][0]] = cfg.MVIT.POOL_KVQ_KERNEL\n            else:\n                pool_q[cfg.MVIT.POOL_Q_STRIDE[i][0]] = [\n                    s + 1 if s > 1 else s for s in cfg.MVIT.POOL_Q_STRIDE[i][1:]\n                ]\n\n        # If POOL_KV_STRIDE_ADAPTIVE is not None, initialize POOL_KV_STRIDE.\n        if cfg.MVIT.POOL_KV_STRIDE_ADAPTIVE is not None:\n            _stride_kv = cfg.MVIT.POOL_KV_STRIDE_ADAPTIVE\n            cfg.MVIT.POOL_KV_STRIDE = []\n            for i in range(cfg.MVIT.DEPTH):\n                if len(stride_q[i]) > 0:\n                    _stride_kv = [\n                        max(_stride_kv[d] // stride_q[i][d], 1)\n                        for d in range(len(_stride_kv))\n                    ]\n                cfg.MVIT.POOL_KV_STRIDE.append([i] + _stride_kv)\n\n        for i in range(len(cfg.MVIT.POOL_KV_STRIDE)):\n            stride_kv[cfg.MVIT.POOL_KV_STRIDE[i][0]] = cfg.MVIT.POOL_KV_STRIDE[i][1:]\n            if cfg.MVIT.POOL_KVQ_KERNEL is not None:\n                pool_kv[cfg.MVIT.POOL_KV_STRIDE[i][0]] = cfg.MVIT.POOL_KVQ_KERNEL\n            else:\n                pool_kv[cfg.MVIT.POOL_KV_STRIDE[i][0]] = [\n                    s + 1 if s > 1 else s for s in cfg.MVIT.POOL_KV_STRIDE[i][1:]\n                ]\n\n        self.pool_q = pool_q\n        self.pool_kv = pool_kv\n        self.stride_q = stride_q\n        self.stride_kv = stride_kv\n\n        self.norm_stem = norm_layer(embed_dim) if cfg.MVIT.NORM_STEM else None\n\n        input_size = self.patch_dims\n\n        if self.enable_rev:\n            # rev does not allow cls token\n            assert not self.cls_embed_on\n\n            self.rev_backbone = ReversibleMViT(cfg, self)\n\n            embed_dim = round_width(embed_dim, dim_mul.prod(), divisor=num_heads)\n\n            self.fuse = TwoStreamFusion(cfg.MVIT.REV.RESPATH_FUSE, dim=2 * embed_dim)\n\n            if \"concat\" in self.cfg.MVIT.REV.RESPATH_FUSE:\n                self.norm = norm_layer(2 * embed_dim)\n            else:\n                self.norm = norm_layer(embed_dim)\n\n        else:\n            self.blocks = nn.ModuleList()\n\n            for i in range(depth):\n                num_heads = round_width(num_heads, head_mul[i])\n                if cfg.MVIT.DIM_MUL_IN_ATT:\n                    dim_out = round_width(\n                        embed_dim,\n                        dim_mul[i],\n                        divisor=round_width(num_heads, head_mul[i]),\n                    )\n                else:\n                    dim_out = round_width(\n                        embed_dim,\n                        dim_mul[i + 1],\n                        divisor=round_width(num_heads, head_mul[i + 1]),\n                    )\n                attention_block = MultiScaleBlock(\n                    dim=embed_dim,\n                    dim_out=dim_out,\n                    num_heads=num_heads,\n                    input_size=input_size,\n                    mlp_ratio=mlp_ratio,\n                    qkv_bias=qkv_bias,\n                    drop_rate=self.drop_rate,\n                    drop_path=dpr[i],\n                    norm_layer=norm_layer,\n                    kernel_q=pool_q[i] if len(pool_q) > i else [],\n                    kernel_kv=pool_kv[i] if len(pool_kv) > i else [],\n                    stride_q=stride_q[i] if len(stride_q) > i else [],\n                    stride_kv=stride_kv[i] if len(stride_kv) > i else [],\n                    mode=mode,\n                    has_cls_embed=self.cls_embed_on,\n                    pool_first=pool_first,\n                    rel_pos_spatial=self.rel_pos_spatial,\n                    rel_pos_temporal=self.rel_pos_temporal,\n                    rel_pos_zero_init=cfg.MVIT.REL_POS_ZERO_INIT,\n                    residual_pooling=cfg.MVIT.RESIDUAL_POOLING,\n                    dim_mul_in_att=cfg.MVIT.DIM_MUL_IN_ATT,\n                    separate_qkv=cfg.MVIT.SEPARATE_QKV,\n                )\n\n                if cfg.MODEL.ACT_CHECKPOINT:\n                    attention_block = checkpoint_wrapper(attention_block)\n                self.blocks.append(attention_block)\n                if len(stride_q[i]) > 0:\n                    input_size = [\n                        size // stride for size, stride in zip(input_size, stride_q[i])\n                    ]\n\n                embed_dim = dim_out\n\n            self.norm = norm_layer(embed_dim)\n\n        if self.enable_detection:\n            self.head = head_helper.ResNetRoIHead(\n                dim_in=[embed_dim],\n                num_classes=num_classes,\n                pool_size=[[temporal_size // self.patch_stride[0], 1, 1]],\n                resolution=[[cfg.DETECTION.ROI_XFORM_RESOLUTION] * 2],\n                scale_factor=[cfg.DETECTION.SPATIAL_SCALE_FACTOR],\n                dropout_rate=cfg.MODEL.DROPOUT_RATE,\n                act_func=cfg.MODEL.HEAD_ACT,\n                aligned=cfg.DETECTION.ALIGNED,\n            )\n        else:\n            self.head = head_helper.TransformerBasicHead(\n                (\n                    2 * embed_dim\n                    if (\"concat\" in cfg.MVIT.REV.RESPATH_FUSE and self.enable_rev)\n                    else embed_dim\n                ),\n                num_classes,\n                dropout_rate=cfg.MODEL.DROPOUT_RATE,\n                act_func=cfg.MODEL.HEAD_ACT,\n                cfg=cfg,\n            )\n        if self.use_abs_pos:\n            if self.sep_pos_embed:\n                trunc_normal_(self.pos_embed_spatial, std=0.02)\n                trunc_normal_(self.pos_embed_temporal, std=0.02)\n                if self.cls_embed_on:\n                    trunc_normal_(self.pos_embed_class, std=0.02)\n            else:\n                trunc_normal_(self.pos_embed, std=0.02)\n                if self.use_fixed_sincos_pos:\n                    pos_embed = get_3d_sincos_pos_embed(\n                        self.pos_embed.shape[-1],\n                        self.H,\n                        self.T,\n                        cls_token=self.cls_embed_on,\n                    )\n                    self.pos_embed.data.copy_(\n                        torch.from_numpy(pos_embed).float().unsqueeze(0)\n                    )\n\n        if self.cls_embed_on:\n            trunc_normal_(self.cls_token, std=0.02)\n        self.apply(self._init_weights)\n\n        self.head.projection.weight.data.mul_(head_init_scale)\n        self.head.projection.bias.data.mul_(head_init_scale)\n\n        self.feat_size, self.feat_stride = calc_mvit_feature_geometry(cfg)\n\n    def _init_weights(self, m):\n        if isinstance(m, (nn.Linear, nn.Conv2d, nn.Conv3d)):\n            nn.init.trunc_normal_(m.weight, std=0.02)\n            if isinstance(m, nn.Linear) and m.bias is not None:\n                nn.init.constant_(m.bias, 0.02)\n        elif isinstance(m, nn.LayerNorm):\n            nn.init.constant_(m.bias, 0.02)\n            nn.init.constant_(m.weight, 1.0)\n\n    @torch.jit.ignore\n    def no_weight_decay(self):\n        names = []\n        if self.cfg.MVIT.ZERO_DECAY_POS_CLS:\n            if self.use_abs_pos:\n                if self.sep_pos_embed:\n                    names.extend(\n                        [\n                            \"pos_embed_spatial\",\n                            \"pos_embed_temporal\",\n                            \"pos_embed_class\",\n                        ]\n                    )\n                else:\n                    names.append(\"pos_embed\")\n            if self.rel_pos_spatial:\n                names.extend([\"rel_pos_h\", \"rel_pos_w\", \"rel_pos_hw\"])\n            if self.rel_pos_temporal:\n                names.extend([\"rel_pos_t\"])\n            if self.cls_embed_on:\n                names.append(\"cls_token\")\n\n        return names\n\n    def _get_pos_embed(self, pos_embed, bcthw):\n        if len(bcthw) == 4:\n            t, h, w = 1, bcthw[-2], bcthw[-1]\n        else:\n            t, h, w = bcthw[-3], bcthw[-2], bcthw[-1]\n        if self.cls_embed_on:\n            cls_pos_embed = pos_embed[:, 0:1, :]\n            pos_embed = pos_embed[:, 1:]\n        txy_num = pos_embed.shape[1]\n        p_t, p_h, p_w = self.patch_dims\n        assert p_t * p_h * p_w == txy_num\n\n        if (p_t, p_h, p_w) != (t, h, w):\n            new_pos_embed = F.interpolate(\n                pos_embed[:, :, :].reshape(1, p_t, p_h, p_w, -1).permute(0, 4, 1, 2, 3),\n                size=(t, h, w),\n                mode=\"trilinear\",\n            )\n            pos_embed = new_pos_embed.reshape(1, -1, t * h * w).permute(0, 2, 1)\n\n        if self.cls_embed_on:\n            pos_embed = torch.cat((cls_pos_embed, pos_embed), dim=1)\n\n        return pos_embed\n\n    def _forward_reversible(self, x):\n        \"\"\"\n        Reversible specific code for forward computation.\n        \"\"\"\n        # rev does not support cls token or detection\n        assert not self.cls_embed_on\n        assert not self.enable_detection\n\n        x = self.rev_backbone(x)\n\n        if self.use_mean_pooling:\n            x = self.fuse(x)\n            x = x.mean(1)\n            x = self.norm(x)\n        else:\n            x = self.norm(x)\n            x = self.fuse(x)\n            x = x.mean(1)\n\n        x = self.head(x)\n\n        return x\n\n    def forward(self, x, bboxes=None, return_attn=False):\n        x = x[0]\n        x, bcthw = self.patch_embed(x)\n        bcthw = list(bcthw)\n        if len(bcthw) == 4:  # Fix bcthw in case of 4D tensor\n            bcthw.insert(2, torch.tensor(self.T))\n        T, H, W = bcthw[-3], bcthw[-2], bcthw[-1]\n        assert len(bcthw) == 5 and (T, H, W) == (self.T, self.H, self.W), bcthw\n        B, N, C = x.shape\n\n        s = 1 if self.cls_embed_on else 0\n        if self.use_fixed_sincos_pos:\n            x += self.pos_embed[:, s:, :]  # s: on/off cls token\n\n        if self.cls_embed_on:\n            cls_tokens = self.cls_token.expand(\n                B, -1, -1\n            )  # stole cls_tokens impl from Phil Wang, thanks\n            if self.use_fixed_sincos_pos:\n                cls_tokens = cls_tokens + self.pos_embed[:, :s, :]\n            x = torch.cat((cls_tokens, x), dim=1)\n\n        if self.use_abs_pos:\n            if self.sep_pos_embed:\n                pos_embed = self.pos_embed_spatial.repeat(\n                    1, self.patch_dims[0], 1\n                ) + torch.repeat_interleave(\n                    self.pos_embed_temporal,\n                    self.patch_dims[1] * self.patch_dims[2],\n                    dim=1,\n                )\n                if self.cls_embed_on:\n                    pos_embed = torch.cat([self.pos_embed_class, pos_embed], 1)\n                x += self._get_pos_embed(pos_embed, bcthw)\n            else:\n                x += self._get_pos_embed(self.pos_embed, bcthw)\n\n        if self.drop_rate:\n            x = self.pos_drop(x)\n\n        if self.norm_stem:\n            x = self.norm_stem(x)\n\n        thw = [T, H, W]\n\n        if self.enable_rev:\n            x = self._forward_reversible(x)\n\n        else:\n            for blk in self.blocks:\n                x, thw = blk(x, thw)\n\n            if self.enable_detection:\n                assert not self.enable_rev\n\n                x = self.norm(x)\n                if self.cls_embed_on:\n                    x = x[:, 1:]\n\n                B, _, C = x.shape\n                x = x.transpose(1, 2).reshape(B, C, thw[0], thw[1], thw[2])\n\n                x = self.head([x], bboxes)\n\n            else:\n                if self.use_mean_pooling:\n                    if self.cls_embed_on:\n                        x = x[:, 1:]\n                    x = x.mean(1)\n                    x = self.norm(x)\n                elif self.cls_embed_on:\n                    x = self.norm(x)\n                    x = x[:, 0]\n                else:  # this is default, [norm->mean]\n                    x = self.norm(x)\n                    x = x.mean(1)\n                x = self.head(x)\n\n        return x\n"
  },
  {
    "path": "slowfast/utils/__init__.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n"
  },
  {
    "path": "slowfast/utils/ava_eval_helper.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n##############################################################################\n#\n# Based on:\n# --------------------------------------------------------\n# ActivityNet\n# Copyright (c) 2015 ActivityNet\n# Licensed under The MIT License\n# [see https://github.com/activitynet/ActivityNet/blob/master/LICENSE for details]\n# --------------------------------------------------------\n\n\"\"\"Helper functions for AVA evaluation.\"\"\"\n\nfrom __future__ import absolute_import, division, print_function, unicode_literals\n\nimport csv\nimport logging\nimport pprint\nimport time\nfrom collections import defaultdict\n\nimport numpy as np\nimport slowfast.utils.distributed as du\nfrom slowfast.utils.env import pathmgr\nfrom vision.fair.slowfast.ava_evaluation import (\n    object_detection_evaluation,\n    standard_fields,\n)\n\nlogger = logging.getLogger(__name__)\n\n\ndef make_image_key(video_id, timestamp):\n    \"\"\"Returns a unique identifier for a video id & timestamp.\"\"\"\n    return \"%s,%04d\" % (video_id, int(timestamp))\n\n\ndef read_csv(csv_file, class_whitelist=None, load_score=False):\n    \"\"\"Loads boxes and class labels from a CSV file in the AVA format.\n    CSV file format described at https://research.google.com/ava/download.html.\n    Args:\n      csv_file: A file object.\n      class_whitelist: If provided, boxes corresponding to (integer) class labels\n        not in this set are skipped.\n    Returns:\n      boxes: A dictionary mapping each unique image key (string) to a list of\n        boxes, given as coordinates [y1, x1, y2, x2].\n      labels: A dictionary mapping each unique image key (string) to a list of\n        integer class lables, matching the corresponding box in `boxes`.\n      scores: A dictionary mapping each unique image key (string) to a list of\n        score values lables, matching the corresponding label in `labels`. If\n        scores are not provided in the csv, then they will default to 1.0.\n    \"\"\"\n    boxes = defaultdict(list)\n    labels = defaultdict(list)\n    scores = defaultdict(list)\n    with pathmgr.open(csv_file, \"r\") as f:\n        reader = csv.reader(f)\n        for row in reader:\n            assert len(row) in [7, 8], \"Wrong number of columns: \" + row\n            image_key = make_image_key(row[0], row[1])\n            x1, y1, x2, y2 = [float(n) for n in row[2:6]]\n            action_id = int(row[6])\n            if class_whitelist and action_id not in class_whitelist:\n                continue\n            score = 1.0\n            if load_score:\n                score = float(row[7])\n            boxes[image_key].append([y1, x1, y2, x2])\n            labels[image_key].append(action_id)\n            scores[image_key].append(score)\n    return boxes, labels, scores\n\n\ndef read_exclusions(exclusions_file):\n    \"\"\"Reads a CSV file of excluded timestamps.\n    Args:\n      exclusions_file: A file object containing a csv of video-id,timestamp.\n    Returns:\n      A set of strings containing excluded image keys, e.g. \"aaaaaaaaaaa,0904\",\n      or an empty set if exclusions file is None.\n    \"\"\"\n    excluded = set()\n    if exclusions_file:\n        with pathmgr.open(exclusions_file, \"r\") as f:\n            reader = csv.reader(f)\n            for row in reader:\n                assert len(row) == 2, \"Expected only 2 columns, got: \" + row\n                excluded.add(make_image_key(row[0], row[1]))\n    return excluded\n\n\ndef read_labelmap(labelmap_file):\n    \"\"\"Read label map and class ids.\"\"\"\n\n    labelmap = []\n    class_ids = set()\n    name = \"\"\n    class_id = \"\"\n    with pathmgr.open(labelmap_file, \"r\") as f:\n        for line in f:\n            if line.startswith(\"  name:\"):\n                name = line.split('\"')[1]\n            elif line.startswith(\"  id:\") or line.startswith(\"  label_id:\"):\n                class_id = int(line.strip().split(\" \")[-1])\n                labelmap.append({\"id\": class_id, \"name\": name})\n                class_ids.add(class_id)\n    return labelmap, class_ids\n\n\ndef evaluate_ava_from_files(labelmap, groundtruth, detections, exclusions):\n    \"\"\"Run AVA evaluation given annotation/prediction files.\"\"\"\n\n    categories, class_whitelist = read_labelmap(labelmap)\n    excluded_keys = read_exclusions(exclusions)\n    groundtruth = read_csv(groundtruth, class_whitelist, load_score=False)\n    detections = read_csv(detections, class_whitelist, load_score=True)\n    run_evaluation(categories, groundtruth, detections, excluded_keys)\n\n\ndef evaluate_ava(\n    preds,\n    original_boxes,\n    metadata,\n    excluded_keys,\n    class_whitelist,\n    categories,\n    groundtruth=None,\n    video_idx_to_name=None,\n    name=\"latest\",\n):\n    \"\"\"Run AVA evaluation given numpy arrays.\"\"\"\n\n    eval_start = time.time()\n\n    detections = get_ava_eval_data(\n        preds,\n        original_boxes,\n        metadata,\n        class_whitelist,\n        video_idx_to_name=video_idx_to_name,\n    )\n\n    logger.info(\"Evaluating with %d unique GT frames.\" % len(groundtruth[0]))\n    logger.info(\"Evaluating with %d unique detection frames\" % len(detections[0]))\n\n    write_results(detections, \"detections_%s.csv\" % name)\n    write_results(groundtruth, \"groundtruth_%s.csv\" % name)\n\n    results = run_evaluation(categories, groundtruth, detections, excluded_keys)\n\n    logger.info(\"AVA eval done in %f seconds.\" % (time.time() - eval_start))\n    return results[\"PascalBoxes_Precision/mAP@0.5IOU\"]\n\n\ndef run_evaluation(categories, groundtruth, detections, excluded_keys, verbose=True):\n    \"\"\"AVA evaluation main logic.\"\"\"\n\n    pascal_evaluator = object_detection_evaluation.PascalDetectionEvaluator(categories)\n\n    boxes, labels, _ = groundtruth\n\n    gt_keys = []\n    pred_keys = []\n\n    for image_key in boxes:\n        if image_key in excluded_keys:\n            logging.info(\n                (\"Found excluded timestamp in ground truth: %s. It will be ignored.\"),\n                image_key,\n            )\n            continue\n        pascal_evaluator.add_single_ground_truth_image_info(\n            image_key,\n            {\n                standard_fields.InputDataFields.groundtruth_boxes: np.array(\n                    boxes[image_key], dtype=float\n                ),\n                standard_fields.InputDataFields.groundtruth_classes: np.array(\n                    labels[image_key], dtype=int\n                ),\n                standard_fields.InputDataFields.groundtruth_difficult: np.zeros(\n                    len(boxes[image_key]), dtype=bool\n                ),\n            },\n        )\n\n        gt_keys.append(image_key)\n\n    boxes, labels, scores = detections\n\n    for image_key in boxes:\n        if image_key in excluded_keys:\n            logging.info(\n                (\"Found excluded timestamp in detections: %s. It will be ignored.\"),\n                image_key,\n            )\n            continue\n        pascal_evaluator.add_single_detected_image_info(\n            image_key,\n            {\n                standard_fields.DetectionResultFields.detection_boxes: np.array(\n                    boxes[image_key], dtype=float\n                ),\n                standard_fields.DetectionResultFields.detection_classes: np.array(\n                    labels[image_key], dtype=int\n                ),\n                standard_fields.DetectionResultFields.detection_scores: np.array(\n                    scores[image_key], dtype=float\n                ),\n            },\n        )\n\n        pred_keys.append(image_key)\n\n    metrics = pascal_evaluator.evaluate()\n\n    if du.is_master_proc():\n        pprint.pprint(metrics, indent=2)\n    return metrics\n\n\ndef get_ava_eval_data(\n    scores,\n    boxes,\n    metadata,\n    class_whitelist,\n    verbose=False,\n    video_idx_to_name=None,\n):\n    \"\"\"\n    Convert our data format into the data format used in official AVA\n    evaluation.\n    \"\"\"\n\n    out_scores = defaultdict(list)\n    out_labels = defaultdict(list)\n    out_boxes = defaultdict(list)\n    count = 0\n    for i in range(scores.shape[0]):\n        video_idx = int(np.round(metadata[i][0]))\n        sec = int(np.round(metadata[i][1]))\n\n        video = video_idx_to_name[video_idx]\n\n        key = video + \",\" + \"%04d\" % (sec)\n        batch_box = boxes[i].tolist()\n        # The first is batch idx.\n        batch_box = [batch_box[j] for j in [0, 2, 1, 4, 3]]\n\n        one_scores = scores[i].tolist()\n        for cls_idx, score in enumerate(one_scores):\n            if cls_idx + 1 in class_whitelist:\n                out_scores[key].append(score)\n                out_labels[key].append(cls_idx + 1)\n                out_boxes[key].append(batch_box[1:])\n                count += 1\n\n    return out_boxes, out_labels, out_scores\n\n\ndef write_results(detections, filename):\n    \"\"\"Write prediction results into official formats.\"\"\"\n    start = time.time()\n\n    boxes, labels, scores = detections\n    with pathmgr.open(filename, \"w\") as f:\n        for key in boxes.keys():\n            for box, label, score in zip(boxes[key], labels[key], scores[key]):\n                f.write(\n                    \"%s,%.03f,%.03f,%.03f,%.03f,%d,%.04f\\n\"\n                    % (key, box[1], box[0], box[3], box[2], label, score)\n                )\n\n    logger.info(\"AVA results wrote to %s\" % filename)\n    logger.info(\"\\ttook %d seconds.\" % (time.time() - start))\n"
  },
  {
    "path": "slowfast/utils/benchmark.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved\n\"\"\"\nFunctions for benchmarks.\n\"\"\"\n\nimport pprint\n\nimport numpy as np\nimport slowfast.utils.logging as logging\nimport slowfast.utils.misc as misc\nimport torch\nimport tqdm\nfrom fvcore.common.timer import Timer\nfrom slowfast.datasets import loader\nfrom slowfast.utils.env import setup_environment\n\nlogger = logging.get_logger(__name__)\n\n\ndef benchmark_data_loading(cfg):\n    \"\"\"\n    Benchmark the speed of data loading in PySlowFast.\n    Args:\n\n        cfg (CfgNode): configs. Details can be found in\n            slowfast/config/defaults.py\n    \"\"\"\n    # Set up environment.\n    setup_environment()\n    # Set random seed from configs.\n    np.random.seed(cfg.RNG_SEED)\n    torch.manual_seed(cfg.RNG_SEED)\n\n    # Setup logging format.\n    logging.setup_logging(cfg.OUTPUT_DIR)\n\n    # Print config.\n    logger.info(\"Benchmark data loading with config:\")\n    logger.info(pprint.pformat(cfg))\n\n    timer = Timer()\n    dataloader = loader.construct_loader(cfg, \"train\")\n    logger.info(\"Initialize loader using {:.2f} seconds.\".format(timer.seconds()))\n    # Total batch size across different machines.\n    batch_size = cfg.TRAIN.BATCH_SIZE * cfg.NUM_SHARDS\n    log_period = cfg.BENCHMARK.LOG_PERIOD\n    epoch_times = []\n    # Test for a few epochs.\n    for cur_epoch in range(cfg.BENCHMARK.NUM_EPOCHS):\n        timer = Timer()\n        timer_epoch = Timer()\n        iter_times = []\n        if cfg.BENCHMARK.SHUFFLE:\n            loader.shuffle_dataset(dataloader, cur_epoch)\n        for cur_iter, _ in enumerate(tqdm.tqdm(dataloader)):\n            if cur_iter > 0 and cur_iter % log_period == 0:\n                iter_times.append(timer.seconds())\n                ram_usage, ram_total = misc.cpu_mem_usage()\n                logger.info(\n                    \"Epoch {}: {} iters ({} videos) in {:.2f} seconds. \"\n                    \"RAM Usage: {:.2f}/{:.2f} GB.\".format(\n                        cur_epoch,\n                        log_period,\n                        log_period * batch_size,\n                        iter_times[-1],\n                        ram_usage,\n                        ram_total,\n                    )\n                )\n                timer.reset()\n        epoch_times.append(timer_epoch.seconds())\n        ram_usage, ram_total = misc.cpu_mem_usage()\n        logger.info(\n            \"Epoch {}: in total {} iters ({} videos) in {:.2f} seconds. \"\n            \"RAM Usage: {:.2f}/{:.2f} GB.\".format(\n                cur_epoch,\n                len(dataloader),\n                len(dataloader) * batch_size,\n                epoch_times[-1],\n                ram_usage,\n                ram_total,\n            )\n        )\n        logger.info(\n            \"Epoch {}: on average every {} iters ({} videos) take {:.2f}/{:.2f} \"\n            \"(avg/std) seconds.\".format(\n                cur_epoch,\n                log_period,\n                log_period * batch_size,\n                np.mean(iter_times),\n                np.std(iter_times),\n            )\n        )\n    logger.info(\n        \"On average every epoch ({} videos) takes {:.2f}/{:.2f} \"\n        \"(avg/std) seconds.\".format(\n            len(dataloader) * batch_size,\n            np.mean(epoch_times),\n            np.std(epoch_times),\n        )\n    )\n"
  },
  {
    "path": "slowfast/utils/bn_helper.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\n\"\"\"bn helper.\"\"\"\n\nimport itertools\n\nimport torch\n\n\n@torch.no_grad()\ndef compute_and_update_bn_stats(model, data_loader, num_batches=200):\n    \"\"\"\n    Compute and update the batch norm stats to make it more precise. During\n    training both bn stats and the weight are changing after every iteration,\n    so the bn can not precisely reflect the latest stats of the current model.\n    Here the bn stats is recomputed without change of weights, to make the\n    running mean and running var more precise.\n    Args:\n        model (model): the model using to compute and update the bn stats.\n        data_loader (dataloader): dataloader using to provide inputs.\n        num_batches (int): running iterations using to compute the stats.\n    \"\"\"\n\n    # Prepares all the bn layers.\n    bn_layers = [\n        m\n        for m in model.modules()\n        if any(\n            (\n                isinstance(m, bn_type)\n                for bn_type in (\n                    torch.nn.BatchNorm1d,\n                    torch.nn.BatchNorm2d,\n                    torch.nn.BatchNorm3d,\n                )\n            )\n        )\n    ]\n\n    # In order to make the running stats only reflect the current batch, the\n    # momentum is disabled.\n    # bn.running_mean = (1 - momentum) * bn.running_mean + momentum * batch_mean\n    # Setting the momentum to 1.0 to compute the stats without momentum.\n    momentum_actual = [bn.momentum for bn in bn_layers]\n    for bn in bn_layers:\n        bn.momentum = 1.0\n\n    # Calculates the running iterations for precise stats computation.\n    running_mean = [torch.zeros_like(bn.running_mean) for bn in bn_layers]\n    running_square_mean = [torch.zeros_like(bn.running_var) for bn in bn_layers]\n\n    for ind, (inputs, _, _) in enumerate(itertools.islice(data_loader, num_batches)):\n        # Forwards the model to update the bn stats.\n        if isinstance(inputs, (list,)):\n            for i in range(len(inputs)):\n                inputs[i] = inputs[i].float().cuda(non_blocking=True)\n        else:\n            inputs = inputs.cuda(non_blocking=True)\n        model(inputs)\n\n        for i, bn in enumerate(bn_layers):\n            # Accumulates the bn stats.\n            running_mean[i] += (bn.running_mean - running_mean[i]) / (ind + 1)\n            # $E(x^2) = Var(x) + E(x)^2$.\n            cur_square_mean = bn.running_var + bn.running_mean**2\n            running_square_mean[i] += (cur_square_mean - running_square_mean[i]) / (\n                ind + 1\n            )\n\n    for i, bn in enumerate(bn_layers):\n        bn.running_mean = running_mean[i]\n        # Var(x) = $E(x^2) - E(x)^2$.\n        bn.running_var = running_square_mean[i] - bn.running_mean**2\n        # Sets the precise bn stats.\n        bn.momentum = momentum_actual[i]\n"
  },
  {
    "path": "slowfast/utils/c2_model_loading.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\n\"\"\"Caffe2 to PyTorch checkpoint name converting utility.\"\"\"\n\nimport re\n\n\ndef get_name_convert_func():\n    \"\"\"\n    Get the function to convert Caffe2 layer names to PyTorch layer names.\n    Returns:\n        (func): function to convert parameter name from Caffe2 format to PyTorch\n        format.\n    \"\"\"\n    pairs = [\n        # ------------------------------------------------------------\n        # 'nonlocal_conv3_1_theta_w' -> 's3.pathway0_nonlocal3.conv_g.weight'\n        [\n            r\"^nonlocal_conv([0-9]+)_([0-9]+)_(.*)\",\n            r\"s\\1.pathway0_nonlocal\\2_\\3\",\n        ],\n        # 'theta' -> 'conv_theta'\n        [r\"^(.*)_nonlocal([0-9]+)_(theta)(.*)\", r\"\\1_nonlocal\\2.conv_\\3\\4\"],\n        # 'g' -> 'conv_g'\n        [r\"^(.*)_nonlocal([0-9]+)_(g)(.*)\", r\"\\1_nonlocal\\2.conv_\\3\\4\"],\n        # 'phi' -> 'conv_phi'\n        [r\"^(.*)_nonlocal([0-9]+)_(phi)(.*)\", r\"\\1_nonlocal\\2.conv_\\3\\4\"],\n        # 'out' -> 'conv_out'\n        [r\"^(.*)_nonlocal([0-9]+)_(out)(.*)\", r\"\\1_nonlocal\\2.conv_\\3\\4\"],\n        # 'nonlocal_conv4_5_bn_s' -> 's4.pathway0_nonlocal3.bn.weight'\n        [r\"^(.*)_nonlocal([0-9]+)_(bn)_(.*)\", r\"\\1_nonlocal\\2.\\3.\\4\"],\n        # ------------------------------------------------------------\n        # 't_pool1_subsample_bn' -> 's1_fuse.conv_f2s.bn.running_mean'\n        [r\"^t_pool1_subsample_bn_(.*)\", r\"s1_fuse.bn.\\1\"],\n        # 't_pool1_subsample' -> 's1_fuse.conv_f2s'\n        [r\"^t_pool1_subsample_(.*)\", r\"s1_fuse.conv_f2s.\\1\"],\n        # 't_res4_5_branch2c_bn_subsample_bn_rm' -> 's4_fuse.conv_f2s.bias'\n        [\n            r\"^t_res([0-9]+)_([0-9]+)_branch2c_bn_subsample_bn_(.*)\",\n            r\"s\\1_fuse.bn.\\3\",\n        ],\n        # 't_pool1_subsample' -> 's1_fuse.conv_f2s'\n        [\n            r\"^t_res([0-9]+)_([0-9]+)_branch2c_bn_subsample_(.*)\",\n            r\"s\\1_fuse.conv_f2s.\\3\",\n        ],\n        # ------------------------------------------------------------\n        # 'res4_4_branch_2c_bn_b' -> 's4.pathway0_res4.branch2.c_bn_b'\n        [\n            r\"^res([0-9]+)_([0-9]+)_branch([0-9]+)([a-z])_(.*)\",\n            r\"s\\1.pathway0_res\\2.branch\\3.\\4_\\5\",\n        ],\n        # 'res_conv1_bn_' -> 's1.pathway0_stem.bn.'\n        [r\"^res_conv1_bn_(.*)\", r\"s1.pathway0_stem.bn.\\1\"],\n        # 'conv1_xy_w_momentum' -> 's1.pathway0_stem.conv_xy.'\n        [r\"^conv1_xy(.*)\", r\"s1.pathway0_stem.conv_xy\\1\"],\n        # 'conv1_w_momentum' -> 's1.pathway0_stem.conv.'\n        [r\"^conv1_(.*)\", r\"s1.pathway0_stem.conv.\\1\"],\n        # 'res4_0_branch1_w' -> 'S4.pathway0_res0.branch1.weight'\n        [\n            r\"^res([0-9]+)_([0-9]+)_branch([0-9]+)_(.*)\",\n            r\"s\\1.pathway0_res\\2.branch\\3_\\4\",\n        ],\n        # 'res_conv1_' -> 's1.pathway0_stem.conv.'\n        [r\"^res_conv1_(.*)\", r\"s1.pathway0_stem.conv.\\1\"],\n        # ------------------------------------------------------------\n        # 'res4_4_branch_2c_bn_b' -> 's4.pathway0_res4.branch2.c_bn_b'\n        [\n            r\"^t_res([0-9]+)_([0-9]+)_branch([0-9]+)([a-z])_(.*)\",\n            r\"s\\1.pathway1_res\\2.branch\\3.\\4_\\5\",\n        ],\n        # 'res_conv1_bn_' -> 's1.pathway0_stem.bn.'\n        [r\"^t_res_conv1_bn_(.*)\", r\"s1.pathway1_stem.bn.\\1\"],\n        # 'conv1_w_momentum' -> 's1.pathway0_stem.conv.'\n        [r\"^t_conv1_(.*)\", r\"s1.pathway1_stem.conv.\\1\"],\n        # 'res4_0_branch1_w' -> 'S4.pathway0_res0.branch1.weight'\n        [\n            r\"^t_res([0-9]+)_([0-9]+)_branch([0-9]+)_(.*)\",\n            r\"s\\1.pathway1_res\\2.branch\\3_\\4\",\n        ],\n        # 'res_conv1_' -> 's1.pathway0_stem.conv.'\n        [r\"^t_res_conv1_(.*)\", r\"s1.pathway1_stem.conv.\\1\"],\n        # ------------------------------------------------------------\n        # pred_ -> head.projection.\n        [r\"pred_(.*)\", r\"head.projection.\\1\"],\n        # '.b_bn_fc' -> '.se.fc'\n        [r\"(.*)b_bn_fc(.*)\", r\"\\1se.fc\\2\"],\n        # conv_5 -> head.conv_5.\n        [r\"conv_5(.*)\", r\"head.conv_5\\1\"],\n        # conv_5 -> head.conv_5.\n        [r\"lin_5(.*)\", r\"head.lin_5\\1\"],\n        # '.bn_b' -> '.weight'\n        [r\"(.*)bn.b\\Z\", r\"\\1bn.bias\"],\n        # '.bn_s' -> '.weight'\n        [r\"(.*)bn.s\\Z\", r\"\\1bn.weight\"],\n        # '_bn_rm' -> '.running_mean'\n        [r\"(.*)bn.rm\\Z\", r\"\\1bn.running_mean\"],\n        # '_bn_riv' -> '.running_var'\n        [r\"(.*)bn.riv\\Z\", r\"\\1bn.running_var\"],\n        # '_b' -> '.bias'\n        [r\"(.*)[\\._]b\\Z\", r\"\\1.bias\"],\n        # '_w' -> '.weight'\n        [r\"(.*)[\\._]w\\Z\", r\"\\1.weight\"],\n    ]\n\n    def convert_caffe2_name_to_pytorch(caffe2_layer_name):\n        \"\"\"\n        Convert the caffe2_layer_name to pytorch format by apply the list of\n        regular expressions.\n        Args:\n            caffe2_layer_name (str): caffe2 layer name.\n        Returns:\n            (str): pytorch layer name.\n        \"\"\"\n        for source, dest in pairs:\n            caffe2_layer_name = re.sub(source, dest, caffe2_layer_name)\n        return caffe2_layer_name\n\n    return convert_caffe2_name_to_pytorch\n"
  },
  {
    "path": "slowfast/utils/checkpoint.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\n\"\"\"Functions that handle saving and loading of checkpoints.\"\"\"\n\nimport copy\nimport math\nimport os\nimport pickle\nfrom collections import OrderedDict\n\nimport numpy as np\nimport slowfast.utils.distributed as du\nimport slowfast.utils.logging as logging\nimport torch\nfrom slowfast.utils.c2_model_loading import get_name_convert_func\nfrom slowfast.utils.env import checkpoint_pathmgr as pathmgr\n\nlogger = logging.get_logger(__name__)\n\n\ndef make_checkpoint_dir(path_to_job):\n    \"\"\"\n    Creates the checkpoint directory (if not present already).\n    Args:\n        path_to_job (string): the path to the folder of the current job.\n    \"\"\"\n    checkpoint_dir = os.path.join(path_to_job, \"checkpoints\")\n    # Create the checkpoint dir from the master process\n    if du.is_master_proc() and not pathmgr.exists(checkpoint_dir):\n        try:\n            pathmgr.mkdirs(checkpoint_dir)\n        except Exception:\n            pass\n    return checkpoint_dir\n\n\ndef get_checkpoint_dir(path_to_job):\n    \"\"\"\n    Get path for storing checkpoints.\n    Args:\n        path_to_job (string): the path to the folder of the current job.\n    \"\"\"\n    return os.path.join(path_to_job, \"checkpoints\")\n\n\ndef get_path_to_checkpoint(path_to_job, epoch, task=\"\"):\n    \"\"\"\n    Get the full path to a checkpoint file.\n    Args:\n        path_to_job (string): the path to the folder of the current job.\n        epoch (int): the number of epoch for the checkpoint.\n    \"\"\"\n    if task != \"\":\n        name = \"{}_checkpoint_epoch_{:05d}.pyth\".format(task, epoch)\n    else:\n        name = \"checkpoint_epoch_{:05d}.pyth\".format(epoch)\n    return os.path.join(get_checkpoint_dir(path_to_job), name)\n\n\ndef get_last_checkpoint(path_to_job, task):\n    \"\"\"\n    Get the last checkpoint from the checkpointing folder.\n    Args:\n        path_to_job (string): the path to the folder of the current job.\n    \"\"\"\n\n    d = get_checkpoint_dir(path_to_job)\n    names = pathmgr.ls(d) if pathmgr.exists(d) else []\n    if task != \"\":\n        names = [f for f in names if \"{}_checkpoint\".format(task) in f]\n    else:\n        names = [f for f in names if f.startswith(\"checkpoint\")]\n    if len(names) == 0:\n        return None\n    # Sort the checkpoints by epoch.\n    name = sorted(names)[-1]\n    return os.path.join(d, name)\n\n\ndef has_checkpoint(path_to_job):\n    \"\"\"\n    Determines if the given directory contains a checkpoint.\n    Args:\n        path_to_job (string): the path to the folder of the current job.\n    \"\"\"\n    d = get_checkpoint_dir(path_to_job)\n    files = pathmgr.ls(d) if pathmgr.exists(d) else []\n    return any(\"checkpoint\" in f for f in files)\n\n\ndef is_checkpoint_epoch(cfg, cur_epoch, multigrid_schedule=None):\n    \"\"\"\n    Determine if a checkpoint should be saved on current epoch.\n    Args:\n        cfg (CfgNode): configs to save.\n        cur_epoch (int): current number of epoch of the model.\n        multigrid_schedule (List): schedule for multigrid training.\n    \"\"\"\n    if cur_epoch + 1 == cfg.SOLVER.MAX_EPOCH:\n        return True\n    if multigrid_schedule is not None:\n        prev_epoch = 0\n        for s in multigrid_schedule:\n            if cur_epoch < s[-1]:\n                period = max((s[-1] - prev_epoch) // cfg.MULTIGRID.EVAL_FREQ + 1, 1)\n                return (s[-1] - 1 - cur_epoch) % period == 0\n            prev_epoch = s[-1]\n\n    return (cur_epoch + 1) % cfg.TRAIN.CHECKPOINT_PERIOD == 0\n\n\ndef save_checkpoint(path_to_job, model, optimizer, epoch, cfg, scaler=None):\n    \"\"\"\n    Save a checkpoint.\n    Args:\n        model (model): model to save the weight to the checkpoint.\n        optimizer (optim): optimizer to save the historical state.\n        epoch (int): current number of epoch of the model.\n        cfg (CfgNode): configs to save.\n        scaler (GradScaler): the mixed precision scale.\n    \"\"\"\n    # Save checkpoints only from the master process.\n    if not du.is_master_proc(cfg.NUM_GPUS * cfg.NUM_SHARDS):\n        return\n    # Ensure that the checkpoint dir exists.\n    pathmgr.mkdirs(get_checkpoint_dir(path_to_job))\n    # Omit the DDP wrapper in the multi-gpu setting.\n    sd = model.module.state_dict() if cfg.NUM_GPUS > 1 else model.state_dict()\n    normalized_sd = sub_to_normal_bn(sd)\n\n    # Record the state.\n    checkpoint = {\n        \"epoch\": epoch,\n        \"model_state\": normalized_sd,\n        \"optimizer_state\": optimizer.state_dict(),\n        \"cfg\": cfg.dump(),\n    }\n    if scaler is not None:\n        checkpoint[\"scaler_state\"] = scaler.state_dict()\n    # Write the checkpoint.\n    path_to_checkpoint = get_path_to_checkpoint(path_to_job, epoch + 1, cfg.TASK)\n    with pathmgr.open(path_to_checkpoint, \"wb\") as f:\n        torch.save(checkpoint, f)\n    return path_to_checkpoint\n\n\ndef inflate_weight(state_dict_2d, state_dict_3d):\n    \"\"\"\n    Inflate 2D model weights in state_dict_2d to the 3D model weights in\n    state_dict_3d. The details can be found in:\n    Joao Carreira, and Andrew Zisserman.\n    \"Quo vadis, action recognition? a new model and the kinetics dataset.\"\n    Args:\n        state_dict_2d (OrderedDict): a dict of parameters from a 2D model.\n        state_dict_3d (OrderedDict): a dict of parameters from a 3D model.\n    Returns:\n        state_dict_inflated (OrderedDict): a dict of inflated parameters.\n    \"\"\"\n    state_dict_inflated = OrderedDict()\n    for k, v2d in state_dict_2d.items():\n        assert k in state_dict_3d.keys()\n        v3d = state_dict_3d[k]\n        # Inflate the weight of 2D conv to 3D conv.\n        if len(v2d.shape) == 4 and len(v3d.shape) == 5:\n            logger.info(\"Inflate {}: {} -> {}: {}\".format(k, v2d.shape, k, v3d.shape))\n            # Dimension need to be match.\n            assert v2d.shape[-2:] == v3d.shape[-2:]\n            assert v2d.shape[:2] == v3d.shape[:2]\n            v3d = v2d.unsqueeze(2).repeat(1, 1, v3d.shape[2], 1, 1) / v3d.shape[2]\n        elif v2d.shape == v3d.shape:\n            v3d = v2d\n        else:\n            logger.info(\n                \"Unexpected {}: {} -|> {}: {}\".format(k, v2d.shape, k, v3d.shape)\n            )\n        state_dict_inflated[k] = v3d.clone()\n    return state_dict_inflated\n\n\ndef load_checkpoint(\n    path_to_checkpoint,\n    model,\n    data_parallel=True,\n    optimizer=None,\n    scaler=None,\n    inflation=False,\n    convert_from_caffe2=False,\n    epoch_reset=False,\n    clear_name_pattern=(),\n    image_init=False,\n):\n    \"\"\"\n    Load the checkpoint from the given file. If inflation is True, inflate the\n    2D Conv weights from the checkpoint to 3D Conv.\n    Args:\n        path_to_checkpoint (string): path to the checkpoint to load.\n        model (model): model to load the weights from the checkpoint.\n        data_parallel (bool): if true, model is wrapped by\n        torch.nn.parallel.DistributedDataParallel.\n        optimizer (optim): optimizer to load the historical state.\n        scaler (GradScaler): GradScaler to load the mixed precision scale.\n        inflation (bool): if True, inflate the weights from the checkpoint.\n        convert_from_caffe2 (bool): if True, load the model from caffe2 and\n            convert it to pytorch.\n        epoch_reset (bool): if True, reset #train iterations from the checkpoint.\n        clear_name_pattern (string): if given, this (sub)string will be cleared\n            from a layer name if it can be matched.\n    Returns:\n        (int): the number of training epoch of the checkpoint.\n    \"\"\"\n    logger.info(\"Loading network weights from {}.\".format(path_to_checkpoint))\n\n    # Account for the DDP wrapper in the multi-gpu setting.\n    ms = model.module if data_parallel else model\n    if convert_from_caffe2:\n        with pathmgr.open(path_to_checkpoint, \"rb\") as f:\n            caffe2_checkpoint = pickle.load(f, encoding=\"latin1\")\n        state_dict = OrderedDict()\n        name_convert_func = get_name_convert_func()\n        for key in caffe2_checkpoint[\"blobs\"].keys():\n            converted_key = name_convert_func(key)\n            converted_key = c2_normal_to_sub_bn(converted_key, ms.state_dict())\n            if converted_key in ms.state_dict():\n                c2_blob_shape = caffe2_checkpoint[\"blobs\"][key].shape\n                model_blob_shape = ms.state_dict()[converted_key].shape\n\n                # expand shape dims if they differ (eg for converting linear to conv params)\n                if len(c2_blob_shape) < len(model_blob_shape):\n                    c2_blob_shape += (1,) * (len(model_blob_shape) - len(c2_blob_shape))\n                    caffe2_checkpoint[\"blobs\"][key] = np.reshape(\n                        caffe2_checkpoint[\"blobs\"][key], c2_blob_shape\n                    )\n                # Load BN stats to Sub-BN.\n                if (\n                    len(model_blob_shape) == 1\n                    and len(c2_blob_shape) == 1\n                    and model_blob_shape[0] > c2_blob_shape[0]\n                    and model_blob_shape[0] % c2_blob_shape[0] == 0\n                ):\n                    caffe2_checkpoint[\"blobs\"][key] = np.concatenate(\n                        [caffe2_checkpoint[\"blobs\"][key]]\n                        * (model_blob_shape[0] // c2_blob_shape[0])\n                    )\n                    c2_blob_shape = caffe2_checkpoint[\"blobs\"][key].shape\n\n                if c2_blob_shape == tuple(model_blob_shape):\n                    state_dict[converted_key] = torch.tensor(\n                        caffe2_checkpoint[\"blobs\"][key]\n                    ).clone()\n                    logger.info(\n                        \"{}: {} => {}: {}\".format(\n                            key,\n                            c2_blob_shape,\n                            converted_key,\n                            tuple(model_blob_shape),\n                        )\n                    )\n                else:\n                    logger.warn(\n                        \"!! {}: {} does not match {}: {}\".format(\n                            key,\n                            c2_blob_shape,\n                            converted_key,\n                            tuple(model_blob_shape),\n                        )\n                    )\n            else:\n                if not any(\n                    prefix in key for prefix in [\"momentum\", \"lr\", \"model_iter\"]\n                ):\n                    logger.warn(\n                        \"!! {}: can not be converted, got {}\".format(key, converted_key)\n                    )\n        diff = set(ms.state_dict()) - set(state_dict)\n        diff = {d for d in diff if \"num_batches_tracked\" not in d}\n        if len(diff) > 0:\n            logger.warn(\"Not loaded {}\".format(diff))\n        ms.load_state_dict(state_dict, strict=False)\n        epoch = -1\n    else:\n        # Load the checkpoint on CPU to avoid GPU mem spike.\n        with pathmgr.open(path_to_checkpoint, \"rb\") as f:\n            checkpoint = torch.load(f, map_location=\"cpu\")\n        model_state_dict_3d = (\n            model.module.state_dict() if data_parallel else model.state_dict()\n        )\n        checkpoint[\"model_state\"] = normal_to_sub_bn(\n            checkpoint[\"model_state\"], model_state_dict_3d\n        )\n        if inflation:\n            # Try to inflate the model.\n            inflated_model_dict = inflate_weight(\n                checkpoint[\"model_state\"], model_state_dict_3d\n            )\n            ms.load_state_dict(inflated_model_dict, strict=False)\n        else:\n            if clear_name_pattern:\n                for item in clear_name_pattern:\n                    model_state_dict_new = OrderedDict()\n                    for k in checkpoint[\"model_state\"]:\n                        if item in k:\n                            k_re = k.replace(\n                                item, \"\", 1\n                            )  # only repace first occurence of pattern\n                            model_state_dict_new[k_re] = checkpoint[\"model_state\"][k]\n                            logger.info(\"renaming: {} -> {}\".format(k, k_re))\n                        else:\n                            model_state_dict_new[k] = checkpoint[\"model_state\"][k]\n                    checkpoint[\"model_state\"] = model_state_dict_new\n\n            pre_train_dict = checkpoint[\"model_state\"]\n            model_dict = ms.state_dict()\n\n            if image_init:\n                if (\n                    \"pos_embed\" in pre_train_dict.keys()\n                    and \"pos_embed_xy\" in model_dict.keys()\n                ):\n                    print(\n                        pre_train_dict[\"pos_embed\"].shape,\n                        model_dict[\"pos_embed_xy\"].shape,\n                        model_dict[\"pos_embed_class\"].shape,\n                    )\n                    if (\n                        pre_train_dict[\"pos_embed\"].shape[1]\n                        == model_dict[\"pos_embed_xy\"].shape[1] + 1\n                    ):\n                        pre_train_dict[\"pos_embed_xy\"] = pre_train_dict[\"pos_embed\"][\n                            :, 1:\n                        ]\n                        pre_train_dict[\"pos_embed_class\"] = pre_train_dict[\"pos_embed\"][\n                            :, :1\n                        ]\n\n                if (\n                    \"patch_embed.proj.weight\" in pre_train_dict.keys()\n                    and \"patch_embed.proj.weight\" in model_dict.keys()\n                ):\n                    print(\n                        pre_train_dict[\"patch_embed.proj.weight\"].shape,\n                        model_dict[\"patch_embed.proj.weight\"].shape,\n                    )\n                    if (\n                        len(pre_train_dict[\"patch_embed.proj.weight\"].shape) == 4\n                        and len(model_dict[\"patch_embed.proj.weight\"].shape) == 5\n                    ):  # img->video\n                        t = model_dict[\"patch_embed.proj.weight\"].shape[2]\n                        pre_train_dict[\"patch_embed.proj.weight\"] = pre_train_dict[\n                            \"patch_embed.proj.weight\"\n                        ][:, :, None, :, :].repeat(1, 1, t, 1, 1)\n                        logger.info(\n                            f\"inflate patch_embed.proj.weight to {pre_train_dict['patch_embed.proj.weight'].shape}\"\n                        )\n                    elif (\n                        len(pre_train_dict[\"patch_embed.proj.weight\"].shape) == 5\n                        and len(model_dict[\"patch_embed.proj.weight\"].shape) == 4\n                    ):  # video->img\n                        orig_shape = pre_train_dict[\"patch_embed.proj.weight\"].shape\n                        # pre_train_dict[\"patch_embed.proj.weight\"] = pre_train_dict[\"patch_embed.proj.weight\"][:, :, orig_shape[2]//2, :, :] # take center\n                        pre_train_dict[\"patch_embed.proj.weight\"] = pre_train_dict[\n                            \"patch_embed.proj.weight\"\n                        ].sum(2)  # take avg\n                        logger.info(\n                            f\"deflate patch_embed.proj.weight from {orig_shape} to {pre_train_dict['patch_embed.proj.weight'].shape}\"\n                        )\n                        if (\n                            \"pos_embed_spatial\" in pre_train_dict.keys()\n                            and \"pos_embed\" in model_dict.keys()\n                        ):\n                            pos_embds = pre_train_dict[\"pos_embed_spatial\"]\n                            if (\n                                \"pos_embed_class\" in pre_train_dict.keys()\n                                and pos_embds.shape != model_dict[\"pos_embed\"].shape\n                            ):\n                                pos_embds = torch.cat(\n                                    [\n                                        pre_train_dict[\"pos_embed_class\"],\n                                        pos_embds,\n                                    ],\n                                    1,\n                                )\n                                pre_train_dict.pop(\"pos_embed_class\")\n                            if pos_embds.shape == model_dict[\"pos_embed\"].shape:\n                                pre_train_dict[\"pos_embed\"] = pos_embds\n                                pre_train_dict.pop(\"pos_embed_spatial\")\n                                logger.info(\n                                    f\"successful surgery of pos embed w/ shape {pos_embds.shape} \"\n                                )\n                            else:\n                                logger.info(\n                                    f\"UNSUCCESSFUL surgery of pos embed w/ shape {pos_embds.shape} \"\n                                )\n\n                qkv = [\n                    \"attn.pool_k.weight\",\n                    \"attn.pool_q.weight\",\n                    \"attn.pool_v.weight\",\n                ]\n                for k in pre_train_dict.keys():\n                    if (\n                        any([x in k for x in qkv])\n                        and pre_train_dict[k].shape != model_dict[k].shape\n                    ):\n                        # print(pre_train_dict[k].shape, model_dict[k].shape)\n                        logger.info(\n                            f\"inflate {k} from {pre_train_dict[k].shape} to {model_dict[k].shape}\"\n                        )\n                        t = model_dict[k].shape[2]\n                        pre_train_dict[k] = pre_train_dict[k].repeat(1, 1, t, 1, 1)\n\n                for k in pre_train_dict.keys():\n                    if (\n                        \"rel_pos\" in k\n                        and pre_train_dict[k].shape != model_dict[k].shape\n                    ):\n                        # print(pre_train_dict[k].shape, model_dict[k].shape)\n                        logger.info(\n                            f\"interpolating {k} from {pre_train_dict[k].shape} to {model_dict[k].shape}\"\n                        )\n                        new_pos_embed = torch.nn.functional.interpolate(\n                            pre_train_dict[k]\n                            .reshape(1, pre_train_dict[k].shape[0], -1)\n                            .permute(0, 2, 1),\n                            size=model_dict[k].shape[0],\n                            mode=\"linear\",\n                        )\n                        new_pos_embed = (\n                            new_pos_embed.reshape(-1, model_dict[k].shape[0])\n                            .permute(1, 0)\n                            .squeeze()\n                        )\n                        pre_train_dict[k] = new_pos_embed\n\n            # Match pre-trained weights that have same shape as current model.\n            pre_train_dict_match = {}\n            not_used_layers = []\n            for k, v in pre_train_dict.items():\n                if k in model_dict:\n                    if v.size() == model_dict[k].size():\n                        pre_train_dict_match[k] = v\n                    else:\n                        if \"attn.rel_pos\" in k:\n                            v_shape = v.shape\n                            v = v.t().unsqueeze(0)\n                            v = torch.nn.functional.interpolate(\n                                v,\n                                size=model_dict[k].size()[0],\n                                mode=\"linear\",\n                            )\n                            v = v[0].t()\n                            pre_train_dict_match[k] = v\n                            logger.info(\n                                \"{} reshaped from {} to {}\".format(k, v_shape, v.shape)\n                            )\n                        elif \"pos_embed_temporal\" in k:\n                            v_shape = v.shape\n                            v = torch.nn.functional.interpolate(\n                                v.permute(0, 2, 1),\n                                size=model_dict[k].shape[1],\n                                mode=\"linear\",\n                            )\n                            pre_train_dict_match[k] = v.permute(0, 2, 1)\n                            logger.info(\n                                \"{} reshaped from {} to {}\".format(\n                                    k, v_shape, pre_train_dict_match[k].shape\n                                )\n                            )\n                        elif \"pos_embed_spatial\" in k:\n                            v_shape = v.shape\n                            pretrain_size = int(math.sqrt(v_shape[1]))\n                            model_size = int(math.sqrt(model_dict[k].shape[1]))\n                            assert pretrain_size * pretrain_size == v_shape[1]\n                            assert model_size * model_size == model_dict[k].shape[1]\n                            v = torch.nn.functional.interpolate(\n                                v.reshape(1, pretrain_size, pretrain_size, -1).permute(\n                                    0, 3, 1, 2\n                                ),\n                                size=(model_size, model_size),\n                                mode=\"bicubic\",\n                            )\n                            pre_train_dict_match[k] = v.reshape(\n                                1, -1, model_size * model_size\n                            ).permute(0, 2, 1)\n                            logger.info(\n                                \"{} reshaped from {} to {}\".format(\n                                    k, v_shape, pre_train_dict_match[k].shape\n                                )\n                            )\n                        else:\n                            not_used_layers.append(k)\n                else:\n                    not_used_layers.append(k)\n            # Weights that do not have match from the pre-trained model.\n            not_load_layers = [\n                k for k in model_dict.keys() if k not in pre_train_dict_match.keys()\n            ]\n            # Log weights that are not loaded with the pre-trained weights.\n            if not_load_layers:\n                for k in not_load_layers:\n                    logger.info(\"Network weights {} not loaded.\".format(k))\n            if not_used_layers:\n                for k in not_used_layers:\n                    logger.info(\"Network weights {} not used.\".format(k))\n            # Load pre-trained weights.\n            missing_keys, unexpected_keys = ms.load_state_dict(\n                pre_train_dict_match, strict=False\n            )\n\n            print(\"missing keys: {}\".format(missing_keys))\n            print(\"unexpected keys: {}\".format(unexpected_keys))\n            epoch = -1\n\n            # Load the optimizer state (commonly not done when fine-tuning)\n        if \"epoch\" in checkpoint.keys() and not epoch_reset:\n            epoch = checkpoint[\"epoch\"]\n            if optimizer:\n                optimizer.load_state_dict(checkpoint[\"optimizer_state\"])\n            if scaler:\n                scaler.load_state_dict(checkpoint[\"scaler_state\"])\n        else:\n            epoch = -1\n    return epoch\n\n\ndef sub_to_normal_bn(sd):\n    \"\"\"\n    Convert the Sub-BN paprameters to normal BN parameters in a state dict.\n    There are two copies of BN layers in a Sub-BN implementation: `bn.bn` and\n    `bn.split_bn`. `bn.split_bn` is used during training and\n    \"compute_precise_bn\". Before saving or evaluation, its stats are copied to\n    `bn.bn`. We rename `bn.bn` to `bn` and store it to be consistent with normal\n    BN layers.\n    Args:\n        sd (OrderedDict): a dict of parameters whitch might contain Sub-BN\n        parameters.\n    Returns:\n        new_sd (OrderedDict): a dict with Sub-BN parameters reshaped to\n        normal parameters.\n    \"\"\"\n    new_sd = copy.deepcopy(sd)\n    modifications = [\n        (\"bn.bn.running_mean\", \"bn.running_mean\"),\n        (\"bn.bn.running_var\", \"bn.running_var\"),\n        (\"bn.split_bn.num_batches_tracked\", \"bn.num_batches_tracked\"),\n    ]\n    to_remove = [\"bn.bn.\", \".split_bn.\"]\n    for key in sd:\n        for before, after in modifications:\n            if key.endswith(before):\n                new_key = key.split(before)[0] + after\n                new_sd[new_key] = new_sd.pop(key)\n\n        for rm in to_remove:\n            if rm in key and key in new_sd:\n                del new_sd[key]\n\n    for key in new_sd:\n        if key.endswith(\"bn.weight\") or key.endswith(\"bn.bias\"):\n            if len(new_sd[key].size()) == 4:\n                assert all(d == 1 for d in new_sd[key].size()[1:])\n                new_sd[key] = new_sd[key][:, 0, 0, 0]\n\n    return new_sd\n\n\ndef c2_normal_to_sub_bn(key, model_keys):\n    \"\"\"\n    Convert BN parameters to Sub-BN parameters if model contains Sub-BNs.\n    Args:\n        key (OrderedDict): source dict of parameters.\n        mdoel_key (OrderedDict): target dict of parameters.\n    Returns:\n        new_sd (OrderedDict): converted dict of parameters.\n    \"\"\"\n    if \"bn.running_\" in key:\n        if key in model_keys:\n            return key\n\n        new_key = key.replace(\"bn.running_\", \"bn.split_bn.running_\")\n        if new_key in model_keys:\n            return new_key\n    else:\n        return key\n\n\ndef normal_to_sub_bn(checkpoint_sd, model_sd):\n    \"\"\"\n    Convert BN parameters to Sub-BN parameters if model contains Sub-BNs.\n    Args:\n        checkpoint_sd (OrderedDict): source dict of parameters.\n        model_sd (OrderedDict): target dict of parameters.\n    Returns:\n        new_sd (OrderedDict): converted dict of parameters.\n    \"\"\"\n    for key in model_sd:\n        if key not in checkpoint_sd:\n            if \"bn.split_bn.\" in key:\n                load_key = key.replace(\"bn.split_bn.\", \"bn.\")\n                bn_key = key.replace(\"bn.split_bn.\", \"bn.bn.\")\n                checkpoint_sd[key] = checkpoint_sd.pop(load_key)\n                checkpoint_sd[bn_key] = checkpoint_sd[key]\n\n    for key in model_sd:\n        if key in checkpoint_sd:\n            model_blob_shape = model_sd[key].shape\n            c2_blob_shape = checkpoint_sd[key].shape\n\n            if (\n                len(model_blob_shape) == 1\n                and len(c2_blob_shape) == 1\n                and model_blob_shape[0] > c2_blob_shape[0]\n                and model_blob_shape[0] % c2_blob_shape[0] == 0\n            ):\n                before_shape = checkpoint_sd[key].shape\n                checkpoint_sd[key] = torch.cat(\n                    [checkpoint_sd[key]] * (model_blob_shape[0] // c2_blob_shape[0])\n                )\n                logger.info(\n                    \"{} {} -> {}\".format(key, before_shape, checkpoint_sd[key].shape)\n                )\n    return checkpoint_sd\n\n\ndef load_test_checkpoint(cfg, model):\n    \"\"\"\n    Loading checkpoint logic for testing.\n    \"\"\"\n    # Load a checkpoint to test if applicable.\n    if cfg.TEST.CHECKPOINT_FILE_PATH != \"\":\n        # If no checkpoint found in MODEL_VIS.CHECKPOINT_FILE_PATH or in the current\n        # checkpoint folder, try to load checkpoint from\n        # TEST.CHECKPOINT_FILE_PATH and test it.\n        load_checkpoint(\n            cfg.TEST.CHECKPOINT_FILE_PATH,\n            model,\n            cfg.NUM_GPUS > 1,\n            None,\n            inflation=False,\n            convert_from_caffe2=cfg.TEST.CHECKPOINT_TYPE == \"caffe2\",\n        )\n    elif has_checkpoint(cfg.OUTPUT_DIR):\n        last_checkpoint = get_last_checkpoint(cfg.OUTPUT_DIR, cfg.TASK)\n        load_checkpoint(last_checkpoint, model, cfg.NUM_GPUS > 1)\n    elif cfg.TRAIN.CHECKPOINT_FILE_PATH != \"\":\n        # If no checkpoint found in TEST.CHECKPOINT_FILE_PATH or in the current\n        # checkpoint folder, try to load checkpoint from\n        # TRAIN.CHECKPOINT_FILE_PATH and test it.\n        load_checkpoint(\n            cfg.TRAIN.CHECKPOINT_FILE_PATH,\n            model,\n            cfg.NUM_GPUS > 1,\n            None,\n            inflation=False,\n            convert_from_caffe2=cfg.TRAIN.CHECKPOINT_TYPE == \"caffe2\",\n        )\n    else:\n        logger.info(\n            \"Unknown way of loading checkpoint. Using with random initialization, only for debugging.\"\n        )\n\n\ndef load_train_checkpoint(cfg, model, optimizer, scaler=None):\n    \"\"\"\n    Loading checkpoint logic for training.\n    \"\"\"\n    if cfg.TRAIN.AUTO_RESUME and has_checkpoint(cfg.OUTPUT_DIR):\n        last_checkpoint = get_last_checkpoint(cfg.OUTPUT_DIR, cfg.TASK)\n        logger.info(\"Load from last checkpoint, {}.\".format(last_checkpoint))\n        checkpoint_epoch = load_checkpoint(\n            last_checkpoint,\n            model,\n            cfg.NUM_GPUS > 1,\n            optimizer,\n            scaler=scaler,\n            clear_name_pattern=cfg.TRAIN.CHECKPOINT_CLEAR_NAME_PATTERN,\n        )\n        start_epoch = checkpoint_epoch + 1\n    elif cfg.TRAIN.CHECKPOINT_FILE_PATH != \"\":\n        logger.info(\"Load from given checkpoint file.\")\n        checkpoint_epoch = load_checkpoint(\n            cfg.TRAIN.CHECKPOINT_FILE_PATH,\n            model,\n            cfg.NUM_GPUS > 1,\n            optimizer,\n            scaler=scaler,\n            inflation=cfg.TRAIN.CHECKPOINT_INFLATE,\n            convert_from_caffe2=cfg.TRAIN.CHECKPOINT_TYPE == \"caffe2\",\n            epoch_reset=cfg.TRAIN.CHECKPOINT_EPOCH_RESET,\n            clear_name_pattern=cfg.TRAIN.CHECKPOINT_CLEAR_NAME_PATTERN,\n            image_init=cfg.TRAIN.CHECKPOINT_IN_INIT,\n        )\n        start_epoch = checkpoint_epoch + 1\n    else:\n        start_epoch = 0\n\n    return start_epoch\n"
  },
  {
    "path": "slowfast/utils/distributed.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\n\"\"\"Distributed helpers.\"\"\"\n\nimport functools\nimport logging\nimport pickle\n\nimport torch\nimport torch.distributed as dist\nfrom pytorchvideo.layers.distributed import (  # noqa\n    cat_all_gather,\n    get_local_process_group,\n    get_local_rank,\n    get_local_size,\n    get_world_size,\n    init_distributed_training as _init_distributed_training,\n)\n\n\ndef init_distributed_training(cfg):\n    return _init_distributed_training(cfg.NUM_GPUS, cfg.SHARD_ID)\n\n\ndef all_gather(tensors):\n    \"\"\"\n    All gathers the provided tensors from all processes across machines.\n    Args:\n        tensors (list): tensors to perform all gather across all processes in\n        all machines.\n    \"\"\"\n\n    gather_list = []\n    output_tensor = []\n    world_size = dist.get_world_size()\n    for tensor in tensors:\n        tensor_placeholder = [torch.ones_like(tensor) for _ in range(world_size)]\n        dist.all_gather(tensor_placeholder, tensor, async_op=False)\n        gather_list.append(tensor_placeholder)\n    for gathered_tensor in gather_list:\n        output_tensor.append(torch.cat(gathered_tensor, dim=0))\n    return output_tensor\n\n\ndef all_reduce(tensors, average=True):\n    \"\"\"\n    All reduce the provided tensors from all processes across machines.\n    Args:\n        tensors (list): tensors to perform all reduce across all processes in\n        all machines.\n        average (bool): scales the reduced tensor by the number of overall\n        processes across all machines.\n    \"\"\"\n\n    for tensor in tensors:\n        dist.all_reduce(tensor, async_op=False)\n    if average:\n        world_size = dist.get_world_size()\n        for tensor in tensors:\n            tensor.mul_(1.0 / world_size)\n    return tensors\n\n\ndef init_process_group(\n    local_rank,\n    local_world_size,\n    shard_id,\n    num_shards,\n    init_method,\n    dist_backend=\"nccl\",\n):\n    \"\"\"\n    Initializes the default process group.\n    Args:\n        local_rank (int): the rank on the current local machine.\n        local_world_size (int): the world size (number of processes running) on\n        the current local machine.\n        shard_id (int): the shard index (machine rank) of the current machine.\n        num_shards (int): number of shards for distributed training.\n        init_method (string): supporting three different methods for\n            initializing process groups:\n            \"file\": use shared file system to initialize the groups across\n            different processes.\n            \"tcp\": use tcp address to initialize the groups across different\n        dist_backend (string): backend to use for distributed training. Options\n            includes gloo, mpi and nccl, the details can be found here:\n            https://pytorch.org/docs/stable/distributed.html\n    \"\"\"\n    # Sets the GPU to use.\n    torch.cuda.set_device(local_rank)\n    # Initialize the process group.\n    proc_rank = local_rank + shard_id * local_world_size\n    world_size = local_world_size * num_shards\n    dist.init_process_group(\n        backend=dist_backend,\n        init_method=init_method,\n        world_size=world_size,\n        rank=proc_rank,\n    )\n\n\ndef is_master_proc(num_gpus=8):\n    \"\"\"\n    Determines if the current process is the master process.\n    \"\"\"\n    if torch.distributed.is_initialized():\n        return dist.get_rank() % num_gpus == 0\n    else:\n        return True\n\n\ndef is_root_proc():\n    \"\"\"\n    Determines if the current process is the root process.\n    \"\"\"\n    if torch.distributed.is_initialized():\n        return dist.get_rank() == 0\n    else:\n        return True\n\n\ndef get_rank():\n    \"\"\"\n    Get the rank of the current process.\n    \"\"\"\n    if not dist.is_available():\n        return 0\n    if not dist.is_initialized():\n        return 0\n    return dist.get_rank()\n\n\ndef synchronize():\n    \"\"\"\n    Helper function to synchronize (barrier) among all processes when\n    using distributed training\n    \"\"\"\n    if not dist.is_available():\n        return\n    if not dist.is_initialized():\n        return\n    world_size = dist.get_world_size()\n    if world_size == 1:\n        return\n    dist.barrier()\n\n\n@functools.lru_cache()\ndef _get_global_gloo_group():\n    \"\"\"\n    Return a process group based on gloo backend, containing all the ranks\n    The result is cached.\n    Returns:\n        (group): pytorch dist group.\n    \"\"\"\n    if dist.get_backend() == \"nccl\":\n        return dist.new_group(backend=\"gloo\")\n    else:\n        return dist.group.WORLD\n\n\ndef _serialize_to_tensor(data, group):\n    \"\"\"\n    Seriialize the tensor to ByteTensor. Note that only `gloo` and `nccl`\n        backend is supported.\n    Args:\n        data (data): data to be serialized.\n        group (group): pytorch dist group.\n    Returns:\n        tensor (ByteTensor): tensor that serialized.\n    \"\"\"\n\n    backend = dist.get_backend(group)\n    assert backend in [\"gloo\", \"nccl\"]\n    device = torch.device(\"cpu\" if backend == \"gloo\" else \"cuda\")\n\n    buffer = pickle.dumps(data)\n    if len(buffer) > 1024**3:\n        logger = logging.getLogger(__name__)\n        logger.warning(\n            \"Rank {} trying to all-gather {:.2f} GB of data on device {}\".format(\n                get_rank(), len(buffer) / (1024**3), device\n            )\n        )\n    storage = torch.ByteStorage.from_buffer(buffer)\n    tensor = torch.ByteTensor(storage).to(device=device)\n    return tensor\n\n\ndef _pad_to_largest_tensor(tensor, group):\n    \"\"\"\n    Padding all the tensors from different GPUs to the largest ones.\n    Args:\n        tensor (tensor): tensor to pad.\n        group (group): pytorch dist group.\n    Returns:\n        list[int]: size of the tensor, on each rank\n        Tensor: padded tensor that has the max size\n    \"\"\"\n    world_size = dist.get_world_size(group=group)\n    assert world_size >= 1, (\n        \"comm.gather/all_gather must be called from ranks within the given group!\"\n    )\n    local_size = torch.tensor([tensor.numel()], dtype=torch.int64, device=tensor.device)\n    size_list = [\n        torch.zeros([1], dtype=torch.int64, device=tensor.device)\n        for _ in range(world_size)\n    ]\n    dist.all_gather(size_list, local_size, group=group)\n    size_list = [int(size.item()) for size in size_list]\n\n    max_size = max(size_list)\n\n    # we pad the tensor because torch all_gather does not support\n    # gathering tensors of different shapes\n    if local_size != max_size:\n        padding = torch.zeros(\n            (max_size - local_size,), dtype=torch.uint8, device=tensor.device\n        )\n        tensor = torch.cat((tensor, padding), dim=0)\n    return size_list, tensor\n\n\ndef all_gather_unaligned(data, group=None):\n    \"\"\"\n    Run all_gather on arbitrary picklable data (not necessarily tensors).\n\n    Args:\n        data: any picklable object\n        group: a torch process group. By default, will use a group which\n            contains all ranks on gloo backend.\n\n    Returns:\n        list[data]: list of data gathered from each rank\n    \"\"\"\n    if get_world_size() == 1:\n        return [data]\n    if group is None:\n        group = _get_global_gloo_group()\n    if dist.get_world_size(group) == 1:\n        return [data]\n\n    tensor = _serialize_to_tensor(data, group)\n\n    size_list, tensor = _pad_to_largest_tensor(tensor, group)\n    max_size = max(size_list)\n\n    # receiving Tensor from all ranks\n    tensor_list = [\n        torch.empty((max_size,), dtype=torch.uint8, device=tensor.device)\n        for _ in size_list\n    ]\n    dist.all_gather(tensor_list, tensor, group=group)\n\n    data_list = []\n    for size, tensor in zip(size_list, tensor_list):\n        buffer = tensor.cpu().numpy().tobytes()[:size]\n        data_list.append(pickle.loads(buffer))\n\n    return data_list\n\n\nclass GatherLayer(torch.autograd.Function):\n    \"\"\"Gather tensors from all process, supporting backward propagation.\"\"\"\n\n    @staticmethod\n    def forward(ctx, input):\n        ctx.save_for_backward(input)\n        output = [torch.zeros_like(input) for _ in range(dist.get_world_size())]\n        dist.all_gather(output, input)\n        return tuple(output)\n\n    @staticmethod\n    def backward(ctx, *grads):\n        (input,) = ctx.saved_tensors\n        grad_out = torch.zeros_like(input)\n        grad_out[:] = grads[dist.get_rank()]\n        return grad_out\n\n\nclass AllGatherWithGradient(torch.autograd.Function):\n    \"\"\"AllGatherWithGradient\"\"\"\n\n    @staticmethod\n    def forward(ctx, input):\n        world_size = dist.get_world_size()\n        x_gather = [torch.ones_like(input) for _ in range(world_size)]\n        torch.distributed.all_gather(x_gather, input, async_op=False)\n        x_gather = torch.cat(x_gather, dim=0)\n        return x_gather\n\n    @staticmethod\n    def backward(ctx, grad_output):\n        reduction = torch.distributed.all_reduce(grad_output, async_op=True)\n        reduction.wait()\n\n        world_size = dist.get_world_size()\n        N = grad_output.size(0)\n        mini_batchsize = N // world_size\n        cur_gpu = torch.distributed.get_rank()\n        grad_output = grad_output[\n            cur_gpu * mini_batchsize : (cur_gpu + 1) * mini_batchsize\n        ]\n        return grad_output\n"
  },
  {
    "path": "slowfast/utils/env.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\n\"\"\"Set up Environment.\"\"\"\n\nfrom iopath.common.file_io import PathManagerFactory\n\n_ENV_SETUP_DONE = False\npathmgr = PathManagerFactory.get(key=\"pyslowfast\")\ncheckpoint_pathmgr = PathManagerFactory.get(key=\"pyslowfast_checkpoint\")\n\n\ndef setup_environment():\n    global _ENV_SETUP_DONE\n    if _ENV_SETUP_DONE:\n        return\n    _ENV_SETUP_DONE = True\n"
  },
  {
    "path": "slowfast/utils/logging.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\n\"\"\"Logging.\"\"\"\n\nimport atexit\nimport builtins\nimport decimal\nimport functools\nimport logging\nimport os\nimport sys\n\nimport simplejson\nimport slowfast.utils.distributed as du\nfrom slowfast.utils.env import pathmgr\n\n\ndef _suppress_print():\n    \"\"\"\n    Suppresses printing from the current process.\n    \"\"\"\n\n    def print_pass(*objects, sep=\" \", end=\"\\n\", file=sys.stdout, flush=False):\n        pass\n\n    builtins.print = print_pass\n\n\n@functools.lru_cache(maxsize=None)\ndef _cached_log_stream(filename):\n    # Use 1K buffer if writing to cloud storage.\n    io = pathmgr.open(filename, \"a\", buffering=1024 if \"://\" in filename else -1)\n    atexit.register(io.close)\n    return io\n\n\ndef setup_logging(output_dir=None):\n    \"\"\"\n    Sets up the logging for multiple processes. Only enable the logging for the\n    master process, and suppress logging for the non-master processes.\n    \"\"\"\n    # Set up logging format.\n    _FORMAT = \"[%(levelname)s: %(filename)s: %(lineno)4d]: %(message)s\"\n\n    if du.is_master_proc():\n        # Enable logging for the master process.\n        logging.root.handlers = []\n    else:\n        # Suppress logging for non-master processes.\n        _suppress_print()\n\n    logger = logging.getLogger()\n    logger.setLevel(logging.DEBUG)\n    logger.propagate = False\n    plain_formatter = logging.Formatter(\n        \"[%(asctime)s][%(levelname)s] %(filename)s: %(lineno)4d: %(message)s\",\n        datefmt=\"%m/%d %H:%M:%S\",\n    )\n\n    if du.is_master_proc():\n        ch = logging.StreamHandler(stream=sys.stdout)\n        ch.setLevel(logging.DEBUG)\n        ch.setFormatter(plain_formatter)\n        logger.addHandler(ch)\n\n    if output_dir is not None and du.is_master_proc(du.get_world_size()):\n        filename = os.path.join(output_dir, \"stdout.log\")\n        fh = logging.StreamHandler(_cached_log_stream(filename))\n        fh.setLevel(logging.DEBUG)\n        fh.setFormatter(plain_formatter)\n        logger.addHandler(fh)\n\n\ndef get_logger(name):\n    \"\"\"\n    Retrieve the logger with the specified name or, if name is None, return a\n    logger which is the root logger of the hierarchy.\n    Args:\n        name (string): name of the logger.\n    \"\"\"\n    return logging.getLogger(name)\n\n\ndef log_json_stats(stats, output_dir=None):\n    \"\"\"\n    Logs json stats.\n    Args:\n        stats (dict): a dictionary of statistical information to log.\n    \"\"\"\n    stats = {\n        k: decimal.Decimal(\"{:.5f}\".format(v)) if isinstance(v, float) else v\n        for k, v in stats.items()\n    }\n    json_stats = simplejson.dumps(stats, sort_keys=True, use_decimal=True)\n    logger = get_logger(__name__)\n    logger.info(\"json_stats: {:s}\".format(json_stats))\n    if du.is_master_proc(du.get_world_size()) and output_dir:\n        filename = os.path.join(output_dir, \"json_stats.log\")\n        try:\n            with pathmgr.open(\n                filename, \"a\", buffering=1024 if \"://\" in filename else -1\n            ) as f:\n                f.write(\"json_stats: {:s}\\n\".format(json_stats))\n        except Exception:\n            logger.info(\"Failed to write to json_stats.log: {}\".format(json_stats))\n"
  },
  {
    "path": "slowfast/utils/lr_policy.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\n\"\"\"Learning rate policy.\"\"\"\n\nimport math\n\n\ndef get_lr_at_epoch(cfg, cur_epoch):\n    \"\"\"\n    Retrieve the learning rate of the current epoch with the option to perform\n    warm up in the beginning of the training stage.\n    Args:\n        cfg (CfgNode): configs. Details can be found in\n            slowfast/config/defaults.py\n        cur_epoch (float): the number of epoch of the current training stage.\n    \"\"\"\n    lr = get_lr_func(cfg.SOLVER.LR_POLICY)(cfg, cur_epoch)\n    # Perform warm up.\n    if cur_epoch < cfg.SOLVER.WARMUP_EPOCHS:\n        lr_start = cfg.SOLVER.WARMUP_START_LR\n        lr_end = get_lr_func(cfg.SOLVER.LR_POLICY)(cfg, cfg.SOLVER.WARMUP_EPOCHS)\n        alpha = (lr_end - lr_start) / cfg.SOLVER.WARMUP_EPOCHS\n        lr = cur_epoch * alpha + lr_start\n    return lr\n\n\ndef lr_func_cosine(cfg, cur_epoch):\n    \"\"\"\n    Retrieve the learning rate to specified values at specified epoch with the\n    cosine learning rate schedule. Details can be found in:\n    Ilya Loshchilov, and  Frank Hutter\n    SGDR: Stochastic Gradient Descent With Warm Restarts.\n    Args:\n        cfg (CfgNode): configs. Details can be found in\n            slowfast/config/defaults.py\n        cur_epoch (float): the number of epoch of the current training stage.\n    \"\"\"\n    offset = cfg.SOLVER.WARMUP_EPOCHS if cfg.SOLVER.COSINE_AFTER_WARMUP else 0.0\n    assert cfg.SOLVER.COSINE_END_LR < cfg.SOLVER.BASE_LR\n    return (\n        cfg.SOLVER.COSINE_END_LR\n        + (cfg.SOLVER.BASE_LR - cfg.SOLVER.COSINE_END_LR)\n        * (\n            math.cos(math.pi * (cur_epoch - offset) / (cfg.SOLVER.MAX_EPOCH - offset))\n            + 1.0\n        )\n        * 0.5\n    )\n\n\ndef lr_func_steps_with_relative_lrs(cfg, cur_epoch):\n    \"\"\"\n    Retrieve the learning rate to specified values at specified epoch with the\n    steps with relative learning rate schedule.\n    Args:\n        cfg (CfgNode): configs. Details can be found in\n            slowfast/config/defaults.py\n        cur_epoch (float): the number of epoch of the current training stage.\n    \"\"\"\n    ind = get_step_index(cfg, cur_epoch)\n    return cfg.SOLVER.LRS[ind] * cfg.SOLVER.BASE_LR\n\n\ndef get_step_index(cfg, cur_epoch):\n    \"\"\"\n    Retrieves the lr step index for the given epoch.\n    Args:\n        cfg (CfgNode): configs. Details can be found in\n            slowfast/config/defaults.py\n        cur_epoch (float): the number of epoch of the current training stage.\n    \"\"\"\n    steps = cfg.SOLVER.STEPS + [cfg.SOLVER.MAX_EPOCH]\n    for ind, step in enumerate(steps):  # NoQA\n        if cur_epoch < step:\n            break\n    return ind - 1\n\n\ndef get_lr_func(lr_policy):\n    \"\"\"\n    Given the configs, retrieve the specified lr policy function.\n    Args:\n        lr_policy (string): the learning rate policy to use for the job.\n    \"\"\"\n    policy = \"lr_func_\" + lr_policy\n    if policy not in globals():\n        raise NotImplementedError(\"Unknown LR policy: {}\".format(lr_policy))\n    else:\n        return globals()[policy]\n"
  },
  {
    "path": "slowfast/utils/meters.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\n\"\"\"Meters.\"\"\"\n\nimport datetime\nimport os\nfrom collections import defaultdict, deque\n\nimport numpy as np\nimport slowfast.datasets.ava_helper as ava_helper\nimport slowfast.utils.logging as logging\nimport slowfast.utils.metrics as metrics\nimport slowfast.utils.misc as misc\nimport torch\nfrom fvcore.common.timer import Timer\nfrom sklearn.metrics import average_precision_score\nfrom slowfast.utils.ava_eval_helper import (\n    evaluate_ava,\n    read_csv,\n    read_exclusions,\n    read_labelmap,\n)\n\nlogger = logging.get_logger(__name__)\n\n\ndef get_ava_mini_groundtruth(full_groundtruth):\n    \"\"\"\n    Get the groundtruth annotations corresponding the \"subset\" of AVA val set.\n    We define the subset to be the frames such that (second % 4 == 0).\n    We optionally use subset for faster evaluation during training\n    (in order to track training progress).\n    Args:\n        full_groundtruth(dict): list of groundtruth.\n    \"\"\"\n    ret = [defaultdict(list), defaultdict(list), defaultdict(list)]\n\n    for i in range(3):\n        for key in full_groundtruth[i].keys():\n            if int(key.split(\",\")[1]) % 4 == 0:\n                ret[i][key] = full_groundtruth[i][key]\n    return ret\n\n\nclass AVAMeter:\n    \"\"\"\n    Measure the AVA train, val, and test stats.\n    \"\"\"\n\n    def __init__(self, overall_iters, cfg, mode):\n        \"\"\"\n        overall_iters (int): the overall number of iterations of one epoch.\n        cfg (CfgNode): configs.\n        mode (str): `train`, `val`, or `test` mode.\n        \"\"\"\n        self.cfg = cfg\n        self.lr = None\n        self.loss = ScalarMeter(cfg.LOG_PERIOD)\n        self.full_ava_test = cfg.AVA.FULL_TEST_ON_VAL\n        self.mode = mode\n        self.iter_timer = Timer()\n        self.data_timer = Timer()\n        self.net_timer = Timer()\n        self.all_preds = []\n        self.all_ori_boxes = []\n        self.all_metadata = []\n        self.overall_iters = overall_iters\n        self.excluded_keys = read_exclusions(\n            os.path.join(cfg.AVA.ANNOTATION_DIR, cfg.AVA.EXCLUSION_FILE)\n        )\n        self.categories, self.class_whitelist = read_labelmap(\n            os.path.join(cfg.AVA.ANNOTATION_DIR, cfg.AVA.LABEL_MAP_FILE)\n        )\n        gt_filename = os.path.join(cfg.AVA.ANNOTATION_DIR, cfg.AVA.GROUNDTRUTH_FILE)\n        self.full_groundtruth = read_csv(gt_filename, self.class_whitelist)\n        self.mini_groundtruth = get_ava_mini_groundtruth(self.full_groundtruth)\n\n        _, self.video_idx_to_name = ava_helper.load_image_lists(cfg, mode == \"train\")\n        self.output_dir = cfg.OUTPUT_DIR\n\n        self.min_top1_err = 100.0\n        self.min_top5_err = 100.0\n        self.stats = {}\n        self.stats[\"top1_acc\"] = 100.0\n        self.stats[\"top5_acc\"] = 100.0\n\n    def log_iter_stats(self, cur_epoch, cur_iter):\n        \"\"\"\n        Log the stats.\n        Args:\n            cur_epoch (int): the current epoch.\n            cur_iter (int): the current iteration.\n        \"\"\"\n\n        if (cur_iter + 1) % self.cfg.LOG_PERIOD != 0:\n            return\n\n        eta_sec = self.iter_timer.seconds() * (self.overall_iters - cur_iter)\n        eta = str(datetime.timedelta(seconds=int(eta_sec)))\n        if self.mode == \"train\":\n            stats = {\n                \"_type\": \"{}_iter\".format(self.mode),\n                \"cur_epoch\": \"{}/{}\".format(cur_epoch + 1, self.cfg.SOLVER.MAX_EPOCH),\n                \"cur_iter\": \"{}\".format(cur_iter + 1),\n                \"eta\": eta,\n                \"dt\": self.iter_timer.seconds(),\n                \"dt_data\": self.data_timer.seconds(),\n                \"dt_net\": self.net_timer.seconds(),\n                \"mode\": self.mode,\n                \"loss\": self.loss.get_win_median(),\n                \"lr\": self.lr,\n            }\n        elif self.mode == \"val\":\n            stats = {\n                \"_type\": \"{}_iter\".format(self.mode),\n                \"cur_epoch\": \"{}/{}\".format(cur_epoch + 1, self.cfg.SOLVER.MAX_EPOCH),\n                \"cur_iter\": \"{}\".format(cur_iter + 1),\n                \"eta\": eta,\n                \"dt\": self.iter_timer.seconds(),\n                \"dt_data\": self.data_timer.seconds(),\n                \"dt_net\": self.net_timer.seconds(),\n                \"mode\": self.mode,\n            }\n        elif self.mode == \"test\":\n            stats = {\n                \"_type\": \"{}_iter\".format(self.mode),\n                \"cur_iter\": \"{}\".format(cur_iter + 1),\n                \"eta\": eta,\n                \"dt\": self.iter_timer.seconds(),\n                \"dt_data\": self.data_timer.seconds(),\n                \"dt_net\": self.net_timer.seconds(),\n                \"mode\": self.mode,\n            }\n        else:\n            raise NotImplementedError(\"Unknown mode: {}\".format(self.mode))\n\n        logging.log_json_stats(stats)\n\n    def iter_tic(self):\n        \"\"\"\n        Start to record time.\n        \"\"\"\n        self.iter_timer.reset()\n        self.data_timer.reset()\n\n    def iter_toc(self):\n        \"\"\"\n        Stop to record time.\n        \"\"\"\n        self.iter_timer.pause()\n        self.net_timer.pause()\n\n    def data_toc(self):\n        self.data_timer.pause()\n        self.net_timer.reset()\n\n    def reset(self):\n        \"\"\"\n        Reset the Meter.\n        \"\"\"\n        self.loss.reset()\n\n        self.all_preds = []\n        self.all_ori_boxes = []\n        self.all_metadata = []\n\n    def update_stats(self, preds, ori_boxes, metadata, loss=None, lr=None):\n        \"\"\"\n        Update the current stats.\n        Args:\n            preds (tensor): prediction embedding.\n            ori_boxes (tensor): original boxes (x1, y1, x2, y2).\n            metadata (tensor): metadata of the AVA data.\n            loss (float): loss value.\n            lr (float): learning rate.\n        \"\"\"\n        if self.mode in [\"val\", \"test\"]:\n            self.all_preds.append(preds)\n            self.all_ori_boxes.append(ori_boxes)\n            self.all_metadata.append(metadata)\n        if loss is not None:\n            self.loss.add_value(loss)\n        if lr is not None:\n            self.lr = lr\n\n    def finalize_metrics(self, log=True):\n        \"\"\"\n        Calculate and log the final AVA metrics.\n        \"\"\"\n        all_preds = torch.cat(self.all_preds, dim=0)\n        all_ori_boxes = torch.cat(self.all_ori_boxes, dim=0)\n        all_metadata = torch.cat(self.all_metadata, dim=0)\n\n        if self.mode == \"test\" or (self.full_ava_test and self.mode == \"val\"):\n            groundtruth = self.full_groundtruth\n        else:\n            groundtruth = self.mini_groundtruth\n\n        self.full_map = evaluate_ava(\n            all_preds,\n            all_ori_boxes,\n            all_metadata.tolist(),\n            self.excluded_keys,\n            self.class_whitelist,\n            self.categories,\n            groundtruth=groundtruth,\n            video_idx_to_name=self.video_idx_to_name,\n        )\n        if log:\n            stats = {\"mode\": self.mode, \"map\": self.full_map}\n            logging.log_json_stats(stats, self.output_dir)\n\n        map_str = \"{:.{prec}f}\".format(self.full_map * 100.0, prec=2)\n\n        self.min_top1_err = self.full_map\n        self.stats[\"top1_acc\"] = map_str\n        self.stats[\"top5_acc\"] = map_str\n\n    def log_epoch_stats(self, cur_epoch):\n        \"\"\"\n        Log the stats of the current epoch.\n        Args:\n            cur_epoch (int): the number of current epoch.\n        \"\"\"\n        if self.mode in [\"val\", \"test\"]:\n            self.finalize_metrics(log=False)\n            stats = {\n                \"_type\": \"{}_epoch\".format(self.mode),\n                \"cur_epoch\": \"{}\".format(cur_epoch + 1),\n                \"mode\": self.mode,\n                \"map\": self.full_map,\n                \"gpu_mem\": \"{:.2f}G\".format(misc.gpu_mem_usage()),\n                \"RAM\": \"{:.2f}/{:.2f}G\".format(*misc.cpu_mem_usage()),\n            }\n            logging.log_json_stats(stats, self.output_dir)\n\n\nclass TestMeter:\n    \"\"\"\n    Perform the multi-view ensemble for testing: each video with an unique index\n    will be sampled with multiple clips, and the predictions of the clips will\n    be aggregated to produce the final prediction for the video.\n    The accuracy is calculated with the given ground truth labels.\n    \"\"\"\n\n    def __init__(\n        self,\n        num_videos,\n        num_clips,\n        num_cls,\n        overall_iters,\n        multi_label=False,\n        ensemble_method=\"sum\",\n    ):\n        \"\"\"\n        Construct tensors to store the predictions and labels. Expect to get\n        num_clips predictions from each video, and calculate the metrics on\n        num_videos videos.\n        Args:\n            num_videos (int): number of videos to test.\n            num_clips (int): number of clips sampled from each video for\n                aggregating the final prediction for the video.\n            num_cls (int): number of classes for each prediction.\n            overall_iters (int): overall iterations for testing.\n            multi_label (bool): if True, use map as the metric.\n            ensemble_method (str): method to perform the ensemble, options\n                include \"sum\", and \"max\".\n        \"\"\"\n\n        self.iter_timer = Timer()\n        self.data_timer = Timer()\n        self.net_timer = Timer()\n        self.num_clips = num_clips\n        self.overall_iters = overall_iters\n        self.multi_label = multi_label\n        self.ensemble_method = ensemble_method\n        # Initialize tensors.\n        self.video_preds = torch.zeros((num_videos, num_cls))\n        if multi_label:\n            self.video_preds -= 1e10\n\n        self.video_labels = (\n            torch.zeros((num_videos, num_cls))\n            if multi_label\n            else torch.zeros((num_videos)).long()\n        )\n        self.clip_count = torch.zeros((num_videos)).long()\n        self.topk_accs = []\n        self.stats = {}\n\n        # Reset metric.\n        self.reset()\n\n    def reset(self):\n        \"\"\"\n        Reset the metric.\n        \"\"\"\n        self.clip_count.zero_()\n        self.video_preds.zero_()\n        if self.multi_label:\n            self.video_preds -= 1e10\n        self.video_labels.zero_()\n\n    def update_stats(self, preds, labels, clip_ids):\n        \"\"\"\n        Collect the predictions from the current batch and perform on-the-flight\n        summation as ensemble.\n        Args:\n            preds (tensor): predictions from the current batch. Dimension is\n                N x C where N is the batch size and C is the channel size\n                (num_cls).\n            labels (tensor): the corresponding labels of the current batch.\n                Dimension is N.\n            clip_ids (tensor): clip indexes of the current batch, dimension is\n                N.\n        \"\"\"\n        for ind in range(preds.shape[0]):\n            vid_id = int(clip_ids[ind]) // self.num_clips\n            if self.video_labels[vid_id].sum() > 0:\n                assert torch.equal(\n                    self.video_labels[vid_id].type(torch.FloatTensor),\n                    labels[ind].type(torch.FloatTensor),\n                )\n            self.video_labels[vid_id] = labels[ind]\n            if self.ensemble_method == \"sum\":\n                self.video_preds[vid_id] += preds[ind]\n            elif self.ensemble_method == \"max\":\n                self.video_preds[vid_id] = torch.max(\n                    self.video_preds[vid_id], preds[ind]\n                )\n            else:\n                raise NotImplementedError(\n                    \"Ensemble Method {} is not supported\".format(self.ensemble_method)\n                )\n            self.clip_count[vid_id] += 1\n\n    def log_iter_stats(self, cur_iter):\n        \"\"\"\n        Log the stats.\n        Args:\n            cur_iter (int): the current iteration of testing.\n        \"\"\"\n        eta_sec = self.iter_timer.seconds() * (self.overall_iters - cur_iter)\n        eta = str(datetime.timedelta(seconds=int(eta_sec)))\n        stats = {\n            \"split\": \"test_iter\",\n            \"cur_iter\": \"{}\".format(cur_iter + 1),\n            \"eta\": eta,\n            \"time_diff\": self.iter_timer.seconds(),\n        }\n        logging.log_json_stats(stats)\n\n    def iter_tic(self):\n        \"\"\"\n        Start to record time.\n        \"\"\"\n        self.iter_timer.reset()\n        self.data_timer.reset()\n\n    def iter_toc(self):\n        \"\"\"\n        Stop to record time.\n        \"\"\"\n        self.iter_timer.pause()\n        self.net_timer.pause()\n\n    def data_toc(self):\n        self.data_timer.pause()\n        self.net_timer.reset()\n\n    def finalize_metrics(self, ks=(1, 5)):\n        \"\"\"\n        Calculate and log the final ensembled metrics.\n        ks (tuple): list of top-k values for topk_accuracies. For example,\n            ks = (1, 5) correspods to top-1 and top-5 accuracy.\n        \"\"\"\n        clip_check = self.clip_count == self.num_clips\n        if not all(clip_check):\n            logger.warning(\n                \"clip count Ids={} = {} (should be {})\".format(\n                    np.argwhere(~clip_check),\n                    self.clip_count[~clip_check],\n                    self.num_clips,\n                )\n            )\n\n        self.stats = {\"split\": \"test_final\"}\n        if self.multi_label:\n            mean_ap = get_map(\n                self.video_preds.cpu().numpy(), self.video_labels.cpu().numpy()\n            )\n            map_str = \"{:.{prec}f}\".format(mean_ap * 100.0, prec=2)\n            self.stats[\"map\"] = map_str\n            self.stats[\"top1_acc\"] = map_str\n            self.stats[\"top5_acc\"] = map_str\n        else:\n            num_topks_correct = metrics.topks_correct(\n                self.video_preds, self.video_labels, ks\n            )\n            topks = [(x / self.video_preds.size(0)) * 100.0 for x in num_topks_correct]\n            assert len({len(ks), len(topks)}) == 1\n            for k, topk in zip(ks, topks):\n                # self.stats[\"top{}_acc\".format(k)] = topk.cpu().numpy()\n                self.stats[\"top{}_acc\".format(k)] = \"{:.{prec}f}\".format(topk, prec=2)\n        logging.log_json_stats(self.stats)\n\n\nclass ScalarMeter:\n    \"\"\"\n    A scalar meter uses a deque to track a series of scaler values with a given\n    window size. It supports calculating the median and average values of the\n    window, and also supports calculating the global average.\n    \"\"\"\n\n    def __init__(self, window_size):\n        \"\"\"\n        Args:\n            window_size (int): size of the max length of the deque.\n        \"\"\"\n        self.deque = deque(maxlen=window_size)\n        self.total = 0.0\n        self.count = 0\n\n    def reset(self):\n        \"\"\"\n        Reset the deque.\n        \"\"\"\n        self.deque.clear()\n        self.total = 0.0\n        self.count = 0\n\n    def add_value(self, value):\n        \"\"\"\n        Add a new scalar value to the deque.\n        \"\"\"\n        self.deque.append(value)\n        self.count += 1\n        self.total += value\n\n    def get_win_median(self):\n        \"\"\"\n        Calculate the current median value of the deque.\n        \"\"\"\n        return np.median(self.deque)\n\n    def get_current_value(self):\n        return self.deque[-1]\n\n    def get_win_avg(self):\n        \"\"\"\n        Calculate the current average value of the deque.\n        \"\"\"\n        return np.mean(self.deque)\n\n    def get_global_avg(self):\n        \"\"\"\n        Calculate the global mean value.\n        \"\"\"\n        return self.total / self.count\n\n\nclass ListMeter:\n    def __init__(self, list_size):\n        \"\"\"\n        Args:\n            list_size (int): size of the list.\n        \"\"\"\n        self.list = np.zeros(list_size)\n        self.total = np.zeros(list_size)\n        self.count = 0\n\n    def reset(self):\n        \"\"\"\n        Reset the meter.\n        \"\"\"\n        self.list = np.zeros_like(self.list)\n        self.total = np.zeros_like(self.total)\n        self.count = 0\n\n    def add_value(self, value):\n        \"\"\"\n        Add a new list value to the meter.\n        \"\"\"\n        self.list = np.array(value)\n        self.count += 1\n        self.total += self.list\n\n    def get_value(self):\n        return self.list\n\n    def get_global_avg(self):\n        \"\"\"\n        Calculate the global mean value.\n        \"\"\"\n        return self.total / self.count\n\n\nclass TrainMeter:\n    \"\"\"\n    Measure training stats.\n    \"\"\"\n\n    def __init__(self, epoch_iters, cfg):\n        \"\"\"\n        Args:\n            epoch_iters (int): the overall number of iterations of one epoch.\n            cfg (CfgNode): configs.\n        \"\"\"\n        self._cfg = cfg\n        self.epoch_iters = epoch_iters\n        self.MAX_EPOCH = cfg.SOLVER.MAX_EPOCH * epoch_iters\n        self.iter_timer = Timer()\n        self.data_timer = Timer()\n        self.net_timer = Timer()\n        self.loss = ScalarMeter(cfg.LOG_PERIOD)\n        self.loss_total = 0.0\n        self.lr = None\n        self.grad_norm = None\n        # Current minibatch errors (smoothed over a window).\n        self.mb_top1_err = ScalarMeter(cfg.LOG_PERIOD)\n        self.mb_top5_err = ScalarMeter(cfg.LOG_PERIOD)\n        # Number of misclassified examples.\n        self.num_top1_mis = 0\n        self.num_top5_mis = 0\n        self.num_samples = 0\n        self.output_dir = cfg.OUTPUT_DIR\n        self.multi_loss = None\n\n    def reset(self):\n        \"\"\"\n        Reset the Meter.\n        \"\"\"\n        self.loss.reset()\n        self.loss_total = 0.0\n        self.lr = None\n        self.grad_norm = None\n        self.mb_top1_err.reset()\n        self.mb_top5_err.reset()\n        self.num_top1_mis = 0\n        self.num_top5_mis = 0\n        self.num_samples = 0\n        if self.multi_loss is not None:\n            self.multi_loss.reset()\n\n    def iter_tic(self):\n        \"\"\"\n        Start to record time.\n        \"\"\"\n        self.iter_timer.reset()\n        self.data_timer.reset()\n\n    def iter_toc(self):\n        \"\"\"\n        Stop to record time.\n        \"\"\"\n        self.iter_timer.pause()\n        self.net_timer.pause()\n\n    def data_toc(self):\n        self.data_timer.pause()\n        self.net_timer.reset()\n\n    def update_stats(\n        self, top1_err, top5_err, loss, lr, grad_norm, mb_size, multi_loss=None\n    ):\n        \"\"\"\n        Update the current stats.\n        Args:\n            top1_err (float): top1 error rate.\n            top5_err (float): top5 error rate.\n            loss (float): loss value.\n            lr (float): learning rate.\n            mb_size (int): mini batch size.\n            multi_loss (list): a list of values for multi-tasking losses.\n        \"\"\"\n        self.loss.add_value(loss)\n        self.lr = lr\n        self.grad_norm = grad_norm\n        self.loss_total += loss * mb_size\n        self.num_samples += mb_size\n\n        if not self._cfg.DATA.MULTI_LABEL:\n            # Current minibatch stats\n            self.mb_top1_err.add_value(top1_err)\n            self.mb_top5_err.add_value(top5_err)\n            # Aggregate stats\n            self.num_top1_mis += top1_err * mb_size\n            self.num_top5_mis += top5_err * mb_size\n        if multi_loss:\n            if self.multi_loss is None:\n                self.multi_loss = ListMeter(len(multi_loss))\n            self.multi_loss.add_value(multi_loss)\n        if (\n            self._cfg.TRAIN.KILL_LOSS_EXPLOSION_FACTOR > 0.0\n            and len(self.loss.deque) > 6\n        ):\n            prev_loss = 0.0\n            for i in range(2, 7):\n                prev_loss += self.loss.deque[len(self.loss.deque) - i]\n            if loss > self._cfg.TRAIN.KILL_LOSS_EXPLOSION_FACTOR * prev_loss / 5.0:\n                raise RuntimeError(\n                    \"ERROR: Got Loss explosion of {} {}\".format(\n                        loss, datetime.datetime.now()\n                    )\n                )\n\n    def log_iter_stats(self, cur_epoch, cur_iter):\n        \"\"\"\n        log the stats of the current iteration.\n        Args:\n            cur_epoch (int): the number of current epoch.\n            cur_iter (int): the number of current iteration.\n        \"\"\"\n        if (cur_iter + 1) % self._cfg.LOG_PERIOD != 0:\n            return\n        eta_sec = self.iter_timer.seconds() * (\n            self.MAX_EPOCH - (cur_epoch * self.epoch_iters + cur_iter + 1)\n        )\n        eta = str(datetime.timedelta(seconds=int(eta_sec)))\n        stats = {\n            \"_type\": \"train_iter_{}\".format(\"ssl\" if self._cfg.TASK == \"ssl\" else \"\"),\n            \"epoch\": \"{}/{}\".format(cur_epoch + 1, self._cfg.SOLVER.MAX_EPOCH),\n            \"iter\": \"{}/{}\".format(cur_iter + 1, self.epoch_iters),\n            \"dt\": self.iter_timer.seconds(),\n            \"dt_data\": self.data_timer.seconds(),\n            \"dt_net\": self.net_timer.seconds(),\n            \"eta\": eta,\n            \"loss\": self.loss.get_win_median(),\n            \"lr\": self.lr,\n            \"grad_norm\": self.grad_norm,\n            \"gpu_mem\": \"{:.2f}G\".format(misc.gpu_mem_usage()),\n        }\n        if not self._cfg.DATA.MULTI_LABEL:\n            stats[\"top1_err\"] = self.mb_top1_err.get_win_median()\n            stats[\"top5_err\"] = self.mb_top5_err.get_win_median()\n        if self.multi_loss is not None:\n            loss_list = self.multi_loss.get_value()\n            for idx, loss in enumerate(loss_list):\n                stats[\"loss_\" + str(idx)] = loss\n        logging.log_json_stats(stats)\n\n    def log_epoch_stats(self, cur_epoch):\n        \"\"\"\n        Log the stats of the current epoch.\n        Args:\n            cur_epoch (int): the number of current epoch.\n        \"\"\"\n        eta_sec = self.iter_timer.seconds() * (\n            self.MAX_EPOCH - (cur_epoch + 1) * self.epoch_iters\n        )\n        eta = str(datetime.timedelta(seconds=int(eta_sec)))\n        stats = {\n            \"_type\": \"train_epoch{}\".format(\"_ssl\" if self._cfg.TASK == \"ssl\" else \"\"),\n            \"epoch\": \"{}/{}\".format(cur_epoch + 1, self._cfg.SOLVER.MAX_EPOCH),\n            \"dt\": self.iter_timer.seconds(),\n            \"dt_data\": self.data_timer.seconds(),\n            \"dt_net\": self.net_timer.seconds(),\n            \"eta\": eta,\n            \"lr\": self.lr,\n            \"grad_norm\": self.grad_norm,\n            \"gpu_mem\": \"{:.2f}G\".format(misc.gpu_mem_usage()),\n            \"RAM\": \"{:.2f}/{:.2f}G\".format(*misc.cpu_mem_usage()),\n        }\n        if not self._cfg.DATA.MULTI_LABEL:\n            top1_err = self.num_top1_mis / self.num_samples\n            top5_err = self.num_top5_mis / self.num_samples\n            avg_loss = self.loss_total / self.num_samples\n            stats[\"top1_err\"] = top1_err\n            stats[\"top5_err\"] = top5_err\n            stats[\"loss\"] = avg_loss\n        if self.multi_loss is not None:\n            avg_loss_list = self.multi_loss.get_global_avg()\n            for idx, loss in enumerate(avg_loss_list):\n                stats[\"loss_\" + str(idx)] = loss\n        logging.log_json_stats(stats, self.output_dir)\n\n\nclass ValMeter:\n    \"\"\"\n    Measures validation stats.\n    \"\"\"\n\n    def __init__(self, max_iter, cfg):\n        \"\"\"\n        Args:\n            max_iter (int): the max number of iteration of the current epoch.\n            cfg (CfgNode): configs.\n        \"\"\"\n        self._cfg = cfg\n        self.max_iter = max_iter\n        self.iter_timer = Timer()\n        self.data_timer = Timer()\n        self.net_timer = Timer()\n        # Current minibatch errors (smoothed over a window).\n        self.mb_top1_err = ScalarMeter(cfg.LOG_PERIOD)\n        self.mb_top5_err = ScalarMeter(cfg.LOG_PERIOD)\n        # Min errors (over the full val set).\n        self.min_top1_err = 100.0\n        self.min_top5_err = 100.0\n        # Number of misclassified examples.\n        self.num_top1_mis = 0\n        self.num_top5_mis = 0\n        self.num_samples = 0\n        self.all_preds = []\n        self.all_labels = []\n        self.output_dir = cfg.OUTPUT_DIR\n\n    def reset(self):\n        \"\"\"\n        Reset the Meter.\n        \"\"\"\n        self.iter_timer.reset()\n        self.data_timer.reset()\n        self.net_timer.reset()\n        self.mb_top1_err.reset()\n        self.mb_top5_err.reset()\n        self.num_top1_mis = 0\n        self.num_top5_mis = 0\n        self.num_samples = 0\n        self.all_preds = []\n        self.all_labels = []\n\n    def iter_tic(self):\n        \"\"\"\n        Start to record time.\n        \"\"\"\n        self.iter_timer.reset()\n        self.data_timer.reset()\n\n    def iter_toc(self):\n        \"\"\"\n        Stop to record time.\n        \"\"\"\n        self.iter_timer.pause()\n        self.net_timer.pause()\n\n    def data_toc(self):\n        self.data_timer.pause()\n        self.net_timer.reset()\n\n    def update_stats(self, top1_err, top5_err, mb_size):\n        \"\"\"\n        Update the current stats.\n        Args:\n            top1_err (float): top1 error rate.\n            top5_err (float): top5 error rate.\n            mb_size (int): mini batch size.\n        \"\"\"\n        self.mb_top1_err.add_value(top1_err)\n        self.mb_top5_err.add_value(top5_err)\n        self.num_top1_mis += top1_err * mb_size\n        self.num_top5_mis += top5_err * mb_size\n        self.num_samples += mb_size\n\n    def update_predictions(self, preds, labels):\n        \"\"\"\n        Update predictions and labels.\n        Args:\n            preds (tensor): model output predictions.\n            labels (tensor): labels.\n        \"\"\"\n        # TODO: merge update_prediction with update_stats.\n        self.all_preds.append(preds)\n        self.all_labels.append(labels)\n\n    def log_iter_stats(self, cur_epoch, cur_iter):\n        \"\"\"\n        log the stats of the current iteration.\n        Args:\n            cur_epoch (int): the number of current epoch.\n            cur_iter (int): the number of current iteration.\n        \"\"\"\n        if (cur_iter + 1) % self._cfg.LOG_PERIOD != 0:\n            return\n        eta_sec = self.iter_timer.seconds() * (self.max_iter - cur_iter - 1)\n        eta = str(datetime.timedelta(seconds=int(eta_sec)))\n        stats = {\n            \"_type\": \"val_iter{}\".format(\"_ssl\" if self._cfg.TASK == \"ssl\" else \"\"),\n            \"epoch\": \"{}/{}\".format(cur_epoch + 1, self._cfg.SOLVER.MAX_EPOCH),\n            \"iter\": \"{}/{}\".format(cur_iter + 1, self.max_iter),\n            \"time_diff\": self.iter_timer.seconds(),\n            \"eta\": eta,\n            \"gpu_mem\": \"{:.2f}G\".format(misc.gpu_mem_usage()),\n        }\n        if not self._cfg.DATA.MULTI_LABEL:\n            stats[\"top1_err\"] = self.mb_top1_err.get_win_median()\n            stats[\"top5_err\"] = self.mb_top5_err.get_win_median()\n        logging.log_json_stats(stats)\n\n    def log_epoch_stats(self, cur_epoch):\n        \"\"\"\n        Log the stats of the current epoch.\n        Args:\n            cur_epoch (int): the number of current epoch.\n        \"\"\"\n        stats = {\n            \"_type\": \"val_epoch{}\".format(\"_ssl\" if self._cfg.TASK == \"ssl\" else \"\"),\n            \"epoch\": \"{}/{}\".format(cur_epoch + 1, self._cfg.SOLVER.MAX_EPOCH),\n            \"time_diff\": self.iter_timer.seconds(),\n            \"gpu_mem\": \"{:.2f}G\".format(misc.gpu_mem_usage()),\n            \"RAM\": \"{:.2f}/{:.2f}G\".format(*misc.cpu_mem_usage()),\n        }\n        if self._cfg.DATA.MULTI_LABEL:\n            stats[\"map\"] = get_map(\n                torch.cat(self.all_preds).cpu().numpy(),\n                torch.cat(self.all_labels).cpu().numpy(),\n            )\n        else:\n            top1_err = self.num_top1_mis / self.num_samples\n            top5_err = self.num_top5_mis / self.num_samples\n            self.min_top1_err = min(self.min_top1_err, top1_err)\n            self.min_top5_err = min(self.min_top5_err, top5_err)\n\n            stats[\"top1_err\"] = top1_err\n            stats[\"top5_err\"] = top5_err\n            stats[\"min_top1_err\"] = self.min_top1_err\n            stats[\"min_top5_err\"] = self.min_top5_err\n\n        logging.log_json_stats(stats, self.output_dir)\n\n\ndef get_map(preds, labels):\n    \"\"\"\n    Compute mAP for multi-label case.\n    Args:\n        preds (numpy tensor): num_examples x num_classes.\n        labels (numpy tensor): num_examples x num_classes.\n    Returns:\n        mean_ap (int): final mAP score.\n    \"\"\"\n\n    logger.info(\"Getting mAP for {} examples\".format(preds.shape[0]))\n\n    preds = preds[:, ~(np.all(labels == 0, axis=0))]\n    labels = labels[:, ~(np.all(labels == 0, axis=0))]\n    aps = [0]\n    try:\n        aps = average_precision_score(labels, preds, average=None)\n    except ValueError:\n        print(\n            \"Average precision requires a sufficient number of samples \\\n            in a batch which are missing in this sample.\"\n        )\n\n    mean_ap = np.mean(aps)\n    return mean_ap\n\n\nclass EpochTimer:\n    \"\"\"\n    A timer which computes the epoch time.\n    \"\"\"\n\n    def __init__(self) -> None:\n        self.timer = Timer()\n        self.timer.reset()\n        self.epoch_times = []\n\n    def reset(self) -> None:\n        \"\"\"\n        Reset the epoch timer.\n        \"\"\"\n        self.timer.reset()\n        self.epoch_times = []\n\n    def epoch_tic(self):\n        \"\"\"\n        Start to record time.\n        \"\"\"\n        self.timer.reset()\n\n    def epoch_toc(self):\n        \"\"\"\n        Stop to record time.\n        \"\"\"\n        self.timer.pause()\n        self.epoch_times.append(self.timer.seconds())\n\n    def last_epoch_time(self):\n        \"\"\"\n        Get the time for the last epoch.\n        \"\"\"\n        assert len(self.epoch_times) > 0, \"No epoch time has been recorded!\"\n\n        return self.epoch_times[-1]\n\n    def avg_epoch_time(self):\n        \"\"\"\n        Calculate the average epoch time among the recorded epochs.\n        \"\"\"\n        assert len(self.epoch_times) > 0, \"No epoch time has been recorded!\"\n\n        return np.mean(self.epoch_times)\n\n    def median_epoch_time(self):\n        \"\"\"\n        Calculate the median epoch time among the recorded epochs.\n        \"\"\"\n        assert len(self.epoch_times) > 0, \"No epoch time has been recorded!\"\n\n        return np.median(self.epoch_times)\n"
  },
  {
    "path": "slowfast/utils/metrics.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\n\"\"\"Functions for computing metrics.\"\"\"\n\nimport torch\n\n\ndef topks_correct(preds, labels, ks):\n    \"\"\"\n    Given the predictions, labels, and a list of top-k values, compute the\n    number of correct predictions for each top-k value.\n\n    Args:\n        preds (array): array of predictions. Dimension is batchsize\n            N x ClassNum.\n        labels (array): array of labels. Dimension is batchsize N.\n        ks (list): list of top-k values. For example, ks = [1, 5] correspods\n            to top-1 and top-5.\n\n    Returns:\n        topks_correct (list): list of numbers, where the `i`-th entry\n            corresponds to the number of top-`ks[i]` correct predictions.\n    \"\"\"\n    assert preds.size(0) == labels.size(0), (\n        \"Batch dim of predictions and labels must match\"\n    )\n    # Find the top max_k predictions for each sample\n    _top_max_k_vals, top_max_k_inds = torch.topk(\n        preds, max(ks), dim=1, largest=True, sorted=True\n    )\n    # (batch_size, max_k) -> (max_k, batch_size).\n    top_max_k_inds = top_max_k_inds.t()\n    # (batch_size, ) -> (max_k, batch_size).\n    rep_max_k_labels = labels.view(1, -1).expand_as(top_max_k_inds)\n    # (i, j) = 1 if top i-th prediction for the j-th sample is correct.\n    top_max_k_correct = top_max_k_inds.eq(rep_max_k_labels)\n    # Compute the number of topk correct predictions for each k.\n    topks_correct = [top_max_k_correct[:k, :].float().sum() for k in ks]\n    return topks_correct\n\n\ndef topk_errors(preds, labels, ks):\n    \"\"\"\n    Computes the top-k error for each k.\n    Args:\n        preds (array): array of predictions. Dimension is N.\n        labels (array): array of labels. Dimension is N.\n        ks (list): list of ks to calculate the top accuracies.\n    \"\"\"\n    num_topks_correct = topks_correct(preds, labels, ks)\n    return [(1.0 - x / preds.size(0)) * 100.0 for x in num_topks_correct]\n\n\ndef topk_accuracies(preds, labels, ks):\n    \"\"\"\n    Computes the top-k accuracy for each k.\n    Args:\n        preds (array): array of predictions. Dimension is N.\n        labels (array): array of labels. Dimension is N.\n        ks (list): list of ks to calculate the top accuracies.\n    \"\"\"\n    num_topks_correct = topks_correct(preds, labels, ks)\n    return [(x / preds.size(0)) * 100.0 for x in num_topks_correct]\n"
  },
  {
    "path": "slowfast/utils/misc.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\nimport json\nimport math\nimport os\nfrom datetime import datetime\n\nimport numpy as np\nimport psutil\nimport slowfast.utils.logging as logging\nimport slowfast.utils.multiprocessing as mpu\nimport torch\nimport torchvision.io as io\nfrom fvcore.nn.activation_count import activation_count\nfrom fvcore.nn.flop_count import flop_count\nfrom matplotlib import pyplot as plt\nfrom slowfast.datasets.utils import pack_pathway_output\nfrom slowfast.models.batchnorm_helper import SubBatchNorm3d\nfrom slowfast.utils.env import pathmgr\nfrom torch import nn\nfrom torchvision.utils import make_grid\n\nlogger = logging.get_logger(__name__)\n\n\ndef check_nan_losses(loss):\n    \"\"\"\n    Determine whether the loss is NaN (not a number).\n    Args:\n        loss (loss): loss to check whether is NaN.\n    \"\"\"\n    if math.isnan(loss):\n        raise RuntimeError(\"ERROR: Got NaN losses {}\".format(datetime.now()))\n\n\ndef params_count(model, ignore_bn=False):\n    \"\"\"\n    Compute the number of parameters.\n    Args:\n        model (model): model to count the number of parameters.\n    \"\"\"\n    if not ignore_bn:\n        return np.sum([p.numel() for p in model.parameters()]).item()\n    else:\n        count = 0\n        for m in model.modules():\n            if not isinstance(m, nn.BatchNorm3d):\n                for p in m.parameters(recurse=False):\n                    count += p.numel()\n    return count\n\n\ndef gpu_mem_usage():\n    \"\"\"\n    Compute the GPU memory usage for the current device (GB).\n    \"\"\"\n    if torch.cuda.is_available():\n        mem_usage_bytes = torch.cuda.max_memory_allocated()\n    else:\n        mem_usage_bytes = 0\n    return mem_usage_bytes / 1024**3\n\n\ndef cpu_mem_usage():\n    \"\"\"\n    Compute the system memory (RAM) usage for the current device (GB).\n    Returns:\n        usage (float): used memory (GB).\n        total (float): total memory (GB).\n    \"\"\"\n    vram = psutil.virtual_memory()\n    usage = (vram.total - vram.available) / 1024**3\n    total = vram.total / 1024**3\n\n    return usage, total\n\n\ndef _get_model_analysis_input(cfg, use_train_input):\n    \"\"\"\n    Return a dummy input for model analysis with batch size 1. The input is\n        used for analyzing the model (counting flops and activations etc.).\n    Args:\n        cfg (CfgNode): configs. Details can be found in\n            slowfast/config/defaults.py\n        use_train_input (bool): if True, return the input for training. Otherwise,\n            return the input for testing.\n\n    Returns:\n        inputs: the input for model analysis.\n    \"\"\"\n    rgb_dimension = 3\n    if use_train_input:\n        if \"imagenet\" in cfg.TRAIN.DATASET:\n            input_tensors = torch.rand(\n                rgb_dimension,\n                cfg.DATA.TRAIN_CROP_SIZE,\n                cfg.DATA.TRAIN_CROP_SIZE,\n            )\n        else:\n            input_tensors = torch.rand(\n                rgb_dimension,\n                cfg.DATA.NUM_FRAMES,\n                cfg.DATA.TRAIN_CROP_SIZE,\n                cfg.DATA.TRAIN_CROP_SIZE,\n            )\n    else:\n        if \"imagenet\" in cfg.TEST.DATASET:\n            input_tensors = torch.rand(\n                rgb_dimension,\n                cfg.DATA.TEST_CROP_SIZE,\n                cfg.DATA.TEST_CROP_SIZE,\n            )\n        else:\n            input_tensors = torch.rand(\n                rgb_dimension,\n                cfg.DATA.NUM_FRAMES,\n                cfg.DATA.TEST_CROP_SIZE,\n                cfg.DATA.TEST_CROP_SIZE,\n            )\n    model_inputs = pack_pathway_output(cfg, input_tensors)\n    for i in range(len(model_inputs)):\n        model_inputs[i] = model_inputs[i].unsqueeze(0)\n        if cfg.NUM_GPUS:\n            model_inputs[i] = model_inputs[i].cuda(non_blocking=True)\n\n    # If detection is enabled, count flops for one proposal.\n    if cfg.DETECTION.ENABLE:\n        bbox = torch.tensor([[0, 0, 1.0, 0, 1.0]])\n        if cfg.NUM_GPUS:\n            bbox = bbox.cuda()\n        inputs = (model_inputs, bbox)\n    else:\n        inputs = (model_inputs,)\n    return inputs\n\n\ndef get_model_stats(model, cfg, mode, use_train_input):\n    \"\"\"\n    Compute statistics for the current model given the config.\n    Args:\n        model (model): model to perform analysis.\n        cfg (CfgNode): configs. Details can be found in\n            slowfast/config/defaults.py\n        mode (str): Options include `flop` or `activation`. Compute either flop\n            (gflops) or activation count (mega).\n        use_train_input (bool): if True, compute statistics for training. Otherwise,\n            compute statistics for testing.\n\n    Returns:\n        float: the total number of count of the given model.\n    \"\"\"\n    assert mode in [\n        \"flop\",\n        \"activation\",\n    ], \"'{}' not supported for model analysis\".format(mode)\n    if mode == \"flop\":\n        model_stats_fun = flop_count\n    elif mode == \"activation\":\n        model_stats_fun = activation_count\n\n    # Set model to evaluation mode for analysis.\n    # Evaluation mode can avoid getting stuck with sync batchnorm.\n    model_mode = model.training\n    model.eval()\n    inputs = _get_model_analysis_input(cfg, use_train_input)\n    count_dict, *_ = model_stats_fun(model, inputs)\n    count = sum(count_dict.values())\n    model.train(model_mode)\n    return count\n\n\ndef log_model_info(model, cfg, use_train_input=True):\n    \"\"\"\n    Log info, includes number of parameters, gpu usage, gflops and activation count.\n        The model info is computed when the model is in validation mode.\n    Args:\n        model (model): model to log the info.\n        cfg (CfgNode): configs. Details can be found in\n            slowfast/config/defaults.py\n        use_train_input (bool): if True, log info for training. Otherwise,\n            log info for testing.\n    \"\"\"\n    logger.info(\"Model:\\n{}\".format(model))\n    params = params_count(model)\n    logger.info(\"Params: {:,}\".format(params))\n    logger.info(\"Mem: {:,} MB\".format(gpu_mem_usage()))\n    flops = get_model_stats(model, cfg, \"flop\", use_train_input)\n    logger.info(\"Flops: {:,} G\".format(flops))\n    logger.info(\n        \"Activations: {:,} M\".format(\n            get_model_stats(model, cfg, \"activation\", use_train_input)\n        )\n    )\n    logger.info(\"nvidia-smi\")\n    os.system(\"nvidia-smi\")\n    return flops, params\n\n\ndef is_eval_epoch(cfg, cur_epoch, multigrid_schedule):\n    \"\"\"\n    Determine if the model should be evaluated at the current epoch.\n    Args:\n        cfg (CfgNode): configs. Details can be found in\n            slowfast/config/defaults.py\n        cur_epoch (int): current epoch.\n        multigrid_schedule (List): schedule for multigrid training.\n    \"\"\"\n    if cur_epoch + 1 == cfg.SOLVER.MAX_EPOCH:\n        return True\n    if multigrid_schedule is not None:\n        prev_epoch = 0\n        for s in multigrid_schedule:\n            if cur_epoch < s[-1]:\n                period = max((s[-1] - prev_epoch) // cfg.MULTIGRID.EVAL_FREQ + 1, 1)\n                return (s[-1] - 1 - cur_epoch) % period == 0\n            prev_epoch = s[-1]\n\n    return (cur_epoch + 1) % cfg.TRAIN.EVAL_PERIOD == 0\n\n\ndef plot_input(tensor, bboxes=(), texts=(), path=\"./tmp_vis.png\"):\n    \"\"\"\n    Plot the input tensor with the optional bounding box and save it to disk.\n    Args:\n        tensor (tensor): a tensor with shape of `NxCxHxW`.\n        bboxes (tuple): bounding boxes with format of [[x, y, h, w]].\n        texts (tuple): a tuple of string to plot.\n        path (str): path to the image to save to.\n    \"\"\"\n    tensor = tensor.float()\n    tensor = tensor - tensor.min()\n    tensor = tensor / tensor.max()\n    f, ax = plt.subplots(nrows=1, ncols=tensor.shape[0], figsize=(50, 20))\n    for i in range(tensor.shape[0]):\n        ax[i].axis(\"off\")\n        ax[i].imshow(tensor[i].permute(1, 2, 0))\n        # ax[1][0].axis('off')\n        if bboxes is not None and len(bboxes) > i:\n            for box in bboxes[i]:\n                x1, y1, x2, y2 = box\n                ax[i].vlines(x1, y1, y2, colors=\"g\", linestyles=\"solid\")\n                ax[i].vlines(x2, y1, y2, colors=\"g\", linestyles=\"solid\")\n                ax[i].hlines(y1, x1, x2, colors=\"g\", linestyles=\"solid\")\n                ax[i].hlines(y2, x1, x2, colors=\"g\", linestyles=\"solid\")\n\n        if texts is not None and len(texts) > i:\n            ax[i].text(0, 0, texts[i])\n    f.savefig(path)\n\n\ndef plot_input_normed(\n    tensor,\n    bboxes=(),\n    texts=(),\n    path=\"./tmp_vis.png\",\n    folder_path=\"\",\n    make_grids=False,\n    output_video=False,\n):\n    \"\"\"\n    Plot the input tensor with the optional bounding box and save it to disk.\n    Args:\n        tensor (tensor): a tensor with shape of `NxCxHxW`.\n        bboxes (tuple): bounding boxes with format of [[x, y, h, w]].\n        texts (tuple): a tuple of string to plot.\n        path (str): path to the image to save to.\n    \"\"\"\n    tensor = tensor.float()\n    try:\n        os.mkdir(folder_path)\n    except Exception as e:\n        pass\n    tensor = convert_normalized_images(tensor)\n    if output_video:\n        # assert make_grids, \"video needs to have make_grids on\"\n        assert tensor.ndim == 5\n        sz = tensor.shape\n\n        if make_grids:\n            vid = tensor.reshape([sz[0], sz[1] * sz[2], sz[3], sz[4]])\n            vid = make_grid(vid, padding=8, pad_value=1.0, nrow=sz[0])\n            vid = vid.reshape([sz[1], sz[2], vid.shape[1], vid.shape[2]])\n        else:\n            vid = tensor.reshape([sz[0] * sz[1], sz[2], sz[3], sz[4]])\n\n        vid = vid.permute([0, 2, 3, 1])\n        vid *= 255.0\n        vid = vid.to(torch.uint8)\n        fps = 30.0 * vid.shape[0] / 64.0\n        io.video.write_video(path, vid, fps, video_codec=\"libx264\")\n    elif make_grids:\n        if tensor.ndim > 4 and tensor.shape[0] == 1:\n            tensor = tensor.squeeze()\n            nrow = 1\n        elif tensor.ndim == 5:\n            nrow = tensor.shape[1]\n            tensor = tensor.reshape(\n                shape=(-1, tensor.shape[2], tensor.shape[3], tensor.shape[4])\n            )\n        vis2 = (\n            make_grid(tensor, padding=8, pad_value=1.0, nrow=nrow)\n            .permute(1, 2, 0)\n            .cpu()\n            .numpy()\n        )\n        plt.imsave(fname=path, arr=vis2, format=\"png\")\n    else:\n        f, ax = plt.subplots(\n            nrows=tensor.shape[0],\n            ncols=tensor.shape[1],\n            figsize=(10 * tensor.shape[1], 10 * tensor.shape[0]),\n        )\n\n        if tensor.shape[0] == 1:\n            for i in range(tensor.shape[1]):\n                ax[i].axis(\"off\")\n                ax[i].imshow(tensor[0][i].permute(1, 2, 0))\n                # ax[1][0].axis('off')\n                if bboxes is not None and len(bboxes) > i:\n                    for box in bboxes[i]:\n                        x1, y1, x2, y2 = box\n                        ax[i].vlines(x1, y1, y2, colors=\"g\", linestyles=\"solid\")\n                        ax[i].vlines(x2, y1, y2, colors=\"g\", linestyles=\"solid\")\n                        ax[i].hlines(y1, x1, x2, colors=\"g\", linestyles=\"solid\")\n                        ax[i].hlines(y2, x1, x2, colors=\"g\", linestyles=\"solid\")\n\n            if texts is not None and len(texts) > i:\n                ax[i].text(0, 0, texts[i])\n        else:\n            for i in range(tensor.shape[0]):\n                for j in range(tensor.shape[1]):\n                    ax[i][j].axis(\"off\")\n                    ax[i][j].imshow(tensor[i][j].permute(1, 2, 0))\n                    # ax[1][0].axis('off')\n                    if bboxes is not None and len(bboxes) > i:\n                        for box in bboxes[i]:\n                            x1, y1, x2, y2 = box\n                            ax[i].vlines(x1, y1, y2, colors=\"g\", linestyles=\"solid\")\n                            ax[i].vlines(x2, y1, y2, colors=\"g\", linestyles=\"solid\")\n                            ax[i].hlines(y1, x1, x2, colors=\"g\", linestyles=\"solid\")\n                            ax[i].hlines(y2, x1, x2, colors=\"g\", linestyles=\"solid\")\n\n                    if texts is not None and len(texts) > i:\n                        ax[i].text(0, 0, texts[i])\n        print(f\"{path}\")\n        f.tight_layout(pad=0.0)\n        with pathmgr.open(path, \"wb\") as h:\n            f.savefig(h)\n\n\ndef convert_normalized_images(tensor):\n    tensor = tensor * 0.225\n    tensor = tensor + 0.45\n\n    tensor = tensor.clamp(min=0.0, max=1.0)\n\n    return tensor\n\n\ndef frozen_bn_stats(model):\n    \"\"\"\n    Set all the bn layers to eval mode.\n    Args:\n        model (model): model to set bn layers to eval mode.\n    \"\"\"\n    for m in model.modules():\n        if isinstance(m, nn.BatchNorm3d):\n            m.eval()\n\n\ndef aggregate_sub_bn_stats(module):\n    \"\"\"\n    Recursively find all SubBN modules and aggregate sub-BN stats.\n    Args:\n        module (nn.Module)\n    Returns:\n        count (int): number of SubBN module found.\n    \"\"\"\n    count = 0\n    for child in module.children():\n        if isinstance(child, SubBatchNorm3d):\n            child.aggregate_stats()\n            count += 1\n        else:\n            count += aggregate_sub_bn_stats(child)\n    return count\n\n\ndef launch_job(cfg, init_method, func, daemon=False):\n    \"\"\"\n    Run 'func' on one or more GPUs, specified in cfg\n    Args:\n        cfg (CfgNode): configs. Details can be found in\n            slowfast/config/defaults.py\n        init_method (str): initialization method to launch the job with multiple\n            devices.\n        func (function): job to run on GPU(s)\n        daemon (bool): The spawned processes’ daemon flag. If set to True,\n            daemonic processes will be created\n    \"\"\"\n    if cfg.NUM_GPUS > 1:\n        torch.multiprocessing.spawn(\n            mpu.run,\n            nprocs=cfg.NUM_GPUS,\n            args=(\n                cfg.NUM_GPUS,\n                func,\n                init_method,\n                cfg.SHARD_ID,\n                cfg.NUM_SHARDS,\n                cfg.DIST_BACKEND,\n                cfg,\n            ),\n            daemon=daemon,\n        )\n    else:\n        func(cfg=cfg)\n\n\ndef get_class_names(path, parent_path=None, subset_path=None):\n    \"\"\"\n    Read json file with entries {classname: index} and return\n    an array of class names in order.\n    If parent_path is provided, load and map all children to their ids.\n    Args:\n        path (str): path to class ids json file.\n            File must be in the format {\"class1\": id1, \"class2\": id2, ...}\n        parent_path (Optional[str]): path to parent-child json file.\n            File must be in the format {\"parent1\": [\"child1\", \"child2\", ...], ...}\n        subset_path (Optional[str]): path to text file containing a subset\n            of class names, separated by newline characters.\n    Returns:\n        class_names (list of strs): list of class names.\n        class_parents (dict): a dictionary where key is the name of the parent class\n            and value is a list of ids of the children classes.\n        subset_ids (list of ints): list of ids of the classes provided in the\n            subset file.\n    \"\"\"\n    try:\n        with pathmgr.open(path, \"r\") as f:\n            class2idx = json.load(f)\n    except Exception as err:\n        print(\"Fail to load file from {} with error {}\".format(path, err))\n        return\n\n    max_key = max(class2idx.values())\n    class_names = [None] * (max_key + 1)\n\n    for k, i in class2idx.items():\n        class_names[i] = k\n\n    class_parent = None\n    if parent_path is not None and parent_path != \"\":\n        try:\n            with pathmgr.open(parent_path, \"r\") as f:\n                d_parent = json.load(f)\n        except EnvironmentError as err:\n            print(\"Fail to load file from {} with error {}\".format(parent_path, err))\n            return\n        class_parent = {}\n        for parent, children in d_parent.items():\n            indices = [class2idx[c] for c in children if class2idx.get(c) is not None]\n            class_parent[parent] = indices\n\n    subset_ids = None\n    if subset_path is not None and subset_path != \"\":\n        try:\n            with pathmgr.open(subset_path, \"r\") as f:\n                subset = f.read().split(\"\\n\")\n                subset_ids = [\n                    class2idx[name]\n                    for name in subset\n                    if class2idx.get(name) is not None\n                ]\n        except EnvironmentError as err:\n            print(\"Fail to load file from {} with error {}\".format(subset_path, err))\n            return\n\n    return class_names, class_parent, subset_ids\n"
  },
  {
    "path": "slowfast/utils/multigrid.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\n\"\"\"Helper functions for multigrid training.\"\"\"\n\nimport numpy as np\nimport slowfast.utils.logging as logging\n\nlogger = logging.get_logger(__name__)\n\n\nclass MultigridSchedule:\n    \"\"\"\n    This class defines multigrid training schedule and update cfg accordingly.\n    \"\"\"\n\n    def init_multigrid(self, cfg):\n        \"\"\"\n        Update cfg based on multigrid settings.\n        Args:\n            cfg (configs): configs that contains training and multigrid specific\n                hyperparameters. Details can be seen in\n                slowfast/config/defaults.py.\n        Returns:\n            cfg (configs): the updated cfg.\n        \"\"\"\n        self.schedule = None\n        # We may modify cfg.TRAIN.BATCH_SIZE, cfg.DATA.NUM_FRAMES, and\n        # cfg.DATA.TRAIN_CROP_SIZE during training, so we store their original\n        # value in cfg and use them as global variables.\n        cfg.MULTIGRID.DEFAULT_B = cfg.TRAIN.BATCH_SIZE\n        cfg.MULTIGRID.DEFAULT_T = cfg.DATA.NUM_FRAMES\n        cfg.MULTIGRID.DEFAULT_S = cfg.DATA.TRAIN_CROP_SIZE\n\n        if cfg.MULTIGRID.LONG_CYCLE:\n            self.schedule = self.get_long_cycle_schedule(cfg)\n            cfg.SOLVER.STEPS = [0] + [s[-1] for s in self.schedule]\n            # Fine-tuning phase.\n            cfg.SOLVER.STEPS[-1] = (cfg.SOLVER.STEPS[-2] + cfg.SOLVER.STEPS[-1]) // 2\n            cfg.SOLVER.LRS = [cfg.SOLVER.GAMMA ** s[0] * s[1][0] for s in self.schedule]\n            # Fine-tuning phase.\n            cfg.SOLVER.LRS = cfg.SOLVER.LRS[:-1] + [\n                cfg.SOLVER.LRS[-2],\n                cfg.SOLVER.LRS[-1],\n            ]\n\n            cfg.SOLVER.MAX_EPOCH = self.schedule[-1][-1]\n\n        elif cfg.MULTIGRID.SHORT_CYCLE:\n            cfg.SOLVER.STEPS = [\n                int(s * cfg.MULTIGRID.EPOCH_FACTOR) for s in cfg.SOLVER.STEPS\n            ]\n            cfg.SOLVER.MAX_EPOCH = int(\n                cfg.SOLVER.MAX_EPOCH * cfg.MULTIGRID.EPOCH_FACTOR\n            )\n        return cfg\n\n    def update_long_cycle(self, cfg, cur_epoch):\n        \"\"\"\n        Before every epoch, check if long cycle shape should change. If it\n            should, update cfg accordingly.\n        Args:\n            cfg (configs): configs that contains training and multigrid specific\n                hyperparameters. Details can be seen in\n                slowfast/config/defaults.py.\n            cur_epoch (int): current epoch index.\n        Returns:\n            cfg (configs): the updated cfg.\n            changed (bool): do we change long cycle shape at this epoch?\n        \"\"\"\n        base_b, base_t, base_s = get_current_long_cycle_shape(self.schedule, cur_epoch)\n        if base_s != cfg.DATA.TRAIN_CROP_SIZE or base_t != cfg.DATA.NUM_FRAMES:\n            cfg.DATA.NUM_FRAMES = base_t\n            cfg.DATA.TRAIN_CROP_SIZE = base_s\n            cfg.TRAIN.BATCH_SIZE = base_b * cfg.MULTIGRID.DEFAULT_B\n\n            bs_factor = (\n                float(cfg.TRAIN.BATCH_SIZE / cfg.NUM_GPUS) / cfg.MULTIGRID.BN_BASE_SIZE\n            )\n\n            if bs_factor < 1:\n                cfg.BN.NORM_TYPE = \"sync_batchnorm\"\n                cfg.BN.NUM_SYNC_DEVICES = int(1.0 / bs_factor)\n            elif bs_factor > 1:\n                cfg.BN.NORM_TYPE = \"sub_batchnorm\"\n                cfg.BN.NUM_SPLITS = int(bs_factor)\n            else:\n                cfg.BN.NORM_TYPE = \"batchnorm\"\n\n            cfg.MULTIGRID.LONG_CYCLE_SAMPLING_RATE = cfg.DATA.SAMPLING_RATE * (\n                cfg.MULTIGRID.DEFAULT_T // cfg.DATA.NUM_FRAMES\n            )\n            logger.info(\"Long cycle updates:\")\n            logger.info(\"\\tBN.NORM_TYPE: {}\".format(cfg.BN.NORM_TYPE))\n            if cfg.BN.NORM_TYPE == \"sync_batchnorm\":\n                logger.info(\"\\tBN.NUM_SYNC_DEVICES: {}\".format(cfg.BN.NUM_SYNC_DEVICES))\n            elif cfg.BN.NORM_TYPE == \"sub_batchnorm\":\n                logger.info(\"\\tBN.NUM_SPLITS: {}\".format(cfg.BN.NUM_SPLITS))\n            logger.info(\"\\tTRAIN.BATCH_SIZE: {}\".format(cfg.TRAIN.BATCH_SIZE))\n            logger.info(\n                \"\\tDATA.NUM_FRAMES x LONG_CYCLE_SAMPLING_RATE: {}x{}\".format(\n                    cfg.DATA.NUM_FRAMES, cfg.MULTIGRID.LONG_CYCLE_SAMPLING_RATE\n                )\n            )\n            logger.info(\"\\tDATA.TRAIN_CROP_SIZE: {}\".format(cfg.DATA.TRAIN_CROP_SIZE))\n            return cfg, True\n        else:\n            return cfg, False\n\n    def get_long_cycle_schedule(self, cfg):\n        \"\"\"\n        Based on multigrid hyperparameters, define the schedule of a long cycle.\n        Args:\n            cfg (configs): configs that contains training and multigrid specific\n                hyperparameters. Details can be seen in\n                slowfast/config/defaults.py.\n        Returns:\n            schedule (list): Specifies a list long cycle base shapes and their\n                corresponding training epochs.\n        \"\"\"\n\n        steps = cfg.SOLVER.STEPS\n\n        default_size = float(cfg.DATA.NUM_FRAMES * cfg.DATA.TRAIN_CROP_SIZE**2)\n        default_iters = steps[-1]\n\n        # Get shapes and average batch size for each long cycle shape.\n        avg_bs = []\n        all_shapes = []\n        for t_factor, s_factor in cfg.MULTIGRID.LONG_CYCLE_FACTORS:\n            base_t = int(round(cfg.DATA.NUM_FRAMES * t_factor))\n            base_s = int(round(cfg.DATA.TRAIN_CROP_SIZE * s_factor))\n            if cfg.MULTIGRID.SHORT_CYCLE:\n                shapes = [\n                    [\n                        base_t,\n                        cfg.MULTIGRID.DEFAULT_S * cfg.MULTIGRID.SHORT_CYCLE_FACTORS[0],\n                    ],\n                    [\n                        base_t,\n                        cfg.MULTIGRID.DEFAULT_S * cfg.MULTIGRID.SHORT_CYCLE_FACTORS[1],\n                    ],\n                    [base_t, base_s],\n                ]\n            else:\n                shapes = [[base_t, base_s]]\n\n            # (T, S) -> (B, T, S)\n            shapes = [\n                [int(round(default_size / (s[0] * s[1] * s[1]))), s[0], s[1]]\n                for s in shapes\n            ]\n            avg_bs.append(np.mean([s[0] for s in shapes]))\n            all_shapes.append(shapes)\n\n        # Get schedule regardless of cfg.MULTIGRID.EPOCH_FACTOR.\n        total_iters = 0\n        schedule = []\n        for step_index in range(len(steps) - 1):\n            step_epochs = steps[step_index + 1] - steps[step_index]\n\n            for long_cycle_index, shapes in enumerate(all_shapes):\n                cur_epochs = step_epochs * avg_bs[long_cycle_index] / sum(avg_bs)\n\n                cur_iters = cur_epochs / avg_bs[long_cycle_index]\n                total_iters += cur_iters\n                schedule.append((step_index, shapes[-1], cur_epochs))\n\n        iter_saving = default_iters / total_iters\n\n        final_step_epochs = cfg.SOLVER.MAX_EPOCH - steps[-1]\n\n        # We define the fine-tuning phase to have the same amount of iteration\n        # saving as the rest of the training.\n        ft_epochs = final_step_epochs / iter_saving * avg_bs[-1]\n\n        schedule.append((step_index + 1, all_shapes[-1][2], ft_epochs))\n\n        # Obtrain final schedule given desired cfg.MULTIGRID.EPOCH_FACTOR.\n        x = (\n            cfg.SOLVER.MAX_EPOCH\n            * cfg.MULTIGRID.EPOCH_FACTOR\n            / sum(s[-1] for s in schedule)\n        )\n\n        final_schedule = []\n        total_epochs = 0\n        for s in schedule:\n            epochs = s[2] * x\n            total_epochs += epochs\n            final_schedule.append((s[0], s[1], int(round(total_epochs))))\n        print_schedule(final_schedule)\n        return final_schedule\n\n\ndef print_schedule(schedule):\n    \"\"\"\n    Log schedule.\n    \"\"\"\n    logger.info(\"Long cycle index\\tBase shape\\tEpochs\")\n    for s in schedule:\n        logger.info(\"{}\\t{}\\t{}\".format(s[0], s[1], s[2]))\n\n\ndef get_current_long_cycle_shape(schedule, epoch):\n    \"\"\"\n    Given a schedule and epoch index, return the long cycle base shape.\n    Args:\n        schedule (configs): configs that contains training and multigrid specific\n            hyperparameters. Details can be seen in\n            slowfast/config/defaults.py.\n        cur_epoch (int): current epoch index.\n    Returns:\n        shapes (list): A list describing the base shape in a long cycle:\n            [batch size relative to default,\n            number of frames, spatial dimension].\n    \"\"\"\n    for s in schedule:\n        if epoch < s[-1]:\n            return s[1]\n    return schedule[-1][1]\n"
  },
  {
    "path": "slowfast/utils/multiprocessing.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\n\"\"\"Multiprocessing helpers.\"\"\"\n\nimport torch\n\n\ndef run(\n    local_rank,\n    num_proc,\n    func,\n    init_method,\n    shard_id,\n    num_shards,\n    backend,\n    cfg,\n    output_queue=None,\n):\n    \"\"\"\n    Runs a function from a child process.\n    Args:\n        local_rank (int): rank of the current process on the current machine.\n        num_proc (int): number of processes per machine.\n        func (function): function to execute on each of the process.\n        init_method (string): method to initialize the distributed training.\n            TCP initialization: equiring a network address reachable from all\n            processes followed by the port.\n            Shared file-system initialization: makes use of a file system that\n            is shared and visible from all machines. The URL should start with\n            file:// and contain a path to a non-existent file on a shared file\n            system.\n        shard_id (int): the rank of the current machine.\n        num_shards (int): number of overall machines for the distributed\n            training job.\n        backend (string): three distributed backends ('nccl', 'gloo', 'mpi') are\n            supports, each with different capabilities. Details can be found\n            here:\n            https://pytorch.org/docs/stable/distributed.html\n        cfg (CfgNode): configs. Details can be found in\n            slowfast/config/defaults.py\n        output_queue (queue): can optionally be used to return values from the\n            master process.\n    \"\"\"\n    # Initialize the process group.\n    world_size = num_proc * num_shards\n    rank = shard_id * num_proc + local_rank\n\n    try:\n        torch.distributed.init_process_group(\n            backend=backend,\n            init_method=init_method,\n            world_size=world_size,\n            rank=rank,\n        )\n    except Exception as e:\n        raise e\n\n    torch.cuda.set_device(local_rank)\n    ret = func(cfg)\n    if output_queue is not None and local_rank == 0:\n        output_queue.put(ret)\n"
  },
  {
    "path": "slowfast/utils/parser.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\n\"\"\"Argument parser functions.\"\"\"\n\nimport argparse\nimport sys\n\nimport slowfast.utils.checkpoint as cu\nfrom slowfast.config.defaults import get_cfg\n\n\ndef parse_args():\n    \"\"\"\n    Parse the following arguments for a default parser for PySlowFast users.\n    Args:\n        shard_id (int): shard id for the current machine. Starts from 0 to\n            num_shards - 1. If single machine is used, then set shard id to 0.\n        num_shards (int): number of shards using by the job.\n        init_method (str): initialization method to launch the job with multiple\n            devices. Options includes TCP or shared file-system for\n            initialization. details can be find in\n            https://pytorch.org/docs/stable/distributed.html#tcp-initialization\n        cfg (str): path to the config file.\n        opts (argument): provide addtional options from the command line, it\n            overwrites the config loaded from file.\n    \"\"\"\n    parser = argparse.ArgumentParser(\n        description=\"Provide SlowFast video training and testing pipeline.\"\n    )\n    parser.add_argument(\n        \"--shard_id\",\n        help=\"The shard id of current node, Starts from 0 to num_shards - 1\",\n        default=0,\n        type=int,\n    )\n    parser.add_argument(\n        \"--num_shards\",\n        help=\"Number of shards using by the job\",\n        default=1,\n        type=int,\n    )\n    parser.add_argument(\n        \"--init_method\",\n        help=\"Initialization method, includes TCP or shared file-system\",\n        default=\"tcp://localhost:9999\",\n        type=str,\n    )\n    parser.add_argument(\n        \"--cfg\",\n        dest=\"cfg_files\",\n        help=\"Path to the config files\",\n        default=[\"configs/Kinetics/SLOWFAST_4x16_R50.yaml\"],\n        nargs=\"+\",\n    )\n    parser.add_argument(\n        \"--opts\",\n        help=\"See slowfast/config/defaults.py for all options\",\n        default=None,\n        nargs=argparse.REMAINDER,\n    )\n    if len(sys.argv) == 1:\n        parser.print_help()\n    return parser.parse_args()\n\n\ndef load_config(args, path_to_config=None):\n    \"\"\"\n    Given the arguemnts, load and initialize the configs.\n    Args:\n        args (argument): arguments includes `shard_id`, `num_shards`,\n            `init_method`, `cfg_file`, and `opts`.\n    \"\"\"\n    # Setup cfg.\n    cfg = get_cfg()\n    # Load config from cfg.\n    if path_to_config is not None:\n        cfg.merge_from_file(path_to_config)\n    # Load config from command line, overwrite config from opts.\n    if args.opts is not None:\n        cfg.merge_from_list(args.opts)\n\n    # Inherit parameters from args.\n    if hasattr(args, \"num_shards\") and hasattr(args, \"shard_id\"):\n        cfg.NUM_SHARDS = args.num_shards\n        cfg.SHARD_ID = args.shard_id\n    if hasattr(args, \"rng_seed\"):\n        cfg.RNG_SEED = args.rng_seed\n    if hasattr(args, \"output_dir\"):\n        cfg.OUTPUT_DIR = args.output_dir\n\n    # Create the checkpoint dir.\n    cu.make_checkpoint_dir(cfg.OUTPUT_DIR)\n    return cfg\n"
  },
  {
    "path": "slowfast/utils/weight_init_helper.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\n\"\"\"Utility function for weight initialization\"\"\"\n\nimport torch.nn as nn\nfrom fvcore.nn.weight_init import c2_msra_fill, c2_xavier_fill\n\n\ndef init_weights(\n    model, fc_init_std=0.01, zero_init_final_bn=True, zero_init_final_conv=False\n):\n    \"\"\"\n    Performs ResNet style weight initialization.\n    Args:\n        fc_init_std (float): the expected standard deviation for fc layer.\n        zero_init_final_bn (bool): if True, zero initialize the final bn for\n            every bottleneck.\n    \"\"\"\n    for m in model.modules():\n        if isinstance(m, nn.Conv3d):\n            # Note that there is no bias due to BN\n            if hasattr(m, \"final_conv\") and zero_init_final_conv:\n                m.weight.data.zero_()\n            else:\n                \"\"\"\n                Follow the initialization method proposed in:\n                {He, Kaiming, et al.\n                \"Delving deep into rectifiers: Surpassing human-level\n                performance on imagenet classification.\"\n                arXiv preprint arXiv:1502.01852 (2015)}\n                \"\"\"\n                c2_msra_fill(m)\n\n        elif isinstance(m, (nn.BatchNorm3d, nn.BatchNorm2d, nn.BatchNorm1d)):\n            if (\n                hasattr(m, \"transform_final_bn\")\n                and m.transform_final_bn\n                and zero_init_final_bn\n            ):\n                batchnorm_weight = 0.0\n            else:\n                batchnorm_weight = 1.0\n            if m.weight is not None:\n                m.weight.data.fill_(batchnorm_weight)\n            if m.bias is not None:\n                m.bias.data.zero_()\n        if isinstance(m, nn.Linear):\n            if hasattr(m, \"xavier_init\") and m.xavier_init:\n                c2_xavier_fill(m)\n            else:\n                m.weight.data.normal_(mean=0.0, std=fc_init_std)\n            if m.bias is not None:\n                m.bias.data.zero_()\n"
  },
  {
    "path": "slowfast/visualization/__init__.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n"
  },
  {
    "path": "slowfast/visualization/async_predictor.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\nimport atexit\nimport queue\n\nimport numpy as np\nimport slowfast.utils.logging as logging\nimport torch\nimport torch.multiprocessing as mp\nfrom slowfast.datasets import cv2_transform\nfrom slowfast.visualization.predictor import Predictor\n\nlogger = logging.get_logger(__name__)\n\n\nclass AsycnActionPredictor:\n    class _Predictor(mp.Process):\n        def __init__(self, cfg, task_queue, result_queue, gpu_id=None):\n            \"\"\"\n            Predict Worker for Detectron2.\n            Args:\n                cfg (CfgNode): configs. Details can be found in\n                    slowfast/config/defaults.py\n                task_queue (mp.Queue): a shared queue for incoming task.\n                result_queue (mp.Queue): a shared queue for predicted results.\n                gpu_id (int): index of the GPU device for the current child process.\n            \"\"\"\n            super().__init__()\n            self.cfg = cfg\n            self.task_queue = task_queue\n            self.result_queue = result_queue\n            self.gpu_id = gpu_id\n\n            self.device = (\n                torch.device(\"cuda:{}\".format(self.gpu_id))\n                if self.cfg.NUM_GPUS\n                else \"cpu\"\n            )\n\n        def run(self):\n            \"\"\"\n            Run prediction asynchronously.\n            \"\"\"\n            # Build the video model and print model statistics.\n            model = Predictor(self.cfg, gpu_id=self.gpu_id)\n            while True:\n                task = self.task_queue.get()\n                if isinstance(task, _StopToken):\n                    break\n                task = model(task)\n                self.result_queue.put(task)\n\n    def __init__(self, cfg, result_queue=None):\n        num_workers = cfg.NUM_GPUS\n\n        self.task_queue = mp.Queue()\n        self.result_queue = mp.Queue() if result_queue is None else result_queue\n\n        self.get_idx = -1\n        self.put_idx = -1\n        self.procs = []\n        cfg = cfg.clone()\n        cfg.defrost()\n        cfg.NUM_GPUS = 1\n        for gpu_id in range(num_workers):\n            self.procs.append(\n                AsycnActionPredictor._Predictor(\n                    cfg, self.task_queue, self.result_queue, gpu_id\n                )\n            )\n\n        self.result_data = {}\n        for p in self.procs:\n            p.start()\n        atexit.register(self.shutdown)\n\n    def put(self, task):\n        \"\"\"\n        Add the new task to task queue.\n        Args:\n            task (TaskInfo object): task object that contain\n                the necessary information for action prediction. (e.g. frames)\n        \"\"\"\n        self.put_idx += 1\n        self.task_queue.put(task)\n\n    def get(self):\n        \"\"\"\n        Return a task object in the correct order based on task id if\n        result(s) is available. Otherwise, raise queue.Empty exception.\n        \"\"\"\n        if self.result_data.get(self.get_idx + 1) is not None:\n            self.get_idx += 1\n            res = self.result_data[self.get_idx]\n            del self.result_data[self.get_idx]\n            return res\n        while True:\n            res = self.result_queue.get(block=False)\n            idx = res.id\n            if idx == self.get_idx + 1:\n                self.get_idx += 1\n                return res\n            self.result_data[idx] = res\n\n    def __call__(self, task):\n        self.put(task)\n        return self.get()\n\n    def shutdown(self):\n        for _ in self.procs:\n            self.task_queue.put(_StopToken())\n\n    @property\n    def result_available(self):\n        \"\"\"\n        How many results are ready to be returned.\n        \"\"\"\n        return self.result_queue.qsize() + len(self.result_data)\n\n    @property\n    def default_buffer_size(self):\n        return len(self.procs) * 5\n\n\nclass AsyncVis:\n    class _VisWorker(mp.Process):\n        def __init__(self, video_vis, task_queue, result_queue):\n            \"\"\"\n            Visualization Worker for AsyncVis.\n            Args:\n                video_vis (VideoVisualizer object): object with tools for visualization.\n                task_queue (mp.Queue): a shared queue for incoming task for visualization.\n                result_queue (mp.Queue): a shared queue for visualized results.\n            \"\"\"\n            self.video_vis = video_vis\n            self.task_queue = task_queue\n            self.result_queue = result_queue\n            super().__init__()\n\n        def run(self):\n            \"\"\"\n            Run visualization asynchronously.\n            \"\"\"\n            while True:\n                task = self.task_queue.get()\n                if isinstance(task, _StopToken):\n                    break\n\n                frames = draw_predictions(task, self.video_vis)\n                task.frames = np.array(frames)\n                self.result_queue.put(task)\n\n    def __init__(self, video_vis, n_workers=None):\n        \"\"\"\n        Args:\n            cfg (CfgNode): configs. Details can be found in\n                slowfast/config/defaults.py\n            n_workers (Optional[int]): number of CPUs for running video visualizer.\n                If not given, use all CPUs.\n        \"\"\"\n\n        num_workers = mp.cpu_count() if n_workers is None else n_workers\n\n        self.task_queue = mp.Queue()\n        self.result_queue = mp.Queue()\n        self.get_indices_ls = []\n        self.procs = []\n        self.result_data = {}\n        self.put_id = -1\n        for _ in range(max(num_workers, 1)):\n            self.procs.append(\n                AsyncVis._VisWorker(video_vis, self.task_queue, self.result_queue)\n            )\n\n        for p in self.procs:\n            p.start()\n\n        atexit.register(self.shutdown)\n\n    def put(self, task):\n        \"\"\"\n        Add the new task to task queue.\n        Args:\n            task (TaskInfo object): task object that contain\n                the necessary information for action prediction. (e.g. frames, boxes, predictions)\n        \"\"\"\n        self.put_id += 1\n        self.task_queue.put(task)\n\n    def get(self):\n        \"\"\"\n        Return visualized frames/clips in the correct order based on task id if\n        result(s) is available. Otherwise, raise queue.Empty exception.\n        \"\"\"\n        get_idx = self.get_indices_ls[0]\n        if self.result_data.get(get_idx) is not None:\n            res = self.result_data[get_idx]\n            del self.result_data[get_idx]\n            del self.get_indices_ls[0]\n            return res\n\n        while True:\n            res = self.result_queue.get(block=False)\n            idx = res.id\n            if idx == get_idx:\n                del self.get_indices_ls[0]\n                return res\n            self.result_data[idx] = res\n\n    def __call__(self, task):\n        \"\"\"\n        How many results are ready to be returned.\n        \"\"\"\n        self.put(task)\n        return self.get()\n\n    def shutdown(self):\n        for _ in self.procs:\n            self.task_queue.put(_StopToken())\n\n    @property\n    def result_available(self):\n        return self.result_queue.qsize() + len(self.result_data)\n\n    @property\n    def default_buffer_size(self):\n        return len(self.procs) * 5\n\n\nclass _StopToken:\n    pass\n\n\nclass AsyncDemo:\n    \"\"\"\n    Asynchronous Action Prediction and Visualization pipeline with AsyncVis.\n    \"\"\"\n\n    def __init__(self, cfg, async_vis):\n        \"\"\"\n        Args:\n            cfg (CfgNode): configs. Details can be found in\n                slowfast/config/defaults.py\n            async_vis (AsyncVis object): asynchronous visualizer.\n        \"\"\"\n        self.model = AsycnActionPredictor(cfg=cfg, result_queue=async_vis.task_queue)\n        self.async_vis = async_vis\n\n    def put(self, task):\n        \"\"\"\n        Put task into task queue for prediction and visualization.\n        Args:\n            task (TaskInfo object): task object that contain\n                the necessary information for action prediction. (e.g. frames)\n        \"\"\"\n        self.async_vis.get_indices_ls.append(task.id)\n        self.model.put(task)\n\n    def get(self):\n        \"\"\"\n        Get the visualized clips if any.\n        \"\"\"\n        try:\n            task = self.async_vis.get()\n        except (queue.Empty, IndexError):\n            raise IndexError(\"Results are not available yet.\")\n\n        return task\n\n\ndef draw_predictions(task, video_vis):\n    \"\"\"\n    Draw prediction for the given task.\n    Args:\n        task (TaskInfo object): task object that contain\n            the necessary information for visualization. (e.g. frames, preds)\n            All attributes must lie on CPU devices.\n        video_vis (VideoVisualizer object): the video visualizer object.\n    \"\"\"\n    boxes = task.bboxes\n    frames = task.frames\n    preds = task.action_preds\n    if boxes is not None:\n        img_width = task.img_width\n        img_height = task.img_height\n        if boxes.device != torch.device(\"cpu\"):\n            boxes = boxes.cpu()\n        boxes = cv2_transform.revert_scaled_boxes(\n            task.crop_size, boxes, img_height, img_width\n        )\n\n    keyframe_idx = len(frames) // 2 - task.num_buffer_frames\n    draw_range = [\n        keyframe_idx - task.clip_vis_size,\n        keyframe_idx + task.clip_vis_size,\n    ]\n    buffer = frames[: task.num_buffer_frames]\n    frames = frames[task.num_buffer_frames :]\n    if boxes is not None:\n        if len(boxes) != 0:\n            frames = video_vis.draw_clip_range(\n                frames,\n                preds,\n                boxes,\n                keyframe_idx=keyframe_idx,\n                draw_range=draw_range,\n            )\n    else:\n        frames = video_vis.draw_clip_range(\n            frames, preds, keyframe_idx=keyframe_idx, draw_range=draw_range\n        )\n    del task\n\n    return buffer + frames\n"
  },
  {
    "path": "slowfast/visualization/ava_demo_precomputed_boxes.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\nimport os\n\nimport cv2\nimport numpy as np\nimport slowfast.utils.checkpoint as cu\nimport slowfast.utils.logging as logging\nimport torch\nimport tqdm\nfrom slowfast.datasets.ava_helper import parse_bboxes_file\nfrom slowfast.datasets.cv2_transform import scale, scale_boxes\nfrom slowfast.datasets.utils import get_sequence\nfrom slowfast.models import build_model\nfrom slowfast.utils import misc\nfrom slowfast.utils.env import pathmgr\nfrom slowfast.visualization.utils import process_cv2_inputs\nfrom slowfast.visualization.video_visualizer import VideoVisualizer\n\nlogger = logging.get_logger(__name__)\n\n\nclass AVAVisualizerWithPrecomputedBox:\n    \"\"\"\n    Visualize action predictions for videos or folder of images with precomputed\n    and ground-truth boxes in AVA format.\n    \"\"\"\n\n    def __init__(self, cfg):\n        \"\"\"\n        Args:\n            cfg (CfgNode): configs. Details can be found in\n                slowfast/config/defaults.py\n        \"\"\"\n        self.source = pathmgr.get_local_path(path=cfg.DEMO.INPUT_VIDEO)\n        self.fps = None\n        if pathmgr.isdir(self.source):\n            self.fps = cfg.DEMO.FPS\n            self.video_name = self.source.split(\"/\")[-1]\n            self.source = os.path.join(\n                self.source, \"{}_%06d.jpg\".format(self.video_name)\n            )\n        else:\n            self.video_name = self.source.split(\"/\")[-1]\n            self.video_name = self.video_name.split(\".\")[0]\n\n        self.cfg = cfg\n        self.cap = cv2.VideoCapture(self.source)\n        if self.fps is None:\n            self.fps = self.cap.get(cv2.CAP_PROP_FPS)\n\n        self.total_frames = int(self.cap.get(cv2.CAP_PROP_FRAME_COUNT))\n\n        self.display_width = int(self.cap.get(cv2.CAP_PROP_FRAME_WIDTH))\n        self.display_height = int(self.cap.get(cv2.CAP_PROP_FRAME_HEIGHT))\n\n        if not self.cap.isOpened():\n            raise IOError(\"Video {} cannot be opened\".format(self.source))\n\n        self.output_file = None\n\n        if cfg.DEMO.OUTPUT_FILE != \"\":\n            self.output_file = self.get_output_file(cfg.DEMO.OUTPUT_FILE)\n\n        self.pred_boxes, self.gt_boxes = load_boxes_labels(\n            cfg,\n            self.video_name,\n            self.fps,\n            self.display_width,\n            self.display_height,\n        )\n\n        self.seq_length = cfg.DATA.NUM_FRAMES * cfg.DATA.SAMPLING_RATE\n        self.no_frames_repeat = cfg.DEMO.SLOWMO\n\n    def get_output_file(self, path):\n        \"\"\"\n        Return a video writer object.\n        Args:\n            path (str): path to the output video file.\n        \"\"\"\n        return cv2.VideoWriter(\n            filename=path,\n            fourcc=cv2.VideoWriter_fourcc(*\"mp4v\"),\n            fps=float(30),\n            frameSize=(self.display_width, self.display_height),\n            isColor=True,\n        )\n\n    def get_input_clip(self, keyframe_idx):\n        \"\"\"\n        Get input clip from the video/folder of images for a given\n        keyframe index.\n        Args:\n            keyframe_idx (int): index of the current keyframe.\n        Returns:\n            clip (list of tensors): formatted input clip(s) corresponding to\n                the current keyframe.\n        \"\"\"\n        seq = get_sequence(\n            keyframe_idx,\n            self.seq_length // 2,\n            self.cfg.DATA.SAMPLING_RATE,\n            self.total_frames,\n        )\n        clip = []\n        for frame_idx in seq:\n            self.cap.set(cv2.CAP_PROP_POS_FRAMES, frame_idx)\n            was_read, frame = self.cap.read()\n            if was_read:\n                frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)\n                frame = scale(self.cfg.DATA.TEST_CROP_SIZE, frame)\n                clip.append(frame)\n            else:\n                logger.error(\"Unable to read frame. Duplicating previous frame.\")\n                clip.append(clip[-1])\n\n        clip = process_cv2_inputs(clip, self.cfg)\n        return clip\n\n    def get_predictions(self):\n        \"\"\"\n        Predict and append prediction results to each box in each keyframe in\n        `self.pred_boxes` dictionary.\n        \"\"\"\n        # Set random seed from configs.\n        np.random.seed(self.cfg.RNG_SEED)\n        torch.manual_seed(self.cfg.RNG_SEED)\n\n        # Setup logging format.\n        logging.setup_logging(self.cfg.OUTPUT_DIR)\n\n        # Print config.\n        logger.info(\"Run demo with config:\")\n        logger.info(self.cfg)\n        assert self.cfg.NUM_GPUS <= 1, \"Cannot run demo visualization on multiple GPUs.\"\n\n        # Build the video model and print model statistics.\n        model = build_model(self.cfg)\n        model.eval()\n        logger.info(\"Start loading model info\")\n        misc.log_model_info(model, self.cfg, use_train_input=False)\n        logger.info(\"Start loading model weights\")\n        cu.load_test_checkpoint(self.cfg, model)\n        logger.info(\"Finish loading model weights\")\n        logger.info(\"Start making predictions for precomputed boxes.\")\n        for keyframe_idx, boxes_and_labels in tqdm.tqdm(self.pred_boxes.items()):\n            inputs = self.get_input_clip(keyframe_idx)\n            boxes = boxes_and_labels[0]\n            boxes = torch.from_numpy(np.array(boxes)).float()\n\n            box_transformed = scale_boxes(\n                self.cfg.DATA.TEST_CROP_SIZE,\n                boxes,\n                self.display_height,\n                self.display_width,\n            )\n\n            # Pad frame index for each box.\n            box_inputs = torch.cat(\n                [\n                    torch.full((box_transformed.shape[0], 1), float(0)),\n                    box_transformed,\n                ],\n                axis=1,\n            )\n            if self.cfg.NUM_GPUS:\n                # Transfer the data to the current GPU device.\n                if isinstance(inputs, (list,)):\n                    for i in range(len(inputs)):\n                        inputs[i] = inputs[i].cuda(non_blocking=True)\n                else:\n                    inputs = inputs.cuda(non_blocking=True)\n\n                box_inputs = box_inputs.cuda()\n\n            preds = model(inputs, box_inputs)\n\n            preds = preds.detach()\n\n            if self.cfg.NUM_GPUS:\n                preds = preds.cpu()\n\n            boxes_and_labels[1] = preds\n\n    def draw_video(self):\n        \"\"\"\n        Draw predicted and ground-truth (if provided) results on the video/folder of images.\n        Write the visualized result to a video output file.\n        \"\"\"\n        all_boxes = merge_pred_gt_boxes(self.pred_boxes, self.gt_boxes)\n        common_classes = (\n            self.cfg.DEMO.COMMON_CLASS_NAMES\n            if len(self.cfg.DEMO.LABEL_FILE_PATH) != 0\n            else None\n        )\n        video_vis = VideoVisualizer(\n            num_classes=self.cfg.MODEL.NUM_CLASSES,\n            class_names_path=self.cfg.DEMO.LABEL_FILE_PATH,\n            top_k=self.cfg.TENSORBOARD.MODEL_VIS.TOPK_PREDS,\n            thres=self.cfg.DEMO.COMMON_CLASS_THRES,\n            lower_thres=self.cfg.DEMO.UNCOMMON_CLASS_THRES,\n            common_class_names=common_classes,\n            colormap=self.cfg.TENSORBOARD.MODEL_VIS.COLORMAP,\n            mode=self.cfg.DEMO.VIS_MODE,\n        )\n\n        all_keys = sorted(all_boxes.keys())\n        # Draw around the keyframe for 2/10 of the sequence length.\n        # This is chosen using heuristics.\n        draw_range = [\n            self.seq_length // 2 - self.seq_length // 10,\n            self.seq_length // 2 + self.seq_length // 10,\n        ]\n        draw_range_repeat = [\n            draw_range[0],\n            (draw_range[1] - draw_range[0]) * self.no_frames_repeat + draw_range[0],\n        ]\n        prev_buffer = []\n        prev_end_idx = 0\n\n        logger.info(\"Start Visualization...\")\n        for keyframe_idx in tqdm.tqdm(all_keys):\n            pred_gt_boxes = all_boxes[keyframe_idx]\n            # Find the starting index of the clip. If start_idx exceeds the beginning\n            # of the video, we only choose valid frame from index 0.\n            start_idx = max(0, keyframe_idx - self.seq_length // 2)\n            # Number of frames from the start of the current clip and the\n            # end of the previous clip.\n            dist = start_idx - prev_end_idx\n            # If there are unwritten frames in between clips.\n            if dist >= 0:\n                # Get the frames in between previous clip and current clip.\n                frames = self._get_frame_range(prev_end_idx, dist)\n                # We keep a buffer of frames for overlapping visualization.\n                # Write these to the output file.\n                for frame in prev_buffer:\n                    frame = cv2.cvtColor(frame, cv2.COLOR_RGB2BGR)\n                    self.display(frame)\n                # Write them to output file without any visualization\n                # since they don't have any corresponding keyframes.\n                for frame in frames:\n                    self.display(frame)\n                prev_buffer = []\n                num_new_frames = self.seq_length\n\n            # If there are overlapping frames in between clips.\n            elif dist < 0:\n                # Flush all ready frames.\n                for frame in prev_buffer[:dist]:\n                    frame = cv2.cvtColor(frame, cv2.COLOR_RGB2BGR)\n                    self.display(frame)\n                prev_buffer = prev_buffer[dist:]\n                num_new_frames = self.seq_length + dist\n            # Obtain new frames for the current clip from the input video file.\n            new_frames = self._get_frame_range(\n                max(start_idx, prev_end_idx), num_new_frames\n            )\n            new_frames = [\n                cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) for frame in new_frames\n            ]\n            clip = prev_buffer + new_frames\n            # Calculate the end of this clip. This will be `prev_end_idx` for the\n            # next iteration.\n            prev_end_idx = max(start_idx, prev_end_idx) + len(new_frames)\n            # For each precomputed or gt boxes.\n            for i, boxes in enumerate(pred_gt_boxes):\n                if i == 0:\n                    repeat = self.no_frames_repeat\n                    current_draw_range = draw_range\n                else:\n                    repeat = 1\n                    current_draw_range = draw_range_repeat\n                # Make sure draw range does not fall out of end of clip.\n                current_draw_range[1] = min(current_draw_range[1], len(clip) - 1)\n                ground_truth = boxes[0]\n                bboxes = boxes[1]\n                label = boxes[2]\n                # Draw predictions.\n                clip = video_vis.draw_clip_range(\n                    clip,\n                    label,\n                    bboxes=torch.Tensor(bboxes),\n                    ground_truth=ground_truth,\n                    draw_range=current_draw_range,\n                    repeat_frame=repeat,\n                )\n            # Store the current clip as buffer.\n            prev_buffer = clip\n\n        # Write the remaining buffer to output file.\n        for frame in prev_buffer:\n            frame = cv2.cvtColor(frame, cv2.COLOR_RGB2BGR)\n            self.display(frame)\n        # If we still have some remaining frames in the input file,\n        # write those to the output file as well.\n        if prev_end_idx < self.total_frames:\n            dist = self.total_frames - prev_end_idx\n            remaining_clip = self._get_frame_range(prev_end_idx, dist)\n            for frame in remaining_clip:\n                self.display(frame)\n\n    def __call__(self):\n        self.get_predictions()\n        self.draw_video()\n\n    def display(self, frame):\n        \"\"\"\n        Either display a single frame (BGR image) to a window or write to\n        an output file if output path is provided.\n        \"\"\"\n        if self.output_file is None:\n            cv2.imshow(\"SlowFast\", frame)\n        else:\n            self.output_file.write(frame)\n\n    def _get_keyframe_clip(self, keyframe_idx):\n        \"\"\"\n        Return a clip corresponding to a keyframe index for visualization.\n        Args:\n            keyframe_idx (int): keyframe index.\n        \"\"\"\n        start_idx = max(0, keyframe_idx - self.seq_length // 2)\n\n        clip = self._get_frame_range(start_idx, self.seq_length)\n\n        return clip\n\n    def _get_frame_range(self, start_idx, num_frames):\n        \"\"\"\n        Return a clip of `num_frames` frames starting from `start_idx`. If not enough frames\n        from `start_idx`, return the remaining frames from `start_idx`.\n        Args:\n            start_idx (int): starting idx.\n            num_frames (int): number of frames in the returned clip.\n        \"\"\"\n        was_read = True\n        assert start_idx < self.total_frames, \"Start index out of range.\"\n\n        self.cap.set(cv2.CAP_PROP_POS_FRAMES, start_idx)\n        all_frames = []\n        for _ in range(num_frames):\n            was_read, frame = self.cap.read()\n            if was_read:\n                all_frames.append(frame)\n            else:\n                break\n\n        return all_frames\n\n\ndef merge_pred_gt_boxes(pred_dict, gt_dict=None):\n    \"\"\"\n    Merge data from precomputed and ground-truth boxes dictionaries.\n    Args:\n        pred_dict (dict): a dict which maps from `frame_idx` to a list of `boxes`\n            and `labels`. Each `box` is a list of 4 box coordinates. `labels[i]` is\n            a list of labels for `boxes[i]`.\n        gt_dict (Optional[dict]): a dict which maps from `frame_idx` to a list of `boxes`\n            and `labels`. Each `box` is a list of 4 box coordinates. `labels[i]` is\n            a list of labels for `boxes[i]`. Note that label is -1 for predicted boxes.\n    Returns:\n        merged_dict (dict): merged dictionary from `pred_dict` and `gt_dict` if given.\n            It is a dict which maps from `frame_idx` to a list of [`is_gt`, `boxes`, `labels`],\n            where `is_gt` is a boolean indicate whether the `boxes` and `labels` are ground-truth.\n    \"\"\"\n    merged_dict = {}\n    for key, item in pred_dict.items():\n        merged_dict[key] = [[False, item[0], item[1]]]\n\n    if gt_dict is not None:\n        for key, item in gt_dict.items():\n            if merged_dict.get(key) is None:\n                merged_dict[key] = [[True, item[0], item[1]]]\n            else:\n                merged_dict[key].append([True, item[0], item[1]])\n    return merged_dict\n\n\ndef load_boxes_labels(cfg, video_name, fps, img_width, img_height):\n    \"\"\"\n    Loading boxes and labels from AVA bounding boxes csv files.\n    Args:\n        cfg (CfgNode): config.\n        video_name (str): name of the given video.\n        fps (int or float): frames per second of the input video/images folder.\n        img_width (int): width of images in input video/images folder.\n        img_height (int): height of images in input video/images folder.\n    Returns:\n        preds_boxes (dict): a dict which maps from `frame_idx` to a list of `boxes`\n            and `labels`. Each `box` is a list of 4 box coordinates. `labels[i]` is\n            a list of labels for `boxes[i]`. Note that label is -1 for predicted boxes.\n        gt_boxes (dict): if cfg.DEMO.GT_BOXES is given, return similar dict as\n            all_pred_boxes but for ground-truth boxes.\n    \"\"\"\n    starting_second = cfg.DEMO.STARTING_SECOND\n\n    def sec_to_frameidx(sec):\n        return (sec - starting_second) * fps\n\n    def process_bboxes_dict(dictionary):\n        \"\"\"\n        Replace all `keyframe_sec` in `dictionary` with `keyframe_idx` and\n        merge all [`box_coordinate`, `box_labels`] pairs into\n        [`all_boxes_coordinates`, `all_boxes_labels`] for each `keyframe_idx`.\n        Args:\n            dictionary (dict): a dictionary which maps `frame_sec` to a list of `box`.\n                Each `box` is a [`box_coord`, `box_labels`] where `box_coord` is the\n                coordinates of box and 'box_labels` are the corresponding\n                labels for the box.\n        Returns:\n            new_dict (dict): a dict which maps from `frame_idx` to a list of `boxes`\n                and `labels`. Each `box` in `boxes` is a list of 4 box coordinates. `labels[i]`\n                is a list of labels for `boxes[i]`. Note that label is -1 for predicted boxes.\n        \"\"\"\n        # Replace all keyframe_sec with keyframe_idx.\n        new_dict = {}\n        for keyframe_sec, boxes_and_labels in dictionary.items():\n            # Ignore keyframes with no boxes\n            if len(boxes_and_labels) == 0:\n                continue\n            keyframe_idx = sec_to_frameidx(keyframe_sec)\n            boxes, labels = list(zip(*boxes_and_labels))\n            # Shift labels from [1, n_classes] to [0, n_classes - 1].\n            labels = [[i - 1 for i in box_label] for box_label in labels]\n            boxes = np.array(boxes)\n            boxes[:, [0, 2]] *= img_width\n            boxes[:, [1, 3]] *= img_height\n            new_dict[keyframe_idx] = [boxes.tolist(), list(labels)]\n        return new_dict\n\n    preds_boxes_path = cfg.DEMO.PREDS_BOXES\n    gt_boxes_path = cfg.DEMO.GT_BOXES\n\n    preds_boxes, _, _ = parse_bboxes_file(\n        ann_filenames=[preds_boxes_path],\n        ann_is_gt_box=[False],\n        detect_thresh=cfg.AVA.DETECTION_SCORE_THRESH,\n        boxes_sample_rate=1,\n    )\n    preds_boxes = preds_boxes[video_name]\n    if gt_boxes_path == \"\":\n        gt_boxes = None\n    else:\n        gt_boxes, _, _ = parse_bboxes_file(\n            ann_filenames=[gt_boxes_path],\n            ann_is_gt_box=[True],\n            detect_thresh=cfg.AVA.DETECTION_SCORE_THRESH,\n            boxes_sample_rate=1,\n        )\n        gt_boxes = gt_boxes[video_name]\n\n    preds_boxes = process_bboxes_dict(preds_boxes)\n    if gt_boxes is not None:\n        gt_boxes = process_bboxes_dict(gt_boxes)\n\n    return preds_boxes, gt_boxes\n"
  },
  {
    "path": "slowfast/visualization/demo_loader.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\nimport atexit\nimport copy\nimport queue\nimport threading\nimport time\n\nimport cv2\nimport slowfast.utils.logging as logging\nfrom slowfast.visualization.utils import TaskInfo\n\nlogger = logging.get_logger(__name__)\n\n\nclass VideoManager:\n    \"\"\"\n    VideoManager object for getting frames from video source for inference.\n    \"\"\"\n\n    def __init__(self, cfg):\n        \"\"\"\n        Args:\n            cfg (CfgNode): configs. Details can be found in\n            slowfast/config/defaults.py\n        \"\"\"\n        assert cfg.DEMO.WEBCAM > -1 or cfg.DEMO.INPUT_VIDEO != \"\", (\n            \"Must specify a data source as input.\"\n        )\n\n        self.source = cfg.DEMO.WEBCAM if cfg.DEMO.WEBCAM > -1 else cfg.DEMO.INPUT_VIDEO\n\n        self.display_width = cfg.DEMO.DISPLAY_WIDTH\n        self.display_height = cfg.DEMO.DISPLAY_HEIGHT\n\n        self.cap = cv2.VideoCapture(self.source)\n\n        if self.display_width > 0 and self.display_height > 0:\n            self.cap.set(cv2.CAP_PROP_FRAME_WIDTH, self.display_width)\n            self.cap.set(cv2.CAP_PROP_FRAME_HEIGHT, self.display_height)\n        else:\n            self.display_width = int(self.cap.get(cv2.CAP_PROP_FRAME_WIDTH))\n            self.display_height = int(self.cap.get(cv2.CAP_PROP_FRAME_HEIGHT))\n\n        if not self.cap.isOpened():\n            raise IOError(\"Video {} cannot be opened\".format(self.source))\n\n        self.output_file = None\n        if cfg.DEMO.OUTPUT_FPS == -1:\n            self.output_fps = self.cap.get(cv2.CAP_PROP_FPS)\n        else:\n            self.output_fps = cfg.DEMO.OUTPUT_FPS\n        if cfg.DEMO.OUTPUT_FILE != \"\":\n            self.output_file = self.get_output_file(\n                cfg.DEMO.OUTPUT_FILE, fps=self.output_fps\n            )\n        self.id = -1\n        self.buffer = []\n        self.buffer_size = cfg.DEMO.BUFFER_SIZE\n        self.seq_length = cfg.DATA.NUM_FRAMES * cfg.DATA.SAMPLING_RATE\n        self.test_crop_size = cfg.DATA.TEST_CROP_SIZE\n        self.clip_vis_size = cfg.DEMO.CLIP_VIS_SIZE\n\n    def __iter__(self):\n        return self\n\n    def __next__(self):\n        \"\"\"\n        Read and return the required number of frames for 1 clip.\n        Returns:\n            was_read (bool): False if not enough frames to return.\n            task (TaskInfo object): object contains metadata for the current clips.\n        \"\"\"\n        self.id += 1\n        task = TaskInfo()\n\n        task.img_height = self.display_height\n        task.img_width = self.display_width\n        task.crop_size = self.test_crop_size\n        task.clip_vis_size = self.clip_vis_size\n\n        frames = []\n        if len(self.buffer) != 0:\n            frames = self.buffer\n        was_read = True\n        while was_read and len(frames) < self.seq_length:\n            was_read, frame = self.cap.read()\n            frames.append(frame)\n        if was_read and self.buffer_size != 0:\n            self.buffer = frames[-self.buffer_size :]\n\n        task.add_frames(self.id, frames)\n        task.num_buffer_frames = 0 if self.id == 0 else self.buffer_size\n\n        return was_read, task\n\n    def get_output_file(self, path, fps=30):\n        \"\"\"\n        Return a video writer object.\n        Args:\n            path (str): path to the output video file.\n            fps (int or float): frames per second.\n        \"\"\"\n        return cv2.VideoWriter(\n            filename=path,\n            fourcc=cv2.VideoWriter_fourcc(*\"mp4v\"),\n            fps=float(fps),\n            frameSize=(self.display_width, self.display_height),\n            isColor=True,\n        )\n\n    def display(self, task):\n        \"\"\"\n        Either display a single frame (BGR image) to a window or write to\n        an output file if output path is provided.\n        Args:\n            task (TaskInfo object): task object that contain\n                the necessary information for prediction visualization. (e.g. visualized frames.)\n        \"\"\"\n        for frame in task.frames[task.num_buffer_frames :]:\n            if self.output_file is None:\n                cv2.imshow(\"SlowFast\", frame)\n                time.sleep(1 / self.output_fps)\n            else:\n                self.output_file.write(frame)\n\n    def clean(self):\n        \"\"\"\n        Clean up open video files and windows.\n        \"\"\"\n        self.cap.release()\n        if self.output_file is None:\n            cv2.destroyAllWindows()\n        else:\n            self.output_file.release()\n\n    def start(self):\n        return self\n\n    def join(self):\n        pass\n\n\nclass ThreadVideoManager:\n    \"\"\"\n    VideoManager object for getting frames from video source for inference\n    using multithreading for read and write frames.\n    \"\"\"\n\n    def __init__(self, cfg):\n        \"\"\"\n        Args:\n            cfg (CfgNode): configs. Details can be found in\n            slowfast/config/defaults.py\n        \"\"\"\n        assert cfg.DEMO.WEBCAM > -1 or cfg.DEMO.INPUT_VIDEO != \"\", (\n            \"Must specify a data source as input.\"\n        )\n\n        self.source = cfg.DEMO.WEBCAM if cfg.DEMO.WEBCAM > -1 else cfg.DEMO.INPUT_VIDEO\n\n        self.display_width = cfg.DEMO.DISPLAY_WIDTH\n        self.display_height = cfg.DEMO.DISPLAY_HEIGHT\n\n        self.cap = cv2.VideoCapture(self.source)\n\n        if self.display_width > 0 and self.display_height > 0:\n            self.cap.set(cv2.CAP_PROP_FRAME_WIDTH, self.display_width)\n            self.cap.set(cv2.CAP_PROP_FRAME_HEIGHT, self.display_height)\n        else:\n            self.display_width = int(self.cap.get(cv2.CAP_PROP_FRAME_WIDTH))\n            self.display_height = int(self.cap.get(cv2.CAP_PROP_FRAME_HEIGHT))\n\n        if not self.cap.isOpened():\n            raise IOError(\"Video {} cannot be opened\".format(self.source))\n\n        self.output_file = None\n\n        if cfg.DEMO.OUTPUT_FPS == -1:\n            self.output_fps = self.cap.get(cv2.CAP_PROP_FPS)\n        else:\n            self.output_fps = cfg.DEMO.OUTPUT_FPS\n        if cfg.DEMO.OUTPUT_FILE != \"\":\n            self.output_file = self.get_output_file(\n                cfg.DEMO.OUTPUT_FILE, fps=self.output_fps\n            )\n        self.num_skip = cfg.DEMO.NUM_CLIPS_SKIP + 1\n        self.get_id = -1\n        self.put_id = -1\n        self.buffer = []\n        self.buffer_size = cfg.DEMO.BUFFER_SIZE\n        self.seq_length = cfg.DATA.NUM_FRAMES * cfg.DATA.SAMPLING_RATE\n        self.test_crop_size = cfg.DATA.TEST_CROP_SIZE\n        self.clip_vis_size = cfg.DEMO.CLIP_VIS_SIZE\n\n        self.read_queue = queue.Queue()\n        self.write_queue = {}\n        self.not_end = True\n        self.write_lock = threading.Lock()\n        self.put_id_lock = threading.Lock()\n        self.input_lock = threading.Lock()\n        self.output_lock = threading.Lock()\n        self.stopped = False\n        atexit.register(self.clean)\n\n    def get_output_file(self, path, fps=30):\n        \"\"\"\n        Return a video writer object.\n        Args:\n            path (str): path to the output video file.\n            fps (int or float): frames per second.\n        \"\"\"\n        return cv2.VideoWriter(\n            filename=path,\n            fourcc=cv2.VideoWriter_fourcc(*\"mp4v\"),\n            fps=float(fps),\n            frameSize=(self.display_width, self.display_height),\n            isColor=True,\n        )\n\n    def __iter__(self):\n        return self\n\n    def put_fn(self):\n        \"\"\"\n        Grabbing frames from VideoCapture.\n        \"\"\"\n        was_read = True\n        while was_read and not self.stopped:\n            task = TaskInfo()\n\n            task.img_height = self.display_height\n            task.img_width = self.display_width\n            task.crop_size = self.test_crop_size\n            task.clip_vis_size = self.clip_vis_size\n            frames = []\n            if len(self.buffer) != 0:\n                frames = self.buffer\n            self.input_lock.acquire()\n            while was_read and len(frames) < self.seq_length:\n                was_read, frame = self.cap.read()\n                if was_read:\n                    frames.append(frame)\n            self.input_lock.release()\n            if was_read:\n                self.buffer = frames[-self.buffer_size :]\n\n            task.add_frames(self.put_id + 1, frames)\n            task.num_buffer_frames = 0 if self.put_id == -1 else self.buffer_size\n            with self.put_id_lock:\n                self.put_id += 1\n                self.not_end = was_read\n            # If mode is to read the most recent clip or we reach task\n            # index that is not supposed to be skipped.\n            if self.num_skip == 0 or self.put_id % self.num_skip == 0:\n                self.read_queue.put((was_read, copy.deepcopy(task)))\n            else:\n                with self.write_lock:\n                    self.write_queue[task.id] = (was_read, copy.deepcopy(task))\n\n    def __next__(self):\n        # If there is nothing in the task queue.\n        if self.read_queue.qsize() == 0:\n            return self.not_end, None\n        else:\n            with self.put_id_lock:\n                put_id = self.put_id\n            was_read, task = None, None\n            # If mode is to predict most recent read clip.\n            if self.num_skip == 0:\n                # Write all previous clips to write queue.\n                with self.write_lock:\n                    while True:\n                        was_read, task = self.read_queue.get()\n                        if task.id == put_id:\n                            break\n                        self.write_queue[task.id] = (was_read, task)\n            else:\n                was_read, task = self.read_queue.get()\n            # If we reach the end of the video.\n            if not was_read:\n                # Put to write queue.\n                with self.write_lock:\n                    self.write_queue[put_id] = was_read, copy.deepcopy(task)\n                task = None\n            return was_read, task\n\n    def get_fn(self):\n        while not self.stopped:\n            with self.put_id_lock:\n                put_id = self.put_id\n                not_end = self.not_end\n\n            with self.write_lock:\n                # If video ended and we have display all frames.\n                if not not_end and self.get_id == put_id:\n                    break\n                # If the next frames are not available, wait.\n                if (\n                    len(self.write_queue) == 0\n                    or self.write_queue.get(self.get_id + 1) is None\n                ):\n                    time.sleep(0.02)\n                    continue\n                else:\n                    self.get_id += 1\n                    was_read, task = self.write_queue[self.get_id]\n                    del self.write_queue[self.get_id]\n\n            with self.output_lock:\n                for frame in task.frames[task.num_buffer_frames :]:\n                    if self.output_file is None:\n                        cv2.imshow(\"SlowFast\", frame)\n                        time.sleep(1 / self.output_fps)\n                    else:\n                        self.output_file.write(frame)\n\n    def display(self, task):\n        \"\"\"\n        Add the visualized task to the write queue for display/write to outputfile.\n        Args:\n            task (TaskInfo object): task object that contain\n                the necessary information for prediction visualization. (e.g. visualized frames.)\n        \"\"\"\n        with self.write_lock:\n            self.write_queue[task.id] = (True, task)\n\n    def start(self):\n        \"\"\"\n        Start threads to read and write frames.\n        \"\"\"\n        self.put_thread = threading.Thread(\n            target=self.put_fn, args=(), name=\"VidRead-Thread\", daemon=True\n        )\n        self.put_thread.start()\n        self.get_thread = threading.Thread(\n            target=self.get_fn, args=(), name=\"VidDisplay-Thread\", daemon=True\n        )\n        self.get_thread.start()\n\n        return self\n\n    def join(self):\n        self.get_thread.join()\n\n    def clean(self):\n        \"\"\"\n        Clean up open video files and windows.\n        \"\"\"\n        self.stopped = True\n        self.input_lock.acquire()\n        self.cap.release()\n        self.input_lock.release()\n        self.output_lock.acquire()\n        if self.output_file is None:\n            cv2.destroyAllWindows()\n        else:\n            self.output_file.release()\n        self.output_lock.release()\n"
  },
  {
    "path": "slowfast/visualization/gradcam_utils.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\nimport matplotlib.pyplot as plt\nimport slowfast.datasets.utils as data_utils\nimport torch\nimport torch.nn.functional as F\nfrom slowfast.visualization.utils import get_layer\n\n\nclass GradCAM:\n    \"\"\"\n    GradCAM class helps create localization maps using the Grad-CAM method for input videos\n    and overlap the maps over the input videos as heatmaps.\n    https://arxiv.org/pdf/1610.02391.pdf\n    \"\"\"\n\n    def __init__(self, model, target_layers, data_mean, data_std, colormap=\"viridis\"):\n        \"\"\"\n        Args:\n            model (model): the model to be used.\n            target_layers (list of str(s)): name of convolutional layer to be used to get\n                gradients and feature maps from for creating localization maps.\n            data_mean (tensor or list): mean value to add to input videos.\n            data_std (tensor or list): std to multiply for input videos.\n            colormap (Optional[str]): matplotlib colormap used to create heatmap.\n                See https://matplotlib.org/3.3.0/tutorials/colors/colormaps.html\n        \"\"\"\n\n        self.model = model\n        # Run in eval mode.\n        self.model.eval()\n        self.target_layers = target_layers\n\n        self.gradients = {}\n        self.activations = {}\n        self.colormap = plt.get_cmap(colormap)\n        self.data_mean = data_mean\n        self.data_std = data_std\n        self._register_hooks()\n\n    def _register_single_hook(self, layer_name):\n        \"\"\"\n        Register forward and backward hook to a layer, given layer_name,\n        to obtain gradients and activations.\n        Args:\n            layer_name (str): name of the layer.\n        \"\"\"\n\n        def get_gradients(module, grad_input, grad_output):\n            self.gradients[layer_name] = grad_output[0].detach()\n\n        def get_activations(module, input, output):\n            self.activations[layer_name] = output.clone().detach()\n\n        target_layer = get_layer(self.model, layer_name=layer_name)\n        target_layer.register_forward_hook(get_activations)\n        target_layer.register_backward_hook(get_gradients)\n\n    def _register_hooks(self):\n        \"\"\"\n        Register hooks to layers in `self.target_layers`.\n        \"\"\"\n        for layer_name in self.target_layers:\n            self._register_single_hook(layer_name=layer_name)\n\n    def _calculate_localization_map(self, inputs, labels=None):\n        \"\"\"\n        Calculate localization map for all inputs with Grad-CAM.\n        Args:\n            inputs (list of tensor(s)): the input clips.\n            labels (Optional[tensor]): labels of the current input clips.\n        Returns:\n            localization_maps (list of ndarray(s)): the localization map for\n                each corresponding input.\n            preds (tensor): shape (n_instances, n_class). Model predictions for `inputs`.\n        \"\"\"\n        assert len(inputs) == len(self.target_layers), (\n            \"Must register the same number of target layers as the number of input pathways.\"\n        )\n        input_clone = [inp.clone() for inp in inputs]\n        preds = self.model(input_clone)\n\n        if labels is None:\n            score = torch.max(preds, dim=-1)[0]\n        else:\n            if labels.ndim == 1:\n                labels = labels.unsqueeze(-1)\n            score = torch.gather(preds, dim=1, index=labels)\n\n        self.model.zero_grad()\n        score = torch.sum(score)\n        score.backward()\n        localization_maps = []\n        for i, inp in enumerate(inputs):\n            _, _, T, H, W = inp.size()\n\n            gradients = self.gradients[self.target_layers[i]]\n            activations = self.activations[self.target_layers[i]]\n            B, C, Tg, _, _ = gradients.size()\n\n            weights = torch.mean(gradients.view(B, C, Tg, -1), dim=3)\n\n            weights = weights.view(B, C, Tg, 1, 1)\n            localization_map = torch.sum(weights * activations, dim=1, keepdim=True)\n            localization_map = F.relu(localization_map)\n            localization_map = F.interpolate(\n                localization_map,\n                size=(T, H, W),\n                mode=\"trilinear\",\n                align_corners=False,\n            )\n            localization_map_min, localization_map_max = (\n                torch.min(localization_map.view(B, -1), dim=-1, keepdim=True)[0],\n                torch.max(localization_map.view(B, -1), dim=-1, keepdim=True)[0],\n            )\n            localization_map_min = torch.reshape(\n                localization_map_min, shape=(B, 1, 1, 1, 1)\n            )\n            localization_map_max = torch.reshape(\n                localization_map_max, shape=(B, 1, 1, 1, 1)\n            )\n            # Normalize the localization map.\n            localization_map = (localization_map - localization_map_min) / (\n                localization_map_max - localization_map_min + 1e-6\n            )\n            localization_map = localization_map.data\n\n            localization_maps.append(localization_map)\n\n        return localization_maps, preds\n\n    def __call__(self, inputs, labels=None, alpha=0.5):\n        \"\"\"\n        Visualize the localization maps on their corresponding inputs as heatmap,\n        using Grad-CAM.\n        Args:\n            inputs (list of tensor(s)): the input clips.\n            labels (Optional[tensor]): labels of the current input clips.\n            alpha (float): transparency level of the heatmap, in the range [0, 1].\n        Returns:\n            result_ls (list of tensor(s)): the visualized inputs.\n            preds (tensor): shape (n_instances, n_class). Model predictions for `inputs`.\n        \"\"\"\n        result_ls = []\n        localization_maps, preds = self._calculate_localization_map(\n            inputs, labels=labels\n        )\n        for i, localization_map in enumerate(localization_maps):\n            # Convert (B, 1, T, H, W) to (B, T, H, W)\n            localization_map = localization_map.squeeze(dim=1)\n            if localization_map.device != torch.device(\"cpu\"):\n                localization_map = localization_map.cpu()\n            heatmap = self.colormap(localization_map)\n            heatmap = heatmap[:, :, :, :, :3]\n            # Permute input from (B, C, T, H, W) to (B, T, H, W, C)\n            curr_inp = inputs[i].permute(0, 2, 3, 4, 1)\n            if curr_inp.device != torch.device(\"cpu\"):\n                curr_inp = curr_inp.cpu()\n            curr_inp = data_utils.revert_tensor_normalize(\n                curr_inp, self.data_mean, self.data_std\n            )\n            heatmap = torch.from_numpy(heatmap)\n            curr_inp = alpha * heatmap + (1 - alpha) * curr_inp\n            # Permute inp to (B, T, C, H, W)\n            curr_inp = curr_inp.permute(0, 1, 4, 2, 3)\n            result_ls.append(curr_inp)\n\n        return result_ls, preds\n"
  },
  {
    "path": "slowfast/visualization/prediction_vis.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\nimport numpy as np\nimport slowfast.datasets.utils as data_utils\nimport slowfast.utils.logging as logging\nimport slowfast.visualization.tensorboard_vis as tb\nimport torch\nfrom slowfast.utils.misc import get_class_names\nfrom slowfast.visualization.video_visualizer import VideoVisualizer\n\nlogger = logging.get_logger(__name__)\n\n\nclass WrongPredictionVis:\n    \"\"\"\n    WrongPredictionVis class for visualizing video inputs to Tensorboard\n    for instances that the model makes wrong predictions.\n    \"\"\"\n\n    def __init__(self, cfg):\n        \"\"\"\n        Args:\n            cfg (CfgNode): configs. Details can be found in\n                slowfast/config/defaults.py\n        \"\"\"\n        self.cfg = cfg\n        self.class_names, _, self.subset = get_class_names(\n            cfg.TENSORBOARD.CLASS_NAMES_PATH,\n            subset_path=cfg.TENSORBOARD.WRONG_PRED_VIS.SUBSET_PATH,\n        )\n        if self.subset is not None:\n            self.subset = set(self.subset)\n        self.num_class = cfg.MODEL.NUM_CLASSES\n        self.video_vis = VideoVisualizer(\n            cfg.MODEL.NUM_CLASSES,\n            cfg.TENSORBOARD.CLASS_NAMES_PATH,\n            1,\n            cfg.TENSORBOARD.MODEL_VIS.COLORMAP,\n        )\n        self.tag = cfg.TENSORBOARD.WRONG_PRED_VIS.TAG\n        self.writer = tb.TensorboardWriter(cfg)\n        self.model_incorrect_classes = set()\n\n    def _pick_wrong_preds(self, labels, preds):\n        \"\"\"\n        Returns a 1D tensor that contains the indices of instances that have\n        wrong predictions, where true labels in in the specified subset.\n        Args:\n            labels (tensor): tensor of shape (n_instances,) containing class ids.\n            preds (tensor): class scores from model, shape (n_intances, n_classes)\n        Returns:\n            mask (tensor): boolean tensor. `mask[i]` is True if `model` makes a wrong prediction.\n        \"\"\"\n        subset_mask = torch.ones(size=(len(labels),), dtype=torch.bool)\n        if self.subset is not None:\n            for i, label in enumerate(labels):\n                if label not in self.subset:\n                    subset_mask[i] = False\n\n        preds_ids = torch.argmax(preds, dim=-1)\n\n        mask = preds_ids != labels\n        mask &= subset_mask\n        for i, wrong_pred in enumerate(mask):\n            if wrong_pred:\n                self.model_incorrect_classes.add(labels[i])\n\n        return mask\n\n    def visualize_vid(self, video_input, labels, preds, batch_idx):\n        \"\"\"\n        Draw predicted labels on video inputs and visualize all incorrectly classified\n        videos in the current batch.\n        Args:\n            video_input (list of list of tensor(s)): list of videos for all pathways.\n            labels (array-like): shape (n_instances,) of true label for each instance.\n            preds (tensor): shape (n, instances, n_classes). The predicted scores for all instances.\n            tag (Optional[str]): all visualized video will be added under this tag. This is for organization\n                purposes in Tensorboard.\n            batch_idx (int): batch index of the current videos.\n        \"\"\"\n\n        def add_video(vid, preds, tag, true_class_name):\n            \"\"\"\n            Draw predicted label on video and add it to Tensorboard.\n            Args:\n                vid (array-like): shape (C, T, H, W). Each image in `vid` is a RGB image.\n                preds (tensor): shape (n_classes,) or (1, n_classes). The predicted scores\n                    for the current `vid`.\n                tag (str): tag for `vid` in Tensorboard.\n                true_class_name (str): the ground-truth class name of the current `vid` instance.\n            \"\"\"\n            # Permute to (T, H, W, C).\n            vid = vid.permute(1, 2, 3, 0)\n            vid = data_utils.revert_tensor_normalize(\n                vid.cpu(), self.cfg.DATA.MEAN, self.cfg.DATA.STD\n            )\n            vid = self.video_vis.draw_clip(vid, preds)\n            vid = torch.from_numpy(np.array(vid)).permute(0, 3, 1, 2)\n            vid = torch.unsqueeze(vid, dim=0)\n            self.writer.add_video(vid, tag=\"{}: {}\".format(tag, true_class_name))\n\n        mask = self._pick_wrong_preds(labels, preds)\n        video_indices = torch.squeeze(mask.nonzero(), dim=-1)\n        # Visualize each wrongly classfied video.\n        for vid_idx in video_indices:\n            cur_vid_idx = batch_idx * len(video_input[0]) + vid_idx\n            for pathway in range(len(video_input)):\n                add_video(\n                    video_input[pathway][vid_idx],\n                    preds=preds[vid_idx],\n                    tag=self.tag + \"/Video {}, Pathway {}\".format(cur_vid_idx, pathway),\n                    true_class_name=self.class_names[labels[vid_idx]],\n                )\n\n    @property\n    def wrong_class_prediction(self):\n        \"\"\"\n        Return class ids that the model predicted incorrectly.\n        \"\"\"\n        incorrect_class_names = [\n            self.class_names[i] for i in self.model_incorrect_classes\n        ]\n        return list(set(incorrect_class_names))\n\n    def clean(self):\n        \"\"\"\n        Close Tensorboard writer.\n        \"\"\"\n        self.writer.close()\n"
  },
  {
    "path": "slowfast/visualization/predictor.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\nimport queue\n\nimport cv2\nimport slowfast.utils.checkpoint as cu\nimport torch\nfrom detectron2 import model_zoo\nfrom detectron2.config import get_cfg\nfrom detectron2.engine import DefaultPredictor\nfrom slowfast.datasets import cv2_transform\nfrom slowfast.models import build_model\nfrom slowfast.utils import logging\nfrom slowfast.visualization.utils import process_cv2_inputs\n\nlogger = logging.get_logger(__name__)\n\n\nclass Predictor:\n    \"\"\"\n    Action Predictor for action recognition.\n    \"\"\"\n\n    def __init__(self, cfg, gpu_id=None):\n        \"\"\"\n        Args:\n            cfg (CfgNode): configs. Details can be found in\n                slowfast/config/defaults.py\n            gpu_id (Optional[int]): GPU id.\n        \"\"\"\n        if cfg.NUM_GPUS:\n            self.gpu_id = torch.cuda.current_device() if gpu_id is None else gpu_id\n\n        # Build the video model and print model statistics.\n        self.model = build_model(cfg, gpu_id=gpu_id)\n        self.model.eval()\n        self.cfg = cfg\n\n        if cfg.DETECTION.ENABLE:\n            self.object_detector = Detectron2Predictor(cfg, gpu_id=self.gpu_id)\n\n        logger.info(\"Start loading model weights.\")\n        cu.load_test_checkpoint(cfg, self.model)\n        logger.info(\"Finish loading model weights\")\n\n    def __call__(self, task):\n        \"\"\"\n        Returns the prediction results for the current task.\n        Args:\n            task (TaskInfo object): task object that contain\n                the necessary information for action prediction. (e.g. frames, boxes)\n        Returns:\n            task (TaskInfo object): the same task info object but filled with\n                prediction values (a tensor) and the corresponding boxes for\n                action detection task.\n        \"\"\"\n        if self.cfg.DETECTION.ENABLE:\n            task = self.object_detector(task)\n\n        frames, bboxes = task.frames, task.bboxes\n        if bboxes is not None:\n            bboxes = cv2_transform.scale_boxes(\n                self.cfg.DATA.TEST_CROP_SIZE,\n                bboxes,\n                task.img_height,\n                task.img_width,\n            )\n        if self.cfg.DEMO.INPUT_FORMAT == \"BGR\":\n            frames = [cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) for frame in frames]\n\n        frames = [\n            cv2_transform.scale(self.cfg.DATA.TEST_CROP_SIZE, frame) for frame in frames\n        ]\n        inputs = process_cv2_inputs(frames, self.cfg)\n        if bboxes is not None:\n            index_pad = torch.full(\n                size=(bboxes.shape[0], 1),\n                fill_value=float(0),\n                device=bboxes.device,\n            )\n\n            # Pad frame index for each box.\n            bboxes = torch.cat([index_pad, bboxes], axis=1)\n        if self.cfg.NUM_GPUS > 0:\n            # Transfer the data to the current GPU device.\n            if isinstance(inputs, (list,)):\n                for i in range(len(inputs)):\n                    inputs[i] = inputs[i].cuda(\n                        device=torch.device(self.gpu_id), non_blocking=True\n                    )\n            else:\n                inputs = inputs.cuda(\n                    device=torch.device(self.gpu_id), non_blocking=True\n                )\n        if self.cfg.DETECTION.ENABLE and not bboxes.shape[0]:\n            preds = torch.tensor([])\n        else:\n            preds = self.model(inputs, bboxes)\n\n        if self.cfg.NUM_GPUS:\n            preds = preds.cpu()\n            if bboxes is not None:\n                bboxes = bboxes.detach().cpu()\n\n        preds = preds.detach()\n        task.add_action_preds(preds)\n        if bboxes is not None:\n            task.add_bboxes(bboxes[:, 1:])\n\n        return task\n\n\nclass ActionPredictor:\n    \"\"\"\n    Synchronous Action Prediction and Visualization pipeline with AsyncVis.\n    \"\"\"\n\n    def __init__(self, cfg, async_vis=None, gpu_id=None):\n        \"\"\"\n        Args:\n            cfg (CfgNode): configs. Details can be found in\n                slowfast/config/defaults.py\n            async_vis (AsyncVis object): asynchronous visualizer.\n            gpu_id (Optional[int]): GPU id.\n        \"\"\"\n        self.predictor = Predictor(cfg=cfg, gpu_id=gpu_id)\n        self.async_vis = async_vis\n\n    def put(self, task):\n        \"\"\"\n        Make prediction and put the results in `async_vis` task queue.\n        Args:\n            task (TaskInfo object): task object that contain\n                the necessary information for action prediction. (e.g. frames, boxes)\n        \"\"\"\n        task = self.predictor(task)\n        self.async_vis.get_indices_ls.append(task.id)\n        self.async_vis.put(task)\n\n    def get(self):\n        \"\"\"\n        Get the visualized clips if any.\n        \"\"\"\n        try:\n            task = self.async_vis.get()\n        except (queue.Empty, IndexError):\n            raise IndexError(\"Results are not available yet.\")\n\n        return task\n\n\nclass Detectron2Predictor:\n    \"\"\"\n    Wrapper around Detectron2 to return the required predicted bounding boxes\n    as a ndarray.\n    \"\"\"\n\n    def __init__(self, cfg, gpu_id=None):\n        \"\"\"\n        Args:\n            cfg (CfgNode): configs. Details can be found in\n                slowfast/config/defaults.py\n            gpu_id (Optional[int]): GPU id.\n        \"\"\"\n\n        self.cfg = get_cfg()\n        self.cfg.merge_from_file(model_zoo.get_config_file(cfg.DEMO.DETECTRON2_CFG))\n        self.cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = cfg.DEMO.DETECTRON2_THRESH\n        self.cfg.MODEL.WEIGHTS = cfg.DEMO.DETECTRON2_WEIGHTS\n        self.cfg.INPUT.FORMAT = cfg.DEMO.INPUT_FORMAT\n        if cfg.NUM_GPUS and gpu_id is None:\n            gpu_id = torch.cuda.current_device()\n        self.cfg.MODEL.DEVICE = \"cuda:{}\".format(gpu_id) if cfg.NUM_GPUS > 0 else \"cpu\"\n\n        logger.info(\"Initialized Detectron2 Object Detection Model.\")\n\n        self.predictor = DefaultPredictor(self.cfg)\n\n    def __call__(self, task):\n        \"\"\"\n        Return bounding boxes predictions as a tensor.\n        Args:\n            task (TaskInfo object): task object that contain\n                the necessary information for action prediction. (e.g. frames)\n        Returns:\n            task (TaskInfo object): the same task info object but filled with\n                prediction values (a tensor) and the corresponding boxes for\n                action detection task.\n        \"\"\"\n        middle_frame = task.frames[len(task.frames) // 2]\n        outputs = self.predictor(middle_frame)\n        # Get only human instances\n        mask = outputs[\"instances\"].pred_classes == 0\n        pred_boxes = outputs[\"instances\"].pred_boxes.tensor[mask]\n        task.add_bboxes(pred_boxes)\n\n        return task\n"
  },
  {
    "path": "slowfast/visualization/tensorboard_vis.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\nimport logging as log\nimport math\nimport os\n\nimport matplotlib.pyplot as plt\nimport slowfast.utils.logging as logging\nimport slowfast.visualization.utils as vis_utils\nimport torch\nfrom slowfast.utils.misc import get_class_names\nfrom torch.utils.tensorboard import SummaryWriter\nfrom torchvision.utils import make_grid\n\nlogger = logging.get_logger(__name__)\nlog.getLogger(\"matplotlib\").setLevel(log.ERROR)\n\n\nclass TensorboardWriter:\n    \"\"\"\n    Helper class to log information to Tensorboard.\n    \"\"\"\n\n    def __init__(self, cfg):\n        \"\"\"\n        Args:\n            cfg (CfgNode): configs. Details can be found in\n                slowfast/config/defaults.py\n        \"\"\"\n        # class_names: list of class names.\n        # cm_subset_classes: a list of class ids -- a user-specified subset.\n        # parent_map: dictionary where key is the parent class name and\n        #   value is a list of ids of its children classes.\n        # hist_subset_classes: a list of class ids -- user-specified to plot histograms.\n        (\n            self.class_names,\n            self.cm_subset_classes,\n            self.parent_map,\n            self.hist_subset_classes,\n        ) = (None, None, None, None)\n        self.cfg = cfg\n        self.cm_figsize = cfg.TENSORBOARD.CONFUSION_MATRIX.FIGSIZE\n        self.hist_figsize = cfg.TENSORBOARD.HISTOGRAM.FIGSIZE\n\n        if cfg.TENSORBOARD.LOG_DIR == \"\":\n            log_dir = os.path.join(cfg.OUTPUT_DIR, \"runs-{}\".format(cfg.TRAIN.DATASET))\n        else:\n            log_dir = os.path.join(cfg.OUTPUT_DIR, cfg.TENSORBOARD.LOG_DIR)\n\n        self.writer = SummaryWriter(log_dir=log_dir)\n        logger.info(\n            \"To see logged results in Tensorboard, please launch using the command \\\n            `tensorboard  --port=<port-number> --logdir {}`\".format(log_dir)\n        )\n\n        if cfg.TENSORBOARD.CLASS_NAMES_PATH != \"\":\n            if cfg.DETECTION.ENABLE:\n                logger.info(\n                    \"Plotting confusion matrix is currently \\\n                    not supported for detection.\"\n                )\n            (\n                self.class_names,\n                self.parent_map,\n                self.cm_subset_classes,\n            ) = get_class_names(\n                cfg.TENSORBOARD.CLASS_NAMES_PATH,\n                cfg.TENSORBOARD.CATEGORIES_PATH,\n                cfg.TENSORBOARD.CONFUSION_MATRIX.SUBSET_PATH,\n            )\n\n            if cfg.TENSORBOARD.HISTOGRAM.ENABLE:\n                if cfg.DETECTION.ENABLE:\n                    logger.info(\n                        \"Plotting histogram is not currently \\\n                    supported for detection tasks.\"\n                    )\n                if cfg.TENSORBOARD.HISTOGRAM.SUBSET_PATH != \"\":\n                    _, _, self.hist_subset_classes = get_class_names(\n                        cfg.TENSORBOARD.CLASS_NAMES_PATH,\n                        None,\n                        cfg.TENSORBOARD.HISTOGRAM.SUBSET_PATH,\n                    )\n\n    def add_scalars(self, data_dict, global_step=None):\n        \"\"\"\n        Add multiple scalars to Tensorboard logs.\n        Args:\n            data_dict (dict): key is a string specifying the tag of value.\n            global_step (Optinal[int]): Global step value to record.\n        \"\"\"\n        if self.writer is not None:\n            for key, item in data_dict.items():\n                self.writer.add_scalar(key, item, global_step)\n\n    def plot_eval(self, preds, labels, global_step=None):\n        \"\"\"\n        Plot confusion matrices and histograms for eval/test set.\n        Args:\n            preds (tensor or list of tensors): list of predictions.\n            labels (tensor or list of tensors): list of labels.\n            global step (Optional[int]): current step in eval/test.\n        \"\"\"\n        if not self.cfg.DETECTION.ENABLE:\n            cmtx = None\n            if self.cfg.TENSORBOARD.CONFUSION_MATRIX.ENABLE:\n                cmtx = vis_utils.get_confusion_matrix(\n                    preds, labels, self.cfg.MODEL.NUM_CLASSES\n                )\n                # Add full confusion matrix.\n                add_confusion_matrix(\n                    self.writer,\n                    cmtx,\n                    self.cfg.MODEL.NUM_CLASSES,\n                    global_step=global_step,\n                    class_names=self.class_names,\n                    figsize=self.cm_figsize,\n                )\n                # If a list of subset is provided, plot confusion matrix subset.\n                if self.cm_subset_classes is not None:\n                    add_confusion_matrix(\n                        self.writer,\n                        cmtx,\n                        self.cfg.MODEL.NUM_CLASSES,\n                        global_step=global_step,\n                        subset_ids=self.cm_subset_classes,\n                        class_names=self.class_names,\n                        tag=\"Confusion Matrix Subset\",\n                        figsize=self.cm_figsize,\n                    )\n                # If a parent-child classes mapping is provided, plot confusion\n                # matrices grouped by parent classes.\n                if self.parent_map is not None:\n                    # Get list of tags (parent categories names) and their children.\n                    for parent_class, children_ls in self.parent_map.items():\n                        tag = (\n                            \"Confusion Matrices Grouped by Parent Classes/\"\n                            + parent_class\n                        )\n                        add_confusion_matrix(\n                            self.writer,\n                            cmtx,\n                            self.cfg.MODEL.NUM_CLASSES,\n                            global_step=global_step,\n                            subset_ids=children_ls,\n                            class_names=self.class_names,\n                            tag=tag,\n                            figsize=self.cm_figsize,\n                        )\n            if self.cfg.TENSORBOARD.HISTOGRAM.ENABLE:\n                if cmtx is None:\n                    cmtx = vis_utils.get_confusion_matrix(\n                        preds, labels, self.cfg.MODEL.NUM_CLASSES\n                    )\n                plot_hist(\n                    self.writer,\n                    cmtx,\n                    self.cfg.MODEL.NUM_CLASSES,\n                    self.cfg.TENSORBOARD.HISTOGRAM.TOPK,\n                    global_step=global_step,\n                    subset_ids=self.hist_subset_classes,\n                    class_names=self.class_names,\n                    figsize=self.hist_figsize,\n                )\n\n    def add_video(self, vid_tensor, tag=\"Video Input\", global_step=None, fps=4):\n        \"\"\"\n        Add input to tensorboard SummaryWriter as a video.\n        Args:\n            vid_tensor (tensor): shape of (B, T, C, H, W). Values should lie\n                [0, 255] for type uint8 or [0, 1] for type float.\n            tag (Optional[str]): name of the video.\n            global_step(Optional[int]): current step.\n            fps (int): frames per second.\n        \"\"\"\n        self.writer.add_video(tag, vid_tensor, global_step=global_step, fps=fps)\n\n    def plot_weights_and_activations(\n        self,\n        weight_activation_dict,\n        tag=\"\",\n        normalize=False,\n        global_step=None,\n        batch_idx=None,\n        indexing_dict=None,\n        heat_map=True,\n    ):\n        \"\"\"\n        Visualize weights/ activations tensors to Tensorboard.\n        Args:\n            weight_activation_dict (dict[str, tensor]): a dictionary of the pair {layer_name: tensor},\n                where layer_name is a string and tensor is the weights/activations of\n                the layer we want to visualize.\n            tag (Optional[str]): name of the video.\n            normalize (bool): If True, the tensor is normalized. (Default to False)\n            global_step(Optional[int]): current step.\n            batch_idx (Optional[int]): current batch index to visualize. If None,\n                visualize the entire batch.\n            indexing_dict (Optional[dict]): a dictionary of the {layer_name: indexing}.\n                where indexing is numpy-like fancy indexing.\n            heatmap (bool): whether to add heatmap to the weights/ activations.\n        \"\"\"\n        for name, array in weight_activation_dict.items():\n            if batch_idx is None:\n                # Select all items in the batch if batch_idx is not provided.\n                batch_idx = list(range(array.shape[0]))\n            if indexing_dict is not None:\n                fancy_indexing = indexing_dict[name]\n                fancy_indexing = (batch_idx,) + fancy_indexing\n                array = array[fancy_indexing]\n            else:\n                array = array[batch_idx]\n            add_ndim_array(\n                self.writer,\n                array,\n                tag + name,\n                normalize=normalize,\n                global_step=global_step,\n                heat_map=heat_map,\n            )\n\n    def flush(self):\n        self.writer.flush()\n\n    def close(self):\n        self.writer.flush()\n        self.writer.close()\n\n\ndef add_confusion_matrix(\n    writer,\n    cmtx,\n    num_classes,\n    global_step=None,\n    subset_ids=None,\n    class_names=None,\n    tag=\"Confusion Matrix\",\n    figsize=None,\n):\n    \"\"\"\n    Calculate and plot confusion matrix to a SummaryWriter.\n    Args:\n        writer (SummaryWriter): the SummaryWriter to write the matrix to.\n        cmtx (ndarray): confusion matrix.\n        num_classes (int): total number of classes.\n        global_step (Optional[int]): current step.\n        subset_ids (list of ints): a list of label indices to keep.\n        class_names (list of strs, optional): a list of all class names.\n        tag (str or list of strs): name(s) of the confusion matrix image.\n        figsize (Optional[float, float]): the figure size of the confusion matrix.\n            If None, default to [6.4, 4.8].\n\n    \"\"\"\n    if subset_ids is None or len(subset_ids) != 0:\n        # If class names are not provided, use class indices as class names.\n        if class_names is None:\n            class_names = [str(i) for i in range(num_classes)]\n        # If subset is not provided, take every classes.\n        if subset_ids is None:\n            subset_ids = list(range(num_classes))\n\n        sub_cmtx = cmtx[subset_ids, :][:, subset_ids]\n        sub_names = [class_names[j] for j in subset_ids]\n\n        sub_cmtx = vis_utils.plot_confusion_matrix(\n            sub_cmtx,\n            num_classes=len(subset_ids),\n            class_names=sub_names,\n            figsize=figsize,\n        )\n        # Add the confusion matrix image to writer.\n        writer.add_figure(tag=tag, figure=sub_cmtx, global_step=global_step)\n\n\ndef plot_hist(\n    writer,\n    cmtx,\n    num_classes,\n    k=10,\n    global_step=None,\n    subset_ids=None,\n    class_names=None,\n    figsize=None,\n):\n    \"\"\"\n    Given all predictions and all true labels, plot histograms of top-k most\n    frequently predicted classes for each true class.\n\n    Args:\n        writer (SummaryWriter object): a tensorboard SummaryWriter object.\n        cmtx (ndarray): confusion matrix.\n        num_classes (int): total number of classes.\n        k (int): top k to plot histograms.\n        global_step (Optional[int]): current step.\n        subset_ids (list of ints, optional): class indices to plot histogram.\n        mapping (list of strings): names of all classes.\n        figsize (Optional[float, float]): the figure size of the confusion matrix.\n            If None, default to [6.4, 4.8].\n    \"\"\"\n    if subset_ids is None or len(subset_ids) != 0:\n        if subset_ids is None:\n            subset_ids = set(range(num_classes))\n        else:\n            subset_ids = set(subset_ids)\n        # If class names are not provided, use their indices as names.\n        if class_names is None:\n            class_names = list(range(num_classes))\n\n        for i in subset_ids:\n            pred = cmtx[i]\n            hist = vis_utils.plot_topk_histogram(\n                class_names[i],\n                torch.Tensor(pred),\n                k,\n                class_names,\n                figsize=figsize,\n            )\n            writer.add_figure(\n                tag=\"Top {} predictions by classes/{}\".format(k, class_names[i]),\n                figure=hist,\n                global_step=global_step,\n            )\n\n\ndef add_ndim_array(\n    writer,\n    array,\n    name,\n    nrow=None,\n    normalize=False,\n    global_step=None,\n    heat_map=True,\n):\n    \"\"\"\n    Visualize and add tensors of n-dimentionals to a Tensorboard SummaryWriter. Tensors\n    will be visualized as a 2D grid image.\n    Args:\n        writer (SummaryWriter): Tensorboard SummaryWriter.\n        array (tensor): tensor to visualize.\n        name (str): name of the tensor.\n        nrow (Optional[int]): number of 2D filters in each row in the grid image.\n        normalize (bool): whether to normalize when we have multiple 2D filters.\n            Default to False.\n        global_step (Optional[int]): current step.\n        heat_map (bool): whether to add heat map to 2D each 2D filters in array.\n    \"\"\"\n    if array is not None and array.ndim != 0:\n        if array.ndim == 1:\n            reshaped_array = array.unsqueeze(0)\n            if nrow is None:\n                nrow = int(math.sqrt(reshaped_array.size()[1]))\n            reshaped_array = reshaped_array.view(-1, nrow)\n            if heat_map:\n                reshaped_array = add_heatmap(reshaped_array)\n                writer.add_image(\n                    name,\n                    reshaped_array,\n                    global_step=global_step,\n                    dataformats=\"CHW\",\n                )\n            else:\n                writer.add_image(\n                    name,\n                    reshaped_array,\n                    global_step=global_step,\n                    dataformats=\"HW\",\n                )\n        elif array.ndim == 2:\n            reshaped_array = array\n            if heat_map:\n                heatmap = add_heatmap(reshaped_array)\n                writer.add_image(\n                    name, heatmap, global_step=global_step, dataformats=\"CHW\"\n                )\n            else:\n                writer.add_image(\n                    name,\n                    reshaped_array,\n                    global_step=global_step,\n                    dataformats=\"HW\",\n                )\n        else:\n            last2_dims = array.size()[-2:]\n            reshaped_array = array.view(-1, *last2_dims)\n            if heat_map:\n                reshaped_array = [\n                    add_heatmap(array_2d).unsqueeze(0) for array_2d in reshaped_array\n                ]\n                reshaped_array = torch.cat(reshaped_array, dim=0)\n            else:\n                reshaped_array = reshaped_array.unsqueeze(1)\n            if nrow is None:\n                nrow = int(math.sqrt(reshaped_array.size()[0]))\n            img_grid = make_grid(reshaped_array, nrow, padding=1, normalize=normalize)\n            writer.add_image(name, img_grid, global_step=global_step)\n\n\ndef add_heatmap(tensor):\n    \"\"\"\n    Add heatmap to 2D tensor.\n    Args:\n        tensor (tensor): a 2D tensor. Tensor value must be in [0..1] range.\n    Returns:\n        heatmap (tensor): a 3D tensor. Result of applying heatmap to the 2D tensor.\n    \"\"\"\n    assert tensor.ndim == 2, \"Only support 2D tensors.\"\n    # Move tensor to cpu if necessary.\n    if tensor.device != torch.device(\"cpu\"):\n        arr = tensor.cpu()\n    else:\n        arr = tensor\n    arr = arr.numpy()\n    # Get the color map by name.\n    cm = plt.get_cmap(\"viridis\")\n    heatmap = cm(arr)\n    heatmap = heatmap[:, :, :3]\n    # Convert (H, W, C) to (C, H, W)\n    heatmap = torch.Tensor(heatmap).permute(2, 0, 1)\n    return heatmap\n"
  },
  {
    "path": "slowfast/visualization/utils.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\nimport itertools\n\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport slowfast.utils.logging as logging\nimport torch\nfrom sklearn.metrics import confusion_matrix\nfrom slowfast.datasets.utils import pack_pathway_output, tensor_normalize\n\nlogger = logging.get_logger(__name__)\n\n\ndef get_confusion_matrix(preds, labels, num_classes, normalize=\"true\"):\n    \"\"\"\n    Calculate confusion matrix on the provided preds and labels.\n    Args:\n        preds (tensor or lists of tensors): predictions. Each tensor is in\n            in the shape of (n_batch, num_classes). Tensor(s) must be on CPU.\n        labels (tensor or lists of tensors): corresponding labels. Each tensor is\n            in the shape of either (n_batch,) or (n_batch, num_classes).\n        num_classes (int): number of classes. Tensor(s) must be on CPU.\n        normalize (Optional[str]) : {‘true’, ‘pred’, ‘all’}, default=\"true\"\n            Normalizes confusion matrix over the true (rows), predicted (columns)\n            conditions or all the population. If None, confusion matrix\n            will not be normalized.\n    Returns:\n        cmtx (ndarray): confusion matrix of size (num_classes x num_classes)\n    \"\"\"\n    if isinstance(preds, list):\n        preds = torch.cat(preds, dim=0)\n    if isinstance(labels, list):\n        labels = torch.cat(labels, dim=0)\n    # If labels are one-hot encoded, get their indices.\n    if labels.ndim == preds.ndim:\n        labels = torch.argmax(labels, dim=-1)\n    # Get the predicted class indices for examples.\n    preds = torch.flatten(torch.argmax(preds, dim=-1))\n    labels = torch.flatten(labels)\n    cmtx = confusion_matrix(\n        labels, preds, labels=list(range(num_classes)), normalize=normalize\n    )\n    return cmtx\n\n\ndef plot_confusion_matrix(cmtx, num_classes, class_names=None, figsize=None):\n    \"\"\"\n    A function to create a colored and labeled confusion matrix matplotlib figure\n    given true labels and preds.\n    Args:\n        cmtx (ndarray): confusion matrix.\n        num_classes (int): total number of classes.\n        class_names (Optional[list of strs]): a list of class names.\n        figsize (Optional[float, float]): the figure size of the confusion matrix.\n            If None, default to [6.4, 4.8].\n\n    Returns:\n        img (figure): matplotlib figure.\n    \"\"\"\n    if class_names is None or type(class_names) != list:\n        class_names = [str(i) for i in range(num_classes)]\n\n    figure = plt.figure(figsize=figsize)\n    plt.imshow(cmtx, interpolation=\"nearest\", cmap=plt.cm.Blues)\n    plt.title(\"Confusion matrix\")\n    plt.colorbar()\n    tick_marks = np.arange(len(class_names))\n    plt.xticks(tick_marks, class_names, rotation=45)\n    plt.yticks(tick_marks, class_names)\n\n    # Use white text if squares are dark; otherwise black.\n    threshold = cmtx.max() / 2.0\n    for i, j in itertools.product(range(cmtx.shape[0]), range(cmtx.shape[1])):\n        color = \"white\" if cmtx[i, j] > threshold else \"black\"\n        plt.text(\n            j,\n            i,\n            format(cmtx[i, j], \".2f\") if cmtx[i, j] != 0 else \".\",\n            horizontalalignment=\"center\",\n            color=color,\n        )\n\n    plt.tight_layout()\n    plt.ylabel(\"True label\")\n    plt.xlabel(\"Predicted label\")\n\n    return figure\n\n\ndef plot_topk_histogram(tag, array, k=10, class_names=None, figsize=None):\n    \"\"\"\n    Plot histogram of top-k value from the given array.\n    Args:\n        tag (str): histogram title.\n        array (tensor): a tensor to draw top k value from.\n        k (int): number of top values to draw from array.\n            Defaut to 10.\n        class_names (list of strings, optional):\n            a list of names for values in array.\n        figsize (Optional[float, float]): the figure size of the confusion matrix.\n            If None, default to [6.4, 4.8].\n    Returns:\n        fig (matplotlib figure): a matplotlib figure of the histogram.\n    \"\"\"\n    val, ind = torch.topk(array, k)\n\n    fig = plt.Figure(figsize=figsize, facecolor=\"w\", edgecolor=\"k\")\n\n    ax = fig.add_subplot(1, 1, 1)\n\n    if class_names is None:\n        class_names = [str(i) for i in ind]\n    else:\n        class_names = [class_names[i] for i in ind]\n\n    tick_marks = np.arange(k)\n    width = 0.75\n    ax.bar(\n        tick_marks,\n        val,\n        width,\n        color=\"orange\",\n        tick_label=class_names,\n        edgecolor=\"w\",\n        linewidth=1,\n    )\n\n    ax.set_xlabel(\"Candidates\")\n    ax.set_xticks(tick_marks)\n    ax.set_xticklabels(class_names, rotation=-45, ha=\"center\")\n    ax.xaxis.set_label_position(\"bottom\")\n    ax.xaxis.tick_bottom()\n\n    y_tick = np.linspace(0, 1, num=10)\n    ax.set_ylabel(\"Frequency\")\n    ax.set_yticks(y_tick)\n    y_labels = [format(i, \".1f\") for i in y_tick]\n    ax.set_yticklabels(y_labels, ha=\"center\")\n\n    for i, v in enumerate(val.numpy()):\n        ax.text(\n            i - 0.1,\n            v + 0.03,\n            format(v, \".2f\"),\n            color=\"orange\",\n            fontweight=\"bold\",\n        )\n\n    ax.set_title(tag)\n\n    fig.set_tight_layout(True)\n\n    return fig\n\n\nclass GetWeightAndActivation:\n    \"\"\"\n    A class used to get weights and activations from specified layers from a Pytorch model.\n    \"\"\"\n\n    def __init__(self, model, layers):\n        \"\"\"\n        Args:\n            model (nn.Module): the model containing layers to obtain weights and activations from.\n            layers (list of strings): a list of layer names to obtain weights and activations from.\n                Names are hierarchical, separated by /. For example, If a layer follow a path\n                \"s1\" ---> \"pathway0_stem\" ---> \"conv\", the layer path is \"s1/pathway0_stem/conv\".\n        \"\"\"\n        self.model = model\n        self.hooks = {}\n        self.layers_names = layers\n        # eval mode\n        self.model.eval()\n        self._register_hooks()\n\n    def _get_layer(self, layer_name):\n        \"\"\"\n        Return a layer (nn.Module Object) given a hierarchical layer name, separated by /.\n        Args:\n            layer_name (str): the name of the layer.\n        \"\"\"\n        layer_ls = layer_name.split(\"/\")\n        prev_module = self.model\n        for layer in layer_ls:\n            prev_module = prev_module._modules[layer]\n\n        return prev_module\n\n    def _register_single_hook(self, layer_name):\n        \"\"\"\n        Register hook to a layer, given layer_name, to obtain activations.\n        Args:\n            layer_name (str): name of the layer.\n        \"\"\"\n\n        def hook_fn(module, input, output):\n            self.hooks[layer_name] = output.clone().detach()\n\n        layer = get_layer(self.model, layer_name)\n        layer.register_forward_hook(hook_fn)\n\n    def _register_hooks(self):\n        \"\"\"\n        Register hooks to layers in `self.layers_names`.\n        \"\"\"\n        for layer_name in self.layers_names:\n            self._register_single_hook(layer_name)\n\n    def get_activations(self, input, bboxes=None):\n        \"\"\"\n        Obtain all activations from layers that we register hooks for.\n        Args:\n            input (tensors, list of tensors): the model input.\n            bboxes (Optional): Bouding boxes data that might be required\n                by the model.\n        Returns:\n            activation_dict (Python dictionary): a dictionary of the pair\n                {layer_name: list of activations}, where activations are outputs returned\n                by the layer.\n        \"\"\"\n        input_clone = [inp.clone() for inp in input]\n        if bboxes is not None:\n            preds = self.model(input_clone, bboxes)\n        else:\n            preds = self.model(input_clone)\n\n        activation_dict = {}\n        for layer_name, hook in self.hooks.items():\n            # list of activations for each instance.\n            activation_dict[layer_name] = hook\n\n        return activation_dict, preds\n\n    def get_weights(self):\n        \"\"\"\n        Returns weights from registered layers.\n        Returns:\n            weights (Python dictionary): a dictionary of the pair\n            {layer_name: weight}, where weight is the weight tensor.\n        \"\"\"\n        weights = {}\n        for layer in self.layers_names:\n            cur_layer = get_layer(self.model, layer)\n            if hasattr(cur_layer, \"weight\"):\n                weights[layer] = cur_layer.weight.clone().detach()\n            else:\n                logger.error(\"Layer {} does not have weight attribute.\".format(layer))\n        return weights\n\n\ndef get_indexing(string):\n    \"\"\"\n    Parse numpy-like fancy indexing from a string.\n    Args:\n        string (str): string represent the indices to take\n            a subset of from array. Indices for each dimension\n            are separated by `,`; indices for different dimensions\n            are separated by `;`.\n            e.g.: For a numpy array `arr` of shape (3,3,3), the string \"1,2;1,2\"\n            means taking the sub-array `arr[[1,2], [1,2]]\n    Returns:\n        final_indexing (tuple): the parsed indexing.\n    \"\"\"\n    index_ls = string.strip().split(\";\")\n    final_indexing = []\n    for index in index_ls:\n        index_single_dim = index.split(\",\")\n        index_single_dim = [int(i) for i in index_single_dim]\n        final_indexing.append(index_single_dim)\n\n    return tuple(final_indexing)\n\n\ndef process_layer_index_data(layer_ls, layer_name_prefix=\"\"):\n    \"\"\"\n    Extract layer names and numpy-like fancy indexing from a string.\n    Args:\n        layer_ls (list of strs): list of strings containing data about layer names\n            and their indexing. For each string, layer name and indexing is separated by whitespaces.\n            e.g.: [layer1 1,2;2, layer2, layer3 150;3,4]\n        layer_name_prefix (Optional[str]): prefix to be added to each layer name.\n    Returns:\n        layer_name (list of strings): a list of layer names.\n        indexing_dict (Python dict): a dictionary of the pair\n            {one_layer_name: indexing_for_that_layer}\n    \"\"\"\n\n    layer_name, indexing_dict = [], {}\n    for layer in layer_ls:\n        ls = layer.split()\n        name = layer_name_prefix + ls[0]\n        layer_name.append(name)\n        if len(ls) == 2:\n            indexing_dict[name] = get_indexing(ls[1])\n        else:\n            indexing_dict[name] = ()\n    return layer_name, indexing_dict\n\n\ndef process_cv2_inputs(frames, cfg):\n    \"\"\"\n    Normalize and prepare inputs as a list of tensors. Each tensor\n    correspond to a unique pathway.\n    Args:\n        frames (list of array): list of input images (correspond to one clip) in range [0, 255].\n        cfg (CfgNode): configs. Details can be found in\n            slowfast/config/defaults.py\n    \"\"\"\n    inputs = torch.from_numpy(np.array(frames)).float() / 255\n    inputs = tensor_normalize(inputs, cfg.DATA.MEAN, cfg.DATA.STD)\n    # T H W C -> C T H W.\n    inputs = inputs.permute(3, 0, 1, 2)\n    # Sample frames for num_frames specified.\n    index = torch.linspace(0, inputs.shape[1] - 1, cfg.DATA.NUM_FRAMES).long()\n    inputs = torch.index_select(inputs, 1, index)\n    inputs = pack_pathway_output(cfg, inputs)\n    inputs = [inp.unsqueeze(0) for inp in inputs]\n    return inputs\n\n\ndef get_layer(model, layer_name):\n    \"\"\"\n    Return the targeted layer (nn.Module Object) given a hierarchical layer name,\n    separated by /.\n    Args:\n        model (model): model to get layers from.\n        layer_name (str): name of the layer.\n    Returns:\n        prev_module (nn.Module): the layer from the model with `layer_name` name.\n    \"\"\"\n    layer_ls = layer_name.split(\"/\")\n    prev_module = model\n    for layer in layer_ls:\n        prev_module = prev_module._modules[layer]\n\n    return prev_module\n\n\nclass TaskInfo:\n    def __init__(self):\n        self.frames = None\n        self.id = -1\n        self.bboxes = None\n        self.action_preds = None\n        self.num_buffer_frames = 0\n        self.img_height = -1\n        self.img_width = -1\n        self.crop_size = -1\n        self.clip_vis_size = -1\n\n    def add_frames(self, idx, frames):\n        \"\"\"\n        Add the clip and corresponding id.\n        Args:\n            idx (int): the current index of the clip.\n            frames (list[ndarray]): list of images in \"BGR\" format.\n        \"\"\"\n        self.frames = frames\n        self.id = idx\n\n    def add_bboxes(self, bboxes):\n        \"\"\"\n        Add correspondding bounding boxes.\n        \"\"\"\n        self.bboxes = bboxes\n\n    def add_action_preds(self, preds):\n        \"\"\"\n        Add the corresponding action predictions.\n        \"\"\"\n        self.action_preds = preds\n"
  },
  {
    "path": "slowfast/visualization/video_visualizer.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\nimport itertools\nimport logging as log\n\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport slowfast.utils.logging as logging\nimport torch\nfrom detectron2.utils.visualizer import Visualizer\nfrom slowfast.utils.misc import get_class_names\n\nlogger = logging.get_logger(__name__)\nlog.getLogger(\"matplotlib\").setLevel(log.ERROR)\n\n\ndef _create_text_labels(classes, scores, class_names, ground_truth=False):\n    \"\"\"\n    Create text labels.\n    Args:\n        classes (list[int]): a list of class ids for each example.\n        scores (list[float] or None): list of scores for each example.\n        class_names (list[str]): a list of class names, ordered by their ids.\n        ground_truth (bool): whether the labels are ground truth.\n    Returns:\n        labels (list[str]): formatted text labels.\n    \"\"\"\n    try:\n        labels = [class_names[i] for i in classes]\n    except IndexError:\n        logger.error(\"Class indices get out of range: {}\".format(classes))\n        return None\n\n    if ground_truth:\n        labels = [\"[{}] {}\".format(\"GT\", label) for label in labels]\n    elif scores is not None:\n        assert len(classes) == len(scores)\n        labels = [\"[{:.2f}] {}\".format(s, label) for s, label in zip(scores, labels)]\n    return labels\n\n\nclass ImgVisualizer(Visualizer):\n    def __init__(self, img_rgb, meta, **kwargs):\n        \"\"\"\n        See https://github.com/facebookresearch/detectron2/blob/master/detectron2/utils/visualizer.py\n        for more details.\n        Args:\n            img_rgb: a tensor or numpy array of shape (H, W, C), where H and W correspond to\n                the height and width of the image respectively. C is the number of\n                color channels. The image is required to be in RGB format since that\n                is a requirement of the Matplotlib library. The image is also expected\n                to be in the range [0, 255].\n            meta (MetadataCatalog): image metadata.\n                See https://github.com/facebookresearch/detectron2/blob/81d5a87763bfc71a492b5be89b74179bd7492f6b/detectron2/data/catalog.py#L90\n        \"\"\"\n        super(ImgVisualizer, self).__init__(img_rgb, meta, **kwargs)\n\n    def draw_text(\n        self,\n        text,\n        position,\n        *,\n        font_size=None,\n        color=\"w\",\n        horizontal_alignment=\"center\",\n        vertical_alignment=\"bottom\",\n        box_facecolor=\"black\",\n        alpha=0.5,\n    ):\n        \"\"\"\n        Draw text at the specified position.\n        Args:\n            text (str): the text to draw on image.\n            position (list of 2 ints): the x,y coordinate to place the text.\n            font_size (Optional[int]): font of the text. If not provided, a font size\n                proportional to the image width is calculated and used.\n            color (str): color of the text. Refer to `matplotlib.colors` for full list\n                of formats that are accepted.\n            horizontal_alignment (str): see `matplotlib.text.Text`.\n            vertical_alignment (str): see `matplotlib.text.Text`.\n            box_facecolor (str): color of the box wrapped around the text. Refer to\n                `matplotlib.colors` for full list of formats that are accepted.\n            alpha (float): transparency level of the box.\n        \"\"\"\n        if not font_size:\n            font_size = self._default_font_size\n        x, y = position\n        self.output.ax.text(\n            x,\n            y,\n            text,\n            size=font_size * self.output.scale,\n            family=\"monospace\",\n            bbox={\n                \"facecolor\": box_facecolor,\n                \"alpha\": alpha,\n                \"pad\": 0.7,\n                \"edgecolor\": \"none\",\n            },\n            verticalalignment=vertical_alignment,\n            horizontalalignment=horizontal_alignment,\n            color=color,\n            zorder=10,\n        )\n\n    def draw_multiple_text(\n        self,\n        text_ls,\n        box_coordinate,\n        *,\n        top_corner=True,\n        font_size=None,\n        color=\"w\",\n        box_facecolors=\"black\",\n        alpha=0.5,\n    ):\n        \"\"\"\n        Draw a list of text labels for some bounding box on the image.\n        Args:\n            text_ls (list of strings): a list of text labels.\n            box_coordinate (tensor): shape (4,). The (x_left, y_top, x_right, y_bottom)\n                coordinates of the box.\n            top_corner (bool): If True, draw the text labels at (x_left, y_top) of the box.\n                Else, draw labels at (x_left, y_bottom).\n            font_size (Optional[int]): font of the text. If not provided, a font size\n                proportional to the image width is calculated and used.\n            color (str): color of the text. Refer to `matplotlib.colors` for full list\n                of formats that are accepted.\n            box_facecolors (str): colors of the box wrapped around the text. Refer to\n                `matplotlib.colors` for full list of formats that are accepted.\n            alpha (float): transparency level of the box.\n        \"\"\"\n        if not isinstance(box_facecolors, list):\n            box_facecolors = [box_facecolors] * len(text_ls)\n        assert len(box_facecolors) == len(text_ls), (\n            \"Number of colors provided is not equal to the number of text labels.\"\n        )\n        if not font_size:\n            font_size = self._default_font_size\n        text_box_width = font_size + font_size // 2\n        # If the texts does not fit in the assigned location,\n        # we split the text and draw it in another place.\n        if top_corner:\n            num_text_split = self._align_y_top(\n                box_coordinate, len(text_ls), text_box_width\n            )\n            y_corner = 1\n        else:\n            num_text_split = len(text_ls) - self._align_y_bottom(\n                box_coordinate, len(text_ls), text_box_width\n            )\n            y_corner = 3\n\n        text_color_sorted = sorted(\n            zip(text_ls, box_facecolors), key=lambda x: x[0], reverse=True\n        )\n        if len(text_color_sorted) != 0:\n            text_ls, box_facecolors = zip(*text_color_sorted)\n        else:\n            text_ls, box_facecolors = [], []\n        text_ls, box_facecolors = list(text_ls), list(box_facecolors)\n        self.draw_multiple_text_upward(\n            text_ls[:num_text_split][::-1],\n            box_coordinate,\n            y_corner=y_corner,\n            font_size=font_size,\n            color=color,\n            box_facecolors=box_facecolors[:num_text_split][::-1],\n            alpha=alpha,\n        )\n        self.draw_multiple_text_downward(\n            text_ls[num_text_split:],\n            box_coordinate,\n            y_corner=y_corner,\n            font_size=font_size,\n            color=color,\n            box_facecolors=box_facecolors[num_text_split:],\n            alpha=alpha,\n        )\n\n    def draw_multiple_text_upward(\n        self,\n        text_ls,\n        box_coordinate,\n        *,\n        y_corner=1,\n        font_size=None,\n        color=\"w\",\n        box_facecolors=\"black\",\n        alpha=0.5,\n    ):\n        \"\"\"\n        Draw a list of text labels for some bounding box on the image in upward direction.\n        The next text label will be on top of the previous one.\n        Args:\n            text_ls (list of strings): a list of text labels.\n            box_coordinate (tensor): shape (4,). The (x_left, y_top, x_right, y_bottom)\n                coordinates of the box.\n            y_corner (int): Value of either 1 or 3. Indicate the index of the y-coordinate of\n                the box to draw labels around.\n            font_size (Optional[int]): font of the text. If not provided, a font size\n                proportional to the image width is calculated and used.\n            color (str): color of the text. Refer to `matplotlib.colors` for full list\n                of formats that are accepted.\n            box_facecolors (str or list of strs): colors of the box wrapped around the text. Refer to\n                `matplotlib.colors` for full list of formats that are accepted.\n            alpha (float): transparency level of the box.\n        \"\"\"\n        if not isinstance(box_facecolors, list):\n            box_facecolors = [box_facecolors] * len(text_ls)\n        assert len(box_facecolors) == len(text_ls), (\n            \"Number of colors provided is not equal to the number of text labels.\"\n        )\n\n        assert y_corner in [1, 3], \"Y_corner must be either 1 or 3\"\n        if not font_size:\n            font_size = self._default_font_size\n\n        x, horizontal_alignment = self._align_x_coordinate(box_coordinate)\n        y = box_coordinate[y_corner].item()\n        for i, text in enumerate(text_ls):\n            self.draw_text(\n                text,\n                (x, y),\n                font_size=font_size,\n                color=color,\n                horizontal_alignment=horizontal_alignment,\n                vertical_alignment=\"bottom\",\n                box_facecolor=box_facecolors[i],\n                alpha=alpha,\n            )\n            y -= font_size + font_size // 2\n\n    def draw_multiple_text_downward(\n        self,\n        text_ls,\n        box_coordinate,\n        *,\n        y_corner=1,\n        font_size=None,\n        color=\"w\",\n        box_facecolors=\"black\",\n        alpha=0.5,\n    ):\n        \"\"\"\n        Draw a list of text labels for some bounding box on the image in downward direction.\n        The next text label will be below the previous one.\n        Args:\n            text_ls (list of strings): a list of text labels.\n            box_coordinate (tensor): shape (4,). The (x_left, y_top, x_right, y_bottom)\n                coordinates of the box.\n            y_corner (int): Value of either 1 or 3. Indicate the index of the y-coordinate of\n                the box to draw labels around.\n            font_size (Optional[int]): font of the text. If not provided, a font size\n                proportional to the image width is calculated and used.\n            color (str): color of the text. Refer to `matplotlib.colors` for full list\n                of formats that are accepted.\n            box_facecolors (str): colors of the box wrapped around the text. Refer to\n                `matplotlib.colors` for full list of formats that are accepted.\n            alpha (float): transparency level of the box.\n        \"\"\"\n        if not isinstance(box_facecolors, list):\n            box_facecolors = [box_facecolors] * len(text_ls)\n        assert len(box_facecolors) == len(text_ls), (\n            \"Number of colors provided is not equal to the number of text labels.\"\n        )\n\n        assert y_corner in [1, 3], \"Y_corner must be either 1 or 3\"\n        if not font_size:\n            font_size = self._default_font_size\n\n        x, horizontal_alignment = self._align_x_coordinate(box_coordinate)\n        y = box_coordinate[y_corner].item()\n        for i, text in enumerate(text_ls):\n            self.draw_text(\n                text,\n                (x, y),\n                font_size=font_size,\n                color=color,\n                horizontal_alignment=horizontal_alignment,\n                vertical_alignment=\"top\",\n                box_facecolor=box_facecolors[i],\n                alpha=alpha,\n            )\n            y += font_size + font_size // 2\n\n    def _align_x_coordinate(self, box_coordinate):\n        \"\"\"\n        Choose an x-coordinate from the box to make sure the text label\n        does not go out of frames. By default, the left x-coordinate is\n        chosen and text is aligned left. If the box is too close to the\n        right side of the image, then the right x-coordinate is chosen\n        instead and the text is aligned right.\n        Args:\n            box_coordinate (array-like): shape (4,). The (x_left, y_top, x_right, y_bottom)\n            coordinates of the box.\n        Returns:\n            x_coordinate (float): the chosen x-coordinate.\n            alignment (str): whether to align left or right.\n        \"\"\"\n        # If the x-coordinate is greater than 5/6 of the image width,\n        # then we align test to the right of the box. This is\n        # chosen by heuristics.\n        if box_coordinate[0] > (self.output.width * 5) // 6:\n            return box_coordinate[2], \"right\"\n\n        return box_coordinate[0], \"left\"\n\n    def _align_y_top(self, box_coordinate, num_text, textbox_width):\n        \"\"\"\n        Calculate the number of text labels to plot on top of the box\n        without going out of frames.\n        Args:\n            box_coordinate (array-like): shape (4,). The (x_left, y_top, x_right, y_bottom)\n            coordinates of the box.\n            num_text (int): the number of text labels to plot.\n            textbox_width (float): the width of the box wrapped around text label.\n        \"\"\"\n        dist_to_top = box_coordinate[1]\n        num_text_top = dist_to_top // textbox_width\n\n        if isinstance(num_text_top, torch.Tensor):\n            num_text_top = int(num_text_top.item())\n\n        return min(num_text, num_text_top)\n\n    def _align_y_bottom(self, box_coordinate, num_text, textbox_width):\n        \"\"\"\n        Calculate the number of text labels to plot at the bottom of the box\n        without going out of frames.\n        Args:\n            box_coordinate (array-like): shape (4,). The (x_left, y_top, x_right, y_bottom)\n            coordinates of the box.\n            num_text (int): the number of text labels to plot.\n            textbox_width (float): the width of the box wrapped around text label.\n        \"\"\"\n        dist_to_bottom = self.output.height - box_coordinate[3]\n        num_text_bottom = dist_to_bottom // textbox_width\n\n        if isinstance(num_text_bottom, torch.Tensor):\n            num_text_bottom = int(num_text_bottom.item())\n\n        return min(num_text, num_text_bottom)\n\n\nclass VideoVisualizer:\n    def __init__(\n        self,\n        num_classes,\n        class_names_path,\n        top_k=1,\n        colormap=\"rainbow\",\n        thres=0.7,\n        lower_thres=0.3,\n        common_class_names=None,\n        mode=\"top-k\",\n    ):\n        \"\"\"\n        Args:\n            num_classes (int): total number of classes.\n            class_names_path (str): path to json file that maps class names to ids.\n                Must be in the format {classname: id}.\n            top_k (int): number of top predicted classes to plot.\n            colormap (str): the colormap to choose color for class labels from.\n                See https://matplotlib.org/tutorials/colors/colormaps.html\n            thres (float): threshold for picking predicted classes to visualize.\n            lower_thres (Optional[float]): If `common_class_names` if given,\n                this `lower_thres` will be applied to uncommon classes and\n                `thres` will be applied to classes in `common_class_names`.\n            common_class_names (Optional[list of str(s)]): list of common class names\n                to apply `thres`. Class names not included in `common_class_names` will\n                have `lower_thres` as a threshold. If None, all classes will have `thres` as a threshold.\n                This is helpful for model trained on highly imbalanced dataset.\n            mode (str): Supported modes are {\"top-k\", \"thres\"}.\n                This is used for choosing predictions for visualization.\n\n        \"\"\"\n        assert mode in [\"top-k\", \"thres\"], \"Mode {} is not supported.\".format(mode)\n        self.mode = mode\n        self.num_classes = num_classes\n        self.class_names, _, _ = get_class_names(class_names_path, None, None)\n        self.top_k = top_k\n        self.thres = thres\n        self.lower_thres = lower_thres\n\n        if mode == \"thres\":\n            self._get_thres_array(common_class_names=common_class_names)\n\n        self.color_map = plt.get_cmap(colormap)\n\n    def _get_color(self, class_id):\n        \"\"\"\n        Get color for a class id.\n        Args:\n            class_id (int): class id.\n        \"\"\"\n        return self.color_map(class_id / self.num_classes)[:3]\n\n    def draw_one_frame(\n        self,\n        frame,\n        preds,\n        bboxes=None,\n        alpha=0.5,\n        text_alpha=0.7,\n        ground_truth=False,\n    ):\n        \"\"\"\n        Draw labels and bouding boxes for one image. By default, predicted labels are drawn in\n        the top left corner of the image or corresponding bounding boxes. For ground truth labels\n        (setting True for ground_truth flag), labels will be drawn in the bottom left corner.\n        Args:\n            frame (array-like): a tensor or numpy array of shape (H, W, C), where H and W correspond to\n                the height and width of the image respectively. C is the number of\n                color channels. The image is required to be in RGB format since that\n                is a requirement of the Matplotlib library. The image is also expected\n                to be in the range [0, 255].\n            preds (tensor or list): If ground_truth is False, provide a float tensor of shape (num_boxes, num_classes)\n                that contains all of the confidence scores of the model.\n                For recognition task, input shape can be (num_classes,). To plot true label (ground_truth is True),\n                preds is a list contains int32 of the shape (num_boxes, true_class_ids) or (true_class_ids,).\n            bboxes (Optional[tensor]): shape (num_boxes, 4) that contains the coordinates of the bounding boxes.\n            alpha (Optional[float]): transparency level of the bounding boxes.\n            text_alpha (Optional[float]): transparency level of the box wrapped around text labels.\n            ground_truth (bool): whether the prodived bounding boxes are ground-truth.\n        \"\"\"\n        if isinstance(preds, torch.Tensor):\n            if preds.ndim == 1:\n                preds = preds.unsqueeze(0)\n            n_instances = preds.shape[0]\n        elif isinstance(preds, list):\n            n_instances = len(preds)\n        else:\n            logger.error(\"Unsupported type of prediction input.\")\n            return\n\n        if ground_truth:\n            top_scores, top_classes = [None] * n_instances, preds\n\n        elif self.mode == \"top-k\":\n            top_scores, top_classes = torch.topk(preds, k=self.top_k)\n            top_scores, top_classes = top_scores.tolist(), top_classes.tolist()\n        elif self.mode == \"thres\":\n            top_scores, top_classes = [], []\n            for pred in preds:\n                mask = pred >= self.thres\n                top_scores.append(pred[mask].tolist())\n                top_class = torch.squeeze(torch.nonzero(mask), dim=-1).tolist()\n                top_classes.append(top_class)\n\n        # Create labels top k predicted classes with their scores.\n        text_labels = []\n        for i in range(n_instances):\n            text_labels.append(\n                _create_text_labels(\n                    top_classes[i],\n                    top_scores[i],\n                    self.class_names,\n                    ground_truth=ground_truth,\n                )\n            )\n        frame_visualizer = ImgVisualizer(frame, meta=None)\n        font_size = min(max(np.sqrt(frame.shape[0] * frame.shape[1]) // 35, 5), 9)\n        top_corner = not ground_truth\n        if bboxes is not None:\n            assert len(preds) == len(bboxes), (\n                \"Encounter {} predictions and {} bounding boxes\".format(\n                    len(preds), len(bboxes)\n                )\n            )\n            for i, box in enumerate(bboxes):\n                text = text_labels[i]\n                pred_class = top_classes[i]\n                colors = [self._get_color(pred) for pred in pred_class]\n\n                box_color = \"r\" if ground_truth else \"g\"\n                line_style = \"--\" if ground_truth else \"-.\"\n                frame_visualizer.draw_box(\n                    box,\n                    alpha=alpha,\n                    edge_color=box_color,\n                    line_style=line_style,\n                )\n                frame_visualizer.draw_multiple_text(\n                    text,\n                    box,\n                    top_corner=top_corner,\n                    font_size=font_size,\n                    box_facecolors=colors,\n                    alpha=text_alpha,\n                )\n        else:\n            text = text_labels[0]\n            pred_class = top_classes[0]\n            colors = [self._get_color(pred) for pred in pred_class]\n            frame_visualizer.draw_multiple_text(\n                text,\n                torch.Tensor([0, 5, frame.shape[1], frame.shape[0] - 5]),\n                top_corner=top_corner,\n                font_size=font_size,\n                box_facecolors=colors,\n                alpha=text_alpha,\n            )\n\n        return frame_visualizer.output.get_image()\n\n    def draw_clip_range(\n        self,\n        frames,\n        preds,\n        bboxes=None,\n        text_alpha=0.5,\n        ground_truth=False,\n        keyframe_idx=None,\n        draw_range=None,\n        repeat_frame=1,\n    ):\n        \"\"\"\n        Draw predicted labels or ground truth classes to clip. Draw bouding boxes to clip\n        if bboxes is provided. Boxes will gradually fade in and out the clip, centered around\n        the clip's central frame, within the provided `draw_range`.\n        Args:\n            frames (array-like): video data in the shape (T, H, W, C).\n            preds (tensor): a tensor of shape (num_boxes, num_classes) that contains all of the confidence scores\n                of the model. For recognition task or for ground_truth labels, input shape can be (num_classes,).\n            bboxes (Optional[tensor]): shape (num_boxes, 4) that contains the coordinates of the bounding boxes.\n            text_alpha (float): transparency label of the box wrapped around text labels.\n            ground_truth (bool): whether the prodived bounding boxes are ground-truth.\n            keyframe_idx (int): the index of keyframe in the clip.\n            draw_range (Optional[list[ints]): only draw frames in range [start_idx, end_idx] inclusively in the clip.\n                If None, draw on the entire clip.\n            repeat_frame (int): repeat each frame in draw_range for `repeat_frame` time for slow-motion effect.\n        \"\"\"\n        if draw_range is None:\n            draw_range = [0, len(frames) - 1]\n        if draw_range is not None:\n            draw_range[0] = max(0, draw_range[0])\n            left_frames = frames[: draw_range[0]]\n            right_frames = frames[draw_range[1] + 1 :]\n\n        draw_frames = frames[draw_range[0] : draw_range[1] + 1]\n        if keyframe_idx is None:\n            keyframe_idx = len(frames) // 2\n\n        img_ls = (\n            list(left_frames)\n            + self.draw_clip(\n                draw_frames,\n                preds,\n                bboxes=bboxes,\n                text_alpha=text_alpha,\n                ground_truth=ground_truth,\n                keyframe_idx=keyframe_idx - draw_range[0],\n                repeat_frame=repeat_frame,\n            )\n            + list(right_frames)\n        )\n\n        return img_ls\n\n    def draw_clip(\n        self,\n        frames,\n        preds,\n        bboxes=None,\n        text_alpha=0.5,\n        ground_truth=False,\n        keyframe_idx=None,\n        repeat_frame=1,\n    ):\n        \"\"\"\n        Draw predicted labels or ground truth classes to clip. Draw bouding boxes to clip\n        if bboxes is provided. Boxes will gradually fade in and out the clip, centered around\n        the clip's central frame.\n        Args:\n            frames (array-like): video data in the shape (T, H, W, C).\n            preds (tensor): a tensor of shape (num_boxes, num_classes) that contains all of the confidence scores\n                of the model. For recognition task or for ground_truth labels, input shape can be (num_classes,).\n            bboxes (Optional[tensor]): shape (num_boxes, 4) that contains the coordinates of the bounding boxes.\n            text_alpha (float): transparency label of the box wrapped around text labels.\n            ground_truth (bool): whether the prodived bounding boxes are ground-truth.\n            keyframe_idx (int): the index of keyframe in the clip.\n            repeat_frame (int): repeat each frame in draw_range for `repeat_frame` time for slow-motion effect.\n        \"\"\"\n        assert repeat_frame >= 1, \"`repeat_frame` must be a positive integer.\"\n\n        repeated_seq = range(0, len(frames))\n        repeated_seq = list(\n            itertools.chain.from_iterable(\n                itertools.repeat(x, repeat_frame) for x in repeated_seq\n            )\n        )\n\n        frames, adjusted = self._adjust_frames_type(frames)\n        if keyframe_idx is None:\n            half_left = len(repeated_seq) // 2\n            half_right = (len(repeated_seq) + 1) // 2\n        else:\n            mid = int((keyframe_idx / len(frames)) * len(repeated_seq))\n            half_left = mid\n            half_right = len(repeated_seq) - mid\n\n        alpha_ls = np.concatenate(\n            [\n                np.linspace(0, 1, num=half_left),\n                np.linspace(1, 0, num=half_right),\n            ]\n        )\n        text_alpha = text_alpha\n        frames = frames[repeated_seq]\n        img_ls = []\n        for alpha, frame in zip(alpha_ls, frames):\n            draw_img = self.draw_one_frame(\n                frame,\n                preds,\n                bboxes,\n                alpha=alpha,\n                text_alpha=text_alpha,\n                ground_truth=ground_truth,\n            )\n            if adjusted:\n                draw_img = draw_img.astype(\"float32\") / 255\n\n            img_ls.append(draw_img)\n\n        return img_ls\n\n    def _adjust_frames_type(self, frames):\n        \"\"\"\n        Modify video data to have dtype of uint8 and values range in [0, 255].\n        Args:\n            frames (array-like): 4D array of shape (T, H, W, C).\n        Returns:\n            frames (list of frames): list of frames in range [0, 1].\n            adjusted (bool): whether the original frames need adjusted.\n        \"\"\"\n        assert frames is not None and len(frames) != 0, (\n            \"Frames does not contain any values\"\n        )\n        frames = np.array(frames)\n        assert np.array(frames).ndim == 4, \"Frames must have 4 dimensions\"\n        adjusted = False\n        if frames.dtype in [np.float32, np.float64]:\n            frames *= 255\n            frames = frames.astype(np.uint8)\n            adjusted = True\n\n        return frames, adjusted\n\n    def _get_thres_array(self, common_class_names=None):\n        \"\"\"\n        Compute a thresholds array for all classes based on `self.thes` and `self.lower_thres`.\n        Args:\n            common_class_names (Optional[list of strs]): a list of common class names.\n        \"\"\"\n        common_class_ids = []\n        if common_class_names is not None:\n            common_classes = set(common_class_names)\n\n            for i, name in enumerate(self.class_names):\n                if name in common_classes:\n                    common_class_ids.append(i)\n        else:\n            common_class_ids = list(range(self.num_classes))\n\n        thres_array = np.full(shape=(self.num_classes,), fill_value=self.lower_thres)\n        thres_array[common_class_ids] = self.thres\n        self.thres = torch.from_numpy(thres_array)\n"
  },
  {
    "path": "tools/benchmark.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved\n\"\"\"\nA script to benchmark data loading.\n\"\"\"\n\nimport slowfast.utils.logging as logging\nfrom slowfast.utils.benchmark import benchmark_data_loading\nfrom slowfast.utils.misc import launch_job\nfrom slowfast.utils.parser import load_config, parse_args\n\nlogger = logging.get_logger(__name__)\n\n\ndef main():\n    args = parse_args()\n    cfg = load_config(args)\n\n    launch_job(cfg=cfg, init_method=args.init_method, func=benchmark_data_loading)\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "tools/demo_net.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\nimport time\n\nimport numpy as np\nimport torch\nimport tqdm\nfrom slowfast.utils import logging\nfrom slowfast.visualization.async_predictor import AsyncDemo, AsyncVis\nfrom slowfast.visualization.ava_demo_precomputed_boxes import (\n    AVAVisualizerWithPrecomputedBox,\n)\nfrom slowfast.visualization.demo_loader import ThreadVideoManager, VideoManager\nfrom slowfast.visualization.predictor import ActionPredictor\nfrom slowfast.visualization.video_visualizer import VideoVisualizer\n\nlogger = logging.get_logger(__name__)\n\n\ndef run_demo(cfg, frame_provider):\n    \"\"\"\n    Run demo visualization.\n    Args:\n        cfg (CfgNode): configs. Details can be found in\n            slowfast/config/defaults.py\n        frame_provider (iterator): Python iterator that return task objects that are filled\n            with necessary information such as `frames`, `id` and `num_buffer_frames` for the\n            prediction and visualization pipeline.\n    \"\"\"\n    # Set random seed from configs.\n    np.random.seed(cfg.RNG_SEED)\n    torch.manual_seed(cfg.RNG_SEED)\n    # Setup logging format.\n    logging.setup_logging(cfg.OUTPUT_DIR)\n    # Print config.\n    logger.info(\"Run demo with config:\")\n    logger.info(cfg)\n    common_classes = (\n        cfg.DEMO.COMMON_CLASS_NAMES if len(cfg.DEMO.LABEL_FILE_PATH) != 0 else None\n    )\n\n    video_vis = VideoVisualizer(\n        num_classes=cfg.MODEL.NUM_CLASSES,\n        class_names_path=cfg.DEMO.LABEL_FILE_PATH,\n        top_k=cfg.TENSORBOARD.MODEL_VIS.TOPK_PREDS,\n        thres=cfg.DEMO.COMMON_CLASS_THRES,\n        lower_thres=cfg.DEMO.UNCOMMON_CLASS_THRES,\n        common_class_names=common_classes,\n        colormap=cfg.TENSORBOARD.MODEL_VIS.COLORMAP,\n        mode=cfg.DEMO.VIS_MODE,\n    )\n\n    async_vis = AsyncVis(video_vis, n_workers=cfg.DEMO.NUM_VIS_INSTANCES)\n\n    if cfg.NUM_GPUS <= 1:\n        model = ActionPredictor(cfg=cfg, async_vis=async_vis)\n    else:\n        model = AsyncDemo(cfg=cfg, async_vis=async_vis)\n\n    seq_len = cfg.DATA.NUM_FRAMES * cfg.DATA.SAMPLING_RATE\n\n    assert cfg.DEMO.BUFFER_SIZE <= seq_len // 2, (\n        \"Buffer size cannot be greater than half of sequence length.\"\n    )\n    num_task = 0\n    # Start reading frames.\n    frame_provider.start()\n    for able_to_read, task in frame_provider:\n        if not able_to_read:\n            break\n        if task is None:\n            time.sleep(0.02)\n            continue\n        num_task += 1\n\n        model.put(task)\n        try:\n            task = model.get()\n            num_task -= 1\n            yield task\n        except IndexError:\n            continue\n\n    while num_task != 0:\n        try:\n            task = model.get()\n            num_task -= 1\n            yield task\n        except IndexError:\n            continue\n\n\ndef demo(cfg):\n    \"\"\"\n    Run inference on an input video or stream from webcam.\n    Args:\n        cfg (CfgNode): configs. Details can be found in\n            slowfast/config/defaults.py\n    \"\"\"\n    # AVA format-specific visualization with precomputed boxes.\n    if cfg.DETECTION.ENABLE and cfg.DEMO.PREDS_BOXES != \"\":\n        precomputed_box_vis = AVAVisualizerWithPrecomputedBox(cfg)\n        precomputed_box_vis()\n    else:\n        start = time.time()\n        if cfg.DEMO.THREAD_ENABLE:\n            frame_provider = ThreadVideoManager(cfg)\n        else:\n            frame_provider = VideoManager(cfg)\n\n        for task in tqdm.tqdm(run_demo(cfg, frame_provider)):\n            frame_provider.display(task)\n\n        frame_provider.join()\n        frame_provider.clean()\n        logger.info(\"Finish demo in: {}\".format(time.time() - start))\n"
  },
  {
    "path": "tools/run_net.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\n\"\"\"Wrapper to train and test a video classification model.\"\"\"\n\nfrom slowfast.config.defaults import assert_and_infer_cfg\nfrom slowfast.utils.misc import launch_job\nfrom slowfast.utils.parser import load_config, parse_args\nfrom vision.fair.slowfast.tools.demo_net import demo\nfrom vision.fair.slowfast.tools.test_net import test\nfrom vision.fair.slowfast.tools.train_net import train\nfrom vision.fair.slowfast.tools.visualization import visualize\n\n\ndef main():\n    \"\"\"\n    Main function to spawn the train and test process.\n    \"\"\"\n    args = parse_args()\n    print(\"config files: {}\".format(args.cfg_files))\n    for path_to_config in args.cfg_files:\n        cfg = load_config(args, path_to_config)\n        cfg = assert_and_infer_cfg(cfg)\n\n        # Perform training.\n        if cfg.TRAIN.ENABLE:\n            launch_job(cfg=cfg, init_method=args.init_method, func=train)\n\n        # Perform multi-clip testing.\n        if cfg.TEST.ENABLE:\n            if cfg.TEST.NUM_ENSEMBLE_VIEWS == -1:\n                num_view_list = [1, 3, 5, 7, 10]\n                for num_view in num_view_list:\n                    cfg.TEST.NUM_ENSEMBLE_VIEWS = num_view\n                    launch_job(cfg=cfg, init_method=args.init_method, func=test)\n            else:\n                launch_job(cfg=cfg, init_method=args.init_method, func=test)\n\n        # Perform model visualization.\n        if cfg.TENSORBOARD.ENABLE and (\n            cfg.TENSORBOARD.MODEL_VIS.ENABLE or cfg.TENSORBOARD.WRONG_PRED_VIS.ENABLE\n        ):\n            launch_job(cfg=cfg, init_method=args.init_method, func=visualize)\n\n        # Run demo.\n        if cfg.DEMO.ENABLE:\n            demo(cfg)\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "tools/test_net.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\n\"\"\"Multi-view test a video classification model.\"\"\"\n\nimport os\nimport pickle\n\nimport numpy as np\nimport slowfast.utils.checkpoint as cu\nimport slowfast.utils.distributed as du\nimport slowfast.utils.logging as logging\nimport slowfast.utils.misc as misc\nimport slowfast.visualization.tensorboard_vis as tb\nimport torch\nfrom slowfast.datasets import loader\nfrom slowfast.models import build_model\nfrom slowfast.utils.env import pathmgr\nfrom slowfast.utils.meters import AVAMeter, TestMeter\n\nlogger = logging.get_logger(__name__)\n\n\n@torch.no_grad()\ndef perform_test(test_loader, model, test_meter, cfg, writer=None):\n    \"\"\"\n    For classification:\n    Perform mutli-view testing that uniformly samples N clips from a video along\n    its temporal axis. For each clip, it takes 3 crops to cover the spatial\n    dimension, followed by averaging the softmax scores across all Nx3 views to\n    form a video-level prediction. All video predictions are compared to\n    ground-truth labels and the final testing performance is logged.\n    For detection:\n    Perform fully-convolutional testing on the full frames without crop.\n    Args:\n        test_loader (loader): video testing loader.\n        model (model): the pretrained video model to test.\n        test_meter (TestMeter): testing meters to log and ensemble the testing\n            results.\n        cfg (CfgNode): configs. Details can be found in\n            slowfast/config/defaults.py\n        writer (TensorboardWriter object, optional): TensorboardWriter object\n            to writer Tensorboard log.\n    \"\"\"\n    # Enable eval mode.\n    model.eval()\n    test_meter.iter_tic()\n\n    for cur_iter, (inputs, labels, video_idx, time, meta) in enumerate(test_loader):\n        if cfg.NUM_GPUS:\n            # Transfer the data to the current GPU device.\n            if isinstance(inputs, (list,)):\n                for i in range(len(inputs)):\n                    inputs[i] = inputs[i].cuda(non_blocking=True)\n            else:\n                inputs = inputs.cuda(non_blocking=True)\n            # Transfer the data to the current GPU device.\n            labels = labels.cuda()\n            video_idx = video_idx.cuda()\n            for key, val in meta.items():\n                if isinstance(val, (list,)):\n                    for i in range(len(val)):\n                        val[i] = val[i].cuda(non_blocking=True)\n                else:\n                    meta[key] = val.cuda(non_blocking=True)\n        test_meter.data_toc()\n\n        if cfg.DETECTION.ENABLE:\n            # Compute the predictions.\n            preds = model(inputs, meta[\"boxes\"])\n            ori_boxes = meta[\"ori_boxes\"]\n            metadata = meta[\"metadata\"]\n\n            preds = preds.detach().cpu() if cfg.NUM_GPUS else preds.detach()\n            ori_boxes = ori_boxes.detach().cpu() if cfg.NUM_GPUS else ori_boxes.detach()\n            metadata = metadata.detach().cpu() if cfg.NUM_GPUS else metadata.detach()\n\n            if cfg.NUM_GPUS > 1:\n                preds = torch.cat(du.all_gather_unaligned(preds), dim=0)\n                ori_boxes = torch.cat(du.all_gather_unaligned(ori_boxes), dim=0)\n                metadata = torch.cat(du.all_gather_unaligned(metadata), dim=0)\n\n            test_meter.iter_toc()\n            # Update and log stats.\n            test_meter.update_stats(preds, ori_boxes, metadata)\n            test_meter.log_iter_stats(None, cur_iter)\n        elif cfg.TASK == \"ssl\" and cfg.MODEL.MODEL_NAME == \"ContrastiveModel\":\n            if not cfg.CONTRASTIVE.KNN_ON:\n                test_meter.finalize_metrics()\n                return test_meter\n            # preds = model(inputs, video_idx, time)\n            train_labels = (\n                model.module.train_labels\n                if hasattr(model, \"module\")\n                else model.train_labels\n            )\n            yd, yi = model(inputs, video_idx, time)\n            batchSize = yi.shape[0]\n            K = yi.shape[1]\n            C = cfg.CONTRASTIVE.NUM_CLASSES_DOWNSTREAM  # eg 400 for Kinetics400\n            candidates = train_labels.view(1, -1).expand(batchSize, -1)\n            retrieval = torch.gather(candidates, 1, yi)\n            retrieval_one_hot = torch.zeros((batchSize * K, C)).cuda()\n            retrieval_one_hot.scatter_(1, retrieval.view(-1, 1), 1)\n            yd_transform = yd.clone().div_(cfg.CONTRASTIVE.T).exp_()\n            probs = torch.mul(\n                retrieval_one_hot.view(batchSize, -1, C),\n                yd_transform.view(batchSize, -1, 1),\n            )\n            preds = torch.sum(probs, 1)\n        else:\n            # Perform the forward pass.\n            preds = model(inputs)\n        # Gather all the predictions across all the devices to perform ensemble.\n        if cfg.NUM_GPUS > 1:\n            preds, labels, video_idx = du.all_gather([preds, labels, video_idx])\n        if cfg.NUM_GPUS:\n            preds = preds.cpu()\n            labels = labels.cpu()\n            video_idx = video_idx.cpu()\n\n        test_meter.iter_toc()\n\n        if not cfg.VIS_MASK.ENABLE:\n            # Update and log stats.\n            test_meter.update_stats(preds.detach(), labels.detach(), video_idx.detach())\n        test_meter.log_iter_stats(cur_iter)\n\n        test_meter.iter_tic()\n\n    # Log epoch stats and print the final testing results.\n    if not cfg.DETECTION.ENABLE:\n        all_preds = test_meter.video_preds.clone().detach()\n        all_labels = test_meter.video_labels\n        if cfg.NUM_GPUS:\n            all_preds = all_preds.cpu()\n            all_labels = all_labels.cpu()\n        if writer is not None:\n            writer.plot_eval(preds=all_preds, labels=all_labels)\n\n        if cfg.TEST.SAVE_RESULTS_PATH != \"\":\n            save_path = os.path.join(cfg.OUTPUT_DIR, cfg.TEST.SAVE_RESULTS_PATH)\n\n            if du.is_root_proc():\n                with pathmgr.open(save_path, \"wb\") as f:\n                    pickle.dump([all_preds, all_labels], f)\n\n            logger.info(\"Successfully saved prediction results to {}\".format(save_path))\n\n    test_meter.finalize_metrics()\n    return test_meter\n\n\ndef test(cfg):\n    \"\"\"\n    Perform multi-view testing on the pretrained video model.\n    Args:\n        cfg (CfgNode): configs. Details can be found in\n            slowfast/config/defaults.py\n    \"\"\"\n    # Set up environment.\n    du.init_distributed_training(cfg)\n    # Set random seed from configs.\n    np.random.seed(cfg.RNG_SEED)\n    torch.manual_seed(cfg.RNG_SEED)\n\n    # Setup logging format.\n    logging.setup_logging(cfg.OUTPUT_DIR)\n\n    if len(cfg.TEST.NUM_TEMPORAL_CLIPS) == 0:\n        cfg.TEST.NUM_TEMPORAL_CLIPS = [cfg.TEST.NUM_ENSEMBLE_VIEWS]\n\n    test_meters = []\n    for num_view in cfg.TEST.NUM_TEMPORAL_CLIPS:\n        cfg.TEST.NUM_ENSEMBLE_VIEWS = num_view\n\n        # Print config.\n        logger.info(\"Test with config:\")\n        logger.info(cfg)\n\n        # Build the video model and print model statistics.\n        model = build_model(cfg)\n        flops, params = 0.0, 0.0\n        if du.is_master_proc() and cfg.LOG_MODEL_INFO:\n            model.eval()\n            flops, params = misc.log_model_info(model, cfg, use_train_input=False)\n\n        if du.is_master_proc() and cfg.LOG_MODEL_INFO:\n            misc.log_model_info(model, cfg, use_train_input=False)\n        if (\n            cfg.TASK == \"ssl\"\n            and cfg.MODEL.MODEL_NAME == \"ContrastiveModel\"\n            and cfg.CONTRASTIVE.KNN_ON\n        ):\n            train_loader = loader.construct_loader(cfg, \"train\")\n            if hasattr(model, \"module\"):\n                model.module.init_knn_labels(train_loader)\n            else:\n                model.init_knn_labels(train_loader)\n\n        cu.load_test_checkpoint(cfg, model)\n\n        # Create video testing loaders.\n        test_loader = loader.construct_loader(cfg, \"test\")\n        logger.info(\"Testing model for {} iterations\".format(len(test_loader)))\n\n        if cfg.DETECTION.ENABLE:\n            assert cfg.NUM_GPUS == cfg.TEST.BATCH_SIZE or cfg.NUM_GPUS == 0\n            test_meter = AVAMeter(len(test_loader), cfg, mode=\"test\")\n        else:\n            assert (\n                test_loader.dataset.num_videos\n                % (cfg.TEST.NUM_ENSEMBLE_VIEWS * cfg.TEST.NUM_SPATIAL_CROPS)\n                == 0\n            )\n            # Create meters for multi-view testing.\n            test_meter = TestMeter(\n                test_loader.dataset.num_videos\n                // (cfg.TEST.NUM_ENSEMBLE_VIEWS * cfg.TEST.NUM_SPATIAL_CROPS),\n                cfg.TEST.NUM_ENSEMBLE_VIEWS * cfg.TEST.NUM_SPATIAL_CROPS,\n                (\n                    cfg.MODEL.NUM_CLASSES\n                    if not cfg.TASK == \"ssl\"\n                    else cfg.CONTRASTIVE.NUM_CLASSES_DOWNSTREAM\n                ),\n                len(test_loader),\n                cfg.DATA.MULTI_LABEL,\n                cfg.DATA.ENSEMBLE_METHOD,\n            )\n\n        # Set up writer for logging to Tensorboard format.\n        if cfg.TENSORBOARD.ENABLE and du.is_master_proc(cfg.NUM_GPUS * cfg.NUM_SHARDS):\n            writer = tb.TensorboardWriter(cfg)\n        else:\n            writer = None\n\n        # # Perform multi-view test on the entire dataset.\n        test_meter = perform_test(test_loader, model, test_meter, cfg, writer)\n        test_meters.append(test_meter)\n        if writer is not None:\n            writer.close()\n\n    result_string_views = \"_p{:.2f}_f{:.2f}\".format(params / 1e6, flops)\n\n    for view, test_meter in zip(cfg.TEST.NUM_TEMPORAL_CLIPS, test_meters):\n        logger.info(\n            \"Finalized testing with {} temporal clips and {} spatial crops\".format(\n                view, cfg.TEST.NUM_SPATIAL_CROPS\n            )\n        )\n        result_string_views += \"_{}a{}\".format(view, test_meter.stats[\"top1_acc\"])\n\n        result_string = (\n            \"_p{:.2f}_f{:.2f}_{}a{} Top5 Acc: {} MEM: {:.2f} f: {:.4f}\".format(\n                params / 1e6,\n                flops,\n                view,\n                test_meter.stats[\"top1_acc\"],\n                test_meter.stats[\"top5_acc\"],\n                misc.gpu_mem_usage(),\n                flops,\n            )\n        )\n\n        logger.info(\"{}\".format(result_string))\n    logger.info(\"{}\".format(result_string_views))\n    return result_string + \" \\n \" + result_string_views\n"
  },
  {
    "path": "tools/train_net.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\n\"\"\"Train a video classification model.\"\"\"\n\nimport math\nimport pprint\n\nimport numpy as np\nimport slowfast.models.losses as losses\nimport slowfast.models.optimizer as optim\nimport slowfast.utils.checkpoint as cu\nimport slowfast.utils.distributed as du\nimport slowfast.utils.logging as logging\nimport slowfast.utils.metrics as metrics\nimport slowfast.utils.misc as misc\nimport slowfast.visualization.tensorboard_vis as tb\nimport torch\nfrom fvcore.nn.precise_bn import get_bn_modules, update_bn_stats\nfrom slowfast.datasets import loader\nfrom slowfast.datasets.mixup import MixUp\nfrom slowfast.models import build_model\nfrom slowfast.models.contrastive import (\n    contrastive_forward,\n    contrastive_parameter_surgery,\n)\nfrom slowfast.utils.meters import AVAMeter, EpochTimer, TrainMeter, ValMeter\nfrom slowfast.utils.multigrid import MultigridSchedule\n\nlogger = logging.get_logger(__name__)\n\n\ndef train_epoch(\n    train_loader,\n    model,\n    optimizer,\n    scaler,\n    train_meter,\n    cur_epoch,\n    cfg,\n    writer=None,\n):\n    \"\"\"\n    Perform the video training for one epoch.\n    Args:\n        train_loader (loader): video training loader.\n        model (model): the video model to train.\n        optimizer (optim): the optimizer to perform optimization on the model's\n            parameters.\n        train_meter (TrainMeter): training meters to log the training performance.\n        cur_epoch (int): current epoch of training.\n        cfg (CfgNode): configs. Details can be found in\n            slowfast/config/defaults.py\n        writer (TensorboardWriter, optional): TensorboardWriter object\n            to writer Tensorboard log.\n    \"\"\"\n    # Enable train mode.\n    model.train()\n    train_meter.iter_tic()\n    data_size = len(train_loader)\n\n    if cfg.MIXUP.ENABLE:\n        mixup_fn = MixUp(\n            mixup_alpha=cfg.MIXUP.ALPHA,\n            cutmix_alpha=cfg.MIXUP.CUTMIX_ALPHA,\n            mix_prob=cfg.MIXUP.PROB,\n            switch_prob=cfg.MIXUP.SWITCH_PROB,\n            label_smoothing=cfg.MIXUP.LABEL_SMOOTH_VALUE,\n            num_classes=cfg.MODEL.NUM_CLASSES,\n        )\n\n    if cfg.MODEL.FROZEN_BN:\n        misc.frozen_bn_stats(model)\n    # Explicitly declare reduction to mean.\n    loss_fun = losses.get_loss_func(cfg.MODEL.LOSS_FUNC)(reduction=\"mean\")\n\n    for cur_iter, (inputs, labels, index, time, meta) in enumerate(train_loader):\n        # Transfer the data to the current GPU device.\n        if cfg.NUM_GPUS:\n            if isinstance(inputs, (list,)):\n                for i in range(len(inputs)):\n                    if isinstance(inputs[i], (list,)):\n                        for j in range(len(inputs[i])):\n                            inputs[i][j] = inputs[i][j].cuda(non_blocking=True)\n                    else:\n                        inputs[i] = inputs[i].cuda(non_blocking=True)\n            else:\n                inputs = inputs.cuda(non_blocking=True)\n            if not isinstance(labels, list):\n                labels = labels.cuda(non_blocking=True)\n                index = index.cuda(non_blocking=True)\n                time = time.cuda(non_blocking=True)\n            for key, val in meta.items():\n                if isinstance(val, (list,)):\n                    for i in range(len(val)):\n                        val[i] = val[i].cuda(non_blocking=True)\n                else:\n                    meta[key] = val.cuda(non_blocking=True)\n\n        batch_size = (\n            inputs[0][0].size(0) if isinstance(inputs[0], list) else inputs[0].size(0)\n        )\n        # Update the learning rate.\n        epoch_exact = cur_epoch + float(cur_iter) / data_size\n        lr = optim.get_epoch_lr(epoch_exact, cfg)\n        optim.set_lr(optimizer, lr)\n\n        train_meter.data_toc()\n        if cfg.MIXUP.ENABLE:\n            samples, labels = mixup_fn(inputs[0], labels)\n            inputs[0] = samples\n\n        with torch.cuda.amp.autocast(enabled=cfg.TRAIN.MIXED_PRECISION):\n            # Explicitly declare reduction to mean.\n            perform_backward = True\n            optimizer.zero_grad()\n\n            if cfg.MODEL.MODEL_NAME == \"ContrastiveModel\":\n                (\n                    model,\n                    preds,\n                    partial_loss,\n                    perform_backward,\n                ) = contrastive_forward(\n                    model, cfg, inputs, index, time, epoch_exact, scaler\n                )\n            elif cfg.DETECTION.ENABLE:\n                # Compute the predictions.\n                preds = model(inputs, meta[\"boxes\"])\n            elif cfg.MASK.ENABLE:\n                preds, labels = model(inputs)\n            else:\n                preds = model(inputs)\n            if cfg.TASK == \"ssl\" and cfg.MODEL.MODEL_NAME == \"ContrastiveModel\":\n                labels = torch.zeros(\n                    preds.size(0), dtype=labels.dtype, device=labels.device\n                )\n\n            if cfg.MODEL.MODEL_NAME == \"ContrastiveModel\" and partial_loss:\n                loss = partial_loss\n            else:\n                # Compute the loss.\n                loss = loss_fun(preds, labels)\n\n        loss_extra = None\n        if isinstance(loss, (list, tuple)):\n            loss, loss_extra = loss\n\n        # check Nan Loss.\n        misc.check_nan_losses(loss)\n        if perform_backward:\n            scaler.scale(loss).backward()\n        # Unscales the gradients of optimizer's assigned params in-place\n        scaler.unscale_(optimizer)\n        # Clip gradients if necessary\n        if cfg.SOLVER.CLIP_GRAD_VAL:\n            grad_norm = torch.nn.utils.clip_grad_value_(\n                model.parameters(), cfg.SOLVER.CLIP_GRAD_VAL\n            )\n        elif cfg.SOLVER.CLIP_GRAD_L2NORM:\n            grad_norm = torch.nn.utils.clip_grad_norm_(\n                model.parameters(), cfg.SOLVER.CLIP_GRAD_L2NORM\n            )\n        else:\n            grad_norm = optim.get_grad_norm_(model.parameters())\n        # Update the parameters. (defaults to True)\n        model, update_param = contrastive_parameter_surgery(\n            model, cfg, epoch_exact, cur_iter\n        )\n        if update_param:\n            scaler.step(optimizer)\n        scaler.update()\n\n        if cfg.MIXUP.ENABLE:\n            _top_max_k_vals, top_max_k_inds = torch.topk(\n                labels, 2, dim=1, largest=True, sorted=True\n            )\n            idx_top1 = torch.arange(labels.shape[0]), top_max_k_inds[:, 0]\n            idx_top2 = torch.arange(labels.shape[0]), top_max_k_inds[:, 1]\n            preds = preds.detach()\n            preds[idx_top1] += preds[idx_top2]\n            preds[idx_top2] = 0.0\n            labels = top_max_k_inds[:, 0]\n\n        if cfg.DETECTION.ENABLE:\n            if cfg.NUM_GPUS > 1:\n                loss = du.all_reduce([loss])[0]\n            loss = loss.item()\n\n            # Update and log stats.\n            train_meter.update_stats(None, None, None, loss, lr)\n            # write to tensorboard format if available.\n            if writer is not None:\n                writer.add_scalars(\n                    {\"Train/loss\": loss, \"Train/lr\": lr},\n                    global_step=data_size * cur_epoch + cur_iter,\n                )\n\n        else:\n            top1_err, top5_err = None, None\n            if cfg.DATA.MULTI_LABEL:\n                # Gather all the predictions across all the devices.\n                if cfg.NUM_GPUS > 1:\n                    loss, grad_norm = du.all_reduce([loss, grad_norm])\n                loss, grad_norm = (\n                    loss.item(),\n                    grad_norm.item(),\n                )\n            elif cfg.MASK.ENABLE:\n                # Gather all the predictions across all the devices.\n                if cfg.NUM_GPUS > 1:\n                    loss, grad_norm = du.all_reduce([loss, grad_norm])\n                    if loss_extra:\n                        loss_extra = du.all_reduce(loss_extra)\n                loss, grad_norm, top1_err, top5_err = (\n                    loss.item(),\n                    grad_norm.item(),\n                    0.0,\n                    0.0,\n                )\n                if loss_extra:\n                    loss_extra = [one_loss.item() for one_loss in loss_extra]\n            else:\n                # Compute the errors.\n                num_topks_correct = metrics.topks_correct(preds, labels, (1, 5))\n                top1_err, top5_err = [\n                    (1.0 - x / preds.size(0)) * 100.0 for x in num_topks_correct\n                ]\n                # Gather all the predictions across all the devices.\n                if cfg.NUM_GPUS > 1:\n                    loss, grad_norm, top1_err, top5_err = du.all_reduce(\n                        [loss.detach(), grad_norm, top1_err, top5_err]\n                    )\n\n                # Copy the stats from GPU to CPU (sync point).\n                loss, grad_norm, top1_err, top5_err = (\n                    loss.item(),\n                    grad_norm.item(),\n                    top1_err.item(),\n                    top5_err.item(),\n                )\n\n            # Update and log stats.\n            train_meter.update_stats(\n                top1_err,\n                top5_err,\n                loss,\n                lr,\n                grad_norm,\n                batch_size\n                * max(\n                    cfg.NUM_GPUS, 1\n                ),  # If running  on CPU (cfg.NUM_GPUS == 1), use 1 to represent 1 CPU.\n                loss_extra,\n            )\n            # write to tensorboard format if available.\n            if writer is not None:\n                writer.add_scalars(\n                    {\n                        \"Train/loss\": loss,\n                        \"Train/lr\": lr,\n                        \"Train/Top1_err\": top1_err,\n                        \"Train/Top5_err\": top5_err,\n                    },\n                    global_step=data_size * cur_epoch + cur_iter,\n                )\n        train_meter.iter_toc()  # do measure allreduce for this meter\n        train_meter.log_iter_stats(cur_epoch, cur_iter)\n        torch.cuda.synchronize()\n        train_meter.iter_tic()\n    del inputs\n\n    # in case of fragmented memory\n    torch.cuda.empty_cache()\n\n    # Log epoch stats.\n    train_meter.log_epoch_stats(cur_epoch)\n    train_meter.reset()\n\n\n@torch.no_grad()\ndef eval_epoch(val_loader, model, val_meter, cur_epoch, cfg, train_loader, writer):\n    \"\"\"\n    Evaluate the model on the val set.\n    Args:\n        val_loader (loader): data loader to provide validation data.\n        model (model): model to evaluate the performance.\n        val_meter (ValMeter): meter instance to record and calculate the metrics.\n        cur_epoch (int): number of the current epoch of training.\n        cfg (CfgNode): configs. Details can be found in\n            slowfast/config/defaults.py\n        writer (TensorboardWriter, optional): TensorboardWriter object\n            to writer Tensorboard log.\n    \"\"\"\n\n    # Evaluation mode enabled. The running stats would not be updated.\n    model.eval()\n    val_meter.iter_tic()\n\n    for cur_iter, (inputs, labels, index, time, meta) in enumerate(val_loader):\n        if cfg.NUM_GPUS:\n            # Transferthe data to the current GPU device.\n            if isinstance(inputs, (list,)):\n                for i in range(len(inputs)):\n                    inputs[i] = inputs[i].cuda(non_blocking=True)\n            else:\n                inputs = inputs.cuda(non_blocking=True)\n            labels = labels.cuda()\n            for key, val in meta.items():\n                if isinstance(val, (list,)):\n                    for i in range(len(val)):\n                        val[i] = val[i].cuda(non_blocking=True)\n                else:\n                    meta[key] = val.cuda(non_blocking=True)\n            index = index.cuda()\n            time = time.cuda()\n        batch_size = (\n            inputs[0][0].size(0) if isinstance(inputs[0], list) else inputs[0].size(0)\n        )\n        val_meter.data_toc()\n\n        if cfg.DETECTION.ENABLE:\n            # Compute the predictions.\n            preds = model(inputs, meta[\"boxes\"])\n            ori_boxes = meta[\"ori_boxes\"]\n            metadata = meta[\"metadata\"]\n\n            if cfg.NUM_GPUS:\n                preds = preds.cpu()\n                ori_boxes = ori_boxes.cpu()\n                metadata = metadata.cpu()\n\n            if cfg.NUM_GPUS > 1:\n                preds = torch.cat(du.all_gather_unaligned(preds), dim=0)\n                ori_boxes = torch.cat(du.all_gather_unaligned(ori_boxes), dim=0)\n                metadata = torch.cat(du.all_gather_unaligned(metadata), dim=0)\n\n            val_meter.iter_toc()\n            # Update and log stats.\n            val_meter.update_stats(preds, ori_boxes, metadata)\n\n        else:\n            if cfg.TASK == \"ssl\" and cfg.MODEL.MODEL_NAME == \"ContrastiveModel\":\n                if not cfg.CONTRASTIVE.KNN_ON:\n                    return\n                train_labels = (\n                    model.module.train_labels\n                    if hasattr(model, \"module\")\n                    else model.train_labels\n                )\n                yd, yi = model(inputs, index, time)\n                K = yi.shape[1]\n                C = cfg.CONTRASTIVE.NUM_CLASSES_DOWNSTREAM  # eg 400 for Kinetics400\n                candidates = train_labels.view(1, -1).expand(batch_size, -1)\n                retrieval = torch.gather(candidates, 1, yi)\n                retrieval_one_hot = torch.zeros((batch_size * K, C)).cuda()\n                retrieval_one_hot.scatter_(1, retrieval.view(-1, 1), 1)\n                yd_transform = yd.clone().div_(cfg.CONTRASTIVE.T).exp_()\n                probs = torch.mul(\n                    retrieval_one_hot.view(batch_size, -1, C),\n                    yd_transform.view(batch_size, -1, 1),\n                )\n                preds = torch.sum(probs, 1)\n            else:\n                preds = model(inputs)\n\n            if cfg.DATA.MULTI_LABEL:\n                if cfg.NUM_GPUS > 1:\n                    preds, labels = du.all_gather([preds, labels])\n            else:\n                if cfg.DATA.IN22k_VAL_IN1K != \"\":\n                    preds = preds[:, :1000]\n                # Compute the errors.\n                num_topks_correct = metrics.topks_correct(preds, labels, (1, 5))\n\n                # Combine the errors across the GPUs.\n                top1_err, top5_err = [\n                    (1.0 - x / preds.size(0)) * 100.0 for x in num_topks_correct\n                ]\n                if cfg.NUM_GPUS > 1:\n                    top1_err, top5_err = du.all_reduce([top1_err, top5_err])\n\n                # Copy the errors from GPU to CPU (sync point).\n                top1_err, top5_err = top1_err.item(), top5_err.item()\n\n                val_meter.iter_toc()\n                # Update and log stats.\n                val_meter.update_stats(\n                    top1_err,\n                    top5_err,\n                    batch_size\n                    * max(\n                        cfg.NUM_GPUS, 1\n                    ),  # If running  on CPU (cfg.NUM_GPUS == 1), use 1 to represent 1 CPU.\n                )\n                # write to tensorboard format if available.\n                if writer is not None:\n                    writer.add_scalars(\n                        {\"Val/Top1_err\": top1_err, \"Val/Top5_err\": top5_err},\n                        global_step=len(val_loader) * cur_epoch + cur_iter,\n                    )\n\n            val_meter.update_predictions(preds, labels)\n\n        val_meter.log_iter_stats(cur_epoch, cur_iter)\n        val_meter.iter_tic()\n\n    # Log epoch stats.\n    val_meter.log_epoch_stats(cur_epoch)\n    # write to tensorboard format if available.\n    if writer is not None:\n        if cfg.DETECTION.ENABLE:\n            writer.add_scalars({\"Val/mAP\": val_meter.full_map}, global_step=cur_epoch)\n        else:\n            all_preds = [pred.clone().detach() for pred in val_meter.all_preds]\n            all_labels = [label.clone().detach() for label in val_meter.all_labels]\n            if cfg.NUM_GPUS:\n                all_preds = [pred.cpu() for pred in all_preds]\n                all_labels = [label.cpu() for label in all_labels]\n            writer.plot_eval(preds=all_preds, labels=all_labels, global_step=cur_epoch)\n\n    val_meter.reset()\n\n\ndef calculate_and_update_precise_bn(loader, model, num_iters=200, use_gpu=True):\n    \"\"\"\n    Update the stats in bn layers by calculate the precise stats.\n    Args:\n        loader (loader): data loader to provide training data.\n        model (model): model to update the bn stats.\n        num_iters (int): number of iterations to compute and update the bn stats.\n        use_gpu (bool): whether to use GPU or not.\n    \"\"\"\n\n    def _gen_loader():\n        for inputs, *_ in loader:\n            if use_gpu:\n                if isinstance(inputs, (list,)):\n                    for i in range(len(inputs)):\n                        inputs[i] = inputs[i].cuda(non_blocking=True)\n                else:\n                    inputs = inputs.cuda(non_blocking=True)\n            yield inputs\n\n    # Update the bn stats.\n    update_bn_stats(model, _gen_loader(), num_iters)\n\n\ndef build_trainer(cfg):\n    \"\"\"\n    Build training model and its associated tools, including optimizer,\n    dataloaders and meters.\n    Args:\n        cfg (CfgNode): configs. Details can be found in\n            slowfast/config/defaults.py\n    Returns:\n        model (nn.Module): training model.\n        optimizer (Optimizer): optimizer.\n        train_loader (DataLoader): training data loader.\n        val_loader (DataLoader): validatoin data loader.\n        precise_bn_loader (DataLoader): training data loader for computing\n            precise BN.\n        train_meter (TrainMeter): tool for measuring training stats.\n        val_meter (ValMeter): tool for measuring validation stats.\n    \"\"\"\n    # Build the video model and print model statistics.\n    model = build_model(cfg)\n    if du.is_master_proc() and cfg.LOG_MODEL_INFO:\n        flops, params = misc.log_model_info(model, cfg, use_train_input=True)\n\n    # Construct the optimizer.\n    optimizer = optim.construct_optimizer(model, cfg)\n\n    # Create the video train and val loaders.\n    train_loader = loader.construct_loader(cfg, \"train\")\n    val_loader = loader.construct_loader(cfg, \"val\")\n    precise_bn_loader = loader.construct_loader(cfg, \"train\", is_precise_bn=True)\n    # Create meters.\n    train_meter = TrainMeter(len(train_loader), cfg)\n    val_meter = ValMeter(len(val_loader), cfg)\n\n    return (\n        model,\n        optimizer,\n        train_loader,\n        val_loader,\n        precise_bn_loader,\n        train_meter,\n        val_meter,\n    )\n\n\ndef train(cfg):\n    \"\"\"\n    Train a video model for many epochs on train set and evaluate it on val set.\n    Args:\n        cfg (CfgNode): configs. Details can be found in\n            slowfast/config/defaults.py\n    \"\"\"\n    # Set up environment.\n    du.init_distributed_training(cfg)\n    # Set random seed from configs.\n    np.random.seed(cfg.RNG_SEED)\n    torch.manual_seed(cfg.RNG_SEED)\n\n    # Setup logging format.\n    logging.setup_logging(cfg.OUTPUT_DIR)\n\n    # Init multigrid.\n    multigrid = None\n    if cfg.MULTIGRID.LONG_CYCLE or cfg.MULTIGRID.SHORT_CYCLE:\n        multigrid = MultigridSchedule()\n        cfg = multigrid.init_multigrid(cfg)\n        if cfg.MULTIGRID.LONG_CYCLE:\n            cfg, _ = multigrid.update_long_cycle(cfg, cur_epoch=0)\n    # Print config.\n    logger.info(\"Train with config:\")\n    logger.info(pprint.pformat(cfg))\n\n    # Build the video model and print model statistics.\n    model = build_model(cfg)\n    flops, params = 0.0, 0.0\n    if du.is_master_proc() and cfg.LOG_MODEL_INFO:\n        flops, params = misc.log_model_info(model, cfg, use_train_input=True)\n\n    # Construct the optimizer.\n    optimizer = optim.construct_optimizer(model, cfg)\n    # Create a GradScaler for mixed precision training\n    scaler = torch.cuda.amp.GradScaler(enabled=cfg.TRAIN.MIXED_PRECISION)\n\n    # Load a checkpoint to resume training if applicable.\n    if cfg.TRAIN.AUTO_RESUME and cu.has_checkpoint(cfg.OUTPUT_DIR):\n        logger.info(\"Load from last checkpoint.\")\n        last_checkpoint = cu.get_last_checkpoint(cfg.OUTPUT_DIR, task=cfg.TASK)\n        if last_checkpoint is not None:\n            checkpoint_epoch = cu.load_checkpoint(\n                last_checkpoint,\n                model,\n                cfg.NUM_GPUS > 1,\n                optimizer,\n                scaler if cfg.TRAIN.MIXED_PRECISION else None,\n            )\n            start_epoch = checkpoint_epoch + 1\n        elif \"ssl_eval\" in cfg.TASK:\n            last_checkpoint = cu.get_last_checkpoint(cfg.OUTPUT_DIR, task=\"ssl\")\n            checkpoint_epoch = cu.load_checkpoint(\n                last_checkpoint,\n                model,\n                cfg.NUM_GPUS > 1,\n                optimizer,\n                scaler if cfg.TRAIN.MIXED_PRECISION else None,\n                epoch_reset=True,\n                clear_name_pattern=cfg.TRAIN.CHECKPOINT_CLEAR_NAME_PATTERN,\n            )\n            start_epoch = checkpoint_epoch + 1\n        else:\n            start_epoch = 0\n    elif cfg.TRAIN.CHECKPOINT_FILE_PATH != \"\":\n        logger.info(\"Load from given checkpoint file.\")\n        checkpoint_epoch = cu.load_checkpoint(\n            cfg.TRAIN.CHECKPOINT_FILE_PATH,\n            model,\n            cfg.NUM_GPUS > 1,\n            optimizer,\n            scaler if cfg.TRAIN.MIXED_PRECISION else None,\n            inflation=cfg.TRAIN.CHECKPOINT_INFLATE,\n            convert_from_caffe2=cfg.TRAIN.CHECKPOINT_TYPE == \"caffe2\",\n            epoch_reset=cfg.TRAIN.CHECKPOINT_EPOCH_RESET,\n            clear_name_pattern=cfg.TRAIN.CHECKPOINT_CLEAR_NAME_PATTERN,\n            image_init=cfg.TRAIN.CHECKPOINT_IN_INIT,\n        )\n        start_epoch = checkpoint_epoch + 1\n    else:\n        start_epoch = 0\n\n    # Create the video train and val loaders.\n    train_loader = loader.construct_loader(cfg, \"train\")\n    val_loader = loader.construct_loader(cfg, \"val\")\n    precise_bn_loader = (\n        loader.construct_loader(cfg, \"train\", is_precise_bn=True)\n        if cfg.BN.USE_PRECISE_STATS\n        else None\n    )\n\n    if (\n        cfg.TASK == \"ssl\"\n        and cfg.MODEL.MODEL_NAME == \"ContrastiveModel\"\n        and cfg.CONTRASTIVE.KNN_ON\n    ):\n        if hasattr(model, \"module\"):\n            model.module.init_knn_labels(train_loader)\n        else:\n            model.init_knn_labels(train_loader)\n\n    # Create meters.\n    if cfg.DETECTION.ENABLE:\n        train_meter = AVAMeter(len(train_loader), cfg, mode=\"train\")\n        val_meter = AVAMeter(len(val_loader), cfg, mode=\"val\")\n    else:\n        train_meter = TrainMeter(len(train_loader), cfg)\n        val_meter = ValMeter(len(val_loader), cfg)\n\n    # set up writer for logging to Tensorboard format.\n    if cfg.TENSORBOARD.ENABLE and du.is_master_proc(cfg.NUM_GPUS * cfg.NUM_SHARDS):\n        writer = tb.TensorboardWriter(cfg)\n    else:\n        writer = None\n\n    # Perform the training loop.\n    logger.info(\"Start epoch: {}\".format(start_epoch + 1))\n\n    epoch_timer = EpochTimer()\n    for cur_epoch in range(start_epoch, cfg.SOLVER.MAX_EPOCH):\n        if cur_epoch > 0 and cfg.DATA.LOADER_CHUNK_SIZE > 0:\n            num_chunks = math.ceil(\n                cfg.DATA.LOADER_CHUNK_OVERALL_SIZE / cfg.DATA.LOADER_CHUNK_SIZE\n            )\n            skip_rows = (cur_epoch) % num_chunks * cfg.DATA.LOADER_CHUNK_SIZE\n            logger.info(\n                f\"=================+++ num_chunks {num_chunks} skip_rows {skip_rows}\"\n            )\n            cfg.DATA.SKIP_ROWS = skip_rows\n            logger.info(f\"|===========| skip_rows {skip_rows}\")\n            train_loader = loader.construct_loader(cfg, \"train\")\n            loader.shuffle_dataset(train_loader, cur_epoch)\n\n        if cfg.MULTIGRID.LONG_CYCLE:\n            cfg, changed = multigrid.update_long_cycle(cfg, cur_epoch)\n            if changed:\n                (\n                    model,\n                    optimizer,\n                    train_loader,\n                    val_loader,\n                    precise_bn_loader,\n                    train_meter,\n                    val_meter,\n                ) = build_trainer(cfg)\n\n                # Load checkpoint.\n                if cu.has_checkpoint(cfg.OUTPUT_DIR):\n                    last_checkpoint = cu.get_last_checkpoint(\n                        cfg.OUTPUT_DIR, task=cfg.TASK\n                    )\n                    assert \"{:05d}.pyth\".format(cur_epoch) in last_checkpoint\n                else:\n                    last_checkpoint = cfg.TRAIN.CHECKPOINT_FILE_PATH\n                logger.info(\"Load from {}\".format(last_checkpoint))\n                cu.load_checkpoint(last_checkpoint, model, cfg.NUM_GPUS > 1, optimizer)\n\n        # Shuffle the dataset.\n        loader.shuffle_dataset(train_loader, cur_epoch)\n        if hasattr(train_loader.dataset, \"_set_epoch_num\"):\n            train_loader.dataset._set_epoch_num(cur_epoch)\n        # Train for one epoch.\n        epoch_timer.epoch_tic()\n        train_epoch(\n            train_loader,\n            model,\n            optimizer,\n            scaler,\n            train_meter,\n            cur_epoch,\n            cfg,\n            writer,\n        )\n        epoch_timer.epoch_toc()\n        logger.info(\n            f\"Epoch {cur_epoch} takes {epoch_timer.last_epoch_time():.2f}s. Epochs \"\n            f\"from {start_epoch} to {cur_epoch} take \"\n            f\"{epoch_timer.avg_epoch_time():.2f}s in average and \"\n            f\"{epoch_timer.median_epoch_time():.2f}s in median.\"\n        )\n        logger.info(\n            f\"For epoch {cur_epoch}, each iteraction takes \"\n            f\"{epoch_timer.last_epoch_time() / len(train_loader):.2f}s in average. \"\n            f\"From epoch {start_epoch} to {cur_epoch}, each iteraction takes \"\n            f\"{epoch_timer.avg_epoch_time() / len(train_loader):.2f}s in average.\"\n        )\n\n        is_checkp_epoch = (\n            cu.is_checkpoint_epoch(\n                cfg,\n                cur_epoch,\n                None if multigrid is None else multigrid.schedule,\n            )\n            or cur_epoch == cfg.SOLVER.MAX_EPOCH - 1\n        )\n        is_eval_epoch = (\n            misc.is_eval_epoch(\n                cfg,\n                cur_epoch,\n                None if multigrid is None else multigrid.schedule,\n            )\n            and not cfg.MASK.ENABLE\n        )\n\n        # Compute precise BN stats.\n        if (\n            (is_checkp_epoch or is_eval_epoch)\n            and cfg.BN.USE_PRECISE_STATS\n            and len(get_bn_modules(model)) > 0\n        ):\n            calculate_and_update_precise_bn(\n                precise_bn_loader,\n                model,\n                min(cfg.BN.NUM_BATCHES_PRECISE, len(precise_bn_loader)),\n                cfg.NUM_GPUS > 0,\n            )\n        _ = misc.aggregate_sub_bn_stats(model)\n\n        # Save a checkpoint.\n        if is_checkp_epoch:\n            cu.save_checkpoint(\n                cfg.OUTPUT_DIR,\n                model,\n                optimizer,\n                cur_epoch,\n                cfg,\n                scaler if cfg.TRAIN.MIXED_PRECISION else None,\n            )\n        # Evaluate the model on validation set.\n        if is_eval_epoch:\n            eval_epoch(\n                val_loader,\n                model,\n                val_meter,\n                cur_epoch,\n                cfg,\n                train_loader,\n                writer,\n            )\n    if (\n        start_epoch == cfg.SOLVER.MAX_EPOCH and not cfg.MASK.ENABLE\n    ):  # final checkpoint load\n        eval_epoch(val_loader, model, val_meter, start_epoch, cfg, train_loader, writer)\n    if writer is not None:\n        writer.close()\n    result_string = (\n        \"_p{:.2f}_f{:.2f} _t{:.2f}_m{:.2f} _a{:.2f} Top5 Acc: {:.2f} MEM: {:.2f} f: {:.4f}\"\n        \"\".format(\n            params / 1e6,\n            flops,\n            (\n                epoch_timer.median_epoch_time() / 60.0\n                if len(epoch_timer.epoch_times)\n                else 0.0\n            ),\n            misc.gpu_mem_usage(),\n            100 - val_meter.min_top1_err,\n            100 - val_meter.min_top5_err,\n            misc.gpu_mem_usage(),\n            flops,\n        )\n    )\n    logger.info(\"training done: {}\".format(result_string))\n\n    return result_string\n"
  },
  {
    "path": "tools/visualization.py",
    "content": "#!/usr/bin/env python3\n# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved.\n\nimport pickle\n\nimport numpy as np\nimport slowfast.datasets.utils as data_utils\nimport slowfast.utils.checkpoint as cu\nimport slowfast.utils.distributed as du\nimport slowfast.utils.logging as logging\nimport slowfast.utils.misc as misc\nimport slowfast.visualization.tensorboard_vis as tb\nimport torch\nimport tqdm\nfrom slowfast.datasets import loader\nfrom slowfast.models import build_model\nfrom slowfast.utils.env import pathmgr\nfrom slowfast.visualization.gradcam_utils import GradCAM\nfrom slowfast.visualization.prediction_vis import WrongPredictionVis\nfrom slowfast.visualization.utils import (\n    GetWeightAndActivation,\n    process_layer_index_data,\n)\nfrom slowfast.visualization.video_visualizer import VideoVisualizer\n\nlogger = logging.get_logger(__name__)\n\n\ndef run_visualization(vis_loader, model, cfg, writer=None):\n    \"\"\"\n    Run model visualization (weights, activations and model inputs) and visualize\n    them on Tensorboard.\n    Args:\n        vis_loader (loader): video visualization loader.\n        model (model): the video model to visualize.\n        cfg (CfgNode): configs. Details can be found in\n            slowfast/config/defaults.py\n        writer (TensorboardWriter, optional): TensorboardWriter object\n            to writer Tensorboard log.\n    \"\"\"\n    n_devices = cfg.NUM_GPUS * cfg.NUM_SHARDS\n    prefix = \"module/\" if n_devices > 1 else \"\"\n    # Get a list of selected layer names and indexing.\n    layer_ls, indexing_dict = process_layer_index_data(\n        cfg.TENSORBOARD.MODEL_VIS.LAYER_LIST, layer_name_prefix=prefix\n    )\n    logger.info(\"Start Model Visualization.\")\n    # Register hooks for activations.\n    model_vis = GetWeightAndActivation(model, layer_ls)\n\n    if writer is not None and cfg.TENSORBOARD.MODEL_VIS.MODEL_WEIGHTS:\n        layer_weights = model_vis.get_weights()\n        writer.plot_weights_and_activations(\n            layer_weights, tag=\"Layer Weights/\", heat_map=False\n        )\n\n    video_vis = VideoVisualizer(\n        cfg.MODEL.NUM_CLASSES,\n        cfg.TENSORBOARD.CLASS_NAMES_PATH,\n        cfg.TENSORBOARD.MODEL_VIS.TOPK_PREDS,\n        cfg.TENSORBOARD.MODEL_VIS.COLORMAP,\n    )\n    if n_devices > 1:\n        grad_cam_layer_ls = [\n            \"module/\" + layer for layer in cfg.TENSORBOARD.MODEL_VIS.GRAD_CAM.LAYER_LIST\n        ]\n    else:\n        grad_cam_layer_ls = cfg.TENSORBOARD.MODEL_VIS.GRAD_CAM.LAYER_LIST\n\n    if cfg.TENSORBOARD.MODEL_VIS.GRAD_CAM.ENABLE:\n        gradcam = GradCAM(\n            model,\n            target_layers=grad_cam_layer_ls,\n            data_mean=cfg.DATA.MEAN,\n            data_std=cfg.DATA.STD,\n            colormap=cfg.TENSORBOARD.MODEL_VIS.GRAD_CAM.COLORMAP,\n        )\n    logger.info(\"Finish drawing weights.\")\n    global_idx = -1\n    for inputs, labels, _, meta in tqdm.tqdm(vis_loader):\n        if cfg.NUM_GPUS:\n            # Transfer the data to the current GPU device.\n            if isinstance(inputs, (list,)):\n                for i in range(len(inputs)):\n                    inputs[i] = inputs[i].cuda(non_blocking=True)\n            else:\n                inputs = inputs.cuda(non_blocking=True)\n            labels = labels.cuda()\n            for key, val in meta.items():\n                if isinstance(val, (list,)):\n                    for i in range(len(val)):\n                        val[i] = val[i].cuda(non_blocking=True)\n                else:\n                    meta[key] = val.cuda(non_blocking=True)\n\n        if cfg.DETECTION.ENABLE:\n            activations, preds = model_vis.get_activations(inputs, meta[\"boxes\"])\n        else:\n            activations, preds = model_vis.get_activations(inputs)\n        if cfg.TENSORBOARD.MODEL_VIS.GRAD_CAM.ENABLE:\n            if cfg.TENSORBOARD.MODEL_VIS.GRAD_CAM.USE_TRUE_LABEL:\n                inputs, preds = gradcam(inputs, labels=labels)\n            else:\n                inputs, preds = gradcam(inputs)\n        if cfg.NUM_GPUS:\n            inputs = du.all_gather_unaligned(inputs)\n            activations = du.all_gather_unaligned(activations)\n            preds = du.all_gather_unaligned(preds)\n            if isinstance(inputs[0], list):\n                for i in range(len(inputs)):\n                    for j in range(len(inputs[0])):\n                        inputs[i][j] = inputs[i][j].cpu()\n            else:\n                inputs = [inp.cpu() for inp in inputs]\n            preds = [pred.cpu() for pred in preds]\n        else:\n            inputs, activations, preds = [inputs], [activations], [preds]\n\n        boxes = [None] * max(n_devices, 1)\n        if cfg.DETECTION.ENABLE and cfg.NUM_GPUS:\n            boxes = du.all_gather_unaligned(meta[\"boxes\"])\n            boxes = [box.cpu() for box in boxes]\n\n        if writer is not None:\n            total_vids = 0\n            for i in range(max(n_devices, 1)):\n                cur_input = inputs[i]\n                cur_activations = activations[i]\n                cur_batch_size = cur_input[0].shape[0]\n                cur_preds = preds[i]\n                cur_boxes = boxes[i]\n                for cur_batch_idx in range(cur_batch_size):\n                    global_idx += 1\n                    total_vids += 1\n                    if (\n                        cfg.TENSORBOARD.MODEL_VIS.INPUT_VIDEO\n                        or cfg.TENSORBOARD.MODEL_VIS.GRAD_CAM.ENABLE\n                    ):\n                        for path_idx, input_pathway in enumerate(cur_input):\n                            if cfg.TEST.DATASET == \"ava\" and cfg.AVA.BGR:\n                                video = input_pathway[cur_batch_idx, [2, 1, 0], ...]\n                            else:\n                                video = input_pathway[cur_batch_idx]\n\n                            if not cfg.TENSORBOARD.MODEL_VIS.GRAD_CAM.ENABLE:\n                                # Permute to (T, H, W, C) from (C, T, H, W).\n                                video = video.permute(1, 2, 3, 0)\n                                video = data_utils.revert_tensor_normalize(\n                                    video, cfg.DATA.MEAN, cfg.DATA.STD\n                                )\n                            else:\n                                # Permute from (T, C, H, W) to (T, H, W, C)\n                                video = video.permute(0, 2, 3, 1)\n                            bboxes = None if cur_boxes is None else cur_boxes[:, 1:]\n                            cur_prediction = (\n                                cur_preds\n                                if cfg.DETECTION.ENABLE\n                                else cur_preds[cur_batch_idx]\n                            )\n                            video = video_vis.draw_clip(\n                                video, cur_prediction, bboxes=bboxes\n                            )\n                            video = (\n                                torch.from_numpy(np.array(video))\n                                .permute(0, 3, 1, 2)\n                                .unsqueeze(0)\n                            )\n                            writer.add_video(\n                                video,\n                                tag=\"Input {}/Pathway {}\".format(\n                                    global_idx, path_idx + 1\n                                ),\n                            )\n                    if cfg.TENSORBOARD.MODEL_VIS.ACTIVATIONS:\n                        writer.plot_weights_and_activations(\n                            cur_activations,\n                            tag=\"Input {}/Activations: \".format(global_idx),\n                            batch_idx=cur_batch_idx,\n                            indexing_dict=indexing_dict,\n                        )\n\n\ndef perform_wrong_prediction_vis(vis_loader, model, cfg):\n    \"\"\"\n    Visualize video inputs with wrong predictions on Tensorboard.\n    Args:\n        vis_loader (loader): video visualization loader.\n        model (model): the video model to visualize.\n        cfg (CfgNode): configs. Details can be found in\n            slowfast/config/defaults.py\n    \"\"\"\n    wrong_prediction_visualizer = WrongPredictionVis(cfg=cfg)\n    for batch_idx, (inputs, labels, _, _) in tqdm.tqdm(enumerate(vis_loader)):\n        if cfg.NUM_GPUS:\n            # Transfer the data to the current GPU device.\n            if isinstance(inputs, (list,)):\n                for i in range(len(inputs)):\n                    inputs[i] = inputs[i].cuda(non_blocking=True)\n            else:\n                inputs = inputs.cuda(non_blocking=True)\n            labels = labels.cuda()\n\n        # Some model modify the original input.\n        inputs_clone = [inp.clone() for inp in inputs]\n\n        preds = model(inputs)\n\n        if cfg.NUM_GPUS > 1:\n            preds, labels = du.all_gather([preds, labels])\n            if isinstance(inputs_clone, (list,)):\n                inputs_clone = du.all_gather(inputs_clone)\n            else:\n                inputs_clone = du.all_gather([inputs_clone])[0]\n\n        if cfg.NUM_GPUS:\n            # Transfer the data to the current CPU device.\n            labels = labels.cpu()\n            preds = preds.cpu()\n            if isinstance(inputs_clone, (list,)):\n                for i in range(len(inputs_clone)):\n                    inputs_clone[i] = inputs_clone[i].cpu()\n            else:\n                inputs_clone = inputs_clone.cpu()\n\n        # If using CPU (NUM_GPUS = 0), 1 represent 1 CPU.\n        n_devices = max(cfg.NUM_GPUS, 1)\n        for device_idx in range(1, n_devices + 1):\n            wrong_prediction_visualizer.visualize_vid(\n                video_input=inputs_clone,\n                labels=labels,\n                preds=preds.detach().clone(),\n                batch_idx=device_idx * batch_idx,\n            )\n\n    logger.info(\n        \"Class indices with wrong predictions: {}\".format(\n            sorted(wrong_prediction_visualizer.wrong_class_prediction)\n        )\n    )\n    wrong_prediction_visualizer.clean()\n\n\ndef visualize(cfg):\n    \"\"\"\n    Perform layer weights and activations visualization on the model.\n    Args:\n        cfg (CfgNode): configs. Details can be found in\n            slowfast/config/defaults.py\n    \"\"\"\n    if cfg.TENSORBOARD.ENABLE and (\n        cfg.TENSORBOARD.MODEL_VIS.ENABLE or cfg.TENSORBOARD.WRONG_PRED_VIS.ENABLE\n    ):\n        # Set up environment.\n        du.init_distributed_training(cfg)\n        # Set random seed from configs.\n        np.random.seed(cfg.RNG_SEED)\n        torch.manual_seed(cfg.RNG_SEED)\n\n        # Setup logging format.\n        logging.setup_logging(cfg.OUTPUT_DIR)\n\n        # Print config.\n        logger.info(\"Model Visualization with config:\")\n        logger.info(cfg)\n\n        # Build the video model and print model statistics.\n        model = build_model(cfg)\n        model.eval()\n        if du.is_master_proc() and cfg.LOG_MODEL_INFO:\n            misc.log_model_info(model, cfg, use_train_input=False)\n\n        cu.load_test_checkpoint(cfg, model)\n\n        # Create video testing loaders.\n        vis_loader = loader.construct_loader(cfg, \"test\")\n\n        if cfg.DETECTION.ENABLE:\n            assert cfg.NUM_GPUS == cfg.TEST.BATCH_SIZE or cfg.NUM_GPUS == 0\n\n        # Set up writer for logging to Tensorboard format.\n        if du.is_master_proc(cfg.NUM_GPUS * cfg.NUM_SHARDS):\n            writer = tb.TensorboardWriter(cfg)\n        else:\n            writer = None\n        if cfg.TENSORBOARD.PREDICTIONS_PATH != \"\":\n            assert not cfg.DETECTION.ENABLE, \"Detection is not supported.\"\n            logger.info(\"Visualizing class-level performance from saved results...\")\n            if writer is not None:\n                with pathmgr.open(cfg.TENSORBOARD.PREDICTIONS_PATH, \"rb\") as f:\n                    preds, labels = pickle.load(f, encoding=\"latin1\")\n\n                writer.plot_eval(preds, labels)\n\n        if cfg.TENSORBOARD.MODEL_VIS.ENABLE:\n            if cfg.TENSORBOARD.MODEL_VIS.GRAD_CAM.ENABLE:\n                assert not cfg.DETECTION.ENABLE, (\n                    \"Detection task is currently not supported for Grad-CAM visualization.\"\n                )\n                if cfg.MODEL.ARCH in cfg.MODEL.SINGLE_PATHWAY_ARCH:\n                    assert len(cfg.TENSORBOARD.MODEL_VIS.GRAD_CAM.LAYER_LIST) == 1, (\n                        \"The number of chosen CNN layers must be equal to the number of pathway(s), given {} layer(s).\".format(\n                            len(cfg.TENSORBOARD.MODEL_VIS.GRAD_CAM.LAYER_LIST)\n                        )\n                    )\n                elif cfg.MODEL.ARCH in cfg.MODEL.MULTI_PATHWAY_ARCH:\n                    assert len(cfg.TENSORBOARD.MODEL_VIS.GRAD_CAM.LAYER_LIST) == 2, (\n                        \"The number of chosen CNN layers must be equal to the number of pathway(s), given {} layer(s).\".format(\n                            len(cfg.TENSORBOARD.MODEL_VIS.GRAD_CAM.LAYER_LIST)\n                        )\n                    )\n                else:\n                    raise NotImplementedError(\n                        \"Model arch {} is not in {}\".format(\n                            cfg.MODEL.ARCH,\n                            cfg.MODEL.SINGLE_PATHWAY_ARCH\n                            + cfg.MODEL.MULTI_PATHWAY_ARCH,\n                        )\n                    )\n            logger.info(\n                \"Visualize model analysis for {} iterations\".format(len(vis_loader))\n            )\n            # Run visualization on the model\n            run_visualization(vis_loader, model, cfg, writer)\n        if cfg.TENSORBOARD.WRONG_PRED_VIS.ENABLE:\n            logger.info(\n                \"Visualize Wrong Predictions for {} iterations\".format(len(vis_loader))\n            )\n            perform_wrong_prediction_vis(vis_loader, model, cfg)\n\n        if writer is not None:\n            writer.close()\n"
  }
]